2022 SQL Outage

From EOVSA Wiki
Revision as of 18:42, 8 March 2022 by Dgary (talk | contribs) (→‎Status)
Jump to navigation Jump to search

Background

Due to a brush fire that started on 2022 Feb 16, observing with EOVSA was interrupted. The fire created a severe threat to the observatory, but in the end there was no damage or injuries. However, fighting the fire required shutting off the power to the site, which has caused a number of issues in getting back to a running condition. The most difficult problem is that the SQL server would not boot up. We finally got it to start, but there are disk errors that unfortunately appear to have been going on for some time without us being aware of it. Reading from the database appears to work, but any extended reading of it fails due to the disk reading errors.

Code Status

As of this writing (2022 Mar 08), I have rewritten numerous routines to permit recording data without the SQL server. The strategy is to use tables for the two main real-time records, the delay_centers and the DCM_master_table, and just write these tables whenever they are changed, to the new NAS RAID system (/nas4). I created a folder /nas4/Tables that contain those tables. In addition, the stateframe (updated once a second) and scanheader (once per scan) are also written to files. The idea is that whenever the database is again available we will be able to write these saved records to it. The problem is that no calibration is possible because all of the calibration procedures require reading the SQL database.

Because we will have to restore the code back to its use of the SQL database eventually, I list below the routines that have been changed. I tried to mark changes with this comment:

# ************ This block commented out due to loss of SQL **************

Routines with changes are sf_display.py, daily_xsp.py, dbutil.py, adc_plot.py, delay_widget.py, flare_monitor.py, pcal_anal.py, schedule.py, stateframe.py.

SQL Status

The SQL database can be read, and it seems like the problem is not actual errors in the database, but rather read errors from the disk (possibly not disk errors, but controller errors?). However, none of the lengthy checks of the database integrity can complete due to the reported disk errors, which stop the procedures immediately.