2022 SQL Outage

From EOVSA Wiki
Revision as of 00:11, 9 March 2022 by Dgary (talk | contribs) (→‎SQL Status)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Background

Due to a brush fire that started on 2022 Feb 16, observing with EOVSA was interrupted. The fire created a severe threat to the observatory, but in the end there was no damage or injuries. However, fighting the fire required shutting off the power to the site, which has caused a number of issues in getting back to a running condition. The most difficult problem is that the SQL server would not boot up. We finally got it to start, but there are disk errors that unfortunately appear to have been going on for some time without us being aware of it. Reading from the database appears to work, but any extended reading of it fails due to the disk reading errors.

Code Status

As of this writing (2022 Mar 08), I have rewritten numerous routines to permit recording data without the SQL server. The strategy is to use tables for the two main real-time records, the delay_centers and the DCM_master_table, and just write these tables whenever they are changed, to the new NAS RAID system (/nas4). I created a folder /nas4/Tables that contain those tables. In addition, the stateframe (updated once a second) and scanheader (once per scan) are also written to files. The idea is that whenever the database is again available we will be able to write these saved records to it. The problem is that no calibration is possible because all of the calibration procedures require reading the SQL database.

Because we will have to restore the code back to its use of the SQL database eventually, I list below the routines that have been changed. I tried to mark changes with this comment:

# ************ This block commented out due to loss of SQL **************

Routines with changes are sf_display.py, daily_xsp.py, dbutil.py, adc_plot.py, delay_widget.py, flare_monitor.py, pcal_anal.py, schedule.py, stateframe.py.

SQL Status

The SQL database can be read, and it seems like the problem is not actual errors in the database, but rather read errors from the disk (possibly not disk errors, but controller errors?). However, none of the lengthy checks of the database integrity can complete due to the reported disk errors, which stop the procedures immediately. I have copied the entire database file eOVSA06.mdf to the NAS drive, /nas4/SQL_backup/SQL3/SQL_DB/eOVSA06.mdf. However, I am not at all certain that this disk transfer worked without errors. Gil Jeffer had a better idea, which is to "script" the existing abin, sbin, and fbin tables (these are the actual binary data) in a new database, which is at /nas4/SQL_LOG/test.mdf (and test_log.ldf), and then transfer the binary data by doing a new insert of data read from the old table. The idea is like this, using test:

set identity_insert abin on
insert into abin (Id, Bin) select * from eOVSA06.dbo.abin

This actually worked. We then did the same for hbin

set identity_insert hbin on
insert into hbin (Id, Bin) select * from eOVSA06.dbo.hbin

I thought that worked, but in fact it turns out that the "Bin" column is a text buffer (actually binary buffer) and SQL defaults to reading only text lengths of 4096 bytes. Instead, one has to do

set identity_insert hbin on
set textsize 2147483647 
insert into hbin (Id, Bin) select * from eOVSA06.dbo.hbin

This also worked, although it took awhile because there were some 50,000 records.

The problem comes with fbin. I read the Id of the last record as -1922969933, whereas the first record is -2147483646. The difference is some 224 million, which is the number of fbin records. At 86400 s/day, this corresponds to 2598 days, or 7+ years. Since each record is of order 15000 bytes, this is about 3.5 TB. I calculated that it will take about 50 hours to transfer the records. Still, it could be done except when I try it I get a disk read error after a few minutes. I did have some success, however, in transferring a set number of records in a specific range of ID numbers. For example,

DECLARE @cnt BIGINT = -2147483646
set identity_insert fbin on
set textsize 2147483647 
WHILE @cnt < -2146483646
BEGIN
  insert into fbin (Id, Bin) select top 100000 * from eOVSA06.dbo.fbin where id between @cnt and @cnt + 100000
  set @cnt = @cnt + 100000
END

This will transfer 1 million records, 100,000 at a time, so if there is an error it will have transferred some number of 100,000s of records. I did this several times, decrementing the count by 1 million each time, and it seemed to work. So a viable method would be to set the WHILE limit to the last record number, -1922969933, and then just let 'er rip for 50 hours or so. If it does die, I will know where by checking the highest record number or something, and starting from there. It is worth a shot.

Also worthy of note is that there are two tables, ScanHeaderDef and StateFrameDef, which are histories of how to translate the binary data in hbin and fbin, respectively. I copied the scripts for creating those tables, and also saved the data as .csv files, in /nas4/SQL_script/, with the names ScanHeaderDef.csv ScanHeaderDef.sql StateFrameDef.csv StateFrameDef.sql.