Runaway Rollbacks… or I stopped my SQL Query, so why is it still running?
I received a call for help today. A Developer called to say that they had “killed a query, but it just won’t stop”. Actually, it wasn’t the Developer responsible, but his colleague, but that’s irrelevant. “can you take a look”?
As a DBA, killing a SQL Server query is generally used a last resort. It often gets used after people have pressed Stop on their query, waited a minute or so and got impatient as SQL Server rolls back the query. Next, they close the Query window and maybe they will find the Process’ SPID number, try to kill it by issuing “sp_kill” against it and then contact the DBA to ask why the query didn’t stop. There’s another common step too – the user reboots the Server. Luckilly, there were no reboots this time..
The first thing I done was to fire up “sp_whoisactive” to find the errant query.
The “sp_kill” command has an optional parameter – STATUSONLY. This provides information on an already KILLed SPID, giving you the estimated percentage completion of the Rollback and the estimated time remaining. Oh yes, I can read “Books Online”. In this case, the Rollback was saying 0% completed and estimated time remaining 0 seconds.
Rolling back the effects of a query can take significantly longer than the query itself. When running a query, multiple CPU threads can be utilised, but only one is used during a Rollback.
The results of “sp_whoisactive” were a bit strange – they didn’t show any CPU or Disk Activity. This led the user to believe that the query hadn’t been doing anything. I was not so sure.
The screenshots below weren’t from the incident, but included to give you an idea of the steps that I took. sp_whoisactive game me the “session_id” (or SPID).
My first instinct was to check locking to see if the query had any Locks open. I issued the command “sp_lock <SPID>” which returned a series of Table locks (“sp_lock 58” for these screenshots).
I was able to see the Table name by moving to the appropriate Database and issuing SELECT OBJECT_NAME(object_id) with the object_id returned from sp_lock.
The next step was to check the Transaction Log file usage – DBCC SQLPERF(LOGSPACE) showed me that the Log file for the associated Database was 98% full. The Log file was 80Gb in size, so this confirmed that SQL Server had some serious work left to roll back the query. Again, apologies but the screenshot is just to give an example – 52%.
Had the SQL Server been re-started, it would have seen the Database as being in an inconsistent state. The Database would have been put into Recovery and the Rollback would have started all over again.
People often say that you should learn from your Children. I took a leaf out of their book and “did nothing”. Sat on my hands for a while, just using DBCC SQLPERF(LOGSPACE) from time to time to see the Log file usage dropping very slowly. Actually, I did something – I prepared yesterday’s Backup file to ensure that it wouldn’t be overwritten for any reason and scripted ready for a restore to ensure that we had a Plan B.
After about twenty minutes, the Rollback suddenly completed, proving just how difficult it can be to predict how long it’d take.
Only one question remained… why would they need a Log file of more than 80Gb on a 160Gb Database?
Learnings from this episode are:
Understanding where to look for Database Health vital signs is crucial.
Procrastination can be useful sometimes.