Virtual Log Files – Too many and you’re gone!

January 8, 2018

We had a situation whereby a Database maintenance job over-ran on a Data Warehouse Server. The overrun was so bad that it was blocking the overnight Data Load process. At the time, there was a separate team that handled general Operations, so I contacted them. They stopped the maintenance job, to no avail. The blocking was still there, so their next move (without consultation) was to issue a KILL command on the blocking process and suddenly the process went into a “Rollback” state. This is where the trouble started.

After an hour, we had an ETA for the “Rollback” completion – 14 hours. This would have been too late, since daily Data Loads are required… well, daily.

At this point, the Operations team suggested that they could do something – they didn’t say what, but they just re-booted the Server. The Server came back up, and guess what? During startup, the Database went into “Recovery Pending” mode and stopped there as it tried to complete the Rollback.

The reason for these problems is, unbeknown to me, the number of Virtual Log Files (VLFs) that existed on this Database. When Database Log Files needs to grow, a new VLF is created. This in itself is not a problem, but the number of VLFs that are created is based on the Autogrowth settings of the Database. Problems occur when you have too many VLFs, since this can cause performance issues with Log backups and can cause Database Recovery to slow considerably.

The issue was caused by the Database being left to the Default “Autogrowth” settings (I.e. 1Mb initial size, 10% autogrowth) . This meant that as data was loaded (in this case, a large amount of data), the Log file grew and this created an enormous number of Virtual Log Files (VLFs). A good rule of thumb is to keep the number of VLFs below a thousand. Investigations showed that this Database had tens of thousands, if not more.

The Database was owned and used by an Application support team and maintained by an Operations team. The Application support team were concentrating on what was going into the Database, and to the Operations team, this was just another of thousands of Servers that they look after. Ultimately, the Operations team and the Application Support team were only acting to serve one master – The Business. This is a common scenario, but in this case allowed for a dangerous gap to exist.

You’re probably expecting a happy ending here…

There was a fixed time window during which the Data needed to be loaded. The Database recovery wasn’t going to complete within that time. The Database in question was holding transient Staging data – Delta records recording changes to Tables on a Source System. This was lucky – I could create another copy of the Database structure by restoring a backup from before the new Table load, re-point the Change Data Capture software into this Database and re-load. This allowed the Data to be loaded and Business Requirements to be met.

The Log file was re-sized, by shrinking down to a minimal size and then re-sizing appropriately.

Autogrowth settings were updated to a more reasonable 1024Mb.

My learnings from this episode:

Check Database autogrowth settings – establish your own standard and enforce it

Where Database maintenance is shared between teams, establish clear responsibilities and ensure that all parties play their part in proper maintenance.

Further Reading:

There’s some excellent Resources (as usual) on the SQLSkills Website:

https://www.sqlskills.com/blogs/kimberly/transaction-log-vlfs-too-many-or-too-few/

Leave a Reply Cancel reply