Options for deleting large amounts of data

Options for deleting large amounts of data

Recently, I had a situation where a lot of data needed to be deleted.

This was caused by a Monitoring system which was incorrectly configured. When Monitoring, you should Monitor just the information that you need, but no more. In this case, far too much information was being monitored and recorded, causing the Database that holds this information to balloon in size.

I was notified of the issue when the Disk holding the Monitoring information breached it’s usage threshold, getting close to filling. In a virtual environment, it’s so easy to add disk space that often a full disk is “resolved” by extending the Disk partition. That might solve the current incident but doesn’t fix the problem. In fact, it can make it far more difficult to resolve it in the future.

Oversized Databases cause problems, including Disk space, larger Backups and potential slowdown of the application that’s creating the Data. Reporting on that Data can slow significantly also.

In my example, I brought it to the attention of the users of the Monitoring application and they reviewed and corrected the collection settings. This resulted in far less information being stored, which was good, but the problem still remains of the excessive data that had been collected.

 

Any data deletion needs to be carefully managed, from the aspect of what to delete and how to delete it. In this case, the manufacturer of the Monitoring tool provided a script for data deletion. This was normally used for general housekeeping – e.g. only holding the last three months of data. If a safe method of data deletion is not available then consult the owners of the data structure to create one.

Simple answer – Just delete the Data…

Once you have a query that will remove the data, it’s not always safe to just run it. Often, additional disk space is required in order to delete data, which may put you in an even more difficult position.

The Deletion query may fail for a number of reasons if you are trying to remove a large amount of data.

Deleting data is a logged operation, so the Database that you are cleaning up will need to have a Transaction Log big enough to hold the Data while it’s being deleted, which causes issues if you are trying to delete the excess data all in one go. The Transaction Log will grow as required, but you might not have enough space for it to grow that big (after all, you might already be out of disk space). This leaves us with (at least) two options.

 

Option 1 – Delete the Data in chunks

One solution would be to Delete the data in multiple goes. In order to do this, I generally work out how much data to remove in each deletion operation and then restrict the deletion statement to only process that amount of data. Removing data in this way may take longer, but it provides more control and gives you more chance of setting your Change window accordingly. It also allows you to more accurately respond to questions about how long it’s going to take.

For example, if you have a DELETE statement with the appropriate WHERE clause:

This can be adjusted to remove a set number of rows – e.g. 1,000,000 as follows

Percentages can also be used:

Getting the settings right can be a case of trial and error. All of this should be worked out on a Test Server before applying to a Live environment.

Best to set the figures low, monitor the effect on your Transaction log and aim for identifying the point where it’s just “touching disk” – whereby the Log grows to a manageable size and is almost full – i.e. no need to grow any more. This is the optimal point – leave the Log file at this size until the deletion work is complete before shrinking again.

You can put additional clauses in your WHERE statement to delete rows in order.

The old-school method was to set the ROWCOUNT, but this is being deprecated. I have used this in the past and mention it for completeness, but the TOP operator could also be used.

Other Considerations

You should consider potential downstream impacts of the deletion also. If the Database is using the “Full” Recovery Model then you need to ensure that Log backups are taken at appropriate intervals to control the Log size. You may consider changing the Database to “Simple” Recovery Model for the duration of the deletion work. Also, if your Database participates in a High Availability or Disaster Recovery then remember that Deletions will also be performed on the other Servers in the system. Again, consider all of these things before setting out your Change window to perform the work.

 

Option 2 – Just keep the Data that you want

If you can easily identify the rows that you want to retain then another option is to just copy these into another Table before Truncating the existing Table and moving the rows back.

The above statement is such a over-simplification. You will need to consider such things as Referential Integrity, Identity columns and the like, but as a concept, it can work really well.

Because you are dealing with far less data, Log usage will be much reduced and it will be a far quicker operation.

Once again, test any method prior to running on the Live server.

 

This post is written as a brief discussion about general concepts. You will need to adjust the suggestions to fit your specific needs. Hopefully it’ll help out though.

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow on Feedly

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close