EXCEPT Operator Troubles (RowGoal)
Sometimes, queries just go bad. They can take you on a Journey. Not always a Journey that you want to go on, but so long as you’re learning then “The Journey is the reward”.
I’ve seen a lot of queries recently that use the EXCEPT operator. Frequently used in Data Warehouses to process Slowly Changing Dimensions.
EXCEPT has a nice simple syntax – i.e. to return the rows in Table A that don’t exist in Table B.SELECT col1, col2, col3
FROM tableA
EXCEPT
SELECT col1, col2, col3
FROM tableB
In many cases, this works well, but recently I’ve seen examples where it becomes troublesome, specifically when trying to process higher data volumes of data.
The same code can behave perfectly on a small dataset, but then cause issues on a larger database built in exactly the same format. This results in Queries changing from taking a few seconds to struggling to complete.
Checking the Query Plans returned between the two Databases, I noticed a difference in one of the Join operators, going from a “Merge Join” to a “Nested Loop”.
The observant amongst you would have noticed that I mentioned the query being just fine on small volumes of data, and causing issues with larger volumes.
So, what happened?
“Nested Loop” Joins are useful when there are a low number of Rows passing through the plan that are then looked up in the Joined Table. They are not good news if there are a large number of Rows to be processed. The Optimizer knows this, and would normally choose a “Hash Match” or “Merge” Join where it has to deal with a lot of rows. The behaviour in this case is a bit odd.
To confirm that the incorrect Join type is causing a problem, I used a Query Hint to stop the Optimizer using “Nested Loop” joins. The Hint is “OPTION (HASH JOIN, MERGE JOIN)”, reducing the options that the Query Optimizer has, so it can only perform “Hash Mash” and “Merge” Joins – i.e. no “Nested Loop” Joins. Sure enough, the query performed quickly again, returned results immediately.
Note
Query Hints can be useful to test hypothesises and maybe as a crutch to get over an immediate issue. They are not for long-term use in production.
Step 2
Further investigation into the Query Plan showed that the Query Optimizer was mis-estimating the number of rows that would be returned from a Table Scan fairly early in the Plan. I checked the Statistics on the Table and found that they had been recently updated, using a Full Scan so they were unlikely to be the culprit.
Next step was to open the Query Plan XML by right-clicking the Plan and selecting “Show XML”. There’s always good stuff in the XML. to get more details on EstimatedRows from this Table Scan. I found not one, but two Estimated values – EstimatedRows and EstimatedRowsWithoutRowgoal. The EstimatedRows seemed a bit low, but the EstimatedRowsWithoutRowgoal value matched what I was expecting to see.
The Optimizer uses RowGoal as an optimization technique – when it believes that less rows have been requested by the Query, it might be able to read less rows throughout the query instead of performing the full / clearing down at the end.
For example, if the Query has “SELECT TOP 10” and the Query would naturally give 100,000 rows then it makes sense to optimize the query for just the 10 rows instead of processing 100,000 rows only to filter down to 10 at the end.
The next step was to switch RowGoal optimization off and see how that changes the situation. I originally wrote that as “if that fixes the issue”, but in actuallity it just changes things.
Another Query Hint to the rescue (after removing the first one):
OPTION (USE HINT (‘DISABLE_OPTIMIZER_ROWGOAL’));
Checking the Query Plan again, the Join has reverted from “Nested Loop” to a “Merge Join”. The query returns in a reasonable time again.
So, it looks like the RowGoal optimization is trying to be helpful, it is causing an issue. What prompts SQL Server to use RowGoal optimization? The main keywords are TOP, FAST, IN and EXISTS according to the documentation. No sign of EXCEPT.
When the Optimizer breaks down a query, the EXCEPT operator is converted to an EXISTS statement and that is on our list of RowGoal prompts.
How does it get the value for RowGoal? I’m not sure. This appears to be a guess based on the EXISTS statement.
Possible Solution
Note that this is described as “Possible Solution”. There are likely to be other solutions, perhaps better, perhaps worse.
The issue here is that SQL Server thinks that there will be a lower number of rows to process, and it’s building the query plan accordingly. On a large number of rows, this is going spectacularly wrong.
It is possible to switch off the RowGoal functionality by using a TOP (x) function in the query to tell the optimizer to expect a large number of rows – i.e. higher than the guess provided by the Optimizer itself.
The recommended value for (x) is 9223372036854775807 – the maximum value that a BigInteger can be. I have seen a number of presentations by Adam Machanic where he uses it to similar effect.
While a lower number can be used, using a fixed “magic number” increases maintainability – if a support person or developer sees this number then they will know what is happening and why the number is there.
By switching off RowGoal on this Query, SQL Server gets a more accurate idea of the number of rows required. With a larger number of rows, it is less likely to choose a “Nested Loop” Join.
It appears that in order to “trust” the EXCEPT operator to behave itself, it pays to change the second SELECT statement to “SELECT TOP (9223372036854775807)”.
This command needs to go on the second part of the Query – just after the EXCEPT statement – i.e.
SELECT col1, col2, col3
FROM tableA
EXCEPT
SELECT TOP (9223372036854775807) col1, col2, col3
FROM tableB
This solution was put in place across the small and larger Databases and appears to work just fine.
I hope that you find this useful, not just the “fix” but the explanation of how I came to resolve the issue. Sometimes, “The Journey is the Reward”. Also the name of a great book by Steve Jobs which I highly recommend reading.
Apologies for the lack of pictures in this post. Time was too limited for creating examples.