I started with blogging about COUNT = 0 situation. Then I followed it with COUNT = 1. So, let’s do a COUNT = 2 today. No, no, I am just kidding, don’t worry 🙂 But I still have to do one more COUNT post, one that will cover all other COUNT situations.
Believe it or not, sometimes you really, honestly, do need to perform an actual COUNT. You just need to know exactly how many of rows there are. It may be 0, 1 or 75 or whatever, but you need to know exactly how many. These situations are few and far between, but every once in a long while this is what you need.
So, let’s give the COUNT function one last kick from another angle.
To make it absolutely clear, when you ask SQL Server to do a COUNT on a table, you can’t magically make it do anything but a table/index scan. If you do a SELECT COUNT(*) FROM… then SQL Server will just have to do the muster.
But is it really that, when you do a Record.COUNT, the NST translates it to SELECT COUNT(*) FROM… or can it possibly be anything else?
As a matter of fact, is can be something else, under right circumstances.
To see what this something else is, do a COUNT on two very different tables: G/L Entry and Sales Price. Then check with SQL Server Profiler what happens deep down in the trenches.
For Sales Price, SQL will just do a SELECT COUNT(*) FROM [Sales Price], as expected.
However, for G/L Entry, it will do a SELECT SUM($Cnt) FROM [G_ L_ Entry$VSIFT1].
What is this G_ L_ Entry$VSIFT1 and what does this $Cnt do in there? What is $Cnt? Why doesn’t it do the same thing for Sales Price, too?
Table$VSIFTn is the indexed view (materialized view) that NST creates for every SIFT index you create. An indexed view behaves just like a physical table: it starts with a SELECT statement just like any other view, but SQL Server then creates a physical table for it, and maintains the view results in this physical table. Indexed views typically have maintenance costs associated with them, but these costs are justified with substantial performance gains elsewhere.
To not venture too far off-topic, I won’t go into the structure of $VSIFT views, but for the sake of understanding how NST utilizes them to perform COUNT, just know this: for every combination of indexed columns the $Cnt column keeps information about how many rows with this specific combination there are.
And now you have your answers. By doing a SUM($Cnt) you get the same results as with COUNT(*), only faster.
And this is the rule: if there is a SIFT index on a table, NST will favor it over doing COUNT to perform the actual counting of records. Now, I don’t honestly know if it uses the very first $VSIFT view for a table or it does a smart job by using the most efficient one (my hunch based on a sample of attempts with different tables I did is that it will simply go for the first $VSIFT for the table), but it doesn’t really matter. What matters is that the answer to yesterday’s conundrum lies in the fact that $VSIFT is used whenever it exists.
Well, consider this. A SUM($Cnt) will do an index scan, and will have to retrieve values from the materialized table , no matter what. Depending on complexity of indexed view being used it may still do a large number of read operations. However, statistically the number of read operations will be smaller than the actual number of rows it would have to count if there was no $VSIFT in the first place. There may be some edgy situations where $VSIFT will perform slower than COUNT, but these will be so rare, and when you hit them, the performance benefit will be measured in nanoseconds, if even so.
This tells you one thing: if you can avoid COUNT, you should avoid COUNT (that was the topic of yesterday’s post).
To go back to yesterday’s post topic once again – the situations where COUNT = 1 will be faster than FIND(‘-‘) AND (NEXT = 0) are not exactly “edge” – you could reasonably argue that they will be more common, and that situations where FIND(‘-‘) AND (NEXT = 0) are actually very rare. And you would be right. Thing is, retrieval of 50 rows will most likely cost more than counting 50 rows or counting 5000 rows. However, if we are talking 50 rows (or 5000 rows) then either will still be blazing fast, and performance penalty of FIND(‘-‘) AND (NEXT = 0) will be microscopic (I had to do 10000 of repeats to be able to measure them in seconds rather than milliseconds). What matters though, is that there is a tipping point: a point where no matter what, retrieving 50 rows will be faster than COUNT, and once you hit your tipping point, COUNT will be slower ever after. For some tables it may be thousands of rows, for some it may be millions. But once you hit it, and the table continues to grow, you get a slower and slower COUNT.
You see, NST doesn’t even consider all this. It simply uses $VSIFT over COUNT whenever it can, because it knows: if $VSIFT is slower, it won’t matter; when it’s faster, it will matter.
Morale of the story? Avoid COUNT if you can. If you can’t, and you expect COUNT to be slow, then just create a SIFT on a key, or simply create a key and then create a SIFT on it. Just make sure to do it on a key with medium selectivity, as low selectivity will have high maintenance cost, and high selectivity will have higher selection cost.
That’s all. I won’t blog about COUNT anymore, I solemnly promise.