datetimeoffset Uniqueness

Datetimeoffset is an interesting data type, because there are multiple ways to represent the same value. And I don't mean like using a string to store "1" and "1.0", I mean the fact that I might have inserted a value in one time zone, but need to be query it in a different one.

Insert in one time zone, query in another

Notice that I inserted a value in the time zone UTC+10:30, and queried it back in UTC+11, and the value that it gave me in the results is in UTC+10:30. Clearly the system knows that the value in the table was stored in UTC+10:30, and would store something different if I had told it to insert the value in UTC+11 (feel free to try it and see that it'll return the same value that you inserted, whichever time zone your WHERE clause value is in).

Sort order understands this too. If I insert other rows, it knows what order they are actually in, regardless of where in the world those times happen.

As weird as it looks, this is in chronological order

And as I have a unique index on this, it won't let me insert 00:30 in UTC+11, because 00:00 in UTC+10:30 is already there. It tells me "Msg 2627, Level 14, State 1, Line 19. Violation of PRIMARY KEY constraint 'pkTimesOffset'. Cannot insert duplicate key in object 'dbo.TimesOffsets'. The duplicate key value is (2021-01-01 00:30:00.0000000 +11:00)."

Errors if I try to insert the same time but translated to a different time zone

I'm okay with this. They are definitely the same value, even if they are written differently. The same thing happens with strings in different collations, because they sort to the same value. Müller and Mueller will clash in a table with a German PhoneBook collation (and yes, that's what the German collations are called, complete with a capital 'B' in the middle of 'PhoneBook').

So you see, the time zone thing isn't about whether they are actually identical or not. It's about whether they are considered the same based on the things that the SQL engine uses to consider them equal. When you're dealing with data that is case-insensitive, it will consider that 'FARLEY' and 'Farley' are the same, and complain if I'm trying to have both those values in a unique index.

And for a fun trick, let's consider what happens with GROUP BY or DISTINCT.

For this, I'm going to first recreate those tables with non-unique indexes. Still using indexes, just non-unique ones.

And the results show different values according to whether we sort the data ascending or descending.

Different order, different results

All those times someone told you the order of the data didn't affect what the results were, just what order they were in… yeah.

And if I run the script again but insert the rows in the opposite order…

Same, but different

You see, the rows are in the table in the order they were inserted. And when we are moving through the index either forwards or backwards, we're coming across values, and the version of the value is the one that gets seen first. If it moves through it forwards, it'll (probably) come across the first row that was inserted first. If it moves through it backwards, it'll (probably) come across the last row first. I say 'probably', because there are times that it won't quite work like this. Parallelism could make this be different. There may have been index rebuilds, or page splits, or the execution plan might decide to do run things differently.

But the idea is there – those two values are the same, even though they're not.

@rob_farley

This post fits into T-SQL Tuesday, hosted this month by Brent Ozar (@brento) on the topic of data types.

Beware the width of the covering range

I'm sure you've heard by now that casting a datetime to a date is still SARGable. Like, that if you have an index on a datetime column (such as DateTimeCol in dbo.SomeTable), then you will get a seek even if you use a predicate like "WHERE CAST(DateTimeCol as date) = @DateParam". Sargability is all about whether your predicates can be used for a Seek, and it's something I've been preaching about for a very long time.

The trouble with this is that casting a datetime column to a date isn't actually a Sargable predicate. It feels like one, but it's not. Either way, I thought I'd write about this for T-SQL Tuesday, which is hosted this month by Brent Ozar (@brento). He asks us to write about our favourite data types, and I figure that talking about the pseudo-sargability of datetime-date conversion fits. (It was either that or write about how a unique index on datetimeoffset complains that '20210101 00:00 +10:30' and '20210101 00:30 +11:00' are the same value, even though they are clearly a little different. Maybe I'll do both… Edit: I did.)

Sargability is about the fact that an index is on the values in a column, and that if you're actually looking for something else, then the index doesn't work. So converting a column from one datatype to another doesn't cut it.

It doesn't even work to add zero to a integer value. That's enough to confuse the Query Optimizer into thinking that an index might not be right (and this is still the case if we use OPTION (RECOMPILE) so that it doesn't try to parameterise the query). It's easy for us to see that the order of things in the index isn't changed by adding a constant value, but the Query Optimizer doesn't look for this. I had a Connect item open for years about this.

The operation on the index is a Scan when I add a "plus zero" to the predicate

So what's special about converting from datetime to date? Well, it's about being able to add a helper predicate to the mix. (This is also how LIKE works against strings.) The Query Optimizer knows that all the values must be in a certain range, so that it can seek to that range. It works out this range using an internal function called GetRangeThroughConvert. It does a good job on this for LIKE, but doesn't do such a good job with dates.

Let me show you.

First I'm going to create a table called Times and put 100,000 rows in it, one per minute from the start of the year until roughly now. It's actually in a couple of days' time, mid-morning on March 11th.

Now when I query this table to find all the rows from today, I see an Index Seek and can be happy. "Obviously" this is sargable.

Looks like we've got ourselves some sargability!

But it's not really. The properties of that Clustered Index Seek show that our WHERE clause is being used as the Residual Predicate, and that our Seek Predicate is based on the Expressions created in that Compute Scalar operator.

The properties show there's something else going on

Now, you'll notice that the operators used in that Seek predicate are > and <. Greater-than and less-than. Not Greater-than-or-equal and less-than. This is a range which is exclusive at both ends. And this is interesting because if I want to write out a date range myself, I do it using an inclusive start and an exclusive end, like:

So I got to thinking about this. I wasn't quite curious enough to pull up the debugger to try to see what the values were, but I did do some investigation.

Notice in that tooltip I showed earlier, that the "Actual Number of Rows for All Executions" property says 1440. That's the number of minutes in a day, so that makes sense. But the "Number of Rows Read" property is 2879. This means that the Seek Predicate is returning 2879 rows, and then the Residual Predicate (shown as just "Predicate") is filtering these down to the 1440 that actually produce the correct value when converted to the date type.

2879 is one less than the number of minutes in two days. I figure the "one less" bit is because it's an exclusive-exclusive range. But still it's a range which is twice as big. It's not "a little bit bigger", it's twice as big as it needs to be. To compare this with LIKE:

When LIKE knows what values are being used, it doesn't use those ComputeScalar & ConstantScan operators, and we see helper predicates that do an inclusive-exclusive range:

Inclusive-exclusive range

But when we use a parameter, we have the expressions and an exclusive-exclusive range, but the same number of rows read as returned.

Exclusive-exclusive range, but still the right number of rows read

The Query Optimizer clearly understands how to do the range for LIKE.

But with the date conversion, twice as big.

Back to what that time period range is…

To figure this out, I started deleting rows from my table. I figured I'll be able to see the range by when the Number of Rows Read value starts dropping.

First I deleted rows from mid-morning March 10th on:

And the size of the range didn't change.

I removed all the datetime values from the 10th and the range didn't change. And all the datetime values from before the 8th. Only when I deleted '20210308 00:01' did the range start shrinking.

Without 00:01 on the 8th, my range to find values from the 9th was smaller.

So the range of my helper predicate hits the day before my date as well as the day I actually want. TWO days. Twice as big.

It certainly sounds like I'd be better off running my query using my own range predicate, typing out my inclusive-exclusive range instead of relying on the supposed sargability of casting as date. When I write this out, my plan becomes a lot more simple-looking, because it doesn't have to work out the range, but the plans tell me the cost is greater!

Looks simpler, but apparently the cost is greater?

It's lying though. It thinks the first query is cheaper because it does a bad job of estimating. It thinks there's only one row being produced when there are actually way more.

It estimates badly to make us think it's cheaper.

To try to even things up, let's try using a parameter for the date in both queries.

And whoa! It's even worse. Now when we do the range ourselves, it thinks it's 93% of the batch, and the simplicity of the plan has gone out the window.

Ugliness prevails

And yet this "more expensive" plan is actually better. The Index Seek looks through 1440 rows to find the 1440 rows, while the CAST query looks through 2879 rows. The estimate on this second version says that over 16,000 rows might be produced (which is what causes it to be 93% of the batch). But if I'm going to be passing 1440 rows through a query, I'd rather guess 16,000 than guess just 1.

The conclusion of all this is to point out that although we often refer casually to the idea that a datetime column cast to a date column is sargable, it's actually not quite, and you're better off handling your ranges yourself.

@rob_farley