Being sure of your data

Victoria Holt (@victoria_holt) says "Data governance is really about 'data erudition', showing an interest in learning about the data we have, improving the quality, and creating a more productive and trusted data asset."

This is so much part of what I do as a consultant. What my team do as consultants. This isn't a post to spruik about LobsterPot Solutions though, it's a response to Victoria's T-SQL Tuesday invitation. This is month 144, 12 years, a gross. But nevertheless, I feel like it's worth talking about one of my favourite things we often do with our customers.

And that is to help them identify situations in their data that should not occur, so that they can fix it.

All database professionals should understand the significance of foreign key constraints, making sure that the thing (product, case, offer, location, etc) you're referring to actually exists. Or about check constraints to make sure that a delivery date isn't before an order date.

But the checks that we do are more about things that the database can allow, but are business scenarios that should never happen.

Plenty of businesses seem to recognise these scenarios all too well, and can point them out when they come across them. You hear phrases like "Oh, we know that's not right, it should be XYZ instead". And they become reasons why they don't really trust their data. It's a data quality issue, and every time someone comes across a data quality issue, they trust the data a little less.

Pretty soon this distrust means they become reluctant to automate anything, feeling they need to eyeball the data first. And the paralysis doesn't stop there – it seeps into just about any piece of information about the data. "Are we sure about this?" gets asked about every report.

Data governance can help this. Approved dictionaries for definitions is a start. Documenting processes is excellent. But also you need to discover which situations cause people not to trust the data, and develop ways to alert that they have occurred. This not only gives an opportunity to fix those situations, but to see that (ideally) they're happening less often. And eventually to develop trust that it's solved.

I've spoken before about the relationship between data quality and trust. That improved data quality can lead to trust. That trust with poor data quality puts you on dangerous ground. Our T-SQL Tuesday host Victoria commented about how data governance includes improving the data quality and developing a more trusted data asset. Data quality can lead to the trust, but only when it has been demonstrated repeatedly over time. Trust must be earned.

When we help our customers discover data quality issues, address those issues, and flag when they occur, they start to develop that trust. That trust can then form a strong foundation for so much more in data.

Never ignore the importance of data governance. Of developing trust that the data quality is strong. Go through the process of documenting everything and tracking everything, but also remember that the goal of data governance should be that trusted data asset.

@rob_farley

Go-to scripts

Actual John McCormack (@actualjohn) asks about our go-to handy short scripts. It's a nice question, because some of the first programs I ever wrote were short and featured the GOTO command.

10 PRINT "Rob was here"
20 GOTO 10
RUN

It was a Commodore machine in a department store. I don't remember if it was a Vic20 or a Commodore 64 – I think it was probably the former, because we got a Commodore 64 at some point later. I had seen another kid do similar, and I guess I learned three commands that day and the principle of line numbers. It wasn't until years later that I understood what Edsger Djikstra meant.

Of course, this isn't the GOTO script that John was asking about. He wants to know about the tools we use, the ones that are scripts that we turn to regularly. I had to think about this, to think about what I find myself typing at client sites, and when, and why.

As a consultant, the kind of work that I do from customer to customer can change a bit. Sometimes I'm reviewing people's environments; sometimes I'm performance tuning; sometimes I'm developing code or reports or cubes; sometimes I'm writing T-SQL, but it's often DAX or PowerShell.

But often, I simply need to figure out where things are being referenced. The "View Dependencies" can be handy, and gives me a pretty tree with branches that I can expand, but just isn't as useful as running:

select 
    object_schema_name(object_id)
  , object_name(object_id)
from sys.all_sql_modules
where definition like '%TheThing%'
;

, because this lets me see the definition if I choose to show it, or do something with the results. Plus if I run it in Azure Data Studio rather than SSMS, I can easily save the results to a file, even in Excel.

And yet this still isn't my ideal.

My ideal scenario for this kind of situation is to look through the source control repository, where it'll list the files, show me how often TheThing is mentioned, show me where it is, still be easily selectable, and I don't even have to connect to a database to do it.

And maybe that's the thing… many of my useful scripts get replaced by other systems which are more useful. The ones that don't tend to be about data rather than the database, and then it's less about having a go-to script, but about being fluent in the languages, having a good speed for typing, and getting used to the business logic at my customers.

@rob_farley

Infrastructure as code

For T-SQL Tuesday this month we've been asked by Frank Geisler (@FrankGeisler) to write about how we use scripting to ensure consistency in our provisioned Azure platforms.

Of course! This is definitely important. Whenever something needs to happen multiple times in a consistent way, scripts are key to this. But let me wax lyrical for a moment about how this isn't new.

You see, back in the days when all our data was on-prem, we would run through the SQL Server installer, choosing the various options and hitting Next. But the best thing about this installer was that it would create a file called Configuration.ini, which could be used to perform another installation with the same configuration.

This was critical to being able to spin up a separate server. And because these files remained on the server even if someone had just used the installation wizard, when an additional machine needed to be stood up, we could just grab the ini file and run an unattended installation, confident that we had the same collation settings and everything. (Don't get me started about the kind of pain you'd have by picking the wrong collation in an install.)

If you're reading this and the idea of using scripts is alien to you, please change your habits.

When you see a dialog box in SSMS, use it of course, but then use the Script button and hit Cancel. Apply your change by running the script, ideally through a proper deployment process. This is such a good habit to get into, I often wonder whether we'll one day get a setting in SSMS to disable the "Ok" button on dialogs. I would certainly be enabling such a thing at my clients, to reduce the likelihood of ad hoc (and even accidental) changes.

@rob_farley

Work-life balance

Sometimes I don't feel like I'm the best at work-life balance, but I think it's more complicated than that. My old friend TJay Belt (@tjaybelt) – we've also been friends quite a long time – has invited us all to write about it for this month's T-SQL Tuesday.

As an employer, I'm completely in favour of having my employees finish work at the end of the working day. I have told them off for staying too late before, even though I have a tendency to work late myself. It's one thing for me to make bad decisions, but I never want my employees to feel like I expect them to do so as well.

Let me say that again – I don't want my employees to think I want them to have an unhealthy balance of work and life. Of course I want them to enjoy work, and to love every day even if it's a Tuesday after a long weekend, but that definitely does not equate to working long hours or writing blog posts at 9:30pm on a Monday evening. Ahem.

I don't even want them thinking about work the rest of the time. Because I want their minds to be elsewhere. I want them to be thinking about other things. Enjoying life. Getting the serotonin moving. Solving different kinds of problems. Using their hands maybe. Volunteering. Being creative. And not because those things make them better at their jobs (although I'm sure they do). But because those things make them better people. They help them live longer, healthier lives. And not so they can be still working for me in their 70s and 80s either.

Let me tell you about some things that I do which I think help me.

One is to be part of a business leaders group. I'm a TEC group, which internationally is known as Vistage. Once a month I spend half a day with a bunch of other people who have their own business, and we talk about various things. Sometimes we have a guest speaker who covers some topic, but every month we talk about what's been significant in the last month, what things we're trying to figure out, how we're going on goals, and importantly, I'm not trying to do data or analytics. I'm not there to solve the problems of my customers, I'm there to be part of a group. To help solve different kinds of problems. To reflect. To share the burden. To be with like-minded people. I put my head into a different place, and it helps.

Another is a hobby I did for a few months a few years ago, but then stopped until earlier this year. And that's stand-up comedy. Stand-up is slightly terrifying, even for someone who's used to getting up in front of audiences. Trying to make people laugh is very different to explaining how an execution plan operator works. I talk about all kinds of things on stage, and I'm not trying to make them understand any new concepts, I'm not trying to inspire them to develop a new skill, I'm not even trying to make them into better people. I'm just trying to make them laugh. To provoke a response where they smile and have a chuckle. It often doesn't work out that way, but I do find that I spend a bunch of time thinking about material, and that's useful for a work-life balance. Because those things that occur in my head when I'm having a cup of tea or I'm in the shower (never both – that's not a thing) are not whispering a possible answer to a business logic thing, but a punchline for a joke I was mulling over a few days earlier.

Both these things are useful to me, and relevant to my quest in achieving a work-life balance because I'm thinking about different things. I can't say that TEC is not about work, because it really is. But it occupies a different part of my head. And comedy feels like hard work, but again it occupies a different part of my head. And thinking about things differently helps.

There are a lot of other things I could write about.

I could write about the time that I spend talking to people and trying to help them with the world. Friends who are going through hard times. Or who are going through good times. It's tremendously useful to me to help people with anything. I typically find that I'm happier with life when I'm helping people, because it feels closer to what I'm meant to be doing.

I could write about the time I spend doing church things, because again, I'm focused on things which are not me. I think many sport are similar, because you're having to give your attention to the particular activity you're doing. It's not routine like driving or running, it's about coordinating what you're doing with everyone else in the team. About being part of something bigger. Volunteering also fits for the same reason.

However it is that I distract myself, I find that my headspace is important to achieving it. It's not just about watching a mindless TV show or playing some mindless game (which are both welcome distractions way too often), but the fact that they are mindless makes them less balancing on the work-life scale.

What I want in any distraction is for my mind to be engaged in something different. And ideally, doing something that actually matters like serving something bigger than me. Or writing a joke.

@rob_farley

On containers

Anthony Nocentino (@nocentino) has invited us to write about what we're doing with containers. It's kinda his thing. It's good to have a thing. I got him to speak to the Adelaide User Group about it in February 2020, which turned out to be our last in-person meeting.

(We could switch back to in-person, but the room at Microsoft only holds 15 people under the current restrictions, and I don't think we've ever had attendance that low in the 16 years I've been running it.)

Two years ago, I wrote about how I'm running SQL on Linux using docker, and I do find this to be really useful (because I don't have to spin up a whole VM or make my machine dual-boot as I did back in the 1990s). I've also used containers for various Machine Learning tasks. In both situations, it gives me the flexibility to be doing things without having to use my local operating system. In the last couple of years I've replaced my main laptop, and the impact on my containers has been really small.

I know I'm not getting as much out of my containers as Anthony is. I don't have a large testing suite letting me spin up umpteen versions of SQL. But I know that containers will let me do this as and when I need it. I know that I can use containers to spin up things without their having a lasting impact on my local operating system. And that's really useful.

I encourage you – get used to containers. They're here to stay.

@rob_farley

A hybrid world

This month, Ben Weissman (@bweissman) challenges us to write about hybrid solutions. He says:


It has become pretty evident to me, that we'll be living in a hybrid world for a long time – potentially indefinitely.

Ben Weissman

, and this is what I want to address in this post.

I like what Ben says here because for a long time it felt like hybrid was only the domain of the partially migrated. For those organisations that had dabbled with the cloud but weren't prepared to commit. A halfway house for people who were on the journey towards being in the cloud, but who had a few things they weren't prepared to give up.

In some ways, I suspect this is actually the rationale behind hybrid solutions like Azure Arc. There are many organisations who like the features that Azure offers, but simply aren't prepared to have their data in the cloud. It's a stance I've seen many times, although with multiple Azure data regions in Australia now, the reluctance to the cloud is not what it was.

Power BI has also caused organisations to have some amount of data in the cloud even if their data still resides on-site. And course, this leads to the situation where they push data into cloud storage to make refreshes easier.

I agree with Ben. I think many organisations will continue to have a large amount of their data within their own data centres for a long time yet. As much as I like the idea of everyone moving fully to the cloud, there might always be reasons why that's not practical. And so hybrid solutions will continue to improve. Azure Arc is an excellent example of how to think smarter around hybrid, so that some of the benefits of Azure can be realised without the need to compromise on having data in external data centres.

As was discussed last month, technology keeps moving on, and the features available in a cloud-only world are developing quicker than ever. But hybrid is one of the ways to make sure that people aren't left behind, and I really like the fact that hybrid is such a strong story.

@rob_farley

Ever-changing technology

This month, Andy Leonard (@andyleonard) describes a situation where a technology change made his book inaccurate between its completion and when it was published. Ouch – that must've been a pain. He asks the blogosphere how we respond to changes in technology.

My answer is acceptance.

It happens. Technology moves on, and generally quicker than we can keep up with. And so I don't try to pretend that I know everything. I like to think I know a lot, and that I probably keep up with someone who doesn't do what I do – but I'm happy to admit that there are things I'm not across. Particularly if they're new changes.

I'm sure if I were a doctor, I'd need to be across a lot of things very thoroughly. But I also suspect that's why doctors tend to prescribe the same meds over and over – because it means they have fewer advisories to be across. If my doctor gives me some antibiotics, I hope they're reading journals about them and have considered whether they're right or not. I care slightly less about whether there might be a better alternative, but I do want to know that the one I'm being given is okay.

And so it is with tech. If I'm teaching someone about how to do something, and there's a new option that's available in the latest release (looking at you, Power BI and DAX, with your frequent updates – I love them, but every so often I miss a detail), then I might look foolish for a moment. But that foolishness is probably short-lived, because I'm happy to admit that I didn't know. I learn, as do my students, and we're all better for it.

There are times when I feel like I've missed quite a lot of things. Once upon a time, data was a relatively small area of expertise, and I felt like I had a good grasp on just about every aspect of the SQL Server product. Except that the more I looked into it, the more I realised there were areas in which I couldn't call myself an expert at all. As time goes on, I know I could spend all my time learning if I wanted. So instead, I spend most of my time doing, and only part of my time learning.

And I just accept that technology will change in the moments when I'm not looking.

@rob_farley

Invoke-SqlNotebook

My old friend Aaron Nelson (@sqlvariant) told me about this back in the middle of last year – the concept of executing a Jupyter Notebook (.ipynb file) against a SQL instance and storing a copy of it (complete with the results of the various queries) in a separate file. Aaron and I have had a lot of conversations about SQL and PowerShell over the years, and Invoke-SqlNotebook fits nicely.

I don't think I was particularly convinced right away. For a start, not many of my customers were using Azure Data Studio, which is probably the easiest way to look at Notebook files. But mostly, I didn't consider I had really had a gap to fill.

The concept is that if I have a notebook with a bunch of queries in it, I can easily call that using Invoke-SqlNotebook, and get the results of the queries to be stored in an easily-viewable file. But I can also just call Invoke-SqlCmd and get the results stored. Or I can create an RDL to create something that will email me based on a subscription. And I wasn't sure I needed another method for running something.

But the materialised notebook fits into a useful spot in between these. It gets created in an environment that is convenient for running queries, which is harder to argue for RDLs. And it contains multiple queries complete with nicely-formatted documentation, where the results of queries are displayed in amongst the text. SQL files created in SSMS don't fit that description.

So these days one useful option that's open to me is to schedule a call to Invoke-SqlNotebook, telling it to materialise a notebook which a bunch of useful queries, so that it can be stored somewhere in a file system and easily opened with the results. I can then give that to just about anyone to digest, because the combination of documentation and results can easily be made understandable and useful.

If there are actions to be made based on the results of queries, I'm still going to use PowerShell to run the queries, and then take further actions based on the results. But when I'm just wanting to create something that shows data and explains what it is – Invoke-SqlNotebook does the job. Well done Aaron.

This post fits into Steve Jones (@way0utwest)'s T-SQL Tuesday this month. For more on notebooks, go to Steve's post and read some of the other posts linked to via the comments.

@rob_farley

datetimeoffset Uniqueness

Datetimeoffset is an interesting data type, because there are multiple ways to represent the same value. And I don't mean like using a string to store "1" and "1.0", I mean the fact that I might have inserted a value in one time zone, but need to be query it in a different one.

CREATE TABLE dbo.TimesOffsets (TimeAndDateAndOffset datetimeoffset, CONSTRAINT pkTimesOffset PRIMARY KEY(TimeAndDateAndOffset))
GO

INSERT dbo.TimesOffsets(TimeAndDateAndOffset) 
VALUES ('20210101 00:00 +10:30');
GO

SELECT TimeAndDateAndOffset
FROM dbo.TimesOffsets
WHERE TimeAndDateAndOffset = '20210101 00:30 +11:00';
Insert in one time zone, query in another

Notice that I inserted a value in the time zone UTC+10:30, and queried it back in UTC+11, and the value that it gave me in the results is in UTC+10:30. Clearly the system knows that the value in the table was stored in UTC+10:30, and would store something different if I had told it to insert the value in UTC+11 (feel free to try it and see that it'll return the same value that you inserted, whichever time zone your WHERE clause value is in).

Sort order understands this too. If I insert other rows, it knows what order they are actually in, regardless of where in the world those times happen.

As weird as it looks, this is in chronological order

And as I have a unique index on this, it won't let me insert 00:30 in UTC+11, because 00:00 in UTC+10:30 is already there. It tells me "Msg 2627, Level 14, State 1, Line 19. Violation of PRIMARY KEY constraint 'pkTimesOffset'. Cannot insert duplicate key in object 'dbo.TimesOffsets'. The duplicate key value is (2021-01-01 00:30:00.0000000 +11:00)."

Errors if I try to insert the same time but translated to a different time zone

I'm okay with this. They are definitely the same value, even if they are written differently. The same thing happens with strings in different collations, because they sort to the same value. Müller and Mueller will clash in a table with a German PhoneBook collation (and yes, that's what the German collations are called, complete with a capital 'B' in the middle of 'PhoneBook').

DROP TABLE IF EXISTS dbo.Names;
GO
CREATE TABLE dbo.Names (LastName varchar(100) COLLATE German_PhoneBook_CI_AI, CONSTRAINT pkGermanNames PRIMARY KEY(LastName));
GO
INSERT dbo.Names (LastName) VALUES ('Müller');
GO
INSERT dbo.Names (LastName) VALUES ('Mueller');
--Fails

So you see, the time zone thing isn't about whether they are actually identical or not. It's about whether they are considered the same based on the things that the SQL engine uses to consider them equal. When you're dealing with data that is case-insensitive, it will consider that 'FARLEY' and 'Farley' are the same, and complain if I'm trying to have both those values in a unique index.

And for a fun trick, let's consider what happens with GROUP BY or DISTINCT.

For this, I'm going to first recreate those tables with non-unique indexes. Still using indexes, just non-unique ones.

DROP TABLE IF EXISTS dbo.TimesOffsets;
GO
CREATE TABLE dbo.TimesOffsets (TimeAndDateAndOffset datetimeoffset, INDEX cixTDO CLUSTERED (TimeAndDateAndOffset))
GO
INSERT dbo.TimesOffsets(TimeAndDateAndOffset) 
VALUES ('20210101 00:00 +10:30');
GO
INSERT dbo.TimesOffsets(TimeAndDateAndOffset) 
VALUES ('20210101 00:30 +11:00');
GO
SELECT DISTINCT TimeAndDateAndOffset
FROM dbo.TimesOffsets
ORDER BY TimeAndDateAndOffset ASC
GO
SELECT DISTINCT TimeAndDateAndOffset
FROM dbo.TimesOffsets
ORDER BY TimeAndDateAndOffset DESC

And the results show different values according to whether we sort the data ascending or descending.

Different order, different results

All those times someone told you the order of the data didn't affect what the results were, just what order they were in… yeah.

And if I run the script again but insert the rows in the opposite order…

Same, but different

You see, the rows are in the table in the order they were inserted. And when we are moving through the index either forwards or backwards, we're coming across values, and the version of the value is the one that gets seen first. If it moves through it forwards, it'll (probably) come across the first row that was inserted first. If it moves through it backwards, it'll (probably) come across the last row first. I say 'probably', because there are times that it won't quite work like this. Parallelism could make this be different. There may have been index rebuilds, or page splits, or the execution plan might decide to do run things differently.

But the idea is there – those two values are the same, even though they're not.

@rob_farley

This post fits into T-SQL Tuesday, hosted this month by Brent Ozar (@brento) on the topic of data types.

Beware the width of the covering range

I'm sure you've heard by now that casting a datetime to a date is still SARGable. Like, that if you have an index on a datetime column (such as DateTimeCol in dbo.SomeTable), then you will get a seek even if you use a predicate like "WHERE CAST(DateTimeCol as date) = @DateParam". Sargability is all about whether your predicates can be used for a Seek, and it's something I've been preaching about for a very long time.

The trouble with this is that casting a datetime column to a date isn't actually a Sargable predicate. It feels like one, but it's not. Either way, I thought I'd write about this for T-SQL Tuesday, which is hosted this month by Brent Ozar (@brento). He asks us to write about our favourite data types, and I figure that talking about the pseudo-sargability of datetime-date conversion fits. (It was either that or write about how a unique index on datetimeoffset complains that '20210101 00:00 +10:30' and '20210101 00:30 +11:00' are the same value, even though they are clearly a little different. Maybe I'll do both… Edit: I did.)

Sargability is about the fact that an index is on the values in a column, and that if you're actually looking for something else, then the index doesn't work. So converting a column from one datatype to another doesn't cut it.

It doesn't even work to add zero to a integer value. That's enough to confuse the Query Optimizer into thinking that an index might not be right (and this is still the case if we use OPTION (RECOMPILE) so that it doesn't try to parameterise the query). It's easy for us to see that the order of things in the index isn't changed by adding a constant value, but the Query Optimizer doesn't look for this. I had a Connect item open for years about this.

The operation on the index is a Scan when I add a "plus zero" to the predicate

So what's special about converting from datetime to date? Well, it's about being able to add a helper predicate to the mix. (This is also how LIKE works against strings.) The Query Optimizer knows that all the values must be in a certain range, so that it can seek to that range. It works out this range using an internal function called GetRangeThroughConvert. It does a good job on this for LIKE, but doesn't do such a good job with dates.

Let me show you.

First I'm going to create a table called Times and put 100,000 rows in it, one per minute from the start of the year until roughly now. It's actually in a couple of days' time, mid-morning on March 11th.

DROP TABLE IF EXISTS dbo.Times;
GO

CREATE TABLE dbo.Times (TimeAndDate datetime, CONSTRAINT pkTimes PRIMARY KEY (TimeAndDate));
GO

INSERT dbo.Times (TimeAndDate)
SELECT TOP (100000) DATEADD(minute,ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1,'20210101') 
FROM master..spt_values t1, master..spt_values t2;
GO

Now when I query this table to find all the rows from today, I see an Index Seek and can be happy. "Obviously" this is sargable.

Looks like we've got ourselves some sargability!

But it's not really. The properties of that Clustered Index Seek show that our WHERE clause is being used as the Residual Predicate, and that our Seek Predicate is based on the Expressions created in that Compute Scalar operator.

The properties show there's something else going on

Now, you'll notice that the operators used in that Seek predicate are > and <. Greater-than and less-than. Not Greater-than-or-equal and less-than. This is a range which is exclusive at both ends. And this is interesting because if I want to write out a date range myself, I do it using an inclusive start and an exclusive end, like:

SELECT TimeAndDate
FROM dbo.Times
WHERE TimeAndDate >= '20210309'
AND TimeAndDate < '20210310'
;
&#91;/sql&#93;
<!-- /wp:shortcode -->

<!-- wp:paragraph -->
<p>So I got to thinking about this. I wasn't quite curious enough to pull up the debugger to try to see what the values were, but I did do some investigation.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>Notice in that tooltip I showed earlier, that the "Actual Number of Rows for All Executions" property says 1440. That's the number of minutes in a day, so that makes sense. But the "Number of Rows Read" property is 2879. This means that the Seek Predicate is returning 2879 rows, and then the Residual Predicate (shown as just "Predicate") is filtering these down to the 1440 that actually produce the correct value when converted to the date type.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>2879 is one less than the number of minutes in two days. I figure the "one less" bit is because it's an exclusive-exclusive range. But still it's a range which is twice as big. It's not "a little bit bigger", it's twice as big as it needs to be. To compare this with LIKE:</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>When LIKE knows what values are being used, it doesn't use those ComputeScalar &amp; ConstantScan operators, and we see helper predicates that do an inclusive-exclusive range:</p>
<!-- /wp:paragraph -->

<!-- wp:image {"id":4311,"align":"center"} -->
<div class="wp-block-image"><figure class="aligncenter"><img src="http://blogs.lobsterpot.com.au/wp-content/uploads/2021/03/image-6.png" alt="" class="wp-image-4311"/><figcaption>Inclusive-exclusive range</figcaption></figure></div>
<!-- /wp:image -->

<!-- wp:paragraph -->
<p>But when we use a parameter, we have the expressions and an exclusive-exclusive range, but the same number of rows read as returned. </p>
<!-- /wp:paragraph -->

<!-- wp:image {"id":4310} -->
<figure class="wp-block-image"><img src="http://blogs.lobsterpot.com.au/wp-content/uploads/2021/03/image-5.png" alt="" class="wp-image-4310"/><figcaption>Exclusive-exclusive range, but still the right number of rows read</figcaption></figure>
<!-- /wp:image -->

<!-- wp:paragraph -->
<p>The Query Optimizer clearly understands how to do the range for LIKE. </p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>But with the date conversion, twice as big.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>Back to what that time period range is...</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>To figure this out, I started deleting rows from my table. I figured I'll be able to see the range by when the Number of Rows Read value starts dropping.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>First I deleted rows from mid-morning March 10th on:</p>
<!-- /wp:paragraph -->

<!-- wp:shortcode -->

DELETE 
FROM dbo.Times
WHERE TimeAndDate &gt;= '20210310 10:00';

And the size of the range didn't change.

I removed all the datetime values from the 10th and the range didn't change. And all the datetime values from before the 8th. Only when I deleted '20210308 00:01' did the range start shrinking.

Without 00:01 on the 8th, my range to find values from the 9th was smaller.

So the range of my helper predicate hits the day before my date as well as the day I actually want. TWO days. Twice as big.

It certainly sounds like I'd be better off running my query using my own range predicate, typing out my inclusive-exclusive range instead of relying on the supposed sargability of casting as date. When I write this out, my plan becomes a lot more simple-looking, because it doesn't have to work out the range, but the plans tell me the cost is greater!

Looks simpler, but apparently the cost is greater?

It's lying though. It thinks the first query is cheaper because it does a bad job of estimating. It thinks there's only one row being produced when there are actually way more.

It estimates badly to make us think it's cheaper.

To try to even things up, let's try using a parameter for the date in both queries.

And whoa! It's even worse. Now when we do the range ourselves, it thinks it's 93% of the batch, and the simplicity of the plan has gone out the window.

Ugliness prevails

And yet this "more expensive" plan is actually better. The Index Seek looks through 1440 rows to find the 1440 rows, while the CAST query looks through 2879 rows. The estimate on this second version says that over 16,000 rows might be produced (which is what causes it to be 93% of the batch). But if I'm going to be passing 1440 rows through a query, I'd rather guess 16,000 than guess just 1.

The conclusion of all this is to point out that although we often refer casually to the idea that a datetime column cast to a date column is sargable, it's actually not quite, and you're better off handling your ranges yourself.

@rob_farley