SQL Spatial: Getting “nearest” calculations working properly

August 14, 2014

If you’ve ever done spatial work with SQL Server, I hope you’ve come across the ‘nearest’ problem.

You have five thousand stores around the world, and you want to identify the one that’s closest to a particular place. Maybe you want the store closest to the LobsterPot office in Adelaide, at -34.925806, 138.605073. Or our new US office, at 42.524929, -87.858244. Or maybe both!

You know how to do this. You don’t want to use an aggregate MIN or MAX, because you want the whole row, telling you which store it is. You want to use TOP, and if you want to find the closest store for multiple locations, you use APPLY. Let’s do this (but I’m going to use addresses in AdventureWorks2012, as I don’t have a list of stores). Oh, and before I do, let’s make sure we have a spatial index in place. I’m going to use the default options.

And my actual query:

image

Great! This is definitely working. I know both those City locations, even if the AddressLine1s don’t quite ring a bell. I’m sure I’ll be able to find them next time I’m in the area.

But of course what I’m concerned about from a querying perspective is what’s happened behind the scenes – the execution plan.

image

This isn’t pretty. It’s not using my index. It’s sucking every row out of the Address table TWICE (which sucks), and then it’s sorting them by the distance to find the smallest one. It’s not pretty, and it takes a while. Mind you, I do like the fact that it saw an indexed view it could use for the State and Country details – that’s pretty neat. But yeah – users of my nifty website aren’t going to like how long that query takes.

The frustrating thing is that I know that I can use the index to find locations that are within a particular distance of my locations quite easily, and Microsoft recommends this for solving the ‘nearest’ problem, as described at http://msdn.microsoft.com/en-au/library/ff929109.aspx.

Now, in the first example on this page, it says that the query there will use the spatial index. But when I run it on my machine, it does nothing of the sort.

image

I’m not particularly impressed. But what we see here is that parallelism has kicked in. In my scenario, it’s split the data up into 4 threads, but it’s still slow, and not using my index. It’s disappointing.

But I can persuade it with hints!

If I tell it to FORCESEEK, or use my index, or even turn off the parallelism with MAXDOP 1, then I get the index being used, and it’s a thing of beauty! Part of the plan is here:

image

It’s massive, and it’s ugly, and it uses a TVF… but it’s quick.

The way it works is to hook into the GeodeticTessellation function, which is essentially finds where the point is, and works out through the spatial index cells that surround it. This then provides a framework to be able to see into the spatial index for the items we want. You can read more about it at http://msdn.microsoft.com/en-us/library/bb895265.aspx#tessellation – including a bunch of pretty diagrams. One of those times when we have a much more complex-looking plan, but just because of the good that’s going on.

This tessellation stuff was introduced in SQL Server 2012. But my query isn’t using it.

When I try to use the FORCESEEK hint on the Person.Address table, I get the friendly error:

And I’m almost tempted to just give up and move back to the old method of checking increasingly large circles around my location. After all, I can even leverage multiple OUTER APPLY clauses just like I did in my recent Lookup post.

But this isn’t friendly-looking at all, and I’d use the method recommended by Isaac Kunen, who uses a table of numbers for the expanding circles.

It feels old-school though, when I’m dealing with SQL 2012 (and later) versions. So why isn’t my query doing what it’s supposed to? Remember the query…

Well, I just wasn’t reading http://msdn.microsoft.com/en-us/library/ff929109.aspx properly.

Let’s start from the top.

1. Needs a spatial index on one of the columns that’s in the STDistance call. Yup, got the index.

2. No ‘PERCENT’. Yeah, I don’t have that.

3. The WHERE clause needs to use STDistance(). Ok, but I’m not filtering, so that should be fine.

4. Yeah, I don’t have multiple predicates.

5. The first expression in the ORDER BY is my distance, that’s fine.

6. Sort order is ASC, because otherwise we’d be starting with the ones that are furthest away, and that’s tricky.

7. All the rows for which STDistance returns NULL must be filtered out. But I don’t have any NULL values, so that shouldn’t affect me either.

…but something’s wrong. I do actually need to satisfy #3. And I do need to make sure #7 is being handled properly, because there are some situations (eg, differing SRIDs) where STDistance can return NULL. It says so at http://msdn.microsoft.com/en-us/library/bb933808.aspx – “STDistance() always returns null if the spatial reference IDs (SRIDs) of the geography instances do not match.” So if I simply make sure that I’m filtering out the rows that return NULL…

…then it’s blindingly fast, I get the right results, and I’ve got the complex-but-brilliant plan that I wanted.

image

It just wasn’t overly intuitive, despite being documented.

@rob_farley

This Post Has 3 Comments

  1. Milan Stojic, SQL Engineer

    Hi Rob, we tried to make it as intuitive as possible, without trying to explain all constraints and requirements in order to have correct behavior in all cases.
    NULL values in SQL Server are on top with ORDER BY clause and that would ruin your TOP k nearest neighbor results but more importantly would kill performance benefit of using NN query plan. As you mentioned NULLs can be produced by STDistance if the two objects have different SRID. Unfortunately SQL Server does not support ORDER BY NULLS LAST and we don’t keep index on SRID value to know if you are using multiple SRIDs. We would need a table scan to figure it out. That’s why the easiest solution is to require proper query semantic to make sure that NULL values for STDistance are filtered out.
    I’d like to hear if there are any suggestions on how to make docs more intuitive.

  2. Rob Farley

    Hi Milan! Thanks for responding. We should talk about PDW more too…
    Anyway – I think the docs aren’t bad, but the keys are the bits around points 3 and 7, plus the fact that the example doesn’t actually use the indexes on my machine!
    So I wouldn’t mind an explicit explanation in the page saying "The query must include an explicit predicate to eliminate the possibility of NULL results which could occur in certain situations even if the columns disallow NULL values. For further information, see the pages regarding STDistance(), etc"
    This kind of statement is going to mean that I don’t just dismiss the requirement as needless in my situation.
    As for the example not working, a query hint of OPTION (MAXDOP 1) could help.
    Of course, as a fan of execution plans, it would be nice to show what people should be looking for as evidence that the index is being used effectively…
    …or maybe just provide a link to this post to help people. 😉

  3. chris

    hey rob – great post. any suggestions on where to get a list of lat/long for intersections in NYC?

Leave a Reply

LobsterPot Blogs

Blog posts by Rob Farley and other LobsterPot Solutions team members.

Search

Related Blogs