SQL Spatial: Getting “nearest” calculations working properly

August 14, 2014

If you’ve ever done spatial work with SQL Server, I hope you’ve come across the ‘nearest’ problem.

You have five thousand stores around the world, and you want to identify the one that’s closest to a particular place. Maybe you want the store closest to the LobsterPot office in Adelaide, at -34.925806, 138.605073. Or our new US office, at 42.524929, -87.858244. Or maybe both!

You know how to do this. You don’t want to use an aggregate MIN or MAX, because you want the whole row, telling you which store it is. You want to use TOP, and if you want to find the closest store for multiple locations, you use APPLY. Let’s do this (but I’m going to use addresses in AdventureWorks2012, as I don’t have a list of stores). Oh, and before I do, let’s make sure we have a spatial index in place. I’m going to use the default options.

CREATE SPATIAL INDEX spin_Address ON Person.Address(SpatialLocation);

1	CREATE SPATIAL INDEX spin_Address ON Person.Address(SpatialLocation);

And my actual query:

WITH MyLocations AS        
(SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),         
('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))         
) t (Name, Geo))         
SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country         
FROM MyLocations AS l         
CROSS APPLY (         
SELECT TOP (1) *         
FROM Person.Address AS ad         
ORDER BY l.Geo.STDistance(ad.SpatialLocation)         
) AS a         
JOIN Person.StateProvince AS s         
ON s.StateProvinceID = a.StateProvinceID         
JOIN Person.CountryRegion AS c         
ON c.CountryRegionCode = s.CountryRegionCode         
;

WITH MyLocations AS

(SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),

('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))

) t (Name, Geo))

SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country

FROM MyLocations AS l

CROSS APPLY (

SELECT TOP (1) *

FROM Person.Address AS ad

ORDER BY l.Geo.STDistance(ad.SpatialLocation)

) AS a

JOIN Person.StateProvince AS s

ON s.StateProvinceID = a.StateProvinceID

JOIN Person.CountryRegion AS c

ON c.CountryRegionCode = s.CountryRegionCode

;

Great! This is definitely working. I know both those City locations, even if the AddressLine1s don’t quite ring a bell. I’m sure I’ll be able to find them next time I’m in the area.

But of course what I’m concerned about from a querying perspective is what’s happened behind the scenes – the execution plan.

This isn’t pretty. It’s not using my index. It’s sucking every row out of the Address table TWICE (which sucks), and then it’s sorting them by the distance to find the smallest one. It’s not pretty, and it takes a while. Mind you, I do like the fact that it saw an indexed view it could use for the State and Country details – that’s pretty neat. But yeah – users of my nifty website aren’t going to like how long that query takes.

The frustrating thing is that I know that I can use the index to find locations that are within a particular distance of my locations quite easily, and Microsoft recommends this for solving the ‘nearest’ problem, as described at http://msdn.microsoft.com/en-au/library/ff929109.aspx.

Now, in the first example on this page, it says that the query there will use the spatial index. But when I run it on my machine, it does nothing of the sort.

I’m not particularly impressed. But what we see here is that parallelism has kicked in. In my scenario, it’s split the data up into 4 threads, but it’s still slow, and not using my index. It’s disappointing.

But I can persuade it with hints!

If I tell it to FORCESEEK, or use my index, or even turn off the parallelism with MAXDOP 1, then I get the index being used, and it’s a thing of beauty! Part of the plan is here:

It’s massive, and it’s ugly, and it uses a TVF… but it’s quick.

The way it works is to hook into the GeodeticTessellation function, which is essentially finds where the point is, and works out through the spatial index cells that surround it. This then provides a framework to be able to see into the spatial index for the items we want. You can read more about it at http://msdn.microsoft.com/en-us/library/bb895265.aspx#tessellation – including a bunch of pretty diagrams. One of those times when we have a much more complex-looking plan, but just because of the good that’s going on.

This tessellation stuff was introduced in SQL Server 2012. But my query isn’t using it.

When I try to use the FORCESEEK hint on the Person.Address table, I get the friendly error:

Msg 8622, Level 16, State 1, Line 1        
Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

1 2	Msg 8622, Level 16, State 1, Line 1 Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

And I’m almost tempted to just give up and move back to the old method of checking increasingly large circles around my location. After all, I can even leverage multiple OUTER APPLY clauses just like I did in my recent Lookup post.

WITH MyLocations AS        
(SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),         
('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))         
) t (Name, Geo))         
SELECT         
l.Name,         
COALESCE(a1.AddressLine1,a2.AddressLine1,a3.AddressLine1),         
COALESCE(a1.City,a2.City,a3.City),         
s.Name AS [State],         
c.Name AS Country         
FROM MyLocations AS l         
OUTER APPLY (         
SELECT TOP (1) *         
FROM Person.Address AS ad         
WHERE l.Geo.STDistance(ad.SpatialLocation) < 1000         
ORDER BY l.Geo.STDistance(ad.SpatialLocation)         
) AS a1         
OUTER APPLY (         
SELECT TOP (1) *         
FROM Person.Address AS ad         
WHERE l.Geo.STDistance(ad.SpatialLocation) < 5000         
AND a1.AddressID IS NULL         
ORDER BY l.Geo.STDistance(ad.SpatialLocation)         
) AS a2         
OUTER APPLY (         
SELECT TOP (1) *         
FROM Person.Address AS ad         
WHERE l.Geo.STDistance(ad.SpatialLocation) < 20000         
AND a2.AddressID IS NULL         
ORDER BY l.Geo.STDistance(ad.SpatialLocation)         
) AS a3         
JOIN Person.StateProvince AS s         
ON s.StateProvinceID = COALESCE(a1.StateProvinceID,a2.StateProvinceID,a3.StateProvinceID)         
JOIN Person.CountryRegion AS c         
ON c.CountryRegionCode = s.CountryRegionCode         
;

WITH MyLocations AS

(SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),

('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))

) t (Name, Geo))

SELECT

l.Name,

COALESCE(a1.AddressLine1,a2.AddressLine1,a3.AddressLine1),

COALESCE(a1.City,a2.City,a3.City),

s.Name AS [State],

c.Name AS Country

FROM MyLocations AS l

OUTER APPLY (

SELECT TOP (1) *

FROM Person.Address AS ad

WHERE l.Geo.STDistance(ad.SpatialLocation) < 1000

ORDER BY l.Geo.STDistance(ad.SpatialLocation)

) AS a1

OUTER APPLY (

SELECT TOP (1) *

FROM Person.Address AS ad

WHERE l.Geo.STDistance(ad.SpatialLocation) < 5000

AND a1.AddressID IS NULL

ORDER BY l.Geo.STDistance(ad.SpatialLocation)

) AS a2

OUTER APPLY (

SELECT TOP (1) *

FROM Person.Address AS ad

WHERE l.Geo.STDistance(ad.SpatialLocation) < 20000

AND a2.AddressID IS NULL

ORDER BY l.Geo.STDistance(ad.SpatialLocation)

) AS a3

JOIN Person.StateProvince AS s

ON s.StateProvinceID = COALESCE(a1.StateProvinceID,a2.StateProvinceID,a3.StateProvinceID)

JOIN Person.CountryRegion AS c

ON c.CountryRegionCode = s.CountryRegionCode

;

But this isn’t friendly-looking at all, and I’d use the method recommended by Isaac Kunen, who uses a table of numbers for the expanding circles.

It feels old-school though, when I’m dealing with SQL 2012 (and later) versions. So why isn’t my query doing what it’s supposed to? Remember the query…

WITH MyLocations AS        
(SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),         
('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))         
) t (Name, Geo))         
SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country         
FROM MyLocations AS l         
CROSS APPLY (         
SELECT TOP (1) *         
FROM Person.Address AS ad         
ORDER BY l.Geo.STDistance(ad.SpatialLocation)         
) AS a         
JOIN Person.StateProvince AS s         
ON s.StateProvinceID = a.StateProvinceID         
JOIN Person.CountryRegion AS c         
ON c.CountryRegionCode = s.CountryRegionCode         
;

WITH MyLocations AS

(SELECT * FROM (VALUES ('LobsterPot Adelaide', geography::Point(-34.925806, 138.605073, 4326)),

('LobsterPot USA', geography::Point(42.524929, -87.858244, 4326))

) t (Name, Geo))

SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country

FROM MyLocations AS l

CROSS APPLY (

SELECT TOP (1) *

FROM Person.Address AS ad

ORDER BY l.Geo.STDistance(ad.SpatialLocation)

) AS a

JOIN Person.StateProvince AS s

ON s.StateProvinceID = a.StateProvinceID

JOIN Person.CountryRegion AS c

ON c.CountryRegionCode = s.CountryRegionCode

;

Well, I just wasn’t reading http://msdn.microsoft.com/en-us/library/ff929109.aspx properly.

The following requirements must be met for a Nearest Neighbor query to use a spatial index:    <ol>     <li>       
A spatial index must be present on one of the spatial columns and the STDistance() method must use that column in the WHERE and ORDER BY clauses.     </li>      <li>       
The TOP clause cannot contain a PERCENT statement.     </li>      <li>       
The WHERE clause must contain a STDistance() method.     </li>      <li>       
If there are multiple predicates in the WHERE clause then the predicate containing STDistance() method must be connected by an AND conjunction to the other predicates. The STDistance() method cannot be in an optional part of the WHERE clause.     </li>      <li>       
The first expression in the ORDER BY clause must use the STDistance() method.     </li>      <li>       
Sort order for the first STDistance() expression in the ORDER BY clause must be ASC.     </li>      <li>       
All the rows for which STDistance returns NULL must be filtered out.     </li>   </ol>

The following requirements must be met for a Nearest Neighbor query to use a spatial index: <ol> <li>

A spatial index must be present on one of the spatial columns and the STDistance() method must use that column in the WHERE and ORDER BY clauses. </li> <li>

The TOP clause cannot contain a PERCENT statement. </li> <li>

The WHERE clause must contain a STDistance() method. </li> <li>

If there are multiple predicates in the WHERE clause then the predicate containing STDistance() method must be connected by an AND conjunction to the other predicates. The STDistance() method cannot be in an optional part of the WHERE clause. </li> <li>

The first expression in the ORDER BY clause must use the STDistance() method. </li> <li>

Sort order for the first STDistance() expression in the ORDER BY clause must be ASC. </li> <li>

All the rows for which STDistance returns NULL must be filtered out. </li> </ol>

Let’s start from the top.

1. Needs a spatial index on one of the columns that’s in the STDistance call. Yup, got the index.

2. No ‘PERCENT’. Yeah, I don’t have that.

3. The WHERE clause needs to use STDistance(). Ok, but I’m not filtering, so that should be fine.

4. Yeah, I don’t have multiple predicates.

5. The first expression in the ORDER BY is my distance, that’s fine.

6. Sort order is ASC, because otherwise we’d be starting with the ones that are furthest away, and that’s tricky.

7. All the rows for which STDistance returns NULL must be filtered out. But I don’t have any NULL values, so that shouldn’t affect me either.

…but something’s wrong. I do actually need to satisfy #3. And I do need to make sure #7 is being handled properly, because there are some situations (eg, differing SRIDs) where STDistance can return NULL. It says so at http://msdn.microsoft.com/en-us/library/bb933808.aspx – “STDistance() always returns null if the spatial reference IDs (SRIDs) of the geography instances do not match.” So if I simply make sure that I’m filtering out the rows that return NULL…

…then it’s blindingly fast, I get the right results, and I’ve got the complex-but-brilliant plan that I wanted.

It just wasn’t overly intuitive, despite being documented.

@rob_farley

This Post Has 3 Comments

Milan Stojic, SQL Engineer August 20, 2014 Reply

Hi Rob, we tried to make it as intuitive as possible, without trying to explain all constraints and requirements in order to have correct behavior in all cases.
NULL values in SQL Server are on top with ORDER BY clause and that would ruin your TOP k nearest neighbor results but more importantly would kill performance benefit of using NN query plan. As you mentioned NULLs can be produced by STDistance if the two objects have different SRID. Unfortunately SQL Server does not support ORDER BY NULLS LAST and we don’t keep index on SRID value to know if you are using multiple SRIDs. We would need a table scan to figure it out. That’s why the easiest solution is to require proper query semantic to make sure that NULL values for STDistance are filtered out.
I’d like to hear if there are any suggestions on how to make docs more intuitive.
Rob Farley August 21, 2014 Reply

Hi Milan! Thanks for responding. We should talk about PDW more too…
Anyway – I think the docs aren’t bad, but the keys are the bits around points 3 and 7, plus the fact that the example doesn’t actually use the indexes on my machine!
So I wouldn’t mind an explicit explanation in the page saying "The query must include an explicit predicate to eliminate the possibility of NULL results which could occur in certain situations even if the columns disallow NULL values. For further information, see the pages regarding STDistance(), etc"
This kind of statement is going to mean that I don’t just dismiss the requirement as needless in my situation.
As for the example not working, a query hint of OPTION (MAXDOP 1) could help.
Of course, as a fan of execution plans, it would be nice to show what people should be looking for as evidence that the index is being used effectively…
…or maybe just provide a link to this post to help people. 😉
chris December 9, 2015 Reply

hey rob – great post. any suggestions on where to get a list of lat/long for intersections in NYC?