"It just runs slow these days"
I'm sure you've heard this, or even said it, about a computer that's a few years old. We remember the days when the computer was new, and it seemed to just fly – but that was then, and this is now. Change happens, things erode, and become slower. Cars, people, computers. I can accept that cars get slower. They lose horsepower over time as the precision components wear and become less precise. I also know that my youth is a thing of the past. But electronics? What happens there?
Well, in my experience, computers don't get slower. They just feel slower. I see two main reasons, and neither of them are because of ageing hardware.
Your computer might be slower than it was yesterday even. In the world of databases we might even be investigating why the computer is slower than it was five minutes ago. Again, it's probably not because of ageing hardware.
One possible reason is that we're simply asking systems to do more. If we're comparing our laptops to when we bought them, we're probably refreshing webpages more frequently (often in the background) and have installed too many utilities (hopefully not in the background, but you never know), and the system has more to get done in a given minutes compared to when it was new. With a database server, the amount of data has probably grown, there may be more VLFs in the log file to deal with, more users pushing more transactions. These are not things you want to uninstall like that annoying browser search bar on your aunt's ageing computer, but they can be a very valid reason for things to be slower. Hopefully you are tuning your system to make sure that scalability is possible, and you're very happy with the amount of extra work that's being done, even if it does mean that some processes take a little longer than they once did.
This problem can be summarised by the fact that the system is having to wait for resources to be free so that it can get its stuff done.
Another reason for slowness is that the system is having to wait more for other reasons, things that you don't want it having to wait for. An increase in busyness will cause slowness because of waiting, but you can easily make the argument that this is 'acceptable'. It's much more of a problem if the system is being slower without actually achieving any more than it was before.
In the SQL Server world, this kind of problem is identified by looking at wait stats. The system records what processes are waiting for, and you can see these by querying sys.dm_os_wait_stats. It's very useful, but querying it in isolation isn't as useful as taking snapshots of it. If you want to store copies of it over time, you may prefer to do something along the lines of:
--A schema for monitoring data can be useful
create schema monitoring;
--Create a table that has the structure of sys.dm_os_wait_stats
select top (0) *
--Add a column to know the the stats are collected
alter table monitoring.waits
add snapshot_time datetime default sysdatetime();
--Run this section regularly
insert monitoring.waits (wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
select * from sys.dm_os_wait_stats;
Regularly collecting snapshots of wait_stats like this can give you a picture of what has occurred over time. You can easily pull this data into a report, or into Excel, or even get a picture of a recent version quite easily, using a query such as:
with numbered as (
wait_time_ms - lead(wait_time_ms) over (partition by wait_type order by snapshot_time desc) as diff_wait_time,
waiting_tasks_count - lead(waiting_tasks_count) over (partition by wait_type order by snapshot_time desc) as diff_wait_count,
1000 * datediff(second,lead(snapshot_time) over (partition by wait_type order by snapshot_time desc),snapshot_time) as diff_ms,
row_number() over (partition by wait_type order by snapshot_time desc) as rownum
select wait_type, snapshot_time, diff_wait_count, diff_wait_time, diff_ms
where rownum = 1
order by diff_wait_time desc, wait_type;
This query compares the amount of wait time for each type (which is frustratingly stored as a string) since the previous one, using the LEAD function that was introduced in SQL Server 2012 (LEAD rather than LAG because we're looking at snapshot_time desc, not ASC). Using ROW_NUMBER(), we can easily pick out the latest snapshot by filtering to rownum = 1, but if you're just wanting to chart them, the contents of the CTE will be enough.
Make sure you keep an eye on the amount of data you're storing, of course, and be careful of the impact of someone inadvertently clearing the stats (though as the query picks up deltas, you should be able to consider a filter that will ignore the deltas that might have spanned a period during which the stats were cleared).
This post is not going to go into all the different wait types to tell you which ones are worth worrying about and which ones are worth ignoring. But what I would suggest to you is that you track what's going on with your environment and keep an eye out for things that seem unusual. When troubleshooting, you will find any history invaluable.