The BigData Legacy

Trends come along, and trends pass. Some hang around for quite a while, and then move on, and some seem to disappear quickly. Often we’re glad that they’ve gone, but we still bear scars. We live and work differently because they were there. In the world of IT, I feel like this is all too common.

When ORMs became trendy, people were saying that writing T-SQL would be a thing of the past. LINQ was another ways that people were reassuring the developer community that writing database queries would never again be needed. The trend of avoiding T-SQL through ORMs has hung around a bit, and many developers have recognised that ORMs don’t necessarily create the best database experiences.

And yet when we consider what’s happening with Azure SQL Data Warehouse (SQL DW), you find yourself querying the data through an interface. Sure, that interface looks like another database, but it’s not where the data is (because the data is in the 60 databases that live in the back), and has to it translates our query into a series of other queries that actually run. And we’re fine with this. I don’t hear anyone complaining about the queries that appear in SQ DW’s explain plans.

When CLR came in, people said it was a T-SQL killer. I remember a colleague of mine telling me that he didn’t need to learn T-SQL, because CLR meant that he would be able to do it all in .Net. Over time, we’ve learned that CLR is excellent for all kinds of things, but it’s by no means a T-SQL killer. It’s excellent for a number of reasons – CLR stored procedures or functions have been great for things like string splitting and regular expressions – and we’ve learned its place now.

I don’t hear people talking about NoSQL like they once did, and it’s been folded somehow into BigData, but even that seems to have lost a little of its lustre from a year or two ago when it felt like it was ‘all the rage’. And yet we still have data which is “Big”. I don’t mean large, necessarily, just data that satisfies one of the three Vs – volume, velocity, variety.

Of these Vs, Volume seems to have felt like a misnomer. Everything thinks what they have is big, but if you compared it to others, it probably wouldn’t actually be that big. Generally, if people are thinking “BigData” because they think their data is big, then they just need a reality check, and then deal with it like all your regular data.

Velocity is interesting. If your system can’t respond to things quickly enough, then perhaps pushing your data through something like Stream Analytics could be reasonable, to pick up the alert conditions. But if your data is flowing through to a relational database, then is it really “BigData”?

And then we have Variety. This is about whether your data is structured or not. I’m going to suggest that your data probably is structured – and BigData solutions wouldn’t disagree with this. It’s just that you might not want to define the structure when the data is first arriving. To get data into a structured environment (such as a data table), types need to be tested, the data needs to be converted appropriately, and if you don’t have enough control over the data that’s coming in, the potential for something to break is high. Sorting out that mess when you need to query it back again means that you have a larger window to deal with it.

So this is where I think BigData is leaving its legacy – in the ability to accept data even if it doesn’t exactly fit the structure you have. I know plenty of systems that will break if the data arriving is in the wrong structure, which makes change and adaptability hard to achieve. A BigData solution can help mitigate that risk. Of course, there’s a price to pay, but for those times when the structure tends to change overly regularly, BigData’s ideology can definitely help.

We see this through the adoption of JSON within SQL Server, which is much less structured even than XML. We see PolyBase’s external tables define structure separately to the collection of data. Concepts that were learned in a void of relational data have now become part of our relational databases.

Don’t dismiss fads that come through. Look into them, and try to spot those things which have more longevity. By adopting those principles, you might find yourself coming through as a stronger professional.

@rob_farley

This post was put together for T-SQL Tuesday 95, hosted by Derik Hammer (@sqlhammer). Thanks Derik!

Interviews and niches

T-SQL Tuesday turns this month to the topic of job interviews. Kendra Little (@kendra_little) is our host, and I really hope her round-up post is in the style of an interview. I’m reminded of a T-SQL Tuesday about three years ago on a similar topic, but I’m sure there will be plenty of new information this time around – the world has moved on.

I’m not sure when my last successful job interview was. I know I went through phases when I guess I was fairly good in job interviews (because I was getting job offers), and phases when I was clearly not very good in job interviews (because I would get interviews but not be able to convert them into job offers), and at some point I reached a point where I stopped doing interviews completely. That’s the phase I’m still in.

I hit that point when I discovered my niche (which sounds like “neesh” in my language, not “nitch”). For me it was because I realised that I had a knack for databases and starting exploring that area more – writing, presenting, helping others – until people noticed and started approaching me. That’s when interviewing stops being a thing. It doesn’t necessarily mean going starting your own business, or even changing jobs – it just means that people know who you are and come to you. You no longer have to sit in front of a panel and prove your worth, because they’ve already decided they want you.

So now people approach me for work through LobsterPot Solutions, and although there is sometimes a bidding phase when we need to compete against other companies, there is no ‘interview’ process in the way that there was when I was an employee.

What’s your niche? And are you spending time developing that?

There’s career-advice that talks about the overlap between something you enjoy doing, something you’re good at, and something that people are prepared to pay for. The thing is that people won’t pay you for it unless they know that you’re the person they need, rather than someone else. So get yourself out there. Prove yourself. Three years ago I asked “When is your interview” and said that you need to realise that even before your interview they’ve researched you, considered your reputation, and all of that. Today I want to ask you how your niche is going. Have you identified that thing you enjoy, and that people will pay for? And are you developing your skills in that area?

Your career is up to you. You can respond to job ads and have interviews. Or you can carve your own space.

Good luck.

@rob_farley

Learning the hard way – referenced objects or actual objects

This month’s T-SQL Tuesday is about lessons we’ve learned the hard way. Which, of course, is the way you learn best. It’s not the best way to learn, but if you’ve suffered in your learning somewhat, then you’re probably going to remember it better. Big thanks to Raul Gonzalez (@sqldoubleg) for dragging up these memories.

Oh, I could list all kinds of times I’ve learned things the hard way, in almost every part of my life. But let’s stick to SQL.

This was a long while back… 15-20 years ago.

There was a guy who needed to get his timesheets in. It wasn’t me – I just thought I could help …by making a copy of his timesheets in a separate table, so that he could prepare them there instead of having to use the clunky Access form. I’d gone into the shared Access file that people were using, made a copy of it, and then proceeded to clear out all the data that wasn’t about him, so that he could get his data ready. I figured once he was done, I’d just drop his data in amongst everyone else’s – and that would be okay.

Except that right after I’d cleared out everyone else’s data, everyone else started to complain that their data wasn’t there.

Heart-rate increased. I checked that I was using the copy, not the original… I closed it, opened the original, and saw that sure enough, only his data was there. Everyone else’s (including my own) data was gone.

And then it dawned on me – these tables were linked back to SQL in the back end. I’d copied the reference, but it was still pointing at the same place. All that data I’d deleted was gone from the actual table. I walked over to the boss and apologised. Luckily there was a recent backup, but I was still feeling pretty ordinary.

These kinds of problems can hurt in all kinds of situations, even if you’re not using Access as a front-end. Other applications, views within SQL, Linked Servers, linked reports – plenty of things contain references rather than the actual thing. When you delete something, or change something, or whatever, you had better be sure that you’re working in the right environment.

I don’t even know the best way to have confidence that you’re safe on this. You can help by colouring Prod tabs differently in SSMS with SSMS Tools Pack, but it’s not going to guarantee that you’re okay. You need to be a little paranoid about it. Learn to check and double-check. Because ultimately, data is too valuable to make that kind of mistake.

@rob_farley

DevOps and your database

I’m a consultant. That means I have to deal with whatever I come across at customer sites. I can recommend change, but when I’m called in to fix something, I generally don’t get to insist on it. I just have to get something fixed. That means dealing with developers (if they exist) and with DBAs, and making sure that anything that I try to fix somehow works for both sides. That means I often have to deal with the realm of DevOps, whether or not the customer knows it.

DevOps is the idea of having a development story which improves operations.

Traditionally, developers would develop code without thinking much about operations. They’d get some new code ready, deploy it somehow, and hope it didn’t break much. And the Operations team would brace themselves for a ton of pain, and start pushing back on change, and be seen as a “BOFH”, and everyone would be happy. I still see these kinds of places, although for the most part, people try to get along.

With DevOps, the idea is that developers work in a way that means that things don’t break.

I know, right.

If you’re doing the DevOps things at your organisation, you’re saying “Yup, that’s normal.” If you’re not, you’re probably saying “Ha – like that’s ever going to happen.”

But let me assure you – it can. For years now, developers have been doing Continuous Integration, Test-Driven Development, Automated Builds, and more. I remember seeing these things demonstrated at TechEd conferences in the middle of the last decade.

But somehow, these things are still considered ‘new’ in the database world. Database developers look at TDD and say “It’s okay for a stateless environment, but my database changes state with every insert, update, or delete. By its very definition, it’s stateful.”

The idea that a stored procedure with particular parameters should have a specific impact on a table that particular characteristics (values and statistics – I would assume structure and indexes would be a given) isn’t unreasonable. And it’s this that can lead to the understanding that whilst a database is far from stateless, state can be a controllable thing. Various states can become part of various tests: does the result still apply when there are edge-case rows in the table?; is the execution plan suitable when there are particular statistics in play?; is the amount of blocking reasonable when the number of transactions is at an extreme level?

Test-driven development is a lot harder in the database-development world than in the web-development world. But it’s certainly not unreasonable, and to have confidence that changes won’t be breaking changes, it’s certainly worthwhile.

The investment to implement a full test suite for a database can be significant, depending on how thorough it needs to be. But it can be an incremental thing. Elements such as source control ought to be put in place first, but there is little reason why database development shouldn’t adhere to DevOps principles.

@rob_farley

(Thanks to Grant Fritchey (@gfritchey) – for hosting this month’s T-SQL Tuesday event)

“Stored procedures don’t need source control…”

Hearing this is one of those things that really bugs me.

And it’s not actually about stored procedures, it’s about the mindset that sits there.

I hear this sentiment in environments where there are multiple developers. Where they’re using source control for all their application code. Because, you know, they want to make sure they have a history of changes, and they want to make sure two developers don’t change the same piece of code, maybe they even want to automate builds, all those good things.

But checking out code and needing it to pass all those tests is a pain. So if there’s some logic that can be put in a stored procedure, then that logic can be maintained outside the annoying rigmarole of source control. I guess this is appealing because developers are supposed to be creative types, and should fight against the repression, fight against ‘the man’, fight against [source] control.

When I come across this mindset, I worry a lot.

I worry that code within stored procedures could be lost if multiple people decide to work on something at the same time.

I worry that code within stored procedures won’t be part of a test regime, and could potentially be failing to consider edge cases.

I worry that the history of changes won’t exist and people won’t be able to roll back to a good version.

I worry that people are considering that this is a way around source control, as if source control is a bad thing that should be circumvented.

I just worry.

And this is just talking about code in stored procedures. Let alone database design, constraints, indexes, rows of static data (such as lookup codes), and so on. All of which contribute to a properly working application, but which many developers don’t consider worthy of source control.

Luckily, there are good options available to change this behaviour. Red Gate’s Source Control is tremendously useful, of course, and the inclusion of many Red Gate’s DevOps tools within VS2017 would suggest that Microsoft wants developers to take this more seriously than ever.

For more on this kind of stuff, go read the other posts about this month’s T-SQL Tuesday!

TSQL2sDay150x150

@rob_farley

Time waits for no one

And technology changes as quickly as the numbers on a clock. A digital clock, of course – the numbers never change on an analogue one.

I think it’s nice to have this month’s T-SQL Tuesday (hosted by Koen Verbeeck (@ko_ver)) on this topic, as I delivered a keynote at the Difinity conference a couple of months ago on same thing.

In the keynote, I talked about the fear people have of becoming obsolete as technology changes. Technology is introduced that trivialises their particular piece of skill – the database that removes the need for a filing cabinet, the expert system that diagnoses sick people, and the platform as a service that is managed by someone other than the company DBA. As someone who lives in Adelaide, where a major car factory has closed down, costing thousands of jobs, this topic is very much at the forefront of a lot of people’s thoughts. The car industry has been full of robots for a very long time – jobs have been disappearing to technology for ages. But now we are seeing the same happen in other industries, such as IT.

Does Automatic Tuning in Azure mean the end of query tuners? Does Self-Service BI in Excel and Power BI mean the end of BI practitioners? Does PaaS mean the end of DBAs?

I think yes. And no.

Yes, because there are tasks that will disappear. For people that only do one very narrow thing, they probably have reason to fear. But they’ve had reason to fear for a lot longer than Azure has been around. If all you do is check that backups have worked, you should have expected to be replaced by a script a very long time ago. The same has applied in many industries, from production lines in factories to ploughing lines in fields. If your contribution is narrow, you are at risk.

But no, because the opportunity here is to use the tools to become a different kind of expert. The person who drove animals to plough fields learned to drive tractors, but could use their skills in ploughing to offer a better service. The person who painted cars in a factory makes an excellent candidate for retouching dent repair, or custom paint jobs. Their expertise sets them apart from those whose careers didn’t have the same background.

As a BI practitioner today, self-service BI doesn’t present a risk. It’s an opportunity. The opportunity is to lead businesses in their BI strategies. In training and mentoring people to apply BI to their businesses. To help create visualisations that convey the desired meaning in a more effective way than the business people realise. This then turns the BI practitioner into a consultant with industry knowledge. Or a data scientist who can transform data to bring out messages that the business users couldn’t see.

As the leader of a company of database experts, these are questions I’ve had to consider. I don’t want my employees or me to become obsolete. We don’t simply offer health checks, BI projects, Azure migrations, troubleshooting, et cetera. We lead business through those things. We mentor and train. We consult. Of course, we deliver, but we are not simply technicians. We are consultants.

@rob_farley

TSQL2sDay150x150

SQL WTF for T-SQL Tuesday #88

The topic for this month’s T-SQL Tuesday is:

“Be inspired by the IT horror stories from http://thedailywtf.com, and tell your own daily WTF story. The truly original way developers generated SQL in project X. Or what the grumpy "DBA" imposed on people in project Y. Or how the architect did truly weird "database design" on project Z”

And I’m torn.

I haven’t missed a T-SQL Tuesday yet. Some months (okay, most months) it’s the only blog post I write. I know I should write more posts, but I simply get distracted by other things. Other things like working for clients, or spending time with the family, or sometimes nothing (you know – those occasions when you find yourself doing almost nothing and time just slips away, lost to some newspaper article or mindless game that looked good in the iTunes store). So I don’t want to miss one.

But I find the topic painful to write about. Not because of the memories of some of the nasty things I’ve seen at customer sites – that’s a major part of why we get called in. But because I wouldn’t ever want to be a customer who had a bad story that got told. When I see you tweeting things like “I’m dying in scalar-function hell today”, I always wonder who knows which customer you’re visiting today, or if you’re not a consultant whether your employer knows what you’re tweeting. Is your boss/customer okay with that tweet’s announcement that their stuff is bad? What if you tweet “Wow – turns out our website is susceptible to SQL Injection attacks!”? Or what if you write “Oh geez, this customer hasn’t had a successful backup in months…”? At what point does that become a problem for them? Is it when customers leave? Is it when they get hacked? Is it when their stock price drops? (I doubt the tweet of a visiting consultant would cause a stock price to fall, but still…)

So I’m quite reluctant to write this blog post at all. I had to think for some time before I thought of a scenario that I was happy to talk about.

This place was never a customer, and this happened a long time ago. Plus, it’s not a particularly rare situation – I just hadn’t seen it become this bad. So I’m happy enough to talk about this…

There was some code that was taking a long time to execute. It was populating a table with a list of IDs of interest, along with a guid that had been generated for this particular run. The main queries ran, doing whatever transforms they needed to do, inserting and updating some other tables, and then the IDs of interest were deleted from that table that was populated in the first part. It all seems relatively innocuous.

But execution was getting worse over time. It had gone from acceptable, to less than ideal, to painful. And the guy who was asking me the question was a little stumped. He knew there was a Scan on the list of IDs – he was okay with that because it was typically only a handful of rows. Once it had been a temporary table, but someone had switched it to be a regular table – I never found out why. The plans had looked the same, he told me, from when it was a temporary table even to now. But the temporary table solution hadn’t seen this nasty degradation. He was hoping to fix it without making a change to the procedures though, because that would have meant source control changes. I’m hoping that the solution I recommended required a source control change too, but you never know.

What I found was that the list of IDs was being stored in a table without a clustered index. A heap. Now – I’m not opposed to heaps at all. Heaps are often very good, and shouldn’t be derided. But you need to understand something about heaps – which is that they’re not suited to tables that have a large amount of deletes. Every time you insert a row into a heap, it goes into the first available slot on the last page of the heap. If there aren’t any slots available, it creates a new page, and the story continues. It doesn’t keep track of what’s happened earlier. They can be excellent for getting data in – and Lookups are very quick because every row is addressed by the actual Row ID, rather than some key values which then require a Seek operation to find them (that said, it’s often cheap to avoid Lookups, by adding extra columns to the Include list of a non-clustered index). But because they don’t think about what kind of state the earlier pages might be in, you can end up with heaps that are completely empty, a bunch of pointers from page to page, with header information, but no actual rows therein. If you’re deleting rows from a heap, this is what you’ll get.

This guy’s heap had only a few rows in it. 8 in fact, when I looked – although I think a few moments later those 8 had disappeared, and were replaced by 13 others.

But the table was more than 400MB in size. For 8 small rows.

At 8kB per page, that’s over 50,000 pages. So every time the table was scanned, it was having to look through 50,000 pages.

When it had been a temporary table, a new table was created every time. The rows would typically have fitted on one or two pages, and then at the end, the temporary table would’ve disappeared. But I think multiple processes were needing to look at the list, so making sure it wasn’t bound to a single session might’ve been useful. I wasn’t going to judge, only to offer a solution. My solution was to put a clustered index in place. I could’ve suggested they rebuild the heap regularly, which would’ve been a quick process run as often as they liked – but a clustered index was going to suit them better. Compared to single-page heap, things wouldn’t’ve been any faster, but compared to a large empty heap, Selects and Deletes would’ve been much faster. Inserts are what heaps do well – but that wasn’t a large part of the process here.

You see, a clustered index maintains a b-tree of data. The very structure of an index needs to be able to know what range of rows are on each page. So if all the rows on a page are removed, this is reflected within the index, and the page can be removed. This is something that is done by the Ghost Cleanup process, which takes care of actually deleting rows within indexes to reduce the effort within the transaction itself, but it does still happen. Heaps don’t get cleaned up in the same way, and can keep growing until they get rebuilt.

Sadly, this is the kind of problem that people can face all the time – the system worked well at first, testing didn’t show any performance problems, the scale of the system hasn’t changed, but over time it just starts getting slower. Defragmenting heaps is definitely worth doing, but better is to find those heaps which fragment quickly, and turn them into clustered indexes.

TSQL2sDay150x150

…but while I hope you never come across heaps that have grown unnecessarily, my biggest hope is that you be very careful about publicly discussing situations you’ve seen at customers.

@rob_farley

Backups – are you missing the point?

It’s a common question “Do you have a backup?” But it’s the wrong question. Very relevant for this month’s T-SQL Tuesday, hosted by Ken Fisher (@sqlstudent144), on the topic of backups.

I think the question should be “Can you recover if needed?”

We all know that a backup is only as good as your ability to restore from it – that you must test your backups to prove their worth. But there’s more to it than being able to restore a backup. Do you know what to do in case of a disaster? Can you restore what you want to restore? Does that restore get your applications back up? Does your reporting become available again? Do you have everything you need? Are there dependencies on other databases?

I often find that organisations don’t quite have the Disaster Recovery story they need, and this is mostly down to not having practised specific scenarios.

Does your disaster testing include getting applications to point at the new server? Have anything else broken while you do that?

Does your disaster testing include a scenario where a rogue process changed values, but there is newer data that you want to keep?

Does your disaster testing include losing an extract from a source system which does incremental extracts?

Does your disaster testing include a situation where a well-meaning person has taken an extra backup, potentially spoiling differential or log backups?

Does your disaster testing include random scenarios where your team needs to figure out what’s going on and what needs to happen to get everything back?

The usefulness of standard SQL backups for some of these situations isn’t even clear. Many people take regular VM backups, but is that sufficient? Can you get the tail of the log if your VM disappears? Does a replicated copy of your database provide enough of a safety net here, or in general?

The key issue is not whether you have a backup. It’s not even whether you have a restorable backup. It’s whether you have what you need to survive if things go south – whichever southbound route you’ve been taken down.

@rob_farley

TSQL2sDay150x150

How I prepare for a presentation

Some people say I talk a lot – but I guess it depends on the context.

Certainly, for many years, I’ve been fairly comfortable about standing up in front of people and explaining things. Whether it’s teaching a course, leading a workshop, presenting at a conference, or preaching at a church, it all has that same “I’m talking, and people are looking at me” feeling. I totally understand why people get nervous about it, and still have a certain about of terror that I suffer from before getting up to present. It doesn’t stop me doing it – I would happily present all the time, despite the fear factor.

It’s almost a cliché, but the biggest advice I have for new speakers is to realise that the people in the room do actually want to hear what you have to say. They don’t want you to fail.

…but there’s more to it than that.

I can present on just about any topic, so long as I have time to prepare. That preparation time is NOT in creating an effective talk (although that’s part of it) – it’s in getting to know the subject matter well.

Suppose I’m giving a talk about Columnstore indexes, like I just did at the PASS Summit. By all means, I want to craft a story for my presentation, and be able to work out which things I want to communicate through that story. If slides will work, then I’ll need to create them. If demos will work, then I’ll need to plan them too. But most of all, I want to get myself deep into Columnstore. I want to read everything there is on the subject. I want to create them, alter them, explore the DMVs about them, find ways to break them, and generally immerse myself in them. That way, I can speak confidently on the topic, knowing that I’m quite probably the most qualified person in the room to be up the front. I want to be explaining concepts that I know intimately.

When people ask questions, there’s no guarantee that I’ll know the answer. At the end of my talk at the PASS Summit, someone asked me if I’d tried using columnstore indexes in a particular way, and I had to say no. She went on to tell me what she’d found, and it was interesting and piqued my curiosity for an area I hadn’t explored. Would I have been thrown if she’d asked me during the session, in front of everyone else? No – not at all. Because I felt comfortable with the depth of my knowledge.

This applies just the same if I’m preaching in a church. If I’m preaching on a section of Galatians, I want to know that section backwards. I want to know the rest of the chapter, the rest of the book, what the rest of the Bible says on the matter, how it has applied in my own life, and what other people say on it too. I want to have a thorough picture of what God is saying to me, and to the rest of the church, through that passage.

When I get stuck in my words, and stumble in some way, I need to know the topic well. I will have a bunch of sound bites that I’ve rehearsed, and expect to explain things using particular phrases. But those are the things that can disappear from my head when the nerves strike. My safety net is the deep knowledge of the subject, so that I can find a different way of explaining it.

I don’t like giving word-perfect speeches. The idea of talking from a script that I need to stick to exactly doesn’t work for me – I get too nervous and wouldn’t be able to pull it off (although one day I will give stand-up comedy a try, which means having well-crafted jokes that need to be word-perfect to work). Knowing the material is way better than knowing the words, and for me is way less stressful.

My advice to anyone is to get into public speaking. It’s a great way of stretching yourself. But do get into your topic as deeply as you can. If you’ve looked at something from a variety of angles, you will be able to explain to anyone.

Big thanks to Andy Yun (@sqlbek) for hosting this month’s T-SQL Tuesday.

TSQL2sDay150x150

@rob_farley