Getting started with Power BI Push Datasets via REST APIs

Real-time Power BI sets are a really useful feature, and there's a good description of them at https://docs.microsoft.com/en-us/power-bi/connect-data/service-real-time-streaming. I thought I'd do a quick walkthrough specifically around the Push side, and show you – including the odd gotcha that you might not have noticed.

To create a dataset that you want to push data into, you need to go to the Power BI service, go to your Workspace, and create a Streaming dataset. Even if you're not wanting to use it with a streaming service, this is the one you need.

Having said I want to create a dataset, I need to tell it that I want to use this via the API, rather than using Azure Streaming or PubNub (for more on those, check the documentation link above). I also need to give the dataset a name and define the columns. I only get to put in numbers, text, and datetime values, so for now I'll use one of each. I'm also going to tell it that I want to enable historical data analysis. You should be careful whether you pick this or not, as we'll see in a moment.

Once we hit the Create button, we get to a screen that gives us the Push URL and some sample code. This is really useful because you can immediately put some code together to start pushing data in. I can click on that window to grab the code and paste it straight into a script that runs regularly.

(Don't bother trying that particular endpoint, by the way. I created it for this post and it was long gone before this post was finished.)

This PowerShell could be kicked off by a number of things. I could poll a database, or have it run when certain conditions are met, or as part of a wider process. And if PowerShell isn't my thing, then I can use some other method – I just need to post data into a URL.

Let's talk a bit more about that Historical Data Analysis option.

If I don't choose that option, then the data hangs around for an hour and then disappears. That might be fine, but it might not. Also, I only get to use this data in a dashboard tile with a handful of visuals – not in reports at all.

If I do choose the Historical Data Analysis option, then 200k rows of data will hang around for me (that's not a configurable value), and I can even use the data in reports.

Let's explore this.

In a dashboard there's an Add tile option under the ellipsis.

Amongst the tiles that I can add is a Real Time tile for Custom Streaming Data.

The fact that it's Real Time means that it will refresh when the dataset underneath refreshes – which is absolutely what we want here if we're going to be showing this on a screen somewhere. Then we can choose the dataset we want, but the tiles we can use are very limited. Only Card, Gauge, and a few types of chart. And things are limited even more, because a Card can only show the number data, not text or datetime.

Suppose I tell it to display a Card, and then I run the PowerShell code repeatedly to push newer and newer data in – the tile updates with the new data almost immediately and this is great. Someone viewing this dashboard can get near-real-time updates from an application or script.

This happens regardless of whether or not we chose to do historical data analysis. It will show the latest row pushed in, regardless of whether the datetime column is older or newer. We push a new row, it replaces what's in that tile. When this is all I want to do, I don't even care too much about sending a timestamp column. Any data I send will disappear an hour later, and that's fine for me.

But if we did choose that option, we can also create a report based on the dataset.

However, you can't do this in Power BI Desktop. You have to do it from the service, where your dataset was defined.

You can either choose New Report and then pick the dataset from the list, or (especially if you can't see it in the list), you can pick "Create Report" from the dataset's ellipsis.

Now I can create all my usual visualisations based on this data, showing multiple rows, and all that. In my example, because I didn't change the timestamp value every time, I need to tell it explicitly "do not summarize" the number value, because I can't edit the dataset except to change the field names and types. I certainly can't do anything like setting default aggregation types. Also notice that the dataset contains a table called RealTimeData. I didn't choose that – it was set for me.

At this point I find myself wondering about that 200k rows that it's holding onto, and filtering to "Top N" can be handy, or filtering on the datetime column for relative time (being careful about UTC settings). Otherwise, you might find yourself inadvertently showing more data than you'd like.

But if I need to clear out my dataset at the start of each day, then that's an option too. I just need to do it deliberately. Luckily, Power BI's REST API is good for that too, and I can easily delete all the rows in a dataset with a new lines of PowerShell.

Clearing this will cause a dashboard tile to empty immediately, just like when we pushed the new data in.

But what about the report visual? That doesn't refresh automatically. To get the report to show the latest data you need to do a refresh.

However, you can pin the visual to a dashboard, and see it refresh automatically.

I made a simple loop to send a row, wait a second, and repeat, and the response that I got was pretty useful. I had the visual filtered to the Top 3 based on the latest timestamp.

Now, it would be nice if reports could be set to auto-refresh when new data is pushed in, but having it in a dashboard is still very useful.

You see, streaming datasets don't have to involve Azure Stream, but they can involve all your usual visuals if you choose the option for historical data analysis – it's just that you won't see your visuals refresh automatically unless you use a dashboard. Now that you understand these aspects of what's going on and the couple of extra hoops you need to jump through, hopefully you can see a good use for these types of datasets. You can decide whether you you need to clear rows occasionally, figure out how you want to visual your data, and really start to make push datasets your own.

@rob_farley

That time the warehouse figures didn't match

I don't write about situations at my customers. It's just a rule I have. Regular readers may have heard my say it before. So if I'm writing about a problem, it's much more likely that it's something I've come across elsewhere, or a general thing. This month's topic for T-SQL Tuesday (hosted by Kerry Tyler (@airbornegeek) is about "crash reports" – the stuff that we've learned from disasters that can teach us (and others) about how to avoid disaster in the future.

So it's a bit of a tricky topic without writing about something that's happened at a customer. I'm still not going to write about what happened at a customer, but I am going to write about a problem that I've seen plenty of times, and what I do to avoid it.

It's about not sufficiently protecting the data that is being populated into a warehouse. I've been guilty of this before, and I continue to try to do better.

The scenario is that there's a source system, and there's a table in a warehouse that is being used to report on it. Maybe it's being populated by Integration Services or Data Factory. Maybe it's being populated by T-SQL. I don't really care. What I care about is whether the data in the warehouse is a true representation of what's in the source system.

If it's not a true representation, then we have all kinds of problems.

Mostly, that our warehouse is rubbish.

Seriously – if the data in the warehouse can't be shown to be correct, then what's the point of it? People won't want to use it, because they won't want their reports to be wrong. If they use it because it's the best they've got, knowing the data is sometimes wrong, then anything that comes from it will need further analysis and the inaccuracies will be a constant frustration. A warehouse needs to be correct.

It should be easy to validate too. It should be easy to pull totals from the source system (which could be a staging environment – it's doesn't have to be production, it just has to be trusted to be correct), to see if they match what's in the warehouse. It should be able to be tested following refresh of the warehouse data, and alerts should be raised if there's a problem.

So there's a problem, and you probably have a good idea about how to detect it. But to be a "crash report", we need to explain what caused it in the first place. There could be several factors involved, but in the situation(s) I'm remembering, it came down to not protecting the base table.

An ETL process gets data from source system(s) into a warehouse, doing some amount of transformation to get it into the right shape, because the things that we're counting might need information from other tables in the source to work out how they should be represented in the warehouse tables, etc., etc., but the whole way along, we need to be protecting that data. And particularly from JOINs.

A join isn't a bad thing. It lets us get to the relevant data from other tables. It's very useful and you learn about it very early on in a T-SQL class. If you're using ADF or SSIS to do your ETL, you might have T-SQL in a source query. Or you might be pushing data through a Lookup or a Merge. You're almost certainly going to be needing information from another table. And you need to take care with that join.

You see, joins don't just pull data from other tables. They can also filter your data, if you're doing an inner join. They might make extra copies of the rows in your base table, if you hit multiple rows on the other side. You might even introduce rows full of nulls if you're doing a right outer join or full outer join.

This is a very real problem that I've had to deal with.

In an ideal world, every row in my base table goes through to my warehouse as exactly one row. But these three things that joins do… well, they can trip us up.

It's probably quite easy to stop pulling in null rows, because you're probably not using many right outer joins or full outer joins.

As for the filtering – well, it's easy to use left joins everywhere… except that you will need to be careful that you don't accidentally eliminate any nulls that you've grabbed from that by using them in another join or whatever. You might figure that you have foreign keys in place to make sure that you can't filter any rows with your join, but you'll need to make sure that the column in your base table doesn't allow nulls, even if it's currently always populated.

Hitting multiple rows should be easy to handle, because you can make sure the join predicates you're using can only hit one row. But this isn't always straightforward either. If you're referring to a primary (or other enforced-to-be-unique) key, then great. But will you be told if that changes? Or what if it's more complicated than that? Suppose you're joining to a row that describes a date range (like a dimension table with startdate and enddate), but somehow you've got overlapping ranges?

I've seen too many crash reports that cover this. Some have been of my own making. It's not always in warehouses, sometimes it's in basic queries used for other things, and I've failed to protect the base table.

What I try to do is to make sure that it's impossible for my joins to impact the number of rows in my base table. This means knowing what I consider the base table to be in the first place even, but then being cautious about the way I'm joining. I often find myself replacing joins with "outer apply (select top (1) …)", even when I'm pretty sure that it should be okay. And this might impact the performance of my query – but correctness is critical. Plus, depending on how it's being used, I can often find ways to ensure that performance is even better this way (because the Query Optimizer will know that the number of rows isn't being affected by the join).

Disasters are no fun. I'm all about making sure that my customers can trust their data. If my query gives incorrect results, they lose confidence in future queries that I write, and possibly in their data environment in general. It's critical that I get it right.

My "crash report" today is stopping before we reach the point of the customer losing confidence in things, but I can assure you that I've seen this happen. Today, I'm choosing those times when we've noticed what's going on and mitigated things before the disaster takes hold. But I implore you – protect your base tables. It really is very important.

@rob_farley