How to use Explain Data in Tableau Desktop 2019.3 and newer.
Instead of manually digging through tables to explain an outlier, let Tableau's machine learning do it for you with Explain Data.
- Explain Data runs a range of machine learning models on a selected mark to surface why it behaves as an outlier, such as a higher-than-expected number of records or extreme values.
- You can open any Explain Data finding as a new worksheet, including charts that exclude the extreme value, and the analysis window stays modal so you can keep working alongside it.
- Explain Data only analyses the measures already on your columns, rows and marks; adding a new measure invalidates the analysis and prompts you to rerun it.
- The feature compares your selected mark (shown in blue) against the spread of all other marks (in grey), revealing distributional differences like a higher proportion of terraces and flats or young adults in a given borough.
0:00One of the exciting new features in 2019.3
0:05is a feature called explain data. In order
0:09to set the context, I'm going to walk you
0:10through some ad hoc analysis that you would
0:12do to arrive at a very similar conclusion.
0:16In the scatterplot, you can see sales and
0:18profit on columns and rows. And this chart
0:21is broken down by cities. Now immediately,
0:24you can see that there's some outliers over
0:26here on the right hand side. If I was to
0:29select
0:29this top one, one of the things I'd
0:32normally want to do is try and understand
0:34why is this
0:35value an outlier. There's multiple ways I
0:38could do that. I could view the table of
0:41data,
0:41go into full data, and then start doing
0:44this analysis in the probably the least
0:47ideal format,
0:48just essentially analyzing the profit, rows
0:50, ratios, and trying to see if I can analyze
0:53anything from this table, it's actually
0:56quite hard to do. The other option, if I'm
0:58a desktop
0:59author of a Tableau dashboard is I'd
1:01probably select this item, keep just that
1:04item in my
1:05data set by pretty much just creating a set
1:08, you'll see that it's created an inclusion
1:10set over here in the filters pane. And then
1:13breaking this view down to its most gran
1:16ular
1:16level. In this particular case, it's going
1:19to be order ID and the product name because
1:22a customer can order multiple products in a
1:25single order. And then what we can now see
1:28is the spread of data that was originally
1:30on that one data point. And you can very
1:32clearly
1:33see that there's an outlier over here, that
1:35's probably skewing our data. You've also
1:38got
1:39a large sort of center of mass over here on
1:40the left. So actually, that one data point
1:43is being driven heavily by some of these
1:46outliers out here. Now, if I just go back a
1:49few steps,
1:52that's an analysis that's very hard to do
1:55just from this one data point. So one of
1:58the
1:58nice new features in 2019.3 is I can click
2:01on this option. And you'll now see this new
2:04option, which is a circle just on this
2:07command tool pane. I'll just highlight it
2:10here for
2:11you. This is the feature that allows you to
2:14explain the data. When I click on this
2:17table
2:17is actually running a range of machine
2:19learning models, you'll see it's already
2:21finished doing
2:22them. And it will highlight the data point
2:24I selected across the top and show me the
2:27options that I saw in the tooltip. But you
2:30'll also see that it's done some analysis on
2:33that
2:33particular data point. And the first thing
2:35it's realized is that this mark has a
2:37higher
2:37than expected number of records. So this
2:40particular one data point has a higher than
2:42normal number
2:43of records sitting behind it. That means
2:46that its position could be skewed simply by
2:48the
2:48number of records sitting behind it. The
2:51second thing you'll see here is that Table
2:53au has
2:53actually identified that there is an
2:56extreme value in our data set. And it's
2:58generated
2:59this chart to show you that value. It's
3:01also highlighting the particular line and
3:03the particular
3:04customer in this particular case. And if I
3:07scroll down, you'll notice that it actually
3:10draws a view that excludes that extreme
3:13value. What I can do is I can click on this
3:16icon
3:16here on the top right to open as a new
3:18worksheet. When I do that, it actually
3:21brings out that
3:22worksheet onto view. And I can move this
3:24annotation over. And it now shows me my new
3:27data point
3:28with that extreme value removed. The great
3:32thing is that this window is modal. So I
3:34can
3:34still see this analysis whilst I'm working
3:37with this view, I can head back to the tab
3:39that originally activated this view. And I
3:42can keep doing my analysis. I'll switch
3:45over
3:45to the sum of cells. You'll notice that it
3:47's pulled out these two measures because
3:50they're
3:50fundamentally the measures that are on the
3:52columns and rows. If I added a third
3:54measure
3:55onto my marks pane, let's say quantity, you
3:58'll notice that Tableau warns you when its
4:01analysis
4:01is no longer valid. Essentially, it's
4:03noticed that I've added a third measure
4:05onto my marks
4:06pane. And it gives you a warning to let you
4:08know that the visualization has changed, I
4:10need to rerun this analysis. So if I go
4:13back, click on that same data point, and
4:17then rerun
4:17the analysis, you'll see that I now get a
4:21new option for the quantity which I added
4:24onto the size. If I head over to the
4:26quantity, I'm now able to interrogate that.
4:30And the
4:30only stand up item here is the number of
4:32records sitting behind this data point.
4:35That's a consistent
4:36feature across all the measures. Because it
4:38refers to that one data point, it's going
4:40to be a consistent item. You'll also see
4:43that in terms of profit, this data point
4:45also has
4:46another extreme value related to profit. It
4:49's again, the same customer. So that one
4:51customer
4:52from a sales and profit perspective is an
4:56outlier. That's just one example of
4:59explained
5:00data. In a different example, I'm going to
5:03show you how this works in a more real life
5:06situation. Let me switch over to another
5:09visualization that I have. This chart
5:13prepared by Tableau
5:14shows the average duration of home and home
5:17ownership in London. You can see that the
5:20color of each borough represents the
5:22average duration of a home ownership. So
5:24you can see
5:25here that Newham has the lowest value of 4.
5:308 years, whereas Sutton and Bexley and
5:34nearly
5:35Hillingdon have some of the higher values.
5:38If I want to analyze some of that data, I
5:40can. Again, I just select the data point.
5:45Then hit explain data. You'll see that
5:49Tableau
5:50runs its machine learning algorithms. This
5:52time is taking a bit longer, the data set
5:54is a little bit larger. And you can see
5:56that Tableau has done analysis into that
5:58one particular
5:59item and it's generated some new insight
6:02for us. So the first thing to notice is
6:04that in
6:05blue, it represents the mark that we've
6:08selected. In this particular case, this is
6:11the Sutton
6:12region. And in gray is the rest of our data
6:16. And so what this is showing is that in
6:20Sutton
6:20in particular, there is a higher proportion
6:23of households in terraces and flats, nearly
6:2740%. And then you can see the remainder of
6:29properties that aren't over here on the
6:31right
6:32hand side in blue. So this blue percentage
6:34on the right, plus this on the left
6:36constitute
6:37100%. It also shows you the spread for all
6:40marks. So you can see here that for all
6:43other
6:43data, that value is much, much lower,
6:46probably just under 10%. If I go over to
6:50this next
6:51dimension, you can see that it's looked at
6:53one of the attributes and it's realized
6:55that
6:55in this particular borough, there's a
6:58higher percentage of young adults, less
7:01than 20%,
7:02and 20 to 30% are both almost exclusively
7:06the only values available in Sutton.
7:10However,
7:10if you compare to the spread in our data
7:12set, you'll see that the spread is a lot
7:14higher,
7:15and a lot more evenly spread out across all
7:18the items. This is some of the powerful
7:20analysis
7:21that explain data can do right off the bat
7:23without you having to create ad hoc
7:25analysis
7:25for the user to dive into. That's it.
Be sure to check out my newer videos on enhancements to this feature as they’re released.