Screen Shot 2016-07-13 at 2.29.44 PM

Getting the Version of a Tableau Workbook in a Few Clicks

In helping other Tableau users as part of DataBlick or my pro-bono contributions to the community I get a lot of Tableau workbooks in a lot of versions, in the last two weeks I’ve received v8.3, v9.0, v9.2, v9.3, and v10beta workbooks and when I edit them I need to make sure I’m using the same version of Tableau. And I’m often frustrated because I’ll open the workbook in the wrong version of Tableau and get this message:

Screen Shot 2016-07-13 at 2.30.52 PM

or this one:

Screen Shot 2016-07-13 at 2.31.34 PM

And then I have to open it up in other versions which can take awhile. I shared this problem in a Tableau Zen Master email thread and Shawn Wallwork replied with his trick for Windows that only takes a few seconds, and I was able to take that and come up with one for the Mac as well.

The basis for this technique is that the version of the Tableau workbook is stored in the XML of the .twb (Tableau Workbook) file, in particular the version attribute of the <workbook> tag, and we can open up the XML in any old text editor, I’ve highlighted the <workbook> tag that shows this workbook was created in version 9.1:

Screen Shot 2016-07-13 at 2.29.44 PM

However a .twbx (Tableau Packaged Workbook) is stored as a zip file so I’ve always just opened workbooks up in different versions of Tableau until I found the right one or once in awhile took the time to fire up a zip application, extract the TWBX to a folder and then looked at the .twb file. Both of these methods are slow. Shawn pointed out a shortcut on Windows using the free 7-Zip application that only takes a few seconds, here’s a demo:

check version windows

The steps on Windows are:

  1. Right-click on the TWBX and choose 7-Zip->Open Archive.
  2. Right-click on the .twb and choose View.
  3. Find the <workbook> tag and version attribute.
  4. Close the window.

That inspired me to figure out an equivalent on the Mac. I use BBEdit for my text editor (BareBones also offers a free text editor called TextWrangler that can do the same thing) and BBEdit natively supports zip files. It turns out I can just drag the TWBX onto the BBEdit icon on the dock and BBEdit will open up the zip file:

check version mac

The steps on Mac are:

  1. Drag the TWBX to the BBEdit or TextWrangler icon.
  2. Click on the .twb.
  3. Find the <workbook> tag and version attribute.
  4. Close the window.

This will easily save me hours over the course of a year, thanks, Shawn! There’s also a feature request for making version-checking easier that you can vote up. Also, if you’d like to change the version of the file you can edit that yourself, or (if you’re ok running the workbook XML through a web-based tool) using Jeffrey Shaffer’s Tableau File Conversion Utility.

Screen Shot 2016-07-12 at 4.32.32 PM

TRIMMEAN() in Tableau

Excel’s TRIMMEAN() function can be quite useful at removing outliers, essentially it removes the top and bottom Nth percent of values and then computes the mean of the rest. Here’s the equivalent formula in Tableau that in Superstore Sales computes the TRIMMEAN() of sales at the customer level removing the top and bottom 5th percentile of customers when used with the AVG() aggregation:

{FIXED [Customer Name]: SUM(
    IF {FIXED [Customer Name] : SUM([Sales])} < 
    {FIXED : PERCENTILE({FIXED [Customer Name] : SUM([Sales])}, .95)}
    AND {FIXED [Customer Name] : SUM([Sales])} > 
    {FIXED : PERCENTILE({FIXED [Customer Name] : SUM([Sales])}, .05)} THEN
       [Sales]
    END)
}

Read on for how to build and validate your own TRIMMEAN() equivalent in Tableau.

When building out calculations in Tableau I try to let Tableau do as much of the computation as possible for both the calculations and the validation, so I’m typing as little as I can. Starting with Superstore, let’s identify the top and bottom 5th percentiles, here’s a view using a reference distribution:

Screen Shot 2016-07-12 at 3.57.38 PM

Now we know what we’re going to have to remove. The next step is to duplicate this worksheet as a crosstab, then build out calcs that can return the 5th and 95th percentiles of Sales at the Customer Name level. While this can be done with table calculations (here’s an example from the Tableau forums) I’m going to use FIXED Level of Detail Expressions so I’ve got a dimension I can use, so for example I could compare the trimmed group to the non-trimmed group. Here’s the 95th percentile Level of Detail Expression:

{FIXED : PERCENTILE({FIXED [Customer Name] : SUM([Sales])}, .95)}

The inner LOD is calculating the sales at the Customer level, then the outer LOD is returning the 95th percentile as a record level value. Here’s the two calcs which have values that compare to the reference lines above:

Screen Shot 2016-07-12 at 4.03.46 PM

The next step is to filter out the values outside of the desired range, here’s the TRIMMEAN Filter formula:

{FIXED [Customer Name] : SUM([Sales])} < 
    {FIXED : PERCENTILE({FIXED [Customer Name] : SUM([Sales])}, .95)}
AND {FIXED [Customer Name] : SUM([Sales])} > 
    {FIXED : PERCENTILE({FIXED [Customer Name] : SUM([Sales])}, .05)}

This uses the 5th and 95th percentile formulas and only returns True when the Customer level sales is less than the 95th percentile or greater than the 5th percentile, we can visually validate it by dropping it on the Color Shelf:

Screen Shot 2016-07-12 at 4.06.37 PM

Now that we have this the next step is to calculate what the trimmed mean would be. Again, we can use a view with a reference line, this time it’s been filtered using the TRIMMEAN Filter calc and the reference line is an average:

Screen Shot 2016-07-12 at 4.08.16 PM

Now we can embed the TRIMMEAN Filter formula inside an IF/THEN statement to only return the sales for the filtered values, this is the Trimmed Sales calc:

IF {FIXED [Customer Name] : SUM([Sales])} < 
   {FIXED : PERCENTILE({FIXED [Customer Name] : SUM([Sales])}, .95)}
AND {FIXED [Customer Name] : SUM([Sales])} > 
   {FIXED : PERCENTILE({FIXED [Customer Name] : SUM([Sales])}, .05)} THEN
   [Sales]
END

And here it is in the workout view, only returning the sales for the trimmed customers:

Screen Shot 2016-07-12 at 4.15.36 PM

Now that we have the trimmed sales there are two ways we can go. If we want the trimmed mean without the Customer Name in the Level of Detail then we can validate that in our workout view by using Tableau’s two-pass Grand Totals to get the average of the customer-level trimmed sales. This was created by:

  1. Removing the TRIMMEAN Filter pill from Colors (this increases the vizLOD and is no longer necessary).
  2. Clicking on the Analytics tab.
  3. Dragging out a Column Grand Total.
  4. Right-clicking the SUM(Trimmed Sales) pill on Measure Values and setting Total Using->Average.

Scrolling down to the bottom we can see that the overall trimmed mean matches of 2,600.79 matches the one from the reference line.

Screen Shot 2016-07-12 at 4.20.33 PM

Note that we could have used the Summary Card instead, however using the Grand Total lets us see exact values.

There’s a problem, though, if we use the Trimmed Sales all on its own in a view it breaks, whether using SUM() or AVG():

Screen Shot 2016-07-12 at 4.25.49 PM

The reason why is that the Trimmed Sales is a record level value and Superstore is at the level of detail of individual order items, but we’re trying to compute the trimmed mean across Customer Names. For the true trimmed mean in this case we need to aggregate this trimmed sales to the Customer Name like we did in the workout view, here’s the Trimmed Sales (Customer Level) formula that uses the Trimmed Sales and wraps that in an LOD to get the Customer Level sales:

{FIXED [Customer Name]: SUM(
    IF {FIXED [Customer Name] : SUM([Sales])} < 
    {FIXED : PERCENTILE({FIXED [Customer Name] : SUM([Sales])}, .95)}
    AND {FIXED [Customer Name] : SUM([Sales])} > 
    {FIXED : PERCENTILE({FIXED [Customer Name] : SUM([Sales])}, .05)} THEN
       [Sales]
    END)
}

This returns the same results in the workout view:

Screen Shot 2016-07-12 at 4.31.49 PM

And works all on its own in a view:

Screen Shot 2016-07-12 at 4.32.32 PM

Now this is a case where the FIXED level of detail expression is returning different results depending on the level of detail of the view, if we want it to return the same result then we can wrap all that in one more LOD expression, this is the TRIMMEAN Fixed calculation:

{FIXED : AVG(
    {FIXED [Customer Name]: SUM(
        IF {FIXED [Customer Name] : SUM([Sales])} < 
        {FIXED : PERCENTILE({FIXED [Customer Name] : SUM([Sales])}, .95)}
        AND {FIXED [Customer Name] : SUM([Sales])} > 
        {FIXED : PERCENTILE({FIXED [Customer Name] : SUM([Sales])}, .05)} THEN
            [Sales]
        END)
    })
}

And here it is in the workout view and a view without any dimensions:

Screen Shot 2016-07-12 at 4.34.36 PM

Screen Shot 2016-07-12 at 4.35.35 PM

 

 

Final Comments

This is a good (and bad) example of how Tableau is different from Excel. In one bad sense note that I didn’t parameterize the percentage for the trimmed mean, this is because in Tableau it would require two parameters because we can’t put calculations as the arguments to the PERCENTILE() function. In another bad sense the calculation requires understanding Level of Detail expressions and is not wrapped into a simple formula. On the other hand we’ve got very precise control over what the calculation is computing over with those Level of Detail expressions and aren’t just limited to doing trimmed means, we could do trimmed medians, get the Nth percentile of the trimmed values, etc.

Here’s the trimmed mean workbook on Tableau Public.

Screen Shot 2016-07-08 at 4.02.54 PM

Parallel Coordinates via Pivot and LOD Expressions

Parallel coordinates are a useful chart type for comparing a number of variables at once across a dimension. They aren’t a native chart type in Tableau, but have been built at different times, here’s one by Joe Mako that I use in this post for the data and basic chart. The data is a set of vehicle attributes from the 1970s, I first saw it used in this post from Robert Kosara. This post updates the method Joe used with two enhancements that make the parallel coordinates plot easier to create and more extensible, namely pivot and Level of Detail Expressions.

The major challenge in creating a parallel coordinates chart is getting all the ranges of data for each variable into a common scale. The easiest away to do this is to linearly scale each measure to a range from 0-1, the equation is of the form (x – min(x))/(max(x)-min(x)). Once that scale is made then laying out the viz only needs 4 pills to get the initial chart:

Screen Shot 2016-07-08 at 3.25.20 PM

The Category is a dimension holding the different variables, ID identifies the different cars in this case, the Value Scaled is the scaled measure that draws the axis. Value Scaled is hidden in the tooltips while Value is used in the tooltips.

Where this is easier to create is using Tableau’s pivot function, in Joe’s original version the data is in a “wide” format like this:

Screen Shot 2016-07-08 at 3.20.33 PMSo for each of the measures a calculation had to be built, and then the view was built using Measure Names and Measure Values:

Screen Shot 2016-07-08 at 3.47.05 PM

The major limitation here is in the tooltips (in fact, Joe had rightly hidden them in the original, they were so useless):

Screen Shot 2016-07-08 at 3.33.36 PM

The tooltip is showing the scaled value, not the actual value of acceleration. This is a limitation of Tableau’s Measure Names/Measure Values pills…If I put the other measures on the tooltip then I see all of them for every measure and it’s harder to identify the one I’m looking at. Plus axis ranges are harder to describe.

Pivoting Makes A Dimension

I think of Tableau’s Measure Names as a form of pivoting the data, to create a faux dimension. I write faux because beyond the limits mentioned above we can’t group Measure Names, we can’d blend on Measure Names, we can’t do cascading filters on Measure Names, etc. The workaround is to pivot our data so we turn those columns of measures into rows and get an actual “Pivot field names” dimension (renamed to Category in my case) and a single “Pivot field values” measure (renamed to Value in my case):

Screen Shot 2016-07-08 at 3.45.22 PM

Then for the scaling we can use a single calculation (instead of one for every original column), here’s the Value Scaled measure’s formula:

([Value] - {EXCLUDE [ID] : MIN([Value])})/
({EXCLUDE [ID] : MAX([Value])} - {EXCLUDE [ID] : MIN([Value])})

I used an EXCLUDE Level of Detail Expression here rather than a TOTAL() table calculation as an example of how we can use LODs to replace table calculations and have a simpler view because we don’t have to set the compute using of the table calculation.

Now with a real Category dimension in the view the Value Scaled calc is computed for each Category & ID, and this also means that if we put the Value measure in the view then that is computed for each Category & ID as well, immediately leading to more usable tooltips:

Screen Shot 2016-07-08 at 3.55.53 PM

For a quick interactive analysis this view takes just a couple of minutes to set up and the insights can be well worth the effort. Prior to the existence of Pivot and LOD expressions this view would have taken several times as long to create, so for me this revised method takes this chart type from “do I want to?” to “why not??”

Cleaning Up

To put this on a dashboard some further cleanup and additions are necessary. Identifying the axis ranges is something that is easier as well with the pivoted data. In this case I used a table calculation to identify the bottom and top-most marks in each axis and used that as mark labels to identify the axis range:

Screen Shot 2016-07-08 at 3.58.04 PM

The Value for Label calculation has the formula:

IF FIRST()==0 OR LAST()==0 THEN
    SUM([Value])
END

The addressing is an advanced Compute Using so that it identifies the very first or last mark in each Category based on the value:

Screen Shot 2016-07-08 at 4.01.00 PM

In addition I created two different versions of the value pill that each had different number formatting and used those on the tooltips, used Joe’s original parameters for setting the color and sort order with revised calculations (which were also easier to use since Category is a dimension), and finally added a couple of other worksheets to be the target of a Filter Action to show details of the vehicle:
Screen Shot 2016-07-08 at 4.02.54 PM

Click on the image above to download the workbook from Tableau Public.

Screen Shot 2016-06-30 at 2.36.54 PM

Sorting a Dimension by Two Values At Once

I recently got a question via email about how to sort a view by two different criteria at the same time. Here’s an Excel table of opportunities where for each month and Opportunity ID there’s a forecast of size of opportunity:

Screen Shot 2016-06-30 at 2.12.45 PM

The goal here is to sort the opportunities within each account type by the nearest (by month) and biggest opportunities (by Forecast size) first, so it looks more like this in Tableau:

Screen Shot 2016-06-30 at 2.13.38 PM

Now with that data in Excel arranged in a tabular form I can sort on December, then November, then October, and so on, and get the desired sort:

Screen Shot 2016-06-30 at 2.14.29 PM

But in Tableau I can’t do that, if I try to work with the data in that “wide” format with a column for each month it just won’t work. If I use Tableau’s pivot feature to make the data “tall” with a record for each Opportunity ID and Month then I still run into problems. I want to sort the opportunities by each month and by Forecast but when I try to sort the Opportunity ID dimension I can’t get it to work, it only sorts by a single month’s values, so in the view below Opportunity ID 864280 should be the first new one for August since:

Screen Shot 2016-06-30 at 2.15.31 PM

The Excel way isn’t good because each month I have to manually re-sort the data. And in Tableau it just seems impossible to get the right sort because it looks like we need to sort in two different directions at once (get the earliest non-zero month for each opportunity, and then sort up and down the opportunities in each account type), and Tableau only lets us sort on one thing at a time. However, it is possible – read on for how to get this kind of sort in Tableau and maybe learn a few math tricks!

Part of how this kind of problem can be more challenging is the way the problem (and data) is initially presented to us. When we see the data in the crosstab form in Tableau the *appearance* is that we need to sort in two different directions. In fact, we really only need to sort in one direction based on the forecast value in the first month for each opportunity, so in the view below we’d want Opportunity ID 864271 to be the first one sorted because it’s from July 2016.

Screen Shot 2016-06-30 at 2.16.45 PM

Each opportunity row in the view needs to be sorted within the account type by the first (earliest) month where there is a non-zero forecast and then by the value of Forecast in descending order for that month.

The key to sorting headers and panes in Tableau is that it’s done using the discrete (blue) pills on Rows or Columns from left to right. So the left-most discrete (blue) pill headers are sorted, then the 2nd discrete pill’s headers are sorted, and so on. For discrete dimensions from a primary source we can sort by a measure, use the default alphanumeric sort, or a manually, otherwise any discrete pills are by default alphanumerically sorted or manually sorted.

Therefore in this case I knew I needed to either return a measure that could sort some dimension (like the Opportunity ID) or return a discrete dimension value that with the default alphanumerical sort would work right. Note that filtering wouldn’t work here because the goal is to show a sorted crosstab.

The next part of working out the solution is how to structure this value for sorting. I’ve done some multi-level sorting in the past where I needed a nested sort of single dimension by two different criteria, and a common construct is a number of the form X.Y where the integer portion X is from one criteria and the decimal portion N is from the other criteria. So with the default alphanumerical sort 1.2 comes before 1.3 comes before 2.1 etc.

So for the integer part of the sort I need to convert the date for each opportunity into a number where the Forecast is greater than 0. The Date Number (temp) calc has the formula:

IF [Forecast] > 0 THEN
    INT([Date])
END

Screen Shot 2016-06-30 at 2.17.34 PM

This convers the date into an integer, in this case the number of days since 1/1/1900. To get the first (earliest) month for each opportunity then all I need to do is aggregate it with MIN() at the level of Opportunity ID:

Screen Shot 2016-06-30 at 2.18.01 PM

Ultimately, this is is what we’re going to do to get that pill sort of Opportunity ID’s in the final view.

For the decimal part of the sort I needed a number where the smallest numbers reflected the largest values, and it needed a value between 0 and 0.999999 (it can’t be a whole value of 1 because that would affect the integer sort). A way to turn a set of positive numbers into decimal numbers between 0 and 1 is to do X/(max X). In this case X is the Forecast, so to get the max X in the data I used the Level of Detail Expression, here’s the Max Forecast (temp) formula:

{FIXED : MAX([Forecast])}

Screen Shot 2016-06-30 at 2.18.35 PM

Now if I do [Forecast]/MAX([Forecast]) that’s going to return a number between 0 and 1 that preserves the original ordering of values, i.e. bigger values of Forecast are closer to 1. So to invert that I used use 1 – X/(max X). So if (max X) is 10 and X is 9, then the result of (1 – 9/10) is 0.1, while if X is 2 then the result of (1- 2/10) is 0.8, a bigger number.

We avoid results of 1 that could affect the sort by skipping values of where the Forecast is 0, here’s the Invert Forecast (temp) formula:

IF [Forecast] > 0 THEN
    1-[Forecast]/[LOD Max Forecast Value]
END

Screen Shot 2016-06-30 at 2.19.19 PM

I could have avoided the LOD expression for the max value by just setting a gigantically huge number, however past experience with foreign currencies has shown me that whatever huge number I can imagine is likely to be smaller than reality so I chose to make sure that the value is coming from the data

With all the values worked out I could now put everything together into a single calculation, this is the Sort calc that returns record-level values:

IF [Forecast] > 0 THEN
    INT([Date])
    + (1-[Forecast]/{FIXED : MAX([Forecast])})
//Forecast is 0, return a really big number that will be put at the end of the sort
ELSE
    999999
END

Screen Shot 2016-06-30 at 2.20.09 PM

This calc returns the numbers as record level values.

To show how the sort works out I set up this view where the Sort calc is used as the leftmost discrete dimension to show what gets sorted first, with the bar chart we can quickly visually verify that the dates are sorted with earliest months first and then by the Forecast within each month:

Screen Shot 2016-06-30 at 2.20.41 PM

Note that there’s a different value for each Opportunity ID/month combination, when what we really want is that single minimum value for each Opportunity ID/month. So we need to aggregate the Sort measure with MIN() at the level of detail of Opportunity ID, and we can do just that using a pill sort on the Opportunity ID dimension:

Screen Shot 2016-06-30 at 2.21.14 PM

Now we can arrange pills to build the original crosstab view and have the desired sort:

Screen Shot 2016-06-30 at 2.13.38 PM

And as the data updates the sort will automatically work, in this case I’ve added January 2017 to the data:

Screen Shot 2016-06-30 at 2.22.37 PM

Conclusion

The following bits of Tableau knowledge were necessary to get the desired results:

  • How Tableau sorts views using discrete pills.
  • How Tableau’s pill sorts work.
  • A tiny bit of Level of Detail Expression building.

And then the following math tricks were used:

  • Combining two different numbers into one using an integer number for one value and a decimal number for the second value.
  • Making positive numbers into a range of 0-1 using X/(max X). A different formula would be needed if there were negative numbers and/or the desired range was different.
  • Inverting ranges to make big numbers small and small numbers big using 1 – X/(max X)

FYI if LOD expressions are not available in your particular data source then you could use a table calculation, a data blend, or just manually enter your equivalent of the Max Forecast value. I set up a table calculation version as well in the Sorting by Two Values Tableau Public workbook.

img_56d047ade1a4b

Show Me *All* The Numbers: Displaying Every Record for Too Tall Data

Here’s a sample data set with 4 records:

When we bring that data set into Tableau and build a text table, though, we only see 3 rows:

If we want to show all 4 underlying records as 4 rows in the Tableau text table we have to jump through a couple of hoops, the rest of this post describes why Tableau behaves that way and how to fix it.

The Data is Too Tall

There are all sorts of resources about working with data that is “too wide”, for example the old Preparing Excel Files for Analysis KB article, the new Pivot feature introduced in Tableau v9.0, or this post on Tiny Habits from Emily Kund with commentary from yours truly. Too wide data has too many columns for the kind of analysis that we want to use. There aren’t so many resources on “too tall” data, of which this is an excellent example.

To explain what “too tall” data is, let’s first look at the data:

What is the grain of this data? In other words, what combination of field(s) makes a unique record in the data?  We might be tempted to say Group, Color, and Size, but for Tableau there is no difference between the first two records:

Effectively the data has no unique grain. Yes, there’s a difference in position between these two records but that is not detectable by Tableau because record order (position) is not something stored in each record. This highlights something I talk about in my training classes: the difference in mental models between Tableau and WYSIWYG tools like Excel. Tableau approaches data as a database does, and the default behavior in databases is that record order doesn’t matter. The reason databases abstract record order away is to get higher performance.

So when we bring this data into Tableau and create a view Tableau’s default behavior is to aggregate the data to the level of the dimensions in the view (i.e. on all Shelves and the Marks Card except for Filters). Here’s what happens we bring all the fields in this data set into the view as dimensions:

There aren’t enough dimensions to separate out the two A/Red/Small records. This explains what I wrote earlier about the data lacking enough dimensionality. What we really need is another column (field) to identify the records. So we now have a simple definition:

  • “too tall” data has too few columns to effectively perform the analyses we want
  • “too wide” data has too many columns to do effectively perform the analysis we want

The rest of this post describes three ways to show all the records: Show Underlying data, editing the source, and constructing a specific Tableau view.

Do You Need to Show the Data in a View?

If you don’t need to show the data in a Tableau view users can still view the underlying data in both Tableau Desktop and from Tableau Server & Online. Here’s the underlying data in Tableau Desktop:

And on Tableau Server:

So some user education might be all you need to show all the records. If the data is too tall and you do have to show a view with all the records then you can alter the source data or set it up in Tableau.

Do You Control the SpiceSource?

If you have control over the source data then many sources have features to add a unique record identifier that would add that necessary column to make the data not too tall, not too wide, but just right. For example Excel has the ROW() function:


Creating a view with this that shows every record is trivial, we just need to add Row ID as a dimension:

If you’re not using Excel then you’ll need to look for a function that adds a row ID, record ID, etc. Part of why this is a rare problem is that most relational data tables are set up with unique keys (indexes) that give us those unique values to draw tables. Where I typically see too tall data coming from is from hand-entered data sources and ancient systems.

When you don’t have that option and you’re stuck with too tall data we can still get a view showing every record in Tableau.

Building a Tableau View Showing Every Record for Too Tall Data

There are three main steps to building a view to show every record:

  1. Turning off aggregation so Tableau will return every record from the data source.
  2. Creating a table calculation to increment over each record and provide a unique identifier.
  3. Using that table calculation as a discrete pill to sort the view.

Here’s how using the above data source:

  1. Turn off Analysis->Aggregate Measures:

    The view now looks like this:

    The reason why there’s a lot of white space is that Tableau is now returning multiple records (the two 1’s) for A/Red/Small and has turned on Mark Stacking by default. This is not a problem, we’ll be rearranging the view later on to get rid of Mark Stacking.
  2. Create a Rank calculated field with the following formula:
    RANK_UNIQUE(MIN([Number of Records]))

    I use RANK_UNIQUE() here instead of INDEX() because rank only counts non-Null values and should there be any unwanted densification those ranks will return Null, whereas INDEX() would return values that would throw off the desired ordering.

  3. Drag the Rank field to the Level of Detail Shelf and set the Compute Using to an Advanced… Compute Using where all the dimensions are used for addressing in the order that you want the records to appear:

    Something I’ll typically do at this point to validate is to add the table calc (Rank) to the text Shelf or Measure Values (here I have it on text):
    And we can see that the Rank is accurate.
  4. Turn Rank into a discrete (blue) pill:
  5. Drag Rank to the Rows Shelf to the left of all the dimensions. With the unique identifier for each record (mark) the Mark Stacking goes away:
  6. As the last step turn off Show Headers for the Rank pill:

    The view now shows each individual record:

Conclusion

Tableau is designed to help us dive & swoop through thousands/millions/billions of rows of data to discover insights so Tableau’s default behavior is to aggregate the data. Tack on Tableau’s mental model of treating data as a database does and a task like showing every record can be more complicated when the data source isn’t aware of more modern database concepts and lacks the necessary dimensions to uniquely identify each record. A feature request for row numbering has been created to make this easier, vote it up if this is something that interests you!

Here’s a link to the too tall workbook on Tableau Public.

Screen Shot 2016-02-07 at 7.14.21 PM

An Exploration of Custom Color Palettes

This post is an exploration of why the two views above look different even though the red for 100% is the exact same for each, as in South Dakota for 2013 Q4:
Screen Shot 2016-02-07 at 7.18.31 PM

Screen Shot 2016-02-07 at 7.20.00 PM

I recently had the opportunity to dive into how Tableau assigns colors, as the goal for this particular worksheet is to make all the 0.0%’s have a white background so the rest of the colors have a little more breathing room. I came out of it impressed with the work that Tableau’s color designers (Maureen Stone and others) have done to create good looking color palettes in Tableau. Read on for some details on how to build your own palettes.

Continue reading

TDE or Live? When to Use Tableau Data Extracts (or not)

I recently answered a question for a new Tableau user on when to use a Tableau Data Extract (TDE) vs. a live connection, here’s a cleaned-up version of my notes:

Why Use a Tableau Data Extract?

My preference is to first consider using a live connection because extracting data adds another step to the data delivery chain. However there are many situations where that isn’t a workable solution, so Tableau has created Tableau Data Extracts to support situations where a live connection to the source is:

  1. Not possible. Sometimes a Tableau viz can’t have live connection to a production system, for example when you want to share a viz with someone not inside your premises. The extract can be published (whether in a viz or as a published data source) to Tableau Server or Online, or saved in a Tableau packaged workbook (TWBX) or packaged datasource (TDSX), or sent “naked” as a TDE file.
  2. Too slow. There are a number of variations on this:
    1. For example a production system is on a slow network connection so a TDE can be created locally and only have to go over the slow network.
    2. Because data extracts are highly optimized for queries they can be much, much faster than a live connection. I regularly see 100x improvement in load times using Tableau data extracts over MS Access, to the degree that my muscle memory is tuned to making an extract as the first thing I do after connecting to an Access-based source.

      Note that older posts (prior to the release of Tableau v8.2) on extracts will talk about them being faster than raw Excel/text connections because at the time the “legacy” aka MS JET engine was used. That is no longer the case with the new Excel/text connector.  The new connector takes advantage of Tableau’s data extract (data engine) technology to create a data extract in the background, that’s why the first connection can be slow and then creating an extract can be instantaneous.

    3. The Tableau queries to the live connection might slow down operational queries too much, so having Tableau only query at scheduled extract refresh times is preferable.
    4. Data volumes could be such that millions of records per week of raw data that would be too slow to run live Tableau queries on might be aggregated in a TDE down to dozens or hundreds per week based on some set of categories/dimensions in the data.
  3. Unable to handle the record volumes. MS Excel is limited to 1M records, MS Access tops out anywhere in the 100s of thousands to couple M records depending on the complexity of the table, etc. whereas a TDE can potentially handle billions of records. Another case is situations where the data is stored in multiple tables (potentially across multiple data bases) and a UNION query is used to generate a result that is too big for a live connection but fine for a TDE.
  4. Exposing too much data. There are four cases where extracts can effectively improve security by reducing what data is made available:
    1. We can create extract filters on TDEs so only the necessary records are included.
    2. We can set up the extract to only include fields used in the workbook, in other words we can exclude columns from the extract.
    3. Extracts can be configured to aggregate the data and therefore hide record-level detail.
    4. For file-based sources when we include the files in a TWBX it’s the whole file, so for an Excel file that means that every worksheet in that file is included in the TWBX. If we extract the data then only the necessary data for the workbook is in the TDE.
  5. Unable to handle the data volumes. A related case is that since a TDE is highly compressed it can be a lot smaller than the original uncompressed source. I’ve seen people use TDEs instead of file-based sources to make distribution of packaged workbooks easier.
  6. Not supporting certain calculations. Tableau data extracts have generally supported more functions than any particular data source (with the exception of RAWSQL functions). One example is that in the “old days” before Tableau 8.2 with the new Excel/text connector we were stuck with the MS Jet engine for connecting with Excel & text files and that couldn’t handle COUNTD(), MEDIAN(), In/Out of Sets among other drawbacks, so we’d create an extract. Another example is that currently not all sources support the Level of Detail Expressions introduced in v9.0 and again we can work around that by creating an extract.
  7. Unable to handle the complexity. There are various computations (such as using top and conditional filters, nested calculated fields, etc.) that TDEs can handle in combined ways that some data sources can’t. For example MS Access databases are one of my main data sources and in some Tableau worksheets if I switch from the TDE to the live connection the MS JET engine gives me a “query too complex” error.
  8. Actually a situation where multiple file-based sources needed to be put together…with TDEs it’s possible to add data to an extract from multiple file-based sources, which can be handy when you are integrating data from various producers at different times. Tableau is working on improving this: At the 2015 Tableau Conference they demoed a feature for creating federated queries across multiple data sources (including server-based sources, other TDEs, etc.). From what I saw Tableau will be able to do this in a live connection, however I’m guessing that we’ll often want to be using TDEs for performance reasons.

Other Features of Tableau Data Extracts

A few other advantages of TDEs are:

  • Materialized expressions. Tableau will “materialize” record-level calculations that use only fields from a single data source and are not dependent on run-time values — i.e. not using TODAY(), NOW(), USERNAME(), ISMEMBEROF(), or a parameter — as fully indexed & compressed fields in an extract. This can improve performance in many cases, for example when splitting name or address fields and/or creating datetime fields out of strings.
  • Access to cloud-based data sources. In order to make cloud-based sources such as Salesforce.com, Google Analytics, oData, and the Tableau Web Data Connectors useful for the kinds of at-the-speed-of-thought analytics that Tableau enables we have to use Tableau data extracts. Other cloud-based sources such as Amazon Redshift, Google BigQuery, and Microsoft Azure can be used as a live connection or extracted as needed.
  • Option to publish to Tableau Public. For performance reasons we can only use TDEs when publishing to Tableau Public.

TDE Limitations

However, Tableau Data Extracts do have some limitations and there are cases when they are not suitable or more difficult to work with than a live connection:

  1. TDEs are by definition not a live connection to the source. This means that Tableau Data Extracts are not usable if you’re needing “real-time” data in your Tableau viz. Also if the refresh time of a TDE is more than the desired data refresh time then TDEs aren’t really feasible.
  2. Tableau Data Extracts can’t be created from OLAP sources such as Oracle Essbase or SSAS. They can be created from SAP BW cubes, however.
  3. Changing the data structure of the underlying data can require rebuilding the entire TDE, which may not be very easy, take too much time, become impossible if the file-based source you used for an incremental append is no longer available, etc.
  4. Tableau’s support for incremental loads, slowly changing dimensions, and updates to existing rows is minimal to non-existent.
  5. Tableau Data Extracts do not support RAWSQL functions, nor can we use Custom SQL on an already-created extract. One use case for RAWSQL is when the underlying data source supports a given function and Tableau does not yet support that feature for that source.
  6. TDEs can become too slow to refresh and/or queries on them become too slow based on the data structure, here are some known factors:
    1. many rows (anywhere from millions to billions)
    2. many columns (when they get into the hundreds)
    3. lots relatively non-compressible (high-cardinality) columns
    4. many complex materialized expressions

      So a billion-row extract might be plenty fast and a million-row extract on a complex data structure might be too slow, your best bet is to do your own testing.

  7.  As of this writing (January 2016) I haven’t heard of anyone else being licensed to read from TDEs so the only pieces of software that can read from TDEs are Tableau Desktop, Tableau Reader, Tableau Server, Tableau Online, and Tableau Public. There’s no published API for reading TDEs and trying to save large CSVs from a Tableau worksheet is likely to run into out-of-memory problems so if you’re looking for more permanent storage for data so you can get at it later you’re likely to want to look elsewhere.
  8. Refreshing TDEs puts more and more load onto Tableau Server and that can impact delivering visualizations, so doing the work to make the underlying source fast enough to use a live connection may be preferable to the extra hardware & configuration needed to make the TDE refresh fast enough.
  9. TDEs don’t include user-level security, those have to be set up higher up in the stack in the Tableau Server data source and/or Tableau workbooks that use the TDE, which means there’s extra work to prevent unauthorized users from getting access to the data in the Tableau views and the TDE itself. It may be better to implement that security in the raw data source (which I know makes my DBAs happy because they get to retain control).

To eliminate and/or work around the performance limitations of TDEs I see people doing one or more of the following:

  1. Read Designing Efficient Workbooks by Alan Eldridge and implementing the suggestions there, it’s the [insert holy-book-of-your-choice metaphor here] for Tableau performance tuning.
  2. Create multiple data sources on the same underlying data, the basic distinction is using a fast & lightweight TDE for the high-level views and then the detail reached via drill-down (i.e. Filter Actions) is stored in a big, relatively slower TDE or live connection.
  3. Use ETL tools such as Alteryx or Trifacta to pre-compute, pre-aggregate, and transform the data to make it fast in Tableau (and potentially use a TDE).
  4. Do the necessary performance tuning in the existing data source fast enough to use as a live connection.
  5. Deal with high volume/high performance requirements by creating a new data source whether that be a tuned datamart/data warehouse/data lake or using something like Teradata, Vertica, Hadoop, Exasol, etc.

Conclusion

Thanks to Brian Bickell for To Extract or Not to Extract (published 2014-04-29) and Tom Brown for Tableau Extracts (published 2011-01-20), those posts helped validate and round-out bits that I’d missed. Also thanks to Alan Eldridge for Designing Efficient Workbooks, it’s on my “must read” list of Tableau resources. If you have other pros & cons of extracts, please let me know!