Cross Data Source Joins + Join on Calcs = UNION Almost Anything in Tableau v10.2

Since Tableau v9.0 or so every new release has come with new features that simplify and reduce the amount of data prep I have to do outside of Tableau. Pivot in version 9.0, the first batch of union support in v9.3, support for ad hoc groups in calculations, cross data source joins and filters in v10.0, and more in-database unions and join calculations in v10.2. With the join calculations we can now do unions and cross/cartesian joins within or across almost any data source without needing Custom SQL or linked databases and without waiting for Tableau to implement more union support, read on to learn how!

Here are some use cases for unions across data sources:

  • Union data that is coming from different systems, for example when different subsidiaries of an organization are using different databases but you want a single view of the company.
  • Union actual sales data from transactional systems and budget data that might come from an Excel spreadsheet.
  • Union customer & store/facility data sets so you can draw both on the same map.

This post goes through examples of all three using a combination of text files and superstore, and Rody Zakovich will be doing a post sometime soon on unions and joins with Tableau data extracts. (Did you know you could do cross data source joins to extracts? That capability came with v10.0, and we can have all sorts of fun with that using join calculations!)

Union Requirements

If all your data is located in a single Excel file, a single Google Sheets workbook, a single folder of text files, a single SQL Server, MySQL, Oracle, Postgres, Redshift, or HP Vertica database, then you can use Tableau’s existing union functionality in v10.2 (or earlier in the case of Excel workbooks, text documents, and Google Sheets workbooks) and skip this post.

If your source is not one of those then to use the union via cross join technique that I’m going to demonstrate here the source must support cross data source joins . If it doesn’t support cross data source joins (I’m looking at you, OData) there’s a workaround using a Tableau data extract  where you connect to the raw source, extract if necessary (OData sources automatically create extracts), then use the technique on the extract.

If the data source is something like one of the OLAP/cube sources that can’t be extracted and can’t do a cross data source joins then you’re out of luck and can’t use this technique

Union via Cross Join Technique Overview

The union that we’re doing is actually a form of cross join (aka cartesian join or cartesian product, see Alexander Mou’s post for some background) that uses Tableau’s cross data source join and the new join calculations in Tableau v10.2. Here’s how it works:

  1. An initial connection is set up to a “scaffold” source that is an N-row text file where N is the number of raw sources that need to be unioned. Here’s the two row text file source that I created for this post with just values 1 and 2:
  2. For each raw source to union:
    1. The raw source is added as another connection.
    2. The source is left-joined to the scaffold source where the left side of the join uses the scaffold’s value column and the right-side of the join uses a join on calculation with one of the values. The cross join happens here because the left side of the join has a single value that is joined to every record of the right side of the join. Here’s a screenshot of the source for the actual sales + budgeted sales use case:
  3. Since this isn’t a true union that would union the values of fields based on the field names we need to “coalesce” or “merge” the values ourselves  We do that coalescing with the IFNULL() function for each pair of fields that needs to be coalesced. If we are unioning 3 or more sources this way then we’d nest the IFNULL() expressions. For example IFNULL([Sales],[Budgeted Sales]) creates a coalesced measure for sales.
  4. Just as in a regular union where we’d typically want to identify each source via a dimension (such as the Table Name dimension that Tableau helpfully adds) we do that here using the scaffold’s values.

With that overview let’s get down to three specific examples: unioning data from different systems/regions, unioning actuals & budget, and unioning to draw customers & facilities on the same map.

Unioning Data from Different Systems/Regions

I’ve worked with a number of clients where they’ve acquired or merged with other organizations and have multiple pieces of software/databases all doing the same thing. So for this example we’ve got sales data coming from the East region in an Excel file and the sales data coming from the West region in a text file. Not every region has sales on every date, so to show a combined sales we can’t use a regular join, instead we need to union, and because these are different data sources we can’t use Tableau’s native union.

We can set up the union via cross join as follows:

  1. Create a text file with a Value column and two rows with values 1 and 2. You can do this in any text editor.
  2. Create a new data source in Tableau using the two row source.
  3. Add a connection to the East source.
  4. Drag the East data worksheet onto the canvas.
  5. Set up a left join to the Value column from the scaffold source.
  6. For the right-hand side of the join click on join on Edit join calculation… Tableau’s calculation editor opens.
  7. Enter a value of 1.
  8. Click OK to close the calculation editor.
  9. Add a connection to the West data source.
  10. If necessary, drag the West data file onto the canvas.
  11. Set up a left join to the Value column from the scaffold source.
  12. For the right-hand side of the join click on join on Edit join calculation… Tableau’s calculation editor opens.
  13. Enter a value of 2.
  14. Click OK to close the calculation editor. The join should now look like this:
  15. In the data source window you can now see rows for value 1 from the East source (with Null values for fields from the West source) and rows for value 2 from the West source (with Null values for fields from the East source.
  16. Now to coalesce the fields: Hover over the Order Date for the East field and click on the caret dropdown->Create Calculated Field… to open the calculation editor.
  17. Create a calculated field with the formula IFNULL([Order Date], [Order Date (west sales)]), I called it Coalesced Date here.
  18. Repeat steps 15 & 16 for the Sales measure to coalesce that field as well. Repeat as necessary for other fields you’d want to coalesce.
  19. Now for the last step, creating text to use for identifying each source. Since I want to have a field that displays number values and a field that displays text files I use a duplicate of the Value dimension: Go to a worksheet.
  20. Right-click on the Value dimension and duplicate it.
  21. Rename the Value dimension.
  22. Add aliases for each of the values of the Value dimension.

Here’s a simple view showing the results of the union:

So that’s the first example of how to union almost any type of data source by using a cross data source join and a cross join via a join calculation.

Unioning Data for Actuals and Budget

When working with actuals & budget, actuals & target, forecast & budget, etc. we can often use a regular join or a Tableau data blend to effectively use the two measures as separate columns, for example here’s a Tableau data blend in a simple Measure Names/Values crosstab:

However there are some cases where for interactivity or ease of use this doesn’t work so well. For example I’ve seen the request to “depending on the selected mark go to a different worksheet” where the marks are initially made by different columns in the data (i.e. measures, like in the table above).  The only way to do that kind of Filter Action in Tableau is when the marks are made by different rows in the data, i.e. instead of having two different measures we have a single column for the values and a second column identifying the measure names. The reason why we need rows is that Filter Actions depend on dimensions, and a dimension is created by a single column of multiple values (as opposed to multiple columns that would spread values across a single row).

If the measures we’d like to treat as a dimension are in a single source then (if it’s supported) we can use Tableau’s pivot feature to do the transformation of columns to a row. Or we can pivot in the data source (or use a cross join via cross data source join technique I’ll be demonstrating in a future post). However when there are multiple data sources with the multiple measures as columns then we need a union and this union via cross join technique works for that purpose.

In the following Tableau data source the actual sales are from a sample Superstore source and the budgeted sales are coming from another source where the budget is per year per product per product sub-category. I’m using the union via cross join technique here:

From the 2 row scaffold source Value 1 is joined to the actual sales data, and Value 2 is joined to the budgeted sales  data, and the Measure dimension is an aliased version of the Value dimension. With the Measure as an actual dimension in the data (instead of the almost-but-not-quite dimension that is Measure Names) I can use this to create a dashboard where depending on which bar is clicked we go to separate worksheets, here’s an interactive view on Tableau Public:

Another use is that in building out views that are a comparison of actuals to budget instead of having to write calculated fields I can use Tableau’s quick table calculations and never have to touch the keyboard:

In this case the % Difference in Budget/Sales is not a % difference over time but a % difference from actual to target.

This technique can also support the kinds of pivots needed for complicated combo charts, see http://drawingwithnumbers.artisart.org/bars-and-lines/ for more details.

Note that with this unioned source we can still use the separate Sales and Budget measures with Measure Names/Values to build views so we get the benefits of both the “wide” setup with separate columns and the “tall” setup with the Measure dimension and separate rows.

Unioning Data for Facilities and Customers

In order to draw a map in Tableau from two or more different raw data sets like facilities (stores, shipping centers, manufacturing plants, etc.) and customers or transactional data there are some requirements, the critical one is that the geography — a geographic field such as City, State or latitude & longitude pairs, or the geometry (spatial) objects that are new in Tableau v10.2 — must come from the same Tableau source. This means that we must join or union the different data sets. When the two data sets have significantly different grain like customers & shipping facilities a join usually isn’t feasible, instead we need a union. When those sources are from totally different sources such as customer data from an OLTP database and facilities from an Excel spreadsheet then we can’t use a native Tableau union, we have to use the union via cross join technique.

In this example I’ve used the union via cross join technique where customer data is coming from Superstore sample data and there’s a set of shipping facilities in New England with a mapping table of what states those facilities cover:

In this view Shape marks are used to draw the customer zip codes (circles) and shipping centers (red stars) , and a Filled Map draws the six states colored by the 4 shipping centers:

Because the data source is unioned we can coalesce the customer & shipping center zip codes to plot both of those on the Shape Marks card, then on the second Filled maps marks card it’s only plotting values from the shipping center data (there’s a Null mark for the customer data but Tableau doesn’t map Null marks unless we tell it where to plot them, in this case we don’t want Tableau to do anything).

Here’s the Marks Card for the Shapes & Filled Map, I’ve captured both of the Latitude (generated) cards to show the level of detail for each:

The level of detail of the Shape marks card is set by the Coalesced Zip Codes field (that has a value for every customer zip code and shipping facility), the Type dimension gets us the color, size, and shape. The level of detail of the Filled Map Marks card is two dimensions from the shipping facilities source: Customer State (i.e. the state served by the shipping facility) to set the overall level of detail of the 6 states and then the Shipping Center State (i.e. where the shipping facility is located) is on Color.

How to Conceptualize the Shapes and Filled Map

This can be a little complicated to think about, one key is to remember that each Marks card has its own viz level of detail (vizLOD) based on the dimensions on the particular marks card (plus Rows, Columns, and Pages).  So when we want a Marks card to show values from both of the unioned sources (such as the Shape marks) then we use a dimension (possibly a coalesced field) that has usable values in both sources. When we want a Marks card to show values from just one of the unioned sources then we use a dimension that only has usable values from that one source. Therefore the filled map for the states only uses dimensions from the shipping facilities source.

Note that we do need to be careful about filters with sources that are unioned like this. If you are trying this technique out and running into problems with filters removing values that you need to see you’re probably going to need to use parameter based filters (which will let you filter just one source without affecting the other) or you’ll need to start using a cross product with the union, that’s a subject for another post.

How about Drawing Lines?

You mean like this?

The key to drawing lines in Tableau is that we need origin/destination pairs as rows in the data source so we need some sort of tall data source, usually from a union. The view above is using the zip codes of customers from Superstore and the zip codes of the shipping centers as the origin/destination pairs. In the union there are two copies of the Superstore customer data with the shipping facility data joined to each. With two full copies of the data then we have maximum flexibility in how we draw, color, and size the marks.

If your data source supports a native Tableau self-union then the source looks like this and we don’t need to use the union via cross join technique:

If Tableau doesn’t support a self-union for one of your data sources then we do need the union via cross join technique and the source looks something like this where the scaffold table has two separate joins to the one of the data sets (thereby creating the union for that data set) and then each of those is joined again to the second data set.

Another way of thinking of this is that if we could use a Tableau self-union either the Orders source or the Shipping Centers source then we could just join the other to the union results and be done with it like the in the Tableau Union Path source above. It’s only the condition where neither source can be self-untoned in Tableau that we need to use the union via cross join technique.

The calculations for the union via cross join source get more complicated because we have to do more coalescing ourselves, but we can get the exact same results in views, check out the two Union Stars and Lines tabs below, they look the same:

Conceptualizing the Lines & Stars

The vizLOD of the star Shape marks used to draw the stars is the shipping center zip code so that’s only going to draw 4 marks. The vizLOD for the line marks is more complicated because we need a Path ID dimension to determine what the line segments will be, then we need a Path Order dimension to specify the order in which to draw the points, and in this case it’s also increasing the level of detail to create the two points for each line.  (This is the crux of why we need two records to draw each line: we need at least one dimension in the vizLOD to make the two points for each line and only record-level fields can be dimensions…ergo, two records).

Here’s the Marks Card for the Stars and Lines using Union via Cross Join sheet:

The Path ID is a calculated field with the coalesced customer zip code (with a Geographic Role of None so Tableau doesn’t try to use it to draw marks) because we want one line for each customer zip code.

The Path Order (Cust & Shipping Zips) is a calculated field using the customer’s zip code from one side of the union and the shipping center’s zip code from the other so in each line (defined by the Path ID) we’ll get a point for the customer zip code and a point for the shipping center zip code. The Path Order (Customer & Shipping Zips) calculated field is using a CASE statement so it returns different values based on which side of the union it’s on, here’s the one used by the union via cross join version:

CASE [Value]
WHEN 1 THEN [Postal Code]
WHEN 2 THEN [Zip Code (Shipping Centers1)]
END

Conclusion

The combination of cross data source joins introduced in Tableau v10.0 and join on calculations introduced in v10.2 is wonderfully powerful for all the Tableau users who don’t have total control over their raw data sources and need to do a certain amount of restructuring of the data to get the desired results.

Besides unioning data across different systems we can use variations on this technique for other use cases like computing moving count distinct without needing table calculations, padding out data so we don’t need to resort to complicated data densification techniques just to get 0’s to show up, and do all sorts of padding to support more complex chart types like Sankeys, jump plots, waffle charts, and beyond. I’ll be posting (or link to your posts!) about how to build those as time permits.

Here’s the link to the Union via Cross Join workbook on Tableau Public. If you’d like to build out the Tableau data sources for these examples yourself here’s the union via cross join sources.zip file.

 

Please add your thoughts and perspectives