Tag Archives: data extract

Creating Lists of Values for Tableau from Text & Excel Sources

There are various use cases where we start out with a “flat” table like the Superstore sample data that has a number of columns with various dimensions and we want to make a simple list of unique values of one or more dimensions. such as a list that has just the six continents in Superstore:

The use cases for this include:

  • Using a filter action value as a parameter in the target source (look for posts from myself and Rody Zakovich on this in the next week).
  • Cross data source filters with higher performance when the list of filter values can be small compared to the volume of data.
  • Creating scaffold data sources to pad out data and ensure there are no sparse combinations of values.
  • Situations where we’d want to do a union or cross product of the data to do something like a market basket analysis but the union or cross product would be prohibitively large, so instead we only union or cross product desired dimension(s) and then join in the original data as necessary.
  • The last multi-select highlighter method from Multiple Ways to Multi-Select and Highlight in Tableau can use a self-union.

If you are starting out with a well-structured data warehouse with dimension tables, can write SQL, Python, or R, build custom views on the data source, use data preparation tools like Alteryx or Easymorph or Trifacta, etc. then obtaining or generating these kinds of lists is pretty straightforward. But not everyone has those skills or resources, and in the case of users who just have Excel and/or text files we need to get creative. This post goes through a three different methods to get these lists in Tableau:

    1. Ask!
    2. Aggregated Extract
    3. Excel Pivot Table as a Data Source
    4. Custom SQL

In this post I’ll go through each of these options. [Note: this post was updated on 10 Jan 2018 to make the aggregated extract method a little simpler.]

1. Ask!

This might seem obvious, but sometimes we’re stressed out and under deadlines and don’t realize we might be able to get help. If the data you are working with is coming from someone else then go ahead and ask them if they have a list of unique values. I’ve found that most people want the data they produce to be used and used well and if I’m coming back to them asking for something so I can do more with “their” data they are happy to accommodate me. I might phrase the request like “I want to make sure I’m using the latest list of departments, can you give me that list?”

The one caveat to getting data back from your ask is that you’ll need to go through some validation to make sure the list matches up with the “real” data, sometimes the amount of validation and cleansing isn’t worth the effort and one of these other approaches is better. However if you’re in a data-starved environment the kind of relationships you can make by asking for data can lead to more trust and ultimately more access to the data you want (and need).

2. Aggregated Extracts

For this method we’re going to connect to the data source and build an extract only we’ll be telling Tableau to aggregate the data to the desired level of detail (the field(s) we want to use) before Tableau builds the extract. The resulting extract then just has a record for each combination of field(s) that we want to use.

  1. Connect to the data source.
  2. Create a single worksheet with the field(s) you want to use as dimension pills, I usually just put them on Rows as discrete (blue) pills:
  3. Right-click on the source and choose Extract Data… The Extract Data window opens.
  4. Click on the Aggregate data for visible dimensions checkbox.
  5. Click the Hide All Unused Fields button.
  6. Click Extract. Tableau will ask where to save the extract. Choose a location and click OK.

Voila, you now have an aggregated extract source that you can use in Tableau data blends and/or join to!

Notes on Aggregated Extracts

There are a few things to keep in mind when using aggregated extracts: First of all there’s the need to refresh them to keep up with the data so if you have Tableau Server you’ll need to set up an appropriate schedule, if not then you’ll need to set up your own manual or automated workflow that gets the results you need. One possibility is using Tableau’s extract API.

Secondly if new columns are later added to the data they are automatically added to the extract. This may be ok for some use cases, there are others where this will break views that depend on that extracted data.

Finally, if you want to join on this aggregated extract you’ll need to join directly to the .tde or .hyper file.  Where this gets complicated is handling data updates. You’ll need one workbook or workflow to update the extract and then use the extract in a second workbook. Unfortunately we can’t publish the extract to Tableau Server or Online and join to that published data source (yet), otherwise that would be an easy workaround. There are a number of cases where a Tableau data blend is sufficient, we’ll be demonstrating one in the next week.

3. Excel Pivot Table as a Data Source

For Excel sources besides connecting to worksheets with raw data we can connect to worksheets that are built as a pivot table.

Here’s how using Excel 2016 for Mac:
  1. Open the source in Excel.
  2. Create a pivot table in a new worksheet.
  3. Drag the field(s) you are interested in to Rows.
  4. Rename the Row Labels header to have appropriate values if necessary.
  5. Remove the grand total.
  6. Rename the worksheet to something more meaningful than Sheet2.
  7. Save the workbook in Excel.
  8. Open up Tableau and connect to the Excel workbook.
  9. Drag the pivot table you just added onto the canvas:
Now you can use this to join to other tables and/or use in data blends.

Notes on using Excel Pivot Tables as a Data Source

Before Tableau introduced Level of Detail expressions in version 9 I used pivot tables in production views to pre-aggregate the data for some values and also to create tables I could join on to pad out the data so I could be sure to see records for every (person, office, metric) for every month. This method has one potentially major challenge around data updates, though, and that is that if we have data in worksheet A and a pivot table in worksheet B and we update the data in A (such as adding a new value that should appear in the pivot table B) that change won’t be reflected in the pivot table B until there is an explicit command in Excel to update the pivot table B and then save the workbook.

Even though we can tell Excel to do things like “Refresh data when opening file” this flag is only detected by Excel, not Tableau. Therefore to get updates to the data to be reflected in the pivot table the workflow has to include the steps to do a Data->Refresh All or open the pivot table worksheet before saving the workbook.

4. Custom SQL for Excel & Text Files

When I’m delivering Tableau training classes and we get to the point of talking about SQL & Tableau there are two common reactions: 1) yeay! and 2) [eyes glaze over]. This part is for the people in the latter category. Tableau hasn’t turned everything we might want to do into point & click, so sometimes we need to work with raw data. We do this in our everyday lives…there’s no good vegetarian restaurant in my town so when my wife & I want African ground nut stew we’ve got to make it ourselves. So I think of using Custom SQL as using the raw ingredients of the data to get a result I don’t have another way to get. However, in this case we’re going to be lazy (in a good way) and make Tableau write the SQL for us! Here’s how (these instructions don’t work for Tableau for Mac, see the Notes section below for more info):

    1. Start adding a new data source that is the Excel or text file you want to connect to.
    2. In the Open dialog select the file, then on the Open button click the drop down carat and choose “Open with Legacy Connection”.  You’ll return to the data source window.
    3. Drag the worksheet or file if necessary onto the canvas.
    4. Use the Data->Convert to Custom SQL menu option. The Convert to Custom SQL window will appear.
    5. Edit the Custom SQL to remove all the fields that you don’t need.
    6. Make sure to delete the trailing comma from the last field in the SELECT before the FROM.
    7. Add the DISTINCT keyword after the SELECT before the first field. The SQL query will now look something like this:
    8. Click Preview Results… to test. If it comes back with an error then check your syntax (see notes below for some tips) and try again. If it works by showing a View Data window with your results close the View Data window and then click OK to close the Custom SQL window You’ve now created a unique list of values using custom SQL!

The advantage of using Custom SQL compared to using an aggregated extract or pivot table is that it updates with the data and doesn’t require the more complicated workflows of the other methods.

Simple SQL SELECT Query Syntax

Here’s a really simple example for getting one field from one table:
SELECT DISTINCT [table].[field1] AS [field1]
FROM [table]
If you want multiple fields from one table the SQL query looks like this:
SELECT DISTINCT  [table].[field1] AS [field1],
   [table].[field2] AS [field2],
   [table].[field3] AS [field3]
FROM [table]

In some ways SQL is written a little backwards, and in more complicated queries backwards and forwards. To me the real “starting place” of a SQL query is the FROM part because that is telling the SQL engine where (what table, worksheet, or text file, generically called “table”) to get the data from. Then the SELECT is going to grab the set of fields that we specify. The DISTINCT keyword tells the SQL engine to only get the unique (distinct) combinations of values of those fields instead of grabbing every single record.

The field names themselves use the [table name].[field name] convention so that if there are multiple tables in a query each field referenced can be uniquely identified. The table and field names are surrounded by square brackets by default to handle situations where the table or field name might have spaces. Finally Tableau uses the AS [field name] aliasing option to ensure that the name used by Tableau is a usable name in Tableau.

SQL doesn’t care about spaces & line feeds, we could write SELECT DISTINCT [table].[field1] AS [field1] FROM [table] all one one line and it would work just fine.

SQL cares very much about the placement of square brackets & commas, if one is out of place or missing then the whole query will fail. Make sure that you have all brackets in place and make sure that the last field in the SELECT doesn’t have a comma after it.

Notes on Custom SQL for Excel & Text Files

The Legacy Connector is not available on Tableau for Mac, so we can’t use this particular method for connecting to Excel or text files on the Mac.

The Legacy Connector is actually the Microsoft JET driver that was phased out in Tableau version 8.3 for a variety of reasons, here’s a link of differences to be aware of from the Tableau legacy connector documentation. Also here’s the Tableau documentation on Connect to a Custom SQL Query. Finally I did a post awhile back on details of using the Custom SQL in the context of Microsoft Access connections which also use the MS JET driver, some of the points there are useful to keep in mind.

Hacky…or not?

If it all seems a bit hacky and contrived then I agree with you. At this time if all we have are Excel or text files and what features Tableau provides we’re in a low-resource environment and workarounds are necessary.

I regularly see projects I’m working with needing to invest more in data preparation in order to keep Tableau humming along. That investment could be in scripting languages like Python or PowerShell or R, using PowerQuery, starting the process of moving data into a database (there are free versions of many databases), and/or use more dedicated data preparation tools like Alteryx, Easymorph, or Trifacta. I like to set expectations around this early on in new projects because once they start using Tableau invariably projects run into imitations of their existing data pipeline to provide the volume and variety of data that they can now analyze in Tableau.

Conclusion

The goal for this post was to set you up with the skills you need to get a custom list of distinct values to support several different use cases and I hope this did that for you. As mentioned early on, Rody Zakovich and I have some posts in the works that use this to do some new things in Tableau!

TDE or Live? When to Use Tableau Data Extracts (or not)

I recently answered a question for a new Tableau user on when to use a Tableau Data Extract (TDE) vs. a live connection, here’s a cleaned-up version of my notes:

Why Use a Tableau Data Extract?

My preference is to first consider using a live connection because extracting data adds another step to the data delivery chain. However there are many situations where that isn’t a workable solution, so Tableau has created Tableau Data Extracts to support situations where a live connection to the source is:

  1. Not possible. Sometimes a Tableau viz can’t have live connection to a production system, for example when you want to share a viz with someone not inside your premises. The extract can be published (whether in a viz or as a published data source) to Tableau Server or Online, or saved in a Tableau packaged workbook (TWBX) or packaged datasource (TDSX), or sent “naked” as a TDE file.
  2. Too slow. There are a number of variations on this:
    1. For example a production system is on a slow network connection so a TDE can be created locally and only have to go over the slow network.
    2. Because data extracts are highly optimized for queries they can be much, much faster than a live connection. I regularly see 100x improvement in load times using Tableau data extracts over MS Access, to the degree that my muscle memory is tuned to making an extract as the first thing I do after connecting to an Access-based source.

      Note that older posts (prior to the release of Tableau v8.2) on extracts will talk about them being faster than raw Excel/text connections because at the time the “legacy” aka MS JET engine was used. That is no longer the case with the new Excel/text connector.  The new connector takes advantage of Tableau’s data extract (data engine) technology to create a data extract in the background, that’s why the first connection can be slow and then creating an extract can be instantaneous.

    3. The Tableau queries to the live connection might slow down operational queries too much, so having Tableau only query at scheduled extract refresh times is preferable.
    4. Data volumes could be such that millions of records per week of raw data that would be too slow to run live Tableau queries on might be aggregated in a TDE down to dozens or hundreds per week based on some set of categories/dimensions in the data.
  3. Unable to handle the record volumes. MS Excel is limited to 1M records, MS Access tops out anywhere in the 100s of thousands to couple M records depending on the complexity of the table, etc. whereas a TDE can potentially handle billions of records. Another case is situations where the data is stored in multiple tables (potentially across multiple data bases) and a UNION query is used to generate a result that is too big for a live connection but fine for a TDE.
  4. Exposing too much data. There are four cases where extracts can effectively improve security by reducing what data is made available:
    1. We can create extract filters on TDEs so only the necessary records are included.
    2. We can set up the extract to only include fields used in the workbook, in other words we can exclude columns from the extract.
    3. Extracts can be configured to aggregate the data and therefore hide record-level detail.
    4. For file-based sources when we include the files in a TWBX it’s the whole file, so for an Excel file that means that every worksheet in that file is included in the TWBX. If we extract the data then only the necessary data for the workbook is in the TDE.
  5. Unable to handle the data volumes. A related case is that since a TDE is highly compressed it can be a lot smaller than the original uncompressed source. I’ve seen people use TDEs instead of file-based sources to make distribution of packaged workbooks easier.
  6. Not supporting certain calculations. Tableau data extracts have generally supported more functions than any particular data source (with the exception of RAWSQL functions). One example is that in the “old days” before Tableau 8.2 with the new Excel/text connector we were stuck with the MS Jet engine for connecting with Excel & text files and that couldn’t handle COUNTD(), MEDIAN(), In/Out of Sets among other drawbacks, so we’d create an extract. Another example is that currently not all sources support the Level of Detail Expressions introduced in v9.0 and again we can work around that by creating an extract.
  7. Unable to handle the complexity. There are various computations (such as using top and conditional filters, nested calculated fields, etc.) that TDEs can handle in combined ways that some data sources can’t. For example MS Access databases are one of my main data sources and in some Tableau worksheets if I switch from the TDE to the live connection the MS JET engine gives me a “query too complex” error.
  8. Actually a situation where multiple file-based sources needed to be put together…with TDEs it’s possible to add data to an extract from multiple file-based sources, which can be handy when you are integrating data from various producers at different times. Tableau is working on improving this: At the 2015 Tableau Conference they demoed a feature for creating federated queries across multiple data sources (including server-based sources, other TDEs, etc.). From what I saw Tableau will be able to do this in a live connection, however I’m guessing that we’ll often want to be using TDEs for performance reasons.

Other Features of Tableau Data Extracts

A few other advantages of TDEs are:

  • Materialized expressions. Tableau will “materialize” record-level calculations that use only fields from a single data source and are not dependent on run-time values — i.e. not using TODAY(), NOW(), USERNAME(), ISMEMBEROF(), or a parameter — as fully indexed & compressed fields in an extract. This can improve performance in many cases, for example when splitting name or address fields and/or creating datetime fields out of strings.
  • Access to cloud-based data sources. In order to make cloud-based sources such as Salesforce.com, Google Analytics, oData, and the Tableau Web Data Connectors useful for the kinds of at-the-speed-of-thought analytics that Tableau enables we have to use Tableau data extracts. Other cloud-based sources such as Amazon Redshift, Google BigQuery, and Microsoft Azure can be used as a live connection or extracted as needed.
  • Option to publish to Tableau Public. For performance reasons we can only use TDEs when publishing to Tableau Public.

TDE Limitations

However, Tableau Data Extracts do have some limitations and there are cases when they are not suitable or more difficult to work with than a live connection:

  1. TDEs are by definition not a live connection to the source. This means that Tableau Data Extracts are not usable if you’re needing “real-time” data in your Tableau viz. Also if the refresh time of a TDE is more than the desired data refresh time then TDEs aren’t really feasible.
  2. Tableau Data Extracts can’t be created from OLAP sources such as Oracle Essbase or SSAS. They can be created from SAP BW cubes, however.
  3. Changing the data structure of the underlying data can require rebuilding the entire TDE, which may not be very easy, take too much time, become impossible if the file-based source you used for an incremental append is no longer available, etc.
  4. Tableau’s support for incremental loads, slowly changing dimensions, and updates to existing rows is minimal to non-existent.
  5. Tableau Data Extracts do not support RAWSQL functions, nor can we use Custom SQL on an already-created extract. One use case for RAWSQL is when the underlying data source supports a given function and Tableau does not yet support that feature for that source.
  6. TDEs can become too slow to refresh and/or queries on them become too slow based on the data structure, here are some known factors:
    1. many rows (anywhere from millions to billions)
    2. many columns (when they get into the hundreds)
    3. lots relatively non-compressible (high-cardinality) columns
    4. many complex materialized expressions

      So a billion-row extract might be plenty fast and a million-row extract on a complex data structure might be too slow, your best bet is to do your own testing.

  7.  As of this writing (January 2016) I haven’t heard of anyone else being licensed to read from TDEs so the only pieces of software that can read from TDEs are Tableau Desktop, Tableau Reader, Tableau Server, Tableau Online, and Tableau Public. There’s no published API for reading TDEs and trying to save large CSVs from a Tableau worksheet is likely to run into out-of-memory problems so if you’re looking for more permanent storage for data so you can get at it later you’re likely to want to look elsewhere.
  8. Refreshing TDEs puts more and more load onto Tableau Server and that can impact delivering visualizations, so doing the work to make the underlying source fast enough to use a live connection may be preferable to the extra hardware & configuration needed to make the TDE refresh fast enough.
  9. TDEs don’t include user-level security, those have to be set up higher up in the stack in the Tableau Server data source and/or Tableau workbooks that use the TDE, which means there’s extra work to prevent unauthorized users from getting access to the data in the Tableau views and the TDE itself. It may be better to implement that security in the raw data source (which I know makes my DBAs happy because they get to retain control).

To eliminate and/or work around the performance limitations of TDEs I see people doing one or more of the following:

  1. Read Designing Efficient Workbooks by Alan Eldridge and implementing the suggestions there, it’s the [insert holy-book-of-your-choice metaphor here] for Tableau performance tuning.
  2. Create multiple data sources on the same underlying data, the basic distinction is using a fast & lightweight TDE for the high-level views and then the detail reached via drill-down (i.e. Filter Actions) is stored in a big, relatively slower TDE or live connection.
  3. Use ETL tools such as Alteryx or Trifacta to pre-compute, pre-aggregate, and transform the data to make it fast in Tableau (and potentially use a TDE).
  4. Do the necessary performance tuning in the existing data source fast enough to use as a live connection.
  5. Deal with high volume/high performance requirements by creating a new data source whether that be a tuned datamart/data warehouse/data lake or using something like Teradata, Vertica, Hadoop, Exasol, etc.

Conclusion

Thanks to Brian Bickell for To Extract or Not to Extract (published 2014-04-29) and Tom Brown for Tableau Extracts (published 2011-01-20), those posts helped validate and round-out bits that I’d missed. Also thanks to Alan Eldridge for Designing Efficient Workbooks, it’s on my “must read” list of Tableau resources. If you have other pros & cons of extracts, please let me know!

Grr... Argh...

O Extract, Where Art Thou?

This was yesterday’s contribution to a Tableau forums discussion on data extracts, I thought it deserved a separate post that I could keep updated. There are some subtle behaviors and idiosyncracies in working with data connections, Tableau data extracts, and Tableau Server that aren’t fully fleshed out in the documentation, here’s my attempt! I start out with a review of the common file types and .twb vs. .twbx, and then get into some details on different types of connections and what happens based on different orders of operations, and toss in a gratuitous Buffy reference.

Continue reading