Skip to content

Visualizing crosstabs (or pivot tables) with sankey diagrams

April 23, 2010

I recently ran across an incredible blog, Excel Hero, that presented a Sankey diagram of energy flow created in Excel. I have been inspired by this to finally post an example of a Sankey diagram I have been using to visualize crosstabulations. The presented example is changes in overall and category grades for NYC High School Progress Report from publicly available data from the NYC DOE (to clarify, these are grades evaluating schools, not students).

Caveat: There is a decent chance that the actual figures I present are erroneously labeled or calculated since I did not bother to review. If the data is of interest, I would work from the raw data. My aim is to present an approach to data visualization, not to present an actual analysis.

Note: If anyone is interested in the workbook used to create these diagrams, let me know and I will provide it.

Value of Sankey v crosstab or pivot table

Crosstab

crosstab_table

Sankey, overall

sankey_overall

Sankey, environment

sankey_environment

Sankey, performance

sankey_performance

Sankey, progress

sankey_progress

I think the Sankey offers improved pattern recognition over a crosstabulation, even when formatted (although a more nuanced heatmap may have made the table more revealing). For simple and small tables there may be relatively little to gain from visualizing data; but as is the case with visualization in general, the greater the quantity and complexity of the data, the more advantageous it is to visualize. The Sankey I have created has increases in grade represented on the top (green), static grades in the middle (brown), and decreases on the bottom (blue). The height of the path or arc represents the magnitude of the increase or decrease (i.e. big or small jump), and the width of the path represents the portion of schools it represents. In this case, you can learn a lot with a quick glance. For example. there is relatively less movement among overall grades than category grades, environment had relatively little increases compared to other categories, and performance has more varied movement than performance. When I look at the tables, they all look more or less the same.

Typically visualizations improve pattern recognition at the loss of look-up functionality, i.e. exact figures are not available. In this case, the exact number and percent of schools label each line, so it contains all the detail available in the table. The coloring may be more effective if red colored the decreases, but I was inspired by the riverbank metaphor used in this music history visualization (the metaphor does not really apply in my case, but I like how it looks). The line are made semi-transparent and use relatively light colors to improve clarity in cases of overlapping lines. One limitation of this approach is that unless you have a good printer, the diagram looks like junk.

This circular Sankey is generalizeable to any crosstab of categorical data that is ordinal and symmetrical. Sankeys commonly depict  asymmetrical or geographic data, but in my opinion Sankeys offer a good alternative to more run-of-the mill crosstabs as well.

I have seen other Sankey or network visualizations created in Processing, and in most cases my Excel creation looks crude in comparison. Here is an example that came to mind, but upon close review it is not a Sankey since line widths do not convey information. Nonetheless, it is an interesting and beautiful visualization.

Creation
The workbook consists of three worksheets:

*Worksheet with the original actual crosstab table. This provides an easy location to populate new raw data.

Worksheet, raw data

data_raw

*An intermediate data worksheet. This references and restructures the crosstab into one value per row, and each value is named; this provides and easier way of referencing the values in VBA. The worksheet also performs calculations, e.g. each values percent of total and width, subtotals.

Worksheet, intermediate data

data_int

*A worksheet containing a template of the diagram, where all of the lines are uniform in width. I created the initial lines and labels manually, and named each line and width with the connection is represents, e.g. a_b.

Worksheet, template

sankey_template

A macro uses the values from the intermediate data worksheet to set the widths of all the lines. It enters data from the intermediate data worksheet into the labels. So the crux of the program is:

Sub master()
line_transparency = 0.2
Sheets("graph_template").Copy Before:=Sheets(1)
Sheets(1).Name = "graph"
Set ws_graph = Worksheets("graph")
Set ws_table = Worksheets("table")
For r = 2 To 26
Call obtain_parameters_from_ws
If line_weight <> 0 And line_weight <= 0.4 Then line_weight = 0.6
label = create_label(count, percent)
Call format_line
Next r
Call label_summary_cells
End Sub

One annoyance when creating this was figuring out which shapes are actually accessible to VBA. I think something like smart shape objects cannot be referenced, but lines can. So for example, I originally attempted to have the brown lines be actual box shape objects, but this did not work, so I had to make them lines. And the labels on the brown lines did not seem to be positionable where I wanted them, so I had to use text boxes. Dealing with small stuff like this accounted for about half the time it took to put this together.

When all is said and done, the end-user just needs to populate the crosstab tables and provide a single overall width multiplier for all lines (this takes a few attempts before a width that is aesthetically pleasing and results a clear digram is arrived at) and hit run to generate a new Sankey.

Advertisement
10 Comments leave one →
  1. April 23, 2010 4:31 am

    Thanks for the shout out.

    Very nice work. I’ve added your RSS to my Google Reader and will follow your blog with interest.

    Regards,

    Daniel Ferry
    excelhero.com

  2. September 29, 2010 8:12 pm

    Hi!

    It seems that I can use this very nice approach to produce a material and value flow. Can you please to send me the workbook to have a try?

    Thanks in advance,

    Jenő

  3. Karen permalink
    November 12, 2010 7:29 pm

    I would love a copy of the workbook as well. Is this something you could send? Thank you!

  4. Ricardo permalink
    February 20, 2011 9:56 pm

    Thanks for this…I’d be grateful if you could send me your workbook, to try it out too. Hope this is possible

    rg

  5. Dan R permalink
    April 5, 2011 10:17 pm

    I love this way of representing crosstab data. Could I also have a copy of this work book?

  6. April 6, 2011 8:01 am

    Just,… wow!
    This is exact what I was looking for.

    I am doing my graduate investigation and writing an thesis about it.. and could not find a proper way to visualize increases and decreases.
    Could you Please send me the worksheet?

    You will be in the source reference!

    Greetings,

    Eric

  7. John permalink
    July 19, 2011 10:45 am

    Seems very interesting. I could use it for my combustion process diagrams!Let me know if I can have a copy of it. Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.