Generating Sankey diagrams using Python

Visualising household expenditure with a Sankey diagram

Joshua

4 minute read

Generating Sankey diagrams using Python

Sankey diagrams are a great data visualisation named after Matthew Henry Phineas Riall Sankey following his usage of this type of diagram when communicating the efficiency steam engine components. They can be useful for visualising the flow of users through a process or application, where the width of the arrow is proportional to the number of users who follow that route. If you’ve used Google Analytics, you’ll be familiar with this as the “Behaviour Flow” chart.

You can also use them for looking at hierarchical data. So for today’s blog, let’s have a look at a breakdown of average weekly household expenditure for the United Kingdom and visualise it as a Sankey diagram.

Getting the data

In the United Kingdom, the Office for National Statistics makes a lot of interesting data available as part of their reporting. One such interesting topic is looking at the average weekly expenditure of families across the UK. You can download the data here:

The data is fairly self explanatory - you get to see expenditure broken down in various categories for different regions of the UK. Once I downloaded the data, I flattened the hierarchy into three columns (manually) and saved the resulting spreadsheet. For today, I’m only interested in the general UK figures, not the regional breakdowns. Below you can see me loading it up in Pandas and the shape of the data:

%matplotlib inline

import pandas as pd

df = pd.read_excel('expenditure.xlsx')
df.head()
CategorySubCategoryItemUnitedKingdom
0Alcoholic drink, tobacco and narcoticsAlcoholic drinksSpirits and liqueurs (brought home)1.8
1Alcoholic drink, tobacco and narcoticsAlcoholic drinksWines, fortified wines (brought home)4.2
2Alcoholic drink, tobacco and narcoticsAlcoholic drinksBeer, lager, ciders and perry (brought home)2.1
3Alcoholic drink, tobacco and narcoticsAlcoholic drinksAlcopops (brought home)0.0
4Alcoholic drink, tobacco and narcoticsTobacco and narcotics1Cigarettes2.8

Generating Sankey diagrams with Python

To generate our Sankey diagrams we’re going to use the IPython Sankey diagram widget. This does the drawing for us, and its compatible with Jupyter Notebooks so we can explore the data interactively. I’m also using Seaborn for colouring in, so we’ll install that too.

To install:

$ pip install ipysankeywidget
$ pip install seaborn
$ jupyter nbextension enable --py --sys-prefix ipysankeywidget

To generate the Sankey diagram, the IPython Sankey widget expects the data in the following form:


data = [
  {'source': 'a', 'target': 'b', 'value': 1, 'color': '#000000' },
  {'source': 'a', 'target': 'c', 'value': 1, 'color': '#000000' },
  {'source': 'a', 'target': 'd', 'value': 1, 'color': '#000000' },
  {'source': 'a', 'target': 'e', 'value': 1, 'color': '#000000' }
  ...
]

The following code groups and loops through the dataframe to get the data into the correct shape. To do this, we’re using some groupby (learn about grouping and aggregating data) and pandas.DataFrame.iterrows to loop over the rows of data.

data = []

# Generate a colour pallet to let us colour in each line
palette = sns.color_palette('cubehelix', len(df.index) + df.Category.nunique())
colours = palette.as_hex()

level1 = df[['Category', 'UnitedKingdom']].groupby('Category').agg('sum')

# Counter so we can iterate through the colours
c_count = 0

for i,r in level1.reset_index().iterrows():
    # Map from a top leve "Expenditure" category to the low level items
    data.append({'source': 'Expenditure', 'target': r['Category'], 'value': r['UnitedKingdom'], 'color': colours[c_count]})
    
    # Get the item sub-totals for the category
    for item_i, item_r in df[df['Category'] == r['Category']][['Category', 'Item', 'UnitedKingdom']].groupby(['Category', 'Item']).agg('sum').reset_index().iterrows():
        
        # Increment the colour counter
        c_count += 1

        # Record the Category --> Item breakdown
        data.append({'source': item_r['Category'], 'target': item_r['Item'], 'value': item_r['UnitedKingdom'], 'color': colours[c_count]})

To draw the Sankey diagram:

# Set the layout parameters as we're going to have a big one...
layout = Layout(width="1600", height="3000")

# Generate the Sankey diagram
w = SankeyWidget(layout=layout, links=data, margins=dict(top=0, bottom=0, left=100, right=150))

# Display it in our Notebook
w

You can see how the width of the Sankey flows are proportional to the value of expenditure, giving us a nice way of seeing where people spend their money at both a high and lower level.

If you want to save the chart for use outside of the Notebook, you can use one of the following methods, depending on the format you want:

w.save_svg('Sankey.svg')
w.save_png('Sankey.png')

So, there you have a nice easy way of visualising flows and hierarchies. If you’ve generated a nice Sankey diagram then let me know in the comments.



comments powered by Disqus