Joshua

8 minute read

Python for Product Managers

This is part of an ongoing series of posts to introduce you to Python and explore how it can help make your life as a Product Manager esier. In Part 1 of this series, we looked at installing Python, setting up Jupyter Notebooks and a glimpse of how you can use Python in your data analytics workflow. In Part 2 we explored Pandas in a bit more detail and loaded and explored some data.

Now, let’s start manipulating and changing the data to see what we can learn. In this post we’ll look at grouping and aggregating data, summarising the data, joining data together and reshaping it for our purposes.

About the examples here

As with the previous posts we’re working with some hypothetical session data from a SaaS site. The basic data looks like this:

idend_datetimestart_datetimeuser_id
04962152018-01-01 01:38:392018-01-01 00:57:31e21fffeb-3006-42b6-929d-9cbef78cf616
14962162018-01-01 00:56:352018-01-01 00:17:448ec6e7e7-a5d7-4a48-bb8a-d4d899e1f5be
24962172018-01-01 01:10:002018-01-01 00:51:34a4389be1-3f79-408b-996e-631705fe5cf9
34962182018-01-01 01:38:242018-01-01 00:45:05cd50dd63-491b-418a-8794-366636c3c4cd
44962192018-01-01 00:30:362018-01-01 00:21:410062817b-1486-4018-86fa-def7f5ac782f

Grouping data

At the moment, our dataframe has one row per session. We can group it to answer questions like “How many sessions for each user?” or “How many sessions per month?”

Pandas has a groupby method which does exactly this. We can specify which column(s) we’re grouping by, and the aggregation function to apply. Common aggregation functions are count, mean and median.

With grouping, we create a new dataframe with a different shape. Whilst our source dataframe has one row per session, grouping will create a new dataframe with one row per unique value for the specified columns. This new dataframe can be used like any other to explore and visualise the data.

# How many sessions for each user?
df.groupby(df.user_id).agg('count').sort_values('id').head()
idend_datetimestart_datetime
user_id
9131dcfb-da92-4c62-b890-ed2f871475ff424242
2d05c9c5-7727-4b9b-91f5-3b4ee4b579b8434343
d2db1d43-e85b-4209-bb45-ce55a46236ef444444
d5e63b7d-00ca-408f-b219-08030b7ce5fb444444
0e80b290-b502-4fce-87ad-49577d7be33e454545

Because our session start_datetime is parsed as a datetime object, we can use special functions to look at the elements of the date such as the day, dayofweek, month, year and so on. So to group by month we can do the following:

# How many sessions each month
df.groupby(df.start_datetime.dt.month).agg('count')
idend_datetimestart_datetimeuser_id
start_datetime
1149721149721149721149721
2113412113412113412113412
3127430127430127430127430
497859978599785997859
596677966779667796677
674060740607406074060
775250752507525075250
8114956114956114956114956
9106937106937106937106937
10132332132332132332132332
11138873138873138873138873
1256976569765697656976

Now, notice how in the above example we have the count of each column for each month of the year - there are twelve rows in the result. That’s probably not what we want. You can group by two data items, so let’s group by the year and the month to see how the number of sessions has changed over time. The groupby method accepts an array of columns for grouping, and they’re used in the order specified.

# How many sessions each year and month
df.groupby([df.start_datetime.dt.year, df.start_datetime.dt.month]).agg('count').tail(n=24)
idend_datetimestart_datetimeuser_id
start_datetimestart_datetime
2017342653426534265342653
431684316843168431684
533031330313303133031
625018250182501825018
724638246382463824638
839624396243962439624
935488354883548835488
1045254452544525445254
1147048470484704847048
1215090150901509015090
2018140415404154041540415
236083360833608336083
348878488784887848878
436715367153671536715
536821368213682136821
628517285172851728517
728030280302803028030
842663426634266342663
939054390543905439054
1050814508145081450814
1152263522635226352263
1216808168081680816808
2019147420474204742047420
221696216962169621696

So that’s counting things nailed. Next lets look at averaging data.

To start off, let’s calculated the length of each session. Again, because our start and end dates are datetime objects, Pandas makes life simple and gives us special treats:


# Calculate the length of each session
df['session_duration'] = df.end_datetime - df.start_datetime

You can find the average of any columns simply by using the mean method of the column. There are also methods for the median as well as the max and min values.

# Now look at some stats for the session duration
print(df['session_duration'].mean())
print(df['session_duration'].median())
print(df['session_duration'].min())
print(df['session_duration'].max())
0 days 00:34:02.443117
0 days 00:33:57
0 days 00:00:01
0 days 01:45:53

To see how the average length of a session has changed over time, we can take everything we’ve learned here and combine our averaging with our grouping. First though, we need to represent the session length numerically - so let’s convert it to a number of seconds using the numpy library.

import numpy as np

# Calculate the session duration in seconds
df['session_duration_seconds'] = df['session_duration']/np.timedelta64(1,'s')

# Average session duration (in seconds) by year and month
df[['session_duration_seconds']].groupby([df.start_datetime.dt.year, df.start_datetime.dt.month]).agg('mean').tail(n=12)
session_duration_seconds
start_datetimestart_datetime
201832045.078011
42041.533134
52044.280682
62039.592489
72045.694684
82039.892764
92046.747913
102038.984611
112042.482961
122044.036292
201912039.241459
22037.908555

If we wanted to work in minutes, we could, it’s just less accurate of course. But we’d only have to change one character!

# Calculate the session duration in minutes
df['session_duration_seconds'] = df['session_duration']/np.timedelta64(1,'m')

Simple!

Describe the data

In the example above, we use some methods of the dataframe to look at the mean, median, max and min of the session duration column. Pandas provides a useful utility to quickly view stats like this for a dataframe called describe.

df['session_duration'].describe()
count                   1284483
mean     0 days 00:34:02.443117
std      0 days 00:14:49.732860
min             0 days 00:00:01
25%             0 days 00:23:51
50%             0 days 00:33:57
75%             0 days 00:44:06
max             0 days 01:45:53
Name: session_duration, dtype: object

It’s super useful for quickly getting a handle on the data you’re working with.

Joining Pandas Dataframes together

Sometimes you won’t be able to answer all your questions with one dataset alone. You’ll need to combine and join data together to provide more insight.

Building on our sessions data, let’s bring in some more information about the users. In our session data we have a user_id. We also have some data about each user:

  • Their unique Id
  • Role (user / administrator)
  • Department

As with our previous data, let’s load it in and take a look at what we have:

# Load the user data from a tab delimited text file
user_df = pd.read_csv('users.csv', sep='\t')
user_df.head()
user_iddepartmentrole
058fa53a8-4524-4ccf-89b0-8c88f9787b32salesuser
17599b109-afbd-4cc4-a2e6-d799d6814920salesuser
2da74df23-08e6-48f3-96dc-ee4ad93200f6generaluser
305c2a22c-d36f-4460-a85b-3c4356e3c723marketinguser
40a34bf0d-b2ef-4d53-b3b5-8acbfba9194ageneraluser

Now, to join the data we need a common column - a bit of data that we can use to match the sessions together. What we would like is to have the user info from the users_df joined on to the sessions data in our main dataframe. We can do this with the pandas join method. This joins one dataframe to another, creating a third.

# Join the sessions data to the user data
full_df = df.join(user_df, rsuffix='_user')
full_df.drop('user_id_user', axis=1, inplace=True)
full_df.set_index('id', inplace=True)
full_df.head()
end_datetimestart_datetimeuser_idsession_durationsession_duration_secondsdepartmentrole
id
02015-11-30 03:39:452015-11-30 03:39:1758fa53a8-4524-4ccf-89b0-8c88f9787b3200:00:2828.0salesuser
12015-11-30 04:44:082015-11-30 03:53:447599b109-afbd-4cc4-a2e6-d799d681492000:50:243024.0salesuser
22015-11-30 04:00:542015-11-30 03:32:19da74df23-08e6-48f3-96dc-ee4ad93200f600:28:351715.0generaluser
32015-11-30 04:20:192015-11-30 03:51:0405c2a22c-d36f-4460-a85b-3c4356e3c72300:29:151755.0marketinguser
42015-11-30 04:06:032015-11-30 03:38:580a34bf0d-b2ef-4d53-b3b5-8acbfba9194a00:27:051625.0generaluser

Because the two dataframes shared a column with the same name (user_id) the rsuffix parameter is used to ensure each column in the resulting dataframe has a unique name - there’s a lsuffix too if you want. Because we don’t need two columns with the same value, we can drop the redundant copy. I also tell Pandas to use the id column as the index for the dataframe - it’s the unique Id for a session and so is unique within the data.

If you’re used to working in SQL, this will hopefully all be familiar. The join method accepts a ‘how’ parameter to specify the type of join you want (inner, outer etc.)

So now we have joined our data, we can ask more interesting questions like “which department has had the most sessions?” or does the users role influence the average length of a session?

I’ll leave these with you to mull before the next installment where we’ll look at answering common Product Management questions with Python and Pandas.

If you’ve got any comments, suggestions or requests then please let me know in the comments.

Read the rest of the series

Follow the full series of posts to master Python!

  • Part 1 : Installing and setting up Python, Pandas and Jupyter
  • Part 2 : Loading and viewing data
  • Part 3 : Grouping, aggregating and summarising data

To be notified of new posts subscribe to the mailing list.



comments powered by Disqus