Joshua

8 minute read

Python for Product Managers

In Part 1 of this series, we looked at installing Python, setting up Jupyter Notebooks and a glimpse of how you can use Python in your data analytics workflow.

As we saw last time, Pandas is a really powerful module for data analysis in Python. In this installment, we’ll look at the tools available to us to load and explore some data.

Importing pandas

Importing the pandas module is the first step, but it’s luckily very easy. A common convention seems to be to import the pandas module and give it the alias “pd” - I guess this just saves a few keystrokes.

To import pandas you simply need to include the following line at the top of your script or notebook:

import pandas as pdpython-for-product-managers-part-2/

If you’re using Jupyter notebooks and want visualisations to appear in line then the following snippet will import pandas, enable inline visualisation and set some default styles:

# Show graphs and charts inline in the notebook
%matplotlib inline

# Import some libraries
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

# Set some default plot styles to make things look nice
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)
plt.style.use('bmh')

I like the bmh style, but there are many others available, so take a look and choose one that you like.

Understanding Pandas dataframes

The main datastructure in Pandas is the dataframe. You can think of this a bit like an Excel spreadsheet or a database table. It’s a collection of data with columns and rows.

idend_datetimestart_datetimeuser_id
002015-11-30 01:49:042015-11-30 01:12:18e3472cfd-e249-43f5-badf-8130c3e93f51
112015-11-30 01:53:502015-11-30 01:11:1469ddcdb5-4635-43c4-94a1-a10e701082cf
222015-11-30 02:05:562015-11-30 01:15:5821dc4ed3-b8b9-4913-8b3b-ed843f78e8af
332015-11-30 04:30:172015-11-30 04:19:022cda4b97-71d3-4f32-8b89-f39a7a41a75f
442015-11-30 06:02:422015-11-30 05:50:51f39984e0-4c08-4e29-aa41-d6dddfdbd7f5

Each row has an index, by default an integer, but you can set other columns of unique data to be the index.

Loading data

From text files

Loading a CSV file is easy with the from_csv method:

df = pd.read_csv('my_csv_file.csv')

There are a few useful parameters you can set to help with different situations:

  • sep - define the column separator that is used in the file (e.g. \t for tab, |) - the default is a comma (‘,’)
  • parse_dates - specify a column or an array of columns that contain dates and pandas will try to ensure that the datatype of the data in the column will be a datetime - this makes it easier to do date based grouping (such as group by Year or Month)
  • header=None - use this if the data file has no headers for data. the names parameter is useful alongside this to set the name of each column, for example names=["col1", "col2", "col3"]

And so, a more complex example of loading a tab delimited file containing data like the example above would look like:

df = pd.read_csv('my_csv_file.txt', sep='\t' header=None, names=['id', 'end_datetime', 'start_datetime', 'user_id'], parse_dates=['end_datetime','start_datetime'])

From a database

Reading data from a database is also relatively easy using the from_sql method.Obviously you’ll need to be able to access the database and to be able to write SQL to define the data that you want out. Here’s an example using the mysql.connector module to read the sessions data directly from the database.

import mysql.connector

# Create a connection to the database
conn = mysql.connector.connect(host='hostname',database='db_name',user='user_name',password='xxxxxx')

# Define the SQL query that will extract the data
sql = "SELECT * FROM sessions"

# Create a dataframe containing the results of the query
df = pd.read_sql(sql, conn, parse_dates=['start_datetime', 'end_datetime'])
df.head()
idend_datetimestart_datetimeuser_id
002016-11-30 02:00:372016-11-30 01:06:43ee945829-9e34-4c7a-a9b4-2685ea5e2714
112016-11-30 01:47:482016-11-30 01:32:508d6cc8b8-b42a-4223-9557-6ff4a4a3871e
222016-11-30 02:21:452016-11-30 01:41:27954979ec-cd64-4b22-b0c1-e5ad7ed87aab
332016-11-30 04:22:222016-11-30 03:24:40fb252299-63cc-47ea-9d10-1b5db1e55166
442016-11-30 04:30:192016-11-30 03:47:54dcd3bdfd-cdef-47f7-9635-fa4e24ca4531

I just used MySQL as an example there, assuming you have the right database modules available and loaded, you can easily get data from SQLite, MS SQL Server or PostgreSQL using the same syntax.

Obviously, the advantage of reading directly from the database are twofold:

  1. If you know SQL, you might be able to do some filtering and data manipulation before you even load the data into Python
  2. Assuming your database gets updated frequently, its easy to refresh and repeat your analysis to get up to date insights into your data

Whilst point 1 above is true, one of the advantages of Python is that a lot of the data manipulation happens in memory on your local machine and so may be faster than some database operations on large or slow databases.

Queries with parameters

Depending on what you’re doing, you might want to pass in a parameter to the query. There’s a nice way to do this without messing about with the SQL as a string.

# Define the sql with parameter placeholders
sql = "SELECT * FROM sessions WHERE YEAR(start_datetime) = %s"

# Extract the data with the parameters supplied
df = pd.read_sql(sql, conn, parse_dates=['start_datetime', 'end_datetime'], params=[2018])
df.head()
idend_datetimestart_datetimeuser_id
04962152018-01-01 01:38:392018-01-01 00:57:31e21fffeb-3006-42b6-929d-9cbef78cf616
14962162018-01-01 00:56:352018-01-01 00:17:448ec6e7e7-a5d7-4a48-bb8a-d4d899e1f5be
24962172018-01-01 01:10:002018-01-01 00:51:34a4389be1-3f79-408b-996e-631705fe5cf9
34962182018-01-01 01:38:242018-01-01 00:45:05cd50dd63-491b-418a-8794-366636c3c4cd
44962192018-01-01 00:30:362018-01-01 00:21:410062817b-1486-4018-86fa-def7f5ac782f

And of course, you can have several parameters in the query, just pass more in via the params array.

From existing Python variables

Finally, you might have some existing Python data structures kicking around that you’d like as a dataframe. Pandas provides an easy way to load a Python dictionary into a dataframe via from_dict:

my_dict = [
    {'id': 0, 'name':'Bob', 'age':64},
    {'id': 1, 'name':'Sally', 'age':61},
    {'id': 2, 'name':'June', 'age':32},
    {'id': 3, 'name':'Nish', 'age':41},
    {'id': 4, 'name':'Lexi', 'age':25}
]

df = pd.DataFrame.from_dict(my_dict)
df.head()
ageidname
0640Bob
1611Sally
2322June
3413Nish
4254Lexi

Heads or tails?

You might of noticed above the use of df.head() to display a summary of the data in the dataframe. Well, you’ve probably guessed it, but df.head() shows a summary of the first 5 rows of the dataframe. If you want to see the last 5 rows, then there’s df.tail(). You can also set how many rows you want to see: df.head(n=10)

Sorting dataframes

Of course, first 5 and last 5 is pretty meaningless unless that data is in some kind of order!

df.sort_values('start_datetime').head(n=3)
idend_datetimestart_datetimeuser_id
54962202018-01-01 00:55:402018-01-01 00:04:233fa82aab-a93b-4d83-9b85-559fc3dbe0f1
14962162018-01-01 00:56:352018-01-01 00:17:448ec6e7e7-a5d7-4a48-bb8a-d4d899e1f5be
44962192018-01-01 00:30:362018-01-01 00:21:410062817b-1486-4018-86fa-def7f5ac782f

By default it will sort ascending. To flip that, we can do the following:

df.sort_values('start_datetime', ascending=False).head(n=3)
idend_datetimestart_datetimeuser_id
50651910027342018-11-29 23:18:052018-11-29 22:56:238cdfb109-1e83-4191-9f75-4d1b4f88428d
50654610027612018-11-29 23:05:222018-11-29 22:56:048cc9255d-35f1-4d7f-b363-189f19d383d2
50653810027532018-11-29 23:50:392018-11-29 22:55:560062817b-1486-4018-86fa-def7f5ac782f

And finally, you can sort based on several columns by passing in an array:

df.sort_values(['start_datetime', 'user_id']).head(n=3)
idend_datetimestart_datetimeuser_id
54962202018-01-01 00:55:402018-01-01 00:04:233fa82aab-a93b-4d83-9b85-559fc3dbe0f1
14962162018-01-01 00:56:352018-01-01 00:17:448ec6e7e7-a5d7-4a48-bb8a-d4d899e1f5be
44962192018-01-01 00:30:362018-01-01 00:21:410062817b-1486-4018-86fa-def7f5ac782f

Filtering data

Another common operation might be to filter data based on a column value:

df[df['user_id']=='8ec6e7e7-a5d7-4a48-bb8a-d4d899e1f5be'].head()
idend_datetimestart_datetimeuser_id
14962162018-01-01 00:56:352018-01-01 00:17:448ec6e7e7-a5d7-4a48-bb8a-d4d899e1f5be
1254963402018-01-01 09:50:432018-01-01 08:52:358ec6e7e7-a5d7-4a48-bb8a-d4d899e1f5be
3244965392018-01-01 11:00:342018-01-01 10:14:018ec6e7e7-a5d7-4a48-bb8a-d4d899e1f5be
6004968152018-01-01 12:20:262018-01-01 12:00:428ec6e7e7-a5d7-4a48-bb8a-d4d899e1f5be
8334970482018-01-01 15:37:022018-01-01 15:16:328ec6e7e7-a5d7-4a48-bb8a-d4d899e1f5be

Remember when we loaded the data we made extra sure that the dates in our data were parsed as datetimes? Well, that means we can find all the sessions in a particular month and year as follows:

df[(df['start_datetime'].dt.month==10) & (df['start_datetime'].dt.year >= 2017)].head()
idend_datetimestart_datetimeuser_id
3665353665352017-10-01 01:05:052017-10-01 00:44:440aea0470-e4a7-42b0-9839-7248a944933a
3665363665362017-10-01 01:24:232017-10-01 00:24:333fa82aab-a93b-4d83-9b85-559fc3dbe0f1
3665373665372017-10-01 00:36:312017-10-01 00:13:302b183d63-3299-46f8-bbfa-46e0e7a0e674
3665383665382017-10-01 01:15:372017-10-01 00:31:0466a13e1a-f5fc-4e64-9421-7fd641a616aa
3665393665392017-10-01 01:07:222017-10-01 00:54:49673ec73d-eb3a-4fce-9cdd-bc5bcd3bcfbb

Summary

So, there you go, some powerful support for loading and exploring data. Next time we’ll look at grouping, transforming and joining data!

If you’ve got any comments, suggestions or requests then please let me know in the comments.

Read the rest of the series

Follow the full series of posts to master Python!

  • Part 1 : Installing and setting up Python, Pandas and Jupyter
  • Part 2 : Loading and viewing data
  • Part 3 : Grouping, aggregating and summarising data

To be notified of new posts subscribe to the mailing list.



comments powered by Disqus