Today I want to kick off a series of posts about slicing and dicing numerical data with Python. In particular, we’re going to do this with the pandas library (stylized lowercase). Rather than showing off all of pandas’ fanciest features, our goal will simply be to build intuition for the core abstractions that pandas gives us.
We’ll do this organically, in layers, by sticking our mitts in some concrete data. For this series, I’ve prepped a dataset stored in a comma-separated value file. This dataset contains demographic and contact information about each current legislator in the United States Congress. It’s distilled from data provided by the Sunlight Foundation, courtesy their wonderful Congress API.
The CSV is available as a gist here.
Let’s start by setting up our workbench!
Up and running
We’ll be using Python 3.4, which you may need to install on your system. On OS X you can use a tool like Homebrew, and on Linux it should be available via your distro’s package manager.
The right way to use Python is in a virtualenv. You can read more about them
(and how to get the
here. We’ll begin by
creating and activating a fresh virtualenv, which properly sets up your shell
Next we’ll install
pandas and the
ipython REPL replacement:
Last but not least, grab a copy of the dataset:
Now we can fire up the
ipython REPL and load pandas:
We’re ready to go!
Framing the problem
The CSV file is called
legislators.csv. This file has a row of column
headers followed by rows of data. If we like, we can think of it as a SQL
table (and we’ll extend this analogy in a bit!). For data like this, pandas
offers us a class called a
We can read the file into a
DataFrame like so:
Now we have a
DataFrame instance! Fun fact: instead of curl-ing the CSV and
loading it from the local file, we could’ve passed
read_csv the remote file
URL as a string. But, for our purposes, it makes more sense to keep a local
copy for future practice.
We’ll ignore the details of the
index_col keyword argument for now, but by
setting it to
False, we are telling pandas “read this data and create an
auto-incrementing surrogate key for each row”, analogous to the
id columns we
frequently find in SQL database schemas.
Let’s learn about the
DataFrame class by answering some questions about the
Question 1: How many legislators are there in congress, anyway?
Each (non-header) row of our CSV file describes a distinct legislator. If we
assume no row in our frame is degenerate (all
NaN), we can count the number
of legislators by calling the generic Python
len function on the
Note: Our total count of 540 legislators comes from the sum of 435 House representatives (less 1 vacant seat), 100 senators, and 6 non-voting representatives.
DataFrame is like a 2-dimensional
ndarray from the
NumPy package. In fact, it used to be an honest
ndarray, though that has changed. Regardless, they still
respond to many of the same methods.
For example, like an
DataFrame has a shape:
DataFrame will always be two-dimensional, and its shape will be (#rows,
#columns). In particular, a
DataFrame has two arrays of indices which
define these rows and columns.
We have indices for the rows:
And indices for the columns:
And we can observe that the
shape is derived from the indexing arrays:
Notice that calling
len on the
DataFrame gives us the number of rows. This
corresponds to the first (or 0th) axis of our
DataFrame. Thus we have:
One caveat about counting rows this way is that there may be cases in which we
have missing data that we want to ignore. pandas represents missing data with
the IEEE 754 floating point
NaN value. The core pandas objects respond to a
count method which accounts for this. For each column index,
the number of non-
We’ll come back to this later, but for now, I promise that all of our rows have
at least one non-
NaN column. Since we know a bit about the silhouette of a
DataFrame, let’s crack it open!
Question 2: What is the binary gender distribution in congress?
Now we want to ask some more interesting questions, so it’s time to learn about
Series class and selecting data from a
Recall that one of our column indices is
gender. We can index by a single
The pandas object we get back from a single-column index is a
A series is basically a labelled one-dimensional array. That is, there is an array of data:
And, in parallel, there is an array of indices which label the data:
Thus, we can think of a
DataFrame as a
Series, all with
compatible, properly-aligned index arrays.
Now back to our question! First, let’s show what we don’t want to do:
Notice that the above solution is extremely imperative. It is also not very compositional, as we’ve done crude manual counting, and we don’t have an obvious interesting data structure that we could pass on to another function in a chain.
Series objects are iterable, we could be a little more explicit about
what we’re trying to do and use the
Counter class from the
package of the standard library:
Better, but we’re still not thinking with arrays. This is where pandas benefits
from Python’s generic operator overloading. Observe how the
== operator works
What is the type of the resulting object?
What we’ve essentially done is apply the predicate
lambda x: x == 'F' to each
coordinate of the
Series. In NumPy terms, we’ve applied the predicate as a
ufunc, or a function whose action on an array is given by its action on each
In this context, the resulting
Series may seem of dubious value, as it is
just a sequence of
False scalars. But in pandas (as in NumPy), we
can use such masking arrays to select the
DataFrame rows whose indices map
And from this we can get the counts of male and female legislators:
This gets the job done, and is “vector-y”, but that doesn’t make it good. For
one, we’ve assumed that values in the
gender column only take on the values
"F". We could verify that this assumption is true for this dataset,
but we’d still have to collect the results, which would entail looping over our
expected variable values. Let’s ask another question and do better!
Question 3 : What is the distribution of party membership?
When answering the last question, what we were really trying to do was group
the rows by the value of
gender and then aggregate them by counting. Using
pandas, we can do this explicitly using the
groupby method of
What kind of object does
groupby gives us?
GroupBy object is kind of like an enriched Python
dict that we can easily
turn into a
DataFrame. This is most clear when we inspect its
We can see that the
GroupBy has a
dict that maps values in the
column to row indices. In common statistical parlance, we might call
'party' a categorical variable with three levels:
We can grab one of the individual groups denoted by those row indices, as a
DataFrame, using the
get_group method of the
For example, to obtain a
DataFrame of all Democrats, we could do any of the following:
Which is extensionally the same as:
But one nicety of the
GroupBy object is that we can easily aggregate within
the groups in various ways. Thus, we can count the number of rows (and thus
number of legislators) in each group by
len funtion to the groups:
Similarly, this would’ve been a better way to summarize the gender distribution amongst legislators:
What’s nice about this method is that the result composes better with other pandas
data structures. In particular, we can again utilize operator overloading to divide
each entry of the resulting
Series by the total number of legislators:
From the above, we can read off some simple facts about how gender and party membership are distributed in Congress.
We’ve covered a lot of ground already. Next time we’ll look at how to answer complex questions involving many variables at once. Along the way, we’ll learn more powerful methods of rearranging and aggregating data. Stay tuned!
Main image from Flickr user internetarchivebookimages.