# Striding through data with Python and pandas

## Sep 11, 2015 • Joe Ranweiler

Today I want to kick off a series of posts about slicing and dicing numerical
data with Python. In particular, we’re going to do this with the
*pandas* library (stylized
lowercase). Rather than showing off all of pandas’ fanciest features, our goal
will simply be to build intuition for the core abstractions that pandas gives
us.

We’ll do this organically, in layers, by sticking our mitts in some concrete data. For this series, I’ve prepped a dataset stored in a comma-separated value file. This dataset contains demographic and contact information about each current legislator in the United States Congress. It’s distilled from data provided by the Sunlight Foundation, courtesy their wonderful Congress API.

The CSV is available as a gist here.

Let’s start by setting up our workbench!

## Up and running

We’ll be using Python 3.4, which you may need to install on your system. On OS X you can use a tool like Homebrew, and on Linux it should be available via your distro’s package manager.

The right way to use Python is in a *virtualenv*. You can read more about them
(and how to get the `virtualenv`

binary)
here. We’ll begin by
creating and *activating* a fresh virtualenv, which properly sets up your shell
`PATH`

.

Next we’ll install `numpy`

, `pandas`

and the `ipython`

REPL replacement:

Last but not least, grab a copy of the dataset:

Now we can fire up the `ipython`

REPL and load pandas:

We’re ready to go!

## Framing the problem

The CSV file is called `legislators.csv`

. This file has a row of *column
headers* followed by rows of data. If we like, we can think of it as a SQL
table (and we’ll extend this analogy in a bit!). For data like this, pandas
offers us a class called a `DataFrame`

.

We can read the file into a `DataFrame`

like so:

Now we have a `DataFrame`

instance! Fun fact: instead of curl-ing the CSV and
loading it from the local file, we could’ve passed `read_csv`

the remote file
URL as a string. But, for our purposes, it makes more sense to keep a local
copy for future practice.

We’ll ignore the details of the `index_col`

keyword argument for now, but by
setting it to `False`

, we are telling pandas “read this data and create an
auto-incrementing surrogate key for each row”, analogous to the `id`

columns we
frequently find in SQL database schemas.

Let’s learn about the `DataFrame`

class by answering some questions about the
dataset.

## Question 1: *How many legislators are there in congress, anyway?*

Each (non-header) row of our CSV file describes a distinct legislator. If we
assume no row in our frame is degenerate (all `NaN`

), we can count the number
of legislators by calling the generic Python `len`

function on the `DataFrame`

:

Note: Our total count of 540 legislators comes from the sum of 435 House representatives (less 1 vacant seat), 100 senators, and 6 non-voting representatives.

A `DataFrame`

is like a 2-dimensional `ndarray`

from the
NumPy package. In fact, it used to be an honest
*subclass* of `ndarray`

, though that has changed. Regardless, they still
respond to many of the same methods.

For example, like an `ndarray`

, our `DataFrame`

has a *shape*:

A `DataFrame`

will always be two-dimensional, and its shape will be (#rows,
#columns). In particular, a `DataFrame`

has two arrays of *indices* which
define these rows and columns.

We have indices for the *rows*:

And indices for the columns:

And we can observe that the `shape`

is derived from the indexing arrays:

Notice that calling `len`

on the `DataFrame`

gives us the number of *rows*. This
corresponds to the first (or *0th*) *axis* of our `DataFrame`

. Thus we have:

One caveat about counting rows this way is that there may be cases in which we
have missing data that we want to *ignore*. pandas represents missing data with
the IEEE 754 floating point `NaN`

value. The core pandas objects respond to a
`count`

method which accounts for this. For each column index, `count`

gives
the number of non-`NaN`

entries.

We’ll come back to this later, but for now, I promise that all of our rows have
at least one *non*-`NaN`

column. Since we know a bit about the silhouette of a
`DataFrame`

, let’s crack it open!

# Question 2: *What is the binary gender distribution in congress?*

Now we want to ask some more interesting questions, so it’s time to learn about
the `Series`

class and *selecting* data from a `DataFrame`

.

Recall that one of our column indices is `gender`

. We can *index* by a single
column label:

The pandas object we get back from a single-column index is a `Series`

:

A series is basically a *labelled one-dimensional array*. That is, there
is an array of *data*:

And, in parallel, there is an array of *indices* which label the data:

Thus, we can think of a `DataFrame`

as a `Series`

of `Series`

, all with
compatible, properly-aligned index arrays.

Now back to our question! First, let’s show what we *don’t* want to do:

## Bad

Notice that the above solution is extremely imperative. It is also not very
*compositional*, as we’ve done crude manual counting, and we don’t have an
obvious interesting data structure that we could pass on to another function in
a chain.

Since `Series`

objects are iterable, we could be a little more explicit about
what we’re trying to do and use the `Counter`

class from the `collections`

package of the standard library:

Better, but we’re still not thinking with arrays. This is where pandas benefits
from Python’s generic operator overloading. Observe how the `==`

operator works
with a `Series`

:

What is the type of the resulting object?

What we’ve essentially done is apply the predicate `lambda x: x == 'F'`

to each
coordinate of the `Series`

. In NumPy terms, we’ve applied the predicate as a
`ufunc`

, or a function whose action on an array is given by its action on each
coordinate.

In this context, the resulting `Series`

may seem of dubious value, as it is
just a sequence of `True`

and `False`

scalars. But in pandas (as in NumPy), we
can use such *masking* arrays to select the `DataFrame`

rows whose indices map
to `True`

in `s`

.

Concretely:

And from this we can get the counts of male and female legislators:

This gets the job done, and is “vector-y”, but that doesn’t make it *good*. For
one, we’ve assumed that values in the `gender`

column only take on the values
`"M"`

and `"F"`

. We could verify that this assumption is true for this dataset,
but we’d still have to collect the results, which would entail looping over our
expected variable values. Let’s ask another question and do better!

## Question 3 : *What is the distribution of party membership?*

When answering the last question, what we were really trying to do was *group*
the rows by the value of `gender`

and then *aggregate* them by counting. Using
pandas, we can do this explicitly using the `groupby`

method of `DataFrame`

.

What kind of object does `groupby`

gives us?

A (`DataFrame`

)`GroupBy`

object is kind of like an enriched Python `dict`

that we can easily
turn into a `DataFrame`

. This is most clear when we inspect its `groups`

property:

We can see that the `GroupBy`

has a `dict`

that maps *values* in the `'party'`

column to *row indices*. In common statistical parlance, we might call
`'party'`

a categorical variable with three levels: `'D'`

, `'I'`

, and `'R'`

.

We can grab one of the individual groups denoted by those row indices, as a
`DataFrame`

, using the `get_group`

method of the `GroupBy`

object.

For example, to obtain a `DataFrame`

of all Democrats, we could do any of the following:

Which is extensionally the same as:

But one nicety of the `GroupBy`

object is that we can easily aggregate *within*
the groups in various ways. Thus, we can count the number of rows (and thus
number of *legislators*) in each group by `apply`

-ing the `len`

funtion to the groups:

Similarly, this would’ve been a better way to summarize the gender distribution amongst legislators:

What’s nice about this method is that the result composes better with other pandas
data structures. In particular, we can again utilize operator overloading to divide
each entry of the resulting `Series`

by the total number of legislators:

From the above, we can read off some simple facts about how gender and party membership are distributed in Congress.

## Conclusion

We’ve covered a lot of ground already. Next time we’ll look at how to answer
complex questions involving *many* variables at once. Along the way, we’ll
learn more powerful methods of rearranging and aggregating data. Stay tuned!

*Main image from Flickr user internetarchivebookimages*.