Today I want to kick off a series of posts about slicing and dicing numerical data with Python. In particular, we’re going to do this with the pandas library (stylized lowercase). Rather than showing off all of pandas’ fanciest features, our goal will simply be to build intuition for the core abstractions that pandas gives us.

We’ll do this organically, in layers, by sticking our mitts in some concrete data. For this series, I’ve prepped a dataset stored in a comma-separated value file. This dataset contains demographic and contact information about each current legislator in the United States Congress. It’s distilled from data provided by the Sunlight Foundation, courtesy their wonderful Congress API.

The CSV is available as a gist here.

Let’s start by setting up our workbench!

Up and running

We’ll be using Python 3.4, which you may need to install on your system. On OS X you can use a tool like Homebrew, and on Linux it should be available via your distro’s package manager.

The right way to use Python is in a virtualenv. You can read more about them (and how to get the virtualenv binary) here. We’ll begin by creating and activating a fresh virtualenv, which properly sets up your shell PATH.

$ virtualenv venv
$ source venv/bin/activate

Next we’ll install numpy, pandas and the ipython REPL replacement:

$ pip install pandas numpy ipython

Last but not least, grab a copy of the dataset:

$ curl -O https://gist.githubusercontent.com/ranweiler/628b2c6ff6fd93ac5f90/raw/0242bb9d7690a6857e8a5ef8c72032c3ddafb470/legislators.csv

Now we can fire up the ipython REPL and load pandas:

$ ipython
In [1]: import pandas as pd

We’re ready to go!

Framing the problem

The CSV file is called legislators.csv. This file has a row of column headers followed by rows of data. If we like, we can think of it as a SQL table (and we’ll extend this analogy in a bit!). For data like this, pandas offers us a class called a DataFrame.

We can read the file into a DataFrame like so:

In [2]: df = pd.read_csv('legislators.csv', index_col=False)

Now we have a DataFrame instance! Fun fact: instead of curl-ing the CSV and loading it from the local file, we could’ve passed read_csv the remote file URL as a string. But, for our purposes, it makes more sense to keep a local copy for future practice.

We’ll ignore the details of the index_col keyword argument for now, but by setting it to False, we are telling pandas “read this data and create an auto-incrementing surrogate key for each row”, analogous to the id columns we frequently find in SQL database schemas.

Let’s learn about the DataFrame class by answering some questions about the dataset.

Question 1: How many legislators are there in congress, anyway?

Each (non-header) row of our CSV file describes a distinct legislator. If we assume no row in our frame is degenerate (all NaN), we can count the number of legislators by calling the generic Python len function on the DataFrame:

In [3]: len(df)
Out[3]: 540

Note: Our total count of 540 legislators comes from the sum of 435 House representatives (less 1 vacant seat), 100 senators, and 6 non-voting representatives.

A DataFrame is like a 2-dimensional ndarray from the NumPy package. In fact, it used to be an honest subclass of ndarray, though that has changed. Regardless, they still respond to many of the same methods.

For example, like an ndarray, our DataFrame has a shape:

In [4]: df.shape
Out[4]: (540, 38)

A DataFrame will always be two-dimensional, and its shape will be (#rows, #columns). In particular, a DataFrame has two arrays of indices which define these rows and columns.

We have indices for the rows:

In [5]: df.index
Out[5]:
Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            530, 531, 532, 533, 534, 535, 536, 537, 538, 539],
          dtype='int64', length=540)

And indices for the columns:

In [6]: df.columns
Out[6]:
Index(['bioguide_id', 'birthday', 'chamber', 'contact_form', 'crp_id',
       'district', 'facebook_id', 'fax', 'fec_ids', 'first_name', 'gender',
       'govtrack_id', 'icpsr_id', 'in_office', 'last_name', 'leadership_role',
       'lis_id', 'middle_name', 'name_suffix', 'nickname', 'oc_email',
       'ocd_id', 'office', 'other_names', 'party', 'phone', 'senate_class',
       'state', 'state_name', 'state_rank', 'term_end', 'term_start',
       'thomas_id', 'title', 'twitter_id', 'votesmart_id', 'website',
       'youtube_id'],
      dtype='object')

And we can observe that the shape is derived from the indexing arrays:

In [7]: (len(df.index), len(df.columns)) == df.shape
Out[7]: True

Notice that calling len on the DataFrame gives us the number of rows. This corresponds to the first (or 0th) axis of our DataFrame. Thus we have:

In [8]: len(df) == df.shape[0]
Out[8]: True

One caveat about counting rows this way is that there may be cases in which we have missing data that we want to ignore. pandas represents missing data with the IEEE 754 floating point NaN value. The core pandas objects respond to a count method which accounts for this. For each column index, count gives the number of non-NaN entries.

We’ll come back to this later, but for now, I promise that all of our rows have at least one non-NaN column. Since we know a bit about the silhouette of a DataFrame, let’s crack it open!

Question 2: What is the binary gender distribution in congress?

Now we want to ask some more interesting questions, so it’s time to learn about the Series class and selecting data from a DataFrame.

Recall that one of our column indices is gender. We can index by a single column label:

In [9]: s = df['gender']
Out[9]:
0      M
1      M
2      M
3      M
4      M
5      M
...
537    M
538    F
539    M
Name: gender, dtype: object

The pandas object we get back from a single-column index is a Series:

In [10]: type(s)
Out[10]: pandas.core.series.Series

A series is basically a labelled one-dimensional array. That is, there is an array of data:

In [11]: s.values
Out[11]:
array(['M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'M',
       'M', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F',
       ...
       'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M',
       'F', 'M', 'M', 'M', 'M', 'F', 'M'], dtype=object)

And, in parallel, there is an array of indices which label the data:

In [12]: s.index
Out[12]:
Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            530, 531, 532, 533, 534, 535, 536, 537, 538, 539],
           dtype='int64', length=540)

Thus, we can think of a DataFrame as a Series of Series, all with compatible, properly-aligned index arrays.

Now back to our question! First, let’s show what we don’t want to do:

Bad

women = 0
men = 0
for leg in s:
    if leg == 'M':
        men +=1
    else:
        women += 1

Notice that the above solution is extremely imperative. It is also not very compositional, as we’ve done crude manual counting, and we don’t have an obvious interesting data structure that we could pass on to another function in a chain.

Since Series objects are iterable, we could be a little more explicit about what we’re trying to do and use the Counter class from the collections package of the standard library:

# In an alternate universe...

In [13]: from collections import Counter
In [14]: c = Counter(s)

In [15]: c
Out[15]: Counter({'F': 108, 'M': 432})

In [16]: c['F']
Out[16]: 108

Better, but we’re still not thinking with arrays. This is where pandas benefits from Python’s generic operator overloading. Observe how the == operator works with a Series:

In [13]: s == 'F'
Out[13]:
0      False
1      False
...
538     True
539    False
Name: gender, dtype: bool

What is the type of the resulting object?

In [14]: type(s == 'F')
Out[14]: pandas.core.series.Series

In [15]: (s == 'F').shape
Out[15]: (540,)

In [16]: (s == 'F').dtype
Out[16]: dtype('bool')

What we’ve essentially done is apply the predicate lambda x: x == 'F' to each coordinate of the Series. In NumPy terms, we’ve applied the predicate as a ufunc, or a function whose action on an array is given by its action on each coordinate.

In this context, the resulting Series may seem of dubious value, as it is just a sequence of True and False scalars. But in pandas (as in NumPy), we can use such masking arrays to select the DataFrame rows whose indices map to True in s.

Concretely:

In [17]: df[s == 'F']
Out[17]:
bioguide_id    birthday chamber  \
6       E000295  1970-07-01  senate
9       R000600  1947-12-29   house
14      P000610  1964-05-13   house
...
533   http://www.feinstein.senate.gov          SenatorFeinstein
538    http://www.cantwell.senate.gov           SenatorCantwell

[108 rows x 38 columns]

And from this we can get the counts of male and female legislators:

In [18]: len(df[s == 'F'])
Out[18]: 108
In [19]: len(df[s == 'M'])
Out[19]: 432

This gets the job done, and is “vector-y”, but that doesn’t make it good. For one, we’ve assumed that values in the gender column only take on the values "M" and "F". We could verify that this assumption is true for this dataset, but we’d still have to collect the results, which would entail looping over our expected variable values. Let’s ask another question and do better!

Question 3 : What is the distribution of party membership?

When answering the last question, what we were really trying to do was group the rows by the value of gender and then aggregate them by counting. Using pandas, we can do this explicitly using the groupby method of DataFrame.

What kind of object does groupby gives us?

In [20]: g = df.groupby('party')

In [21]: type(g)
Out[21]: pandas.core.groupby.DataFrameGroupBy

A (DataFrame)GroupBy object is kind of like an enriched Python dict that we can easily turn into a DataFrame. This is most clear when we inspect its groups property:

In [22]: g.groups
Out[22]:
{'D': [14,
       16,
...
       538,
       539],
 'I': [113, 227, 527],
 'R': [0,
       1,
       2,
...
       523,
       532,
       534]}

We can see that the GroupBy has a dict that maps values in the 'party' column to row indices. In common statistical parlance, we might call 'party' a categorical variable with three levels: 'D', 'I', and 'R'.

We can grab one of the individual groups denoted by those row indices, as a DataFrame, using the get_group method of the GroupBy object.

For example, to obtain a DataFrame of all Democrats, we could do any of the following:

df.groupby('party').get_group('D')

Which is extensionally the same as:

df[df['party'] == 'D']

But one nicety of the GroupBy object is that we can easily aggregate within the groups in various ways. Thus, we can count the number of rows (and thus number of legislators) in each group by apply-ing the len funtion to the groups:

In [23]: g.apply(len)
Out[23]:
party
D    236
I      3
R    301
dtype: int64

Similarly, this would’ve been a better way to summarize the gender distribution amongst legislators:

In [24]: df.groupby('gender').apply(len)
Out[24]:
gender
F    108
M    432
dtype: int64

What’s nice about this method is that the result composes better with other pandas data structures. In particular, we can again utilize operator overloading to divide each entry of the resulting Series by the total number of legislators:

In [25]: df.groupby('gender').apply(len) / len(df)
Out[25]:
gender
F    0.2
M    0.8
dtype: float64

In [26]: df.groupby('party').apply(len) / len(df)
Out[26]:
party
D    0.437037
I    0.005556
R    0.557407
dtype: float64

From the above, we can read off some simple facts about how gender and party membership are distributed in Congress.

Conclusion

We’ve covered a lot of ground already. Next time we’ll look at how to answer complex questions involving many variables at once. Along the way, we’ll learn more powerful methods of rearranging and aggregating data. Stay tuned!

Main image from Flickr user internetarchivebookimages.