And we’re back, closing out our series on parsing data in Python using pandas!
We’ll be picking up from where we left off in
Part 1. Last time we covered the
Series classes, as well as basic selection and indexing. This time we’ll focus on more general forms of selection, hierarchical grouping, numerical aggregation, how to handle missing data, and even plotting!
We’ll continue our strategy of learning by doing—answering questions about our dataset of members of the Congress of the United States.
Question 4: How do gender and party relate?
Previously, we just counted legislators which were equivalent up to their level in some particular categorical variable—in pandas terms, rows which had the same value at one particular column index. We grouped by a column, then aggregated by counting.
Suppose we want to know the gender distribution by political party. This too can be thought of as a group-then-count question.
Before, we grouped by a single column index. Now we want to group by multiple column indices. With this comes a new notion: that of a non-scalar index, or
MultiIndex. Instead of passing a column index (here, a scalar
str) as an argument to
groupby, we pass a
Once again, we have a
GroupBy object. However, let’s look at the keys of the underlying
dict, which lives at
Before, our keys were simply scalars. Now they are tuples (a technical requirement, as general Python
lists are not hashable).
We can think of what is happening as grouping by a new “compound” categorical variable. We might called this
gender x party, and and its levels would be the Cartesian product of levels of the component variables:
Comparing this to the value of
g.groups.keys(), we can see that the sets of pairs mostly agree. The only exception is
('F', 'I'), which does not appear in the set of keys. This is only because it doesn’t correspond to any rows, as there aren’t currently any female Independents in Congress.
We’ll try to count rows using the same invocation as before:
Now we’re presented with a friendly table of our data, hierarchically-organized, and aggregated by counts. How did pandas know what to do? Let’s figure it out by analyzing
It is a
Series, as before. What does its
index look like?
As we can see, a
MultiIndex is a fairly complex object. All we really need to know is that it encodes the grouping we saw explicitly in the
dict keys of the
GroupBy object. This gives us some intuitive new indexing mechanisms:
However, some aspects are less intuitive:
What’s going on here? Due to Python’s parsing, the first two expressions are equivalent. However, a Python
list is “array-like”, which means that the indexing attempt gets interpreted as the query: “give me the values of
s whose value at the first component of the
"gender") is the string
'D' is not a valid level for
'gender', including it in our query never causes any new rows to match. Thus
s[['F', 'D']] ends ups being the same as
This is contrary to what we might’ve meant in this context, namely: “give me the values of
s indexed by
'F' in the first component of the
MultiIndex and then by
'D' in the second component”. As we saw, that query would be expressed as
Another nice feature of this multi-level grouping is that it makes it easy to perform within-group computations.
For example, as we saw last time, we can divide a
Series by a scalar to get the result of a corresponding coordinate-wise division. This lets us obtain the ratio of each gender/party combination relative to the overall population of legislators:
But what if we want the ratios within each gender, i.e. relative to the overall count of lawmakers with a given gender? By simply making another, coarser
GroupBy object, we can get this in one fell swoop using coordinate-wise array division:
From this, we can easily draw conclusions like “26.9% of women in Congress are Republican”, or “36.3% of men in Congress are Democrat”.
Question 5: What are the descriptive statistics on legislator count by state?
Now that we’ve warmed up a bit and know how to group and count, let’s look at these counts across states. We might care to ask some standard statistical questions. Which states have the fewest legislators? The most? What’s the average number of legislators? The median? What’s the distribution look like?
These questions really only make sense for the House, not the Senate, since there are exactly 2 Senators per state. So, we’ll first select only House legislators from our original
Then we’ll perform a group-then-count:
This time, instead of working with a “raw”
Series, we are going to convert it into a
DataFrame. In doing so, we provide a name for the column that contains our new aggregate value:
One thing we can do with the
DataFrame is call the
describe method on our new column to get some quick summary statistics:
It’s also easy to get an explicit list of the top 5 states by House representation. We will sort by
reps column, reverse the result with the Python slice syntax
[::-1], and then use the pandas
head method to take the first 5 rows of the result:
DataFrame objects respond to the
tail methods, which optionally accept a number of rows (the default is 5) to return from the start (
head) or end (
tail) of the array of rows.
So, we also could’ve taken the
tail and then reversed:
Or simply stated our preferred ordering as an argument to
In truth, we could’ve done the same thing with the plain old
Series. But the
DataFrame can be a more convenient object to work with at times.
One example is with plotting. Though we can plot with
Series objects, the extra labeling we usually have on a
DataFrame makes producing a box plot of our state house count as easy as:
And the resulting image visually summarizes the result of
We’ll see another example when answering the next (and last!) question.
Question 6: What are the top 10 states with the highest proportion of female legislators?
This question can be answered with our previous techniques plus a bit of sorting. This time we’ll include senators.
We need to:
- Group and count the women legislators by state
- Find the proportion of women lawmakers within each state
- Sort the resulting rows from greatest to least proportion
We’ve already figured out how to do step 1:
And computing step 2 is similar to the last example:
This time, however, we have some
NaN values, since when we grouped, some states didn’t correspond to any rows. The pandas
fillna method lets us replace
NaN values with the appropriate value—in this case, 0:
Finally, we sort and take the top 10 states with the
There is much, much more to pandas than the simple selecting, sorting, grouping, and plotting we’ve done so far. Furthermore, one of the benefits of using a popular library like pandas is that you can hook into the huge ecosystem built around it. For example, if we were to try to visualize our data set in a more interesting way, I would probably reach for something like the Seaborn library, rather than using the pandas defaults.
Despite everything we haven’t covered, my hope is that after reading this, you now feel like you really understand the central data structures that pandas gives us. With a solid conceptual background, you can dig into the (excellent) official pandas docs.
That’s all in this series for now. In the meantime, happy histogramming!
Main image from Flickr user internetarchivebookimages.