A trip through pandas
The moment we start in Data Science with Python, a name in the horizon appears: pandas. This is one of the best known libraries for data processing, and it is very well integrated in most of the machine learning frameworks such as scikit-learn or XGBoost. This is why it is almost essential to learn how to treat your data in Python with this library, and in this article we are going to go through some of the most useful things you can do with it.
Creating or Importing DataFrames
The first element you will need to know is the DataFrame, the structure where all data will be stored. This will be a database-like element that we will use and modify throughout the tutorial.
Note that the DataFrame is a complex structure that works for machine learning. If you are looking for Big Data, where you might have gigabytes and gigabytes of data and you might profit from cloud scalability; there are other more suitable libraries, such as Dask.
First things first, we need to import the library:
import pandas as pd
It is a well established convention to use the pd
alias for pandas - plus you are going to be writing it so many times that it’ll be easier the shorter it is.
Now, the general rule for building a DataFrame is to do it as follows:
df = pd.DataFrame([[1,2,3],
[4,5,6])
Usually you are not going to be creating your datasets, rather, you will import them from somewhere.
In this tutorial, we are going to use the wine dataset found
here.
CSV files are widely used in machine learning to load datasets.
To load it:
df = pd.read_csv('./wine.csv')
To check that it loaded properly, let’s check the first five elements with:
df.head(5)
Wine Alcohol Malic.acid Ash Acl Mg ... Nonflavanoid.phenols Proanth Color.int Hue OD Proline
0 1 14.23 1.71 2.43 15.6 127 ... 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 ... 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 ... 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 ... 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 ... 0.39 1.82 4.32 1.04 2.93 735
We can see that it loaded even the name of the columns.
The read_csv
function always takes the first row as the column name.
If you don’t want that to happen (because the columns name is not in the file, for example), you have to add header=None
.
If you have a list of columns (must be unique), you can pass it with names=my_list_of_cols
.
Finally, if your CSV has a different delimiter than a comma, let’s say a semicolon, you can specify it with sep=';'
. With these options, you will pretty much load any dataset you have in your hands, and you are ready to go.
Indexing
Now that we have a set of data, let’s see how to pick parts of it for further study.
We already know how to take the first n
elements, but what if we want to do a random sampling?
Of course, we do:
df.sample(5)
Wine Alcohol Malic.acid Ash Acl Mg ... Nonflavanoid.phenols Proanth Color.int Hue OD Proline
79 2 12.70 3.87 2.40 23.0 101 ... 0.43 1.95 2.57 1.19 3.13 463
135 3 12.60 2.46 2.20 18.5 94 ... 0.63 0.94 7.10 0.73 1.58 695
164 3 13.78 2.76 2.30 22.0 90 ... 0.41 1.03 9.58 0.70 1.68 615
144 3 12.25 3.88 2.20 18.5 112 ... 0.29 1.14 8.21 0.65 2.00 855
133 3 12.70 3.55 2.36 21.5 106 ... 0.17 0.84 5.00 0.78 1.29 600
[5 rows x 14 columns]
This can be very useful to extract graphs or to see the distribution of the data. But let’s say you want to be in control, and you need the registers 5 to 15.
You then have to use:
df.iloc[5:15]
Wine Alcohol Malic.acid Ash Acl ... Proanth Color.int Hue OD Proline
5 1 14.20 1.76 2.45 15.2 ... 1.97 6.75 1.05 2.85 1450
6 1 14.39 1.87 2.45 14.6 ... 1.98 5.25 1.02 3.58 1290
7 1 14.06 2.15 2.61 17.6 ... 1.25 5.05 1.06 3.58 1295
8 1 14.83 1.64 2.17 14.0 ... 1.98 5.20 1.08 2.85 1045
9 1 13.86 1.35 2.27 16.0 ... 1.85 7.22 1.01 3.55 1045
10 1 14.10 2.16 2.30 18.0 ... 2.38 5.75 1.25 3.17 1510
11 1 14.12 1.48 2.32 16.8 ... 1.57 5.00 1.17 2.82 1280
12 1 13.75 1.73 2.41 16.0 ... 1.81 5.60 1.15 2.90 1320
13 1 14.75 1.73 2.39 11.4 ... 2.81 5.40 1.25 2.73 1150
14 1 14.38 1.87 2.38 12.0 ... 2.96 7.50 1.20 3.00 1547
[10 rows x 14 columns]
But what if you want to know just the Alcohol
and Ash
of those registers?
Then you need:
df.loc[5:14,['Alcohol', 'Ash']]
Alcohol Ash
5 14.20 2.45
6 14.39 2.45
7 14.06 2.61
8 14.83 2.17
9 13.86 2.27
10 14.10 2.30
11 14.12 2.32
12 13.75 2.41
13 14.75 2.39
14 14.38 2.38
Whoa, some changes here.
Why? The first change is to replace iloc
by loc
.
This is fairly simple.
While iloc
works with indices, loc
works with labels.
If we want to select a column by its name, we need the label; but if we had the number we could still use iloc
.
Second is the range.
Before we used 5:15
but now 5:14
? This is because iloc
uses the standard ranges in Python, excluding the last one.
But given that loc
works with all datatypes, including strings, it is more convenient to include the last one.
If you wanted to look for an alphabetical value from a
to s
it is easier to write a:s
than a:t
(knowing that t
goes after s
).
And complexity increases when it is full words we are using.
The only caveat is to take it into account when indexing with integers.
Lastly, there is a list of values.
Even if it is only one value (Alcohol
), it needs to be a list.
Now, let’s say we want to be even more accurate.
We want to know from those registers which ones have Alcohol
greater than 14
:
df.loc[5:14,['Alcohol', 'Ash']].loc[df.Alcohol > 14]
Alcohol Ash
5 14.20 2.45
6 14.39 2.45
7 14.06 2.61
8 14.83 2.17
10 14.10 2.30
11 14.12 2.32
13 14.75 2.39
14 14.38 2.38
We can chain different loc
operations to filter the DataFrame even further. And see that df.Alcohol
?
Yes, we can access the columns of the DataFrame as if they were attributes of an object.
Let’s make it more complex, and filter also when Ash
is smaller than 2.4
:
df.loc[5:14,['Alcohol', 'Ash']].loc[(df.Alcohol > 14) & (df.Ash < 2.4)]
Alcohol Ash
8 14.83 2.17
10 14.10 2.30
11 14.12 2.32
13 14.75 2.39
14 14.38 2.38
We can see that loc
uses binary objects to make the comparisons, that’s why we use the &
as a binary operator and not &&
as a logic operator.
If we wanted to specify an “or” operator, we would use a single pipe |
.
Creating new values
When processing the data to study the relationships, or to train a model, there are lots of times when creating new data is needed. Sometimes you need to build a value from existing values, or to split a certain value in two. Sometimes you have a string that represents multiple values and you want to encode them. Or other times you just think you might have seen an interesting ratio and you want to check how some values behave together.
Let’s say you want to have a new value that represents the ratio Alcohol
to Ash
.
It is as simple as write the following:
df['AlcoholAshRatio'] = df['Alcohol'] / df['Ash']
df.head(5)
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline AlcoholAshRatio
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065 5.855967
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050 6.168224
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185 4.928839
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480 5.748000
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735 4.613240
Here we have our ratio, to use it to plot a graph, to pass it to a model, etc. But instead of having to look at all the values to see how that ratio behaves, we can just study the most important data related to it with the following command:
df.AlcoholAshRatio.describe()
count 178.000000
mean 5.562817
std 0.693379
min 3.578947
25% 5.107217
50% 5.561649
75% 5.991097
max 9.095588
Name: AlcoholAshRatio, dtype: float64
Well, we have that our new value has a very low standard deviation, and that its percentiles are close to one another. Except from some outliers (as shown by the max and min values), it seems that indeed Alcohol and Ash are related. This could be an important information, depending on what we do with the data.
Declarative Programming
We have seen how to create values from other columns, but we can go even further beyond, with the power of declarative programming.
.map()
The first element we are going to use will be a map.
For those who don’t have yet the grasp of declarative programming, a map is a function that takes a set of values and “maps” them to a second set of values.
This is done by using a lambda function.
Let’s say we want to invert the AlcoholAshRatio
column.
We can just do:
df.AlcoholAshRatio.map(lambda p: 1/p)
0 0.170766
1 0.162121
2 0.202888
3 0.173974
4 0.216767
...
173 0.178702
174 0.185075
175 0.170309
176 0.179954
177 0.193914
Name: AlcoholAshRatio, Length: 178, dtype: float64
Now we can see it as a ratio from 0 to 1. Maybe easier to plot or to study. But… wait. Have we lost the original ratio?
df.AlcoholAshRatio
0 5.855967
1 6.168224
2 4.928839
3 5.748000
4 4.613240
...
173 5.595918
174 5.403226
175 5.871681
176 5.556962
177 5.156934
Name: AlcoholAshRatio, Length: 178, dtype: float64
No.
As we can see, the original ratio is not modified.
These functions won’t modify the original value.
Which is specially useful to build even more new columns:
df['AshAlcoholRatio'] = df.AlcoholAshRatio.map(lambda p: 1/p)
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline AlcoholAshRatio AshAlcoholRatio
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065 5.855967 0.170766
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050 6.168224 0.162121
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185 4.928839 0.202888
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480 5.748000 0.173974
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735 4.613240 0.216767
.apply()
Say now that we want to transform the whole DataFrame by applying a function per row.
We want to create a new value that checks if AlcoholAshRatio is an outlier, and divide it in low or high outlier; or in medium value.
We can do a function such as this:
def AAR_outlier(row):
aar_mean = df.AlcoholAshRatio.mean()
aar_std = df.AlcoholAshRatio.std()
if row.AlcoholAshRatio > aar_mean + aar_std:
row['AAR_outlier'] = 'high'
elif row.AlcoholAshRatio < aar_mean + aar_std:
row['AAR_outlier'] = 'low'
else:
row['AAR_outlier'] = 'mid'
return row
df.apply(AAR_outlier, axis='columns')
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline AlcoholAshRatio AshAlcoholRatio AAR_outlier
0 1.0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 5.855967 0.170766 low
1 1.0 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 6.168224 0.162121 low
2 1.0 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 4.928839 0.202888 low
3 1.0 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0 5.748000 0.173974 low
4 1.0 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0 4.613240 0.216767 low
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
173 3.0 13.71 5.65 2.45 20.5 95.0 1.68 0.61 0.52 1.06 7.70 0.64 1.74 740.0 5.595918 0.178702 low
174 3.0 13.40 3.91 2.48 23.0 102.0 1.80 0.75 0.43 1.41 7.30 0.70 1.56 750.0 5.403226 0.185075 low
175 3.0 13.27 4.28 2.26 20.0 120.0 1.59 0.69 0.43 1.35 10.20 0.59 1.56 835.0 5.871681 0.170309 low
176 3.0 13.17 2.59 2.37 20.0 120.0 1.65 0.68 0.53 1.46 9.30 0.60 1.62 840.0 5.556962 0.179954 low
177 3.0 14.13 4.10 2.74 24.5 96.0 2.05 0.76 0.56 1.35 9.20 0.61 1.60 560.0 5.156934 0.193914 low
[178 rows x 17 columns]
Just as before, it did not modify the original DataFrame.
Note that to apply it row-wise, we added the parameter index=columns
.
This might seem counter-intuitive, so be careful when applying the functions.
Instead, index=rows
means that the parameter passed to the function will be a column.
Lastly, the keen-eyed might have noted that this is just a more complex way of what we did to create the AlcoholAshRatio
before.
Indeed, a DataFrame supports a set of basic operations to transform its columns, such as addition, multiplication, etc. These operations are much more efficient and faster, but much more limited in complexity.
It is the good programmer who knows the tools at hand and when to use them to create an efficient and powerful pipeline, in the end.
Sometimes you will need the speed of a simple sum between two columns, other times you will need the expressivness of a function that is applied per row (or column) despite it being slower.
pandas gives you both tools, and it is up to you to choose wisely.
Closure… or not?
pandas is a truly marvel when it comes to data treatment and study. It allows to do a huge deal with the data, when you know how to handle, and it is one of the most multi-purpose tools you can have in your data scientist toolbox. In this tutorial, we have but scratched what a DataFrame can do; and even with this, a great deal of data treatment can be done.
In a second part, we will explore more aspects of pandas, such as how to group columns, or do SQL-like queries. How to sort by different indices, or how to split and join multiple similar (or not so similar) datasets. How to build a transforming pipeline that can process similar DataFrames. We are not done yet.