Filtering In Pandas Dataframe
Pandas dataframe is like a small database, we can use it to inject some data and do some in-memory filtering without any external SQL. This post is much like a summary of this StackOverflow thread.
Building dataframe
In [1]: import pandas as pd
...: import numpy as np
...: df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
...: 'B': 'one one two three two two one three'.split(),
...: 'C': np.arange(8), 'D': np.arange(8) * 2})
In [2]: df
Out[2]:
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14
Some basic filtering conditions
Filtering by A = ‘foo’
In [3]: df[df.A == 'foo']
Out[3]:
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Filtering by A = ‘foo’ and B = ‘one’
In [4]: df[(df.A == 'foo') & (df.B == 'one')]
Out[4]:
A B C D
0 foo one 0 0
6 foo one 6 12
Filtering by A = ‘foo’ or B = ‘one’
In [5]: df[(df.A == 'foo') | (df.B == 'one')]
Out[5]:
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Different ways to achieve the same filtering
Let’s take the example for filtering by
A = 'foo' and B = 'one'
Column as dataframe property
In [4]: df[(df.A == 'foo') & (df.B == 'one')]
Out[4]:
A B C D
0 foo one 0 0
6 foo one 6 12
Column as dataframe dict key
In [7]: df[(df['A'] == 'foo') & (df['B'] == 'one')]
Out[7]:
A B C D
0 foo one 0 0
6 foo one 6 12
Using multiple single filters
In [16]: df[df.A == 'foo'][df.B == 'one']
C:\Users\xiang\AppData\Local\PackageManagement\NuGet\Packages\python.3.7.0\tools\Scripts\ipython:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Out[16]:
A B C D
0 foo one 0 0
6 foo one 6 12
Using numpy array
In [24]: df[(df.A.values == 'foo') & (df.B.values == 'one')]
Out[24]:
A B C D
0 foo one 0 0
6 foo one 6 12
Using isin function
In [9]: df[( df['A'].isin(['foo']) ) & ( df['B'].isin(['one']) )]
Out[9]:
A B C D
0 foo one 0 0
6 foo one 6 12
Using underlying numpy in1d function
In [25]: df[(np.in1d(df['A'].values, ['foo'])) & (np.in1d(df['B'].values, ['one']))]
Out[25]:
A B C D
0 foo one 0 0
6 foo one 6 12
Using query API (developer friendly)
In [10]: df.query("(A == 'foo') & (B == 'one')")
Out[10]:
A B C D
0 foo one 0 0
6 foo one 6 12
Using numpy where function and dataframe iloc positional indexing
In [20]: df.iloc[np.where( (df.A.values=='foo') & (df.B.values=='one') )]
Out[20]:
A B C D
0 foo one 0 0
6 foo one 6 12
Using xs label indexing
The Syntax is too complicated.
Developer friendly filtering
As mentioned previously, the query API method is a developer friendly filtering method.
Why? All the other methods must include the original df
object in the filter. If we have a dynamic filter conditions, it will be difficult to generate the filters (pandas Series) with the df
object. I haven’t found the solution to build this kind of filter by looping over a Python dict.
For example:
The filter is based on a Python dict, the key of the dict
corresponds to the dataframe column, and the value of the dict
corresponds to the value to dataframe column value to filter. One more context, if the value is None, don’t filter on the corresponding key (column).
Suppose the filter dict is like this one:
filter_dict = {'A': 'foo', 'B': 'one', 'C': None, 'D': None}
By using df
object in the filter, we should see something like this:
df[(df['A'] == 'foo') & (df['B'] == 'one')]
It’s easy to type manually the filter directly from a shell (ipython or jupyter as you like), but how you build the same filter from a Python script ? Not simple.
Please let me know if you have any suggestions :)
But with the query API, we just need to convert the filter_dict
to a string like: "(A == 'foo') & (B == 'one')"
. This is pretty easy in pure Python:
In [32]: filter_dict = {'A': 'foo', 'B': 'one', 'C': None, 'D': None}
In [33]: filter_string = " & ".join(["{} == '{}'".format(k,v) for k,v in filter_dict.items() if v is not None])
In [34]: filter_string
Out[34]: "A == 'foo' & B == 'one'"}
Benchmark
You can get the benchmark from this StackOverflow thread.
Generally speaking, except for the query API
and the xs label indexing
methods, all the others are fast.
But for a large quantity of data, the query API becomes pretty fast.
Some benchmarks I tested from my laptop:
For 8 lines of data
In [35]: import pandas as pd
...: import numpy as np
...: df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
...: 'B': 'one one two three two two one three'.split(),
...: 'C': np.arange(8), 'D': np.arange(8) * 2})
In [36]: %timeit df.query("(A == 'foo') & (B == 'one')")
1.48 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [37]: %timeit df[df.A == 'foo'][df.B == 'one']
1.01 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [38]: %timeit df[(df.A == 'foo') & (df.B == 'one')]
688 µs ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit df[(df.A.values == 'foo') & (df.B.values == 'one')]
248 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [40]: %timeit df.iloc[np.where( (df.A.values=='foo') & (df.B.values=='one') )]
287 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For 30k lines of data
In [51]: import pandas as pd
...: import numpy as np
...: df = pd.DataFrame({'A': ('foo bar ' * 15000).split(),
...: 'B': ('one one two two three three ' * 5000).split(),
...: 'C': np.arange(30000), 'D': np.arange(30000) * 2})
In [52]: %timeit df.query("(A == 'foo') & (B == 'one')")
2.83 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [53]: %timeit df[df.A == 'foo'][df.B == 'one']
6.51 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [54]: %timeit df[(df.A == 'foo') & (df.B == 'one')]
5.58 ms ± 480 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [55]: %timeit df[(df.A.values == 'foo') & (df.B.values == 'one')]
1.47 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [56]: %timeit df.iloc[np.where( (df.A.values=='foo') & (df.B.values=='one') )]
1.5 ms ± 38.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Leave a comment