Filtering In Pandas Dataframe#

Pandas dataframe is like a small database, we can use it to inject some data and do some in-memory filtering without any external SQL. This post is much like a summary of this StackOverflow thread.

Building dataframe#

In [1]: import pandas as pd
   ...: import numpy as np
   ...: df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
   ...:                    'B': 'one one two three two two one three'.split(),
   ...:                    'C': np.arange(8), 'D': np.arange(8) * 2})

In [2]: df
Out[2]:
     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
2  foo    two  2   4
3  bar  three  3   6
4  foo    two  4   8
5  bar    two  5  10
6  foo    one  6  12
7  foo  three  7  14

Some basic filtering conditions#

Filtering by A = 'foo'#

In [3]: df[df.A == 'foo']
Out[3]:
     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Filtering by A = 'foo' and B = 'one'#

In [4]: df[(df.A == 'foo') & (df.B == 'one')]
Out[4]:
     A    B  C   D
0  foo  one  0   0
6  foo  one  6  12

Filtering by A = 'foo' or B = 'one'#

In [5]: df[(df.A == 'foo') | (df.B == 'one')]
Out[5]:
     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Different ways to achieve the same filtering#

Let's take the example for filtering by A = 'foo' and B = 'one'

Column as dataframe property#

In [4]: df[(df.A == 'foo') & (df.B == 'one')]
Out[4]:
     A    B  C   D
0  foo  one  0   0
6  foo  one  6  12

Column as dataframe dict key#

In [7]: df[(df['A'] == 'foo') & (df['B'] == 'one')]
Out[7]:
     A    B  C   D
0  foo  one  0   0
6  foo  one  6  12

Using multiple single filters#

In [16]: df[df.A == 'foo'][df.B == 'one']
C:\Users\xiang\AppData\Local\PackageManagement\NuGet\Packages\python.3.7.0\tools\Scripts\ipython:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Out[16]:
     A    B  C   D
0  foo  one  0   0
6  foo  one  6  12

Using numpy array#

In [24]: df[(df.A.values == 'foo') & (df.B.values == 'one')]
Out[24]:
     A    B  C   D
0  foo  one  0   0
6  foo  one  6  12

Using isin function#

In [9]: df[( df['A'].isin(['foo']) ) & ( df['B'].isin(['one']) )]
Out[9]:
     A    B  C   D
0  foo  one  0   0
6  foo  one  6  12

Using underlying numpy in1d function#

In [25]: df[(np.in1d(df['A'].values, ['foo'])) & (np.in1d(df['B'].values, ['one']))]
Out[25]:
     A    B  C   D
0  foo  one  0   0
6  foo  one  6  12

Using query API (developer friendly)#

In [10]: df.query("(A == 'foo') & (B == 'one')")
Out[10]:
     A    B  C   D
0  foo  one  0   0
6  foo  one  6  12

Using numpy where function and dataframe iloc positional indexing#

In [20]: df.iloc[np.where( (df.A.values=='foo') & (df.B.values=='one') )]
Out[20]:
     A    B  C   D
0  foo  one  0   0
6  foo  one  6  12

Using xs label indexing#

The Syntax is too complicated.

Developer friendly filtering#

As mentioned previously, the query API method is a developer friendly filtering method.

Why? All the other methods must include the original df object in the filter. If we have a dynamic filter conditions, it will be difficult to generate the filters (pandas Series) with the df object. I haven't found the solution to build this kind of filter by looping over a Python dict.

For example:

The filter is based on a Python dict, the key of the dict corresponds to the dataframe column, and the value of the dict corresponds to the value to dataframe column value to filter. One more context, if the value is None, don't filter on the corresponding key (column).

Suppose the filter dict is like this one:

filter_dict = {'A': 'foo', 'B': 'one', 'C': None, 'D': None}

By using df object in the filter, we should see something like this:

df[(df['A'] == 'foo') & (df['B'] == 'one')]

It's easy to type manually the filter directly from a shell (ipython or jupyter as you like), but how you build the same filter from a Python script ? Not simple.

Please let me know if you have any suggestions :)

But with the query API, we just need to convert the filter_dict to a string like: "(A == 'foo') & (B == 'one')". This is pretty easy in pure Python:

In [32]: filter_dict = {'A': 'foo', 'B': 'one', 'C': None, 'D': None}

In [33]: filter_string = " & ".join(["{} == '{}'".format(k,v) for k,v in filter_dict.items() if v is not None])

In [34]: filter_string
Out[34]: "A == 'foo' & B == 'one'"}

Benchmark#

You can get the benchmark from this StackOverflow thread.

Generally speaking, except for the query API and the xs label indexing methods, all the others are fast.

Note

But for a large quantity of data, the query API becomes pretty fast.

Some benchmarks I tested from my laptop:

For 8 lines of data#

In [35]: import pandas as pd
   ...: import numpy as np
   ...: df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
   ...:                    'B': 'one one two three two two one three'.split(),
   ...:                    'C': np.arange(8), 'D': np.arange(8) * 2})

In [36]: %timeit df.query("(A == 'foo') & (B == 'one')")
1.48 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [37]: %timeit df[df.A == 'foo'][df.B == 'one']
1.01 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [38]: %timeit df[(df.A == 'foo') & (df.B == 'one')]
688 µs ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [39]: %timeit df[(df.A.values == 'foo') & (df.B.values == 'one')]
248 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [40]: %timeit df.iloc[np.where( (df.A.values=='foo') & (df.B.values=='one') )]
287 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For 30k lines of data#

In [51]: import pandas as pd
    ...: import numpy as np
    ...: df = pd.DataFrame({'A': ('foo bar ' * 15000).split(),
    ...:                    'B': ('one one two two three three ' * 5000).split(),
    ...:                    'C': np.arange(30000), 'D': np.arange(30000) * 2})

In [52]: %timeit df.query("(A == 'foo') & (B == 'one')")
2.83 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [53]: %timeit df[df.A == 'foo'][df.B == 'one']
6.51 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [54]: %timeit df[(df.A == 'foo') & (df.B == 'one')]
5.58 ms ± 480 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [55]: %timeit df[(df.A.values == 'foo') & (df.B.values == 'one')]
1.47 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [56]: %timeit df.iloc[np.where( (df.A.values=='foo') & (df.B.values=='one') )]
1.5 ms ± 38.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)