pandas(9)

"Python Data Science Handbook" study notes

High performance Pandas: eval () to query ()

query () and eval () Design Motivation: complex algebraic

NumPy and Pandas support fast vectorized operations. For example,
it can be summed following two arrays:

import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y
4.01 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Comprehensive doing so faster than normal Python loop or a list of
many:

%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))
563 ms ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

However, this operation efficiency when processing complex algebraic expressions (compound expression) problem
is relatively low, for example, the following expression:

mask = (x > 0.5) & (y < 0.5)

Since NumPy calculates each sub algebraic formula, so this calculation is equivalent to:

 tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2

That is, each intermediate process needs to explicitly allocate memory. If x and y number of array
group is very large, so it will take a lot of computing time and memory consumption.

import numexpr
mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
np.allclose(mask, mask_numexpr)
True

The benefit is that, due to a temporary array need not Numexpr points in calculating the algebraic expression
with all of the memory, the calculation is more efficient than NumPy, particularly when processing large arrays. Pandas's eval () and query () tool is actually based Numexpr
implementation.

() With a high-performance computing pandas.eval

Pandas the eval () function with the string DataFrame performance achieved algebraic operation
count, e.g. DataFrame the following:

import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
                      for i in range(4))

If you use common method for calculating the Pandas and four DataFrame, you could write:

%timeit df1 + df2 + df3 + df4
116 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It may be calculated and obtained the same results and pd.eval algebra string:

%timeit pd.eval('df1 + df2 + df3 + df4')
53.7 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The eval () version is twice as fast as ordinary algebraic method (and less memory consumption),
the result is the same:

np.allclose(df1 + df2 + df3 + df4,
            pd.eval('df1 + df2 + df3 + df4'))
True

pd.eval () supported operators

Starting Pandas v0.16 version, pd.eval () to support the operation of many. To demonstrate these
operations, create an integer type DataFrame:

df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))
                           for i in range(5))
  • Arithmetic operators.

pd.eval () supports all the arithmetic operators, for example:

result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)
True
  • Comparison operator.

pd.eval () supports all of the comparison operators, including algebraic chain
(chained expression):

result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)
True
  • Bitwise operators.

pd.eval () Support & (and) and | (or) an allelic operator:

 result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)
True

In addition, and may also be used or the like and in the literal Boolean algebra in which:

result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)
True
  • Object properties and indexes.

pd.eval () object properties may be obtained by obj.attr grammar
resistance, syntax gets indexed by the object obj [index]:

result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)
True
  • Other operations.

Currently pd.eval () does not support a function call, conditional statements, loops to
and more complex operations. If you want to make these operations can be implemented by means of Numexpr
now.

With DataFrame.eval () operation to achieve inter-column

Since pd.eval () function is a top Pandas, so there is a DataFrame
eval () method can do a similar operation. The advantage of using eval () method is possible by means of
column names calculates, for example:

df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()
A B C
0 0.375506 0.406939 0.069938
1 0.069087 0.235615 0.154374
2 0.677945 0.433839 0.652324
3 0.264038 0.808055 0.347197
4 0.589161 0.252418 0.557789

If pd.eval described previously (), can be calculated by the following algebraic three columns:

result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)

True

The DataFrame.eval () method can be achieved by simple algebraic column name:

result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)
True

Please note that the column name as a variable used here to calculate the algebraic expression, the same result is correct.

  1. With DataFrame.eval () In addition to the inclusion of new computing functions described previously, DataFrame.eval () can create a new
    column. DataFrame also used to demonstrate the foregoing, the column name is 'A', 'B' and 'C':
df.head()
A B C
0 0.375506 0.406939 0.069938
1 0.069087 0.235615 0.154374
2 0.677945 0.433839 0.652324
3 0.264038 0.808055 0.347197
4 0.589161 0.252418 0.557789

Can create a new column 'D' with df.eval (), and then assign it to other column calculated
value:

df.eval('D = (A + B) / C', inplace=True)
df.head()
A B C D
0 0.375506 0.406939 0.069938 11.187620
1 0.069087 0.235615 0.154374 1.973796
2 0.677945 0.433839 0.652324 1.704344
3 0.264038 0.808055 0.347197 3.087857
4 0.589161 0.252418 0.557789 1.508776

You can also modify existing columns:

df.eval('D = (A - B) / C', inplace=True)
df.head()
A B C D
0 0.375506 0.406939 0.069938 -0.449425
1 0.069087 0.235615 0.154374 -1.078728
2 0.677945 0.433839 0.652324 0.374209
3 0.264038 0.808055 0.347197 -1.566886
4 0.589161 0.252418 0.557789 0.603708
  1. DataFrame.eval () using local variables

DataFrame.eval () method supports the local variables used by the @ symbol Python
amount, as follows:

column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)
True

@ Sign said, "This is a variable name instead of a column name," allowing you the flexibility
to use two "namespace" of resources (column namespace and the name of the Python object
computing algebraic namespace). Note that, the @ symbol can only
use DataFrame.eval () method, but not () function in pandas.eval
in use, because pandas.eval () function can only get a (Python) named
content space.

DataFrame.query()方法

DataFrame string algebraic calculation based on other implements, referred to as
Query (), for example:

result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)
True

And introduced earlier DataFrame.eval (), this is a column with DataFrame
algebraic created, but can not be used DataFrame.eval () syntax. However, for this
kind of filtering operation, you can query () method:

result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)
True

In addition to calculating better performance than the syntax of this processing method is also better than the mask algebraic syntax
solution. It should be noted, query () method is also supported by the @ symbol refers to a local variable:

Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)
True

Performance decided to use the opportunity

When considering whether to use these two functions need to consider two aspects: computation time and memory consumer
consumption, and memory consumption is more important factor. Just as previously described, each involving
DataFrame NumPy array of algebraic or Pandas composite will have a temporary array,
for example:

x = df[(df.A < 0.5) & (df.B < 0.5)]

It basically equivalent to:

tmp1 = df.A < 0.5
tmp2 = df.B < 0.5
tmp3 = tmp1 & tmp2
x = df[tmp3]

If the temporary DataFrame memory requirements bigger than your system memory (typically a few words guitar
section), it is best to use eval () and query () algebraic expressions. You can use the following
method to estimate probably about memory consumption variables:

 df.values.nbytes
32000

In terms of performance, even if you do not use the maximum system memory, eval () calculation speed is
faster than conventional methods. Now becomes a performance bottleneck of the system's CPU provisional DataFrame
contrast between the L1 and L2 cache - If the cache is large enough, then the eval () slowly move the temporary avoided between different cache
files. In practice, I found a common calculation method with eval / query calculated
difference in the calculation time is not always so obvious, common methods in dealing with an array of smaller
but faster! advantage eval / query method is mainly to save memory, sometimes syntax
more concise.

Published 57 original articles · won praise 63 · views 80000 +

Guess you like

Origin blog.csdn.net/weixin_41503009/article/details/104185522
9