8 cool operations for filtering data in pandas

The most commonly used method for daily Pythondata analysis is query filtering, which selects the data we want based on various conditions, dimensions, and combinations to facilitate our analysis and mining.

Brother Dong has summarized the commonly used seed operations for daily query and screening for your reference. sklearnThe data used in this article bostonare introduced with examples.

from sklearn import datasets
import pandas as pd

boston = datasets.load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)

picture

1. []

The first is the fastest and most convenient, []writing the filtering conditions or combination conditions directly in the dataframe. For example, below, you want to filter out NOXall data that is greater than the average value of this variable, and then sort it in NOXdescending order.

df[df['NOX']>df['NOX'].mean()].sort_values(by='NOX',ascending=False).head()

The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly.

Of course, you can also use combined conditions, use logical symbols between conditions & |, etc. For example, in the following example, in addition to the above conditions CHAS为1, an AND condition is added. Note that conditions separated by logical symbols must be ()separated.

df[(df['NOX']>df['NOX'].mean())& (df['CHAS'] ==1)].sort_values(by='NOX',ascending=False).head()

picture

2. place/place

In []addition, loc/ilocthey should be the two most commonly used query methods. locAccess by tag value (column name and row index value), ilocaccess by numeric index, both support single value access or slice query. In addition to []filtering data by conditions, locyou can also specify the returned column variables to filter from both row and column dimensions.

For example, in the following example, data is filtered out according to conditions, and the specified variables are filtered out, and then assigned values.

df.loc[(df['NOX']>df['NOX'].mean()),['CHAS']] = 2

picture

3. isin

Our filtering conditions above < > == !=are all within a range, but many times we need to lock in certain specific values, and this is necessary isin. For example, we want to limit NOXthe value to 0.538,0.713,0.437medium time.

df.loc[df['NOX'].isin([0.538,0.713,0.437]),:].sample(5)

picture

Of course, you can also do the inversion operation by adding a sign before the filter condition ~.

df.loc[~df['NOX'].isin([0.538,0.713,0.437]),:].sample(5)

picture

4. str.contains

The above examples are all filtering conditions for numerical size comparison . In addition to numerical values, of course there are also query requirements for strings . pandasIt can .str.contains()be used to , a bit like what is used in SQL statements like.

The following uses titanic's data as an example to filter out data that contains Mrsor in the person's name, or logical symbols within quotation marks.Lily|

train.loc[train['Name'].str.contains('Mrs|Lily'),:].head()

picture

.str.contains()Regularization filtering logic can also be set in .

  • case=True: Use case to specify case sensitivity
  • na=True: It means converting NAN into Boolean value True
  • flags=re.IGNORECASE: Flags are passed to the re module, such as re.IGNORECASE
  • regex=True: regex: If True, the first string is assumed to be a regular expression, otherwise it is a string

5. where/mask

In SQL, wherethe function we know is to filter out those that meet the conditions. Filtering is also done in pandas where, but the usage is slightly different.

whereThe accepted condition needs to be of Boolean type . If the matching condition is not met, it will be assigned the default NaNor other specified value. For example, Sexif malethe filter condition condis a Boolean Series, all non-male values ​​will be assigned the default NaNnull value.

cond = train['Sex'] == 'male'
train['Sex'].where(cond, inplace=True)
train.head()

The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly.

You can also use otherassignment to a specified value.

cond = train['Sex'] == 'male'
train['Sex'].where(cond, other='FEMALE', inplace=True)

picture

You can even write combined conditions.

train['quality'] = ''
traincond1 = train['Sex'] == 'male'
cond2 = train['Age'] > 25

train['quality'].where(cond1 & cond2, other='低质量男性', inplace=True)

picture

maskand whereis a pair of operations, and is whereexactly the opposite.

train['quality'].mask(cond1 & cond2, other='低质量男性', inplace=True)

picture

6. query

This is a very elegant way to filter data. All filtering operations ''are done within.

# 常用方式
train[train.Age > 25]
# query方式
train.query('Age > 25')

The above two methods have the same effect. For a more complex example, add str.containsthe combined conditions of the above usage. Note that ''sometimes , both sides need to ""be wrapped.

train.query("Name.str.contains('William') & Age > 25")

picture

queryYou can also set variables here @.

name = 'William'
train.query("Name.str.contains(@name)")

7. filter

filteris another unique filtering feature. filterInstead of filtering specific data, filter specific rows or columns. It supports three filtering methods:

  • items: fixed column name
  • regex: regular expression
  • like: and fuzzy query
  • axis: the control is a query of row index or column columns

An example is given below.

train.filter(items=['Age', 'Sex'])

picture

train.filter(regex='S', axis=1) # 列名包含S的

picture

train.filter(like='2', axis=0) # 索引中有2的

picture

train.filter(regex='^2', axis=0).filter(like='S', axis=1)

picture

8. any/all

anyThe method means that if at least one value is Truethe result True, it is, alland all values ​​need to Truebe the result True, such as the following.

>> train['Cabin'].all()
>> False
>> train['Cabin'].any()
>> True

anyIt allgenerally needs to be used in conjunction with other operations, such as checking the null value of each column.

train.isnull().any(axis=0)

picture

Another example is checking the number of rows containing null values.

>>> train.isnull().any(axis=1).sum()
>>> 708

e


`any`和`all`一般是需要和其它操作配合使用的,比如查看每列的空值情况。

train.isnull().any(axis=0)


[外链图片转存中...(img-QYyk6pc2-1694485667807)]

再比如查看含有空值的行数。

train.isnull().any(axis=1).sum()
708

Guess you like

Origin blog.csdn.net/mmmmm44444/article/details/132825711