The most commonly used method for daily Python
data analysis is query filtering, which selects the data we want based on various conditions, dimensions, and combinations to facilitate our analysis and mining.
Brother Dong has summarized the commonly used seed operations for daily query and screening for your reference. sklearn
The data used in this article boston
are introduced with examples.
from sklearn import datasets
import pandas as pd
boston = datasets.load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
1. []
The first is the fastest and most convenient, []
writing the filtering conditions or combination conditions directly in the dataframe. For example, below, you want to filter out NOX
all data that is greater than the average value of this variable, and then sort it in NOX
descending order.
df[df['NOX']>df['NOX'].mean()].sort_values(by='NOX',ascending=False).head()
Of course, you can also use combined conditions, use logical symbols between conditions & |
, etc. For example, in the following example, in addition to the above conditions CHAS为1
, an AND condition is added. Note that conditions separated by logical symbols must be ()
separated.
df[(df['NOX']>df['NOX'].mean())& (df['CHAS'] ==1)].sort_values(by='NOX',ascending=False).head()
2. place/place
In []
addition, loc/iloc
they should be the two most commonly used query methods. loc
Access by tag value (column name and row index value), iloc
access by numeric index, both support single value access or slice query. In addition to []
filtering data by conditions, loc
you can also specify the returned column variables to filter from both row and column dimensions.
For example, in the following example, data is filtered out according to conditions, and the specified variables are filtered out, and then assigned values.
df.loc[(df['NOX']>df['NOX'].mean()),['CHAS']] = 2
3. isin
Our filtering conditions above < > == !=
are all within a range, but many times we need to lock in certain specific values, and this is necessary isin
. For example, we want to limit NOX
the value to 0.538,0.713,0.437
medium time.
df.loc[df['NOX'].isin([0.538,0.713,0.437]),:].sample(5)
Of course, you can also do the inversion operation by adding a sign before the filter condition ~
.
df.loc[~df['NOX'].isin([0.538,0.713,0.437]),:].sample(5)
4. str.contains
The above examples are all filtering conditions for numerical size comparison . In addition to numerical values, of course there are also query requirements for strings . pandas
It can .str.contains()
be used to , a bit like what is used in SQL statements like
.
The following uses titanic's data as an example to filter out data that contains Mrs
or in the person's name, or logical symbols within quotation marks.Lily
|
train.loc[train['Name'].str.contains('Mrs|Lily'),:].head()
.str.contains()
Regularization filtering logic can also be set in .
- case=True: Use case to specify case sensitivity
- na=True: It means converting NAN into Boolean value True
- flags=re.IGNORECASE: Flags are passed to the re module, such as re.IGNORECASE
- regex=True: regex: If True, the first string is assumed to be a regular expression, otherwise it is a string
5. where/mask
In SQL, where
the function we know is to filter out those that meet the conditions. Filtering is also done in pandas where
, but the usage is slightly different.
where
The accepted condition needs to be of Boolean type . If the matching condition is not met, it will be assigned the default NaN
or other specified value. For example, Sex
if male
the filter condition cond
is a Boolean Series, all non-male values will be assigned the default NaN
null value.
cond = train['Sex'] == 'male'
train['Sex'].where(cond, inplace=True)
train.head()
You can also use other
assignment to a specified value.
cond = train['Sex'] == 'male'
train['Sex'].where(cond, other='FEMALE', inplace=True)
You can even write combined conditions.
train['quality'] = ''
traincond1 = train['Sex'] == 'male'
cond2 = train['Age'] > 25
train['quality'].where(cond1 & cond2, other='低质量男性', inplace=True)
mask
and where
is a pair of operations, and is where
exactly the opposite.
train['quality'].mask(cond1 & cond2, other='低质量男性', inplace=True)
6. query
This is a very elegant way to filter data. All filtering operations ''
are done within.
# 常用方式
train[train.Age > 25]
# query方式
train.query('Age > 25')
The above two methods have the same effect. For a more complex example, add str.contains
the combined conditions of the above usage. Note that ''
sometimes , both sides need to ""
be wrapped.
train.query("Name.str.contains('William') & Age > 25")
query
You can also set variables here @
.
name = 'William'
train.query("Name.str.contains(@name)")
7. filter
filter
is another unique filtering feature. filter
Instead of filtering specific data, filter specific rows or columns. It supports three filtering methods:
- items: fixed column name
- regex: regular expression
- like: and fuzzy query
- axis: the control is a query of row index or column columns
An example is given below.
train.filter(items=['Age', 'Sex'])
train.filter(regex='S', axis=1) # 列名包含S的
train.filter(like='2', axis=0) # 索引中有2的
train.filter(regex='^2', axis=0).filter(like='S', axis=1)
8. any/all
any
The method means that if at least one value is True
the result True
, it is, all
and all values need to True
be the result True
, such as the following.
>> train['Cabin'].all()
>> False
>> train['Cabin'].any()
>> True
any
It all
generally needs to be used in conjunction with other operations, such as checking the null value of each column.
train.isnull().any(axis=0)
Another example is checking the number of rows containing null values.
>>> train.isnull().any(axis=1).sum()
>>> 708
e
`any`和`all`一般是需要和其它操作配合使用的,比如查看每列的空值情况。
train.isnull().any(axis=0)
[外链图片转存中...(img-QYyk6pc2-1694485667807)]
再比如查看含有空值的行数。
train.isnull().any(axis=1).sum()
708