箱型图原理:
箱型图可以通过程序设置一个识别异常值的标准,即大于或小于箱型图设定的上下界的数值则识别为异常值,箱型图如下图所示:
上四分位U: 表示的是所有样本中只有1/4的数值大于U ,即从大到小排序时U处于25%处
下四分位L:表示的是所有样本中只有1/4的数值小于L,即从大到小排序时L处于75%处
上四分位与下四分位的差值 IQR = U - L
上界 U + 1.5IQR
下界 L - 1.5IQR
箱型图选取异常值比较客观,在识别异常值方面有一定的优越性。
pandas.DataFrame.boxplot
Parameters:
column : str or list of str, optional
Column name or list of names, or vector. Can be any valid input to
pandas.DataFrame.groupby()
.
by : str or array-like, optional
Column in the DataFrame to
pandas.DataFrame.groupby()
. One box-plot will be done per value of columns in by.
ax : object of class matplotlib.axes.Axes, optional
The matplotlib axes to be used by boxplot.
fontsize : float or str
Tick label font size in points or as a string (e.g., large).
rot : int or float, default 0
The rotation angle of labels (in degrees) with respect to the screen coordinate sytem.
grid : boolean, default True
Setting this to True will show the grid.
figsize : A tuple (width, height) in inches
The size of the figure to create in matplotlib.
layout : tuple (rows, columns), optional
For example, (3, 5) will display the subplots using 3 columns and 5 rows, starting from the top-left.
return_type : {‘axes’, ‘dict’, ‘both’} or None, default ‘axes’
The kind of object to return. The default is
axes
.
‘axes’ returns the matplotlib axes the boxplot is drawn on.
‘dict’ returns a dictionary whose values are the matplotlib Lines of the boxplot.
‘both’ returns a namedtuple with the axes and dict.
when grouping with
by
, a Series mapping columns toreturn_type
is returned.If
return_type
is None, a NumPy array of axes with the same shape aslayout
is returned.
**kwds
All other plotting keyword arguments to be passed to
matplotlib.pyplot.boxplot()
.
参考代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# boxplot()可以为dataframe中的每一列创建boxplot,或者指示要使用的列:
np.random.seed(222)
df = pd.DataFrame(np.random.randn(10, 4), columns=['Col1', 'Col2', 'Col3', 'Col4'])
boxplot = df.boxplot(column=['Col1', 'Col2', 'Col3'])
plt.show()
plt.close()
# 可以使用选项by 创建按某一列的值分组的变量分布的箱线图。例如:
df = pd.DataFrame(np.random.randn(10,2), columns=['Col1', 'Col2'])
df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B'])
boxplot = df.boxplot(by='X')
plt.show()
plt.close()
# 可以将字符串列表(即['X','Y'])传递给boxplot,通过X轴上变量的组合对数据进行分组
df = pd.DataFrame(np.random.randn(10,3), columns= ['Col1', 'Col2', 'Col3'])
df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A','B', 'B', 'B', 'B', 'B'])
df['Y'] = pd.Series(['A', 'B', 'A', 'B', 'A','B', 'A', 'B', 'A', 'B'])
boxplot = df.boxplot(column=['Col1', 'Col2'], by=['X', 'Y'])
# 调整箱线图的布局可以根据layout参数来改变
# 可以对boxplot进行额外的格式化
# 比如抑制网格(grid=False),在x轴上旋转标签(rot=45),更改fontsize (fontsize=15)
boxplot = df.boxplot(column=['Col1', 'Col2'], by=['X', 'Y'],layout=(2, 1), grid=False, rot=45,fontsize=15)
plt.show()
plt.close()