get-dummies
将分类变量转换为哑变量/指示变量
pd.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False)
data为df或series,在未指定对data的某个列做one-hot时,get_dummies会自动识别data内类型为str的列,并做on-hot。prefix是为one-hot转变后的列命名,dunmmy_na是指是否将na类型也作为one-hot编码时的一个类型。
# 未指定哪一列做one-hot
df = pd.DataFrame({'A': ['A','B','C'], 'B': [1,2,3],
'C': [1, 2, 3]})
pd.get_dummies(df,prefix=['A'])
当data内的列均不为str且也未指定对data的哪一列做编码时,get_dummies会不作任何处理。
df = pd.DataFrame({'A': [1,2,2], 'B': [1,2,3],
'C': [1, 2, 3]})
pd.get_dummies(df,prefix=['A'])
这时若想对data的第一列做one-hot就得指定该列,然后将其余数据列合并
df = pd.DataFrame({'A': [1,2,2], 'B': [1,2,3],
'C': [1, 2, 3]})
a_hot = pd.get_dummies(df['A'])
a_hot.join(df[['B','C']])
df.drop
drop(labels, axis=0, level=None, inplace=False, errors=‘raise’)
axis=0 表示删除行
df.join
join(other, on=None, how=‘left’, lsuffix=’’, rsuffix=’’, sort=False)
必须是dataframe格式才可使用
Parameters | usage |
---|---|
other | DataFrame, Series with name field set, or list of DataFrame Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame |
on | column name, tuple/list of column names, or array-like |
how | {‘left’, ‘right’, ‘outer’, ‘inner’} How to handle indexes of the two objects. Default: ‘left’ for joining on index, None otherwise * left: use calling frame’s index * right: use input frame’s index * outer: form union of indexes * inner: use intersection of indexes |
lsuffix | string Suffix to use from left frame’s overlapping columns |
rsuffix | string Suffix to use from right frame’s overlapping columns |
sort | boolean, default False Order result DataFrame lexicographically by the join key. If False, preserves the index order of the calling (left) DataFrame |