版权声明:未经博主本人同意,请勿私自转发分享。 https://blog.csdn.net/Nerver_77/article/details/85053021
3. Pandas玩转数据
-
Series、DataFrame的简单数学运算
- Series:index所对应的值运算
- DataFrame
-
简单运算:index、columns所对应的值运算
-
内置运算方法:
df3 = DataFrame([[1,2,3],[4,5,np.nan],[7,8,9]], index=['A','B','C'], columns=['c1','c2','c3']) -------------------------------------------------------------- c1 c2 c3 A 1 2 3.0 B 4 5 NaN C 7 8 9.0
-
sum() 默认按照列进行求和,忽略NaN
df3.sum() -------- c1 12.0 c2 15.0 c3 12.0 dtype: float64 ============== df3.sum(axis=1) --------------- A 6.0 B 9.0 C 24.0 dtype: float64
-
min()、max() 最值
df3.max() --------- c1 7.0 c2 8.0 c3 9.0 dtype: float64
-
describe() 数据统计信息(计数、均值、标准值、分位数)
df3.describe() ------------- c1 c2 c3 count 3.0 3.0 2.000000 mean 4.0 5.0 6.000000 std 3.0 3.0 4.242641 min 1.0 2.0 3.000000 25% 2.5 3.5 4.500000 50% 4.0 5.0 6.000000 75% 5.5 6.5 7.500000 max 7.0 8.0 9.000000
-
-
-
Series、DataFrame的排序:
values 值排序
和index 索引排序
-
Series
s1 = Series(np.random.randn(5)) ------------------------------ 0 -0.453149 1 -0.135939 2 0.637722 3 0.699666 4 -0.421094 dtype: float64
-
sort_values()
# 值排序,排序方式为降序,默认升序 s2 = s1.sort_values(ascending=False) ------------------------------------ 3 0.699666 2 0.637722 1 -0.135939 4 -0.421094 0 -0.453149 dtype: float64
-
sort_index()
# 索引排序,默认索引升序 s2.sort_index() --------------- 0 -0.453149 1 -0.135939 2 0.637722 3 0.699666 4 -0.421094 dtype: float64
-
-
DataFrame:
使用同理,注意值排序和索引排序区别,以及排序标准(升序/降序)
# 数据框按照列['A'] 降序排列 df1.sort_values(['A'],ascending = False)
-
homework
要求用一条代码实现从CSV文件中读取数据,构造一个DF,经过排序处理后生成新的数据文件。
一条代码实现相当于把所有的操作一步完成,作为新手,分步骤进行操作。# 读取文件 f = pd.read_csv("J:\csv\movie_metadata.csv") # 创建DF,并对列进行过滤 df = DataFrame(f,columns=['imdb_score','director_name','movie_title']) # 对imdb_score字段进行降序排列 df_ = df.sort_values(['imdb_score'], ascending=False) # 转存为文件 df_.to_csv('imdb.csv', index=False)
-
-
重命名DataFrame的index
df1 = DataFrame(np.arange(9).reshape(3, 3), index=['A', 'B', 'C'], columns=['BJ', 'SH', 'GZ']) ------------------------------------------------------------------ BJ SH GZ A 0 1 2 B 3 4 5 C 6 7 8
-
直接赋值方式
df1.index = Series(['a', 'b', 'c']) ----------------------------------- BJ SH GZ a 0 1 2 b 3 4 5 c 6 7 8
-
map方式
# map方式产生新的index df1.index = df1.index.map(str.upper) ------------------------------------ BJ SH GZ A 0 1 2 B 3 4 5 C 6 7 8
-
rename方式
# rename方式传入内置mapper函数 df1.rename(index=str.lower, columns=str.lower) ---------------------------------------------- bj sh gz a 0 1 2 b 3 4 5 c 6 7 8 ========================== # map方式传入字典 df1.rename(index={'A':'aaa'}, columns={'BJ': 'beijing'}) -------------------------------------------------------- beijing SH GZ aaa 0 1 2 B 3 4 5 C 6 7 8
-
-
Map回顾 [1, 2, 3, 4] -> [‘1’, ‘2’, ‘3’, ‘4’]
-
for循环
list2 = [] for x in list1: list2.append(str(x))
-
列表解析
[str(x) for x in list1]
-
map()方法
list(map(str, list1))
-
-
自定义map方法
# 举个例子 def map_(x): return x + '_' # 设置新的索引 df1.index = df1.index.map(map_) ------------------------------- BJ SH GZ A_ 0 1 2 B_ 3 4 5 C_ 6 7 8
-
DataFrame的merge操作
df1 = DataFrame({'key': ['X','Y','Z'], 'data_set_1':[1,2,3]}) ------------------------------------------------------------- key data_set_1 0 X 1 1 Y 2 2 Z 3 =========================== df2 = DataFrame({'key': ['X','B','C'], 'data_set_2':[4,5,6]}) ------------------------------------------------------------- key data_set_2 0 X 4 1 B 5 2 C 6
-
on:参照letf,right两者相同数据列才可以进行合并操作
pd.merge(df1,df2,on='key') -------------------------- key data_set_1 data_set_2 0 X 1 4
-
how:
inner 默认
-
left
左合并 参照df1数据列进行合并
pd.merge(df1, df2, on='key',how='left') --------------------------------------- key data_set_1 data_set_2 0 X 1 4.0 1 Y 2 NaN 2 Z 3 NaN
-
right
右合并 参照df2数据列进行合并
pd.merge(df1, df2, on='key',how='right') ---------------------------------------- key data_set_1 data_set_2 0 X 1.0 4 1 B NaN 5 2 C NaN 6
-
outer
全合并,不存在的数据列会用NaN填充
pd.merge(df1, df2, on='key',how='outer') ---------------------------------------- key data_set_1 data_set_2 0 X 1.0 4.0 1 Y 2.0 NaN 2 Z 3.0 NaN 3 B NaN 5.0 4 C NaN 6.0
-
inner
默认
-
-
-
Concatenate和Combine
-
Concatenate:连接
-
Array
arr1 = np.arange(9).reshape(3, 3) arr2 = np.arange(9).reshape(3, 3) # 连接,默认按照列连接 np.concatenate([arr1, arr2]) ---------------------------- array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 1, 2], [3, 4, 5], [6, 7, 8]]) ============================ # 按照行连接 np.concatenate([arr1, arr2], axis=1) ------------------------------------ array([[0, 1, 2, 0, 1, 2], [3, 4, 5, 3, 4, 5], [6, 7, 8, 6, 7, 8]])
-
Series:注意Series的连接采用的方法为
concat()
s1 = Series([1, 2, 3], index=['X', 'Y', 'Z']) s2 = Series([4, 5], index=['A', 'B']) # 连接 pd.concat([s1, s2]) ------------------- X 1 Y 2 Z 3 A 4 B 5 dtype: int64 # 行连接、添加排序 --> 多级Series(DataFrame) pd.concat([s1, s2], axis=1, sort=True) -------------------------------------- 0 1 A NaN 4.0 B NaN 5.0 X 1.0 NaN Y 2.0 NaN Z 3.0 NaN
-
DataFrame:注意DataFrame的连接采用的方法为
concat()
,与Series一致
-
-
Combine:填充
combine_first()
# 准备两个交错的Series s1 = Series([1, np.nan, 3, np.nan], index=['A','B','C','D']) s2 = Series([np.nan, 2, np.nan, 4], index=['A','B','C','D']) # 用s2的数据填补s1的缺失值 s1.combine_first(s2) -------------------- A 1.0 B 2.0 C 3.0 D 4.0 dtype: float64 =================== df1 = DataFrame({ 'A': [1, np.nan, 3, np.nan], 'B': [1, np.nan, 3, np.nan], 'C': [1, np.nan, 3, np.nan] }) -------------------------------- A B C 0 1.0 1.0 1.0 1 NaN NaN NaN 2 3.0 3.0 3.0 3 NaN NaN NaN =================== df2 = DataFrame({ 'A': [np.nan, 2, np.nan, 4], 'Y': [np.nan, 2, np.nan, 4] }) ------------------------------- A Y 0 NaN NaN 1 2.0 2.0 2 NaN NaN 3 4.0 4.0 ======================== # df2与df1中相同的数据列为A,则用df2数据列A中的数据去对应填充df1中的NaN df1.combine_first(df2) ---------------------- A B C Y 0 1.0 1.0 1.0 NaN 1 2.0 NaN NaN 2.0 2 3.0 3.0 3.0 NaN 3 4.0 NaN NaN 4.0 ======================= df2.combine_first(df1) ----------------------- A B C Y 0 1.0 1.0 1.0 NaN 1 2.0 NaN NaN 2.0 2 3.0 3.0 3.0 NaN 3 4.0 NaN NaN 4.0
-