告别加班，用pandas代替excel处理数据（2）

此为教程笔记

教程地址：https://study.163.com/course/courseMain.htm?courseId=1209401897

感谢老师：城市数据团大鹏

通过实例学习，实验数据商铺数据，california_housing_train

所用数据连接：csdn上去搜

数据截图：

1.读取数据文件

2 pandas的数据结构

3 处理comment数据，使用series的字符串处理功能

4 数据索引易错点

5 利用pandas画图，如直方图

6 用pandas处理股票信息，主要使用tushare工具包，pip install tushare

1.读取数据文件

# 读取csv数据
csv_path = '/Users/luo/workspace/pycharm/DataAnalysis/商铺数据.csv'
df = pd.read_csv(csv_path)

数据的读取方式依据数据的类型，不同的数据读取方式如下：

pd.read_csv(filename)：从CSV文件导入数据

pd.read_table(filename)：从限定分隔符的文本文件导入数据

pd.read_excel(filename)：从Excel文件导入数据

pd.read_sql(query, connection_object)：从SQL表/库导入数据

pd.read_json(json_string)：从JSON格式的字符串导入数据

pd.read_html(url)：解析URL、字符串或者HTML文件，抽取其中的tables表格

pd.read_clipboard()：从你的粘贴板获取内容，并传给read_table()

pd.DataFrame(dict)：从字典对象导入数据，Key是列名，Value是数据
作者：大熊_7d48
链接：https://www.jianshu.com/p/550eb6424fa0
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

2 pandas的数据结构

维数	名称	描述
1	Series	带标签的一维同构数组
2	DataFrame	带标签的，大小可变的，二维异构表格

简单来讲，DataFrame是二维数据，如我们读取的商业数据。Series是一维数据，DataFrame可以转换为多个Series.

# 获取第一列数据
for col in df.columns:
    series = df[col]
    break

print(series)

结果如下：

0       美食
1       美食
2       美食
3       美食
4       美食
        ..
1260    购物
1261    购物
1262    购物
1263    购物
1264    购物
Name: classify, Length: 1265, dtype: object

由此可知series是包含了多种信息的，如标签，值和长度等等。

3 处理comment数据，使用series的字符串处理功能

更多的字符串处理函数见官网：pandas官网

String handling

Series.str can be used to access the values of the series as strings and apply several methods to it. These can be accessed like Series.str.<function/property>.

`Series.str.capitalize`(self)	Convert strings in the Series/Index to be capitalized.
`Series.str.casefold`(self)	Convert strings in the Series/Index to be casefolded.
`Series.str.cat`(self[, others, sep, na_rep, join])	Concatenate strings in the Series/Index with given separator.
`Series.str.center`(self, width[, fillchar])	Filling left and right side of strings in the Series/Index with an additional character.
`Series.str.contains`(self, pat[, case, …])	Test if pattern or regex is contained within a string of a Series or Index.
`Series.str.count`(self, pat[, flags])	Count occurrences of pattern in each string of the Series/Index.
`Series.str.decode`(self, encoding[, errors])	Decode character string in the Series/Index using indicated encoding.
`Series.str.encode`(self, encoding[, errors])	Encode character string in the Series/Index using indicated encoding.

以上是一些例子，不会的搜索查询。

在此例中，处理方法为.str.contains()和.str.split()：

#  处理商铺数据，comment字段清洗, 只要包含有评论的数据，并且只要数字
df1 = df[df['comment'].str.contains('条')] # 筛选数据  行筛选，条件筛选
df1['comment'] = df1['comment'].str.split(' ').str[0]  # 提取数字
print(df1['comment'])

不懂：series.str.split()之后为何要用str[x]去获取字符串，如此处为何是str[0] ，而不是直接用[0]

4 数据索引易错点

4.1 列索引

# 列索引
# df[列名]， 单列索引
# df[[列名1，列名2]]， 多列索引, 用列表

之前的代码用的都是列索引

4.2 行索引

# 根据数据在哪行去索引  .iloc[], 索引前10行
df_10 = california_housing_dataframe.iloc[:10]
print(df_10)

# 根据数据的行标签去索引   .loc[] 索引2行到4行
df_24= california_housing_dataframe.loc[2:4]  # 2, 4为行的名字，有些数据中行标签可能是其他形式
print(df_24)

# 根据判断条件索引      df[判断条件]
# df1 = df[df['comment'].str.contains('条')] # 筛选数据
# 此就是通过条件筛选，还有很多其他的条件形式，如大于，小于，包含等

以上结果：

   longitude  latitude  ...  median_income  median_house_value
0    -114.31     34.19  ...         1.4936               66900
1    -114.47     34.40  ...         1.8200               80100
2    -114.56     33.69  ...         1.6509               85700
3    -114.57     33.64  ...         3.1917               73400
4    -114.57     33.57  ...         1.9250               65500
5    -114.58     33.63  ...         3.3438               74000
6    -114.58     33.61  ...         2.6768               82400
7    -114.59     34.83  ...         1.7083               48500
8    -114.59     33.61  ...         2.1782               58400
9    -114.60     34.83  ...         2.1908               48100

[10 rows x 9 columns]
   longitude  latitude  ...  median_income  median_house_value
2    -114.56     33.69  ...         1.6509               85700
3    -114.57     33.64  ...         3.1917               73400
4    -114.57     33.57  ...         1.9250               65500

[3 rows x 9 columns]

5 利用pandas画图，如直方图

这里主要介绍了DataFrame.describe()和DataFrame.hist()。所用的数据来源于

https://blog.csdn.net/gaishi_hero/article/details/81433595

# 利用pandas画图
california_housing_dataframe = pd.read_csv('/Users/luo/workspace/pycharm/DataAnalysis/california_housing_train.csv')
print(california_housing_dataframe)
print(california_housing_dataframe.describe())  # 获取简单地数据描述，均值，最大值等
# 画某列的直方图
california_housing_dataframe.hist('housing_median_age')
plt.show()  # 需要show才能画出来

结果：

6 用pandas处理股票信息，主要使用tushare工具包，pip install tushare

# 用pandas处理股票信息
import tushare as ts
ts_df = ts.get_today_all()  # 获取当天的股票信息
pd.set_option('display.max_columns', None)   #显示完整的列
pd.set_option('display.max_rows', None)  #显示完整的行
print(ts_df)

可能遇到的错误：ModuleNotFoundError

再pip install lxml等即可，缺什么安装什么

结果：[Getting data:]############################################################ code name changepercent trade open high low \
0 688399 硕世生物 0.89 94.94 93.48 96.38 91.69
1 688398 赛特新材 -1.32 83.88 76.00 84.01 71.01
2 688389 普门科技 1.55 25.58 24.31 26.18 24.19
3 688388 嘉元科技 1.99 71.87 69.14 73.00 68.74
4 688369 致远互联 3.86 83.40 80.36 85.86 79.68
5 688368 晶丰明源 8.68 116.30 106.99 119.00 104.01
6 688366 昊海生科 2.41 96.39 94.60 97.00 93.19
7 688363 华熙生物 1.23 82.10 81.10 82.99 80.88
8 688358 祥生医疗 2.47 59.00 57.45 59.94 57.45
9 688357 建龙微纳 3.05 57.50 55.21 57.89 55.21
10 688333 铂力特 2.97 58.66 55.99 59.06 55.02

等等。

股票处理还有很多细节，不研究股票，暂且不学。

z智慧

发布了73 篇原创文章 · 获赞 89 · 访问量 22万+

私信关注