实战：战狼2票房数据分析——（3）数据读取及分析

前言

前面我们已经成功的把数据从网站上抓取下来并存入了csv文件当中，那么本章就演练将数据从csv中读取出来，并作相应的分析

数据读取

可以使用pandas的函数read_csv来读取数据，默认情况下csv文件的数据都是以逗号分隔的。

比如我们将前面的CSV文件用记事本打开，看到的结果如下

这里写图片描述

我们直接在命令窗口对文件进行读取，并显示前三行

In [190]: df = pd.read_csv('data\out.csv')

In [191]: df[:3]
Out[191]: 
   Unnamed: 0   name       box  boxRatio  playRatio  attendance
0  2017-08-01    战狼2  29249.43      86.3       56.4        42.4
1  2017-08-01   建军大业   3248.29       9.6       22.3        19.3
2  2017-08-01  神偷奶爸3    464.67       1.4        4.7        13.7

read_csv函数中有很多参数，例如通过设置sep参数可以修改默认的分隔符

df = pd.read_csv('data\out.csv', sep=';')

df[:3]
Out[198]: 
  ,name,box,boxRatio,playRatio,attendance
0  2017-08-01,战狼2,29249.43,86.3,56.4,42.4
1   2017-08-01,建军大业,3248.29,9.6,22.3,19.3
2    2017-08-01,神偷奶爸3,464.67,1.4,4.7,13.7

我们发现读取出来的数据就变成每行只有一条数据了，因为数据中没有；分隔符

通过name参数，修改DataFrame的索引，以下示例将name列改为索引

df = pd.read_csv('data\out.csv', index_col='name')

df[:3]
Out[202]: 
       Unnamed: 0       box  boxRatio  playRatio  attendance
name                                                        
战狼2    2017-08-01  29249.43      86.3       56.4        42.4
建军大业   2017-08-01   3248.29       9.6       22.3        19.3
神偷奶爸3  2017-08-01    464.67       1.4        4.7        13.7

通过encoding参数，修改编码格式。默认值为utf8

df = pd.read_csv('data\out.csv', encoding='latin1')

df[:3]
Out[213]: 
   Unnamed: 0           name       box  boxRatio  playRatio  attendance
0  2017-08-01        æˆ˜ç‹¼2  29249.43      86.3       56.4        42.4
1  2017-08-01   å»ºå†›å¤§ä¸š   3248.29       9.6       22.3        19.3
2  2017-08-01  ç¥žå·å¥¶çˆ¸3    464.67       1.4        4.7        13.7

先把数据，按照日期为索引的方式读取出来，日期我们并没有命名，从前面的读取结果可以看到，他的名字为0


In [222]: df = pd.read_csv('data\out.csv', index_col=0, parse_dates=[0], dayfirst=True)

In [223]: df[:3]
Out[223]: 
             name       box  boxRatio  playRatio  attendance
2017-08-01    战狼2  29249.43      86.3       56.4        42.4
2017-08-01   建军大业   3248.29       9.6       22.3        19.3
2017-08-01  神偷奶爸3    464.67       1.4        4.7        13.7

列选择

读取出来的数据类型是DataFrame，可以像从字典中获取元素一样，来获取某一列的数据

In [224]: df['box'][:3]
Out[224]: 
2017-08-01    29249.43
2017-08-01     3248.29
2017-08-01      464.67
Name: box, dtype: float64

对列数据进行绘图

只需要在取出来的列后面加上.plot()方法就行，So easy。

In [225]: df['box'].plot()
Out[225]: <matplotlib.axes._subplots.AxesSubplot at 0x10a2b5b0>

这里写图片描述

从图表中我们可以看出2017-08-05到2017-08-06的票房最高，我猜测这几天肯定是周末。翻看日历一看，果然是周六和周日。

我们也可以很容易的画出所有列。我们让图变大一点。

In [226]: df.plot(figsize=(15, 10))
Out[226]: <matplotlib.axes._subplots.AxesSubplot at 0x10f6f110>

这里写图片描述

由于票房数据的数值太大，所以其他数据都看不见了。

数据分析

我们现在的数据集不够多，我们从新执行命令，来获取50天的数据。（一次性获取太多数据或太频繁的获取，猫眼会让你输入验证码）

由于之间的时间方法是选定一个时间，再设置往后多少天，不太方便。我们将其改为往前多少天，即获取从今天开始，前50天的数据。

很简单，只需要将timedelta的days参数改为-1就行

def buildDates(start, days):
    day = timedelta(days=-1)
    for i in range(days):
        yield start + day*i

获取50天的数据

if __name__ == "__main__":
    df = getData(2017, 8, 11, 50)
    writeToCSV(df, 'data\out.csv')
    df = pd.read_csv('data\out.csv')

plot绘图中文显示

默认情况下，使用matplotlib进行绘图，如果有中文会显示不出来，这是因为matplotlib包默认只支持ASCII码，不支持unicode码。

解决方法，就是需要将 matplotlib 的安装目录下， matplotlibrc 配置文件修改一下，将font.family 部分注释去掉，并且在 font.serif 和 font.sans-serif 支持字体加上一个中文字体，如 SimHei。matplotlibrc文件的默认位置在C:\Python27\Lib\site-packages\matplotlib\mpl-data ：

font.family         : sans-serif
font.style          : normal
font.variant        : normal
font.weight         : medium
font.stretch        : normal
# note that font.size controls default text sizes.  To configure
# special text sizes tick labels, axes, labels, title, etc, see the rc
# settings for axes and ticks. Special text sizes can be defined
# relative to font.size, using the following values: xx-small, x-small,
# small, medium, large, x-large, xx-large, larger, or smaller
#font.size           : 12.0
font.serif          : SimHei, Bitstream Vera Serif, New Century Schoolbook, Century Schoolbook L, Utopia, ITC Bookman, Bookman, Nimbus Roman No9 L, Times New Roman, Times, Palatino, Charter, serif
font.sans-serif     : SimHei, Bitstream Vera Sans, Lucida Grande, Verdana, Geneva, Lucid, Arial, Helvetica, Avant Garde, sans-serif

查看战狼2上映后的票房走势

In [23]: df = pd.read_csv('data\out.csv', index_col=0)

In [24]: df[df.name == '战狼2']['box'].sort_index().plot()
Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0x111ffb30>

这里写图片描述

票房占比趋势

In [25]: df[df.name == '战狼2']['boxRatio'].sort_index().plot()
Out[25]: <matplotlib.axes._subplots.AxesSubplot at 0x10ca3770>

这里写图片描述

查看战狼2现有票房

In [51]: sum(df[df.name == '战狼2']['box'])
Out[51]: 405248.93000000005

已经突破40E了？查了下，果然突破40E了，恭喜恭喜。

查看历史总票房排名

将票房数据按照片名进行加总，并根据票房进行倒叙排列，再取出前10名

df2 = df.groupby(['name']).sum().sort('box', ascending=False)

df2[:10]
Out[74]: 
                   box  boxRatio  playRatio  attendance
name                                                   
战狼2          405248.93    1200.2      789.8      1052.0
变形金刚5：最后的骑士  150792.95     865.1      756.3       497.2
神偷奶爸3        102270.50     644.9      527.8       695.2
悟空传           69202.04     514.9      409.2       491.0
三生三世十里桃花      51162.79     130.3      209.2       167.6
建军大业          36883.16     117.5      245.9       634.9
绣春刀II：修罗战场    26366.93     238.9      237.3       634.9
京城81号II       21849.87     154.6      173.6       491.5
逆时营救          20115.38     173.2      190.5       429.0
父子雄兵          12449.89     109.1      132.8      1023.5

没有美人鱼是因为我们获取的数据是从2017年8月11日至前50天的，数据不够多啊 :(

In [87]: df2[:10].plot(kind='bar')
Out[87]: <matplotlib.axes._subplots.AxesSubplot at 0x207c5e70>

这里写图片描述

票房前10出席率

In [88]: df2['attendance'][:10].plot(kind='bar')
Out[88]: <matplotlib.axes._subplots.AxesSubplot at 0x20968d70>

这里写图片描述

票房前10排片率

In [89]: df2['playRatio'][:10].plot(kind='bar')
Out[89]: <matplotlib.axes._subplots.AxesSubplot at 0x2094cd70>

这里写图片描述

最后

再次祝贺战狼2，票房突破40E

这里写图片描述