教女朋友学数据分析———数据分析库Pandas

版权声明:本文为博主原创文章,有参考的地方都会在文中给出链接。如有转载,需征求博主同意。 https://blog.csdn.net/striver6/article/details/87597263

在这里插入图片描述
哦,不,这个:
在这里插入图片描述

学习笔记

Pandas数据分析处理库

pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

一、数据读取

附:文件链接(https://pan.baidu.com/s/1hhnuiYJvUAMoHd4gwXz2Vw)

import pandas
food_info = pandas.read_csv("G:\\food_info.csv")

查看数据结构:

print(type(food_info))
<class 'pandas.core.frame.DataFrame'>

上节中,我们见到numpy的核心结构为ndarray,pandas核心结构则为DataFrame。DataFrame我们也可以把他当做一个矩阵结构。

head函数会输出数据的几行和后几行,中间几行省略号。

print(food_info.head)
<bound method NDFrame.head of       NDB_No                                          Shrt_Desc  Water_(g)  \
0       1001                                   BUTTER WITH SALT      15.87   
1       1002                           BUTTER WHIPPED WITH SALT      15.87   
2       1003                               BUTTER OIL ANHYDROUS       0.24   
3       1004                                        CHEESE BLUE      42.41   
4       1005                                       CHEESE BRICK      41.11   
5       1006                                        CHEESE BRIE      48.42   
6       1007                                   CHEESE CAMEMBERT      51.80   
7       1008                                     CHEESE CARAWAY      39.28   
8       1009                                     CHEESE CHEDDAR      37.10   
9       1010                                    CHEESE CHESHIRE      37.65   
10      1011                                       CHEESE COLBY      38.20   
11      1012                CHEESE COTTAGE CRMD LRG OR SML CURD      79.79   
12      1013                        CHEESE COTTAGE CRMD W/FRUIT      79.64   
13      1014   CHEESE COTTAGE NONFAT UNCRMD DRY LRG OR SML CURD      81.01   
14      1015                   CHEESE COTTAGE LOWFAT 2% MILKFAT      81.24   
15      1016                   CHEESE COTTAGE LOWFAT 1% MILKFAT      82.48   
16      1017                                       CHEESE CREAM      54.44   
17      1018                                        CHEESE EDAM      41.56   
18      1019                                        CHEESE FETA      55.22   
19      1020                                     CHEESE FONTINA      37.92   
20      1021                                     CHEESE GJETOST      13.44   
21      1022                                       CHEESE GOUDA      41.46   
22      1023                                     CHEESE GRUYERE      33.19   
23      1024                                   CHEESE LIMBURGER      48.42   
24      1025                                    CHEESE MONTEREY      41.01   
25      1026                         CHEESE MOZZARELLA WHL MILK      50.01   
26      1027                CHEESE MOZZARELLA WHL MILK LO MOIST      48.38   
27      1028                   CHEESE MOZZARELLA PART SKIM MILK      53.78   
28      1029               CHEESE MOZZARELLA LO MOIST PART-SKIM      45.54   
29      1030                                    CHEESE MUENSTER      41.77   
...      ...                                                ...        ...   
8588   43544         BABYFOOD CRL RICE W/ PEARS & APPL DRY INST       2.00   
8589   43546                     BABYFOOD BANANA NO TAPIOCA STR      76.70   
8590   43550                     BABYFOOD BANANA APPL DSSRT STR      83.10   
8591   43566       SNACKS TORTILLA CHIPS LT (BAKED W/ LESS OIL)       1.30   
8592   43570  CEREALS RTE POST HONEY BUNCHES OF OATS HONEY RSTD       5.00   
8593   43572                         POPCORN MICROWAVE LOFAT&NA       2.80   
8594   43585                       BABYFOOD FRUIT SUPREME DSSRT      81.60   
8595   43589                               CHEESE SWISS LOW FAT      59.60   
8596   43595             BREAKFAST BAR CORN FLAKE CRUST W/FRUIT      14.50   
8597   43597                            CHEESE MOZZARELLA LO NA      49.90   
8598   43598                           MAYONNAISE DRSNG NO CHOL      21.70   
8599   44005                          OIL CORN PEANUT AND OLIVE       0.00   
8600   44018                   SWEETENERS TABLETOP FRUCTOSE LIQ      23.90   
8601   44048                              CHEESE FOOD IMITATION      55.50   
8602   44055                                CELERY FLAKES DRIED       9.00   
8603   44061           PUDDINGS CHOC FLAVOR LO CAL INST DRY MIX       4.20   
8604   44074                    BABYFOOD GRAPE JUC NO SUGAR CND      84.40   
8605   44110                   JELLIES RED SUGAR HOME PRESERVED      53.00   
8606   44158                         PIE FILLINGS BLUEBERRY CND      54.66   
8607   44203               COCKTAIL MIX NON-ALCOHOLIC CONCD FRZ      28.24   
8608   44258            PUDDINGS CHOC FLAVOR LO CAL REG DRY MIX       6.80   
8609   44259  PUDDINGS ALL FLAVORS XCPT CHOC LO CAL REG DRY MIX      10.40   
8610   44260  PUDDINGS ALL FLAVORS XCPT CHOC LO CAL INST DRY...       6.84   
8611   48052                                 VITAL WHEAT GLUTEN       8.20   
8612   80200                                      FROG LEGS RAW      81.90   
8613   83110                                    MACKEREL SALTED      43.00   
8614   90240                         SCALLOP (BAY&SEA) CKD STMD      70.25   
8615   90480                                         SYRUP CANE      26.00   
8616   90560                                          SNAIL RAW      79.20   
8617   93600                                   TURTLE GREEN RAW      78.50   

      Energ_Kcal  Protein_(g)  Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  \
0            717         0.85          81.11     2.11            0.06   
1            717         0.85          81.11     2.11            0.06   
2            876         0.28          99.48     0.00            0.00   
3            353        21.40          28.74     5.11            2.34   
4            371        23.24          29.68     3.18            2.79   
5            334        20.75          27.68     2.70            0.45   
6            300        19.80          24.26     3.68            0.46   
7            376        25.18          29.20     3.28            3.06   
8            406        24.04          33.82     3.71            1.33   
9            387        23.37          30.60     3.60            4.78   
10           394        23.76          32.11     3.36            2.57   
11            98        11.12           4.30     1.41            3.38   
12            97        10.69           3.85     1.20            4.61   
13            72        10.34           0.29     1.71            6.66   
14            81        10.45           2.27     1.27            4.76   
15            72        12.39           1.02     1.39            2.72   
16           342         5.93          34.24     1.32            4.07   
17           357        24.99          27.80     4.22            1.43   
18           264        14.21          21.28     5.20            4.09   
19           389        25.60          31.14     3.79            1.55   
20           466         9.65          29.51     4.75           42.65   
21           356        24.94          27.44     3.94            2.22   
22           413        29.81          32.34     4.30            0.36   
23           327        20.05          27.25     3.79            0.49   
24           373        24.48          30.28     3.55            0.68   
25           300        22.17          22.35     3.28            2.19   
26           318        21.60          24.64     2.91            2.47   
27           254        24.26          15.92     3.27            2.77   
28           301        24.58          19.72     3.80            6.36   
29           368        23.41          30.04     3.66            1.12   
...          ...          ...            ...      ...             ...   
8588         389         6.60           0.90     2.00           88.60   
8589          91         1.00           0.20     0.76           21.34   
8590          68         0.30           0.20     0.29           16.30   
8591         465         8.70          15.20     1.85           73.40   
8592         401         7.12           5.46     1.22           81.19   
8593         429        12.60           9.50     1.71           73.39   
8594          73         0.50           0.20     0.52           17.18   
8595         179        28.40           5.10     3.50            3.40   
8596         377         4.40           7.50     0.80           72.90   
8597         280        27.50          17.10     2.40            3.10   
8598         688         0.00          77.80     0.40            0.30   
8599         884         0.00         100.00     0.00            0.00   
8600         279         0.00           0.00     0.00           76.10   
8601         257         4.08          19.50     4.74           16.18   
8602         319        11.30           2.10    13.90           63.70   
8603         356         5.30           2.40     9.90           78.20   
8604          62         0.00           0.00     0.22           15.38   
8605         179         0.30           0.03     0.08           46.10   
8606         181         0.41           0.20     0.35           44.38   
8607         287         0.08           0.01     0.07           71.60   
8608         365        10.08           3.00     5.70           74.42   
8609         351         1.60           0.10     1.86           86.04   
8610         350         0.81           0.90     6.80           84.66   
8611         370        75.16           1.85     1.00           13.79   
8612          73        16.40           0.30     1.40            0.00   
8613         305        18.50          25.10    13.40            0.00   
8614         111        20.54           0.84     2.97            5.41   
8615         269         0.00           0.00     0.86           73.14   
8616          90        16.10           1.40     1.30            2.00   
8617          89        19.80           0.50     1.20            0.00   

      Fiber_TD_(g)  Sugar_Tot_(g)       ...        Vit_A_IU  Vit_A_RAE  \
0              0.0           0.06       ...          2499.0      684.0   
1              0.0           0.06       ...          2499.0      684.0   
2              0.0           0.00       ...          3069.0      840.0   
3              0.0           0.50       ...           721.0      198.0   
4              0.0           0.51       ...          1080.0      292.0   
5              0.0           0.45       ...           592.0      174.0   
6              0.0           0.46       ...           820.0      241.0   
7              0.0            NaN       ...          1054.0      271.0   
8              0.0           0.28       ...           994.0      263.0   
9              0.0            NaN       ...           985.0      233.0   
10             0.0           0.52       ...           994.0      264.0   
11             0.0           2.67       ...           140.0       37.0   
12             0.2           2.38       ...           146.0       38.0   
13             0.0           1.85       ...             8.0        2.0   
14             0.0           4.00       ...           225.0       68.0   
15             0.0           2.72       ...            41.0       11.0   
16             0.0           3.21       ...          1343.0      366.0   
17             0.0           1.43       ...           825.0      243.0   
18             0.0           4.09       ...           422.0      125.0   
19             0.0           1.55       ...           913.0      261.0   
20             0.0            NaN       ...          1113.0      334.0   
21             0.0           2.22       ...           563.0      165.0   
22             0.0           0.36       ...           948.0      271.0   
23             0.0           0.49       ...          1155.0      340.0   
24             0.0           0.50       ...           769.0      198.0   
25             0.0           1.03       ...           676.0      179.0   
26             0.0           1.01       ...           745.0      197.0   
27             0.0           1.13       ...           481.0      127.0   
28             0.0           2.24       ...           846.0      254.0   
29             0.0           1.12       ...          1012.0      298.0   
...            ...            ...       ...             ...        ...   
8588           2.6           1.35       ...             0.0        0.0   
8589           1.6          11.36       ...             5.0        0.0   
8590           1.0          14.66       ...            30.0        2.0   
8591           5.7           0.53       ...            81.0        4.0   
8592           4.2          19.79       ...          2731.0      806.0   
8593          14.2           0.54       ...           147.0        7.0   
8594           2.0          14.87       ...            50.0        3.0   
8595           0.0           1.33       ...           152.0       40.0   
8596           2.1          35.10       ...          2027.0      608.0   
8597           0.0           1.23       ...           517.0      137.0   
8598           0.0           0.30       ...             0.0        0.0   
8599           0.0           0.00       ...             0.0        0.0   
8600           0.1          76.00       ...             0.0        0.0   
8601           0.0           8.21       ...           900.0       45.0   
8602          27.8          35.90       ...          1962.0       98.0   
8603           6.1           0.70       ...             0.0        0.0   
8604           0.1            NaN       ...             8.0        NaN   
8605           0.8          45.30       ...             3.0        0.0   
8606           2.6          37.75       ...            22.0        1.0   
8607           0.0          24.53       ...            12.0        1.0   
8608          10.1           0.70       ...             0.0        0.0   
8609           0.9           2.90       ...             0.0        0.0   
8610           0.8           0.90       ...             0.0        0.0   
8611           0.6           0.00       ...             0.0        0.0   
8612           0.0           0.00       ...            50.0       15.0   
8613           0.0           0.00       ...           157.0       47.0   
8614           0.0           0.00       ...             5.0        2.0   
8615           0.0          73.20       ...             0.0        0.0   
8616           0.0           0.00       ...           100.0       30.0   
8617           0.0           0.00       ...           100.0       30.0   

      Vit_E_(mg)  Vit_D_mcg  Vit_D_IU  Vit_K_(mcg)  FA_Sat_(g)  FA_Mono_(g)  \
0           2.32        1.5      60.0          7.0      51.368       21.021   
1           2.32        1.5      60.0          7.0      50.489       23.426   
2           2.80        1.8      73.0          8.6      61.924       28.732   
3           0.25        0.5      21.0          2.4      18.669        7.778   
4           0.26        0.5      22.0          2.5      18.764        8.598   
5           0.24        0.5      20.0          2.3      17.410        8.013   
6           0.21        0.4      18.0          2.0      15.259        7.023   
7            NaN        NaN       NaN          NaN      18.584        8.275   
8           0.78        0.6      24.0          2.9      19.368        8.428   
9            NaN        NaN       NaN          NaN      19.475        8.671   
10          0.28        0.6      24.0          2.7      20.218        9.280   
11          0.08        0.1       3.0          0.0       1.718        0.778   
12          0.04        0.0       0.0          0.4       2.311        1.036   
13          0.01        0.0       0.0          0.0       0.169        0.079   
14          0.08        0.0       0.0          0.0       1.235        0.516   
15          0.01        0.0       0.0          0.1       0.645        0.291   
16          0.29        0.6      25.0          2.9      19.292        8.620   
17          0.24        0.5      20.0          2.3      17.572        8.125   
18          0.18        0.4      16.0          1.8      14.946        4.623   
19          0.27        0.6      23.0          2.6      19.196        8.687   
20           NaN        NaN       NaN          NaN      19.160        7.879   
21          0.24        0.5      20.0          2.3      17.614        7.747   
22          0.28        0.6      24.0          2.7      18.913       10.043   
23          0.23        0.5      20.0          2.3      16.746        8.606   
24          0.26        0.6      22.0          2.5      19.066        8.751   
25          0.19        0.4      16.0          2.3      13.152        6.573   
26          0.21        0.5      18.0          2.5      15.561        7.027   
27          0.14        0.3      12.0          1.6      10.114        4.510   
28          0.43        0.4      15.0          1.3      11.473        5.104   
29          0.26        0.6      22.0          2.5      19.113        8.711   
...          ...        ...       ...          ...         ...          ...   
8588        0.13        0.0       0.0          0.3       0.185        0.252   
8589        0.25        0.0       0.0          0.5       0.072        0.028   
8590        0.02        0.0       0.0          0.1       0.058        0.018   
8591        3.53        0.0       0.0          0.7       2.837        6.341   
8592        1.22        4.6     183.0          3.0       0.600        2.831   
8593        5.01        0.0       0.0         15.7       1.415        4.085   
8594        0.79        0.0       0.0          5.1       0.030        0.025   
8595        0.07        0.1       4.0          0.5       3.304        1.351   
8596        0.76        0.0       0.0         13.8       1.500        5.000   
8597        0.15        0.3      13.0          1.8      10.867        4.844   
8598       11.79        0.0       0.0         24.7      10.784       18.026   
8599       14.78        0.0       0.0         21.0      14.367       48.033   
8600        0.00        0.0       0.0          0.0       0.000        0.000   
8601        2.15        0.0       0.0         36.7       7.996        3.108   
8602        5.55        0.0       0.0        584.2       0.555        0.405   
8603        0.02        0.0       0.0          0.4       0.984        1.154   
8604         NaN        NaN       NaN          NaN       0.000        0.000   
8605        0.00        0.0       0.0          0.2       0.009        0.001   
8606        0.23        0.0       0.0          3.9       0.000        0.000   
8607        0.02        0.0       0.0          0.0       0.003        0.001   
8608        0.02        0.0       0.0          0.5       1.578        1.150   
8609        0.05        0.0       0.0          1.1       0.018        0.032   
8610        0.08        0.0       0.0          1.7       0.099        0.116   
8611        0.00        0.0       0.0          0.0       0.272        0.156   
8612        1.00        0.2       8.0          0.1       0.076        0.053   
8613        2.38       25.2    1006.0          7.8       7.148        8.320   
8614        0.00        0.0       2.0          0.0       0.218        0.082   
8615        0.00        0.0       0.0          0.0       0.000        0.000   
8616        5.00        0.0       0.0          0.1       0.361        0.259   
8617        0.50        0.0       0.0          0.1       0.127        0.088   

      FA_Poly_(g)  Cholestrl_(mg)  
0           3.043           215.0  
1           3.012           219.0  
2           3.694           256.0  
3           0.800            75.0  
4           0.784            94.0  
5           0.826           100.0  
6           0.724            72.0  
7           0.830            93.0  
8           1.433           102.0  
9           0.870           103.0  
10          0.953            95.0  
11          0.123            17.0  
12          0.124            13.0  
13          0.003             7.0  
14          0.083            12.0  
15          0.031             4.0  
16          1.437           110.0  
17          0.665            89.0  
18          0.591            89.0  
19          1.654           116.0  
20          0.938            94.0  
21          0.657           114.0  
22          1.733           110.0  
23          0.495            90.0  
24          0.899            89.0  
25          0.765            79.0  
26          0.778            89.0  
27          0.472            64.0  
28          0.861            65.0  
29          0.661            96.0  
...           ...             ...  
8588        0.231             0.0  
8589        0.041             0.0  
8590        0.047             0.0  
8591        5.024             0.0  
8592        1.307             0.0  
8593        3.572             0.0  
8594        0.068             0.0  
8595        0.180            35.0  
8596        0.900             0.0  
8597        0.509            54.0  
8598       45.539             0.0  
8599       33.033             0.0  
8600        0.000             0.0  
8601        7.536             6.0  
8602        1.035             0.0  
8603        0.131             0.0  
8604        0.000             0.0  
8605        0.008             0.0  
8606        0.000             0.0  
8607        0.009             0.0  
8608        0.130             0.0  
8609        0.050             0.0  
8610        0.433             0.0  
8611        0.810             0.0  
8612        0.102            50.0  
8613        6.210            95.0  
8614        0.222            41.0  
8615        0.000             0.0  
8616        0.252            50.0  
8617        0.170            50.0  

[8618 rows x 36 columns]>

head()函数默认输出数据前5行。

food_info.head()
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) ... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
0 1001 BUTTER WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0 0.06 ... 2499.0 684.0 2.32 1.5 60.0 7.0 51.368 21.021 3.043 215.0
1 1002 BUTTER WHIPPED WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0 0.06 ... 2499.0 684.0 2.32 1.5 60.0 7.0 50.489 23.426 3.012 219.0
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28 99.48 0.00 0.00 0.0 0.00 ... 3069.0 840.0 2.80 1.8 73.0 8.6 61.924 28.732 3.694 256.0
3 1004 CHEESE BLUE 42.41 353 21.40 28.74 5.11 2.34 0.0 0.50 ... 721.0 198.0 0.25 0.5 21.0 2.4 18.669 7.778 0.800 75.0
4 1005 CHEESE BRICK 41.11 371 23.24 29.68 3.18 2.79 0.0 0.51 ... 1080.0 292.0 0.26 0.5 22.0 2.5 18.764 8.598 0.784 94.0

5 rows × 36 columns

查看数据类型:

print(food_info.dtypes)
NDB_No               int64
Shrt_Desc           object
Water_(g)          float64
Energ_Kcal           int64
Protein_(g)        float64
Lipid_Tot_(g)      float64
Ash_(g)            float64
Carbohydrt_(g)     float64
Fiber_TD_(g)       float64
Sugar_Tot_(g)      float64
Calcium_(mg)       float64
Iron_(mg)          float64
Magnesium_(mg)     float64
Phosphorus_(mg)    float64
Potassium_(mg)     float64
Sodium_(mg)        float64
Zinc_(mg)          float64
Copper_(mg)        float64
Manganese_(mg)     float64
Selenium_(mcg)     float64
Vit_C_(mg)         float64
Thiamin_(mg)       float64
Riboflavin_(mg)    float64
Niacin_(mg)        float64
Vit_B6_(mg)        float64
Vit_B12_(mcg)      float64
Vit_A_IU           float64
Vit_A_RAE          float64
Vit_E_(mg)         float64
Vit_D_mcg          float64
Vit_D_IU           float64
Vit_K_(mcg)        float64
FA_Sat_(g)         float64
FA_Mono_(g)        float64
FA_Poly_(g)        float64
Cholestrl_(mg)     float64
dtype: object
  • object - For string values
  • int - For integer values
  • float - For float values
  • datetime - For time values
  • bool - For Boolean values

输出前3行:

food_info.head(3)
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) ... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
0 1001 BUTTER WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0 0.06 ... 2499.0 684.0 2.32 1.5 60.0 7.0 51.368 21.021 3.043 215.0
1 1002 BUTTER WHIPPED WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0 0.06 ... 2499.0 684.0 2.32 1.5 60.0 7.0 50.489 23.426 3.012 219.0
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28 99.48 0.00 0.00 0.0 0.00 ... 3069.0 840.0 2.80 1.8 73.0 8.6 61.924 28.732 3.694 256.0

3 rows × 36 columns

输出后4行:

food_info.tail(4)
#print(food_info.tail(4))
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) ... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
8614 90240 SCALLOP (BAY&SEA) CKD STMD 70.25 111 20.54 0.84 2.97 5.41 0.0 0.0 ... 5.0 2.0 0.0 0.0 2.0 0.0 0.218 0.082 0.222 41.0
8615 90480 SYRUP CANE 26.00 269 0.00 0.00 0.86 73.14 0.0 73.2 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000 0.000 0.000 0.0
8616 90560 SNAIL RAW 79.20 90 16.10 1.40 1.30 2.00 0.0 0.0 ... 100.0 30.0 5.0 0.0 0.0 0.1 0.361 0.259 0.252 50.0
8617 93600 TURTLE GREEN RAW 78.50 89 19.80 0.50 1.20 0.00 0.0 0.0 ... 100.0 30.0 0.5 0.0 0.0 0.1 0.127 0.088 0.170 50.0

4 rows × 36 columns

查看列名:
print(food_info.columns)
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
       'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
       'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
       'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
       'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
       'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
       'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
       'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
       'Cholestrl_(mg)'],
      dtype='object')

查看数据维度:

print(food_info.shape)
(8618, 36)

二、索引与计算

查看第7行数据:

food_info.loc[6]
NDB_No                         1007
Shrt_Desc          CHEESE CAMEMBERT
Water_(g)                      51.8
Energ_Kcal                      300
Protein_(g)                    19.8
Lipid_Tot_(g)                 24.26
Ash_(g)                        3.68
Carbohydrt_(g)                 0.46
Fiber_TD_(g)                      0
Sugar_Tot_(g)                  0.46
Calcium_(mg)                    388
Iron_(mg)                      0.33
Magnesium_(mg)                   20
Phosphorus_(mg)                 347
Potassium_(mg)                  187
Sodium_(mg)                     842
Zinc_(mg)                      2.38
Copper_(mg)                   0.021
Manganese_(mg)                0.038
Selenium_(mcg)                 14.5
Vit_C_(mg)                        0
Thiamin_(mg)                  0.028
Riboflavin_(mg)               0.488
Niacin_(mg)                    0.63
Vit_B6_(mg)                   0.227
Vit_B12_(mcg)                   1.3
Vit_A_IU                        820
Vit_A_RAE                       241
Vit_E_(mg)                     0.21
Vit_D_mcg                       0.4
Vit_D_IU                         18
Vit_K_(mcg)                       2
FA_Sat_(g)                   15.259
FA_Mono_(g)                   7.023
FA_Poly_(g)                   0.724
Cholestrl_(mg)                   72
Name: 6, dtype: object
查看第4行——第7行数据:
# Returns a DataFrame containing the rows at indexes 3, 4, 5, and 6.
food_info.loc[3:6]
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) ... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
3 1004 CHEESE BLUE 42.41 353 21.40 28.74 5.11 2.34 0.0 0.50 ... 721.0 198.0 0.25 0.5 21.0 2.4 18.669 7.778 0.800 75.0
4 1005 CHEESE BRICK 41.11 371 23.24 29.68 3.18 2.79 0.0 0.51 ... 1080.0 292.0 0.26 0.5 22.0 2.5 18.764 8.598 0.784 94.0
5 1006 CHEESE BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0 0.45 ... 592.0 174.0 0.24 0.5 20.0 2.3 17.410 8.013 0.826 100.0
6 1007 CHEESE CAMEMBERT 51.80 300 19.80 24.26 3.68 0.46 0.0 0.46 ... 820.0 241.0 0.21 0.4 18.0 2.0 15.259 7.023 0.724 72.0

4 rows × 36 columns

查看第3,6,11行:
# Returns a DataFrame containing the rows at indexes 2, 5, and 10. Either of the following approaches will work.
# Method 1
two_five_ten = [2,5,10] 
food_info.loc[two_five_ten]
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) ... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28 99.48 0.00 0.00 0.0 0.00 ... 3069.0 840.0 2.80 1.8 73.0 8.6 61.924 28.732 3.694 256.0
5 1006 CHEESE BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0 0.45 ... 592.0 174.0 0.24 0.5 20.0 2.3 17.410 8.013 0.826 100.0
10 1011 CHEESE COLBY 38.20 394 23.76 32.11 3.36 2.57 0.0 0.52 ... 994.0 264.0 0.28 0.6 24.0 2.7 20.218 9.280 0.953 95.0

3 rows × 36 columns

# Method 2
food_info.loc[[2,5,10]]
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) ... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28 99.48 0.00 0.00 0.0 0.00 ... 3069.0 840.0 2.80 1.8 73.0 8.6 61.924 28.732 3.694 256.0
5 1006 CHEESE BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0 0.45 ... 592.0 174.0 0.24 0.5 20.0 2.3 17.410 8.013 0.826 100.0
10 1011 CHEESE COLBY 38.20 394 23.76 32.11 3.36 2.57 0.0 0.52 ... 994.0 264.0 0.28 0.6 24.0 2.7 20.218 9.280 0.953 95.0

3 rows × 36 columns

根据列名获取列数据:

# Series object representing the "NDB_No" column.
ndb_col = food_info["NDB_No"]
print(ndb_col)
0        1001
1        1002
2        1003
3        1004
4        1005
5        1006
6        1007
7        1008
8        1009
9        1010
10       1011
11       1012
12       1013
13       1014
14       1015
15       1016
16       1017
17       1018
18       1019
19       1020
20       1021
21       1022
22       1023
23       1024
24       1025
25       1026
26       1027
27       1028
28       1029
29       1030
        ...  
8588    43544
8589    43546
8590    43550
8591    43566
8592    43570
8593    43572
8594    43585
8595    43589
8596    43595
8597    43597
8598    43598
8599    44005
8600    44018
8601    44048
8602    44055
8603    44061
8604    44074
8605    44110
8606    44158
8607    44203
8608    44258
8609    44259
8610    44260
8611    48052
8612    80200
8613    83110
8614    90240
8615    90480
8616    90560
8617    93600
Name: NDB_No, Length: 8618, dtype: int64

根据列名查看多列数据:

columns = ["Zinc_(mg)", "Copper_(mg)"]
zinc_copper = food_info[columns]
print(zinc_copper)
#或者:
#zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]
      Zinc_(mg)  Copper_(mg)
0          0.09        0.000
1          0.05        0.016
2          0.01        0.001
3          2.66        0.040
4          2.60        0.024
5          2.38        0.019
6          2.38        0.021
7          2.94        0.024
8          3.43        0.056
9          2.79        0.042
10         3.07        0.042
11         0.40        0.029
12         0.33        0.040
13         0.47        0.030
14         0.51        0.033
15         0.38        0.028
16         0.51        0.019
17         3.75        0.036
18         2.88        0.032
19         3.50        0.025
20         1.14        0.080
21         3.90        0.036
22         3.90        0.032
23         2.10        0.021
24         3.00        0.032
25         2.92        0.011
26         2.46        0.022
27         2.76        0.025
28         3.61        0.034
29         2.81        0.031
...         ...          ...
8588       3.30        0.377
8589       0.05        0.040
8590       0.05        0.030
8591       1.15        0.116
8592       5.03        0.200
8593       3.83        0.545
8594       0.08        0.035
8595       3.90        0.027
8596       4.10        0.100
8597       3.13        0.027
8598       0.13        0.000
8599       0.02        0.000
8600       0.09        0.037
8601       0.21        0.026
8602       2.77        0.571
8603       0.41        0.838
8604       0.05        0.028
8605       0.03        0.023
8606       0.10        0.112
8607       0.02        0.020
8608       1.49        0.854
8609       0.19        0.040
8610       0.10        0.038
8611       0.85        0.182
8612       1.00        0.250
8613       1.10        0.100
8614       1.55        0.033
8615       0.19        0.020
8616       1.00        0.400
8617       1.00        0.250

[8618 rows x 2 columns]

获取所有列名中单位以“(g)”结尾的数据前3行:

#将列名转化为一个list
col_names = food_info.columns.tolist()
#print col_names
gram_columns = []

for c in col_names:
    if c.endswith("(g)"):
        gram_columns.append(c)
gram_df = food_info[gram_columns]
print(gram_df.head(3))
   Water_(g)  Protein_(g)  Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  \
0      15.87         0.85          81.11     2.11            0.06   
1      15.87         0.85          81.11     2.11            0.06   
2       0.24         0.28          99.48     0.00            0.00   

   Fiber_TD_(g)  Sugar_Tot_(g)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  
0           0.0           0.06      51.368       21.021        3.043  
1           0.0           0.06      50.489       23.426        3.012  
2           0.0           0.00      61.924       28.732        3.694  

将"Iron_(mg)“列数据转化为单位为g,即除以1000,并将该列添加到源数据中,命名为"Iron_(g)”:

iron_grams = food_info["Iron_(mg)"] / 1000  
print(food_info.shape)
food_info["Iron_(g)"] = iron_grams
print(food_info.shape)

# Subtracts 100 from each value in the column and returns a Series object.
#sub_100 = food_info["Iron_(mg)"] - 100

# Multiplies each value in the column by 2 and returns a Series object.
#mult_2 = food_info["Iron_(mg)"]*2
(8618, 37)
(8618, 37)
"Water_(g)"列和"Energ_Kcal"列相乘:
#It applies the arithmetic operator to the first value in both columns, the second value in both columns, and so on
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]

根据Protein和Lipid_Tot指标计算Score:

#Score=2×(Protein_(g))−0.75×(Lipid_Tot_(g))
weighted_protein = food_info["Protein_(g)"] * 2
weighted_fat = -0.75 * food_info["Lipid_Tot_(g)"]
initial_rating = weighted_protein + weighted_fat

对数据做归一化操作:

# the "Vit_A_IU" column ranges from 0 to 100000, while the "Fiber_TD_(g)" column ranges from 0 to 79
#For certain calculations, columns like "Vit_A_IU" can have a greater effect on the result, 
#due to the scale of the values
# The largest value in the "Energ_Kcal" column.
max_calories = food_info["Energ_Kcal"].max()
# Divide the values in "Energ_Kcal" by the largest value.
normalized_calories = food_info["Energ_Kcal"] / max_calories
normalized_protein = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()
normalized_fat = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()
food_info["Normalized_Protein"] = normalized_protein
food_info["Normalized_Fat"] = normalized_fat
"Sodium_(mg)"升序排列:
#By default, pandas will sort the data by the column we specify in ascending order and return a new DataFrame
# Sorts the DataFrame in-place, rather than returning a new DataFrame.
#print food_info["Sodium_(mg)"]
food_info.sort_values("Sodium_(mg)", inplace=True)
print(food_info["Sodium_(mg)"])

指定ascending=False,对"Sodium_(mg)"降序排列:

#Sorts by descending order, rather than ascending.
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)
print(food_info["Sodium_(mg)"])
276     38758.0
5814    27360.0
6192    26050.0
1242    26000.0
1245    24000.0
1243    24000.0
1244    23875.0
292     17000.0
1254    11588.0
5811    10600.0
8575     9690.0
291      8068.0
1249     8031.0
5812     7893.0
1292     7851.0
293      7203.0
4472     7027.0
4836     6820.0
1261     6580.0
3747     6008.0
1266     5730.0
4835     5586.0
4834     5493.0
1263     5356.0
1553     5203.0
1552     5053.0
1251     4957.0
1257     4843.0
294      4616.0
8613     4450.0
         ...   
8153        NaN
8155        NaN
8156        NaN
8157        NaN
8158        NaN
8159        NaN
8160        NaN
8161        NaN
8163        NaN
8164        NaN
8165        NaN
8167        NaN
8169        NaN
8170        NaN
8172        NaN
8173        NaN
8174        NaN
8175        NaN
8176        NaN
8177        NaN
8178        NaN
8179        NaN
8180        NaN
8181        NaN
8183        NaN
8184        NaN
8185        NaN
8195        NaN
8251        NaN
8267        NaN
Name: Sodium_(mg), Length: 8618, dtype: float64

三、数据预处理实例——泰坦尼克号数据分析

这是一个比较经典的Kaggle竞赛案例:https://www.kaggle.com/c/titanic

附:文件链接(https://pan.baidu.com/s/1VeNYxEuXo7Fy-rhDRGzi1A)

import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("G:\\train.csv")
#titanic_survival.shape
titanic_survival.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

指标解释:

  • PassengerID 对于泰坦尼克号上每一个乘客都有一个编号,一共有891个乘客
  • Survived 只有两个值:0和1,0代表死亡,1代表存活
  • Pclass 船舱位等级,一共有三等
  • Name 乘客名字
  • Sex 乘客性别
  • Age 乘客年龄
  • SibSp 当时的家里兄弟姐妹都是比较多的,大家出去玩的时候一般都会一起的,SibSp统计了乘客的兄弟姐妹的数量
  • Parch Parents和childern的缩写,基本上都是等于0的,出去玩不带父母和孩子嗨
  • Ticket 船票编码
  • Fare 船票价格
  • Cabin 乘客船舱的编号,缺失值统一用NaN表示,这一列缺失值太多,之后不会用嘞
  • Embarked 登船地点,一共有三个码头

计算Age列缺失值数量:

#The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.
#we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False values
age = titanic_survival["Age"]
#print(age.loc[0:10])
age_is_null = pd.isnull(age)
#print age_is_null
age_null_true = age[age_is_null]
#print age_null_true
age_null_count = len(age_null_true)
print(age_null_count)
177

如果有缺失值,计算Age均值时会得出nan,无法计算:

#The result of this is that mean_age would be nan. This is because any calculations we do with a null value also result in a null value
mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
print(mean_age)
nan

去除掉Age列缺失值,并计算均值:

#we have to filter out the missing values before we calculate the mean.
good_ages = titanic_survival["Age"][age_is_null == False]
#print good_ages
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age)
29.6991176471

当然,以上代码我们也可以使用均值函数.mean()来执行,但是把带有缺失值的样本删除掉不是一个好办法,通常我们会使用均值、中位数、众数来代替缺失值。

# missing data is so common that many pandas methods automatically filter for it
correct_mean_age = titanic_survival["Age"].mean()
print(correct_mean_age)
29.69911764705882

按照船舱等级分别来计算平均票价:

#mean fare for each class
passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
    pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
    pclass_fares = pclass_rows["Fare"]
    fare_for_class = pclass_fares.mean()
    fares_by_class[this_class] = fare_for_class
print(fares_by_class)
{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}

有没有简单点的代码呢?答案当然是有。pivot_table函数可以帮助实现。设置index和values、aggfunc参数,拥挤不同Pclass下的Survived平均值。

#index tells the method which column to group by
#values is the column that we want to apply the calculation to
#aggfunc specifies the calculation we want to perform
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print(passenger_survival)
        Survived
Pclass          
1       0.629630
2       0.472826
3       0.242363

可见,存活人数随着舱位的等级降低也变低了,存活率都和钱挂钩了。。

利用pivot_table函数,计算不同Pclass下的平均年龄值(当aggfun参数不指定时,默认求均值):

passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
print(passenger_age)
              Age
Pclass           
1       38.233441
2       29.877630
3       25.140620

可见,一等舱平均年龄38岁,二等舱平均年龄29岁,三等舱平均年龄25岁。也说明了有钱的人年龄是比较大的,年轻人是比较穷的。

现在我们想同时观察一个量和其他两个量之间的关系,分析登船地点与船票价格及获救与否之间的关系,同上(aggfunc=np.sum表示计算总值):

port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)
                Fare  Survived
Embarked                      
C         10072.2962        93
Q          1022.2543        30
S         17439.3988       217

dropna函数表示把缺失值剔除,fillna函数表示把缺失值填充:

#specifying axis=1 or axis='columns' will drop any columns that have null values
drop_na_columns = titanic_survival.dropna(axis=1)
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"])
#print new_titanic_survival

查看指定位置的元素:

row_index_83_age = titanic_survival.loc[83,"Age"]
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print(row_index_83_age)
print(row_index_1000_pclass)
28.0
1

将年龄降序排列,

new_titanic_survival = titanic_survival.sort_values("Age",ascending=False)
print(new_titanic_survival[0:10])
print('--------------')
#丢弃原来的索引,重新设置:
itanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(itanic_reindexed.iloc[0:10])
     PassengerId  Survived  Pclass                                  Name  \
630          631         1       1  Barkworth, Mr. Algernon Henry Wilson   
851          852         0       3                   Svensson, Mr. Johan   
493          494         0       1               Artagaveytia, Mr. Ramon   
96            97         0       1             Goldschmidt, Mr. George B   
116          117         0       3                  Connors, Mr. Patrick   
672          673         0       2           Mitchell, Mr. Henry Michael   
745          746         0       1          Crosby, Capt. Edward Gifford   
33            34         0       2                 Wheadon, Mr. Edward H   
54            55         0       1        Ostby, Mr. Engelhart Cornelius   
280          281         0       3                      Duane, Mr. Frank   

      Sex   Age  SibSp  Parch      Ticket     Fare Cabin Embarked  
630  male  80.0      0      0       27042  30.0000   A23        S  
851  male  74.0      0      0      347060   7.7750   NaN        S  
493  male  71.0      0      0    PC 17609  49.5042   NaN        C  
96   male  71.0      0      0    PC 17754  34.6542    A5        C  
116  male  70.5      0      0      370369   7.7500   NaN        Q  
672  male  70.0      0      0  C.A. 24580  10.5000   NaN        S  
745  male  70.0      1      1   WE/P 5735  71.0000   B22        S  
33   male  66.0      0      0  C.A. 24579  10.5000   NaN        S  
54   male  65.0      0      1      113509  61.9792   B30        C  
280  male  65.0      0      0      336439   7.7500   NaN        Q  
--------------
   PassengerId  Survived  Pclass                                  Name   Sex  \
0          631         1       1  Barkworth, Mr. Algernon Henry Wilson  male   
1          852         0       3                   Svensson, Mr. Johan  male   
2          494         0       1               Artagaveytia, Mr. Ramon  male   
3           97         0       1             Goldschmidt, Mr. George B  male   
4          117         0       3                  Connors, Mr. Patrick  male   
5          673         0       2           Mitchell, Mr. Henry Michael  male   
6          746         0       1          Crosby, Capt. Edward Gifford  male   
7           34         0       2                 Wheadon, Mr. Edward H  male   
8           55         0       1        Ostby, Mr. Engelhart Cornelius  male   
9          281         0       3                      Duane, Mr. Frank  male   

    Age  SibSp  Parch      Ticket     Fare Cabin Embarked  
0  80.0      0      0       27042  30.0000   A23        S  
1  74.0      0      0      347060   7.7750   NaN        S  
2  71.0      0      0    PC 17609  49.5042   NaN        C  
3  71.0      0      0    PC 17754  34.6542    A5        C  
4  70.5      0      0      370369   7.7500   NaN        Q  
5  70.0      0      0  C.A. 24580  10.5000   NaN        S  
6  70.0      1      1   WE/P 5735  71.0000   B22        S  
7  66.0      0      0  C.A. 24579  10.5000   NaN        S  
8  65.0      0      1      113509  61.9792   B30        C  
9  65.0      0      0      336439   7.7500   NaN        Q  

Pandas为我们提供了很多函数,当提供的函数不能满足我们的需求时,我们可以编写自定义函数。比如说,定义一个返回第100行数据的函数:

# This function returns the hundredth item from a series
def hundredth_row(column):
    # Extract the hundredth item
    hundredth_item = column.iloc[99]
    return hundredth_item

# Return the hundredth item from each column
hundredth_row = titanic_survival.apply(hundredth_row)
print(hundredth_row)
PassengerId                  100
Survived                       0
Pclass                         2
Name           Kantor, Mr. Sinai
Sex                         male
Age                           34
SibSp                          1
Parch                          0
Ticket                    244367
Fare                          26
Cabin                        NaN
Embarked                       S
dtype: object

自定义返回每列缺失值个数的函数:

def not_null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)

column_null_count = titanic_survival.apply(not_null_count)
print(column_null_count)
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

apply,自定义函数,将1,2,3等舱改写为First Class、Second Class、Third Class:

#By passing in the axis=1 argument, we can use the DataFrame.apply() method to iterate over rows instead of columns.
def which_class(row):
    pclass = row['Pclass']
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    elif pclass == 3:
        return "Third Class"

classes = titanic_survival.apply(which_class, axis=1)
print(classes)
0       Third Class
1       First Class
2       Third Class
3       First Class
4       Third Class
5       Third Class
6       First Class
7       Third Class
8       Third Class
9      Second Class
10      Third Class
11      First Class
12      Third Class
13      Third Class
14      Third Class
15     Second Class
16      Third Class
17     Second Class
18      Third Class
19      Third Class
20     Second Class
21     Second Class
22      Third Class
23      First Class
24      Third Class
25      Third Class
26      Third Class
27      First Class
28      Third Class
29      Third Class
           ...     
861    Second Class
862     First Class
863     Third Class
864    Second Class
865    Second Class
866    Second Class
867     First Class
868     Third Class
869     Third Class
870     Third Class
871     First Class
872     First Class
873     Third Class
874    Second Class
875     Third Class
876     Third Class
877     Third Class
878     Third Class
879     First Class
880    Second Class
881     Third Class
882     Third Class
883    Second Class
884     Third Class
885     Third Class
886    Second Class
887     First Class
888     Third Class
889     First Class
890     Third Class
Length: 891, dtype: object

接下来,自定义函数,将年龄小于18的数据改写为minor;年龄大于18的,改写为adult:

def is_minor(row):
    if row["Age"] < 18:
        return True
    else:
        return False

minors = titanic_survival.apply(is_minor, axis=1)
#print minors

def generate_age_label(row):
    age = row["Age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

age_labels = titanic_survival.apply(generate_age_label, axis=1)
print(age_labels)
0        adult
1        adult
2        adult
3        adult
4        adult
5      unknown
6        adult
7        minor
8        adult
9        minor
10       minor
11       adult
12       adult
13       adult
14       minor
15       adult
16       minor
17     unknown
18       adult
19     unknown
20       adult
21       adult
22       minor
23       adult
24       minor
25       adult
26     unknown
27       adult
28     unknown
29     unknown
        ...   
861      adult
862      adult
863    unknown
864      adult
865      adult
866      adult
867      adult
868    unknown
869      minor
870      adult
871      adult
872      adult
873      adult
874      adult
875      minor
876      adult
877      adult
878    unknown
879      adult
880      adult
881      adult
882      adult
883      adult
884      adult
885      adult
886      adult
887      adult
888    unknown
889      adult
890      adult
Length: 891, dtype: object

通过pivot_table函数,分析成年人和未成年人的平均获救人数:

titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived")
print(age_group_survival)
            Survived
age_labels          
adult       0.381032
minor       0.539823
unknown     0.293785

可以看到,成年人平均获救人数为0.38,minor平均获救人数为0.59. 这可能应了一句话“让妇女和儿童先走”,然后成年人就牺牲了。

四、Series结构

上面我们讲到的是DataFrame结构,而其中的一行或者是一列我们管它叫做Series结构:

  • Series (collection of values)
  • DataFrame (collection of Series objects)
  • Panel (collection of DataFrame objects)

A Series object can hold many data types, including

  • float - for representing float values
  • int - for representing integer values
  • bool - for representing Boolean values
  • datetime64[ns] - for representing date & time, without time-zone
  • datetime64[ns, tz] - for representing date & time, with time-zone
  • timedelta[ns] - for representing differences in dates & times (seconds, minutes, etc.)
  • category - for representing categorical values
  • object - for representing String values

"fandango_score_comparison.csv"文件存放的是对电影的评分,包含了电影名字和一系列指标。

附:文件链接(https://pan.baidu.com/s/18kfSAqWTmtFIaijNXXKa0Q)

import pandas as pd
fandango = pd.read_csv('G:\\fandango_score_comparison.csv')
series_film = fandango['FILM']
print(type(series_film))
print(series_film[0:5])
series_rt = fandango['RottenTomatoes']
print (series_rt[0:5])
<class 'pandas.core.series.Series'>
0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
0    74
1    85
2    80
3    18
4    14
Name: RottenTomatoes, dtype: int64

数据指标含义解释:

  • FILM - film name
  • RottenTomatoes - Rotten Tomatoes critics average score
  • RottenTomatoes_User - Rotten Tomatoes user average score
  • RT_norm - Rotten Tomatoes critics average score (normalized to a 0 to 5 point system)
  • RT_user_norm - Rotten Tomatoes user average score (normalized to a 0 to 5 point system)
  • Metacritic - Metacritic critics average score
  • Metacritic_User - Metacritic user average score

构造Series结构:

# Import the Series object from pandas
from pandas import Series

film_names = series_film.values
print(type(film_names))
#print film_names
rt_scores = series_rt.values
#print rt_scores
series_custom = Series(rt_scores , index=film_names)
series_custom[['Minions (2015)', 'Leviathan (2014)']]
<class 'numpy.ndarray'>





Minions (2015)      54
Leviathan (2014)    99
dtype: int64

我们可以看到,Series里面又是ndarray,这也说明了numpy是封装在pandas里面的。所以,二者之间很多操作是互通的。

在Series结构中,我们用str值作为索引,提取元素:

# int index is also aviable
series_custom = Series(rt_scores , index=film_names)
series_custom[['Minions (2015)', 'Leviathan (2014)']]
fiveten = series_custom[5:10]
print(fiveten)
The Water Diviner (2015)        63
Irrational Man (2015)           42
Top Five (2014)                 86
Shaun the Sheep Movie (2015)    99
Love & Mercy (2015)             89
dtype: int64

利用sorted函数对series排序,并利用reindex重新设置索引:

original_index = series_custom.index.tolist()
#print original_index
sorted_index = sorted(original_index)
sorted_by_index = series_custom.reindex(sorted_index)
#print sorted_by_index

Series和DataFrame结构也有是不同的,Series结构可以使用sort_index函数按索引排序、使用sort_values按值排序:

sc2 = series_custom.sort_index()
sc3 = series_custom.sort_values()
#print(sc2[0:10])
print(sc3[0:10])
Paul Blart: Mall Cop 2 (2015)     5
Hitman: Agent 47 (2015)           7
Hot Pursuit (2015)                8
Fantastic Four (2015)             9
Taken 3 (2015)                    9
The Boy Next Door (2015)         10
The Loft (2015)                  11
Unfinished Business (2015)       11
Mortdecai (2015)                 12
Seventh Son (2015)               12
dtype: int64

series结果可以进行相加,得到新的Series。

#The values in a Series object are treated as an ndarray, the core data type in NumPy
import numpy as np
# Add each value with each other
print(np.add(series_custom, series_custom))
# Apply sine function to each value
np.sin(series_custom)
# Return the highest value (will return a single value not a Series)
np.max(series_custom)
Avengers: Age of Ultron (2015)                    148
Cinderella (2015)                                 170
Ant-Man (2015)                                    160
Do You Believe? (2015)                             36
Hot Tub Time Machine 2 (2015)                      28
The Water Diviner (2015)                          126
Irrational Man (2015)                              84
Top Five (2014)                                   172
Shaun the Sheep Movie (2015)                      198
Love & Mercy (2015)                               178
Far From The Madding Crowd (2015)                 168
Black Sea (2015)                                  164
Leviathan (2014)                                  198
Unbroken (2014)                                   102
The Imitation Game (2014)                         180
Taken 3 (2015)                                     18
Ted 2 (2015)                                       92
Southpaw (2015)                                   118
Night at the Museum: Secret of the Tomb (2014)    100
Pixels (2015)                                      34
McFarland, USA (2015)                             158
Insidious: Chapter 3 (2015)                       118
The Man From U.N.C.L.E. (2015)                    136
Run All Night (2015)                              120
Trainwreck (2015)                                 170
Selma (2014)                                      198
Ex Machina (2015)                                 184
Still Alice (2015)                                176
Wild Tales (2014)                                 192
The End of the Tour (2015)                        184
                                                 ... 
Clouds of Sils Maria (2015)                       178
Testament of Youth (2015)                         162
Infinitely Polar Bear (2015)                      160
Phoenix (2015)                                    198
The Wolfpack (2015)                               168
The Stanford Prison Experiment (2015)             168
Tangerine (2015)                                  190
Magic Mike XXL (2015)                             124
Home (2015)                                        90
The Wedding Ringer (2015)                          54
Woman in Gold (2015)                              104
The Last Five Years (2015)                        120
Mission: Impossible – Rogue Nation (2015)       184
Amy (2015)                                        194
Jurassic World (2015)                             142
Minions (2015)                                    108
Max (2015)                                         70
Paul Blart: Mall Cop 2 (2015)                      10
The Longest Ride (2015)                            62
The Lazarus Effect (2015)                          28
The Woman In Black 2 Angel of Death (2015)         44
Danny Collins (2015)                              154
Spare Parts (2015)                                104
Serena (2015)                                      36
Inside Out (2015)                                 196
Mr. Holmes (2015)                                 174
'71 (2015)                                        194
Two Days, One Night (2014)                        194
Gett: The Trial of Viviane Amsalem (2015)         200
Kumiko, The Treasure Hunter (2015)                174
Length: 146, dtype: int64





100

对series值进行判断,利用布尔值索引提取元素:

#will actually return a Series object with a boolean value for each film
series_custom > 50
series_greater_than_50 = series_custom[series_custom > 50]

criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]
print(both_criteria)
Avengers: Age of Ultron (2015)                                            74
The Water Diviner (2015)                                                  63
Unbroken (2014)                                                           51
Southpaw (2015)                                                           59
Insidious: Chapter 3 (2015)                                               59
The Man From U.N.C.L.E. (2015)                                            68
Run All Night (2015)                                                      60
5 Flights Up (2015)                                                       52
Welcome to Me (2015)                                                      71
Saint Laurent (2015)                                                      51
Maps to the Stars (2015)                                                  60
Pitch Perfect 2 (2015)                                                    67
The Age of Adaline (2015)                                                 54
The DUFF (2015)                                                           71
Ricki and the Flash (2015)                                                64
Unfriended (2015)                                                         60
American Sniper (2015)                                                    72
The Hobbit: The Battle of the Five Armies (2014)                          61
Paper Towns (2015)                                                        55
Big Eyes (2014)                                                           72
Maggie (2015)                                                             54
Focus (2015)                                                              57
The Second Best Exotic Marigold Hotel (2015)                              62
The 100-Year-Old Man Who Climbed Out the Window and Disappeared (2015)    67
Escobar: Paradise Lost (2015)                                             52
Into the Woods (2014)                                                     71
Inherent Vice (2014)                                                      73
Magic Mike XXL (2015)                                                     62
Woman in Gold (2015)                                                      52
The Last Five Years (2015)                                                60
Jurassic World (2015)                                                     71
Minions (2015)                                                            54
Spare Parts (2015)                                                        52
dtype: int64

定义两个Series结构,index相同, 计算’RottenTomatoes’和’RottenTomatoes_User’对电影评分的平均值:

#data alignment same index
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2

print(rt_mean)
FILM
Avengers: Age of Ultron (2015)                    80.0
Cinderella (2015)                                 82.5
Ant-Man (2015)                                    85.0
Do You Believe? (2015)                            51.0
Hot Tub Time Machine 2 (2015)                     21.0
The Water Diviner (2015)                          62.5
Irrational Man (2015)                             47.5
Top Five (2014)                                   75.0
Shaun the Sheep Movie (2015)                      90.5
Love & Mercy (2015)                               88.0
Far From The Madding Crowd (2015)                 80.5
Black Sea (2015)                                  71.0
Leviathan (2014)                                  89.0
Unbroken (2014)                                   60.5
The Imitation Game (2014)                         91.0
Taken 3 (2015)                                    27.5
Ted 2 (2015)                                      52.0
Southpaw (2015)                                   69.5
Night at the Museum: Secret of the Tomb (2014)    54.0
Pixels (2015)                                     35.5
McFarland, USA (2015)                             84.0
Insidious: Chapter 3 (2015)                       57.5
The Man From U.N.C.L.E. (2015)                    74.0
Run All Night (2015)                              59.5
Trainwreck (2015)                                 79.5
Selma (2014)                                      92.5
Ex Machina (2015)                                 89.0
Still Alice (2015)                                86.5
Wild Tales (2014)                                 94.0
The End of the Tour (2015)                        90.5
                                                  ... 
Clouds of Sils Maria (2015)                       78.0
Testament of Youth (2015)                         80.0
Infinitely Polar Bear (2015)                      78.0
Phoenix (2015)                                    90.0
The Wolfpack (2015)                               78.5
The Stanford Prison Experiment (2015)             85.5
Tangerine (2015)                                  90.5
Magic Mike XXL (2015)                             63.0
Home (2015)                                       55.0
The Wedding Ringer (2015)                         46.5
Woman in Gold (2015)                              66.5
The Last Five Years (2015)                        60.0
Mission: Impossible – Rogue Nation (2015)       91.0
Amy (2015)                                        94.0
Jurassic World (2015)                             76.0
Minions (2015)                                    53.0
Max (2015)                                        54.0
Paul Blart: Mall Cop 2 (2015)                     20.5
The Longest Ride (2015)                           52.0
The Lazarus Effect (2015)                         18.5
The Woman In Black 2 Angel of Death (2015)        23.5
Danny Collins (2015)                              76.0
Spare Parts (2015)                                67.5
Serena (2015)                                     21.5
Inside Out (2015)                                 94.0
Mr. Holmes (2015)                                 82.5
'71 (2015)                                        89.5
Two Days, One Night (2014)                        87.5
Gett: The Trial of Viviane Amsalem (2015)         90.5
Kumiko, The Treasure Hunter (2015)                75.0
Length: 146, dtype: float64

读取’G:\fandango_score_comparison.csv’文件,将’FILM’值作为索引:

import pandas as pd
#will return a new DataFrame that is indexed by the values in the specified column 
#and will drop that column from the DataFrame
#without the FILM column dropped 
fandango = pd.read_csv('G:\\fandango_score_comparison.csv')
print(type(fandango))
fandango_films = fandango.set_index('FILM', drop=False)
#print(fandango_films.index)
<class 'pandas.core.frame.DataFrame'>

按照电影名进行取值:

# Slice using either bracket notation or loc[]
fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]

# Specific movie
fandango_films.loc['Kumiko, The Treasure Hunter (2015)']

# Selecting list of movies
movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
fandango_films.loc[movies]

#When selecting multiple rows, a DataFrame is returned, 
#but when selecting an individual row, a Series object is returned instead
FILM RottenTomatoes RottenTomatoes_User Metacritic Metacritic_User IMDB Fandango_Stars Fandango_Ratingvalue RT_norm RT_user_norm ... IMDB_norm RT_norm_round RT_user_norm_round Metacritic_norm_round Metacritic_user_norm_round IMDB_norm_round Metacritic_user_vote_count IMDB_user_vote_count Fandango_votes Fandango_Difference
FILM
Kumiko, The Treasure Hunter (2015) Kumiko, The Treasure Hunter (2015) 87 63 68 6.4 6.7 3.5 3.5 4.35 3.15 ... 3.35 4.5 3.0 3.5 3.0 3.5 19 5289 41 0.0
Do You Believe? (2015) Do You Believe? (2015) 18 84 22 4.7 5.4 5.0 4.5 0.90 4.20 ... 2.70 1.0 4.0 1.0 2.5 2.5 31 3136 1793 0.5
Ant-Man (2015) Ant-Man (2015) 80 90 64 8.1 7.8 5.0 4.5 4.00 4.50 ... 3.90 4.0 4.5 3.0 4.0 4.0 627 103660 12055 0.5

3 rows × 22 columns

对数据做类型转换:

#The apply() method in Pandas allows us to specify Python logic
#The apply() method requires you to pass in a vectorized operation 
#that can be applied over each Series object.
import numpy as np

# returns the data types as a Series
types = fandango_films.dtypes
#print types
# filter data types to just floats, index attributes returns just column names
float_columns = types[types.values == 'float64'].index
# use bracket notation to filter columns to just float columns
float_df = fandango_films[float_columns]
#print float_df
# `x` is a Series object representing a column
deviations = float_df.apply(lambda x: np.std(x))

print(deviations)
Metacritic_User               1.505529
IMDB                          0.955447
Fandango_Stars                0.538532
Fandango_Ratingvalue          0.501106
RT_norm                       1.503265
RT_user_norm                  0.997787
Metacritic_norm               0.972522
Metacritic_user_nom           0.752765
IMDB_norm                     0.477723
RT_norm_round                 1.509404
RT_user_norm_round            1.003559
Metacritic_norm_round         0.987561
Metacritic_user_norm_round    0.785412
IMDB_norm_round               0.501043
Fandango_Difference           0.152141
dtype: float64

利用applay及std函数求每个指标的标准差:

rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
rt_mt_user.apply(lambda x: np.std(x), axis=1)
FILM
Avengers: Age of Ultron (2015)                    0.375
Cinderella (2015)                                 0.125
Ant-Man (2015)                                    0.225
Do You Believe? (2015)                            0.925
Hot Tub Time Machine 2 (2015)                     0.150
The Water Diviner (2015)                          0.150
Irrational Man (2015)                             0.575
Top Five (2014)                                   0.100
Shaun the Sheep Movie (2015)                      0.150
Love & Mercy (2015)                               0.050
Far From The Madding Crowd (2015)                 0.050
Black Sea (2015)                                  0.150
Leviathan (2014)                                  0.175
Unbroken (2014)                                   0.125
The Imitation Game (2014)                         0.250
Taken 3 (2015)                                    0.000
Ted 2 (2015)                                      0.175
Southpaw (2015)                                   0.050
Night at the Museum: Secret of the Tomb (2014)    0.000
Pixels (2015)                                     0.025
McFarland, USA (2015)                             0.425
Insidious: Chapter 3 (2015)                       0.325
The Man From U.N.C.L.E. (2015)                    0.025
Run All Night (2015)                              0.350
Trainwreck (2015)                                 0.350
Selma (2014)                                      0.375
Ex Machina (2015)                                 0.175
Still Alice (2015)                                0.175
Wild Tales (2014)                                 0.100
The End of the Tour (2015)                        0.350
                                                  ...  
Clouds of Sils Maria (2015)                       0.100
Testament of Youth (2015)                         0.000
Infinitely Polar Bear (2015)                      0.075
Phoenix (2015)                                    0.025
The Wolfpack (2015)                               0.075
The Stanford Prison Experiment (2015)             0.050
Tangerine (2015)                                  0.325
Magic Mike XXL (2015)                             0.250
Home (2015)                                       0.200
The Wedding Ringer (2015)                         0.825
Woman in Gold (2015)                              0.225
The Last Five Years (2015)                        0.225
Mission: Impossible – Rogue Nation (2015)       0.250
Amy (2015)                                        0.075
Jurassic World (2015)                             0.275
Minions (2015)                                    0.125
Max (2015)                                        0.350
Paul Blart: Mall Cop 2 (2015)                     0.300
The Longest Ride (2015)                           0.625
The Lazarus Effect (2015)                         0.650
The Woman In Black 2 Angel of Death (2015)        0.475
Danny Collins (2015)                              0.100
Spare Parts (2015)                                0.300
Serena (2015)                                     0.700
Inside Out (2015)                                 0.025
Mr. Holmes (2015)                                 0.025
'71 (2015)                                        0.175
Two Days, One Night (2014)                        0.250
Gett: The Trial of Viviane Amsalem (2015)         0.200
Kumiko, The Treasure Hunter (2015)                0.025
Length: 146, dtype: float64

End

猜你喜欢

转载自blog.csdn.net/striver6/article/details/87597263