哦,不,这个:
学习笔记
Pandas数据分析处理库
pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。
一、数据读取
附:文件链接(https://pan.baidu.com/s/1hhnuiYJvUAMoHd4gwXz2Vw)
import pandas
food_info = pandas.read_csv("G:\\food_info.csv")
查看数据结构:
print(type(food_info))
<class 'pandas.core.frame.DataFrame'>
上节中,我们见到numpy的核心结构为ndarray,pandas核心结构则为DataFrame。DataFrame我们也可以把他当做一个矩阵结构。
head函数会输出数据的几行和后几行,中间几行省略号。
print(food_info.head)
<bound method NDFrame.head of NDB_No Shrt_Desc Water_(g) \
0 1001 BUTTER WITH SALT 15.87
1 1002 BUTTER WHIPPED WITH SALT 15.87
2 1003 BUTTER OIL ANHYDROUS 0.24
3 1004 CHEESE BLUE 42.41
4 1005 CHEESE BRICK 41.11
5 1006 CHEESE BRIE 48.42
6 1007 CHEESE CAMEMBERT 51.80
7 1008 CHEESE CARAWAY 39.28
8 1009 CHEESE CHEDDAR 37.10
9 1010 CHEESE CHESHIRE 37.65
10 1011 CHEESE COLBY 38.20
11 1012 CHEESE COTTAGE CRMD LRG OR SML CURD 79.79
12 1013 CHEESE COTTAGE CRMD W/FRUIT 79.64
13 1014 CHEESE COTTAGE NONFAT UNCRMD DRY LRG OR SML CURD 81.01
14 1015 CHEESE COTTAGE LOWFAT 2% MILKFAT 81.24
15 1016 CHEESE COTTAGE LOWFAT 1% MILKFAT 82.48
16 1017 CHEESE CREAM 54.44
17 1018 CHEESE EDAM 41.56
18 1019 CHEESE FETA 55.22
19 1020 CHEESE FONTINA 37.92
20 1021 CHEESE GJETOST 13.44
21 1022 CHEESE GOUDA 41.46
22 1023 CHEESE GRUYERE 33.19
23 1024 CHEESE LIMBURGER 48.42
24 1025 CHEESE MONTEREY 41.01
25 1026 CHEESE MOZZARELLA WHL MILK 50.01
26 1027 CHEESE MOZZARELLA WHL MILK LO MOIST 48.38
27 1028 CHEESE MOZZARELLA PART SKIM MILK 53.78
28 1029 CHEESE MOZZARELLA LO MOIST PART-SKIM 45.54
29 1030 CHEESE MUENSTER 41.77
... ... ... ...
8588 43544 BABYFOOD CRL RICE W/ PEARS & APPL DRY INST 2.00
8589 43546 BABYFOOD BANANA NO TAPIOCA STR 76.70
8590 43550 BABYFOOD BANANA APPL DSSRT STR 83.10
8591 43566 SNACKS TORTILLA CHIPS LT (BAKED W/ LESS OIL) 1.30
8592 43570 CEREALS RTE POST HONEY BUNCHES OF OATS HONEY RSTD 5.00
8593 43572 POPCORN MICROWAVE LOFAT&NA 2.80
8594 43585 BABYFOOD FRUIT SUPREME DSSRT 81.60
8595 43589 CHEESE SWISS LOW FAT 59.60
8596 43595 BREAKFAST BAR CORN FLAKE CRUST W/FRUIT 14.50
8597 43597 CHEESE MOZZARELLA LO NA 49.90
8598 43598 MAYONNAISE DRSNG NO CHOL 21.70
8599 44005 OIL CORN PEANUT AND OLIVE 0.00
8600 44018 SWEETENERS TABLETOP FRUCTOSE LIQ 23.90
8601 44048 CHEESE FOOD IMITATION 55.50
8602 44055 CELERY FLAKES DRIED 9.00
8603 44061 PUDDINGS CHOC FLAVOR LO CAL INST DRY MIX 4.20
8604 44074 BABYFOOD GRAPE JUC NO SUGAR CND 84.40
8605 44110 JELLIES RED SUGAR HOME PRESERVED 53.00
8606 44158 PIE FILLINGS BLUEBERRY CND 54.66
8607 44203 COCKTAIL MIX NON-ALCOHOLIC CONCD FRZ 28.24
8608 44258 PUDDINGS CHOC FLAVOR LO CAL REG DRY MIX 6.80
8609 44259 PUDDINGS ALL FLAVORS XCPT CHOC LO CAL REG DRY MIX 10.40
8610 44260 PUDDINGS ALL FLAVORS XCPT CHOC LO CAL INST DRY... 6.84
8611 48052 VITAL WHEAT GLUTEN 8.20
8612 80200 FROG LEGS RAW 81.90
8613 83110 MACKEREL SALTED 43.00
8614 90240 SCALLOP (BAY&SEA) CKD STMD 70.25
8615 90480 SYRUP CANE 26.00
8616 90560 SNAIL RAW 79.20
8617 93600 TURTLE GREEN RAW 78.50
Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) \
0 717 0.85 81.11 2.11 0.06
1 717 0.85 81.11 2.11 0.06
2 876 0.28 99.48 0.00 0.00
3 353 21.40 28.74 5.11 2.34
4 371 23.24 29.68 3.18 2.79
5 334 20.75 27.68 2.70 0.45
6 300 19.80 24.26 3.68 0.46
7 376 25.18 29.20 3.28 3.06
8 406 24.04 33.82 3.71 1.33
9 387 23.37 30.60 3.60 4.78
10 394 23.76 32.11 3.36 2.57
11 98 11.12 4.30 1.41 3.38
12 97 10.69 3.85 1.20 4.61
13 72 10.34 0.29 1.71 6.66
14 81 10.45 2.27 1.27 4.76
15 72 12.39 1.02 1.39 2.72
16 342 5.93 34.24 1.32 4.07
17 357 24.99 27.80 4.22 1.43
18 264 14.21 21.28 5.20 4.09
19 389 25.60 31.14 3.79 1.55
20 466 9.65 29.51 4.75 42.65
21 356 24.94 27.44 3.94 2.22
22 413 29.81 32.34 4.30 0.36
23 327 20.05 27.25 3.79 0.49
24 373 24.48 30.28 3.55 0.68
25 300 22.17 22.35 3.28 2.19
26 318 21.60 24.64 2.91 2.47
27 254 24.26 15.92 3.27 2.77
28 301 24.58 19.72 3.80 6.36
29 368 23.41 30.04 3.66 1.12
... ... ... ... ... ...
8588 389 6.60 0.90 2.00 88.60
8589 91 1.00 0.20 0.76 21.34
8590 68 0.30 0.20 0.29 16.30
8591 465 8.70 15.20 1.85 73.40
8592 401 7.12 5.46 1.22 81.19
8593 429 12.60 9.50 1.71 73.39
8594 73 0.50 0.20 0.52 17.18
8595 179 28.40 5.10 3.50 3.40
8596 377 4.40 7.50 0.80 72.90
8597 280 27.50 17.10 2.40 3.10
8598 688 0.00 77.80 0.40 0.30
8599 884 0.00 100.00 0.00 0.00
8600 279 0.00 0.00 0.00 76.10
8601 257 4.08 19.50 4.74 16.18
8602 319 11.30 2.10 13.90 63.70
8603 356 5.30 2.40 9.90 78.20
8604 62 0.00 0.00 0.22 15.38
8605 179 0.30 0.03 0.08 46.10
8606 181 0.41 0.20 0.35 44.38
8607 287 0.08 0.01 0.07 71.60
8608 365 10.08 3.00 5.70 74.42
8609 351 1.60 0.10 1.86 86.04
8610 350 0.81 0.90 6.80 84.66
8611 370 75.16 1.85 1.00 13.79
8612 73 16.40 0.30 1.40 0.00
8613 305 18.50 25.10 13.40 0.00
8614 111 20.54 0.84 2.97 5.41
8615 269 0.00 0.00 0.86 73.14
8616 90 16.10 1.40 1.30 2.00
8617 89 19.80 0.50 1.20 0.00
Fiber_TD_(g) Sugar_Tot_(g) ... Vit_A_IU Vit_A_RAE \
0 0.0 0.06 ... 2499.0 684.0
1 0.0 0.06 ... 2499.0 684.0
2 0.0 0.00 ... 3069.0 840.0
3 0.0 0.50 ... 721.0 198.0
4 0.0 0.51 ... 1080.0 292.0
5 0.0 0.45 ... 592.0 174.0
6 0.0 0.46 ... 820.0 241.0
7 0.0 NaN ... 1054.0 271.0
8 0.0 0.28 ... 994.0 263.0
9 0.0 NaN ... 985.0 233.0
10 0.0 0.52 ... 994.0 264.0
11 0.0 2.67 ... 140.0 37.0
12 0.2 2.38 ... 146.0 38.0
13 0.0 1.85 ... 8.0 2.0
14 0.0 4.00 ... 225.0 68.0
15 0.0 2.72 ... 41.0 11.0
16 0.0 3.21 ... 1343.0 366.0
17 0.0 1.43 ... 825.0 243.0
18 0.0 4.09 ... 422.0 125.0
19 0.0 1.55 ... 913.0 261.0
20 0.0 NaN ... 1113.0 334.0
21 0.0 2.22 ... 563.0 165.0
22 0.0 0.36 ... 948.0 271.0
23 0.0 0.49 ... 1155.0 340.0
24 0.0 0.50 ... 769.0 198.0
25 0.0 1.03 ... 676.0 179.0
26 0.0 1.01 ... 745.0 197.0
27 0.0 1.13 ... 481.0 127.0
28 0.0 2.24 ... 846.0 254.0
29 0.0 1.12 ... 1012.0 298.0
... ... ... ... ... ...
8588 2.6 1.35 ... 0.0 0.0
8589 1.6 11.36 ... 5.0 0.0
8590 1.0 14.66 ... 30.0 2.0
8591 5.7 0.53 ... 81.0 4.0
8592 4.2 19.79 ... 2731.0 806.0
8593 14.2 0.54 ... 147.0 7.0
8594 2.0 14.87 ... 50.0 3.0
8595 0.0 1.33 ... 152.0 40.0
8596 2.1 35.10 ... 2027.0 608.0
8597 0.0 1.23 ... 517.0 137.0
8598 0.0 0.30 ... 0.0 0.0
8599 0.0 0.00 ... 0.0 0.0
8600 0.1 76.00 ... 0.0 0.0
8601 0.0 8.21 ... 900.0 45.0
8602 27.8 35.90 ... 1962.0 98.0
8603 6.1 0.70 ... 0.0 0.0
8604 0.1 NaN ... 8.0 NaN
8605 0.8 45.30 ... 3.0 0.0
8606 2.6 37.75 ... 22.0 1.0
8607 0.0 24.53 ... 12.0 1.0
8608 10.1 0.70 ... 0.0 0.0
8609 0.9 2.90 ... 0.0 0.0
8610 0.8 0.90 ... 0.0 0.0
8611 0.6 0.00 ... 0.0 0.0
8612 0.0 0.00 ... 50.0 15.0
8613 0.0 0.00 ... 157.0 47.0
8614 0.0 0.00 ... 5.0 2.0
8615 0.0 73.20 ... 0.0 0.0
8616 0.0 0.00 ... 100.0 30.0
8617 0.0 0.00 ... 100.0 30.0
Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) \
0 2.32 1.5 60.0 7.0 51.368 21.021
1 2.32 1.5 60.0 7.0 50.489 23.426
2 2.80 1.8 73.0 8.6 61.924 28.732
3 0.25 0.5 21.0 2.4 18.669 7.778
4 0.26 0.5 22.0 2.5 18.764 8.598
5 0.24 0.5 20.0 2.3 17.410 8.013
6 0.21 0.4 18.0 2.0 15.259 7.023
7 NaN NaN NaN NaN 18.584 8.275
8 0.78 0.6 24.0 2.9 19.368 8.428
9 NaN NaN NaN NaN 19.475 8.671
10 0.28 0.6 24.0 2.7 20.218 9.280
11 0.08 0.1 3.0 0.0 1.718 0.778
12 0.04 0.0 0.0 0.4 2.311 1.036
13 0.01 0.0 0.0 0.0 0.169 0.079
14 0.08 0.0 0.0 0.0 1.235 0.516
15 0.01 0.0 0.0 0.1 0.645 0.291
16 0.29 0.6 25.0 2.9 19.292 8.620
17 0.24 0.5 20.0 2.3 17.572 8.125
18 0.18 0.4 16.0 1.8 14.946 4.623
19 0.27 0.6 23.0 2.6 19.196 8.687
20 NaN NaN NaN NaN 19.160 7.879
21 0.24 0.5 20.0 2.3 17.614 7.747
22 0.28 0.6 24.0 2.7 18.913 10.043
23 0.23 0.5 20.0 2.3 16.746 8.606
24 0.26 0.6 22.0 2.5 19.066 8.751
25 0.19 0.4 16.0 2.3 13.152 6.573
26 0.21 0.5 18.0 2.5 15.561 7.027
27 0.14 0.3 12.0 1.6 10.114 4.510
28 0.43 0.4 15.0 1.3 11.473 5.104
29 0.26 0.6 22.0 2.5 19.113 8.711
... ... ... ... ... ... ...
8588 0.13 0.0 0.0 0.3 0.185 0.252
8589 0.25 0.0 0.0 0.5 0.072 0.028
8590 0.02 0.0 0.0 0.1 0.058 0.018
8591 3.53 0.0 0.0 0.7 2.837 6.341
8592 1.22 4.6 183.0 3.0 0.600 2.831
8593 5.01 0.0 0.0 15.7 1.415 4.085
8594 0.79 0.0 0.0 5.1 0.030 0.025
8595 0.07 0.1 4.0 0.5 3.304 1.351
8596 0.76 0.0 0.0 13.8 1.500 5.000
8597 0.15 0.3 13.0 1.8 10.867 4.844
8598 11.79 0.0 0.0 24.7 10.784 18.026
8599 14.78 0.0 0.0 21.0 14.367 48.033
8600 0.00 0.0 0.0 0.0 0.000 0.000
8601 2.15 0.0 0.0 36.7 7.996 3.108
8602 5.55 0.0 0.0 584.2 0.555 0.405
8603 0.02 0.0 0.0 0.4 0.984 1.154
8604 NaN NaN NaN NaN 0.000 0.000
8605 0.00 0.0 0.0 0.2 0.009 0.001
8606 0.23 0.0 0.0 3.9 0.000 0.000
8607 0.02 0.0 0.0 0.0 0.003 0.001
8608 0.02 0.0 0.0 0.5 1.578 1.150
8609 0.05 0.0 0.0 1.1 0.018 0.032
8610 0.08 0.0 0.0 1.7 0.099 0.116
8611 0.00 0.0 0.0 0.0 0.272 0.156
8612 1.00 0.2 8.0 0.1 0.076 0.053
8613 2.38 25.2 1006.0 7.8 7.148 8.320
8614 0.00 0.0 2.0 0.0 0.218 0.082
8615 0.00 0.0 0.0 0.0 0.000 0.000
8616 5.00 0.0 0.0 0.1 0.361 0.259
8617 0.50 0.0 0.0 0.1 0.127 0.088
FA_Poly_(g) Cholestrl_(mg)
0 3.043 215.0
1 3.012 219.0
2 3.694 256.0
3 0.800 75.0
4 0.784 94.0
5 0.826 100.0
6 0.724 72.0
7 0.830 93.0
8 1.433 102.0
9 0.870 103.0
10 0.953 95.0
11 0.123 17.0
12 0.124 13.0
13 0.003 7.0
14 0.083 12.0
15 0.031 4.0
16 1.437 110.0
17 0.665 89.0
18 0.591 89.0
19 1.654 116.0
20 0.938 94.0
21 0.657 114.0
22 1.733 110.0
23 0.495 90.0
24 0.899 89.0
25 0.765 79.0
26 0.778 89.0
27 0.472 64.0
28 0.861 65.0
29 0.661 96.0
... ... ...
8588 0.231 0.0
8589 0.041 0.0
8590 0.047 0.0
8591 5.024 0.0
8592 1.307 0.0
8593 3.572 0.0
8594 0.068 0.0
8595 0.180 35.0
8596 0.900 0.0
8597 0.509 54.0
8598 45.539 0.0
8599 33.033 0.0
8600 0.000 0.0
8601 7.536 6.0
8602 1.035 0.0
8603 0.131 0.0
8604 0.000 0.0
8605 0.008 0.0
8606 0.000 0.0
8607 0.009 0.0
8608 0.130 0.0
8609 0.050 0.0
8610 0.433 0.0
8611 0.810 0.0
8612 0.102 50.0
8613 6.210 95.0
8614 0.222 41.0
8615 0.000 0.0
8616 0.252 50.0
8617 0.170 50.0
[8618 rows x 36 columns]>
head()函数默认输出数据前5行。
food_info.head()
NDB_No | Shrt_Desc | Water_(g) | Energ_Kcal | Protein_(g) | Lipid_Tot_(g) | Ash_(g) | Carbohydrt_(g) | Fiber_TD_(g) | Sugar_Tot_(g) | ... | Vit_A_IU | Vit_A_RAE | Vit_E_(mg) | Vit_D_mcg | Vit_D_IU | Vit_K_(mcg) | FA_Sat_(g) | FA_Mono_(g) | FA_Poly_(g) | Cholestrl_(mg) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | BUTTER WITH SALT | 15.87 | 717 | 0.85 | 81.11 | 2.11 | 0.06 | 0.0 | 0.06 | ... | 2499.0 | 684.0 | 2.32 | 1.5 | 60.0 | 7.0 | 51.368 | 21.021 | 3.043 | 215.0 |
1 | 1002 | BUTTER WHIPPED WITH SALT | 15.87 | 717 | 0.85 | 81.11 | 2.11 | 0.06 | 0.0 | 0.06 | ... | 2499.0 | 684.0 | 2.32 | 1.5 | 60.0 | 7.0 | 50.489 | 23.426 | 3.012 | 219.0 |
2 | 1003 | BUTTER OIL ANHYDROUS | 0.24 | 876 | 0.28 | 99.48 | 0.00 | 0.00 | 0.0 | 0.00 | ... | 3069.0 | 840.0 | 2.80 | 1.8 | 73.0 | 8.6 | 61.924 | 28.732 | 3.694 | 256.0 |
3 | 1004 | CHEESE BLUE | 42.41 | 353 | 21.40 | 28.74 | 5.11 | 2.34 | 0.0 | 0.50 | ... | 721.0 | 198.0 | 0.25 | 0.5 | 21.0 | 2.4 | 18.669 | 7.778 | 0.800 | 75.0 |
4 | 1005 | CHEESE BRICK | 41.11 | 371 | 23.24 | 29.68 | 3.18 | 2.79 | 0.0 | 0.51 | ... | 1080.0 | 292.0 | 0.26 | 0.5 | 22.0 | 2.5 | 18.764 | 8.598 | 0.784 | 94.0 |
5 rows × 36 columns
查看数据类型:
print(food_info.dtypes)
NDB_No int64
Shrt_Desc object
Water_(g) float64
Energ_Kcal int64
Protein_(g) float64
Lipid_Tot_(g) float64
Ash_(g) float64
Carbohydrt_(g) float64
Fiber_TD_(g) float64
Sugar_Tot_(g) float64
Calcium_(mg) float64
Iron_(mg) float64
Magnesium_(mg) float64
Phosphorus_(mg) float64
Potassium_(mg) float64
Sodium_(mg) float64
Zinc_(mg) float64
Copper_(mg) float64
Manganese_(mg) float64
Selenium_(mcg) float64
Vit_C_(mg) float64
Thiamin_(mg) float64
Riboflavin_(mg) float64
Niacin_(mg) float64
Vit_B6_(mg) float64
Vit_B12_(mcg) float64
Vit_A_IU float64
Vit_A_RAE float64
Vit_E_(mg) float64
Vit_D_mcg float64
Vit_D_IU float64
Vit_K_(mcg) float64
FA_Sat_(g) float64
FA_Mono_(g) float64
FA_Poly_(g) float64
Cholestrl_(mg) float64
dtype: object
- object - For string values
- int - For integer values
- float - For float values
- datetime - For time values
- bool - For Boolean values
输出前3行:
food_info.head(3)
NDB_No | Shrt_Desc | Water_(g) | Energ_Kcal | Protein_(g) | Lipid_Tot_(g) | Ash_(g) | Carbohydrt_(g) | Fiber_TD_(g) | Sugar_Tot_(g) | ... | Vit_A_IU | Vit_A_RAE | Vit_E_(mg) | Vit_D_mcg | Vit_D_IU | Vit_K_(mcg) | FA_Sat_(g) | FA_Mono_(g) | FA_Poly_(g) | Cholestrl_(mg) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | BUTTER WITH SALT | 15.87 | 717 | 0.85 | 81.11 | 2.11 | 0.06 | 0.0 | 0.06 | ... | 2499.0 | 684.0 | 2.32 | 1.5 | 60.0 | 7.0 | 51.368 | 21.021 | 3.043 | 215.0 |
1 | 1002 | BUTTER WHIPPED WITH SALT | 15.87 | 717 | 0.85 | 81.11 | 2.11 | 0.06 | 0.0 | 0.06 | ... | 2499.0 | 684.0 | 2.32 | 1.5 | 60.0 | 7.0 | 50.489 | 23.426 | 3.012 | 219.0 |
2 | 1003 | BUTTER OIL ANHYDROUS | 0.24 | 876 | 0.28 | 99.48 | 0.00 | 0.00 | 0.0 | 0.00 | ... | 3069.0 | 840.0 | 2.80 | 1.8 | 73.0 | 8.6 | 61.924 | 28.732 | 3.694 | 256.0 |
3 rows × 36 columns
输出后4行:
food_info.tail(4)
#print(food_info.tail(4))
NDB_No | Shrt_Desc | Water_(g) | Energ_Kcal | Protein_(g) | Lipid_Tot_(g) | Ash_(g) | Carbohydrt_(g) | Fiber_TD_(g) | Sugar_Tot_(g) | ... | Vit_A_IU | Vit_A_RAE | Vit_E_(mg) | Vit_D_mcg | Vit_D_IU | Vit_K_(mcg) | FA_Sat_(g) | FA_Mono_(g) | FA_Poly_(g) | Cholestrl_(mg) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8614 | 90240 | SCALLOP (BAY&SEA) CKD STMD | 70.25 | 111 | 20.54 | 0.84 | 2.97 | 5.41 | 0.0 | 0.0 | ... | 5.0 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.218 | 0.082 | 0.222 | 41.0 |
8615 | 90480 | SYRUP CANE | 26.00 | 269 | 0.00 | 0.00 | 0.86 | 73.14 | 0.0 | 73.2 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000 | 0.000 | 0.000 | 0.0 |
8616 | 90560 | SNAIL RAW | 79.20 | 90 | 16.10 | 1.40 | 1.30 | 2.00 | 0.0 | 0.0 | ... | 100.0 | 30.0 | 5.0 | 0.0 | 0.0 | 0.1 | 0.361 | 0.259 | 0.252 | 50.0 |
8617 | 93600 | TURTLE GREEN RAW | 78.50 | 89 | 19.80 | 0.50 | 1.20 | 0.00 | 0.0 | 0.0 | ... | 100.0 | 30.0 | 0.5 | 0.0 | 0.0 | 0.1 | 0.127 | 0.088 | 0.170 | 50.0 |
4 rows × 36 columns
查看列名:
print(food_info.columns)
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
'Cholestrl_(mg)'],
dtype='object')
查看数据维度:
print(food_info.shape)
(8618, 36)
二、索引与计算
查看第7行数据:
food_info.loc[6]
NDB_No 1007
Shrt_Desc CHEESE CAMEMBERT
Water_(g) 51.8
Energ_Kcal 300
Protein_(g) 19.8
Lipid_Tot_(g) 24.26
Ash_(g) 3.68
Carbohydrt_(g) 0.46
Fiber_TD_(g) 0
Sugar_Tot_(g) 0.46
Calcium_(mg) 388
Iron_(mg) 0.33
Magnesium_(mg) 20
Phosphorus_(mg) 347
Potassium_(mg) 187
Sodium_(mg) 842
Zinc_(mg) 2.38
Copper_(mg) 0.021
Manganese_(mg) 0.038
Selenium_(mcg) 14.5
Vit_C_(mg) 0
Thiamin_(mg) 0.028
Riboflavin_(mg) 0.488
Niacin_(mg) 0.63
Vit_B6_(mg) 0.227
Vit_B12_(mcg) 1.3
Vit_A_IU 820
Vit_A_RAE 241
Vit_E_(mg) 0.21
Vit_D_mcg 0.4
Vit_D_IU 18
Vit_K_(mcg) 2
FA_Sat_(g) 15.259
FA_Mono_(g) 7.023
FA_Poly_(g) 0.724
Cholestrl_(mg) 72
Name: 6, dtype: object
查看第4行——第7行数据:
# Returns a DataFrame containing the rows at indexes 3, 4, 5, and 6.
food_info.loc[3:6]
NDB_No | Shrt_Desc | Water_(g) | Energ_Kcal | Protein_(g) | Lipid_Tot_(g) | Ash_(g) | Carbohydrt_(g) | Fiber_TD_(g) | Sugar_Tot_(g) | ... | Vit_A_IU | Vit_A_RAE | Vit_E_(mg) | Vit_D_mcg | Vit_D_IU | Vit_K_(mcg) | FA_Sat_(g) | FA_Mono_(g) | FA_Poly_(g) | Cholestrl_(mg) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 1004 | CHEESE BLUE | 42.41 | 353 | 21.40 | 28.74 | 5.11 | 2.34 | 0.0 | 0.50 | ... | 721.0 | 198.0 | 0.25 | 0.5 | 21.0 | 2.4 | 18.669 | 7.778 | 0.800 | 75.0 |
4 | 1005 | CHEESE BRICK | 41.11 | 371 | 23.24 | 29.68 | 3.18 | 2.79 | 0.0 | 0.51 | ... | 1080.0 | 292.0 | 0.26 | 0.5 | 22.0 | 2.5 | 18.764 | 8.598 | 0.784 | 94.0 |
5 | 1006 | CHEESE BRIE | 48.42 | 334 | 20.75 | 27.68 | 2.70 | 0.45 | 0.0 | 0.45 | ... | 592.0 | 174.0 | 0.24 | 0.5 | 20.0 | 2.3 | 17.410 | 8.013 | 0.826 | 100.0 |
6 | 1007 | CHEESE CAMEMBERT | 51.80 | 300 | 19.80 | 24.26 | 3.68 | 0.46 | 0.0 | 0.46 | ... | 820.0 | 241.0 | 0.21 | 0.4 | 18.0 | 2.0 | 15.259 | 7.023 | 0.724 | 72.0 |
4 rows × 36 columns
查看第3,6,11行:
# Returns a DataFrame containing the rows at indexes 2, 5, and 10. Either of the following approaches will work.
# Method 1
two_five_ten = [2,5,10]
food_info.loc[two_five_ten]
NDB_No | Shrt_Desc | Water_(g) | Energ_Kcal | Protein_(g) | Lipid_Tot_(g) | Ash_(g) | Carbohydrt_(g) | Fiber_TD_(g) | Sugar_Tot_(g) | ... | Vit_A_IU | Vit_A_RAE | Vit_E_(mg) | Vit_D_mcg | Vit_D_IU | Vit_K_(mcg) | FA_Sat_(g) | FA_Mono_(g) | FA_Poly_(g) | Cholestrl_(mg) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 1003 | BUTTER OIL ANHYDROUS | 0.24 | 876 | 0.28 | 99.48 | 0.00 | 0.00 | 0.0 | 0.00 | ... | 3069.0 | 840.0 | 2.80 | 1.8 | 73.0 | 8.6 | 61.924 | 28.732 | 3.694 | 256.0 |
5 | 1006 | CHEESE BRIE | 48.42 | 334 | 20.75 | 27.68 | 2.70 | 0.45 | 0.0 | 0.45 | ... | 592.0 | 174.0 | 0.24 | 0.5 | 20.0 | 2.3 | 17.410 | 8.013 | 0.826 | 100.0 |
10 | 1011 | CHEESE COLBY | 38.20 | 394 | 23.76 | 32.11 | 3.36 | 2.57 | 0.0 | 0.52 | ... | 994.0 | 264.0 | 0.28 | 0.6 | 24.0 | 2.7 | 20.218 | 9.280 | 0.953 | 95.0 |
3 rows × 36 columns
# Method 2
food_info.loc[[2,5,10]]
NDB_No | Shrt_Desc | Water_(g) | Energ_Kcal | Protein_(g) | Lipid_Tot_(g) | Ash_(g) | Carbohydrt_(g) | Fiber_TD_(g) | Sugar_Tot_(g) | ... | Vit_A_IU | Vit_A_RAE | Vit_E_(mg) | Vit_D_mcg | Vit_D_IU | Vit_K_(mcg) | FA_Sat_(g) | FA_Mono_(g) | FA_Poly_(g) | Cholestrl_(mg) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 1003 | BUTTER OIL ANHYDROUS | 0.24 | 876 | 0.28 | 99.48 | 0.00 | 0.00 | 0.0 | 0.00 | ... | 3069.0 | 840.0 | 2.80 | 1.8 | 73.0 | 8.6 | 61.924 | 28.732 | 3.694 | 256.0 |
5 | 1006 | CHEESE BRIE | 48.42 | 334 | 20.75 | 27.68 | 2.70 | 0.45 | 0.0 | 0.45 | ... | 592.0 | 174.0 | 0.24 | 0.5 | 20.0 | 2.3 | 17.410 | 8.013 | 0.826 | 100.0 |
10 | 1011 | CHEESE COLBY | 38.20 | 394 | 23.76 | 32.11 | 3.36 | 2.57 | 0.0 | 0.52 | ... | 994.0 | 264.0 | 0.28 | 0.6 | 24.0 | 2.7 | 20.218 | 9.280 | 0.953 | 95.0 |
3 rows × 36 columns
根据列名获取列数据:
# Series object representing the "NDB_No" column.
ndb_col = food_info["NDB_No"]
print(ndb_col)
0 1001
1 1002
2 1003
3 1004
4 1005
5 1006
6 1007
7 1008
8 1009
9 1010
10 1011
11 1012
12 1013
13 1014
14 1015
15 1016
16 1017
17 1018
18 1019
19 1020
20 1021
21 1022
22 1023
23 1024
24 1025
25 1026
26 1027
27 1028
28 1029
29 1030
...
8588 43544
8589 43546
8590 43550
8591 43566
8592 43570
8593 43572
8594 43585
8595 43589
8596 43595
8597 43597
8598 43598
8599 44005
8600 44018
8601 44048
8602 44055
8603 44061
8604 44074
8605 44110
8606 44158
8607 44203
8608 44258
8609 44259
8610 44260
8611 48052
8612 80200
8613 83110
8614 90240
8615 90480
8616 90560
8617 93600
Name: NDB_No, Length: 8618, dtype: int64
根据列名查看多列数据:
columns = ["Zinc_(mg)", "Copper_(mg)"]
zinc_copper = food_info[columns]
print(zinc_copper)
#或者:
#zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]
Zinc_(mg) Copper_(mg)
0 0.09 0.000
1 0.05 0.016
2 0.01 0.001
3 2.66 0.040
4 2.60 0.024
5 2.38 0.019
6 2.38 0.021
7 2.94 0.024
8 3.43 0.056
9 2.79 0.042
10 3.07 0.042
11 0.40 0.029
12 0.33 0.040
13 0.47 0.030
14 0.51 0.033
15 0.38 0.028
16 0.51 0.019
17 3.75 0.036
18 2.88 0.032
19 3.50 0.025
20 1.14 0.080
21 3.90 0.036
22 3.90 0.032
23 2.10 0.021
24 3.00 0.032
25 2.92 0.011
26 2.46 0.022
27 2.76 0.025
28 3.61 0.034
29 2.81 0.031
... ... ...
8588 3.30 0.377
8589 0.05 0.040
8590 0.05 0.030
8591 1.15 0.116
8592 5.03 0.200
8593 3.83 0.545
8594 0.08 0.035
8595 3.90 0.027
8596 4.10 0.100
8597 3.13 0.027
8598 0.13 0.000
8599 0.02 0.000
8600 0.09 0.037
8601 0.21 0.026
8602 2.77 0.571
8603 0.41 0.838
8604 0.05 0.028
8605 0.03 0.023
8606 0.10 0.112
8607 0.02 0.020
8608 1.49 0.854
8609 0.19 0.040
8610 0.10 0.038
8611 0.85 0.182
8612 1.00 0.250
8613 1.10 0.100
8614 1.55 0.033
8615 0.19 0.020
8616 1.00 0.400
8617 1.00 0.250
[8618 rows x 2 columns]
获取所有列名中单位以“(g)”结尾的数据前3行:
#将列名转化为一个list
col_names = food_info.columns.tolist()
#print col_names
gram_columns = []
for c in col_names:
if c.endswith("(g)"):
gram_columns.append(c)
gram_df = food_info[gram_columns]
print(gram_df.head(3))
Water_(g) Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) \
0 15.87 0.85 81.11 2.11 0.06
1 15.87 0.85 81.11 2.11 0.06
2 0.24 0.28 99.48 0.00 0.00
Fiber_TD_(g) Sugar_Tot_(g) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g)
0 0.0 0.06 51.368 21.021 3.043
1 0.0 0.06 50.489 23.426 3.012
2 0.0 0.00 61.924 28.732 3.694
将"Iron_(mg)“列数据转化为单位为g,即除以1000,并将该列添加到源数据中,命名为"Iron_(g)”:
iron_grams = food_info["Iron_(mg)"] / 1000
print(food_info.shape)
food_info["Iron_(g)"] = iron_grams
print(food_info.shape)
# Subtracts 100 from each value in the column and returns a Series object.
#sub_100 = food_info["Iron_(mg)"] - 100
# Multiplies each value in the column by 2 and returns a Series object.
#mult_2 = food_info["Iron_(mg)"]*2
(8618, 37)
(8618, 37)
将"Water_(g)"列和"Energ_Kcal"列相乘:
#It applies the arithmetic operator to the first value in both columns, the second value in both columns, and so on
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]
根据Protein和Lipid_Tot指标计算Score:
#Score=2×(Protein_(g))−0.75×(Lipid_Tot_(g))
weighted_protein = food_info["Protein_(g)"] * 2
weighted_fat = -0.75 * food_info["Lipid_Tot_(g)"]
initial_rating = weighted_protein + weighted_fat
对数据做归一化操作:
# the "Vit_A_IU" column ranges from 0 to 100000, while the "Fiber_TD_(g)" column ranges from 0 to 79
#For certain calculations, columns like "Vit_A_IU" can have a greater effect on the result,
#due to the scale of the values
# The largest value in the "Energ_Kcal" column.
max_calories = food_info["Energ_Kcal"].max()
# Divide the values in "Energ_Kcal" by the largest value.
normalized_calories = food_info["Energ_Kcal"] / max_calories
normalized_protein = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()
normalized_fat = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()
food_info["Normalized_Protein"] = normalized_protein
food_info["Normalized_Fat"] = normalized_fat
对"Sodium_(mg)"升序排列:
#By default, pandas will sort the data by the column we specify in ascending order and return a new DataFrame
# Sorts the DataFrame in-place, rather than returning a new DataFrame.
#print food_info["Sodium_(mg)"]
food_info.sort_values("Sodium_(mg)", inplace=True)
print(food_info["Sodium_(mg)"])
指定ascending=False,对"Sodium_(mg)"降序排列:
#Sorts by descending order, rather than ascending.
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)
print(food_info["Sodium_(mg)"])
276 38758.0
5814 27360.0
6192 26050.0
1242 26000.0
1245 24000.0
1243 24000.0
1244 23875.0
292 17000.0
1254 11588.0
5811 10600.0
8575 9690.0
291 8068.0
1249 8031.0
5812 7893.0
1292 7851.0
293 7203.0
4472 7027.0
4836 6820.0
1261 6580.0
3747 6008.0
1266 5730.0
4835 5586.0
4834 5493.0
1263 5356.0
1553 5203.0
1552 5053.0
1251 4957.0
1257 4843.0
294 4616.0
8613 4450.0
...
8153 NaN
8155 NaN
8156 NaN
8157 NaN
8158 NaN
8159 NaN
8160 NaN
8161 NaN
8163 NaN
8164 NaN
8165 NaN
8167 NaN
8169 NaN
8170 NaN
8172 NaN
8173 NaN
8174 NaN
8175 NaN
8176 NaN
8177 NaN
8178 NaN
8179 NaN
8180 NaN
8181 NaN
8183 NaN
8184 NaN
8185 NaN
8195 NaN
8251 NaN
8267 NaN
Name: Sodium_(mg), Length: 8618, dtype: float64
三、数据预处理实例——泰坦尼克号数据分析
这是一个比较经典的Kaggle竞赛案例:https://www.kaggle.com/c/titanic
附:文件链接(https://pan.baidu.com/s/1VeNYxEuXo7Fy-rhDRGzi1A)
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("G:\\train.csv")
#titanic_survival.shape
titanic_survival.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
指标解释:
- PassengerID 对于泰坦尼克号上每一个乘客都有一个编号,一共有891个乘客
- Survived 只有两个值:0和1,0代表死亡,1代表存活
- Pclass 船舱位等级,一共有三等
- Name 乘客名字
- Sex 乘客性别
- Age 乘客年龄
- SibSp 当时的家里兄弟姐妹都是比较多的,大家出去玩的时候一般都会一起的,SibSp统计了乘客的兄弟姐妹的数量
- Parch Parents和childern的缩写,基本上都是等于0的,出去玩不带父母和孩子嗨
- Ticket 船票编码
- Fare 船票价格
- Cabin 乘客船舱的编号,缺失值统一用NaN表示,这一列缺失值太多,之后不会用嘞
- Embarked 登船地点,一共有三个码头
计算Age列缺失值数量:
#The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.
#we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False values
age = titanic_survival["Age"]
#print(age.loc[0:10])
age_is_null = pd.isnull(age)
#print age_is_null
age_null_true = age[age_is_null]
#print age_null_true
age_null_count = len(age_null_true)
print(age_null_count)
177
如果有缺失值,计算Age均值时会得出nan,无法计算:
#The result of this is that mean_age would be nan. This is because any calculations we do with a null value also result in a null value
mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
print(mean_age)
nan
去除掉Age列缺失值,并计算均值:
#we have to filter out the missing values before we calculate the mean.
good_ages = titanic_survival["Age"][age_is_null == False]
#print good_ages
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age)
29.6991176471
当然,以上代码我们也可以使用均值函数.mean()来执行,但是把带有缺失值的样本删除掉不是一个好办法,通常我们会使用均值、中位数、众数来代替缺失值。
# missing data is so common that many pandas methods automatically filter for it
correct_mean_age = titanic_survival["Age"].mean()
print(correct_mean_age)
29.69911764705882
按照船舱等级分别来计算平均票价:
#mean fare for each class
passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
pclass_fares = pclass_rows["Fare"]
fare_for_class = pclass_fares.mean()
fares_by_class[this_class] = fare_for_class
print(fares_by_class)
{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}
有没有简单点的代码呢?答案当然是有。pivot_table函数可以帮助实现。设置index和values、aggfunc参数,拥挤不同Pclass下的Survived平均值。
#index tells the method which column to group by
#values is the column that we want to apply the calculation to
#aggfunc specifies the calculation we want to perform
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print(passenger_survival)
Survived
Pclass
1 0.629630
2 0.472826
3 0.242363
可见,存活人数随着舱位的等级降低也变低了,存活率都和钱挂钩了。。
利用pivot_table函数,计算不同Pclass下的平均年龄值(当aggfun参数不指定时,默认求均值):
passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
print(passenger_age)
Age
Pclass
1 38.233441
2 29.877630
3 25.140620
可见,一等舱平均年龄38岁,二等舱平均年龄29岁,三等舱平均年龄25岁。也说明了有钱的人年龄是比较大的,年轻人是比较穷的。
现在我们想同时观察一个量和其他两个量之间的关系,分析登船地点与船票价格及获救与否之间的关系,同上(aggfunc=np.sum表示计算总值):
port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)
Fare Survived
Embarked
C 10072.2962 93
Q 1022.2543 30
S 17439.3988 217
dropna函数表示把缺失值剔除,fillna函数表示把缺失值填充:
#specifying axis=1 or axis='columns' will drop any columns that have null values
drop_na_columns = titanic_survival.dropna(axis=1)
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"])
#print new_titanic_survival
查看指定位置的元素:
row_index_83_age = titanic_survival.loc[83,"Age"]
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print(row_index_83_age)
print(row_index_1000_pclass)
28.0
1
将年龄降序排列,
new_titanic_survival = titanic_survival.sort_values("Age",ascending=False)
print(new_titanic_survival[0:10])
print('--------------')
#丢弃原来的索引,重新设置:
itanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(itanic_reindexed.iloc[0:10])
PassengerId Survived Pclass Name \
630 631 1 1 Barkworth, Mr. Algernon Henry Wilson
851 852 0 3 Svensson, Mr. Johan
493 494 0 1 Artagaveytia, Mr. Ramon
96 97 0 1 Goldschmidt, Mr. George B
116 117 0 3 Connors, Mr. Patrick
672 673 0 2 Mitchell, Mr. Henry Michael
745 746 0 1 Crosby, Capt. Edward Gifford
33 34 0 2 Wheadon, Mr. Edward H
54 55 0 1 Ostby, Mr. Engelhart Cornelius
280 281 0 3 Duane, Mr. Frank
Sex Age SibSp Parch Ticket Fare Cabin Embarked
630 male 80.0 0 0 27042 30.0000 A23 S
851 male 74.0 0 0 347060 7.7750 NaN S
493 male 71.0 0 0 PC 17609 49.5042 NaN C
96 male 71.0 0 0 PC 17754 34.6542 A5 C
116 male 70.5 0 0 370369 7.7500 NaN Q
672 male 70.0 0 0 C.A. 24580 10.5000 NaN S
745 male 70.0 1 1 WE/P 5735 71.0000 B22 S
33 male 66.0 0 0 C.A. 24579 10.5000 NaN S
54 male 65.0 0 1 113509 61.9792 B30 C
280 male 65.0 0 0 336439 7.7500 NaN Q
--------------
PassengerId Survived Pclass Name Sex \
0 631 1 1 Barkworth, Mr. Algernon Henry Wilson male
1 852 0 3 Svensson, Mr. Johan male
2 494 0 1 Artagaveytia, Mr. Ramon male
3 97 0 1 Goldschmidt, Mr. George B male
4 117 0 3 Connors, Mr. Patrick male
5 673 0 2 Mitchell, Mr. Henry Michael male
6 746 0 1 Crosby, Capt. Edward Gifford male
7 34 0 2 Wheadon, Mr. Edward H male
8 55 0 1 Ostby, Mr. Engelhart Cornelius male
9 281 0 3 Duane, Mr. Frank male
Age SibSp Parch Ticket Fare Cabin Embarked
0 80.0 0 0 27042 30.0000 A23 S
1 74.0 0 0 347060 7.7750 NaN S
2 71.0 0 0 PC 17609 49.5042 NaN C
3 71.0 0 0 PC 17754 34.6542 A5 C
4 70.5 0 0 370369 7.7500 NaN Q
5 70.0 0 0 C.A. 24580 10.5000 NaN S
6 70.0 1 1 WE/P 5735 71.0000 B22 S
7 66.0 0 0 C.A. 24579 10.5000 NaN S
8 65.0 0 1 113509 61.9792 B30 C
9 65.0 0 0 336439 7.7500 NaN Q
Pandas为我们提供了很多函数,当提供的函数不能满足我们的需求时,我们可以编写自定义函数。比如说,定义一个返回第100行数据的函数:
# This function returns the hundredth item from a series
def hundredth_row(column):
# Extract the hundredth item
hundredth_item = column.iloc[99]
return hundredth_item
# Return the hundredth item from each column
hundredth_row = titanic_survival.apply(hundredth_row)
print(hundredth_row)
PassengerId 100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
dtype: object
自定义返回每列缺失值个数的函数:
def not_null_count(column):
column_null = pd.isnull(column)
null = column[column_null]
return len(null)
column_null_count = titanic_survival.apply(not_null_count)
print(column_null_count)
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
apply,自定义函数,将1,2,3等舱改写为First Class、Second Class、Third Class:
#By passing in the axis=1 argument, we can use the DataFrame.apply() method to iterate over rows instead of columns.
def which_class(row):
pclass = row['Pclass']
if pd.isnull(pclass):
return "Unknown"
elif pclass == 1:
return "First Class"
elif pclass == 2:
return "Second Class"
elif pclass == 3:
return "Third Class"
classes = titanic_survival.apply(which_class, axis=1)
print(classes)
0 Third Class
1 First Class
2 Third Class
3 First Class
4 Third Class
5 Third Class
6 First Class
7 Third Class
8 Third Class
9 Second Class
10 Third Class
11 First Class
12 Third Class
13 Third Class
14 Third Class
15 Second Class
16 Third Class
17 Second Class
18 Third Class
19 Third Class
20 Second Class
21 Second Class
22 Third Class
23 First Class
24 Third Class
25 Third Class
26 Third Class
27 First Class
28 Third Class
29 Third Class
...
861 Second Class
862 First Class
863 Third Class
864 Second Class
865 Second Class
866 Second Class
867 First Class
868 Third Class
869 Third Class
870 Third Class
871 First Class
872 First Class
873 Third Class
874 Second Class
875 Third Class
876 Third Class
877 Third Class
878 Third Class
879 First Class
880 Second Class
881 Third Class
882 Third Class
883 Second Class
884 Third Class
885 Third Class
886 Second Class
887 First Class
888 Third Class
889 First Class
890 Third Class
Length: 891, dtype: object
接下来,自定义函数,将年龄小于18的数据改写为minor;年龄大于18的,改写为adult:
def is_minor(row):
if row["Age"] < 18:
return True
else:
return False
minors = titanic_survival.apply(is_minor, axis=1)
#print minors
def generate_age_label(row):
age = row["Age"]
if pd.isnull(age):
return "unknown"
elif age < 18:
return "minor"
else:
return "adult"
age_labels = titanic_survival.apply(generate_age_label, axis=1)
print(age_labels)
0 adult
1 adult
2 adult
3 adult
4 adult
5 unknown
6 adult
7 minor
8 adult
9 minor
10 minor
11 adult
12 adult
13 adult
14 minor
15 adult
16 minor
17 unknown
18 adult
19 unknown
20 adult
21 adult
22 minor
23 adult
24 minor
25 adult
26 unknown
27 adult
28 unknown
29 unknown
...
861 adult
862 adult
863 unknown
864 adult
865 adult
866 adult
867 adult
868 unknown
869 minor
870 adult
871 adult
872 adult
873 adult
874 adult
875 minor
876 adult
877 adult
878 unknown
879 adult
880 adult
881 adult
882 adult
883 adult
884 adult
885 adult
886 adult
887 adult
888 unknown
889 adult
890 adult
Length: 891, dtype: object
通过pivot_table函数,分析成年人和未成年人的平均获救人数:
titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived")
print(age_group_survival)
Survived
age_labels
adult 0.381032
minor 0.539823
unknown 0.293785
可以看到,成年人平均获救人数为0.38,minor平均获救人数为0.59. 这可能应了一句话“让妇女和儿童先走”,然后成年人就牺牲了。
四、Series结构
上面我们讲到的是DataFrame结构,而其中的一行或者是一列我们管它叫做Series结构:
- Series (collection of values)
- DataFrame (collection of Series objects)
- Panel (collection of DataFrame objects)
A Series object can hold many data types, including
- float - for representing float values
- int - for representing integer values
- bool - for representing Boolean values
- datetime64[ns] - for representing date & time, without time-zone
- datetime64[ns, tz] - for representing date & time, with time-zone
- timedelta[ns] - for representing differences in dates & times (seconds, minutes, etc.)
- category - for representing categorical values
- object - for representing String values
"fandango_score_comparison.csv"文件存放的是对电影的评分,包含了电影名字和一系列指标。
附:文件链接(https://pan.baidu.com/s/18kfSAqWTmtFIaijNXXKa0Q)
import pandas as pd
fandango = pd.read_csv('G:\\fandango_score_comparison.csv')
series_film = fandango['FILM']
print(type(series_film))
print(series_film[0:5])
series_rt = fandango['RottenTomatoes']
print (series_rt[0:5])
<class 'pandas.core.series.Series'>
0 Avengers: Age of Ultron (2015)
1 Cinderella (2015)
2 Ant-Man (2015)
3 Do You Believe? (2015)
4 Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
0 74
1 85
2 80
3 18
4 14
Name: RottenTomatoes, dtype: int64
数据指标含义解释:
- FILM - film name
- RottenTomatoes - Rotten Tomatoes critics average score
- RottenTomatoes_User - Rotten Tomatoes user average score
- RT_norm - Rotten Tomatoes critics average score (normalized to a 0 to 5 point system)
- RT_user_norm - Rotten Tomatoes user average score (normalized to a 0 to 5 point system)
- Metacritic - Metacritic critics average score
- Metacritic_User - Metacritic user average score
构造Series结构:
# Import the Series object from pandas
from pandas import Series
film_names = series_film.values
print(type(film_names))
#print film_names
rt_scores = series_rt.values
#print rt_scores
series_custom = Series(rt_scores , index=film_names)
series_custom[['Minions (2015)', 'Leviathan (2014)']]
<class 'numpy.ndarray'>
Minions (2015) 54
Leviathan (2014) 99
dtype: int64
我们可以看到,Series里面又是ndarray,这也说明了numpy是封装在pandas里面的。所以,二者之间很多操作是互通的。
在Series结构中,我们用str值作为索引,提取元素:
# int index is also aviable
series_custom = Series(rt_scores , index=film_names)
series_custom[['Minions (2015)', 'Leviathan (2014)']]
fiveten = series_custom[5:10]
print(fiveten)
The Water Diviner (2015) 63
Irrational Man (2015) 42
Top Five (2014) 86
Shaun the Sheep Movie (2015) 99
Love & Mercy (2015) 89
dtype: int64
利用sorted函数对series排序,并利用reindex重新设置索引:
original_index = series_custom.index.tolist()
#print original_index
sorted_index = sorted(original_index)
sorted_by_index = series_custom.reindex(sorted_index)
#print sorted_by_index
Series和DataFrame结构也有是不同的,Series结构可以使用sort_index函数按索引排序、使用sort_values按值排序:
sc2 = series_custom.sort_index()
sc3 = series_custom.sort_values()
#print(sc2[0:10])
print(sc3[0:10])
Paul Blart: Mall Cop 2 (2015) 5
Hitman: Agent 47 (2015) 7
Hot Pursuit (2015) 8
Fantastic Four (2015) 9
Taken 3 (2015) 9
The Boy Next Door (2015) 10
The Loft (2015) 11
Unfinished Business (2015) 11
Mortdecai (2015) 12
Seventh Son (2015) 12
dtype: int64
series结果可以进行相加,得到新的Series。
#The values in a Series object are treated as an ndarray, the core data type in NumPy
import numpy as np
# Add each value with each other
print(np.add(series_custom, series_custom))
# Apply sine function to each value
np.sin(series_custom)
# Return the highest value (will return a single value not a Series)
np.max(series_custom)
Avengers: Age of Ultron (2015) 148
Cinderella (2015) 170
Ant-Man (2015) 160
Do You Believe? (2015) 36
Hot Tub Time Machine 2 (2015) 28
The Water Diviner (2015) 126
Irrational Man (2015) 84
Top Five (2014) 172
Shaun the Sheep Movie (2015) 198
Love & Mercy (2015) 178
Far From The Madding Crowd (2015) 168
Black Sea (2015) 164
Leviathan (2014) 198
Unbroken (2014) 102
The Imitation Game (2014) 180
Taken 3 (2015) 18
Ted 2 (2015) 92
Southpaw (2015) 118
Night at the Museum: Secret of the Tomb (2014) 100
Pixels (2015) 34
McFarland, USA (2015) 158
Insidious: Chapter 3 (2015) 118
The Man From U.N.C.L.E. (2015) 136
Run All Night (2015) 120
Trainwreck (2015) 170
Selma (2014) 198
Ex Machina (2015) 184
Still Alice (2015) 176
Wild Tales (2014) 192
The End of the Tour (2015) 184
...
Clouds of Sils Maria (2015) 178
Testament of Youth (2015) 162
Infinitely Polar Bear (2015) 160
Phoenix (2015) 198
The Wolfpack (2015) 168
The Stanford Prison Experiment (2015) 168
Tangerine (2015) 190
Magic Mike XXL (2015) 124
Home (2015) 90
The Wedding Ringer (2015) 54
Woman in Gold (2015) 104
The Last Five Years (2015) 120
Mission: Impossible – Rogue Nation (2015) 184
Amy (2015) 194
Jurassic World (2015) 142
Minions (2015) 108
Max (2015) 70
Paul Blart: Mall Cop 2 (2015) 10
The Longest Ride (2015) 62
The Lazarus Effect (2015) 28
The Woman In Black 2 Angel of Death (2015) 44
Danny Collins (2015) 154
Spare Parts (2015) 104
Serena (2015) 36
Inside Out (2015) 196
Mr. Holmes (2015) 174
'71 (2015) 194
Two Days, One Night (2014) 194
Gett: The Trial of Viviane Amsalem (2015) 200
Kumiko, The Treasure Hunter (2015) 174
Length: 146, dtype: int64
100
对series值进行判断,利用布尔值索引提取元素:
#will actually return a Series object with a boolean value for each film
series_custom > 50
series_greater_than_50 = series_custom[series_custom > 50]
criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]
print(both_criteria)
Avengers: Age of Ultron (2015) 74
The Water Diviner (2015) 63
Unbroken (2014) 51
Southpaw (2015) 59
Insidious: Chapter 3 (2015) 59
The Man From U.N.C.L.E. (2015) 68
Run All Night (2015) 60
5 Flights Up (2015) 52
Welcome to Me (2015) 71
Saint Laurent (2015) 51
Maps to the Stars (2015) 60
Pitch Perfect 2 (2015) 67
The Age of Adaline (2015) 54
The DUFF (2015) 71
Ricki and the Flash (2015) 64
Unfriended (2015) 60
American Sniper (2015) 72
The Hobbit: The Battle of the Five Armies (2014) 61
Paper Towns (2015) 55
Big Eyes (2014) 72
Maggie (2015) 54
Focus (2015) 57
The Second Best Exotic Marigold Hotel (2015) 62
The 100-Year-Old Man Who Climbed Out the Window and Disappeared (2015) 67
Escobar: Paradise Lost (2015) 52
Into the Woods (2014) 71
Inherent Vice (2014) 73
Magic Mike XXL (2015) 62
Woman in Gold (2015) 52
The Last Five Years (2015) 60
Jurassic World (2015) 71
Minions (2015) 54
Spare Parts (2015) 52
dtype: int64
定义两个Series结构,index相同, 计算’RottenTomatoes’和’RottenTomatoes_User’对电影评分的平均值:
#data alignment same index
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2
print(rt_mean)
FILM
Avengers: Age of Ultron (2015) 80.0
Cinderella (2015) 82.5
Ant-Man (2015) 85.0
Do You Believe? (2015) 51.0
Hot Tub Time Machine 2 (2015) 21.0
The Water Diviner (2015) 62.5
Irrational Man (2015) 47.5
Top Five (2014) 75.0
Shaun the Sheep Movie (2015) 90.5
Love & Mercy (2015) 88.0
Far From The Madding Crowd (2015) 80.5
Black Sea (2015) 71.0
Leviathan (2014) 89.0
Unbroken (2014) 60.5
The Imitation Game (2014) 91.0
Taken 3 (2015) 27.5
Ted 2 (2015) 52.0
Southpaw (2015) 69.5
Night at the Museum: Secret of the Tomb (2014) 54.0
Pixels (2015) 35.5
McFarland, USA (2015) 84.0
Insidious: Chapter 3 (2015) 57.5
The Man From U.N.C.L.E. (2015) 74.0
Run All Night (2015) 59.5
Trainwreck (2015) 79.5
Selma (2014) 92.5
Ex Machina (2015) 89.0
Still Alice (2015) 86.5
Wild Tales (2014) 94.0
The End of the Tour (2015) 90.5
...
Clouds of Sils Maria (2015) 78.0
Testament of Youth (2015) 80.0
Infinitely Polar Bear (2015) 78.0
Phoenix (2015) 90.0
The Wolfpack (2015) 78.5
The Stanford Prison Experiment (2015) 85.5
Tangerine (2015) 90.5
Magic Mike XXL (2015) 63.0
Home (2015) 55.0
The Wedding Ringer (2015) 46.5
Woman in Gold (2015) 66.5
The Last Five Years (2015) 60.0
Mission: Impossible – Rogue Nation (2015) 91.0
Amy (2015) 94.0
Jurassic World (2015) 76.0
Minions (2015) 53.0
Max (2015) 54.0
Paul Blart: Mall Cop 2 (2015) 20.5
The Longest Ride (2015) 52.0
The Lazarus Effect (2015) 18.5
The Woman In Black 2 Angel of Death (2015) 23.5
Danny Collins (2015) 76.0
Spare Parts (2015) 67.5
Serena (2015) 21.5
Inside Out (2015) 94.0
Mr. Holmes (2015) 82.5
'71 (2015) 89.5
Two Days, One Night (2014) 87.5
Gett: The Trial of Viviane Amsalem (2015) 90.5
Kumiko, The Treasure Hunter (2015) 75.0
Length: 146, dtype: float64
读取’G:\fandango_score_comparison.csv’文件,将’FILM’值作为索引:
import pandas as pd
#will return a new DataFrame that is indexed by the values in the specified column
#and will drop that column from the DataFrame
#without the FILM column dropped
fandango = pd.read_csv('G:\\fandango_score_comparison.csv')
print(type(fandango))
fandango_films = fandango.set_index('FILM', drop=False)
#print(fandango_films.index)
<class 'pandas.core.frame.DataFrame'>
按照电影名进行取值:
# Slice using either bracket notation or loc[]
fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
# Specific movie
fandango_films.loc['Kumiko, The Treasure Hunter (2015)']
# Selecting list of movies
movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
fandango_films.loc[movies]
#When selecting multiple rows, a DataFrame is returned,
#but when selecting an individual row, a Series object is returned instead
FILM | RottenTomatoes | RottenTomatoes_User | Metacritic | Metacritic_User | IMDB | Fandango_Stars | Fandango_Ratingvalue | RT_norm | RT_user_norm | ... | IMDB_norm | RT_norm_round | RT_user_norm_round | Metacritic_norm_round | Metacritic_user_norm_round | IMDB_norm_round | Metacritic_user_vote_count | IMDB_user_vote_count | Fandango_votes | Fandango_Difference | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FILM | |||||||||||||||||||||
Kumiko, The Treasure Hunter (2015) | Kumiko, The Treasure Hunter (2015) | 87 | 63 | 68 | 6.4 | 6.7 | 3.5 | 3.5 | 4.35 | 3.15 | ... | 3.35 | 4.5 | 3.0 | 3.5 | 3.0 | 3.5 | 19 | 5289 | 41 | 0.0 |
Do You Believe? (2015) | Do You Believe? (2015) | 18 | 84 | 22 | 4.7 | 5.4 | 5.0 | 4.5 | 0.90 | 4.20 | ... | 2.70 | 1.0 | 4.0 | 1.0 | 2.5 | 2.5 | 31 | 3136 | 1793 | 0.5 |
Ant-Man (2015) | Ant-Man (2015) | 80 | 90 | 64 | 8.1 | 7.8 | 5.0 | 4.5 | 4.00 | 4.50 | ... | 3.90 | 4.0 | 4.5 | 3.0 | 4.0 | 4.0 | 627 | 103660 | 12055 | 0.5 |
3 rows × 22 columns
对数据做类型转换:
#The apply() method in Pandas allows us to specify Python logic
#The apply() method requires you to pass in a vectorized operation
#that can be applied over each Series object.
import numpy as np
# returns the data types as a Series
types = fandango_films.dtypes
#print types
# filter data types to just floats, index attributes returns just column names
float_columns = types[types.values == 'float64'].index
# use bracket notation to filter columns to just float columns
float_df = fandango_films[float_columns]
#print float_df
# `x` is a Series object representing a column
deviations = float_df.apply(lambda x: np.std(x))
print(deviations)
Metacritic_User 1.505529
IMDB 0.955447
Fandango_Stars 0.538532
Fandango_Ratingvalue 0.501106
RT_norm 1.503265
RT_user_norm 0.997787
Metacritic_norm 0.972522
Metacritic_user_nom 0.752765
IMDB_norm 0.477723
RT_norm_round 1.509404
RT_user_norm_round 1.003559
Metacritic_norm_round 0.987561
Metacritic_user_norm_round 0.785412
IMDB_norm_round 0.501043
Fandango_Difference 0.152141
dtype: float64
利用applay及std函数求每个指标的标准差:
rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
rt_mt_user.apply(lambda x: np.std(x), axis=1)
FILM
Avengers: Age of Ultron (2015) 0.375
Cinderella (2015) 0.125
Ant-Man (2015) 0.225
Do You Believe? (2015) 0.925
Hot Tub Time Machine 2 (2015) 0.150
The Water Diviner (2015) 0.150
Irrational Man (2015) 0.575
Top Five (2014) 0.100
Shaun the Sheep Movie (2015) 0.150
Love & Mercy (2015) 0.050
Far From The Madding Crowd (2015) 0.050
Black Sea (2015) 0.150
Leviathan (2014) 0.175
Unbroken (2014) 0.125
The Imitation Game (2014) 0.250
Taken 3 (2015) 0.000
Ted 2 (2015) 0.175
Southpaw (2015) 0.050
Night at the Museum: Secret of the Tomb (2014) 0.000
Pixels (2015) 0.025
McFarland, USA (2015) 0.425
Insidious: Chapter 3 (2015) 0.325
The Man From U.N.C.L.E. (2015) 0.025
Run All Night (2015) 0.350
Trainwreck (2015) 0.350
Selma (2014) 0.375
Ex Machina (2015) 0.175
Still Alice (2015) 0.175
Wild Tales (2014) 0.100
The End of the Tour (2015) 0.350
...
Clouds of Sils Maria (2015) 0.100
Testament of Youth (2015) 0.000
Infinitely Polar Bear (2015) 0.075
Phoenix (2015) 0.025
The Wolfpack (2015) 0.075
The Stanford Prison Experiment (2015) 0.050
Tangerine (2015) 0.325
Magic Mike XXL (2015) 0.250
Home (2015) 0.200
The Wedding Ringer (2015) 0.825
Woman in Gold (2015) 0.225
The Last Five Years (2015) 0.225
Mission: Impossible – Rogue Nation (2015) 0.250
Amy (2015) 0.075
Jurassic World (2015) 0.275
Minions (2015) 0.125
Max (2015) 0.350
Paul Blart: Mall Cop 2 (2015) 0.300
The Longest Ride (2015) 0.625
The Lazarus Effect (2015) 0.650
The Woman In Black 2 Angel of Death (2015) 0.475
Danny Collins (2015) 0.100
Spare Parts (2015) 0.300
Serena (2015) 0.700
Inside Out (2015) 0.025
Mr. Holmes (2015) 0.025
'71 (2015) 0.175
Two Days, One Night (2014) 0.250
Gett: The Trial of Viviane Amsalem (2015) 0.200
Kumiko, The Treasure Hunter (2015) 0.025
Length: 146, dtype: float64
End