(六)pandas知识学习1-python数据分析与机器学习实战
文章原创,最近更新:2018-05-2
1.pandas数据读取
2.pandas索引与计算
课程来源: python数据分析与机器学习实战-唐宇迪
1.pandas数据读取
1.1read_csv函数的运用
food_info.csv这个文件是关于食品包含各种各样的维生素的指标.csv是以逗号为分隔符的文件.
用.read_csv这个函数读取文件的数据.
food_info是一个DataFrame格式.
用dtype当前的数据文件包含几种数据类型的结构. object是字符串类型
n [10]: import pandas
food_info=pandas.read_csv("food_info.csv")
type(food_info)
Out[13]: pandas.core.frame.DataFrame
food_info.dtypes
Out[14]:
NDB_No int64
Shrt_Desc object
Water_(g) float64
Energ_Kcal int64
Protein_(g) float64
Lipid_Tot_(g) float64
Ash_(g) float64
Carbohydrt_(g) float64
Fiber_TD_(g) float64
Sugar_Tot_(g) float64
Calcium_(mg) float64
Iron_(mg) float64
Magnesium_(mg) float64
Phosphorus_(mg) float64
Potassium_(mg) float64
Sodium_(mg) float64
Zinc_(mg) float64
Copper_(mg) float64
Manganese_(mg) float64
Selenium_(mcg) float64
Vit_C_(mg) float64
Thiamin_(mg) float64
Riboflavin_(mg) float64
Niacin_(mg) float64
Vit_B6_(mg) float64
Vit_B12_(mcg) float64
Vit_A_IU float64
Vit_A_RAE float64
Vit_E_(mg) float64
Vit_D_mcg float64
Vit_D_IU float64
Vit_K_(mcg) float64
FA_Sat_(g) float64
FA_Mono_(g) float64
FA_Poly_(g) float64
Cholestrl_(mg) float64
dtype: object
用help()函数深入了解pandas.read_csv.常规用法,参数等等.
拓外:
使用pandas下的read_csv方法,读取csv文件,参数是文件的路径,这是一个相对路径,是相对于当前工作目录的,那么如何知道当前的工作目录呢?
使用os.getcwd()方法获取当前工作目录
import os
os.getcwd()
Out[11]: 'C:\\Users\\Administrator'
将文件放置在 'C:\Users\Administrator'这个路径里.
1.2head函数的运用
.head()将food_info.csv文件里面的前5条数据显示出来.
import pandas as pd
food_info=pandas.read_csv("food_info.csv")
food_info.head()
Out[19]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \
0 1001 BUTTER WITH SALT 15.87 717 0.85
1 1002 BUTTER WHIPPED WITH SALT 15.87 717 0.85
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28
3 1004 CHEESE BLUE 42.41 353 21.40
4 1005 CHEESE BRICK 41.11 371 23.24
Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) \
0 81.11 2.11 0.06 0.0 0.06
1 81.11 2.11 0.06 0.0 0.06
2 99.48 0.00 0.00 0.0 0.00
3 28.74 5.11 2.34 0.0 0.50
4 29.68 3.18 2.79 0.0 0.51
Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU \
0 ... 2499.0 684.0 2.32 1.5 60.0
1 ... 2499.0 684.0 2.32 1.5 60.0
2 ... 3069.0 840.0 2.80 1.8 73.0
3 ... 721.0 198.0 0.25 0.5 21.0
4 ... 1080.0 292.0 0.26 0.5 22.0
Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
0 7.0 51.368 21.021 3.043 215.0
1 7.0 50.489 23.426 3.012 219.0
2 8.6 61.924 28.732 3.694 256.0
3 2.4 18.669 7.778 0.800 75.0
4 2.5 18.764 8.598 0.784 94.0
[5 rows x 36 columns]
如果只是想显示前3条数据应该怎么办?
.head(3)就可以显示前3条数据了.
food_info.head(3)
Out[20]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \
0 1001 BUTTER WITH SALT 15.87 717 0.85
1 1002 BUTTER WHIPPED WITH SALT 15.87 717 0.85
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28
Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) \
0 81.11 2.11 0.06 0.0 0.06
1 81.11 2.11 0.06 0.0 0.06
2 99.48 0.00 0.00 0.0 0.00
Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU \
0 ... 2499.0 684.0 2.32 1.5 60.0
1 ... 2499.0 684.0 2.32 1.5 60.0
2 ... 3069.0 840.0 2.80 1.8 73.0
Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
0 7.0 51.368 21.021 3.043 215.0
1 7.0 50.489 23.426 3.012 219.0
2 8.6 61.924 28.732 3.694 256.0
[3 rows x 36 columns]
1.3tail函数的运用
tail函数默认显示文件的后5行.
import pandas as pd
food_info=pandas.read_csv("food_info.csv")
food_info.tail()
Out[21]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \
8613 83110 MACKEREL SALTED 43.00 305 18.50
8614 90240 SCALLOP (BAY&SEA) CKD STMD 70.25 111 20.54
8615 90480 SYRUP CANE 26.00 269 0.00
8616 90560 SNAIL RAW 79.20 90 16.10
8617 93600 TURTLE GREEN RAW 78.50 89 19.80
Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) \
8613 25.10 13.40 0.00 0.0 0.0
8614 0.84 2.97 5.41 0.0 0.0
8615 0.00 0.86 73.14 0.0 73.2
8616 1.40 1.30 2.00 0.0 0.0
8617 0.50 1.20 0.00 0.0 0.0
Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU \
8613 ... 157.0 47.0 2.38 25.2 1006.0
8614 ... 5.0 2.0 0.00 0.0 2.0
8615 ... 0.0 0.0 0.00 0.0 0.0
8616 ... 100.0 30.0 5.00 0.0 0.0
8617 ... 100.0 30.0 0.50 0.0 0.0
Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
8613 7.8 7.148 8.320 6.210 95.0
8614 0.0 0.218 0.082 0.222 41.0
8615 0.0 0.000 0.000 0.000 0.0
8616 0.1 0.361 0.259 0.252 50.0
8617 0.1 0.127 0.088 0.170 50.0
[5 rows x 36 columns]
1.4columns函数的运用
运用columns函数打印文件的第一行的标题也就是列名.
food_info.columns
Out[22]:
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
'Cholestrl_(mg)'],
dtype='object')
1.5shape函数的运用
运用shape可以知道DataFrame的由几行几列构成的
food_info.shape
Out[23]: (8618, 36)
当前这份数据是一共由8000多个样本,每个样本有36个指标.
2.pandas索引与计算
2.1用loc进行数据选取(行数据)
对于DataFrame的行的标签索引,引入了特殊的标签运算符loc。它们可以让你用类似NumPy的标记,使用轴标签(loc),从DataFrame选择行和列的子集。
import pandas as pd
food_info=pandas.read_csv("food_info.csv")
food_info.loc[0]
Out[24]:
NDB_No 1001
Shrt_Desc BUTTER WITH SALT
Water_(g) 15.87
Energ_Kcal 717
Protein_(g) 0.85
Lipid_Tot_(g) 81.11
Ash_(g) 2.11
Carbohydrt_(g) 0.06
Fiber_TD_(g) 0
Sugar_Tot_(g) 0.06
Calcium_(mg) 24
Iron_(mg) 0.02
Magnesium_(mg) 2
Phosphorus_(mg) 24
Potassium_(mg) 24
Sodium_(mg) 643
Zinc_(mg) 0.09
Copper_(mg) 0
Manganese_(mg) 0
Selenium_(mcg) 1
Vit_C_(mg) 0
Thiamin_(mg) 0.005
Riboflavin_(mg) 0.034
Niacin_(mg) 0.042
Vit_B6_(mg) 0.003
Vit_B12_(mcg) 0.17
Vit_A_IU 2499
Vit_A_RAE 684
Vit_E_(mg) 2.32
Vit_D_mcg 1.5
Vit_D_IU 60
Vit_K_(mcg) 7
FA_Sat_(g) 51.368
FA_Mono_(g) 21.021
FA_Poly_(g) 3.043
Cholestrl_(mg) 215
Name: 0, dtype: object
food_info.loc[0]显示的结果是相当于文件数据中的第0行的内容.,行数是从0开始算起.
food_info.loc[6]显示的结果是相当于文件数据中的第6行的内容.
food_info.loc[6]
Out[26]:
NDB_No 1007
Shrt_Desc CHEESE CAMEMBERT
Water_(g) 51.8
Energ_Kcal 300
Protein_(g) 19.8
Lipid_Tot_(g) 24.26
Ash_(g) 3.68
Carbohydrt_(g) 0.46
Fiber_TD_(g) 0
Sugar_Tot_(g) 0.46
Calcium_(mg) 388
Iron_(mg) 0.33
Magnesium_(mg) 20
Phosphorus_(mg) 347
Potassium_(mg) 187
Sodium_(mg) 842
Zinc_(mg) 2.38
Copper_(mg) 0.021
Manganese_(mg) 0.038
Selenium_(mcg) 14.5
Vit_C_(mg) 0
Thiamin_(mg) 0.028
Riboflavin_(mg) 0.488
Niacin_(mg) 0.63
Vit_B6_(mg) 0.227
Vit_B12_(mcg) 1.3
Vit_A_IU 820
Vit_A_RAE 241
Vit_E_(mg) 0.21
Vit_D_mcg 0.4
Vit_D_IU 18
Vit_K_(mcg) 2
FA_Sat_(g) 15.259
FA_Mono_(g) 7.023
FA_Poly_(g) 0.724
Cholestrl_(mg) 72
Name: 6, dtype: object
如果行数超过文件的行数,就会报错,如下:
food_info.loc[8620]
KeyError: 'the label [8620] is not in the [index]'
拓外:
DataFrame常见的dtype几种数据类型.
- object - For string values,字符类型
- int - For integer values,整型
- float - For float values,浮点型
- datetime - For time values时间类型
- bool - For Boolean values布尔型
loc函数也适用于一个标签或多个标签的切片:
比如取3,4,5,6行的数据
food_info.loc[3:6]
Out[28]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \
3 1004 CHEESE BLUE 42.41 353 21.40
4 1005 CHEESE BRICK 41.11 371 23.24
5 1006 CHEESE BRIE 48.42 334 20.75
6 1007 CHEESE CAMEMBERT 51.80 300 19.80
Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) \
3 28.74 5.11 2.34 0.0 0.50
4 29.68 3.18 2.79 0.0 0.51
5 27.68 2.70 0.45 0.0 0.45
6 24.26 3.68 0.46 0.0 0.46
Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU \
3 ... 721.0 198.0 0.25 0.5 21.0
4 ... 1080.0 292.0 0.26 0.5 22.0
5 ... 592.0 174.0 0.24 0.5 20.0
6 ... 820.0 241.0 0.21 0.4 18.0
Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
3 2.4 18.669 7.778 0.800 75.0
4 2.5 18.764 8.598 0.784 94.0
5 2.3 17.410 8.013 0.826 100.0
6 2.0 15.259 7.023 0.724 72.0
[4 rows x 36 columns]
比如取第2,5,6行的数据
wo_five_ten = [2,5,10]
food_info.loc[wo_five_ten]#等价与food_info.loc[[2,5,10]]
Out[33]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28
5 1006 CHEESE BRIE 48.42 334 20.75
10 1011 CHEESE COLBY 38.20 394 23.76
Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) \
2 99.48 0.00 0.00 0.0 0.00
5 27.68 2.70 0.45 0.0 0.45
10 32.11 3.36 2.57 0.0 0.52
Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU \
2 ... 3069.0 840.0 2.80 1.8 73.0
5 ... 592.0 174.0 0.24 0.5 20.0
10 ... 994.0 264.0 0.28 0.6 24.0
Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
2 8.6 61.924 28.732 3.694 256.0
5 2.3 17.410 8.013 0.826 100.0
10 2.7 20.218 9.280 0.953 95.0
[3 rows x 36 columns]
2.2列名进行数据选取(列数据)
如何抽取1列的数据?
food_info[columns],直接将列名传入到columns,具体用法如下:
ndb_col=food_info["NDB_No"]
ndb_col
Out[36]:
0 1001
1 1002
2 1003
3 1004
4 1005
5 1006
6 1007
7 1008
8 1009
9 1010
10 1011
11 1012
12 1013
13 1014
14 1015
15 1016
16 1017
17 1018
18 1019
19 1020
20 1021
21 1022
22 1023
23 1024
24 1025
25 1026
26 1027
27 1028
28 1029
29 1030
8588 43544
8589 43546
8590 43550
8591 43566
8592 43570
8593 43572
8594 43585
8595 43589
8596 43595
8597 43597
8598 43598
8599 44005
8600 44018
8601 44048
8602 44055
8603 44061
8604 44074
8605 44110
8606 44158
8607 44203
8608 44258
8609 44259
8610 44260
8611 48052
8612 80200
8613 83110
8614 90240
8615 90480
8616 90560
8617 93600
Name: NDB_No, Length: 8618, dtype: int64
相当于文件列名为"NDB_No"所在列所有的数据.
那抽取2列的数据又是如何抽取?
food_info[columns],以列表的形式将列名传入到columns.具体用法如下:
food_info[["Zinc_(mg)","Copper_(mg)"]]
Out[37]:
Zinc_(mg) Copper_(mg)
0 0.09 0.000
1 0.05 0.016
2 0.01 0.001
3 2.66 0.040
4 2.60 0.024
5 2.38 0.019
6 2.38 0.021
7 2.94 0.024
8 3.43 0.056
9 2.79 0.042
10 3.07 0.042
11 0.40 0.029
12 0.33 0.040
13 0.47 0.030
14 0.51 0.033
15 0.38 0.028
16 0.51 0.019
17 3.75 0.036
18 2.88 0.032
19 3.50 0.025
20 1.14 0.080
21 3.90 0.036
22 3.90 0.032
23 2.10 0.021
24 3.00 0.032
25 2.92 0.011
26 2.46 0.022
27 2.76 0.025
28 3.61 0.034
29 2.81 0.031
... ...
8588 3.30 0.377
8589 0.05 0.040
8590 0.05 0.030
8591 1.15 0.116
8592 5.03 0.200
8593 3.83 0.545
8594 0.08 0.035
8595 3.90 0.027
8596 4.10 0.100
8597 3.13 0.027
8598 0.13 0.000
8599 0.02 0.000
8600 0.09 0.037
8601 0.21 0.026
8602 2.77 0.571
8603 0.41 0.838
8604 0.05 0.028
8605 0.03 0.023
8606 0.10 0.112
8607 0.02 0.020
8608 1.49 0.854
8609 0.19 0.040
8610 0.10 0.038
8611 0.85 0.182
8612 1.00 0.250
8613 1.10 0.100
8614 1.55 0.033
8615 0.19 0.020
8616 1.00 0.400
8617 1.00 0.250
[8618 rows x 2 columns]
怎么查找文件的列名以(g)为结尾的?
首先将所有的列名拿到手,再看哪些列名是以(g)为结尾.用tolist()将当前的结果打印成列表.
col_names=food_info.columns.tolist()
col_names
Out[39]:
['NDB_No',
'Shrt_Desc',
'Water_(g)',
'Energ_Kcal',
'Protein_(g)',
'Lipid_Tot_(g)',
'Ash_(g)',
'Carbohydrt_(g)',
'Fiber_TD_(g)',
'Sugar_Tot_(g)',
'Calcium_(mg)',
'Iron_(mg)',
'Magnesium_(mg)',
'Phosphorus_(mg)',
'Potassium_(mg)',
'Sodium_(mg)',
'Zinc_(mg)',
'Copper_(mg)',
'Manganese_(mg)',
'Selenium_(mcg)',
'Vit_C_(mg)',
'Thiamin_(mg)',
'Riboflavin_(mg)',
'Niacin_(mg)',
'Vit_B6_(mg)',
'Vit_B12_(mcg)',
'Vit_A_IU',
'Vit_A_RAE',
'Vit_E_(mg)',
'Vit_D_mcg',
'Vit_D_IU',
'Vit_K_(mcg)',
'FA_Sat_(g)',
'FA_Mono_(g)',
'FA_Poly_(g)',
'Cholestrl_(mg)']
然后看col_names这个列表,里面哪个元素是以g()为结尾的?
gram_columns = []
for c in col_names:
if c.endswith("(g)"): #与c[-3:]=="(g)"等价
gram_columns.append(c)
gram_df = food_info[gram_columns]
gram_df.head(3)
Out[43]:
Water_(g) Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) \
0 15.87 0.85 81.11 2.11 0.06
1 15.87 0.85 81.11 2.11 0.06
2 0.24 0.28 99.48 0.00 0.00
Fiber_TD_(g) Sugar_Tot_(g) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g)
0 0.0 0.06 51.368 21.021 3.043
1 0.0 0.06 50.489 23.426 3.012
2 0.0 0.00 61.924 28.732 3.694
2.3pandas加减乘除运算
pandas最重要的一个功能是,它可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。对于有数据库经验的用户,这就像在索引标签上进行自动外连接。看一个简单的例子:
将mg为结尾的列转换成以g为结尾的列.
例如:现将"Iron_(mg)"的数据打印出来.
food_info["Iron_(mg)"]
Out[58]:
0 0.02
1 0.16
2 0.00
3 0.31
4 0.43
5 0.50
6 0.33
7 0.64
8 0.16
9 0.21
10 0.76
11 0.07
12 0.16
13 0.15
14 0.13
15 0.14
16 0.38
17 0.44
18 0.65
19 0.23
20 0.52
21 0.24
22 0.17
23 0.13
24 0.72
25 0.44
26 0.20
27 0.22
28 0.23
29 0.41
8588 9.00
8589 0.30
8590 0.10
8591 1.63
8592 34.82
8593 2.28
8594 0.17
8595 0.17
8596 4.86
8597 0.25
8598 0.23
8599 0.13
8600 0.11
8601 0.68
8602 7.83
8603 3.11
8604 0.30
8605 0.18
8606 0.80
8607 0.04
8608 3.87
8609 0.05
8610 0.38
8611 5.20
8612 1.50
8613 1.40
8614 0.58
8615 3.60
8616 3.50
8617 1.40
Name: Iron_(mg), Length: 8618, dtype: float64
food_info[columns]/1000就可以将mg转换为g.跟numpy很类似,加减乘除一个数,相当于对所有的数都加减乘除了.
div_1000=food_info["Iron_(mg)"]/1000
div_1000
Out[60]:
0 0.00002
1 0.00016
2 0.00000
3 0.00031
4 0.00043
5 0.00050
6 0.00033
7 0.00064
8 0.00016
9 0.00021
10 0.00076
11 0.00007
12 0.00016
13 0.00015
14 0.00013
15 0.00014
16 0.00038
17 0.00044
18 0.00065
19 0.00023
20 0.00052
21 0.00024
22 0.00017
23 0.00013
24 0.00072
25 0.00044
26 0.00020
27 0.00022
28 0.00023
29 0.00041
8588 0.00900
8589 0.00030
8590 0.00010
8591 0.00163
8592 0.03482
8593 0.00228
8594 0.00017
8595 0.00017
8596 0.00486
8597 0.00025
8598 0.00023
8599 0.00013
8600 0.00011
8601 0.00068
8602 0.00783
8603 0.00311
8604 0.00030
8605 0.00018
8606 0.00080
8607 0.00004
8608 0.00387
8609 0.00005
8610 0.00038
8611 0.00520
8612 0.00150
8613 0.00140
8614 0.00058
8615 0.00360
8616 0.00350
8617 0.00140
Name: Iron_(mg), Length: 8618, dtype: float64
其他类似的小练习:
add_100 = food_info["Iron_(mg)"] + 100
sub_100 = food_info["Iron_(mg)"] - 100
mult_2 = food_info["Iron_(mg)"]*2
对两个列进行组合,假设维度是一样的.如果进行加减乘除的操作,会做对应位置的加减乘除的操作,具体如下:
food_info["Water_(g)"]
Out[61]:
0 15.87
1 15.87
2 0.24
3 42.41
4 41.11
5 48.42
6 51.80
7 39.28
8 37.10
9 37.65
10 38.20
11 79.79
12 79.64
13 81.01
14 81.24
15 82.48
16 54.44
17 41.56
18 55.22
19 37.92
20 13.44
21 41.46
22 33.19
23 48.42
24 41.01
25 50.01
26 48.38
27 53.78
28 45.54
29 41.77
8588 2.00
8589 76.70
8590 83.10
8591 1.30
8592 5.00
8593 2.80
8594 81.60
8595 59.60
8596 14.50
8597 49.90
8598 21.70
8599 0.00
8600 23.90
8601 55.50
8602 9.00
8603 4.20
8604 84.40
8605 53.00
8606 54.66
8607 28.24
8608 6.80
8609 10.40
8610 6.84
8611 8.20
8612 81.90
8613 43.00
8614 70.25
8615 26.00
8616 79.20
8617 78.50
Name: Water_(g), Length: 8618, dtype: float64
food_info["Energ_Kcal"]
Out[62]:
0 717
1 717
2 876
3 353
4 371
5 334
6 300
7 376
8 406
9 387
10 394
11 98
12 97
13 72
14 81
15 72
16 342
17 357
18 264
19 389
20 466
21 356
22 413
23 327
24 373
25 300
26 318
27 254
28 301
29 368
8588 389
8589 91
8590 68
8591 465
8592 401
8593 429
8594 73
8595 179
8596 377
8597 280
8598 688
8599 884
8600 279
8601 257
8602 319
8603 356
8604 62
8605 179
8606 181
8607 287
8608 365
8609 351
8610 350
8611 370
8612 73
8613 305
8614 111
8615 269
8616 90
8617 89
Name: Energ_Kcal, Length: 8618, dtype: int64
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]
water_energy
Out[64]:
0 11378.79
1 11378.79
2 210.24
3 14970.73
4 15251.81
5 16172.28
6 15540.00
7 14769.28
8 15062.60
9 14570.55
10 15050.80
11 7819.42
12 7725.08
13 5832.72
14 6580.44
15 5938.56
16 18618.48
17 14836.92
18 14578.08
19 14750.88
20 6263.04
21 14759.76
22 13707.47
23 15833.34
24 15296.73
25 15003.00
26 15384.84
27 13660.12
28 13707.54
29 15371.36
8588 778.00
8589 6979.70
8590 5650.80
8591 604.50
8592 2005.00
8593 1201.20
8594 5956.80
8595 10668.40
8596 5466.50
8597 13972.00
8598 14929.60
8599 0.00
8600 6668.10
8601 14263.50
8602 2871.00
8603 1495.20
8604 5232.80
8605 9487.00
8606 9893.46
8607 8104.88
8608 2482.00
8609 3650.40
8610 2394.00
8611 3034.00
8612 5978.70
8613 13115.00
8614 7797.75
8615 6994.00
8616 7128.00
8617 6986.50
Length: 8618, dtype: float64
其他类似的小练习:
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]
我们看一下之前的water_energy的shape形式:
food_info.shape
Out[65]: (8618, 36)
由此可以看出是8618行36列,那我们如何新增加一列?
案例:原先是以"Iron_(mg)"以mg结尾,新增加一列以g结尾.
iron_grams = food_info["Iron_(mg)"] / 1000
food_info["Iron_(g)"] = iron_grams
food_info["Iron_(g)"]
Out[68]:
0 0.00002
1 0.00016
2 0.00000
3 0.00031
4 0.00043
5 0.00050
6 0.00033
7 0.00064
8 0.00016
9 0.00021
10 0.00076
11 0.00007
12 0.00016
13 0.00015
14 0.00013
15 0.00014
16 0.00038
17 0.00044
18 0.00065
19 0.00023
20 0.00052
21 0.00024
22 0.00017
23 0.00013
24 0.00072
25 0.00044
26 0.00020
27 0.00022
28 0.00023
29 0.00041
8588 0.00900
8589 0.00030
8590 0.00010
8591 0.00163
8592 0.03482
8593 0.00228
8594 0.00017
8595 0.00017
8596 0.00486
8597 0.00025
8598 0.00023
8599 0.00013
8600 0.00011
8601 0.00068
8602 0.00783
8603 0.00311
8604 0.00030
8605 0.00018
8606 0.00080
8607 0.00004
8608 0.00387
8609 0.00005
8610 0.00038
8611 0.00520
8612 0.00150
8613 0.00140
8614 0.00058
8615 0.00360
8616 0.00350
8617 0.00140
Name: Iron_(g), Length: 8618, dtype: float64
打印food_info.shape,可以看出由原来的(8618, 36)变成(8618, 37),food_info增加了一列.
food_info.shape
Out[69]: (8618, 37)
如何求一列的最大值?
用food_info[columns].max()函数求某一列的最大值.
max_calories=food_info["Energ_Kcal"].max()
max_calories
Out[71]: 902
其他相关的小练习:
normalized_calories = food_info["Energ_Kcal"] / max_calories
normalized_protein = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()
normalized_fat = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()