Pandas - 2. 抽取行列
2022-04-30 本文已影响0人
陈天睡懒觉
import pandas as pd
df = pd.read_csv('data/gapminder.tsv',sep='\t')
print(df.head())
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
查看每一列的类型 df.dtypes或df.info()
- object -- string -- 字符串
- int64 -- int -- 整型
- float64 -- float -- 浮点型
- datetime64 -- datetime -- 时间
print(df.dtypes)
country object
continent object
year int64
lifeExp float64
pop int64
gdpPercap float64
dtype: object
查看行列信息
# df.shape shape是属性,加上括号会报错
print(df.shape) #(行数,列数)
(1704, 6)
获取列名和行索引
# df.columns (列名)
print(df.columns)
# df.index (行索引)
print(df.index)
print(list(df.index)[:10])
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')
RangeIndex(start=0, stop=1704, step=1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
获取列子集
# 单列
continent = df.continent #只适合英文列名
continent = df['continent']
print(continent[:5])
# 多列
year_continent = df[['year','continent']]
print(year_continent[:5])
0 Asia
1 Asia
2 Asia
3 Asia
4 Asia
Name: continent, dtype: object
year continent
0 1952 Asia
1 1957 Asia
2 1962 Asia
3 1967 Asia
4 1972 Asia
获取行子集
- 通过行名(loc)
- 用过行号(iloc)
# 取一行
sample = df.loc[0] # 因为只取1行输出Series
print(sample)
# 取多行
samples = df.loc[[0,100,200]]
print(samples)
# df.loc[-1]会报错,因为没有-1这个标签的行
# 取一行
sample = df.iloc[0] # 因为只取1行输出Series
# 取多行
samples = df.iloc[[0,100,200]]
# iloc可以输入数值
sample = df.iloc[-1]
country Afghanistan
continent Asia
year 1952
lifeExp 28.801
pop 8425333
gdpPercap 779.445
Name: 0, dtype: object
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
100 Bangladesh Asia 1972 45.252 70759295 630.233627
200 Burkina Faso Africa 1992 50.260 8878303 931.752773
混合,抽取行列子集
iloc/loc[,] 逗号左边是行,右边是列
# 获取整列
subset = df.loc[:,['year','pop']]
subset = df.iloc[:,[1,3,-1]] # 可以指定具体位置的列
subset = df.iloc[:,3:6]
subset = df.iloc[:,:3]
# 多行多列
subset = df.loc[[1,10,20],['year','pop']]
subset = df.iloc[[1,10,20],[1,-1]]
print(subset)
continent gdpPercap
1 Asia 820.853030
10 Asia 726.734055
20 Europe 2497.437901