Pandas CSV - read_csv / to_csv()
2021-05-14 本文已影响0人
shellblock
CSV(Comma-Separated Values,逗号分隔值,有时也称为字符分隔值,因为分隔字符也可以不是逗号),其文件以纯文本形式存储表格数据(数字和文本)。
CSV 是一种通用的、相对简单的文件格式,被用户、商业和科学广泛应用。
本文以 meal_order_info.csv 为例说明。
语法
基本语法格式:
pd.read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]],
sep=',', delimiter=None, header='infer', names=None, index_col=None,
usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True,
dtype=None, engine=None, converters=None, true_values=None,
false_values=None, skipinitialspace=False, skiprows=None,
skipfooter=0, nrows=None, na_values=None, keep_default_na=True,
na_filter=True, verbose=False, skip_blank_lines=True,
parse_dates=False, infer_datetime_format=False,
keep_date_col=False, date_parser=None, dayfirst=False,
cache_dates=True, iterator=False, chunksize=None,
compression='infer', thousands=None, decimal: str = '.',
lineterminator=None, quotechar='"', quoting=0,
doublequote=True, escapechar=None, comment=None,
encoding=None, dialect=None, error_bad_lines=True,
warn_bad_lines=True, delim_whitespace=False,
low_memory=True, memory_map=False, float_precision=None)
参数
pandas.read_csv函数常用参数及说明:
参数名称 | 说明 |
---|---|
filepath | 接收str,表示文件路径,无默认值 |
sep | 接收str,表示文件的分隔符,默认为“,” |
header | 接收int或sequence,表示将某行数据为列名,为int时表示将第n行作为列名;为sequence时表示将sequence作为列名。默认为infer,表示自动识别 |
name | 接收array |
index_col | 接收int,sequence,False |
dtype | 接收dict |
engine | 接收c或Python |
nrows | 接收int |
encoding | 接收str |
实例
import pandas as pd
df = pd.read_csv('.../data/meal_order_info.csv', encoding='gbk')
print(df.head())
输出结果为:
info_id emp_id number_consumers mode dining_table_id \
0 417 1442 4 NaN 1501
1 301 1095 3 NaN 1430
2 413 1147 6 NaN 1488
3 415 1166 4 NaN 1502
4 392 1094 10 NaN 1499
dining_table_name expenditure dishes_count accounts_payable \
0 1022 165 5 165
1 1031 321 6 321
2 1009 854 15 854
3 1023 466 10 466
4 1020 704 24 704
use_start_time ... lock_time cashier_id pc_id order_number \
0 2016/8/1 11:05:36 ... 2016/8/1 11:11:46 NaN NaN NaN
1 2016/8/1 11:15:57 ... 2016/8/1 11:31:55 NaN NaN NaN
2 2016/8/1 12:42:52 ... 2016/8/1 12:54:37 NaN NaN NaN
3 2016/8/1 12:51:38 ... 2016/8/1 13:08:20 NaN NaN NaN
4 2016/8/1 12:58:44 ... 2016/8/1 13:07:16 NaN NaN NaN
org_id print_doc_bill_num lock_table_info order_status phone \
0 330 NaN NaN 1 18688880641
1 328 NaN NaN 1 18688880174
2 330 NaN NaN 1 18688880276
3 330 NaN NaN 1 18688880231
4 330 NaN NaN 1 18688880173
name
0 苗宇怡
1 赵颖
2 徐毅凡
3 张大鹏
4 孙熙凯
[5 rows x 21 columns]
同样也可以使用to_csv()
方法将 DataFrame 存储为 csv 文件
实例
import pandas as pd
# 三个字段 name, site, age
nme = ["Google", "Runoob", "Taobao", "Wiki"]
st = ["www.google.com", "www.runoob.com", "www.taobao.com", "www.wikipedia.org"]
ag = [90, 40, 80, 98]
# 字典
dict = {'name': nme, 'site': st, 'age': ag}
df = pd.DataFrame(dict)
# 保存 dataframe
df.to_csv('site.csv')
执行成功后,我们打开 site.csv 文件,显示结果如下:
site.csv
数据处理
head()
head( n ) 方法用于读取前面的 n 行,如果不填参数 n ,默认返回 5 行。
实例 - 读取前面 5 行
import pandas as pd
df = pd.read_csv('nba.csv')
print(df.head())
输出结果为:
Name Team Number Position Age Height Weight College Salary
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 Boston University NaN
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0 NaN 5000000.0
实例 - 读取前面 10 行
import pandas as pd
df = pd.read_csv('nba.csv')
print(df.head(10))
输出结果为:
Name Team Number Position Age Height Weight College Salary
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 Boston University NaN
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0 NaN 5000000.0
5 Amir Johnson Boston Celtics 90.0 PF 29.0 6-9 240.0 NaN 12000000.0
6 Jordan Mickey Boston Celtics 55.0 PF 21.0 6-8 235.0 LSU 1170960.0
7 Kelly Olynyk Boston Celtics 41.0 C 25.0 7-0 238.0 Gonzaga 2165160.0
8 Terry Rozier Boston Celtics 12.0 PG 22.0 6-2 190.0 Louisville 1824360.0
9 Marcus Smart Boston Celtics 36.0 PG 22.0 6-4 220.0 Oklahoma State 3431040.0
tail()
tail( n ) 方法用于读取尾部的 n 行,如果不填参数 n ,默认返回 5 行,空行各个字段的值返回 NaN。
实例 - 读取末尾 5 行
import pandas as pd
df = pd.read_csv('nba.csv')
print(df.tail())
输出结果为:
Name Team Number Position Age Height Weight College Salary
453 Shelvin Mack Utah Jazz 8.0 PG 26.0 6-3 203.0 Butler 2433333.0
454 Raul Neto Utah Jazz 25.0 PG 24.0 6-1 179.0 NaN 900000.0
455 Tibor Pleiss Utah Jazz 21.0 C 26.0 7-3 256.0 NaN 2900000.0
456 Jeff Withey Utah Jazz 24.0 C 26.0 7-0 231.0 Kansas 947276.0
457 NaN NaN NaN NaN NaN NaN NaN NaN NaN
实例 - 读取末尾 10 行
import pandas as pd
df = pd.read_csv('nba.csv')
print(df.tail(10))
输出结果为:
Name Team Number Position Age Height Weight College Salary
448 Gordon Hayward Utah Jazz 20.0 SF 26.0 6-8 226.0 Butler 15409570.0
449 Rodney Hood Utah Jazz 5.0 SG 23.0 6-8 206.0 Duke 1348440.0
450 Joe Ingles Utah Jazz 2.0 SF 28.0 6-8 226.0 NaN 2050000.0
451 Chris Johnson Utah Jazz 23.0 SF 26.0 6-6 206.0 Dayton 981348.0
452 Trey Lyles Utah Jazz 41.0 PF 20.0 6-10 234.0 Kentucky 2239800.0
453 Shelvin Mack Utah Jazz 8.0 PG 26.0 6-3 203.0 Butler 2433333.0
454 Raul Neto Utah Jazz 25.0 PG 24.0 6-1 179.0 NaN 900000.0
455 Tibor Pleiss Utah Jazz 21.0 C 26.0 7-3 256.0 NaN 2900000.0
456 Jeff Withey Utah Jazz 24.0 C 26.0 7-0 231.0 Kansas 947276.0
457 NaN NaN NaN NaN NaN NaN NaN NaN NaN
info()
info() 方法返回表格的一些基本信息:
实例
import pandas as pd
df = pd.read_csv('nba.csv')
print(df.info())
输出结果为:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457 # 行数,458 行,第一行编号为 0
Data columns (total 9 columns): # 列数,9列
# Column Non-Null Count Dtype # 各列的数据类型
--- ------ -------------- -----
0 Name 457 non-null object
1 Team 457 non-null object
2 Number 457 non-null float64
3 Position 457 non-null object
4 Age 457 non-null float64
5 Height 457 non-null object
6 Weight 457 non-null float64
7 College 373 non-null object # non-null,意思为非空的数据
8 Salary 446 non-null float64
dtypes: float64(4), object(5) # 类型
non-null 为非空数据,我们可以看到上面的信息中,总共 458 行,College 字段的空值最多。