pandas数据读入
Pandas中关于读入数据的相关函数列表如下:
pd.read_clipboard() pd.read_gbq() pd.read_parquet() pd.read_sql_query()
pd.read_csv() pd.read_hdf() pd.read_pickle() pd.read_sql_table()
pd.read_excel() pd.read_html() pd.read_sas() pd.read_stata()
pd.read_feather() pd.read_json() pd.read_spss() pd.read_table()
pd.read_fwf() pd.read_msgpack() pd.read_sql()
以下函数都有对应的导出函数:
pd.read_clipboard() pd.read_feather() pd.read_html() pd.read_pickle()
pd.read_csv() pd.read_gbq() pd.read_json() pd.read_sql()
pd.read_excel() pd.read_hdf() pd.read_msgpack()
本文先主要介绍2个参数,陆续更新 。。。。。。
- pandas.read_csv() :主要读取文本文件.csv,.txt,.xls
- pandas.read_excel() :主要读取超文本文件.xlsx
1. pandas.read_csv() 读入文本文件或者URL等
1)函数详解:
pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None,
usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None,
converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None,
nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False,
skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False,
date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer',
thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None,
comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True,
warn_bad_lines=True, skipfooter=0, doublequote=True, delim_whitespace=False, low_memory=True,
memory_map=False, float_precision=None)
常用参数介绍:
- sep :分隔符(str)默认是","。若读入的是txt,应设置为sep='\t"
- delimiter :定界符(str),默认是"None"。备选分隔符(如果指定该参数,则sep参数失效)
- header :指定行数用来作为列名( int or list of ints), 默认是 ‘infer’。 如果同时header和skiprows指定,header对应skiprows相对的结束的位置,header使用新的行号(忽略注释/空行)
- names:用于结果的列名列表(array-like),默认是 "None"。如果数据文件中没有列标题行,就需要执行header=None。默认列表中不能出现重复,除非设定参数mangle_dupe_cols=True
- index_col:用作行索引的列编号或者列名(int or sequence or False), 默认是 “None”。如果给定一个序列则有多个行索引。如果文件不规则,行尾有分隔符,则可以设定index_col=False 来是的pandas不适用第一列作为行索引。
- encoding:字符编码。encoding="latin-1" 来读取法语字符,encoding="utf-8" 或 encoding="gbk"读取中文
- skiprows: 跳过某些行不读入。skiprows=[1,2] 表示不读取第 1 行和第 2 行。会忽略comment,若第一行满足comment,则skiprows=[1],则结果只会不读第一行,行编号不会由comment而改变,即skiprows使用原始行号(包括注释/空行)
- nrows : 设置读取行数。默认是"None”,若想要读取前10行,应设置为nrow=10
- comment:str(1个字符), 默认是“None",匹配到str后的所有字符会被解析为"NaN"
2)参数应用示例:
读入文件(example.csv)内容如下:
Chr,Start,End,Ref,Alt,Ref_Reads,Alt_Reads,Var
#chr6,117643659,117643659,C,G,0,3199,100.00%
chr12,25368462,25368462,C,T,1,2258,99.90%
chr17,7579801,7579801,G,C,51,1727,99.30%
chr9,21975017,21975017,C,T,1,3593,99.90%
默认读入时,会忽略空行
>>> import pandas as pd
>>> #pd.__version__ #查看Pandas版本
>>> df = pd.read_csv("example.csv")
>>> df
Chr Start End Ref Alt Ref_Reads Alt_Reads Var
0 #chr6 117643659 117643659 C G 0 3199 100.00%
1 chr12 25368462 25368462 C T 1 2258 99.90%
2 chr17 7579801 7579801 G C 51 1727 99.30%
3 chr9 21975017 21975017 C T 1 3593 99.90%
设置comment="#",忽略匹配到#之后的所有内容,这里"#"可以换成任意其他字符,字符数必须时1。在python中类似于匹配到该字符。在pandas中会把匹配到该字符之后的所有字符转化为NaN进行写入
>>> df = pd.read_csv("example.csv",comment="#")
>>> df #这里#开头的那行没有读入
Chr Start End Ref Alt Ref_Reads Alt_Reads Var
0 chr12 25368462 25368462 C T 1 2258 99.90%
1 chr17 7579801 7579801 G C 51 1727 99.30%
2 chr9 21975017 21975017 C T 1 3593 99.90%
''' example-1.csv"文件内容如下:'''
Chr,Start,End,Ref,Alt,Ref_Reads,Alt_Reads,Var
#chr6,117643659,117643659,C,G,0,3199,100.00%
chr12,#25368462,25368462,C,T,1,2258,99.90%
chr17,7579#801,7579801,G,C,51,1727,99.30%
chr9,21975017,21975017,C,T,1,3593,99.90%
>>> df1 = pd.read_csv("example-1.csv",comment="#")
>>> df1 #发现#之后的内容全部转为NaN了
Chr Start End Ref Alt Ref_Reads Alt_Reads Var
0 chr12 NaN NaN NaN NaN NaN NaN NaN
1 chr17 7579.0 NaN NaN NaN NaN NaN NaN
2 chr9 21975017.0 21975017.0 C T 1.0 3593.0 99.90%
header 默认第一行为列名(即header=0),若同时使用comment时,这里会忽略注释行,重新进行索引排序后的第一行
>>> df = pd.read_csv("example.csv",comment="#")
>>> df
Chr Start End Ref Alt Ref_Reads Alt_Reads Var
0 chr12 25368462 25368462 C T 1 2258 99.90%
1 chr17 7579801 7579801 G C 51 1727 99.30%
2 chr9 21975017 21975017 C T 1 3593 99.90%
>>> df = pd.read_csv("example-2.csv",comment="#",header=0)
>>> df
Chr Start End Ref Alt Ref_Reads Alt_Reads Var
0 chr12 25368462 25368462 C T 1 2258 99.90%
1 chr17 7579801 7579801 G C 51 1727 99.30%
2 chr9 21975017 21975017 C T 1 3593 99.90%
当header=n,且n > 0时,默认不会读取n(索引)前面的行,n行会作为列名
>>> df = pd.read_csv("example-2.csv",comment="#",header=1)
>>> df
chr12 25368462 25368462.1 C T 1 2258 99.90%
0 chr17 7579801 7579801 G C 51 1727 99.30%
1 chr9 21975017 21975017 C T 1 3593 99.90%
当 header=None 即不指定列名,read_csv为其自动加上列索引(从0开始)
>>> df = pd.read_csv("example-2.csv",comment="#",header=None)
>>> df
0 1 2 3 4 5 6 7
0 Chr Start End Ref Alt Ref_Reads Alt_Reads Var
1 chr12 25368462 25368462 C T 1 2258 99.90%
2 chr17 7579801 7579801 G C 51 1727 99.30%
3 chr9 21975017 21975017 C T 1 3593 99.90%
当 header=None时,可以通过names=[],指定新的列名,新的数据框列数由names长度决定
- names长度小于原始数据列数时,则会进行截取
- names长度大于原始数据列数时, 则会通过NaN进行填补
>>> df = pd.read_csv("example-2.csv",comment="#",header=None,names=['A','B','C','D','E','F','G','H'])
>>> df
A B C D E F G H
0 Chr Start End Ref Alt Ref_Reads Alt_Reads Var
1 chr12 25368462 25368462 C T 1 2258 99.90%
2 chr17 7579801 7579801 G C 51 1727 99.30%
3 chr9 21975017 21975017 C T 1 3593 99.90%
>>> df = pd.read_csv("example-2.csv",comment="#",header=None,names=['A','B','C','D','E','F','G'])
>>> df
A B C D E F G
Chr Start End Ref Alt Ref_Reads Alt_Reads Var
chr12 25368462 25368462 C T 1 2258 99.90%
chr17 7579801 7579801 G C 51 1727 99.30%
chr9 21975017 21975017 C T 1 3593 99.90%
>>> df = pd.read_csv("example-2.csv",comment="#",header=None,names=['A','B','C','D','E','F','G','H','I'])
>>> df
A B C D E F G H I
0 Chr Start End Ref Alt Ref_Reads Alt_Reads Var NaN
1 chr12 25368462 25368462 C T 1 2258 99.90% NaN
2 chr17 7579801 7579801 G C 51 1727 99.30% NaN
3 chr9 21975017 21975017 C T 1 3593 99.90% NaN
index_col设置行名,若同时设置header,如下,’Chr‘会被作为行名与列名产生了冲突,所以 header参数和index_col参数不能混用,header指定列索引,index_col指定行索引,一个DataFrame对象只有一种索引
>>> df = pd.read_csv("example-2.csv",comment="#",index_col=0)
>>> df
Start End Ref Alt Ref_Reads Alt_Reads Var
Chr
chr12 25368462 25368462 C T 1 2258 99.90%
chr17 7579801 7579801 G C 51 1727 99.30%
chr9 21975017 21975017 C T 1 3593 99.90%
2. pandas.read_excel() 读取 xlsx格式文件, 需要安装xlrd包
1) 函数详解:
pandas.read_excel(io,sheet_name = 0,header = 0,names = None,index_col = None,usecols = None,
squeeze = False,dtype = None, ...)
主要参数介绍:
- io :输入文件
- sheet_name:默认为0,可以是字符串列表或整数列表(None,string,int)。可以是sheet的索引,也可以是sheet名,为None时获取所有工作表。
- header:用法同pandas.read_csv()
- names:用法同pandas.read_csv()
- index_col:用法同pandas.read_csv()
2) 参数举例说明:
输入文件(example.xlsx)包含三个sheet,分别是A,B,C
当 不设置 sheet_name 时 和 sheet_name = 0 以及 sheet_name = 'A' 结果一致
>>> import pandas as pd
>>> df = pd.read_excel('example.xlsx') #不设置sheet_name
>>> df
Chr Start End Ref Alt Gene.refGene VAF
0 chr10 43600607 43600607 C A RET 100.00%
1 chr7 140624440 140624440 C T BRAF 53.50%
>>> df = pd.read_excel('example.xlsx',sheet_name=0) #sheet_name = 0
>>> df
Chr Start End Ref Alt Gene.refGene VAF
0 chr10 43600607 43600607 C A RET 100.00%
1 chr7 140624440 140624440 C T BRAF 53.50%
>>> df = pd.read_excel('example.xlsx',sheet_name='A') #sheet_name = 'A','A'是第一个sheet
>>> df
Chr Start End Ref Alt Gene.refGene VAF
0 chr10 43600607 43600607 C A RET 100.00%
1 chr7 140624440 140624440 C T BRAF 53.50%
当sheet_name = None 时,会读取全部sheet的内容
>>> df = pd.read_excel('example.xlsx',sheet_name=None)
>>> df
OrderedDict([('A', Chr Start End Ref Alt Gene.refGene VAF
0 chr10 43600607 43600607 C A RET 100.00%
1 chr7 140624440 140624440 C T BRAF 53.50%), ('B', Chr Start End Ref Alt Gene.refGene VAF
0 chr2 111907691 111907691 T C BCL2L11 99.37%
1 chr12 25368462 25368462 C T KRAS 99.74%
2 chr5 38950776 38950776 G A RICTOR 99.52%), ('C', Chr Start End Ref Alt Gene.refGene VAF
0 chr7 116435768 116435768 C T MET 99.49%
1 chr7 140449150 140449150 T C BRAF 52.17%
2 chr2 30143499 30143499 G C ALK 99.58%
3 chr1 11288758 11288758 G A MTOR 99.68%
4 chr12 133208979 133208979 T C POLE 46.77%)])
sheet_name 也可以是列表,包含想要读取的sheet的索引或者名称,也可以混合使用,比如
sheet_name = [0,1],显示的索引是0,1
>>> df = pd.read_excel('example.xlsx',sheet_name=[0,1])
>>> df
OrderedDict([(0, Chr Start End Ref Alt Gene.refGene VAF
0 chr10 43600607 43600607 C A RET 100.00%
1 chr7 140624440 140624440 C T BRAF 53.50%), (1, Chr Start End Ref Alt Gene.refGene VAF
0 chr2 111907691 111907691 T C BCL2L11 99.37%
1 chr12 25368462 25368462 C T KRAS 99.74%
2 chr5 38950776 38950776 G A RICTOR 99.52%)])
sheet_name = ['A','C'],显示的索引是'A','C'
>>> df = pd.read_excel('example.xlsx',sheet_name=['A','C'])
>>> df
OrderedDict([('A', Chr Start End Ref Alt Gene.refGene VAF
0 chr10 43600607 43600607 C A RET 100.00%
1 chr7 140624440 140624440 C T BRAF 53.50%), ('C', Chr Start End Ref Alt Gene.refGene VAF
0 chr7 116435768 116435768 C T MET 99.49%
1 chr7 140449150 140449150 T C BRAF 52.17%
2 chr2 30143499 30143499 G C ALK 99.58%
3 chr1 11288758 11288758 G A MTOR 99.68%
4 chr12 133208979 133208979 T C POLE 46.77%)])
sheet_name = [1,'C'],显示的索引是1和'C'
>>> df = pd.read_excel('example.xlsx',sheet_name=[1,'C'])
>>> df
OrderedDict([(1, Chr Start End Ref Alt Gene.refGene VAF
0 chr2 111907691 111907691 T C BCL2L11 99.37%
1 chr12 25368462 25368462 C T KRAS 99.74%
2 chr5 38950776 38950776 G A RICTOR 99.52%), ('C', Chr Start End Ref Alt Gene.refGene VAF
0 chr7 116435768 116435768 C T MET 99.49%
1 chr7 140449150 140449150 T C BRAF 52.17%
2 chr2 30143499 30143499 G C ALK 99.58%
3 chr1 11288758 11288758 G A MTOR 99.68%
4 chr12 133208979 133208979 T C POLE 46.77%)])
参考:
https://www.cnblogs.com/datablog/p/6127000.html
https://www.jianshu.com/p/d1eed925509b