pandas数据读入

2020-10-20  本文已影响0人  千千罐

Pandas中关于读入数据的相关函数列表如下:

pd.read_clipboard() pd.read_gbq()       pd.read_parquet()   pd.read_sql_query()
pd.read_csv()       pd.read_hdf()       pd.read_pickle()    pd.read_sql_table()
pd.read_excel()     pd.read_html()      pd.read_sas()       pd.read_stata()
pd.read_feather()   pd.read_json()      pd.read_spss()      pd.read_table()
pd.read_fwf()       pd.read_msgpack()   pd.read_sql()   

以下函数都有对应的导出函数:

pd.read_clipboard() pd.read_feather()   pd.read_html()      pd.read_pickle()
pd.read_csv()       pd.read_gbq()       pd.read_json()      pd.read_sql()
pd.read_excel()     pd.read_hdf()       pd.read_msgpack()   

本文先主要介绍2个参数,陆续更新 。。。。。。

1. pandas.read_csv() 读入文本文件或者URL等

1)函数详解:

pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, 
                usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None,
                converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None,
                nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, 
                skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False,       
                date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', 
                thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, 
                comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, 
                warn_bad_lines=True, skipfooter=0, doublequote=True, delim_whitespace=False, low_memory=True, 
                memory_map=False, float_precision=None)

常用参数介绍:

2)参数应用示例:

读入文件(example.csv)内容如下:

Chr,Start,End,Ref,Alt,Ref_Reads,Alt_Reads,Var
#chr6,117643659,117643659,C,G,0,3199,100.00%

chr12,25368462,25368462,C,T,1,2258,99.90%
chr17,7579801,7579801,G,C,51,1727,99.30%
chr9,21975017,21975017,C,T,1,3593,99.90%

默认读入时,会忽略空行

>>> import pandas as pd
>>> #pd.__version__  #查看Pandas版本
>>> df = pd.read_csv("example.csv")
>>> df
     Chr      Start        End Ref Alt  Ref_Reads  Alt_Reads      Var
0  #chr6  117643659  117643659   C   G          0       3199  100.00%
1  chr12   25368462   25368462   C   T          1       2258   99.90%
2  chr17    7579801    7579801   G   C         51       1727   99.30%
3   chr9   21975017   21975017   C   T          1       3593   99.90% 

设置comment="#",忽略匹配到#之后的所有内容,这里"#"可以换成任意其他字符,字符数必须时1。在python中类似于匹配到该字符。在pandas中会把匹配到该字符之后的所有字符转化为NaN进行写入

>>> df = pd.read_csv("example.csv",comment="#")
>>> df        #这里#开头的那行没有读入
     Chr     Start       End Ref Alt  Ref_Reads  Alt_Reads     Var
0  chr12  25368462  25368462   C   T          1       2258  99.90%
1  chr17   7579801   7579801   G   C         51       1727  99.30%
2   chr9  21975017  21975017   C   T          1       3593  99.90%

''' example-1.csv"文件内容如下:'''

Chr,Start,End,Ref,Alt,Ref_Reads,Alt_Reads,Var
#chr6,117643659,117643659,C,G,0,3199,100.00%

chr12,#25368462,25368462,C,T,1,2258,99.90%
chr17,7579#801,7579801,G,C,51,1727,99.30%
chr9,21975017,21975017,C,T,1,3593,99.90%

>>> df1 = pd.read_csv("example-1.csv",comment="#")
>>> df1      #发现#之后的内容全部转为NaN了
     Chr       Start         End  Ref  Alt  Ref_Reads  Alt_Reads     Var
0  chr12         NaN         NaN  NaN  NaN        NaN        NaN     NaN
1  chr17      7579.0         NaN  NaN  NaN        NaN        NaN     NaN
2   chr9  21975017.0  21975017.0    C    T        1.0     3593.0  99.90%

header 默认第一行为列名(即header=0),若同时使用comment时,这里会忽略注释行,重新进行索引排序后的第一行

>>> df = pd.read_csv("example.csv",comment="#")
>>> df
     Chr     Start       End Ref Alt  Ref_Reads  Alt_Reads     Var
0  chr12  25368462  25368462   C   T          1       2258  99.90%
1  chr17   7579801   7579801   G   C         51       1727  99.30%
2   chr9  21975017  21975017   C   T          1       3593  99.90%

>>> df = pd.read_csv("example-2.csv",comment="#",header=0)
>>> df
     Chr     Start       End Ref Alt  Ref_Reads  Alt_Reads     Var
0  chr12  25368462  25368462   C   T          1       2258  99.90%
1  chr17   7579801   7579801   G   C         51       1727  99.30%
2   chr9  21975017  21975017   C   T          1       3593  99.90%

当header=n,且n > 0时,默认不会读取n(索引)前面的行,n行会作为列名

>>> df = pd.read_csv("example-2.csv",comment="#",header=1)
>>> df
   chr12  25368462  25368462.1  C  T   1  2258  99.90%
0  chr17   7579801     7579801  G  C  51  1727  99.30%
1   chr9  21975017    21975017  C  T   1  3593  99.90%

当 header=None 即不指定列名,read_csv为其自动加上列索引(从0开始)

>>> df = pd.read_csv("example-2.csv",comment="#",header=None)
>>> df
       0         1         2    3    4          5          6       7
0    Chr     Start       End  Ref  Alt  Ref_Reads  Alt_Reads     Var
1  chr12  25368462  25368462    C    T          1       2258  99.90%
2  chr17   7579801   7579801    G    C         51       1727  99.30%
3   chr9  21975017  21975017    C    T          1       3593  99.90%

当 header=None时,可以通过names=[],指定新的列名,新的数据框列数由names长度决定

>>> df = pd.read_csv("example-2.csv",comment="#",header=None,names=['A','B','C','D','E','F','G','H'])
>>> df
       A         B         C    D    E          F          G       H
0    Chr     Start       End  Ref  Alt  Ref_Reads  Alt_Reads     Var
1  chr12  25368462  25368462    C    T          1       2258  99.90%
2  chr17   7579801   7579801    G    C         51       1727  99.30%
3   chr9  21975017  21975017    C    T          1       3593  99.90%
>>> df = pd.read_csv("example-2.csv",comment="#",header=None,names=['A','B','C','D','E','F','G'])
>>> df
              A         B    C    D          E          F       G
Chr       Start       End  Ref  Alt  Ref_Reads  Alt_Reads     Var
chr12  25368462  25368462    C    T          1       2258  99.90%
chr17   7579801   7579801    G    C         51       1727  99.30%
chr9   21975017  21975017    C    T          1       3593  99.90%

>>> df = pd.read_csv("example-2.csv",comment="#",header=None,names=['A','B','C','D','E','F','G','H','I'])
>>> df
       A         B         C    D    E          F          G       H   I
0    Chr     Start       End  Ref  Alt  Ref_Reads  Alt_Reads     Var NaN
1  chr12  25368462  25368462    C    T          1       2258  99.90% NaN
2  chr17   7579801   7579801    G    C         51       1727  99.30% NaN
3   chr9  21975017  21975017    C    T          1       3593  99.90% NaN

index_col设置行名,若同时设置header,如下,’Chr‘会被作为行名与列名产生了冲突,所以 header参数和index_col参数不能混用,header指定列索引,index_col指定行索引,一个DataFrame对象只有一种索引

>>> df = pd.read_csv("example-2.csv",comment="#",index_col=0)
>>> df
          Start       End Ref Alt  Ref_Reads  Alt_Reads     Var
Chr                                                            
chr12  25368462  25368462   C   T          1       2258  99.90%
chr17   7579801   7579801   G   C         51       1727  99.30%
chr9   21975017  21975017   C   T          1       3593  99.90%
2. pandas.read_excel() 读取 xlsx格式文件, 需要安装xlrd包

1) 函数详解:

pandas.read_excel(io,sheet_name = 0,header = 0,names = None,index_col = None,usecols = None,        
                  squeeze = False,dtype = None, ...)

主要参数介绍:

2) 参数举例说明:
输入文件(example.xlsx)包含三个sheet,分别是A,B,C

当 不设置 sheet_name 时 和 sheet_name = 0 以及 sheet_name = 'A' 结果一致

>>> import pandas as pd
>>> df = pd.read_excel('example.xlsx')    #不设置sheet_name
>>> df
     Chr      Start        End Ref Alt Gene.refGene      VAF
0  chr10   43600607   43600607   C   A          RET  100.00%
1   chr7  140624440  140624440   C   T         BRAF   53.50%

>>> df = pd.read_excel('example.xlsx',sheet_name=0)  #sheet_name = 0 
>>> df
     Chr      Start        End Ref Alt Gene.refGene      VAF
0  chr10   43600607   43600607   C   A          RET  100.00%
1   chr7  140624440  140624440   C   T         BRAF   53.50%

>>> df = pd.read_excel('example.xlsx',sheet_name='A')  #sheet_name = 'A','A'是第一个sheet
>>> df
     Chr      Start        End Ref Alt Gene.refGene      VAF
0  chr10   43600607   43600607   C   A          RET  100.00%
1   chr7  140624440  140624440   C   T         BRAF   53.50%

当sheet_name = None 时,会读取全部sheet的内容

>>> df = pd.read_excel('example.xlsx',sheet_name=None)
>>> df
OrderedDict([('A',      Chr      Start        End Ref Alt Gene.refGene      VAF
0  chr10   43600607   43600607   C   A          RET  100.00%
1   chr7  140624440  140624440   C   T         BRAF   53.50%), ('B',      Chr      Start        End Ref Alt Gene.refGene     VAF
0   chr2  111907691  111907691   T   C      BCL2L11  99.37%
1  chr12   25368462   25368462   C   T         KRAS  99.74%
2   chr5   38950776   38950776   G   A       RICTOR  99.52%), ('C',      Chr      Start        End Ref Alt Gene.refGene     VAF
0   chr7  116435768  116435768   C   T          MET  99.49%
1   chr7  140449150  140449150   T   C         BRAF  52.17%
2   chr2   30143499   30143499   G   C          ALK  99.58%
3   chr1   11288758   11288758   G   A         MTOR  99.68%
4  chr12  133208979  133208979   T   C         POLE  46.77%)])

sheet_name 也可以是列表,包含想要读取的sheet的索引或者名称,也可以混合使用,比如

sheet_name = [0,1],显示的索引是0,1

>>> df = pd.read_excel('example.xlsx',sheet_name=[0,1])
>>> df
OrderedDict([(0,      Chr      Start        End Ref Alt Gene.refGene      VAF
0  chr10   43600607   43600607   C   A          RET  100.00%
1   chr7  140624440  140624440   C   T         BRAF   53.50%), (1,      Chr      Start        End Ref Alt Gene.refGene     VAF
0   chr2  111907691  111907691   T   C      BCL2L11  99.37%
1  chr12   25368462   25368462   C   T         KRAS  99.74%
2   chr5   38950776   38950776   G   A       RICTOR  99.52%)])

sheet_name = ['A','C'],显示的索引是'A','C'

>>> df = pd.read_excel('example.xlsx',sheet_name=['A','C'])
>>> df
OrderedDict([('A',      Chr      Start        End Ref Alt Gene.refGene      VAF
0  chr10   43600607   43600607   C   A          RET  100.00%
1   chr7  140624440  140624440   C   T         BRAF   53.50%), ('C',      Chr      Start        End Ref Alt Gene.refGene     VAF
0   chr7  116435768  116435768   C   T          MET  99.49%
1   chr7  140449150  140449150   T   C         BRAF  52.17%
2   chr2   30143499   30143499   G   C          ALK  99.58%
3   chr1   11288758   11288758   G   A         MTOR  99.68%
4  chr12  133208979  133208979   T   C         POLE  46.77%)])

sheet_name = [1,'C'],显示的索引是1和'C'

>>> df = pd.read_excel('example.xlsx',sheet_name=[1,'C'])
>>> df
OrderedDict([(1,      Chr      Start        End Ref Alt Gene.refGene     VAF
0   chr2  111907691  111907691   T   C      BCL2L11  99.37%
1  chr12   25368462   25368462   C   T         KRAS  99.74%
2   chr5   38950776   38950776   G   A       RICTOR  99.52%), ('C',      Chr      Start        End Ref Alt Gene.refGene     VAF
0   chr7  116435768  116435768   C   T          MET  99.49%
1   chr7  140449150  140449150   T   C         BRAF  52.17%
2   chr2   30143499   30143499   G   C          ALK  99.58%
3   chr1   11288758   11288758   G   A         MTOR  99.68%
4  chr12  133208979  133208979   T   C         POLE  46.77%)])

参考:
https://www.cnblogs.com/datablog/p/6127000.html
https://www.jianshu.com/p/d1eed925509b

上一篇下一篇

猜你喜欢

热点阅读