2020-11-29

2020-12-06 本文已影响0人 byejya

1.准备数据：

bed：207

good：77

数据不够，加茶叶的good。

利用excel表取出good_tea的图片。

方法：每读一个excel，循环一遍图片地址。

报错：

raise ImportError('Could not import PIL.Image. '

ImportError: Could not import PIL.Image. The use of `load_img` requires PIL.

方法：1.进入keras所在的环境，之前因为一直在base，发现无法install ，显示有包，实际错误是没进入环境，进入环境以后发现里面没有pillow包

2.退出再重新进入环境

conda deactivate

conda activate

批量下数据脚本：

关键问题：路径问题。

当前工作路径和文件所在路径。如果加入path，不知是以什么为当前路径。

在最外层，将python加入path，进入某一个文件夹，调用它，查看当前工作空间，结果：

并使用../../的相对路径形式，查看工作空间。

转换工作空间path该如何转换

防止写了相对路径，但是找不到的情况。

#####

目前的瓶颈：数据量

具体而言：已经跑出的数据不够，因此即使看了文章，也不能确定具体在自己数据上的效果

解决办法：改完跑人的流程

改下数据脚本

看别人用的数据库和方法，尝试选定的两种方法。

########

目前流程中的问题：1.每次+header过于复杂。看pycharm有无更好的解决办法。

2.主要是复杂

3.没法自动识别参考基因组，需要先手动对参考基因组进行处理，不然拼接最长intron将无法实现。

###########

改动只是：记录的chr从1到XY，将intron的开头改了以及去掉ChrUn

intron地址：/mnt/x110/wus/BP_new/BWA_mapping/test_full/hg38_intron_111

结果文件地址：/mnt/x110/wus/BP_new/BWA_mapping/dingh/te

.fa:/mnt/x110/guosy/Database/hg38_gff/hg38.fa

circRNA_fa:/mnt/x110/dingh/prostate-cancer/circRNA

-sam:/mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam

命令：python full_human.py -i /mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam -f /mnt/x110/guosy/Database/hg38_gff/hg38.fa -I /mnt/x110/wus/BP_new/BWA_mapping/test_full/hg38_intron_111 -o /mnt/x110/wus/BP_new/BWA_mapping/dingh/te

输入输出：只需要输入刚比对完的sam文件参考基因组fa文件 intron的gff文件即可输出5种类型的read在intron内的sam文件和两种one_pair的meta序列sam文件

报错：

fai文件没有。

改fa：/mnt/T30/zhengy/book/database/hg38/hg38.fa

命令：python full_human.py -i /mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam -f /mnt/T30/zhengy/book/database/hg38/hg38.fa -I /mnt/x110/wus/BP_new/BWA_mapping/test_full/hg38_intron_111 -o /mnt/x110/wus/BP_new/BWA_mapping/dingh/te

又出错，查看了header，发现bwa用下面这个参考基因组比对的：
/mnt/x110/guosy/Database/hg19/bwa-index/hg19.fa

因此改命令：python full_human.py -i /mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam -f /mnt/x110/guosy/Database/hg19/bwa-index/hg19.fa -I /mnt/x110/wus/BP_new/BWA_mapping/test_full/hg38_intron_111 -o /mnt/x110/wus/BP_new/BWA_mapping/dingh/te

没有fai，找到有fai的

/mnt/x110/guosy/Database/hg19/samtools-index

因此改命令：python full_human.py -i /mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam -f /mnt/x110/guosy/Database/hg19/samtools-index/hg19.fa -I /mnt/x110/wus/BP_new/BWA_mapping/dingh/hg38_intron_111 -o /mnt/x110/wus/BP_new/BWA_mapping/dingh/te

之后：一直卡在取intron位置上，因为之前是根据最后一列去重，去重，改名，再排序。

现在，根据前面的处理，111已经是去重排序过后的，问题是排序。当时写入gff3的文件是纯list，所有内容都在一个list里，没有区分chr几。因此特别依赖对111的正确排序，现在的问题是111的排序不正确。

首先是从10开始，其次不是从开头开始。

chr10最小的位置不是最开头的位置。

仅仅是排序的方法非常麻烦，尝试放进dataframe，利用索引拆成多个按chr分的dataframe，再sort第四列，最后转存为list

已完成：读取，保存，排序

下一步：按chr分成多个。

成功：

之后，取出d e 列，并保存为list：

可以初步成功，但还是有问题。

因为.ix弃用，所以换成iloc。

但是是多维数组，需要平展为一维数组

用itertools库

换成22测试也没问题。

可以将修改后的代码用于得到intron_list

原先的整体步骤：改名，sort， uniq

现在的步骤：linux去重，分组和取的列表依赖pandas库中完整。

将代码改完，重新使用命令：

python full_human_1.py -i /mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam -f /mnt/x110/guosy/Database/hg19/samtools-index/hg19.fa -I /mnt/x110/wus/BP_new/BWA_mapping/dingh/hg38_intron_111 -o /mnt/x110/wus/BP_new/BWA_mapping/dingh/te

经过修改后能正常运行，时间短。

问题是硬编码。

在循环写入list的步骤失败，也没测试从list中删除内容的方法，以及对list长度的影响。

需要进一步在igv中查看。

主要耗时的部分在sam比对上，脚本部分可以很快跑出。

主要问题在不能随便换intron，换参考基因组文件。

说明脚本还是不够全面。

目前只是输入sam intron reference.fa

但由于sam的fa需要和加header的fa是一样的，因此如和保持统一是个问题

目前的问题和下一步的计划：

1.硬编码问题

2,加header问题，如何简化

3.脱shell问题，减少脚本中的shell代码

4.利用上mate信息

已解决之前提出的问题：

将在植物中跑的脚本迁移到在人的数据上跑

简化取intron---->list的步骤，并在步骤中脱shell写为纯python

下一步计划：

将这次自己跑出的结果和实验室的bp结果比对

在igv中查看intron文件，查看结果文件，

2020-11-29

intron地址：/mnt/x110/wus/BP_new/BWA_mapping/test_full/hg38_intron_111

结果文件地址：/mnt/x110/wus/BP_new/BWA_mapping/dingh/te

.fa:/mnt/x110/guosy/Database/hg38_gff/hg38.fa

circRNA_fa:/mnt/x110/dingh/prostate-cancer/circRNA

-sam:/mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam

命令：python full_human.py -i /mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam -f /mnt/x110/guosy/Database/hg38_gff/hg38.fa -I /mnt/x110/wus/BP_new/BWA_mapping/test_full/hg38_intron_111 -o /mnt/x110/wus/BP_new/BWA_mapping/dingh/te

输入输出：只需要输入刚比对完的sam文件参考基因组fa文件 intron的gff文件即可输出5种类型的read在intron内的sam文件和两种one_pair的meta序列sam文件

报错：

改fa：/mnt/T30/zhengy/book/database/hg38/hg38.fa

命令：python full_human.py -i /mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam -f /mnt/T30/zhengy/book/database/hg38/hg38.fa -I /mnt/x110/wus/BP_new/BWA_mapping/test_full/hg38_intron_111 -o /mnt/x110/wus/BP_new/BWA_mapping/dingh/te

又出错，查看了header，发现bwa用下面这个参考基因组比对的：
/mnt/x110/guosy/Database/hg19/bwa-index/hg19.fa

/mnt/x110/guosy/Database/hg19/samtools-index

因此改命令：python full_human.py -i /mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam -f /mnt/x110/guosy/Database/hg19/samtools-index/hg19.fa -I /mnt/x110/wus/BP_new/BWA_mapping/dingh/hg38_intron_111 -o /mnt/x110/wus/BP_new/BWA_mapping/dingh/te

目前的问题和下一步的计划：

已解决之前提出的问题：

下一步计划：

猜你喜欢

热点阅读

2020-11-29

intron地址：/mnt/x110/wus/BP_new/BWA_mapping/test_full/hg38_intron_111

结果文件地址：/mnt/x110/wus/BP_new/BWA_mapping/dingh/te

.fa:/mnt/x110/guosy/Database/hg38_gff/hg38.fa

circRNA_fa:/mnt/x110/dingh/prostate-cancer/circRNA

-sam:/mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam

命令：python full_human.py -i /mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam -f /mnt/x110/guosy/Database/hg38_gff/hg38.fa -I /mnt/x110/wus/BP_new/BWA_mapping/test_full/hg38_intron_111 -o /mnt/x110/wus/BP_new/BWA_mapping/dingh/te

输入输出：只需要输入刚比对完的sam文件 参考基因组fa文件 intron的gff文件即可输出5种类型的read在intron内的sam文件和两种one_pair的meta序列sam文件

报错：

改fa：/mnt/T30/zhengy/book/database/hg38/hg38.fa

命令：python full_human.py -i /mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam -f /mnt/T30/zhengy/book/database/hg38/hg38.fa -I /mnt/x110/wus/BP_new/BWA_mapping/test_full/hg38_intron_111 -o /mnt/x110/wus/BP_new/BWA_mapping/dingh/te

又出错，查看了header，发现bwa用下面这个参考基因组比对的：/mnt/x110/guosy/Database/hg19/bwa-index/hg19.fa

/mnt/x110/guosy/Database/hg19/samtools-index

因此改命令：python full_human.py -i /mnt/x110/wus/BP_new/BWA_mapping/dingh/SRR6999003_mapped.sam -f /mnt/x110/guosy/Database/hg19/samtools-index/hg19.fa -I /mnt/x110/wus/BP_new/BWA_mapping/dingh/hg38_intron_111 -o /mnt/x110/wus/BP_new/BWA_mapping/dingh/te

目前的问题和下一步的计划：

已解决之前提出的问题：

下一步计划：

猜你喜欢

热点阅读

输入输出：只需要输入刚比对完的sam文件参考基因组fa文件 intron的gff文件即可输出5种类型的read在intron内的sam文件和两种one_pair的meta序列sam文件

又出错，查看了header，发现bwa用下面这个参考基因组比对的：
/mnt/x110/guosy/Database/hg19/bwa-index/hg19.fa