如何批量获取基因的转录组序列
2020-01-16 本文已影响0人
落寞的橙子
一、使用biomaRt包的getSequence
我的一个例子,ensembl编号的基因集(eg:ENSG00000272106.1)
rm(list=ls())
suppressMessages(library(biomaRt))
#移除genelist的小数点
genelist<-unlist(lapply(human_overlap_lnc, FUN = function(x) {return(strsplit(x, split = ".",fixed = T)[[1]][1])}))
#查看有那些dataset
#选用对应的数据集,这里采用的是hsapiens_gene_ensembl
mart = useMart('ensembl')
listDatasets(mart)
Ensemble_to_seq<-function(x) {
mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
seq = getSequence(id = x,
type = "ensembl_gene_id",
seqType = "cdna",
mart = mart)
seq<-as.data.frame(seq)
seq$"Length"<-lapply(seq[,1],function(y){return(nchar(y))})
return(seq)
}
outTab<-data.frame()
for (i in genelist){
outTab<-rbind(outTab,Ensemble_to_seq(i))
}
write.table(outTab,paste0(ori_dir_lnc,“/your_GeneList_sequence.txt"),sep = "\t",quote = F,row.names = F,col.names = F)
二、EASER 不会Python 没run
三、seqkit grep
后来我发现biomaRT实在太慢了,于是想到可以用grep,可以先去genecode上下载transcripts.fa 文件,使用seqkit的grep功能去做这个事情。
四、awk+grep
参考方法
虽然笨,但是确实work。同样去GENECODE网站上去下载transcripts.fa,然后往下走就行了,注意release版本要一致。我的例子:
#######prepare the files
#Get the transcript sequence
data_dir=/your_dir/sequence
human_transcript_fasta=/your_dir/hg38/fasta/gencode.v32.transcripts.fa
mouse_transcript_fasta=/your_dir/mouse/fasta/gencode.vM23.transcripts.fa
human_fa_dir=/your_dir/sequence/human/fasta
mouse_fa_dir=/your_dir/sequence/mouse/fasta
mkdir -p ${human_fa_dir}
mkdir -p ${mouse_fa_dir}
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < ${human_transcript_fasta} > /you/hg38/fasta/gencode.v32.transcripts.linear.fa
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < ${mouse_transcript_fasta} > /your_dir/hg38/fasta/gencode.vM23.transcripts.linear.fa
human_fa_linear=/your_dir/hg38/fasta/gencode.v32.transcripts.linear.fa
mouse_fa_linear=/your_dir/hg38/fasta/gencode.vM23.transcripts.linear.fa
cat ${data_dir}/human_fasting_refeeding_overlap_lncRNAs_logFC_0_pvalue_0.05.txt | while read line
do
cat ${human_fa_linear} | grep ${line} --no-group-separator | tr '\t' '\n' |awk '{ if ($0~/^>/) { n=split($0, a, "|"); gsub(/_/," ", a[1]); printf("%s|%s\n", a[1], substr(a[2], 2)); } else { print $0; } }' > ${human_fa_dir}/${line}.fa
done
cat ${mouse_fa_linear} | grep -f ${data_dir}/mouse_genelist.txt --no-group-separator | tr '\t' '\n' > ${mouse_fa_dir}/mouse.lncRNA.fa