根据表达量排序,并输出不同表达量区段的基因
2023-04-10 本文已影响0人
余绕
I have a file with two tab-separated columns. The first column contains IDs and the second column contains corresponding values.
I need to sort the IDs based on their corresponding values. After sorting, I want to divide the IDs into four ranges: highly expressed, moderately expressed, lowly expressed, and non-expressed.
To determine the expression ranges, I will set the value of non-expressed genes to 0. Then, I will sort the remaining values from high to low and equally divide them into three parts. Finally, I will report the four parts in separate files.
Following is a Perl script to accomplish this task:
open FA,"$ARGV[0]";
while(<FA>){
chomp;
($gene_id,$tpm)=split /\t/,$_;
if ($tpm >1){
push @tpm,$tpm;
}
else{
push @non_expressed,$tpm
}
}
@tpm=sort {$b <=>$a} @tpm;
@non_expressed =sort {$b <=>$a} @non_expressed;
$n=scalar @tpm;
#print"$n\n";
$num1=int($n/4);
$top=@tpm[$num1];
$middle1=@tpm[$num1+1];
$middle2=@tpm[$num1*3];
$end=@tpm[$num1*3+1];
#print"$num1\t $top\t$middle1\t$middle2\t$end\n";
open FA,"$ARGV[0]";
while(<FA>){
chomp;
($gene_id,$tpm)=split /\t/,$_;
if($gene_id eq 'gene_id'){next;}
if($tpm > $top){
push @top_id,$gene_id;
}
elsif($middle1>$tpm and $middle2<$tpm){
push @mid_id,$gene_id;
}
elsif($tpm< $end and $tpm >=1){
push @end_id,$gene_id;
}
elsif($tpm >= 0 and $tpm <1){
push @nonexpress_id,$gene_id;
}
}
open OU1,">$ARGV[1]";
foreach(@top_id){
print OU1 "$_\n";
}
open OU2,">$ARGV[2]";
foreach(@mid_id){
print OU2 "$_\n";
}
open OU3,">$ARGV[3]";
foreach(@end_id){
print OU3 "$_\n";
}
open OU4,">$ARGV[4]";
foreach(@nonexpress_id){
print OU4 "$_\n";
}
To use this Perl script, follow these steps:
perl perl_sort.pl average_CK_TPM.txt high.txt mid.txt low.txt Non.txt