Topic 5. 克隆进化之 CITUP

2022-03-25 本文已影响0人桓峰基因

这期介绍CITUP 全称 Conality Inference in Tumors Using Phylogeny，是一种使用系统发育论进行肿瘤的克隆推断生物信息学工具，可用于从单个患者获得的多个样本来推断肿瘤异质性。利用Pyclone的输出文件tables/loci.tsv 获得CITUP的输入文件。

研究动机：肿瘤内的异质性通过肿瘤进展过程中亚克隆的进化而呈现。虽然最近的研究表明，这种异质性具有临床意义，生物信息学中测定克隆亚群仍然是一个挑战。

结果：该软件通过一种新的组合方法来解决这个问题，即使用系统发育(CITUP)学对肿瘤克隆进行推断(clonality inference in tumor using phylogeny)，该方法在满足系统发育约束的情况下推断克隆种群及其频率，并能够利用多个样本的数据。利用来自两项癌症研究的模拟数据集和深度测序数据，表明CITUP预测克隆频率和潜在的系统发育具有较高的准确性。

可用性和实现：CITUP可在http://sourceforge.net/projects/citup/免费获得。

关于CITUP软件的安装：

安装CITUP的前提条件是在conda环境下，python 版本为2.7，方能顺利完成安装，当时安装的时候总是出现版本问题，后来有位大神写了个博客，按照他给出来的版本一一安装，才成功，中间卸载，加载了好多次，我这里也给出来，免得大家一遍一遍尝试，浪费时间。

# 添加channelsconda config --add channels http://conda.anaconda.org/dranewconda config --add channels https://conda.anaconda.org/IBMDecisionOptimization/linux-64# 创建小环境，然后安装，其实可以安装在前面的 pyclone 小环境里conda create -n citupconda activate citupconda install -y citup h5py# 安装完成后可以调用软件的帮助文档run_citup_iter.py -hrun_citup_qip.py  -h

版本号来自大神的帖子，这个非常重要，各种包的版本号如下：

$ conda install citup --prefix ./Fetching package metadata .................Solving package specifications: .Package plan for installation in environment /home/test:The following NEW packages will be INSTALLED:    blas:            1.0-mkl                                            blosc:           1.15.0-hd408876_0                                  boost_source:    1.60.0-0                dranew                     bzip2:           1.0.6-h14c3975_5                                   ca-certificates: 2019.1.23-0                                        certifi:         2019.3.9-py27_0                                    citup:           0.1.1-py27_1            dranew                     cplex:           12.8-py27_0             IBMDecisionOptimization    decorator:       4.4.0-py27_1                                       hdf5:            1.10.4-hb1b8bf9_0                                  intel-openmp:    2019.3-199                                         libedit:         3.1.20181209-hc058e9b_0                            libffi:          3.2.1-hd88cf55_4                                   libgcc-ng:       8.2.0-hdf63c60_1                                   libgfortran-ng:  7.3.0-hdf63c60_0                                   libstdcxx-ng:    8.2.0-hdf63c60_1                                   lzo:             2.10-h49e0be7_2                                    mkl:             2019.3-199                                         mkl_fft:         1.0.12-py27ha843d7b_0                              mkl_random:      1.0.2-py27hd81dba3_0                               ncurses:         6.1-he6710b0_1                                     networkx:        2.2-py27_1                                         numexpr:         2.6.9-py27h9e4a6bb_0                               numpy:           1.16.3-py27h7e9f1db_0                              numpy-base:      1.16.3-py27hde5b4d6_0                              openssl:         1.1.1b-h7b6447c_1                                  pandas:          0.24.2-py27he6710b0_0                              pip:             19.1.1-py27_0                                      pypeliner:       0.5.0-py27h1453be2_0    dranew                     pytables:        3.5.1-py27h71ec239_0                               python:          2.7.16-h9bab390_0                                  python-dateutil: 2.8.0-py27_0                                       pytz:            2019.1-py_0                                        readline:        7.0-h7b6447c_5                                     scikit-learn:    0.20.3-py27hd81dba3_0                              scipy:           1.2.1-py27h7c811a0_0                               setuptools:      41.0.1-py27_0                                      six:             1.12.0-py27_0                                      snappy:          1.1.7-hbae5bb6_3                                   sqlite:          3.28.0-h7b6447c_0                                  tk:              8.6.8-hbc83047_0                                   wheel:           0.33.2-py27_0                                      zlib:            1.2.11-h7b6447c_3

关于输入文件获得：

输入文件需要准备两个文件：

突变位点在不同样本的突变频率，行是突变位点，列是样本；

突变的 cluster，只有单列，记录每个突变位点所在的 cluster 。

两个文件的突变位点顺序要一致。具体可以参考：https://shahlab.ca/projects/citup/

利用前面 pyclone 结果进行处理获得两个文件格式方法和格式，如下：

cat pyclone_analysis/tables/loci.tsv | cut -f 6 | sed '1d' | paste - - - -  >pyclone_analysis/freq.txt#####freq.txt1.65541e-08  4.36882e-01  5.52845e-09  1.61941e-01  1.20627e-08  3.61060e-01  4.01164e-021.65324e-08  3.98927e-01  8.02723e-09  1.88155e-01  1.23399e-08  3.70812e-01  4.21063e-022.72157e-08  4.00831e-09  2.54612e-01  9.32397e-03  1.86031e-01  2.41739e-01  3.08294e-011.32804e-08  3.02610e-09  2.69609e-01  7.53186e-09  1.98110e-01  2.30463e-01  3.01818e-01cat pyclone_analysis/tables/loci.tsv | cut -f 3 | sed '1d' | paste - - - - |cut -f 1 >pyclone_analysis/cluster.txt######cluster.txt0  11  22  33  44  50  6cat pyclone_analysis/tables/loci.tsv | cut -f 2 | sed '1d' | head -4 >pyclone_analysis/sample_id#####sample_idPMPM

关于运行问题

该软件使用起来还是蛮简单的，但是输入输出文件的获得稍微麻烦一点，尤其输出结果hdf5格式，不是普通的文本文件，需要进一步处理一下，好多都卡在这里，但是别担心，方式多种，选择适合自己的就行。

run_citup_qip.py ./pyclone_analysis/freq.txt ./pyclone_analysis/cluster.txt ./pyclone_analysis/results.h5

关于H5格式文件读取

首先看下H5格式的生产的数据模式，然后提取我们需要的数据，如下：

###The output of citup is pandas hdf5 format.####The results store contains the following pandas series, with an entry per tree solution:/results/bic/results/error_rate/results/likelihood/results/num_mutations/results/num_nodes/results/num_samples/results/objective_value/results/optimal  #######拟合最佳的进化树/results/tree_id/results/tree_index/results/tree_string###The store also contains the following pandas series, with one series per tree solution:/trees/{tree_solution}/cluster_assignment/trees/{tree_solution}/objective_value###Finally, the store contains the following pandas dataframes, with one frame per tree solution:、/trees/{tree_solution}/adjacency_list  #####进化树的节点/trees/{tree_solution}/clade_freq/trees/{tree_solution}/clone_freq  ####克隆组合/trees/{tree_solution}/gamma_matrix

python方式读取：

利用python 包h5py既可以读取结果文件results.h5，然后从上述结果中取最佳拟合树、进化树节点，克隆组成等信息，写段代码保存为ReadH5.py。

import sysimport h5pyhf=h5py.File(sys.argv[1],'r')hf.keys()opnum=hf["results/optimal/index"][0]cellfreq=hf["trees/" + str(opnum) + "/clone_freq/block0_values"][:]tree=hf["trees/" + str(opnum) + "/adjacency_list/block0_values"][:]print cellfreqprint tree

利用ReadH5.py脚本出来一下样本，得到的cellfreq.txt和tree.txt 这两个文件，如下：

python ReadH5.py pyclone_analysis/results.h5 | sed 's/^ \[//;s/\[//g;s/\]//g' | tr ' ' '\t'| grep '\.' >pyclone_analysis/cellfreq.txtpython ReadH5.py pyclone_analysis/results.h5 | sed 's/^ \[//;s/\[//g;s/\]//g' | tr ' ' '\t'| grep -v '\.' >pyclone_analysis/tree.txt

cellfreq是一个矩阵：行表示样本，列表示克隆 clusters，每个值代表克隆频率，CITUP软件中给出来的例子，如下：

0.2 0.10.4 0.30.5 0.1

后者表示的是进化树的克隆分支关系，如下：

关于R方式读取H5文件：

hdf5文件是一种大数据存储结构，除了目前介绍的hdf5r包之外，同时cran中的h5包，Bioconductor中的rhdf5也能够实现类似的功能。其中hdf5r这个包的读取，如下：

library(hdf5r)# 查看file.h5下的groupnames(file.h5)# [1] "flights" "mtcars"# 查看filght组中有什么数据names(flights.grp)## [1] "flights"    "weather"    "wind_dir"   "wind_speed"

最后利用三个文件既sample_id，cellfreq.txt和tree.txt，即可通过Timescape 做肿瘤进化鱼图，下期将继续讲解。

Reference:

Salem Malikic, Andrew W. McPherson, Nilgun Donmez, Cenk S. Sahinalp, Clonality inference in multiple tumor samples using phylogeny, Bioinformatics, Volume 31, Issue 9, 1 May 2015, Pages 1349–1356, https://doi.org/10.1093/bioinformatics/btv003

Topic 5. 克隆进化之 CITUP

猜你喜欢

热点阅读