生信分析流程

DeepVariant — step4 DeepVariant

2019-07-31  本文已影响0人  小灰灰不会飞那又怎样

Deepvariant是google 推出的基于CNN(convolutional neural network,卷积神经网络)算法,图像识别的变异检测软件。

参考的有关Deepvariant原理的博客:
Creating a universal SNP and small indel variant caller with deep neural networks(deepvariants文章)
Creating a universal SNP and small indel variant caller with deep neural networks理解(对deepvariants文章的解读)
2017-google-DeepVariant-机器学习(卷积神经网络)与基因组学
DeepVariant: 用卷积神经网络进行DNA序列变异位点检测
EVALUATING DEEPVARIANT: A NEW DEEP LEARNING VARIANT CALLER FROM THE GOOGLE BRAIN TEAM

附图一张:


DeepVariant 识别variant的过程

DeepVariant识别variant过程总共包括了三个步骤,一一说明:

一、make_examples

筛选候选位点

1.原理:

数据比对到参考基因组,考量候选区域支持的reads数,以及支持变异的read数与该区域总reads数之比),碱基质量值等,筛选潜在变异。
得到候选位点,将每个位点转化成RGB图像,这一步会得到多个multi-dimensional pileup images。

2.命令:
/opt/deepvariant/bin/make_examples --mode calling --ref "/opt/quickstart-testdata/ucsc.hg19.chr20.unittest.fasta" --reads "/opt/quickstart-testdata/NA12878_S1.chr20.10_10p1mb.bam" --examples "/tmp/deepvariant_tmp_output/make_examples.tfrecord@2.gz" --regions "chr20:10,000,000-10,010,000" --gvcf "/tmp/deepvariant_tmp_output/gvcf.tfrecord@2.gz"
3.输入:

参考基因组;
bam文件;

4.得到:
root@bcffde6cb466:/deepvariant# ll /tmp/deepvariant_tmp_output/
total 468
drwxr-xr-x   2 root root   4096 Jul 30 09:54 ./
drwxrwxrwt. 13 root root   4096 Jul 30 09:54 ../
-rw-r--r--   1 root root   4496 Jul 30 09:54 gvcf.tfrecord-00000-of-00002.gz
-rw-r--r--   1 root root 303120 Jul 30 09:54 make_examples.tfrecord-00000-of-00002.gz
-rw-r--r--   1 root root 154859 Jul 30 09:54 make_examples.tfrecord-00000-of-00002.gz.run_info.pbtxt
5.LOG输出:
2019-07-30 09:54:06.077220: W third_party/nucleus/io/sam_reader.cc:564] Unrecognized SAM header type, ignoring:
I0730 09:54:06.077377 140332821325568 genomics_reader.py:218] Reading /opt/quickstart-testdata/NA12878_S1.chr20.10_10p1mb.bam with NativeSamReader
I0730 09:54:06.079504 140332821325568 make_examples.py:1110] Preparing inputs
2019-07-30 09:54:06.079838: W third_party/nucleus/io/sam_reader.cc:564] Unrecognized SAM header type, ignoring:
I0730 09:54:06.079907 140332821325568 genomics_reader.py:218] Reading /opt/quickstart-testdata/NA12878_S1.chr20.10_10p1mb.bam with NativeSamReader
I0730 09:54:06.080715 140332821325568 make_examples.py:1034] Common contigs are [u'chr20']
I0730 09:54:06.083332 140332821325568 make_examples.py:1116] Writing examples to /tmp/deepvariant_tmp_output/make_examples.tfrecord-00000-of-00002.gz
I0730 09:54:06.083600 140332821325568 make_examples.py:1120] Writing gvcf records to /tmp/deepvariant_tmp_output/gvcf.tfrecord-00000-of-00002.gz
2019-07-30 09:54:06.084198: I third_party/nucleus/io/sam_reader.cc:600] Setting HTS_OPT_BLOCK_SIZE to 134217728
2019-07-30 09:54:06.143938: W third_party/nucleus/io/sam_reader.cc:564] Unrecognized SAM header type, ignoring:
I0730 09:54:06.144200 140332821325568 genomics_reader.py:218] Reading /opt/quickstart-testdata/NA12878_S1.chr20.10_10p1mb.bam with NativeSamReader
I0730 09:54:06.452500 140332821325568 make_examples.py:1149] Task 0: 6 candidates (6 examples) [0.37s elapsed]
I0730 09:54:06.900475 140332821325568 make_examples.py:1164] Writing MakeExamplesRunInfo to /tmp/deepvariant_tmp_output/make_examples.tfrecord-00000-of-00002.gz.run_info.pbtxt
I0730 09:54:07.027529 140332821325568 make_examples.py:1167] Found 40 candidate variants
I0730 09:54:07.028004 140332821325568 make_examples.py:1168] Created 46 examples
二、call_variants

运用TensorFlow框架构建模型,训练样本

1.原理:

这些候选位点被分为三类:hom-ref(与参考相同),het(杂合型),hom-alt(纯合变异),利用卷积神经网络进行训练和分类。

CNN训练(trained CNN):利用一个已知genotype的样本(genome in a bottle, NA12878, http://jimb.stanford.edu/giab/)的image训练CNN,不断优化参数,直到模型表现收敛。

2.命令:
/opt/deepvariant/bin/call_variants --outfile "/tmp/deepvariant_tmp_output/call_variants_output.tfrecord.gz" --examples "/tmp/deepvariant_tmp_output/make_examples.tfrecord@2.gz" --checkpoint "/opt/models/wgs/model.ckpt"
3.输入:

参考基因组;
第一步得到的候选位点;
已知模型;

4.得到:
root@bcffde6cb466:/deepvariant# ll /tmp/deepvariant_tmp_output/
total 472
drwxr-xr-x   2 root root   4096 Jul 31 01:18 ./
drwxrwxrwt. 14 root root   4096 Jul 31 01:19 ../
-rw-r--r--   1 root root   2504 Jul 31 01:19 call_variants_output.tfrecord.gz
5.LOG输出:
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

I0731 01:18:53.880074 140390238451456 call_variants.py:313] Set KMP_BLOCKTIME to 0
2019-07-31 01:18:54.094641: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-31 01:18:54.524745: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2019-07-31 01:18:54.525717: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x52c22f0 executing computations on platform Host. Devices:
2019-07-31 01:18:54.525743: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
I0731 01:18:55.435303 140390238451456 modeling.py:364] Initializing model with random parameters
W0731 01:18:55.437553 140390238451456 estimator.py:1760] Using temporary folder as model directory: /tmp/tmp4YBYF_
I0731 01:18:55.437961 140390238451456 estimator.py:201] Using config: {'_save_checkpoints_secs': 1000, '_session_config': , '_keep_checkpoint_max': 100000, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7faf05f5db10>, '_model_dir': '/tmp/tmp4YBYF_', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_evaluation_master': '', '_eval_distribute': None, '_global_id_in_cluster': 0, '_master': ''}
I0731 01:18:55.438348 140390238451456 call_variants.py:381] Writing calls to /tmp/deepvariant_tmp_output/call_variants_output.tfrecord.gz
W0731 01:18:55.445154 140390238451456 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
I0731 01:18:55.480995 140390238451456 data_providers.py:360] self.input_read_threads=8
W0731 01:18:55.481403 140390238451456 deprecation.py:323] From /tmp/Bazel.runfiles_HSnmci/runfiles/com_google_deepvariant/deepvariant/data_providers.py:365: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
I0731 01:18:55.596564 140390238451456 data_providers.py:366] self.input_map_threads=48
W0731 01:18:55.596837 140390238451456 deprecation.py:323] From /tmp/Bazel.runfiles_HSnmci/runfiles/com_google_deepvariant/deepvariant/data_providers.py:371: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
I0731 01:18:55.621093 140390238451456 estimator.py:1111] Calling model_fn.
W0731 01:18:55.621329 140390238451456 deprecation.py:323] From /tmp/Bazel.runfiles_HSnmci/runfiles/com_google_deepvariant/deepvariant/modeling.py:679: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0731 01:18:55.623230 140390238451456 deprecation.py:323] From /tmp/Bazel.runfiles_HSnmci/runfiles/com_google_deepvariant/deepvariant/modeling.py:681: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
I0731 01:18:59.044318 140390238451456 estimator.py:1113] Done calling model_fn.
I0731 01:19:02.425590 140390238451456 monitored_session.py:222] Graph was finalized.
W0731 01:19:02.426923 140390238451456 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
I0731 01:19:02.463726 140390238451456 saver.py:1270] Restoring parameters from /opt/models/wgs/model.ckpt
I0731 01:19:05.943398 140390238451456 session_manager.py:491] Running local_init_op.
I0731 01:19:06.011964 140390238451456 session_manager.py:493] Done running local_init_op.
I0731 01:19:06.536978 140390238451456 modeling.py:283] Reloading EMA...
I0731 01:19:06.538686 140390238451456 saver.py:1270] Restoring parameters from /opt/models/wgs/model.ckpt
2019-07-31 01:19:11.018226: W tensorflow/core/framework/allocator.cc:124] Allocation of 67891200 exceeds 10% of system memory.
2019-07-31 01:19:11.110581: W tensorflow/core/framework/allocator.cc:124] Allocation of 24398400 exceeds 10% of system memory.
2019-07-31 01:19:11.154066: W tensorflow/core/framework/allocator.cc:124] Allocation of 31736320 exceeds 10% of system memory.
2019-07-31 01:19:11.352648: W tensorflow/core/framework/allocator.cc:124] Allocation of 29887488 exceeds 10% of system memory.
2019-07-31 01:19:11.527675: W tensorflow/core/framework/allocator.cc:124] Allocation of 59774976 exceeds 10% of system memory.
I0731 01:19:14.422972 140390238451456 call_variants.py:399] Processed 1 examples in 1 batches [1898.435 sec per 100]
I0731 01:19:14.607350 140390238451456 call_variants.py:401] Done evaluating variants
三、postprocess_variants

用训练好的CNN模型来判断最可能的genotype

1.原理:

利用训练好的CNN模型,寻找与参考基因组不一样的位点(基于图像识别与参考基因组不同的位点,并由CNN算法(trained CNN)给出候选位点的genotype likehood值),将概率最大值作为该基因的基因型。

2.命令:
mkdir -p /deepvariant/quickstart-output/
/opt/deepvariant/bin/postprocess_variants --ref "/opt/quickstart-testdata/ucsc.hg19.chr20.unittest.fasta" --infile "/tmp/deepvariant_tmp_output/call_variants_output.tfrecord.gz" --outfile "/deepvariant/quickstart-output/output.vcf.gz"
3.输入:

参考基因组;
第二步得到的训练模型;

4.得到:
root@bcffde6cb466:/deepvariant# ll /deepvariant/quickstart-output/
total 20
drwxr-xr-x 2 root root 4096 Jul 31 01:32 ./
drwxr-xr-x 3 root root 4096 Jul 31 01:22 ../
-rw-r--r-- 1 root root  533 Jul 31 01:22 output.g.vcf.gz
-rw-r--r-- 1 root root 1526 Jul 31 01:32 output.vcf.gz
-rw-r--r-- 1 root root  143 Jul 31 01:32 output.vcf.gz.tbi
5.LOG输出:
2019-07-31 01:32:06.448499: I deepvariant/postprocess_variants.cc:88] Read from: /tmp/deepvariant_tmp_output/call_variants_output.tfrecord.gz
2019-07-31 01:32:06.449696: I deepvariant/postprocess_variants.cc:97] Done reading: /tmp/deepvariant_tmp_output/call_variants_output.tfrecord.gz. #entries in single_site_calls = 46
2019-07-31 01:32:06.449717: I deepvariant/postprocess_variants.cc:101] Total #entries in single_site_calls = 46
2019-07-31 01:32:06.449724: I deepvariant/postprocess_variants.cc:103] Start SortSingleSiteCalls
2019-07-31 01:32:06.449744: I deepvariant/postprocess_variants.cc:105] Done SortSingleSiteCalls
I0731 01:32:06.450007 140146169325312 postprocess_variants.py:907] CVO sorting took 2.54511833191e-05 minutes
I0731 01:32:06.450489 140146169325312 postprocess_variants.py:909] Transforming call_variants_output to variants.
I0731 01:32:06.450607 140146169325312 postprocess_variants.py:921] Writing variants to VCF.
I0731 01:32:06.450683 140146169325312 postprocess_variants.py:628] Writing output to VCF file: /deepvariant/quickstart-output/output.vcf.gz
I0731 01:32:06.451236 140146169325312 genomics_writer.py:172] Writing /deepvariant/quickstart-output/output.vcf.gz with NativeVcfWriter
I0731 01:32:06.463268 140146169325312 postprocess_variants.py:633] 1 variants written.
I0731 01:32:06.487912 140146169325312 postprocess_variants.py:929] VCF creation took 0.000621068477631 minutes
上一篇下一篇

猜你喜欢

热点阅读