一个比较好看的r markdown模板

2022-10-29  本文已影响0人  小明的数据分析笔记本

来自于论文

Removing unwanted variation from large-scale
RNA sequencing data with PRPS

论文里提供了很多的数据和代码

链接是 GitHub - RMolania/TCGA_PanCancer_UnwantedVariation

这个模板需要用到 rmdformats 这个R包

image.png image.png

rmarkdown 表头内容

---
title: "Removing tumour purity, library size and batch effects from the TCGA breast cancer RNA-seq data using RUV-III-PRPS"
author:
- name: Ramyar Molania
  affiliation: Papenfuss Lab, Bioinformatics, WEHI.
  url: https://www.wehi.edu.au/people/tony-papenfuss
date: "15-02-2020"
output:
  rmdformats::readthedown:
    code_folding: hide
    gallery: yes
    highlight: tango
    lightbox: yes
    self_contained: yes
    thumbnails: no
    number_sections: yes
    toc_depth: 3
    use_bookdown: yes
  html_document2:
    df_print: paged
  html_document:
    toc_depth: '3'
    df_print: paged
params:
  update_date: !r paste("Last updated on:", Sys.Date())
editor_options:
  chunk_output_type: console
---
`r params$update_date`

<style type="text/css">
h1.title {
  font-size: 28px;
  color: DarkRed;
}
h1 { /* Header 1 */
  font-size: 24px;
  color: DarkBlue;
}
h2 { /* Header 2 */
    font-size: 20px;
  color: DarkBlue;
}
h3 { /* Header 3 */
    font-size: 18px;
  color: DarkBlue;
}
h4 { /* Header 3 */
    font-size: 16px;
  color: DarkBlue;
}
</style>

<style>
p.caption {
  font-size: 46em;
  font-style: italic;
  color: black;
}
</style>



#```{r setup, include=F}
knitr::opts_chunk$set(
  tidy = FALSE,
  fig.width = 10,
  message = FALSE,
  warning = FALSE)
#```

# Introduction

Effective removal of unwanted variation is essential to derive meaningful biological results from RNA-seq data, particularly when the data comes from large and complex studies. We have previously proposed a new method, removing unwanted variation III (RUV-III) to normalize gene expression data [(R.Molania, NAR, 2019)](https://academic.oup.com/nar/article/47/12/6073/5494770?login=true). The RUV-III method requires well-designed technical replicates (well-distributed across sources of unwanted variation) and negative control genes to estimate known and unknown sources of unwanted variation and remove it from the data.\
We propose a novel strategy, pseudo-replicates of pseudo-samples (PRPS) [R.Molania, bioRxiv, 2021](https://www.biorxiv.org/content/10.1101/2021.11.01.466731v1), for deploying RUV-III to normalize RNA-seq data in situations when technical replicates are not available or are not well-designed. Our approach requires at least one **roughly** known biologically homogenous subclass of samples presented across sources of unwanted variation. For example, in a cancer RNA-seq study where there are normal tissues present across all sources of unwanted variation. Then, we can use these samples to create PRPS.\
To create PRPS, we first need to identify the sources of unwanted variation, which we call batches in the data. Then the gene expression measurements of suitable biologically homogeneous sets of samples are averaged within batches, and the results called pseudo-samples. Since the variation between pseudo-samples in different batches is mainly unwanted variation, by defining them as pseudo-replicates and used them in RUV-III as replicates, we can easily and effectively remove the unwanted variation. we refer to our paper for more technical details [R.Molania, bioRxiv, 2021](https://www.biorxiv.org/content/10.1101/2021.11.01.466731v1).\

Here, we use the TCGA invasive breast cancer (BRCA) RNA-seq data as an example to show how to remove tumour purity, flow cell chemistry, library size and batch effects (plate effects) from the data. We illustrate the value of our approach by comparing it to the standard TCGA normalizations on the TCGA BRCA RNA-seq data. Further, we demonstrate how unwanted variation can compromise several downstream analyses and can lead to wrong biological conclusions. We will also assess the performance of RUV-III with poorly chosen PRPS and in situations where biological labels are only partially known.\
Note that RUV-III with PRPS is not limited to TCGA data: it can be used for any large genomics project involving multiple labs, technicians, platforms, ...\

## Data preparation

The TCGA consortium aligned RNA sequencing reads to the hg38 reference genome using the STAR aligner and quantified the results at gene level using the HTseq and Gencode v22 gene-annotation [Ref](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/). The TCGA RNA-seq data are publicly available in three formats: raw counts, FPKM and FPKM with upper-quartile normalization (FPKM.UQ). All these formats for individual cancer types (33 cancer types, ~ 11000 samples) were downloaded using the R/Bioconductor package (version 2.16.1). The TCGA normalized microarray gene expression data were downloaded from the Broad GDAC [Firehose](https://gdac.broadinstitute.org) repository , data version 2016/01/28. Tissue source sites (TSS), and batches of sequencing-plates were extracted from individual TCGA [patient barcodes](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/), and sample processing times were downloaded from the [MD Anderson Cancer Centre TCGA Batch Effects website](https://bioinformatics.mdanderson.org/public-software/tcga-batch-effects). Pathological features of cancer patients were downloaded from the Broad GDAC Firehose repository (https://gdac.broadinstitute.org). The details of processing the TCGA BRCA RNA-seq samples using two flow cell chemistries were received by personal communication from Dr. K Hoadley. The TCGA survival data reported by [Liu et al.](https://www.cell.com/cell/fulltext/S0092-8674(18)30229-0?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867418302290%3Fshowall%3Dtrue) were used in this paper. The consensus measurement of purity estimation (CPE) were downloaded from the [Aran et al](https://www.nature.com/articles/ncomms9971) study.\
We have generated SummarizedExperiment objects for all the TCGA RNA-seq datasets. These datasets can be found here [TCGA_PanCancerRNAseq](https://zenodo.org/record/6326542#.YimR0C8Rquo). Unwanted variation of all the datasets can be explored using an Rshiny application published in [(R.Molania, bioRxiv, 2021)](https://www.biorxiv.org/content/10.1101/2021.11.01.466731v1.article-metrics).\
All datasets that are required for this vignette can be found here [link](https://doi.org/10.5281/zenodo.6392171)

# TCGA BRCA gene expression data

## RNA-seq data

We load the TCGA_SummarizedExperiment_HTseq_BRCA.rds file. This is a SummarizedExperiment object that contains:\
**assays:**\
-Raw counts\
-FPKM\
-FPKM.UQ\
**colData:**\
-Batch information\
-Clinical information (collected from different resources)\
**rowData:**\
-Genes' details (GC, chromosome, ...)\
-Several lists of housekeeping genes\
  
The lists of housekeeping genes might be suitable to use as negative control genes (NCG) for the RUV-III normalization.

效果

image.png
上一篇下一篇

猜你喜欢

热点阅读