SAIGE用户手册笔记2

2022-03-24 本文已影响0人 zd200572

基于集合的分析

SAIGE-GENE（现在称为SAIGE-GENE+）采取两个步骤来执行基于集合的关联测试

BURDEN, SKAT, and SKAT-O
MAC < = 10的超罕见变体在测试中被归结为伪marker
多个函数注释，例如仅 Lof、Lof+Missense、Lof+Missense+Synonymous
多个最大 MAF 阀值，例如 0.0001、0.001 和 0.01
对于每个基因，基于柯西分布组合对应于不同注释的多个p值和最大MAF截止值

步骤 1：拟合零逻辑/线性混合模型

当使用稀疏 GRM 拟合空模型（-useSparseGRMtoFitNULL=TRUE）且未建立方差比（–skipVarianceRatioEstimation=TRUE）时，基于集合的检验的步骤 1 与 SAIGE 中的单变量检验相同

Rscript step1_fitNULLGLMM.R     \
    --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
    --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt     \
    --useSparseGRMtoFitNULL=TRUE    \
    --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
    --phenoCol=y_binary \
    --covarColList=x1,x2,a9,a10 \
    --qCovarColList=a9,a10  \
    --sampleIDColinphenoFile=IID \
    --traitType=binary        \
    --skipVarianceRatioEstimation=TRUE  \
    --outputPrefix=./output/example_binary_sparseGRM

当使用完整的 GRM 来拟合空模型时（GRM 是使用 PLINK 文件中的基因型动态构建的，–plinkfile=)
- 使用完整的 GRM 来拟合空模型（默认情况下，–useSparseGRMtoFitNULL=FALSE)
- 使用完整的 GRM（使用 PLINK 文件中的基因型进行构造）和带有 –useSparseGRMforVarRatio=TRUE 的稀疏 GRM 来估计方差比 ** 对包含稀疏 GRM 的文件使用 –稀疏GRMFile 对包含稀疏 GRM 中样本 ID 的文件使用 –稀疏GRMSampleIDFile
- 需要根据不同的次要等位基因计数类别估计多个方差比，其中 –isCateVarianceRatio=TRUE ** 默认情况下，10 <= MAC < 20 和 MAC >= 20 时，估计两个方差比。** 使用 –cateVarRatioMinMACVecExclude 和 –cateVarRatioMaxMACVecInclude 修改 MAC 类别 请注意，PLINK 文件需要包含至少 200 个 MAC 属于这些类别的变体 与 SAIGE 中用于 SAIGE 中单变量测试的步骤 1 不同，SAIGE 中仅估计单个方差比
```
 Rscript step1_fitNULLGLMM.R     \
     --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly_22chr  \
     --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
     --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt     \
     --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
     --phenoCol=y_binary \
     --covarColList=x1,x2,a9,a10 \
     --qCovarColList=a9,a10  \
     --sampleIDColinphenoFile=IID \
     --traitType=binary        \
     --outputPrefix=./output/example_binary_fullGRM \
     --nThreads=64    \
     --useSparseGRMtoFitNULL=FALSE    \
     --isCateVarianceRatio=TRUE      \
     --outputPrefix_varRatio=./output/example_binary_cate \
     --useSparseGRMforVarRatio=TRUE  \
     --IsOverwriteVarianceRatioFile=TRUE
```

输入文件（与 SAIGE 步骤 1 中的输入相同）

与 SAIGE 在步骤 1 中用于单变量关联测试的输入相同
对于稀有变体，PLINK 文件 –plinkFile= 特定于 SAIGE-GENE+，需要包含 MAC 低至 10 的稀有变体。

（必填）表型文件（包含协变量（如果有），如性别和年龄）文件可以是空格，也可以是用标题以制表符分隔的。该文件必须包含一列用于样本 ID，一列用于表型。它可能包含协变量列。

image

less -S ./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt

（可选）用于构建完整遗传关系矩阵（GRM）和估计方差比的基因型
文件 SAIGE 采用基因型的 PLINK 二进制文件，并假设文件前缀与 .bed， .bim 的文件前缀相同。和 .fam

    ./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed
    ./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim
    ./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam

（可选）稀疏 GRM 文件和稀疏 GRM 的示例 ID 文件。有关更多详细信息，请参阅步骤 0。

   ./output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx
   ./output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt

   ##the sparse matrix can be read and viewed using the R library Matrix
   library(Matrix)
   sparseGRM = readMM("./output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx")

输出文件（（与 SAIGE 步骤 1 中用于单变量关联测试的输出相同））

与 SAIGE 在步骤 1 中用于单变量关联测试的输出相同
特定于稀有变异的 SAIGE-GENE+，方差比文件包含多个方差比，而不是单个方差比

模型文件（步骤 2 的输入）)

 ./output/example_binary.rda

 #load the model file in R
 load("./output/example_binary.rda")
 names(modglmm)
 modglmm$theta

image

方差比文件（如果在步骤 1 中估计了类别方差比，则将在步骤 2 中输入)
```
 less -S ./output/example_binary_cate.varianceRatio.txt
```
随机选择的标记子集的关联结果文件（如果在步骤 1 中估计了方差比，则将生成此文件。它是一个中间文件，后续步骤不需要它。)
```
 less -S ./output/example_binary_30markers.SAIGE.results.txt
```

第2步：执行基于区域或基因的关联测试

这些命令与单变体联合测试的步骤 2 相同，不同之处在于
- 指定了一个组文件 （–groupFile），其中包含要测试的每个集合的遗传标记 ID、注释和权重（如果有）
允许每组（基因或区域）使用多个掩码
- 使用 –annotation_in_groupTest 列出用逗号分隔的不同批注。在每个批注组合中，批注由"："分隔
  - 例如，"lof，missense:lof，missense:lof:synonymous"仅用于测试lof，missense +lof和missense+lof+synonymous
- 使用 –maxMAF_in_groupTest 表示以逗号分隔的不同最大MAF 截止值
  - 例如 0.0001，0.001，0.01（默认值）
- 使用 –maxMAC_in_groupTest 表示用逗号分隔的不同最大MAC 截止值
  - 例如，singletons和doubletons为 1，2
  - 默认情况下，不应用此选项
  - 最大 MAC 将转换为最大 MAF，并在分析中合并到 –maxMAF_in_groupTest
- 在示例中，每个集合将应用 9 个掩码，并且将基于柯西组合合并9个p值
默认情况下，将执行 SKAT-O 测试（同时输出 BURDEN 和 SKAT 测试结果）。使用 –r.corr=1 仅执行 BURDEN 检验
- 如果执行 SKAT-O 测试（–r.corr=0），则还会输出单变量联合测试结果
- 如果仅执行 BURDEN 检验（–r.corr=1），则默认情况下不执行单变量联合分析。使用 –is_single_in_groupTest=TRUE 输出单变量关联分析结果。
使用 –is_output_markerList_in_groupTest=TRUE 输出每个测试中包含的标记列表。
默认情况下，程序首先检查组文件中是否提供了每个标记权重。如果不是，程序将基于 Beta 分布中的 MAF 与 paraemters weights.beta 计算权重。使用 –is_no_weight_in_groupTest=TRUE 可不在分析中使用任何权重。
使用 –minMAF、–minMAC 和 –minInfo 指定的单个标记的截止值也应用于基于区域/集的分析中
与单变量关联检验相同，可以执行基于条件分析的汇总统计信息（–条件)

在步骤 1 中，如果使用稀疏 GRM 来拟合空模型，并且没有估计方差比，则在步骤 2 中，使用与输入相同的稀疏 GRM（–稀疏 GRMFile、–稀疏 GRMSampleIDFile）作为输入

 Rscript step2_SPAtests.R        \
     --bgenFile=./input/genotype_100markers.bgen    \
     --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
     --SAIGEOutputFile=./output/genotype_100markers_bgen_groupTest_out_sparseGRMforStep1.txt \
     --chrom=1 \
     --AlleleOrder=ref-first \
     --minMAF=0 \
     --minMAC=0.5 \
     --sampleFile=./input/samplelist.txt \
     --GMMATmodelFile=./output/example_binary_sparseGRM.rda \
     --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
     --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt     \
     --groupFile=./input/group_new_chrposa1a2.txt    \
     --annotation_in_groupTest=lof,missense:lof,missense:lof:synonymous        \
     --maxMAF_in_groupTest=0.0001,0.001,0.01

如果在步骤 1 中使用了完整的 GRM 来拟合空模型，并且使用完整和稀疏的 GRM 估计方差比，则在步骤 2 中，稀疏 GRM（–稀疏 GRMFile、–稀疏GRMSampleIDFile）和方差比（–varianceRatioFile）将用作输入。使用 –is_output_markerList_in_groupTest=TRUE 输出用于每个测试的标记。

 Rscript step2_SPAtests.R        \
     --bgenFile=./input/genotype_100markers.bgen    \
     --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
     --SAIGEOutputFile=./output/genotype_100markers_bgen_groupTest_out.txt \
     --chrom=1 \
     --LOCO=TRUE    \
     --AlleleOrder=ref-first \
     --minMAF=0 \
     --minMAC=0.5 \
     --sampleFile=./input/samplelist.txt \
     --GMMATmodelFile=./output/example_binary_fullGRM.rda \
     --varianceRatioFile=./output/example_binary_cate.varianceRatio.txt \
     --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
     --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt  \
     --groupFile=./input/group_new_chrposa1a2.txt   \
     --annotation_in_groupTest="lof,missense:lof,missense:lof:synonymous"        \
     --maxMAF_in_groupTest=0.0001,0.001,0.01    \
     --is_output_markerList_in_groupTest=TRUE

仅执行 –r.corr=1 的 BURDEN 检验。使用 –minGroupMAC_in_BurdenTest表示负担测试中测试"负载标记"的最小 MAC。

 Rscript step2_SPAtests.R        \
     --bgenFile=./input/genotype_100markers.bgen    \
     --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
     --SAIGEOutputFile=./output/genotype_100markers_bgen_groupTest_out_onlyBURDEN.txt \
     --chrom=1 \
     --LOCO=TRUE    \
     --AlleleOrder=ref-first \
     --minMAF=0 \
     --minMAC=0.5 \
     --sampleFile=./input/samplelist.txt \
     --GMMATmodelFile=./output/example_binary_fullGRM.rda \
     --varianceRatioFile=./output/example_binary_cate.varianceRatio.txt      \
     --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
     --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt \
     --groupFile=./input/group_new_chrposa1a2.txt    \
     --annotation_in_groupTest="lof,missense;lof,missense;lof;synonymous"        \
     --maxMAF_in_groupTest=0.0001,0.001,0.01    \
 --r.corr=1

Use –condition= to perform conditioning analysis

 Rscript step2_SPAtests.R        \
     --bgenFile=./input/genotype_100markers.bgen    \
     --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
     --SAIGEOutputFile=./output/genotype_100markers_bgen_groupTest_out_cond.txt \
     --chrom=1 \
     --LOCO=TRUE    \
     --AlleleOrder=ref-first \
     --minMAF=0 \
     --minMAC=0.5 \
     --sampleFile=./input/samplelist.txt \
     --GMMATmodelFile=./output/example_binary_fullGRM.rda \
     --varianceRatioFile=./output/example_binary_cate.varianceRatio.txt      \
     --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
     --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt     \
     --groupFile=./input/group_new_chrposa1a2.txt    \
     --annotation_in_groupTest=lof,missense:lof,missense:lof:synonymous        \
     --maxMAF_in_groupTest=0.0001,0.001,0.01    \
     --condition=1:30:A:C,1:79:A:C

使用PLINK文件作为基因型/剂量的输入（–bedFile=， –bimFile=， –famFile=， –AlleleOrder=）

–等位基因顺序可以是 alt-first 或 ref-first。必须正确指定，否则，PLINK 文件中的变体将与 groupFile 中的标记不匹配

 Rscript step2_SPAtests.R        \
     --bedFile=./input/genotype_100markers.bed       \
     --bimFile=./input/genotype_100markers.bim       \
     --famFile=./input/genotype_100markers.fam       \
     --SAIGEOutputFile=./output/genotype_100markers_plink_groupTest_out.txt \
     --chrom=1 \
     --LOCO=TRUE    \
     --AlleleOrder=alt-first \
     --minMAF=0 \
     --minMAC=0.5 \
     --sampleFile=./input/samplelist.txt \
     --GMMATmodelFile=./output/example_binary_fullGRM.rda \
     --varianceRatioFile=./output/example_binary_cate.varianceRatio.txt      \
     --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
     --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt     \
     --groupFile=./input/group_new_chrposa1a2.txt    \
     --annotation_in_groupTest=lof,missense:lof,missense:lof:synonymous        \
     --maxMAF_in_groupTest=0.0001,0.001,0.01

VCF 文件作为输入

–vcfFileIndex 将 .csi 索引文件作为输入，可以使用 tabix –csi -p vcf ./input/genotype_100markers.vcf.gz
–vcfField 是 GT 或 DS
–必须为 VCF 文件指定 chrom。–chrom必须与 VCF 文件中的CHROM字符串相同

Rscript step2_SPAtests.R        \
    --vcfFile=./input/genotype_100markers.vcf.gz    \
    --vcfFileIndex=./input/genotype_100markers.vcf.gz.csi     \
    --vcfField=GT   \
    --SAIGEOutputFile=./output/genotype_100markers_vcf_groupTest_out.txt \
    --LOCO=FALSE    \
    --chrom=1   \
    --minMAF=0 \
    --minMAC=0.5 \
    --sampleFile=./input/samplelist.txt \
    --GMMATmodelFile=./output/example_binary_fullGRM.rda \
    --varianceRatioFile=./output/example_binary_cate.varianceRatio.txt      \
    --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
    --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt     \
    --groupFile=./input/group_new_chrposa1a2.txt    \
    --annotation_in_groupTest=lof,missense:lof,missense:lof:synonymous        \
    --maxMAF_in_groupTest=0.0001,0.001,0.01

输入文件

详情请参考 SAIGE (Single-variant test) Step 2 输入文件。此外，基于集合的测试还需要额外的输入文件

（必填。特定于基于集合的测试）组文件，其中包含每组（基因或区域）的标记ID，注释和/或权重。* 第一列包含设置的名称。* 组文件每组有 2 或 3 行。* 标记 ID 和注释是必需的，权重是可选的。* 第二列具有 var（指示该行用于标记 ID）、anno（指示该行用于注释）和权重（指示该行用于基于集的测试中使用的标记的权重）
- 如果权重未包含在组文件中，则默认情况下权重将计算为 beta（MAF，1，25）
- 不带权重的组文件
```
less -S ./input/group_new_chrposa1a2.txt
```
- 带权重的分组文件
```
less -S ./input/group_new_chrposa1a2_withWeights.txt
```

输出文件

包含基于区域或基因的关联测试结果的文件

  head -n 1  ./output/genotype_100markers_bgen_groupTest_out_cond.txt

image

Region: set name
Group: annotation mask
max_MAF: maximum MAF cutoff
Pvalue: p value for SKAT-O test
Pvalue_Burden: p value for BURDEN test
Pvalue_SKAT: p value for SKAT test
BETA_Burden: effect size of BURDEN test
SE_Burden: standard error of BETA_Burden

#if --condition= is used for conditioning analysis, the conditional analysis results will be output
Pvalue_cond, Pvalue_Burden_cond, Pvalue_SKAT_cond, BETA_Burden_cond, SE_Burden_cond

MAC: minor allele count in the set
MAC_case: minor allele count in cases
MAC_control: minor allele count in controls
Number_rare: number of markers that are not ultra-rare with MAC > MACCutoff_to_CollapseUltraRare (=10 by default)
Number_ultra_rare: number of markers that are ultra-rare with MAC <= MACCutoff_to_CollapseUltraRare (=10 by default)

包含基于集合的测试中单个标记的关联测试结果的文件 有关详细信息，请参阅 SAIGE（单变量测试）步骤 2 输出文件。 对于二元性状，通过设置 * –is_Firth_beta=TRUE 和 –pCutoffforFirth=0.01 * 注意：此选项目前仅适用于单变量协会检验和负担检验（当执行唯一的负荷检验（–r.corr=1）时），可以通过 Firth 的偏倚减少逻辑回归更准确地估计单个变异的效应大小。当执行 SKAT-O 检验（–r.corr=0）时，不应用 Firth 的偏倚减少逻辑回归。
** 如果仅执行 Burden 检验（–r.corr=1），请使用 –is_single_in_groupTest=TRUE 输出单变量 assoc 检验 ** 如果执行了 SKAT-O 检验（–r.corr=0），则会自动输出单变异体联合检验结果。
```
  less -S ./output/genotype_100markers_bgen_groupTest_out_cond.txt.singleAssoc.txt
```
包含基于区域/集的测试的标记列表的文件（–is_output_markerList_in_groupTest=TRUE）
```
  less -S ./output/genotype_100markers_bgen_groupTest_out.txt.markerList.txt
```

示例 1

使用稀疏 GRM 拟合空模型（–useSparseGRMtoFitNULL=TRUE）并且没有方差比被估计（–skipVarianceRatioEstimation=TRUE）
a9 和 a10 是分类协变量，将针对不同级别重写（–qCovarColList）
需要在步骤 1 和步骤 2 中使用相同的稀疏 GRM 文件（对于方差比方法）

为每组（基因或区域）使用多个掩码

应用了三个批注masks：仅 lof、missense+lof 和 missense+lof+同义词（–annotation_in_groupTest）
应用三个最大 MAF 截止值：0.0001，0.001，0.01
每个集合的 3x3 检验的 p 值基于柯西组合进行组合

  Rscript step1_fitNULLGLMM.R     \
      --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
      --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt     \
      --useSparseGRMtoFitNULL=TRUE    \
      --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
      --phenoCol=y_binary \
      --covarColList=x1,x2,a9,a10 \
      --qCovarColList=a9,a10  \
      --sampleIDColinphenoFile=IID \
      --traitType=binary        \
      --skipVarianceRatioEstimation=TRUE      \
      --outputPrefix=./output/example_binary_sparseGRM

  Rscript step2_SPAtests.R        \
      --bgenFile=./input/genotype_100markers.bgen    \
      --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
      --SAIGEOutputFile=./output/genotype_100markers_bgen_groupTest_out_sparseGRMforStep1.txt \
      --chrom=1 \
      --AlleleOrder=ref-first \
      --minMAF=0 \
      --minMAC=0.5 \
      --sampleFile=./input/samplelist.txt \
      --GMMATmodelFile=./output/example_binary_sparseGRM.rda \
      --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
      --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt     \
      --groupFile=./input/group_new_chrposa1a2.txt    \
      --annotation_in_groupTest=lof,missense:lof,missense:lof:synonymous        \
      --maxMAF_in_groupTest=0.0001,0.001,0.01

示例 2

使用完整的 GRM 拟合空模型，该 GRM 将使用 plink 文件中的基因型动态计算（–plinkFile=， –useSparseGRMtoFitNULL=FALSE)
a9 和 a10 是分类协变量，针对不同级别重编码（–qCovarColList）
使用完整的 GRM（使用 PLINK 文件中的基因型进行构造）和sparseGRM 与 –useSparseGRMforVarRatio=TRUE *对包含稀疏 GRM 的文件使用 –sparseGRMFile 对包含稀疏 GRM 中样本 ID 的文件使用 –sparseGRMSampleIDFile
需要根据不同的次要等位基因计数类别估计多个方差比，其中 –isCateVarianceRatio=TRUE ** 默认情况下，10 <= MAC < 20 和 MAC >= 20 时，估计两个方差比。使用 –cateVarRatioMinMACVecExclude 和 –cateVarRatioMaxMACVecInclude 修改 MAC 类别 请注意，PLINK 文件需要包含至少 1000 个 MAC 属于这些类别的变体 ** 与 SAIGE 中用于 SAIGE 中单变量测试的步骤 1 不同，SAIGE 中仅估计单个方差比
需要在步骤 1 和步骤 2 中使用相同的稀疏 GRM 文件（对于方差比方法）

输出测试的标记列表 – is_output_markerList_in_groupTest=TRUE

  Rscript step1_fitNULLGLMM.R     \
      --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly_22chr  \
      --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
      --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt     \
      --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
      --phenoCol=y_binary \
      --covarColList=x1,x2,a9,a10 \
      --qCovarColList=a9,a10  \
      --sampleIDColinphenoFile=IID \
      --traitType=binary        \
      --outputPrefix=./output/example_binary_fullGRM \
      --nThreads=64    \
      --useSparseGRMtoFitNULL=FALSE    \
      --isCateVarianceRatio=TRUE      \
      --outputPrefix_varRatio=./output/example_binary_cate \
      --useSparseGRMforVarRatio=TRUE  \
      --IsOverwriteVarianceRatioFile=TRUE

  Rscript step2_SPAtests.R        \
      --bgenFile=./input/genotype_100markers.bgen    \
      --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
      --SAIGEOutputFile=./output/genotype_100markers_bgen_groupTest_out.txt \
      --chrom=1 \
      --LOCO=TRUE    \
      --AlleleOrder=ref-first \
      --minMAF=0 \
      --minMAC=0.5 \
      --sampleFile=./input/samplelist.txt \
      --GMMATmodelFile=./output/example_binary_fullGRM.rda \
      --varianceRatioFile=./output/example_binary_cate.varianceRatio.txt      \
      --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
      --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt     \
      --groupFile=./input/group_new_chrposa1a2.txt    \
      --annotation_in_groupTest=lof,missense:lof,missense:lof:synonymous        \
      --maxMAF_in_groupTest=0.0001,0.001,0.01   \
  --is_output_markerList_in_groupTest=TRUE

创建稀疏GRM

步骤 0：创建稀疏 GRM

SAIGE 和 SAIGE-GENE 可以采用稀疏 GRM 来拟合零模型和关联分析

这种稀疏的GRM只需要为每个数据集创建一次，例如生物库，并且只要所有测试的样品都在稀疏GRM中，就可以用于所有不同的表型。
可以使用多个程序来生成稀疏 GRM

SAIGE 提供了一个用于创建稀疏 GRM 的脚本 *程序将输出一个以 sampleID 结尾的文件.txt其中包含稀疏 GRM 的示例 ID，以及一个以 .sparseGRM.mtx 结尾的文件，其中包含稀疏 GRM

然后，这两个文件可以直接用于后续步骤

 #For help information
 Rscript createSparseGRM.R --help

 Rscript createSparseGRM.R       \
     --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
     --nThreads=4  \
     --outputPrefix=./output/sparseGRM       \
     --numRandomMarkerforSparseKin=2000      \
     --relatednessCutoff=0.125

GCTA

 gcta64 \
     --bfile ./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
     --out ./output/sparseGRM \
     --make-grm-part 3 1 \
     --maf 0.01 \
     --geno 0.15 \
     --thread-num 2

KING

同时运行单变异和集合的关联分析

如果 SAIGE 以前曾用于单变异关联测试。步骤 1 中的空模型拟合结果可重复用于基于区域/集的分析
首先，使用完整和稀疏的 GRM 估计分类方差比
使用 –sparseGRMFile 和 –sparseGRMSampleIDFile 作为稀疏 GRM 和 –useSparseGRMforVarRatio=TRUE
使用 –isCateVarianceRatio=TRUE 来构造分类方差比
使用 –skipModelFitting=FASLE 以避免重新拟合空模型。使用 –outputPrefix=prefix 时，prgoram 将从 prefix.rda 读取模型。如果尝试避免覆盖以前的方差比文件，请使用 –outputPrefix_varRatio为新的方差比结果指定单独的文件前缀，否则 –IsOverwriteVarianceRatioFile=TRUE 可用于覆盖以前的文件。
注意：在使用 –plinkFile 指定的 plink 文件中，至少需要添加约 200 个 10<= MAC < 20 的标记，这将用于估计较低 MAF 类别的方差比。

  Rscript step1_fitNULLGLMM.R     \
      --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly_22chr  \
      --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
      --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt     \
      --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
      --phenoCol=y_binary \
      --covarColList=x1,x2,a9,a10 \
      --qCovarColList=a9,a10  \
      --sampleIDColinphenoFile=IID \
      --traitType=binary        \
      --outputPrefix=./output/example_binary_fullGRM \
      --nThreads=64    \
      --useSparseGRMtoFitNULL=FALSE    \
      --isCateVarianceRatio=TRUE      \
  --skipModelFitting=TRUE   \
      --outputPrefix_varRatio=./output/example_binary_cate \
      --useSparseGRMforVarRatio=TRUE

如果之前未运行任何 SAIGE 作业，请按照步骤 1 和 2 进行基于集的测试。使用 –is_single_in_groupTest=TRUE 同时输出单变量联合测试

SAIGE用户手册笔记2

基于集合的分析

步骤 1：拟合零逻辑/线性混合模型

输入文件（与 SAIGE 步骤 1 中的输入相同）

输出文件（（与 SAIGE 步骤 1 中用于单变量关联测试的输出相同））

第2步：执行基于区域或基因的关联测试

输入文件

输出文件

示例 1

示例 2

创建稀疏GRM

步骤 0：创建稀疏 GRM

SAIGE 和 SAIGE-GENE 可以采用稀疏 GRM 来拟合零模型和关联分析

同时运行单变异和集合的关联分析

猜你喜欢

热点阅读