R包学习-pROC
- ROC曲线
我们在做诊断性实验的时候,最常用的选择最佳cutoff的方法是使用ROC曲线。本次主要介绍的是pROC包对于 ROC曲线的的绘制和分析
https://www.jianshu.com/p/04e3bb3bc990
> library(pROC)
> data("aSAH")
> head(aSAH)
gos6 outcome gender age wfns s100b ndka
29 5 Good Female 42 1 0.13 3.01
30 5 Good Female 37 1 0.14 8.54
31 5 Good Female 42 1 0.10 8.09
32 5 Good Female 27 1 0.04 10.42
33 1 Poor Female 42 3 0.13 17.40
34 1 Poor Male 48 2 0.10 12.75
-
roc
roc是这个包的主要函数。通过roc函数我们可以构建一个ROC曲线,并返回一个ROC的列表。这个列表可以被“打印”, 可以被“plot”,同时可以使用auc, ci, smooth.roc and coords等辅助函数调出结果。
> ##以下两种方式是等价的
> roc(aSAH$outcome ~ aSAH$s100b)
> roc(outcome ~ s100b, aSAH)
>roc(aSAH$outcome, aSAH$s100b,direction = ">") # Area under the curve: 0.2686
Setting levels: control = Good, case = Poor
Setting direction: controls < cases
Call:
roc.formula(formula = outcome ~ s100b, data = aSAH)
Data: s100b in 72 controls (outcome Good) < 41 cases (outcome Poor).
Area under the curve: 0.7314
在亚组中进行roc分析
roc(outcome ~ s100b, data=aSAH, subset=(gender == "Male"))
roc(outcome ~ s100b, data=aSAH, subset=(gender == "Female"))
改变case的因子水平
> roc(aSAH$outcome, aSAH$s100b,
+ levels=c("Poor", "Good"))
Setting direction: controls > cases
Call:
roc.default(response = aSAH$outcome, predictor = aSAH$s100b, levels = c("Poor", "Good"))
Data: aSAH$s100b in 41 controls (aSAH$outcome Poor) > 72 cases (aSAH$outcome Good).
Area under the curve: 0.7314
##AUC变成百分比
roc(aSAH$outcome, aSAH$s100b, percent = T)
> ##计算95%CI
> roc(aSAH$outcome, aSAH$s100b, ci = T)
Setting levels: control = Good, case = Poor
Setting direction: controls < cases
Call:
roc.default(response = aSAH$outcome, predictor = aSAH$s100b, ci = T)
Data: aSAH$s100b in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
Area under the curve: 0.7314
95% CI: 0.6301-0.8326 (DeLong)
roc.test
roc.test用来比较两个或者多个roc曲线是否有差别。
###通过roc函数来比较两个roc的区别
roc1 <- roc(aSAH$outcome, aSAH$s100b)
roc2 <- roc(aSAH$outcome, aSAH$wfns)
roc.test(roc1, roc2)
##
## DeLong's test for two correlated ROC curves
##
## data: roc1 and roc2
## Z = -2.209, p-value = 0.02718
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7313686 0.8236789
###如果是一个数据框可以在一个公式中比较
roc.test(aSAH$outcome, aSAH$s100b, aSAH$wfns)
##
## DeLong's test for two correlated ROC curves
##
## data: aSAH$s100b and aSAH$wfns by aSAH$outcome (Good, Poor)
## Z = -2.209, p-value = 0.02718
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7313686 0.8236789
###改变比较的方法默认的的是delong法。可选bootstrap
roc.test(roc1, roc2, method = "bootstrap")
###比较两个ROC的灵敏度和特异度
roc.test(roc1, roc2, method="specificity", specificity=0.9)
##
## Specificity test for two correlated ROC curves
##
## data: roc1 and roc2
## D = -1.3119, boot.n = 2000, boot.stratified = 1, p-value = 0.1895
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7313686 0.8236789
roc.test(roc1, roc2, method="sensitivity", sensitivity=0.9)
##
## Sensitivity test for two correlated ROC curves
##
## data: roc1 and roc2
## D = -3.1677, boot.n = 2000, boot.stratified = 1, p-value =
## 0.001537
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7313686 0.8236789
###如果两个roc的样本量不同则需要进行配置
roc7 <- roc(aSAH$outcome, aSAH$s100b)
roc8 <- roc(aSAH$outcome[1:100], aSAH$s100b[1:100])
roc.test(roc7, roc8, paired=FALSE, method="delong")
##
## DeLong's test for two ROC curves
##
## data: roc7 and roc8
## D = 0.16859, df = 205.38, p-value = 0.8663
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7313686 0.7183601
auc 通过AUC来计算ROC曲线下面积
rocobj <- roc(aSAH$outcome, aSAH$s100b)
auc(rocobj)
###同样的可以使用roc函数计算
roc(aSAH$outcome, aSAH$s100b)$auc
## Area under the curve: 0.7314
ci.auc 计算一个roc曲线的曲线下面积以及95%CI
rocobj <- roc(aSAH$outcome, aSAH$s100b)
result <- ci.auc(rocobj)
###结果其实包括三个变量分别是auc, lower, upper
result[1]; result[2]; result[3]
## [1] 0.6301182
## [1] 0.7313686
## [1] 0.8326189
coords 返回ROC曲线计算过程中的变量值
- 其中包括阈值、灵敏度、特异度、准确率、真阴性数、真阳性数、假阴性数、假阳性数、阳性预测值、阴性预测值…… coords(roc, x, input, ret,as.list, drop, best.method) 参数当中x可以是数字代表input的值,all代表ROC所有点的值。 best代表灵敏度和特异度和最大的点。local maximasROC曲线局部最大的点 input如果X是数字的需要制定input。参数时“threshold”, “specificity” or “sensitivity”这三个的一个。同时可以简写为“t”,“sp”和“se” ret返回值。可以是“threshold”, “specificity”, “sensitivity”, “accuracy”, “tn” (true negative count), “tp” (true positive count), “fn” (false negative count), “fp” (false positive count), “npv” (negative predictive value), “ppv” (positive predictive value), “precision”, “recall”. “1-specificity”, “1-sensitivity”, “1-accuracy”, “1-npv” and “1-ppv” drop以最简单的形式返回结果 as.list以列表的形式输出结果 best.method逻辑值。通过这个参数可以决定最佳的阈值
##制定X值
coords(roc=rocobj, x=0.5, input="sensitivity", ret="sensitivity")
## [1] 0.5
##得到所有灵敏度的值
sensitivities <- coords(rocobj, "all", ret="se")
##得到最佳的阈值的cutoff
coords(rocobj, "b", ret="t")
## [1] 0.205
##得到作用结果
coords(rocobj, "best", ret=c("threshold", "specificity", "sensitivity", "accuracy",
"tn", "tp", "fn", "fp", "npv", "ppv", "1-specificity",
"1-sensitivity", "1-accuracy", "1-npv", "1-ppv",
"precision", "recall"), drop = T)
## threshold specificity sensitivity accuracy tn
## 0.2050000 0.8055556 0.6341463 0.7433628 58.0000000
## tp fn fp npv ppv
## 26.0000000 15.0000000 14.0000000 0.7945205 0.6500000
## 1-specificity 1-sensitivity 1-accuracy 1-npv 1-ppv
## 0.1944444 0.3658537 0.2566372 0.2054795 0.3500000
## precision recall
## 0.6500000 0.6341463
plot.roc 用来绘制ROC曲线的最主要的函数。
- add如果是绘制两个或者多个图的话,使用这个可以绘制到同一个图上。 smooth逻辑值,曲线是否平滑 type选择线条的类型:“l”只是线条、“p”只是点、“b”既有线条又有点、“n”啥都没有 legacy.axes X轴的参数。如果为F则为sp,如果为T则为1-sp。一般而言是1-sp xlim,ylim,xlab,ylab,asp,mar,mgp基本画图参数。 col, lty, lwd线条的颜色、类型和宽度 print.thres 标注ROC曲线当中最佳的位置。 print.thres.pch, print.thres.adj, print.thres.col, print.thres.cex 修改thres位置点的类型、字符位置的横向调整、颜色、字体的大小。 print.thres.pattern自定义阈值的内容 print.auc打印曲线下面积。 print.auc.pattern自定义AUC的内容 print.auc.x/print.auc.y定义auc的位置 print.auc.adj/cex/col定义auc的横向位置、字体大小以及颜色
##基本图形
plot.roc(roc1)
image.png
<meta charset="utf-8">
image可以进行ROC分析和ROC 曲线展示的R包。
#1. 安装
> install.packages("pROC")
#2. 数据导入
> library(pROC)
> data(aSAH)
> head(aSAH)
gos6 outcome gender age wfns s100b ndka
29 5 Good Female 42 1 0.13 3.01
30 5 Good Female 37 1 0.14 8.54
31 5 Good Female 42 1 0.10 8.09
32 5 Good Female 27 1 0.04 10.42
33 1 Poor Female 42 3 0.13 17.40
34 1 Poor Male 48 2 0.10 12.75
#3. ROC分析
##3.1 使用roc()进行ROC分析
> roc(aSAH$outcome, aSAH$s100b,plot = T)
> roc(outcome ~ s100b, aSAH,plot=T,levels=c("Good", "Poor"))
> roc(controls=aSAH$s100b[aSAH$outcome=="Good"], cases=aSAH$s100b[aSAH$outcome=="Poor"])
Call:
roc.formula(formula = outcome ~ s100b, data = aSAH, plot = T)
Data: s100b in 72 controls (outcome Good) < 41 cases (outcome Poor).
Area under the curve: 0.7314
image
##3.2 roc()参数详细解释
roc(...)
# S3 method for formula
roc(formula, data, ...)
# S3 method for default
roc(response, predictor, controls, cases,
density.controls, density.cases,
levels=base::levels(as.factor(response)), percent=FALSE, na.rm=TRUE,
direction=c("auto", "<", "="">"), algorithm = 5, quiet = TRUE,
smooth=FALSE, auc=TRUE, ci=FALSE, plot=FALSE, smooth.method="binormal",
ci.method=NULL, density=NULL, ...)
roc1 <- roc(aSAH$outcome,
aSAH$s100b, percent=TRUE,
# 设置auc参数
partial.auc=c(100, 90), partial.auc.correct=TRUE,
partial.auc.focus="sens",
# 设置ci参数
ci=TRUE, boot.n=100, ci.alpha=0.9, stratified=FALSE,
# 设置画图参数
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
show.thres=TRUE
)
#在原有的图上加ROC曲线
roc2 <- roc(aSAH$outcome, aSAH$wfns,
plot=TRUE, add=TRUE, percent=roc1$percent)
-
response
:样本类型,一般只有两类( 0 (controls) 和 1 (cases)),可做ROC曲线。当类型过多时,需要使用levels参数指定那些值作为control ,那些作为 case。 -
predictor
:样本的预测值,可以是概率、排名之类。 -
controls, cases
: 直接提供controls和cases,可以是数值向量,也可以是排好序的向量。 -
formula, data
: 通过表达式传入数据框中的值 。 -
levels
: controls 和 cases 对应的值,默认为levels(as.factor(response)),剩下的忽略;当输入数据为0、1,默认0为controls。 -
percent
:sensitivities, specificities 和 AUC返回形式为百分数。 -
direction
:根据两组数据中位数大小确定;“>”: control组中位数值大于cases组;“<”:control组中位数值小于或等于cases组。 -
algorithm
:1,也是默认,数量较少;2,数量大于1000时,速度更快;3,C++实现算法,快3-5x; 4 (debug only, slow): 三个算法都运行一遍,确认返回值知否一样。5,迅速选择2或3。 -
smooth
:ROC 曲线修饰为平滑曲线。 -
auc
:计算AUC,默认TRUE。 -
ci
:是否计算置信区间。 -
plot
:是否作图。 -
smooth.method, ci.method
:smooth 算法。
##3.3 返回ROC计算对象
coords(roc1, "best", ret=c("threshold", "specificity", "1-npv"))
coords(roc2, "local maximas", ret=c("threshold", "sens", "spec", "ppv", "npv"))
##3.4 置信区间
# Of the AUC
ci(roc2)
# Of the curve
sens.ci <- ci.se(roc1, specificities=seq(0, 100, 5))
plot(sens.ci, type="shape", col="lightblue")
plot(sens.ci, type="bars")
# need to re-add roc2 over the shape
plot(roc2, add=TRUE)
# CI of thresholds
plot(ci.thresholds(roc2))
##3.5 roc.test()对ROC进行统计检验
# Test on the whole AUC
> roc.test(roc1, roc2, reuse.auc=FALSE)
DeLong's test for two correlated ROC curves
data: roc1 and roc2
Z = -2.209, p-value = 0.02718
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
73.13686 82.36789
# Test on a portion of the whole AUC
> roc.test(roc1, roc2, reuse.auc=FALSE, partial.auc=c(100, 90),
partial.auc.focus="se", partial.auc.correct=TRUE)
# With modified bootstrap parameters
> roc.test(roc1, roc2, reuse.auc=FALSE, partial.auc=c(100, 90),
partial.auc.correct=TRUE, boot.n=1000, boot.stratified=FALSE)
##3.6 roc.test()参数
alternative
:“two.sided”, “less” ,“greater”。对于method="venkatraman",只能使用 “two.sided”
paired
: 是否时配对,会自动检测。
boot.n
:指定method="bootstrap"中自检举重复次数和 method="venkatraman"中置换次数;默认,2000。
boot.stratified
:每次自检举过程中,cases/controls 比例与原始样本中比例一致。
method:
- “delong”,默认,不适用于partial AUC, smoothed curves,curves with different direction。
- “bootstrap” :
- “venkatraman”:不适用于smoothed ROC curves, or curves with partial AUC specifications
##3.7 roc.test()中统计学方法
- method="bootstrap"
-
自检举抽样,通过
boot.n
指定自检举次数。 -
计算AUC。
-
计算标准偏差。
[图片上传失败...(image-eabcc6-1596545977177)] -
D与正态分布进行比较
-
method="delong"
method="delong"
在文章DeLong *et al.* (1988)
中提到用于配对ROC 曲线的检验,实际利用是在文章Sun and Xu (2014)
;现在已经延伸用于非配对的 ROC 曲线研究中,
[图片上传失败...(image-8901d1-1596545977177)] -
method="venkatrama"
method="venkatraman"
在文章Venkatraman and Begg (1996) (for paired ROC curves)
和Venkatraman (2000) (for unpaired ROC curves)
对样本排名进行置换检验。通过boot.n
指定置换次数。
##3.7 样本量
# Two ROC curves
power.roc.test(roc1, roc2, reuse.auc=FALSE)
power.roc.test(roc1, roc2, power=0.9, reuse.auc=FALSE)
# One ROC curve
power.roc.test(auc=0.8, ncases=41, ncontrols=72)
power.roc.test(auc=0.8, power=0.9)
power.roc.test(auc=0.8, ncases=41, ncontrols=72, sig.level=0.01)
power.roc.test(ncases=41, ncontrols=72, power=0.9)
#4 pROC使用更多实例
EXPASY基于R代码上给出了pROC的6个示例,见pROC: Screenshots,下面看一个例子:
library(pROC)
data(aSAH)
rocobj <- plot.roc(aSAH$outcome, aSAH$s100b, percent = TRUE, main="Smoothing")
lines(smooth(rocobj), # smoothing (default: binormal)
col = "#1c61b6")
lines(smooth(rocobj, method = "density"), # density smoothing
col = "#008600")
lines(smooth(rocobj, method = "fitdistr", # fit a distribution
density = "lognormal"), # let the distribution be log-normal
col = "#840000")
legend("bottomright", legend = c("Empirical", "Binormal", "Density", "Fitdistr\n(Log-normal)"), col = c("black", "#1c61b6", "#008600", "#840000"),lwd = 2)
image