网络数据统计分析笔记|| 网络图的统计模型

2020-10-01 本文已影响0人周运来就是我

前情回顾：
Gephi网络图极简教程
 Network在单细胞转录组数据分析中的应用
 Gephi网络图极简教程
 Network在单细胞转录组数据分析中的应用
 网络数据统计分析笔记|| 为什么研究网络
 网络数据统计分析笔记|| 操作网络数据
 网络数据统计分析笔记|| 网络数据可视化
 网络数据统计分析笔记|| 网络数据的描述性分析
 网络数据统计分析笔记||网络图的数学模型

学过统计的我们知道，要执行推断只靠几个分布模型是不够的，还要有模型来做推断，这就是统计建模。那么，上一章的经典随机图模型，伯努利随机图模型，广义随机图模型等是我们网络图世界的正态分布，卡方分布等等等。本章介绍的内容，作类比的话，就像之前我们学过的线性模型，非线性模型，广义线性模型之流。

为什么要学习模型，一个模型就像一个思考框架，在我们需要描述问题的时候，有个思考的方向。模型，质言之，有模具的形状。之所有模型，是因为这世界是可归纳的。

当前主要的三类模型与用于非网络数据的统计模型很相似。指数随机图模型类似标准的回归模型，尤其是广义线性模型。类似地，随机块模型受到混合模型的启发，其基本形式本质上是一些经典随机图模型的混合。最后，潜变量网络模型是一种基于网络的变体模型，同时使用观测变量和未观测变量（即，潜变量）对结果（此处为网络中边的缺失与否）进行建模。

指数随机图模型
- 一般形式
- 模型界定
- 模型拟合
- 拟合优度
网络块模型
- 模型界定
- 模型拟合
潜变量模型
- 一般形式
- 模型界定
- 模型拟合
- 拟合优度

指数随机图模型

指数随机图模型（exponential random graph models，ERGM）直接类比经典的广义线性模型（GLM）而建立。这当然是为了在模型建立、拟合与比较方面利用和扩展业已成熟的统计原理与方法。然而，指数随机模型的界定和拟合显然比标准的广义线性模型更为微妙。在我们使用模型的时候，一定要知道自己使用的是什么。应用模型或者算法，就像吃饭，要知道自己吃的是什么。应用模型不可以离开具体的应用场景，不然代码会显得非常简单乏味枯燥。

一般形式

考虑一个随机图

模型界定

模型界定是在给定模型框架下，进一步具体化相关函数，建立更加具体的模型。

my.ergm.bern <- formula(lazega.s ~ edges)
my.ergm.bern

lazega.s ~ edges

summary(my.ergm.bern)
edges 
  115

my.ergm <- formula(lazega.s ~ edges + kstar(2)
   + kstar(3) + triangle)
summary(my.ergm)

 edges   kstar2   kstar3 triangle 
     115      926     2681      120

my.ergm <- formula(lazega.s ~ edges
   + gwesp(1, fixed=TRUE))
summary(my.ergm)

       edges gwesp.fixed.1 
     115.0000      213.1753

lazega.ergm <- formula(lazega.s ~ edges
   + gwesp(log(3), fixed=TRUE)
   + nodemain("Seniority")
   + nodemain("Practice")
   + match("Practice")
   + match("Gender")
   + match("Office"))

这个公式像不像我们线性模型里面的R软件提供的拟合计算广义线性模型的函数glm()

glm(formula, family = gaussian, data, weights, subset,
    na.action, start = NULL, etastart, mustart, offset,
    control = list(...), model = TRUE, method = "glm.fit",
    x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)

glm.fit(x, y, weights = rep.int(1, nobs),
        start = NULL, etastart = NULL, mustart = NULL,
        offset = rep.int(0, nobs), family = gaussian(),
        control = list(), intercept = TRUE, singular.ok = TRUE)

模型拟合

set.seed(42)
lazega.ergm.fit <- ergm(lazega.ergm)

Starting maximum pseudolikelihood estimation (MPLE):
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Starting Monte Carlo maximum likelihood estimation (MCMLE):
Iteration 1 of at most 20:
Optimizing with step length 1.
The log-likelihood improved by 0.5253.
Step length converged once. Increasing MCMC sample size.
Iteration 2 of at most 20:
Optimizing with step length 1.
The log-likelihood improved by 0.07175.
Step length converged twice. Stopping.
Finished MCMLE.
Evaluating log-likelihood at the estimate. Using 20 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 .
This model was fit using MCMC.  To examine model diagnostics and check for degeneracy, use the mcmc.diagnostics() function.
Warning message:
`set_attrs()` is deprecated as of rlang 0.3.0
This warning is displayed once per session.

方差分析

anova(lazega.ergm.fit)
Analysis of Variance Table

Model 1: lazega.s ~ edges + gwesp(log(3), fixed = TRUE) + nodemain("Seniority") + 
    nodemain("Practice") + match("Practice") + match("Gender") + 
    match("Office")
         Df Deviance Resid. Df Resid. Dev Pr(>|Chisq|)    
NULL                       630     873.37                 
Model 1:  7   413.74       623     459.63    < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

在图模型的框架里面也要考虑哪些变量纳入到模型中，这时候就需要看看：
变量贡献度

summary(lazega.ergm.fit)

==========================
Summary of model fit
==========================

Formula:   lazega.s ~ edges + gwesp(log(3), fixed = TRUE) + nodemain("Seniority") + 
    nodemain("Practice") + match("Practice") + match("Gender") + 
    match("Office")

Iterations:  2 out of 20 

Monte Carlo MLE Results:
                             Estimate Std. Error MCMC % z value Pr(>|z|)    
edges                        -7.00655    0.67114      0 -10.440  < 1e-04 ***
gwesp.fixed.1.09861228866811  0.59166    0.08554      0   6.917  < 1e-04 ***
nodecov.Seniority             0.02456    0.00620      0   3.962  < 1e-04 ***
nodecov.Practice              0.39455    0.10218      0   3.861 0.000113 ***
nodematch.Practice            0.76966    0.19060      0   4.038  < 1e-04 ***
nodematch.Gender              0.73767    0.24362      0   3.028 0.002463 ** 
nodematch.Office              1.16439    0.18753      0   6.209  < 1e-04 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

     Null Deviance: 873.4  on 630  degrees of freedom
 Residual Deviance: 459.6  on 623  degrees of freedom
 
AIC: 473.6    BIC: 504.7    (Smaller is better.)

拟合优度

> gof.lazega.ergm <- gof(lazega.ergm.fit)
> # CHUNK 12
> plot(gof.lazega.ergm)
> gof.lazega.ergm

Goodness-of-fit for degree 

   obs min mean max MC p-value
0    2   0 2.61   8       1.00
1    3   0 2.81   7       1.00
2    2   0 2.67   8       1.00
3    4   0 2.94   9       0.72
4    2   0 2.95   8       0.92
5    4   0 2.73   7       0.62
6    4   0 2.93   8       0.70
7    1   0 3.11   7       0.36
8    1   0 2.65   7       0.46
9    5   0 2.35   7       0.14
10   1   0 1.97  10       0.90
11   1   0 1.80   8       0.94
12   2   0 1.39   7       0.68
13   3   0 1.11   5       0.26
14   0   0 0.69   4       0.92
15   1   0 0.60   4       0.84
16   0   0 0.30   2       1.00
17   0   0 0.24   2       1.00
18   0   0 0.06   1       1.00
19   0   0 0.06   1       1.00
20   0   0 0.02   1       1.00
23   0   0 0.01   1       1.00

Goodness-of-fit for edgewise shared partner 

      obs min  mean max MC p-value
esp0    5   2  8.52  21       0.46
esp1   16   6 15.29  27       0.98
esp2   29   6 20.86  38       0.20
esp3   17   3 21.81  44       0.50
esp4   23   2 18.66  35       0.52
esp5   11   0 12.72  32       0.94
esp6   10   0  7.46  26       0.54
esp7    4   0  4.04  18       0.96
esp8    0   0  1.95   8       0.62
esp9    0   0  0.96  12       1.00
esp10   0   0  0.19   3       1.00
esp11   0   0  0.10   2       1.00

Goodness-of-fit for minimum geodesic distance 

    obs min   mean max MC p-value
1   115  60 112.56 163       1.00
2   275 100 259.30 354       0.96
3   148  45 128.59 198       0.56
4    21   0  26.73 120       1.00
5     2   0   5.06  63       0.76
6     0   0   1.00  27       1.00
7     0   0   0.29  21       1.00
8     0   0   0.06   6       1.00
Inf  69   0  96.41 304       0.96

Goodness-of-fit for model statistics 

                                   obs        min      mean       max MC p-value
edges                         115.0000   60.00000  112.5600  163.0000       1.00
gwesp.fixed.1.09861228866811  222.9108   80.38272  215.2664  383.0266       0.96
nodecov.Seniority            4687.0000 2494.00000 4552.9400 6505.0000       0.82
nodecov.Practice              359.0000  181.00000  351.0100  506.0000       0.98
nodematch.Practice             72.0000   40.00000   70.7500  107.0000       1.00
nodematch.Gender               99.0000   48.00000   97.4500  140.0000       0.94
nodematch.Office               85.0000   45.00000   83.5000  133.0000       0.98

网络块模型

块模型是一类带有种植簇的随机图。“母模型”是随机块模型(random block model, SBM)，它被广泛用作社区检测(community detection)的典型模型。它可以说是带有社区的图的最简单模型。由于SBM是一种生成模型，它从社区的基础事实中获益，这允许在正式的环境中考虑背景问题。

SBM的历史很长。Asmentioned早些时候,模型出现在多个独立科学同一片蓝天下：座的术语。近年来似乎已经占主导地位的,来自于机器学习和统计文献，虽然这些模型通常被称为种植分区模型（planted partition model ）在理论电脑科学[BCLS87、DF89 Bop87],和非均匀随机图模型在数学文献[BJR07]

在某种意义上，SBM的作用与信息论中的离散无记忆通道(DMC)相似。虽然建模外部噪音的任务可能更容易简化真实数据集，SBM捕获了社区检测的一些关键瓶颈现象，并承认许多可能的改进，以提高对真实数据的适合度。在这里，我们的重点将放在对核心SBM的基本理解上，而不是深入到改进后的扩展上。

#install.packages('blockmodels')
library(blockmodels)
set.seed(42)
fblog
IGRAPH 3e87bca UN-- 192 1431 -- 
+ attr: name (v/c), PolParty (v/c)
+ edges from 3e87bca (vertex names):
 [1]  jeunesverts.org/bordeaux-- bix.enix.org/                      jeunesverts.org/bordeaux-- dominiquevoynet.net/blog         
 [3]  bix.enix.org/           -- www.arnaudcaron.net/               bix.enix.org/           -- dominiquevoynet.net/blog         
 [5]  bix.enix.org/           -- blogs.lesverts.fr/                 bix.enix.org/           -- emilien.net/                     
 [7]  bix.enix.org/           -- lipietz.net/blog.php3?id_breve=63  bix.enix.org/           -- democratiesansfrontiere.org/     
 [9]  bix.enix.org/           -- www.dsk2007.net                    bix.enix.org/           -- www.parti-socialiste.fr          
[11]  bix.enix.org/           -- puteaux.typepad.com                bix.enix.org/           -- www.monputeaux.com/              
[13]  bix.enix.org/           -- www.marielauremeyer.com/           bix.enix.org/           -- remibazillier.blogspot.com/      
[15]  bix.enix.org/           -- www.temps-reels.net/blogs/paris/   bix.enix.org/           -- www.monneuilly.com/              
+ ... omitted several edges

A.fblog <- as.matrix(as_adjacency_matrix(fblog))
# 对块模型进行伯努利概率分布估计描述
fblog.sbm <- BM_bernoulli("SBM_sym", A.fblog, 
                           verbosity=0, plotting='')

blockmodels object
    model: bernoulli 
    membership: SBM_sym 
    network: 192 x 192 scalar network 
    maximum of ICL: 10 groups 
    Most usefull fields and methods:
        The following fields are indexed by the number of groups:
            $ICL : vector of ICL
            $PL : vector of pseudo log liklihood
            $memberships : list of memberships founds by estimation
                           each membership is represented object
            $model_parameters : models parameters founds by estimation
        Estimation methods:
            $estimate(reinitalization_effort=1) : to run again estimation with a
                                                  higher reinitalization effort
        Plotting methods:
            $plot_obs_pred(Q) : to plot the obeserved and predicted network for Q groups
            $plot_parameters(Q) : to plot the model_parameters for Q groups
            Please note that each membership object have a plotting pethod

fblog.sbm$estimate()

这里用积分似然准则（Integrated ClassificationLikelihood，ICL）选择网络拟合的类别数量。

块聚类(或共聚类)旨在同时对数据表的行和列进行分区，以揭示同构块结构。这种结构可以源于潜在块模型，该模型提供了数据表的概率建模，其块模式由行和列类定义。对于连续数据，每个表条目通常假定遵循高斯分布。对于给定的数据表，通常会检查几个候选模型:它们可能在集群数量或offree参数数量上有所不同。然后，模型选择就变成了一个关键的问题，为基于模型的单向聚类而派生的工具需要进行调整。在单向聚类中，大多数选择标准是基于渐近的考虑，由于行和列的双重性质，这些考虑很难在块聚类中呈现。我们通过基于综合分类可能性开发非渐进标准（non-asymptotic criterion）来规避这一问题。在参数上定义适当的先前分布后，可以以封闭形式计算此标准。实验结果表明，具有分离良好和中分聚类的大中型数据表性能稳定。

# CHUNK 14
ICLs <- fblog.sbm$ICL
Q <- which.max(ICLs)
Q
# ---
## [1] 10
# ---

(Z <- fblog.sbm$memberships[[Q]]$Z)
(cl.labs <- apply(Z,1,which.max)) # 可以看做对节点的划分（聚类）
 [1]  1  1  1  1  1  1  1  3  3  6  6  6  9  6  6  6  6  6  2  6  6  6  3  6  3  6  6  1  6  3  3  3  6  6  6  6  3  3  6  3  3  6  6  6  6  9  6  2  8  2  8  2  2  2  2  2  2  8  8  9  2
 [62]  2  2  8  2  2  2  2  2  2  2  2  2  2  8  2  2  2  2 10  1  4 10 10  4  7  4  4 10 10  1  4  4  4  4  7 10 10 10 10 10 10  9 10  4  4  4  7  7  4  7 10 10 10 10  4  4  4  4  1  1  9
[123]  4 10  7 10 10  4  1  1 10 10  4 10  4  4  1  1  1  1  1  1 10  9  1  1  1  1  1  1  1  1  1  5  5  1  5  5  5  5  5  5  5  1  1  9  2  9  8  9  9  9  1  9  9  1  1  5  5  5  5  5  5
[184]  5  5  5  5  5  5  5  5  5
nv <- vcount(fblog)
summary(Z[cbind(1:nv,cl.labs)])
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8586  0.9953  0.9953  0.9938  0.9953  0.9953

# CHUNK 18
cl.cnts <- as.vector(table(cl.labs))
alpha <- cl.cnts/nv
alpha
# ---
##  [1] 0.18229167 0.14062500 0.05729167 0.10937500 
##  [5] 0.12500000 0.13020833 0.03125000 0.03645833 
##  [9] 0.06770833 0.11979167
# ---

# CHUNK 19
Pi.mat <- fblog.sbm$model_parameters[[Q]]$pi
Pi.mat[3,]
# ---
##  [1] 0.0030340287 0.0073657690 0.9102251927 0.0009221811 
##  [5] 0.0009170384 0.0364593875 0.0177621832 0.0024976022 
##  [9] 0.0431732528 0.0012495852
# ---

# CHUNK 20
ntrials <- 1000
Pi.mat <- (t(Pi.mat)+Pi.mat)/2
deg.summ <- list(ntrials)
for(i in (1:ntrials)){
   blk.sz <- rmultinom(1,nv,alpha)
   g.sbm <- sample_sbm(nv,pref.matrix=Pi.mat,
                       block.sizes=blk.sz,
                       directed=FALSE)
   deg.summ[[i]] <- summary(degree(g.sbm))
 }
Reduce('+',deg.summ)/ntrials
# ---
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.931   9.165  13.127  15.183  18.896  49.484 
# ---
summary(degree(fblog))
# ---
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    8.00   13.00   14.91   18.00   56.00
# ---

拟合优度

评估随机块模型的拟合优度

plot(fblog.sbm$ICL,xlab="Q",ylab="ICL",type="b")
lines(c(Q,Q),c(min(ICLs),max(ICLs)),col="red",lty=2)

edges <- as_edgelist(fblog,names=FALSE)
neworder<-order(cl.labs)
m<-t(matrix(order(neworder)[as.numeric(edges)],2))
plot(1, 1, xlim = c(0, nv + 1), ylim = c(nv + 1, 0), 
      type = "n", axes= FALSE, xlab="Classes",
      ylab="Classes",
      main="Reorganized Adjacency matrix")
rect(m[,2]-0.5,m[,1]-0.5,m[,2]+0.5,m[,1]+0.5,col=1)
rect(m[,1]-0.5,m[,2]-0.5,m[,1]+0.5,m[,2]+0.5,col=1)
cl.lim <- cl.cnts 
cl.lim <- cumsum(cl.lim)[1:(length(cl.lim)-1)]+0.5
clip(0,nv+1,nv+1,0)
abline(v=c(0.5,cl.lim,nv+0.5),
        h=c(0.5,cl.lim,nv+0.5),col="red")

对这10个分群（划分）的节点属性可视化。

g.cl <- graph_from_adjacency_matrix(Pi.mat,
                                     mode="undirected",
                                     weighted=TRUE)
# Set necessary parameters
vsize <- 100*sqrt(alpha)
ewidth <- 10*E(g.cl)$weight
PolP <- V(fblog)$PolParty
class.by.PolP <- as.matrix(table(cl.labs,PolP))
pie.vals <- lapply(1:Q, function(i) 
                    as.vector(class.by.PolP[i,]))
my.cols <- topo.colors(length(unique(PolP)))
# Plot 
plot(g.cl, edge.width=ewidth, 
      vertex.shape="pie", vertex.pie=pie.vals, 
      vertex.pie.color=list(my.cols),
      vertex.size=vsize, vertex.label.dist=0.1*vsize,
      vertex.label.degree=pi)
# Add a legend
my.names <- names(table(PolP))
my.names[2] <- "Comm. Anal."
my.names[5] <- "PR de G"
legend(x="topleft", my.names,
        fill=my.cols, bty="n")

其中饼图是博客对应政党所占的比例，边的粗细正比于两组博客间相互连接的估计概率。

潜变量模型

随机块模型及其关键创新点在于以节点类别的形式引入了潜变量（latent variable）。换言之，模型使用了未被观测到的变量，而这些变量在决定节点对之间的链接起作用。已知了我们的数据有n个t，那么这个时候我们需要一组神奇的变量x,，他看不见摸不到，但是却实际的决定了每个t的状态，所以，我们把这个变量称为潜变量，潜在的变量，看不见的变量，潜水的变量（^_）。这样，我们就可以得到一个潜变量和原始变量的联合分布。

模型界定
如何描述潜在变量是一个概率函数，如何定义这个函数反映了应用者希望看到哪些潜在变量，目前主要有三个函数：

潜类别模型（latent class）
潜在距离模型（latent distance）
特征模型( eigenmodel)

模型拟合

构建得到的潜变量模型为分层形式，所以和自然地采用贝叶斯推断方法。

summary(lazega)
IGRAPH 3e8b2bf UN-- 36 115 -- 
+ attr: name (v/c), Seniority (v/n), Status (v/n), Gender (v/n), Office (v/n), Years (v/n), Age (v/n), Practice (v/n),
| School (v/n)

 print_all(lazega)
IGRAPH 3e8b2bf UN-- 36 115 -- 
+ attr: name (v/c), Seniority (v/n), Status (v/n), Gender (v/n), Office (v/n), Years (v/n), Age (v/n), Practice (v/n),
| School (v/n)
+ edges (vertex names):
V1 -- V17
V2 -- V7, V16, V17, V22, V26, V29
V3 -- V18, V25, V28
V4 -- V12, V17, V19, V20, V22, V26, V28, V29, V31
V5 -- V18, V24, V28, V31, V32, V33
V6 -- V24, V28, V30, V31, V32
V7 -- V2, V18
V8 --
V9 -- V12, V16, V29
V10 -- V24, V26, V29, V31, V34
V11 -- V17
V12 -- V4, V9, V15, V16, V17, V19, V26, V29, V34
V13 -- V31, V33
V14 -- V16, V17, V25, V28, V30, V32
V15 -- V12, V16, V19, V20, V22, V24, V26, V29, V32, V35, V36
V16 -- V2, V9, V12, V14, V15, V17, V22, V26, V27, V29, V32, V34, V36
V17 -- V1, V2, V4, V11, V12, V14, V16, V19, V22, V24, V25, V26, V28, V29, V34
V18 -- V3, V5, V7, V28, V31, V32, V33, V35
V19 -- V4, V12, V15, V17, V22, V24, V26, V28, V34, V35
V20 -- V4, V15, V22, V26
V21 -- V27
V22 -- V2, V4, V15, V16, V17, V19, V20, V31, V32
V23 --
V24 -- V5, V6, V10, V15, V17, V19, V26, V31, V36
V25 -- V3, V14, V17, V28, V35
V26 -- V2, V4, V10, V12, V15, V16, V17, V19, V20, V24, V27, V32
V27 -- V16, V21, V26
V28 -- V3, V4, V5, V6, V14, V17, V18, V19, V25, V30, V31, V32, V35
V29 -- V2, V4, V9, V10, V12, V15, V16, V17, V34
V30 -- V6, V14, V28, V31
V31 -- V4, V5, V6, V10, V13, V18, V22, V24, V28, V30, V32, V33, V35
V32 -- V5, V6, V14, V15, V16, V18, V22, V26, V28, V31, V33, V35
V33 -- V5, V13, V18, V31, V32
V34 -- V10, V12, V16, V17, V19, V29
V35 -- V15, V18, V19, V25, V28, V31, V32
V36 -- V15, V16, V24

由于潜变量网络模型的特征模型形式能够同时反映距离和同质性因素，对于我们的数据，我们可以分别拟合并比较三种不同的特征模型：

没有成对协变量
共同执业协变量
共同办公地点协变量

不包含成对的协变量，潜在特征空间维度为2

library(eigenmodel)
set.seed(42)
A <- as_adjacency_matrix(lazega, sparse=FALSE)
lazega.leig.fit1 <- eigenmodel_mcmc(A, R=2, S=11000,
   burn=10000) # Approximate the posterior distribution of parameters in an eigenmodel

5 percent done (iteration 1050) Thu Oct 01 08:59:26 2020
10 percent done (iteration 2100) Thu Oct 01 08:59:31 2020
15 percent done (iteration 3150) Thu Oct 01 08:59:37 2020
20 percent done (iteration 4200) Thu Oct 01 08:59:41 2020
25 percent done (iteration 5250) Thu Oct 01 08:59:45 2020
30 percent done (iteration 6300) Thu Oct 01 08:59:49 2020
35 percent done (iteration 7350) Thu Oct 01 08:59:53 2020
40 percent done (iteration 8400) Thu Oct 01 08:59:57 2020
45 percent done (iteration 9450) Thu Oct 01 09:00:00 2020
50 percent done (iteration 10500) Thu Oct 01 09:00:05 2020
55 percent done (iteration 11550) Thu Oct 01 09:00:09 2020
60 percent done (iteration 12600) Thu Oct 01 09:00:14 2020
65 percent done (iteration 13650) Thu Oct 01 09:00:18 2020
70 percent done (iteration 14700) Thu Oct 01 09:00:23 2020
75 percent done (iteration 15750) Thu Oct 01 09:00:27 2020
80 percent done (iteration 16800) Thu Oct 01 09:00:31 2020
85 percent done (iteration 17850) Thu Oct 01 09:00:35 2020
90 percent done (iteration 18900) Thu Oct 01 09:00:39 2020
95 percent done (iteration 19950) Thu Oct 01 09:00:43 2020
100 percent done (iteration 21000) Thu Oct 01 09:00:47 2020

定义分组

same.prac.op <- v.attr.lazega$Practice %o%
   v.attr.lazega$Practice
same.prac <- matrix(as.numeric(same.prac.op
   %in% c(1, 4, 9)), 36, 36)
same.prac <- array(same.prac,dim=c(36, 36, 1))

使用分组信息拟合

lazega.leig.fit2 <- eigenmodel_mcmc(A, same.prac, R=2,
+    S=11000,burn=10000)
5 percent done (iteration 1050) Thu Oct 01 09:08:47 2020
10 percent done (iteration 2100) Thu Oct 01 09:08:53 2020
15 percent done (iteration 3150) Thu Oct 01 09:08:58 2020
20 percent done (iteration 4200) Thu Oct 01 09:09:05 2020
25 percent done (iteration 5250) Thu Oct 01 09:09:10 2020
30 percent done (iteration 6300) Thu Oct 01 09:09:15 2020
35 percent done (iteration 7350) Thu Oct 01 09:09:20 2020
40 percent done (iteration 8400) Thu Oct 01 09:09:26 2020
45 percent done (iteration 9450) Thu Oct 01 09:09:31 2020
50 percent done (iteration 10500) Thu Oct 01 09:09:36 2020
55 percent done (iteration 11550) Thu Oct 01 09:09:41 2020
60 percent done (iteration 12600) Thu Oct 01 09:09:47 2020
65 percent done (iteration 13650) Thu Oct 01 09:09:52 2020
70 percent done (iteration 14700) Thu Oct 01 09:09:58 2020
75 percent done (iteration 15750) Thu Oct 01 09:10:03 2020
80 percent done (iteration 16800) Thu Oct 01 09:10:09 2020
85 percent done (iteration 17850) Thu Oct 01 09:10:16 2020
90 percent done (iteration 18900) Thu Oct 01 09:10:22 2020
95 percent done (iteration 19950) Thu Oct 01 09:10:28 2020
100 percent done (iteration 21000) Thu Oct 01 09:10:35 2020

最后，我们将共同办公地点加到模型里面

same.off.op <- v.attr.lazega$Office %o%
   v.attr.lazega$Office
same.off <- matrix(as.numeric(same.off.op %in%
   c(1, 4, 9)), 36, 36)
same.off <- array(same.off,dim=c(36, 36, 1))
lazega.leig.fit3 <- eigenmodel_mcmc(A, same.off,
    R=2, S=11000, burn=10000)

为了比较网络我们在每个模型推断得到的二维潜在变量空间表示有何不同，我们提取了每个拟合模型的特征向量。

lat.sp.1 <-
   eigen(lazega.leig.fit1$ULU_postmean)$vec[, 1:2]
lat.sp.2 <-
   eigen(lazega.leig.fit2$ULU_postmean)$vec[, 1:2]
lat.sp.3 <-
   eigen(lazega.leig.fit3$ULU_postmean)$vec[, 1:2]

并用这些坐标作为布局在igraph中绘制这一网络。

colbar <- c("red", "dodgerblue", "goldenrod")
v.colors <- colbar[V(lazega)$Office]
v.shapes <- c("circle", "square")[V(lazega)$Practice]
v.size <- 3.5*sqrt(V(lazega)$Years)
v.label <- V(lazega)$Seniority
plot(lazega, layout=lat.sp.1, vertex.color=v.colors,
      vertex.shape=v.shapes, vertex.size=v.size,
      vertex.label=(1:36))

拟合优度

使用交叉验证

perm.index <- sample(1:630)
nfolds <- 5
nmiss <- 630/nfolds
Avec <- A[lower.tri(A)]
Avec.pred1 <- numeric(length(Avec))

# 交叉验证过程
for(i in seq(1,nfolds)){
  # Index of missing values.
  miss.index <- seq(((i-1) * nmiss + 1),
     (i*nmiss), 1)
  A.miss.index <- perm.index[miss.index]

  # Fill a new Atemp appropriately with NA's.
  Avec.temp <- Avec
  Avec.temp[A.miss.index] <-
     rep("NA", length(A.miss.index))
  Avec.temp <- as.numeric(Avec.temp)
  Atemp <- matrix(0, 36, 36)
  Atemp[lower.tri(Atemp)] <- Avec.temp
  Atemp <- Atemp + t(Atemp)

  # Now fit model and predict.
  Y <- Atemp

  model1.fit <- eigenmodel_mcmc(Y, R=2,
     S=11000, burn=10000)
  model1.pred <- model1.fit$Y_postmean
  model1.pred.vec <-
     model1.pred[lower.tri(model1.pred)]
  Avec.pred1[A.miss.index] <-
     model1.pred.vec[A.miss.index]
}

使用ROC曲线评估模型

ROC曲线：接收者操作特征曲线（receiver operating characteristic curve），是反映敏感性和特异性连续变量的综合指标，roc曲线上每个点反映着对同一信号刺激的感受性。

对于分类器，或者说分类算法，评价指标主要有precision，recall，F-score等，以及这里要讨论的ROC和AUC

library(ROCR)
pred1 <- prediction(Avec.pred1, Avec)
perf1 <- performance(pred1, "tpr", "fpr")
plot(perf1, col="blue", lwd=3)

# CHUNK 34
perf1.auc <- performance(pred1, "auc")
slot(perf1.auc, "y.values")

[[1]]
[1] 0.819181

http://www.bu.edu/cs/files/2014/05/Kolaczyk.pdf
https://www.r-bloggers.com/2016/05/ergm-tutorial/
https://baike.baidu.com/item/%E7%A4%BE%E4%BC%9A%E7%BD%91%E7%BB%9C%E6%8C%87%E6%95%B0%E9%9A%8F%E6%9C%BA%E5%9B%BE%E6%A8%A1%E5%9E%8B%EF%BC%9A%E7%90%86%E8%AE%BA%E3%80%81%E6%96%B9%E6%B3%95%E4%B8%8E%E5%BA%94%E7%94%A8/20351978
https://blog.csdn.net/weixin_40895857/article/details/105184024
https://scholar.princeton.edu/sites/default/files/bstewart/files/soc-stats-ergms.pdf
https://www.cs.du.edu/~meiyin/Talk.pdf
http://www.princeton.edu/~eabbe/publications/abbe_FNT_2.pdf
https://hal.archives-ouvertes.fr/hal-00730829/file/model_selection_in_block_clustering_by_the_icl.pdf
https://cloud.tencent.com/developer/article/1595157
https://blog.csdn.net/g8015108/article/details/69367602
https://psych-networks.com/meaning-model-equivalence-network-models-latent-variables-theoretical-space/#:~:text=The%20paper%20constructs%20elegant%20representations%20of%20the%20Ising,the%20latent%20variable%20acts%20as%20a%20common%20effect%29.
http://www.stat.cmu.edu/~brian/905-2009/all-papers/hoff-raftery-handcock-2002-jasa.pdf
https://www.ijcai.org/Proceedings/11/Papers/286.pdf
https://www.r-bloggers.com/2011/01/example-8-21-latent-class-analysis/

网络数据统计分析笔记|| 网络图的统计模型

指数随机图模型

网络块模型

潜变量模型

猜你喜欢

热点阅读