网络数据统计分析笔记|| 网络流数据分析
前情回顾:
Gephi网络图极简教程
Network在单细胞转录组数据分析中的应用
网络数据统计分析笔记|| 为什么研究网络
网络数据统计分析笔记|| 操作网络数据
网络数据统计分析笔记|| 网络数据可视化
网络数据统计分析笔记|| 网络数据的描述性分析
网络数据统计分析笔记||网络图的数学模型
网络数据统计分析笔记|| 网络图的统计模型
网络数据统计分析笔记|| 网络拓扑结构推断
网络流(network-flows)是一种类比水流的解决问题方法,与线性规划密切相关。网络流的理论和应用在不断发展,出现了具有增益的流、多终端流、多商品流以及网络流的分解与合成等新课题。网络流的应用已遍及通讯、运输、电力、工程规划、任务分派、设备更新以及计算机辅助设计等众多领域。
网络流(Network Flow)的相关定义:
源点:有n个点,有m条有向边,有一个点很特殊,只出不进,叫做源点。
汇点:另一个点也很特殊,只进不出,叫做汇点。
容量和流量:每条有向边上有两个量,容量和流量,从i到j的容量通常用c[i,j]表示,流量则通常是f[i,j].
通常可以把这些边想象成道路,流量就是这条道路的车流量,容量就是道路可承受的最大的车流量。很显然的,流量<=容量。而对于每个不是源点和汇点的点来说,可以类比的想象成没有存储功能的货物的中转站,所有“进入”他们的流量和等于所有从他本身“出去”的流量。
最大流:把源点比作工厂的话,问题就是求从工厂最大可以发出多少货物,是不至于超过道路的容量限制,也就是,最大流。先认识一下S(source)和T(sink)的概念。S就是常说的源点,T就是汇点(也就是起点和重点,这个跟最短路的概念是一样的)。我们有一张图,要求从源点流向汇点的最大流量(可以有很多条路到达汇点),就是我们的最大流问题(max flow)。
library(sand)
data(calldata)
names(calldata)
# ---
## [1] "Orig" "Dest" "DistEuc" "DistRd" "O.GRP"
## [6] "D.GRP" "Flow"
# ---
head(calldata)
Orig Dest DistEuc DistRd O.GRP D.GRP Flow
1 Bruck/Le Wien 36.14 37 7828880 285193984 57.61270
2 Bruck/Le Mistelbach 61.46 81 7828880 7685130 1.39963
3 Bruck/Le Wr.Neustadt 46.50 56 7828880 27369776 19.90791
4 Bruck/Le St.P<U+00BA>lten 88.60 96 7828880 24601360 3.90906
5 Bruck/Le Zwettl 135.13 145 7828880 8143690 1.38712
6 Bruck/Le Hollabrunn 78.75 85 7828880 6385840 1.49779
min.call <- min(calldata$Flow)
calldata$FlowCnt <- round(5 * calldata$Flow / min.call)
W <- xtabs(FlowCnt ~ Orig + Dest, calldata)
head(W)
Dest
Orig Amstetten BadIschl Bischofsh. Bruck/Le Bruck/Mur Feldkirch Freistadt Graz Hartberg
Amstetten 0 4222 6731 640 7235 6342 7160 10389 3950
BadIschl 2514 0 3695 1184 2981 2192 990 13802 1194
Bischofsh. 3113 3812 0 1095 5807 4926 649 17626 2259
Bruck/Le 2812 1137 1902 0 2715 3865 261 8717 7369
Bruck/Mur 7485 5337 7184 842 0 14447 529 38500 7360
Feldkirch 3304 850 774 7593 71570 0 235 127013 31020
Dest
Orig Hollabrunn Innsbruck Judenburg Kirch/Krems Klagenfurt Landeck Leibnitz Lienz/Os Liezen
Amstetten 522 4346 2610 4186 6767 715 1883 1452 9408
BadIschl 620 3363 2232 3083 2640 512 1095 715 5157
Bischofsh. 1195 6711 6011 1165 11946 1172 2378 1826 11635
Bruck/Le 2925 2246 1486 836 4012 773 1339 1795 1006
Bruck/Mur 494 8665 9391 2129 23218 1875 7887 1841 60
Feldkirch 1217 7470 10397 1073 6040 1570 11627 3904 19740
Dest
Orig Linz Mistelbach Reutte Ried/Inn Salzburg Sp/Drau St.P�lten V�cklabruck
Amstetten 92097 4455 1054 8286 6356 2365 2761 10190
BadIschl 28618 338 196 6619 16401 653 3132 42022
Bischofsh. 14742 844 556 7667 43725 6740 4770 7043
Bruck/Le 7268 2734 413 1890 2363 750 7635 2559
Bruck/Mur 18372 458 1639 2731 6052 3964 5402 4311
Feldkirch 15229 6164 1423 4688 1110 806 9586 4748
Dest
Orig Wien Wolfsberg W�rgl Wr.Neustadt ZellamSee Zwettl
Amstetten 24920 692 2937 6439 3218 2393
BadIschl 43925 264 1681 2968 1771 666
Bischofsh. 12277 1915 5585 4153 811 1647
Bruck/Le 112525 453 3502 38883 795 2709
Bruck/Mur 9941 4789 6385 6326 3855 992
Feldkirch 169658 348 3360 10088 6953 1498
g.cd <- graph_from_adjacency_matrix(W, weighted=TRUE)
in.flow <- strength(g.cd, mode="in")
out.flow <- strength(g.cd, mode="out")
vsize <- sqrt(in.flow + out.flow) / 100
pie.vals <- lapply((1:vcount(g.cd)),
function(i) c(in.flow[i], out.flow[i]))
ewidth <- E(g.cd)$weight / 10^5
set.seed(42)
plot(g.cd, vertex.size=vsize, vertex.shape="pie",
vertex.pie=pie.vals, edge.width=ewidth,
edge.arrow.size=0.1)
calldata$lFlowCnt <- log(calldata$FlowCnt, 10)
calldata$lO.GRP <- log(calldata$O.GRP, 10)
calldata$lD.GRP <- log(calldata$D.GRP, 10)
calldata$lDistRd <- log(calldata$DistRd, 10)
library(car)
scatterplotMatrix( ~ lFlowCnt + lO.GRP + lD.GRP +
lDistRd, data=calldata, regLine=list(col="red"),
smooth=list(spread=FALSE,col.smooth="goldenrod"),
col="powderblue")
以上可以看成是特征选择
formula.s <- FlowCnt ~ lO.GRP + lD.GRP + lDistRd
formula.g <- FlowCnt ~ Orig + Dest + lDistRd
gm.s <- glm(formula.s, family="poisson", data=calldata)
gm.g <- glm(formula.g, family="poisson", data=calldata)
summary(gm.s)
Call:
glm(formula = formula.s, family = "poisson", data = calldata)
Deviance Residuals:
Min 1Q Median 3Q Max
-475.06 -54.16 -29.20 -2.09 1149.93
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.149e+01 5.394e-03 -2131 <2e-16 ***
lO.GRP 1.885e+00 4.306e-04 4376 <2e-16 ***
lD.GRP 1.670e+00 4.401e-04 3794 <2e-16 ***
lDistRd -2.191e+00 7.909e-04 -2770 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 45490237 on 991 degrees of freedom
Residual deviance: 10260808 on 988 degrees of freedom
AIC: 10270760
Number of Fisher Scoring iterations: 5
gm.g$aic
# ---
## [1] 5466814
# ---
gm.s$aic
# ---
## [1] 10270760
# ---
# CHUNK 11
plot(calldata$lFlowCnt,log(gm.g$fitted.values,10),
cex.lab=1.5,
xlab=expression(Log[10](paste("Flow Volume"))),
col="green", cex.axis=1.5, ylab="", ylim=c(2, 5.75))
mtext(expression(Log[10](paste("Fitted Value"))), 2,
outer=T, cex=1.5, padj=1)
clip(0.5,5.75,2,5.75)
abline(0, 1, lwd=2, col="darkgoldenrod1")
res <- residuals.glm(gm.g, type="response")
relres <- res/calldata$FlowCnt
lrelres <- log(abs(relres),10)
res.sgn <- (relres>=0)
plot(calldata$lFlowCnt[res.sgn], lrelres[res.sgn],
xlim=c(0.5,5.75), ylim=c(-3.5,3.5),
xlab=expression(Log[10](paste("Flow Volume"))),
cex.lab=1.5, cex.axis=1.5, ylab="", col="lightgreen")
mtext(expression(Log[10](paste("Relative Error"))), 2,
outer=T, cex=1.5, padj=1)
par(new=T)
plot(calldata$lFlowCnt[!res.sgn], lrelres[!res.sgn],
xlim=c(0.5,5.75), ylim=c(-3.5, 3.5),
xlab=expression(Log[10](paste("Flow Volume"))),
cex.lab=1.5, cex.axis=1.5, ylab="", col="darkgreen")
mtext(expression(Log[10](paste("Relative Error"))), 2,
outer=T, cex=1.5, padj=1)
clip(0.5,5.75,-3.5,3.5)
abline(h=0, lwd=2, col="darkgoldenrod2")
网络流的预测:流量矩阵估计
有了网络链路上的流量测量,再加上力量在起讫点间链路上分布方式的知识,对于我们准确预测起讫点间的流量就足够了。这被称为流量矩阵预测(traffic matrix estimation )。
流量矩阵是一个二维矩阵与其邻接矩阵元素,tij决定的流量采购从节点我和退出节点j。tij价值也被称为交通需求和每个需求代表的数量每一对网络节点之间的数据传输。在由4个节点组成的网络中,每个节点是一个流量源或一个流量汇聚,流量矩阵包含12个需求(图5a)。当节点1、2代表流量源,节点3、4代表流量目的地时,只有4个需求(图5b)。
library(networkTomography)
data(bell.labs)
g.bl <- graph_from_literal(fddi:switch:local:corp
++ Router)
plot(g.bl)
B <- bell.labs$A
Z <- bell.labs$X
x <- bell.labs$Y
# CHUNK 16
library(lattice)
traffic.in <- c("dst fddi","dst switch",
"dst local","dst corp")
traffic.out <- c("src fddi","src switch",
"src local","src corp")
my.df <- bell.labs$df
my.df$t <- unlist(lapply(my.df$time, function(x) {
hrs <- as.numeric(substring(x, 11, 12))
mins <- as.numeric(substring(x, 14, 15))
t <- hrs + mins/60
return(t)}))
# Separate according to whether data
# are incoming or outgoing.
my.df.in <- subset(my.df, nme %in% traffic.in)
my.df.out <- subset(my.df, nme %in% traffic.out)
# Set up trellis plots for each case.
p.in <- xyplot(value / 2^10 ~ t | nme, data=my.df.in,
type="l", col.line="goldenrod",
lwd=2, layout=c(1,4),
xlab="Hour of Day", ylab="Kbytes/sec")
p.out <- xyplot(value / 2^10 ~ t | nme, data=my.df.out,
type="l", col.line="red",
lwd=2, layout=c(1,4),
xlab="Hour of Day", ylab="Kbytes/sec")
# Generate trellis plots.
print(p.in, position=c(0,0.5,1,1), more=TRUE)
print(p.out, position=c(0,0,1,0.5))
B.full <- rbind(B, 2 - colSums(B))
write.table(format(B.full),
row.names=F, col.names=F, quote=F)
# ---
## 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
## 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
## 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0
## 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0
## 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
# ---
层析引力方法(tomogravity algorithm )
计算机通信网络的建模和分析产生了各种有趣的统计问题。本文讨论了网络的tomog- raphy问题,tomog- raphy用于描述两类大容量逆问题。第一个处理被动层析成像,即在单个路由器/节点级别收集聚集数据,其目标是恢复p级信息。这里的主要问题是估计出发地-目的地转换矩阵。第二种方法是主动层析成像,它利用重构链路级信息fr的方法进行数据分析
x.full <- Z %*% t(B.full)
tomo.fit <- tomogravity(x.full, B.full, 0.01)
zhat <- tomo.fit$Xhat
# CHUNK 19
nt <- nrow(Z); nf <- ncol(Z)
t.dat <- data.frame(z = as.vector(c(Z) / 2^10),
zhat = as.vector(c(zhat) / 2^10),
t <- c(rep(as.vector(bell.labs$tvec), nf)))
od.names <- c(rep("fddi->fddi", nt),
rep("fddi->local", nt),
rep("fddi->switch", nt), rep("fddi->corp",nt),
rep("local->fddi", nt), rep("local->local",nt),
rep("local->switch", nt), rep("local->corp",nt),
rep("switch->fddi", nt), rep("switch->local",nt),
rep("switch->switch", nt), rep("switch->corp",nt),
rep("corp->fddi", nt), rep("corp->local",nt),
rep("corp->switch", nt), rep("corp->corp",nt))
t.dat <- transform(t.dat, OD = od.names)
xyplot(z~t | OD, data=t.dat,
panel=function(x, y, subscripts){
panel.xyplot(x, y, type="l", col.line="blue")
panel.xyplot(t.dat$t[subscripts],
t.dat$zhat[subscripts],
type="l", col.line="green")
}, as.table=T, subscripts=T, xlim=c(0,24),
xlab="Hour of Day", ylab="Kbytes/sec",
layout=c(4,4))
https://blog.csdn.net/BigFatSheep/article/details/78771897
https://blog.csdn.net/dengtiaolu0407/article/details/102199361?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param
https://www.cnblogs.com/FibonacciHeap/articles/9691400.html
基于引力模型的区域物流需求预测研究
https://sites.cs.ucsb.edu/~suri/cs231/Flows.pdf
http://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/y2h.pdf
https://www.noction.com/knowledge-base/network-capacity-planning
Traffic matrix estimation: A neural network approach with extended input and expectation maximization iteration
https://rdrr.io/cran/networkTomography/
https://arxiv.org/pdf/0708.0945.pdf