TinkerPop中的GraphComputer组件

2020-03-11  本文已影响0人  天之見證

TinkerPop中对于图的计算,底层的计算引擎是通过GraphComputer来提供的

1. 计算的分类

TinkerPop提供的2种和图数据交互的方式:

OLTP OLAP
驱动计算的数据量 1个点/很少的几个点 点很多,计算过程中会用到全图
数据时效性 毫秒/秒级 数分钟/数小时
数据访问方式 随机 顺序
数据处理方式 串行 并行
oltp vs olap

2. 计算的分解

TinkerPop使用的计算模型,在我看来即是一个 大的MapReduce=BSP+MapReduce

其中:

  1. BSP: VertexProgram
  2. MapReduce: MapReduce (可能会没有)

1. 计算的入口/并行的开始VertexProgram

图中的涉及到的点,都会执行VertexProgram对应的代码, 一个抽象的执行机器称为worker, 这些worker可以并行执行这些代码, 即调用VertexProgram.execute() (这里采用了BSP计算模型), 节点之间通过2种消息进行通信

  1. MessageScope.Local: 相邻节点之间通信
  2. MessageScope.Global: 图中任一节点通信

VertexProgram执行完成之后, 会接着执行其对应的MapReduce任务, 可以通过VertexProgram.getMapReducers() 获得

BSP的执行示例如下:

  1. 本地计算
  2. 计算结果通信
  3. 阻塞直到所有的本次计算完成
  4. 转到步骤1
bsp diagram

PageRankVertexProgram 的实现为例子, 可以看到具体的实现过程

public class PageRankVertexProgram implements VertexProgram<Double> { //1

    public static final String PAGE_RANK = "gremlin.pageRankVertexProgram.pageRank";
    private static final String EDGE_COUNT = "gremlin.pageRankVertexProgram.edgeCount";
    private static final String PROPERTY = "gremlin.pageRankVertexProgram.property";
    private static final String VERTEX_COUNT = "gremlin.pageRankVertexProgram.vertexCount";
    private static final String ALPHA = "gremlin.pageRankVertexProgram.alpha";
    private static final String EPSILON = "gremlin.pageRankVertexProgram.epsilon";
    private static final String MAX_ITERATIONS = "gremlin.pageRankVertexProgram.maxIterations";
    private static final String EDGE_TRAVERSAL = "gremlin.pageRankVertexProgram.edgeTraversal";
    private static final String INITIAL_RANK_TRAVERSAL = "gremlin.pageRankVertexProgram.initialRankTraversal";
    private static final String TELEPORTATION_ENERGY = "gremlin.pageRankVertexProgram.teleportationEnergy";
    private static final String CONVERGENCE_ERROR = "gremlin.pageRankVertexProgram.convergenceError";

    private MessageScope.Local<Double> incidentMessageScope = MessageScope.Local.of(__::outE); //2
    private MessageScope.Local<Double> countMessageScope = MessageScope.Local.of(new MessageScope.Local.ReverseTraversalSupplier(this.incidentMessageScope));
    private PureTraversal<Vertex, Edge> edgeTraversal = null;
    private PureTraversal<Vertex, ? extends Number> initialRankTraversal = null;
    private double alpha = 0.85d;
    private double epsilon = 0.00001d;
    private int maxIterations = 20;
    private String property = PAGE_RANK; //3
    private Set<VertexComputeKey> vertexComputeKeys;
    private Set<MemoryComputeKey> memoryComputeKeys;

    private PageRankVertexProgram() {    }

    @Override
    public void loadState(final Graph graph, final Configuration configuration) { //4
        if (configuration.containsKey(INITIAL_RANK_TRAVERSAL))
            this.initialRankTraversal = PureTraversal.loadState(configuration, INITIAL_RANK_TRAVERSAL, graph);
        if (configuration.containsKey(EDGE_TRAVERSAL)) {
            this.edgeTraversal = PureTraversal.loadState(configuration, EDGE_TRAVERSAL, graph);
            this.incidentMessageScope = MessageScope.Local.of(() -> this.edgeTraversal.get().clone());
            this.countMessageScope = MessageScope.Local.of(new MessageScope.Local.ReverseTraversalSupplier(this.incidentMessageScope));
        }
        this.alpha = configuration.getDouble(ALPHA, this.alpha);
        this.epsilon = configuration.getDouble(EPSILON, this.epsilon);
        this.maxIterations = configuration.getInt(MAX_ITERATIONS, 20);
        this.property = configuration.getString(PROPERTY, PAGE_RANK);
        this.vertexComputeKeys = new HashSet<>(Arrays.asList(
                VertexComputeKey.of(this.property, false),
                VertexComputeKey.of(EDGE_COUNT, true))); //5
        this.memoryComputeKeys = new HashSet<>(Arrays.asList(
                MemoryComputeKey.of(TELEPORTATION_ENERGY, Operator.sum, true, true),
                MemoryComputeKey.of(VERTEX_COUNT, Operator.sum, true, true),
                MemoryComputeKey.of(CONVERGENCE_ERROR, Operator.sum, false, true)));
    }

    @Override
    public void storeState(final Configuration configuration) {
        VertexProgram.super.storeState(configuration);
        configuration.setProperty(ALPHA, this.alpha);
        configuration.setProperty(EPSILON, this.epsilon);
        configuration.setProperty(PROPERTY, this.property);
        configuration.setProperty(MAX_ITERATIONS, this.maxIterations);
        if (null != this.edgeTraversal)
            this.edgeTraversal.storeState(configuration, EDGE_TRAVERSAL);
        if (null != this.initialRankTraversal)
            this.initialRankTraversal.storeState(configuration, INITIAL_RANK_TRAVERSAL);
    }

    @Override
    public void setup(final Memory memory) {
        memory.set(TELEPORTATION_ENERGY, null == this.initialRankTraversal ? 1.0d : 0.0d);
        memory.set(VERTEX_COUNT, 0.0d);
        memory.set(CONVERGENCE_ERROR, 1.0d);
    }

    @Override
    public void execute(final Vertex vertex, Messenger<Double> messenger, final Memory memory) { //7
        if (memory.isInitialIteration()) {
            messenger.sendMessage(this.countMessageScope, 1.0d);  //8
            memory.add(VERTEX_COUNT, 1.0d);
        } else {
            final double vertexCount = memory.<Double>get(VERTEX_COUNT);
            final double edgeCount;
            double pageRank;
            if (1 == memory.getIteration()) {
                edgeCount = IteratorUtils.reduce(messenger.receiveMessages(), 0.0d, (a, b) -> a + b);
                vertex.property(VertexProperty.Cardinality.single, EDGE_COUNT, edgeCount);
                pageRank = null == this.initialRankTraversal ?
                        0.0d :
                        TraversalUtil.apply(vertex, this.initialRankTraversal.get()).doubleValue(); //9
            } else {
                edgeCount = vertex.value(EDGE_COUNT);
                pageRank = IteratorUtils.reduce(messenger.receiveMessages(), 0.0d, (a, b) -> a + b); //10
            }
            //////////////////////////
            final double teleporationEnergy = memory.get(TELEPORTATION_ENERGY);
            if (teleporationEnergy > 0.0d) {
                final double localTerminalEnergy = teleporationEnergy / vertexCount;
                pageRank = pageRank + localTerminalEnergy;
                memory.add(TELEPORTATION_ENERGY, -localTerminalEnergy);
            }
            final double previousPageRank = vertex.<Double>property(this.property).orElse(0.0d);
            memory.add(CONVERGENCE_ERROR, Math.abs(pageRank - previousPageRank));
            vertex.property(VertexProperty.Cardinality.single, this.property, pageRank);
            memory.add(TELEPORTATION_ENERGY, (1.0d - this.alpha) * pageRank);
            pageRank = this.alpha * pageRank;
            if (edgeCount > 0.0d)
                messenger.sendMessage(this.incidentMessageScope, pageRank / edgeCount);
            else
                memory.add(TELEPORTATION_ENERGY, pageRank);
        }
    }

    @Override
    public boolean terminate(final Memory memory) { //11
        boolean terminate = memory.<Double>get(CONVERGENCE_ERROR) < this.epsilon || memory.getIteration() >= this.maxIterations;
        memory.set(CONVERGENCE_ERROR, 0.0d);
        return terminate;
    }
}
  1. loadStatestoreState 都是让代码在其他机器上执行的时候, 需要对实例/配置的状态进行保存和恢复
  2. execute里面可以看到消息之间的通信
  3. terminate 表示任务的结束条件

2. 计算的延伸/收敛MapReduce

一般来讲BSP模型计算完成之后, 它的结果是像图的属性一样分布在图的各个节点上

BSP计算的结束,并不是所有结果都收敛在了一个值上面, 而是计算迭代到了一定步数或者该次要进行的计算已经为空了

所以当我们需要对一些全局的问题作答的时候, 就需要对BSP的计算结果进行再加工

例如:

  1. 图聚类结束之后, 每个类簇下有多少节点
  2. 图聚类结束之后一共有多少个簇

3. Spark对这部分计算的实现

这里Spark对图计算的实现没有依赖GraphX, 而是按照上面的2个计算步骤来实现的, 具体实现

VertexProgram的执行

while (true) {
    if (Thread.interrupted()) {
        sparkContext.cancelAllJobs();
        throw new TraversalInterruptedException();
    }
    memory.setInExecute(true);
    viewIncomingRDD = SparkExecutor.executeVertexProgramIteration(loadedGraphRDD, viewIncomingRDD, memory, graphComputerConfiguration, vertexProgramConfiguration);
    memory.setInExecute(false);
    if (this.vertexProgram.terminate(memory))
        break;
    else {
        memory.incrIteration();
        memory.broadcastMemory(sparkContext);
    }
}

MapReduce的执行

for (final MapReduce mapReduce : this.mapReducers) {
    // execute the map reduce job
    final HadoopConfiguration newApacheConfiguration = new HadoopConfiguration(graphComputerConfiguration);
    mapReduce.storeState(newApacheConfiguration);
    // map
    final JavaPairRDD mapRDD = SparkExecutor.executeMap((JavaPairRDD) mapReduceRDD, mapReduce, newApacheConfiguration);
    // combine
    final JavaPairRDD combineRDD = mapReduce.doStage(MapReduce.Stage.COMBINE) ? SparkExecutor.executeCombine(mapRDD, newApacheConfiguration) : mapRDD;
    // reduce
    final JavaPairRDD reduceRDD = mapReduce.doStage(MapReduce.Stage.REDUCE) ? SparkExecutor.executeReduce(combineRDD, mapReduce, newApacheConfiguration) : combineRDD;
    // write the map reduce output back to disk and computer result memory
    if (null != outputRDD)
        mapReduce.addResultToMemory(finalMemory, outputRDD.writeMemoryRDD(graphComputerConfiguration, mapReduce.getMemoryKey(), reduceRDD));
}

ref:

  1. http://tinkerpop.apache.org/docs/3.4.6/reference/#graphcomputer
  2. https://github.com/apache/tinkerpop/tree/master/spark-gremlin
上一篇 下一篇

猜你喜欢

热点阅读