十八 HugeGraph search 索引

2021-11-01 本文已影响0人 NazgulSun

现在看来，基本上有两类索引；
1）类似于nebula之类的，直接外接第三方搜索引擎elasticsearch；
这种做法，相对简化了整个图引擎，可以专注于结构，而es负责文本等搜索；
但是造成数据层面上的隔离，如何保持两边的数据一致性，是一个较大的问题；
比如，删除节点，添加节点的时候，要不要做分布式事务？
2）类似于hugegraph，内置各类搜索引擎；比如二级索引，模糊匹配索引，range索引；
索引数据与图数据是在一个repo里面，同时图的更新，同步更新索引数据，并在一个事务里面；
所以没有数据一致性的问题；
这种内置的方案，增加了系统的复杂性和耦合性，同时功能上没有专业搜索引擎ES强大；
我们以hugegraph 的search index 来说明，他的本质依旧是二级索引的思路；
比如有一个节点《珠海全志科技有限公司》，在构建索引的时候，回调用分词器，分词成
【珠海，全志科技，全志，有限公司】等合集，分词的时候可以使用类似jieba等NLP工具；
然后，构建倒排序列；比如珠海-> node, 全志科技->node;

            case SEARCH:
                E.checkState(propValues.size() == 1,
                             "Expect only one property in search index");
                value = propValues.get(0);
                Set<String> words = this.segmentWords(value.toString());
                for (String word : words) {
                    this.updateIndex(indexLabel, word, element.id(), removed);
                }
                break;

搜索的时候，g.V().has('name', Text .contains('珠海全志科技有限公司')); 依旧是对珠海全志科技有限公司
分词，【珠海，全志科技，全志，有限公司】，然后遍历各个词，做到 g.v().has(name,'珠海')这种二级索引的搜索；
就目前而言，这种搜索方式一个是性能不是太高，另外一个是搜索排序效果不好，也无按需进行排序，对于搜索要求较高的应用，
需要自己外接第三方搜索如ES，同时自己管理好索引的更新；

    @Watched(prefix = "index")
    private List<IdHolder> doSearchIndex(ConditionQuery query,
                                         MatchedIndex index) {
        query = this.constructSearchQuery(query, index);
        List<IdHolder> holders = new SortByCountIdHolderList(query.paging());
        // sorted by matched count
        for (ConditionQuery q : ConditionQueryFlatten.flatten(query)) {
            IndexQueries queries = index.constructIndexQueries(q);
            assert !query.paging() || queries.size() <= 1;
            IdHolder holder = this.doSingleOrJointIndex(queries);
            // NOTE: ids will be merged into one IdHolder if not in paging
            holders.add(holder);
        }
        return holders;
    }

constructSearchQuery,对查询进行segment

    private ConditionQuery constructSearchQuery(ConditionQuery query,
                                                MatchedIndex index) {
        ConditionQuery originQuery = query;
        Set<Id> indexFields = new HashSet<>();
        // Convert has(key, text) to has(key, textContainsAny(word1, word2))
        for (IndexLabel il : index.indexLabels()) {
            if (il.indexType() != IndexType.SEARCH) {
                continue;
            }
            Id indexField = il.indexField();
            String fieldValue = (String) query.userpropValue(indexField);
            Set<String> words = this.segmentWords(fieldValue);
            indexFields.add(indexField);

            query = query.copy();
            query.unsetCondition(indexField);
            query.query(Condition.textContainsAny(indexField, words));
        }

        // Register results filter
        query.registerResultsFilter(elem -> {
            for (Condition cond : originQuery.conditions()) {
                Object key = cond.isRelation() ? ((Relation) cond).key() : null;
                if (key instanceof Id && indexFields.contains(key)) {
                    // This is an index field of search index
                    Id field = (Id) key;
                    String propValue = elem.<String>getPropertyValue(field);
                    String fvalue = (String) originQuery.userpropValue(field);
                    if (this.matchSearchIndexWords(propValue, fvalue)) {
                        continue;
                    }
                    return false;
                }
                if (!cond.test(elem)) {
                    return false;
                }
            }
            return true;
        });

        return query;
    }

索引的其他tips；

多个has条件，只能复合索引，无法has（A）,Has(B)多个查询；
within in 等条件，会被flattern，然后oneby one 执行，没有batch
整个执行过程，突出一个单步模式，然后整合。因为要做底层适配，很多复合查询都在server端来做；
如果确定了某个底层，应该可以在这个层面做比较多的优化；专一性的性能总是会比通用性的要强很多；

nebula 做第三方索引的思路

可以参考文章： https://blog.csdn.net/weixin_44324814/article/details/117999808
在处理数据一致性的问题的时候，由raft 负责日志的正常入图；同时开辟一个额外的进程 listener；
listener 处理类似于同步器的工具，把 leader 的日志，再一次同步到第三方存储，比如elasticsearch；
可以看出来是一个最终一致性的模式， realtime要求较高的话，可能会存在数据不一致；如果要完全做到强一致性，必须要做分布式事务的，系统的复杂性会直线上升；

不过不管怎么样，hugegraph 的索引模块还是有很大的提升空间，整个hugegraph 如果在确认单一的存储之后，都有很大的优化空间；

十八 HugeGraph search 索引

索引的其他tips；

nebula 做第三方索引的思路

猜你喜欢

热点阅读