Titan 1.0 批量数据并行导入示例

乜建柏

2023-12-01

Titan图数据库数据导入分为串行导入、批量导入、以及基于HDFS和Spark的批量数据导入。

前两者较简单，相对好开发和配置，本文通过示例详细介绍基于HDFS和Spark的批量数据导入。

1）顶点集out_vertices.txt

root@cnic-1:~/titan# cat test/out_vertices.txt 
1
2
3
4
5
root@cnic-1:~/titan#

2）边数据集out_edges.txt，注意边数据以邻接表的形式保存

root@cnic-1:~/titan# cat test/out_edges.txt 
1|2,3,4|5
2|5|1,3
3|2|1,4,5
4|3|1,5
5|1,4,3|2
root@cnic-1:~/titan#

3) 解析并插入顶点和边的脚本vertices.groovy和edges.groovy

vertices.groovy

def parse(line, factory) {
    idstr = line
    def v1 = factory.vertex(idstr, "user")
    return v1
}

edges.groovy

root@cnic-1:~/titan# cat test/edges.groovy 
def parse(line, factory) {
    def (id,inv,outv) = line.split("\\|")
    def in_lst = inv.toString().split(",")
    def out_lst = outv.toString().split(",")
    idstr = "${id}".toString()
    def v1 = factory.vertex(idstr, "user")

    for (v_id in in_lst) {
        def v2 = factory.vertex(v_id)
        factory.edge(v1, v2, "friend")
    }
    
    for (v_id in out_lst) {
        def v2 = factory.vertex(v_id)
        factory.edge(v2, v1, "friend")
    }
    return v1
}

vertices.groovy和vertices.groovy从输入数据集解析出顶点和边

4）最后是用于数据导入的gremlim脚本，用于设置Titan图数据库，启动Spark分别导入顶点和边

hadoop-script-load-example.groovy

root@cnic-1:~/titan# cat hadoop-script-load-example.groovy 
cassandra_props = "conf/titan-cassandra.properties"
path = "/root/titan/test"

graph = TitanFactory.open(cassandra_props)

m = graph.openManagement()
user = m.makeVertexLabel("user").make()
friend = m.makeEdgeLabel("friend").make()
blid = m.makePropertyKey("bulkLoader.vertex.id").dataType(Long.class).make()

uid = m.makePropertyKey("uid").dataType(Long.class).make()

m.buildIndex("byBulkLoaderVertexId", Vertex.class).addKey(blid).buildCompositeIndex()
m.commit()

hdfs.copyFromLocal("${path}/out_vertices.txt", "vertices.txt")
hdfs.copyFromLocal("${path}/vertices.groovy", "vertices.groovy")

graph = GraphFactory.open("conf/hadoop-graph/hadoop-script.properties")
graph.configuration().setInputLocation("vertices.txt")
graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "vertices.groovy")
blvp = BulkLoaderVertexProgram.build().writeGraph(cassandra_props).create(graph)
graph.compute(SparkGraphComputer).program(blvp).submit().get()

hdfs.copyFromLocal("${path}/out_edges.txt", "edges.txt")
hdfs.copyFromLocal("${path}/edges.groovy", "edges.groovy")

graph = GraphFactory.open("conf/hadoop-graph/hadoop-script.properties")
graph.configuration().setInputLocation("edges.txt")
graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "edges.groovy")
blvp = BulkLoaderVertexProgram.build().keepOriginalIds(false).writeGraph(cassandra_props).create(graph)
graph.compute(SparkGraphComputer).program(blvp).submit().get()
root@cnic-1:~/titan#

5）运行

在titan目录下运行

./bin/gremlin.sh hadoop-script-load-example.groovy

out_edges.txt、out_vertices.txt、vertices.groovy，edges.groovy位于titan/test目录，hadoop-script-load-example.groovy位于titan目录。

参考（https://groups.google.com/forum/#!topic/aureliusgraphs/fLPl7OlcXt0）

Titan 1.0 批量数据并行导入示例

相关阅读

相关文章

相关问答

相关文档

Titan 1.0 批量 数据并行导入示例

相关阅读

相关文章

相关问答

相关文档

Titan 1.0 批量数据并行导入示例