Titan图数据库数据导入分为串行导入、批量导入、以及基于HDFS和Spark的批量数据导入。
前两者较简单,相对好开发和配置,本文通过示例详细介绍基于HDFS和Spark的批量数据导入。
1)顶点集out_vertices.txt
root@cnic-1:~/titan# cat test/out_vertices.txt
1
2
3
4
5
root@cnic-1:~/titan#
2)边数据集out_edges.txt,注意边数据以邻接表的形式保存
root@cnic-1:~/titan# cat test/out_edges.txt
1|2,3,4|5
2|5|1,3
3|2|1,4,5
4|3|1,5
5|1,4,3|2
root@cnic-1:~/titan#
vertices.groovy
def parse(line, factory) {
idstr = line
def v1 = factory.vertex(idstr, "user")
return v1
}
root@cnic-1:~/titan# cat test/edges.groovy
def parse(line, factory) {
def (id,inv,outv) = line.split("\\|")
def in_lst = inv.toString().split(",")
def out_lst = outv.toString().split(",")
idstr = "${id}".toString()
def v1 = factory.vertex(idstr, "user")
for (v_id in in_lst) {
def v2 = factory.vertex(v_id)
factory.edge(v1, v2, "friend")
}
for (v_id in out_lst) {
def v2 = factory.vertex(v_id)
factory.edge(v2, v1, "friend")
}
return v1
}
4)最后是用于数据导入的gremlim脚本,用于设置Titan图数据库,启动Spark分别导入顶点和边
hadoop-script-load-example.groovy
root@cnic-1:~/titan# cat hadoop-script-load-example.groovy
cassandra_props = "conf/titan-cassandra.properties"
path = "/root/titan/test"
graph = TitanFactory.open(cassandra_props)
m = graph.openManagement()
user = m.makeVertexLabel("user").make()
friend = m.makeEdgeLabel("friend").make()
blid = m.makePropertyKey("bulkLoader.vertex.id").dataType(Long.class).make()
uid = m.makePropertyKey("uid").dataType(Long.class).make()
m.buildIndex("byBulkLoaderVertexId", Vertex.class).addKey(blid).buildCompositeIndex()
m.commit()
hdfs.copyFromLocal("${path}/out_vertices.txt", "vertices.txt")
hdfs.copyFromLocal("${path}/vertices.groovy", "vertices.groovy")
graph = GraphFactory.open("conf/hadoop-graph/hadoop-script.properties")
graph.configuration().setInputLocation("vertices.txt")
graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "vertices.groovy")
blvp = BulkLoaderVertexProgram.build().writeGraph(cassandra_props).create(graph)
graph.compute(SparkGraphComputer).program(blvp).submit().get()
hdfs.copyFromLocal("${path}/out_edges.txt", "edges.txt")
hdfs.copyFromLocal("${path}/edges.groovy", "edges.groovy")
graph = GraphFactory.open("conf/hadoop-graph/hadoop-script.properties")
graph.configuration().setInputLocation("edges.txt")
graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "edges.groovy")
blvp = BulkLoaderVertexProgram.build().keepOriginalIds(false).writeGraph(cassandra_props).create(graph)
graph.compute(SparkGraphComputer).program(blvp).submit().get()
root@cnic-1:~/titan#
5)运行
在titan目录下运行
./bin/gremlin.sh hadoop-script-load-example.groovy
out_edges.txt、out_vertices.txt、vertices.groovy,edges.groovy位于titan/test目录,hadoop-script-load-example.groovy位于titan目录。
参考(https://groups.google.com/forum/#!topic/aureliusgraphs/fLPl7OlcXt0)