问题：

无法用GraphDB加载大型数据集

胡天佑

2023-03-14

当我将这个DBpedia（2015-10，嗯，大约10亿个三倍）加载到GraphDB 9.1.1中时，CPU负载在大约1300万个三倍和空闲之后下降到0%。在我手动终止之前，进程不会终止。

与通过Xmx CMD选项分配给java的512GB相比，该机器有足够的磁盘空间和足够多的RAM。

我试图加载的文件提供在这里：https://hobbitdata.informatik.uni-leipzig.de/dbpedia_2015-10_en_wo-comments_c.nt.zst

可以使用以下方法对其进行解压缩：

zstd -d "dbpedia_2015-10_en_wo-comments_c.nt.zst" -o "dbpedia_2015-10_en_wo-comments_c.nt"

我使用以下命令加载数据：

java -Xmx512G -cp "$HOME/graphdb/graphdb-free-9.1.1/lib/*" -Dgraphdb.dist=$HOME/graphdb/graphdb-free-9.1.1 -Dgraphdb.home.data=$HOME/dbpedia2015/data/ -Djdk.xml.entityExpansionLimit=0 com.ontotext.graphdb.loadrdf.LoadRDF -f -m parallel -p -c $HOME/graphdb/graphdb-dbpedia2015.ttl $HOME/dbpedia_2015-10_en_wo-comments_c.nt

$HOME/graphdb/graphdb-dbpedia2015。ttl看起来像：

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;
    rep:repositoryID "dbpedia2015" ;
    rdfs:label "Repository for dataset dbpedia2015" ;
    rep:repositoryImpl [
        rep:repositoryType "graphdb:FreeSailRepository" ;
        sr:sailImpl [
            sail:sailType "graphdb:FreeSail" ;

                        # ruleset to use
                        owlim:ruleset "rdfsplus-optimized" ;

                        # disable context index(because my data do not uses contexts)
                        owlim:enable-context-index "false" ;

                        # indexes to speed up the read queries
                        owlim:enablePredicateList "true" ;
                        owlim:enable-literal-index "true" ;
                        owlim:in-memory-literal-properties "true" ;
        ]
    ].

输出的日志是：

16:11:07.438 [main] INFO  com.ontotext.graphdb.loadrdf.Params - MODE: parallel
16:11:07.439 [main] INFO  com.ontotext.graphdb.loadrdf.Params - STOP ON FIRST ERROR: false
16:11:07.439 [main] INFO  com.ontotext.graphdb.loadrdf.Params - PARTIAL LOAD: true
16:11:07.439 [main] INFO  com.ontotext.graphdb.loadrdf.Params - CONFIG FILE: /home/me/graphdb-dbpedia2015.ttl
16:11:07.444 [main] INFO  com.ontotext.graphdb.loadrdf.LoadRDF - Attaching to location: /home/me/graphdb/dbpedia2015/data
16:11:07.618 [main] INFO  c.o.t.u.l.LimitedObjectCacheFactory - Using LRU cache type: synch
16:11:08.025 [main] WARN  com.ontotext.plugin.literals-index - Rebuilding literals indexes. Starting from id:1
16:11:08.029 [main] WARN  com.ontotext.plugin.literals-index - Complete in 0.004, num entries indexed:0
16:11:08.780 [main] INFO  c.o.rio.parallel.ParallelLoader - Data will be parsed + resolved + loaded.
16:11:08.788 [main] INFO  c.o.rio.parallel.ParallelLoader - Using 128 threads for inference
16:11:09.984 [main] INFO  com.ontotext.graphdb.loadrdf.LoadRDF - Loading file: dbpedia_2015-10_en_wo-comments_c.nt
16:11:09.991 [main] INFO  c.o.rio.parallel.ParallelLoader - Using 128 threads for inference
16:11:19.987 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 2,111,690 stmts. Rate: 211,147 st/s. Statements overall: 2,111,690. Global average rate: 211,000 st/s. Now: Tue Mar 10 16:11:19 UTC 2020. Total memory: 22144M, Free memory: 4890M, Max memory: 524288M.
16:11:30.515 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 3,955,363 stmts. Rate: 192,662 st/s. Statements overall: 3,955,363. Global average rate: 192,596 st/s. Now: Tue Mar 10 16:11:30 UTC 2020. Total memory: 66432M, Free memory: 53925M, Max memory: 524288M.
16:11:40.515 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 6,889,662 stmts. Rate: 225,661 st/s. Statements overall: 6,889,662. Global average rate: 225,609 st/s. Now: Tue Mar 10 16:11:40 UTC 2020. Total memory: 199296M, Free memory: 177241M, Max memory: 524288M.
16:11:51.185 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 9,124,978 stmts. Rate: 221,474 st/s. Statements overall: 9,124,978. Global average rate: 221,437 st/s. Now: Tue Mar 10 16:11:51 UTC 2020. Total memory: 199296M, Free memory: 185106M, Max memory: 524288M.
16:12:02.877 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 11,083,153 stmts. Rate: 209,539 st/s. Statements overall: 11,083,153. Global average rate: 209,511 st/s. Now: Tue Mar 10 16:12:02 UTC 2020. Total memory: 199296M, Free memory: 184331M, Max memory: 524288M.
16:12:15.800 [main] INFO  c.o.rio.parallel.ParallelRDFInserter - Parsed 13,166,352 stmts. Rate: 200,047 st/s. Statements overall: 13,166,352. Global average rate: 200,026 st/s. Now: Tue Mar 10 16:12:15 UTC 2020. Total memory: 329312M, Free memory: 313496M, Max memory: 524288M.

知道为什么它在三倍左右13M卡住了吗？

共有1个答案

徐嘉勋

2023-03-14

首先，为进程分配较少的Xmx（大约38-42 GB就足够了）。数据库将需要额外的内存用于堆外存储，因此请确保不要分配所有内存。如果仍然无法加载数据集，请发送流程的jstack，或者如果使用Oracle JDK，可以使用Java航班记录：

jcmd <pid> VM.unlock_commercial_features
jcmd <pid> JFR.start duration=60s name=production filename=production.jfr settings=profile

将持续时间设置为允许跟踪执行的值。您可以将结果发送到support@ontotext.com因为它将包含有关您的环境的信息。

另一种选择是使用预加载工具——它的目的是加载大型数据集——http://graphdb.ontotext.com/documentation/enterprise/loading-data-using-preload.html

类似资料：

使用AngularJS加载大型数据集

问题内容：我正在尝试设计一种无需分页就可以将大量数据（最多1000行）加载到页面中的方法。这方面的第一个障碍是以并行咬大小块查询数据库，这是我在如何使用AngularJS进行顺序RestWeb服务调用的解决方案的帮助下完成的。但是，我在实施时遇到了两个问题：每个返回的对象都将传递到一个数组中，然后该数组本身将作为Angular用来绑定的数组返回。即[[{{键：值，键：值，键：值}，{键：值，
将大型数据集加载到Pandas Python中

我想从InstaCart https://www.InstaCart.com/datasets/grocery-shopping-2017加载大型.csv（3.4百万行，20.6万用户）开源数据集基本上，我在将orders.csv加载到Pandas数据帧中时遇到了麻烦。我想学习将大文件加载到Pandas/Python中的最佳实践。
无法用新数据重新加载dataTable

null t错误显示为: null DataTables警告:表ID=Slave-Requested未知参数'0'用于行0，列0。有关此错误的详细信息，请参阅http://datatables.net/TN/4 null 我通过进行API调用得到的数据如下: 请帮帮我。如果你想要更多的信息就问。
hadoop PIG：无法加载sqooped数据

我将一个非常简单的mysql表（2列，'key'和'label')sqooping到HDFS。当我查看数据时，这似乎很有效： java.io.ioException：ExecException：无法设置加载函数。在org.apache.pig.pigserver.getExamples(pigServer.java:1204)，在org.apache.pig.tools.grunt.gruntPa
无法将数据加载到TableView

我无法将数据加载到表中。我有类，其名称为、等。我想将、插入到TextField上的表播放器中。我正在执行与下面所示完全相同的操作:http://docs.oracle.com/javase/8/javafx/user-interface-tutorial/table-view.htm#cjagaaee 但我不能让它起作用。有人能帮我吗？
jqgrid无需分页即可加载大数据集

问题内容：我想知道是否有更好的方法从服务器加载大型Json数据集。我正在使用jqgrid作为loadonce：true。我需要一次加载大约1500条记录，而且我不使用分页选项。有没有更好的方法来实现这一目标？先感谢您。这是我的网格代码- 问题答案：在此演示的示例中，您可以看到在使用的情况下为网格加载1500行的时间。您的示例最大的性能问题在函数内部。如果确实需要在网格上进行一些修改，则应

无法用GraphDB加载大型数据集

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档