第 21 章 配置和调优

优质
小牛编辑
138浏览
2023-12-01

为了获得Neo4j最佳性能,Neo4j有一些可以调整的参数。可以配置的两个主要组件是Neo4j缓存和Neo4j运行的JVM。下面的章节描述了如何调整它们。

21.1. 介绍

21.1.1. 如何增加配置设置

为了获得更好的性能,下面的事情是我们首先需要做的:

- 确保JVM没有使用太多时间用于执行垃圾收集。 监视使用Neo4j的一个应用的堆使用可能有点混乱,因为当内存充裕时Neo4j会增加缓存,而相反会减少缓存。 目标就是有一个足够大的堆来确保一个重型加载不会导致调用GC收集(如果导致GC收集,那么性能将降低高达两个数量级)。

- 用一个-server标记启动JVM和一个适当的堆尺寸(参考:第 21.6 节 “JVM设置”)。太大的堆尺寸也会伤害性能,因此你可以反复尝试下不同的堆尺寸。

- 使用 parallel/concurrent 垃圾收集器 (我们发现使用 -XX:+UseConcMarkSweepGC在许多情况下使用良好)

21.1.1. 如何增加配置设置

当创建一个嵌入Neo4j实例时,可以传递包含Key-Value的Map作为参数。

1

2

3

4

5

Map<String, String> config = newHashMap<String, String>();

config.put( "neostore.nodestore.db.mapped_memory", "10M");

config.put( "string_block_size", "60");

config.put( "array_block_size", "300");

GraphDatabaseService db = newImpermanentGraphDatabase( config );

如果没有配置提供,数据库核心将试图通过JVM配置和操作系统探测适合的配置信息。

JVM的配置是通过在启动JVM时传递命令行参数。对于Neo4j最重要的配置参数是控制内存和垃圾收集,但对于一些实时编译使用到的参数也是非常有趣的。

比如我们要在一个64位系统,堆空间为1G的服务器启动你的应用的主类:

1

java -d64 -server -Xmx1024m -cp /path/to/neo4j-kernel.jar:/path/to/jta.jar:/path/to/your-application.jar com.example.yourapp.MainClass

看上面的范例,你也可以留意到最基本命令行参数之一:指定类路径。类路径是JVM搜寻你的类的路径。它经常是一个jar文件列表。指定类路径通过标志 -cp(或者 -classpath) 完成后面跟类路径的值。对于Neo4j应用来说,至少应该包括Neo4j neo4j-kernel.jar和Java事务API (jta.jar) 以及你自己的应用需要加载的类的路径。

提示

在Linux,Unix和Mac OS X上,在路径列表上面的每一个元素都被一个冒号符号 (:)分隔,在Windows上面,则使用分号 (;)分隔。

当使用Neo4j REST 服务器时,参考 server-configuration 了解如何增加数据的配置到服务器。

21.2. 性能向导

21.2.1. 首先尝试

21.2.2. Neo4j 基础元素的生命周期

21.2.3. 配置Neo4j

这是Neo4j性能优化向导。它将引导你如何使用Neo4j来达到最佳性能。

21.2.1. 首先尝试

首先需要做的事情就是确保JVM运行良好而没有浪费大量的时间来进行垃圾收集。监视使用Neo4j的一个应用的堆使用可能有点混乱,因为当内存充裕时Neo4j会增加缓存,而相反会减少缓存。目标就是有一个足够大的堆来确保一个重型加载不会导致调用GC收集(如果导致GC收集,那么性能将降低高达两个数量级)。

使用标记 -server和 -Xmx<good sized heap>(f.ex. -Xmx512M for 512Mb memory or -Xmx3G for 3Gb memory)来启动JVM。太大的堆尺寸也会伤害性能,因此你可以反复尝试下不同的堆尺寸。使用 parallel/concurrent 垃圾收集器 (我们发现使用 -XX:+UseConcMarkSweepGC在许多情况下使用良好)

最后,确保操作系统有一些内存来管理属性文件系统缓存, 这意味着如果你的系统有8G内存就不要使用全部的内存给堆使用(除非你关闭内存映射缓冲区)而要留一个适合大小的内存给系统使用。要了解更多详情,请参考: 第 21 章 配置和调优。

对于Linux特有的调优,请参考: 第 21.10 节 “Linux性能向导”。

21.2.2. Neo4j 基础元素的生命周期

Neo4j根据你使用Neo4j的情况来管理它的基础元素(节点,关系和属性)。比如如果你从来都不会从某一个节点或者关系那儿获取一个属性,那么节点和关系将不会加载属性到内存。第一次,在加载一个节点或者关系后,任何属性都可以被访问,所有的属性都加载了。如果某一个属性包含一个数组大于一些常规元素或者包含一个长字符串,在请求是需要进行切分。简单讲,一个节点的关系只有在访问这个节点的第一次被加载。

节点和关系使用LRU缓存。如果你(因为一些奇怪的原因)只需要使用节点工作,那关系缓存会变得越来越小,而节点缓存会根据需要自动增长。使用大量关系和少量节点的应用会导致关系数据占用缓存猛增而节点占用缓存会越来越小。

Neo4j API 规范并没有描述任何关于关系的顺序,所以调用

1

Node.getRelationships()

会与之前的调用相比以不同顺序返回关系。这允许我们做更多的优化来返回最需要遍历的关系。

在Neo4j的所有元素都设计来根据实际使用来自动适配。 The (unachievable) overall goal is to be able to handle any incoming operation without having to go down and work with the file/disk I/O layer.

21.2.3. 配置Neo4j

在第 21 章 配置和调优章节有很多关于对Neo4j和JVM配置的信息。这些设置有很多对性能的影响。

磁盘, 内存和其他要点

一如往常,和任何持久持久化持久方案持久一样,性能非常依赖持久化存储设备的。更好的磁盘就会有更好的性能。

如果你有多个磁盘或者其他持久化介质可以使用,切分存储文件和事务日志在这些磁盘上是个不错的主意。让存储文件运行在低寻址时间的磁盘上对于非缓存的读操作会有非常优秀的表现。在今天一个常规的机械磁盘平均查询时间是5ms,如果可以使用的内存非常少额或者缓存内存映射设置不当的话,这会导致查询或者遍历操作变得非常慢。一个新的更好的打开了SSD功能的SATA磁盘平均查询时间少于100微妙,这意味着比其他类型的速度快50倍以上。

为了避免命中磁盘你需要更多的内存。在一个标准机械磁盘上你能用1-2GB的内存管理差不多几千万的Neo4j基础元素。 4-8GB的内存可以管理上亿的基础元素,而如果你要管理数十亿的话,你需要16-32GB的样子。然而,如果你投资一块好的SSD,你将可以处理更大的图数据而需要更少的内存。

Neo4j喜欢Java 1.6 JVMs,如果你曾经没有或者至少没有使用-server标记,以服务器模式运行的可以考虑升级到那个版本。当你的应用运行时,使用 vmstat等工具收集信息。如果你有很高的I/O等待,而当运行读写事务时没有很多块数据进出磁盘,这是一个信号,表明你需要调整你的Java堆参数,Neo4j缓存以及内存映射设置(也许需要配置更多的内存或者更好的磁盘)。

写操作性能

如果你在写入一些数据(刚开始很快,然后越来越慢)后经历过慢速的写性能,这可能是操作系统从存储文件的内存映射区域写出来脏页造成的。这些区域不需要被写入来维护一致性因此要实现最高性能的写操作,这类行为要避免。

另外写操作越来越慢的原因还可能是事务的大小决定的。许多小事务导致大量的I/O写到磁盘的操作,这些应该避免。太多大事务会导致内存溢出错误发生,因为没有提交的事务数据一致保持在内存的Java堆里面。关于Neo4j事务管理的细节,请参考:第 12 章 事务管理。

Neo4j内核使用一些存储文件和一个逻辑日志文件来存储图数据到磁盘。存储文件包括实际的图数据而日志文件包括写操作。所有的写操作都会被追加到日志文件中而当一个事务提交时,会强迫(fdatasync)逻辑日志同步到磁盘。然而存储文件不会强制写入到磁盘而也不仅仅是追加操作。它们将被写入一个更大或者更小的随机模型中(依赖于图数据库的布局)而写操作不会被强迫同步到磁盘。除非日志发生翻转或者Neo4j内核关闭。为逻辑日志目标增加翻转的大小是个不错的主意,如果你在使用翻转日志功能时遇到写操作问题,你可以考虑关闭日志翻转功能。下面是一个范例演示如何正运行时改变日志翻转设置:

1

2

3

4

5

6

7

8

9

10

11

12

GraphDatabaseService graphDb; // ...

// get the XaDataSource for the native store

TxModule txModule = ((EmbeddedGraphDatabase) graphDb).getConfig().getTxModule();

XaDataSourceManager xaDsMgr = txModule.getXaDataSourceManager();

XaDataSource xaDs = xaDsMgr.getXaDataSource( "nioneodb");

// 关闭日志翻转

xaDs.setAutoRotate( false);

// 或者增加日志翻转目标尺寸为100MB (默认:10MB)

xaDs.setLogicalLogTargetSize( 100* 1024* 1024L );

自从随机写到存储文件的内存映射区域会发生,如果不需要,不要让数据写道磁盘是非常重要的。一些操作系统在把脏页面数据写出到磁盘时有非常积极的设置规则。如果操作系统决定开始写出这些内存映射区域的脏页面时,写到磁盘的操作会停止连续的写,而变成随机。这会大大降低性能,因此当用Neo4j时要确保最大的写性能,必须确保,操作系统不会因为写到存储文件的内存映射区域而导致写出任何脏页面数据。举个例子,如果机器有8G的内存而存储文件一共有4G(完全可以内存映射),操作系统必须被配置来接受至少50%的脏页面在虚拟内存里面以确保我们不会出现随机的磁盘写操作。

Note: 关于更多的规则信息,请参考: 第 21.10 节 “Linux性能向导”。

二级缓存

当一般构建应用和“总是幻想图数据总是在内存里面”时,有时很有必要优化某些关键区域的性能。 Neo4j增加了一个很小的过载甚至当你属于内存数据结构做比较时,节点,关系或者有问题的属性都在缓存中。如果这变成一个问题,请使用性能测试器找出它们的热点而然后增加你自己的二级缓存。我们相信二级缓存应该能规避最大的扩展因为它将强迫你小心处理有时非常难的无效数据。但当其他事情都失败时,你必须使用它,因此这是一个范例演示如何使用它。

我们有一些POJO,封装了一个节点和它的状态。在这个特殊的POJO中,我们重载了相同的实现。

1

2

3

4

5

6

7

8

9

publicbooleanequals( Object obj )

{

returnunderlyingNode.getProperty( "some_property").equals( obj );

}

publicinthashCode()

{

returnunderlyingNode.getProperty( "some_property").hashCode();

}

这会在许多场景都运行得非常好,但在这个特殊的场景下,那个POJO的许多实例都会递归调用 adding/removing/getting/finding 来收集类。性能优化器探测这个应用后会发行相同的实现会被反复调用而这个可以别看作一个热点。为这个相同的重载增加二级缓存将在这个特殊的场景增加性能。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

privateObject cachedProperty = null;

publicbooleanequals( Object obj )

{

if( cachedProperty == null)

{

cachedProperty = underlyingNode.getProperty( "some_property");

}

returncachedProperty.equals( obj );

}

publicinthashCode()

{

if( cachedPropety == null)

{

cachedProperty = underlyingNode.getProperty( "some_property");

}

returncachedProperty.hashCode();

}

现在的问题是,我们需要废除缓存属性而不管 some_property在什么时候发生改变。(可能在这个场景是一个问题因为状态是相同的而hash code计算经常都不会改变)。

提示

总结,尽可能的回避使用二级缓存除非你真的需要它。

21.3. 内核配置

这些是你可能传递给Neo4j内核的配置选项。如果你使用嵌入数据库,你可以以一个map类型传递,又或者在Neo4j服务器中在neo4j.properties文件中配置。

表 21.1. Allow store upgrade

Default value: false

allow_store_upgrade

Whether to allow a store upgrade in case the current version of the database starts against an older store version. Setting this to true does not guarantee successful upgrade, justthat it allows an attempt at it.

表 21.2. Array block size

array_block_size

Specifies the block size for storing arrays. This parameter is only honored when the store is created, otherwise it is ignored. The default block size is 120 bytes, and the overhead of each block is the same as for string blocks, i.e., 8 bytes.

Limit

Value

Default value: 120

min

1

表 21.3. Backup slave

Default value: false

backup_slave

Mark this database as a backup slave.

表 21.4. Cache type

cache_type

The type of cache to use for nodes and relationships.

Value

Description

Default value: soft

soft

Provides optimal utilization of the available memory. Suitable for high performance traversal. May run into GC issues under high load if the frequently accessed parts of the graph does not fit in the cache.

weak

Use weak reference cache.

strong

Use strong references.

none

Don’t use caching.

表 21.5. Cypher parser version

cypher_parser_version

Enable this to specify a parser other than the default one.

Value

Description

1.5

Cypher v1.5 syntax.

1.6

Cypher v1.6 syntax.

1.7

Cypher v1.7 syntax.

表 21.6. Dump configuration

Default value: false

dump_configuration

Print out the effective Neo4j configuration after startup.

表 21.7. Forced kernel id

Default value:

forced_kernel_id

An identifier that uniquely identifies this graph database instance within this JVM. Defaults to an auto-generated number depending on how many instance are started in this JVM.

表 21.8. Gc monitor threshold

Default value: 200ms

gc_monitor_threshold

The amount of time in ms the monitor thread has to be blocked before logging a message it was blocked.

表 21.9. Gc monitor wait time

Default value: 100ms

gc_monitor_wait_time

Amount of time in ms the GC monitor thread will wait before taking another measurement.

表 21.10. Gcr cache min log interval

Default value: 60s

gcr_cache_min_log_interval

The minimal time that must pass in between logging statistics from the cache (when using the 'gcr' cache).

表 21.11. Grab file lock

Default value: true

grab_file_lock

Whether to grab locks on files or not.

表 21.12. Intercept committing transactions

Default value: false

intercept_committing_transactions

Determines whether any TransactionInterceptors loaded will intercept prepared transactions before they reach the logical log.

表 21.13. Intercept deserialized transactions

Default value: false

intercept_deserialized_transactions

Determines whether any TransactionInterceptors loaded will intercept externally received transactions (e.g. in HA) before they reach the logical log and are applied to the store.

表 21.14. Keep logical logs

Default value: true

keep_logical_logs

Make Neo4j keep the logical transaction logs for being able to backup the database.Can be used for specifying the threshold to prune logical logs after. For example "10 days" will prune logical logs that only contains transactions older than 10 days from the current time, or "100k txs" will keep the 100k latest transactions and prune any older transactions.

表 21.15. Logging.threshold for rotation

logging.threshold_for_rotation

Threshold in bytes for when database logs (text logs, for debugging, that is) are rotated.

Limit

Value

Default value: 104857600

min

1

表 21.16. Logical log

Default value: nioneo_logical.log

logical_log

The base name for the logical log files, either an absolute path or relative to the store_dir setting. This should generally not be changed.

表 21.17. Lucene searcher cache size

lucene_searcher_cache_size

Integer value that sets the maximum number of open lucene index searchers.

Limit

Value

Default value: 2147483647

min

1

表 21.18. Neo store

Default value: neostore

neo_store

The base name for the Neo4j Store files, either an absolute path or relative to the store_dir setting. This should generally not be changed.

表 21.19. Neostore.nodestore.db.mapped memory

Default value: 20M

neostore.nodestore.db.mapped_memory

The size to allocate for memory mapping the node store.

表 21.20. Neostore.propertystore.db.arrays.mapped memory

Default value: 130M

neostore.propertystore.db.arrays.mapped_memory

The size to allocate for memory mapping the array property store.

表 21.21. Neostore.propertystore.db.index.keys.mapped memory

Default value: 1M

neostore.propertystore.db.index.keys.mapped_memory

The size to allocate for memory mapping the store for property key strings.

表 21.22. Neostore.propertystore.db.index.mapped memory

Default value: 1M

neostore.propertystore.db.index.mapped_memory

The size to allocate for memory mapping the store for property key indexes.

表 21.23. Neostore.propertystore.db.mapped memory

Default value: 90M

neostore.propertystore.db.mapped_memory

The size to allocate for memory mapping the property value store.

表 21.24. Neostore.propertystore.db.strings.mapped memory

Default value: 130M

neostore.propertystore.db.strings.mapped_memory

The size to allocate for memory mapping the string property store.

表 21.25. Neostore.relationshipstore.db.mapped memory

Default value: 100M

neostore.relationshipstore.db.mapped_memory

The size to allocate for memory mapping the relationship store.

表 21.26. Node auto indexing

Default value: false

node_auto_indexing

Controls the auto indexing feature for nodes. Setting to false shuts it down unconditionally, while true enables it for every property, subject to restrictions in the configuration.

表 21.27. Node cache array fraction

node_cache_array_fraction

The fraction of the heap (1%-10%) to use for the base array in the node cache (when using the 'gcr' cache).

Limit

Value

Default value: 1.0

min

1.0

max

10.0

表 21.28. Node cache size

node_cache_size

The amount of memory to use for the node cache (when using the 'gcr' cache).

表 21.29. Node keys indexable

node_keys_indexable

A list of property names (comma separated) that will be indexed by default. This applies to Nodes only.

表 21.30. Read only database

Default value: false

read_only

Only allow read operations from this Neo4j instance.

表 21.31. Rebuild idgenerators fast

Default value: true

rebuild_idgenerators_fast

Use a quick approach for rebuilding the ID generators. This give quicker recovery time, but will limit the ability to reuse the space of deleted entities.

表 21.32. Relationship auto indexing

Default value: false

relationship_auto_indexing

Controls the auto indexing feature for relationships. Setting to false shuts it down unconditionally, while true enables it for every property, subject to restrictions in the configuration.

表 21.33. Relationship cache array fraction

relationship_cache_array_fraction

The fraction of the heap (1%-10%) to use for the base array in the relationship cache (when using the 'gcr' cache).

Limit

Value

Default value: 1.0

min

1.0

max

10.0

表 21.34. Relationship cache size

relationship_cache_size

The amount of memory to use for the relationship cache (when using the 'gcr' cache).

表 21.35. Relationship keys indexable

relationship_keys_indexable

A list of property names (comma separated) that will be indexed by default. This applies to Relationships only.

表 21.36. Remote logging enabled

Default value: false

remote_logging_enabled

Whether to enable logging to a remote server or not.

表 21.37. Remote logging host

Default value: 127.0.0.1

remote_logging_host

Host for remote logging using LogBack SocketAppender.

表 21.38. Remote logging port

remote_logging_port

Port for remote logging using LogBack SocketAppender.

Limit

Value

Default value: 4560

min

1

max

65535

表 21.39. Store dir

store_dir

The directory where the database files are located.

表 21.40. String block size

string_block_size

Specifies the block size for storing strings. This parameter is only honored when the store is created, otherwise it is ignored. Note that each character in a string occupies two bytes, meaning that a block size of 120 (the default size) will hold a 60 character long string before overflowing into a second block. Also note that each block carries an overhead of 8 bytes. This means that if the block size is 120, the size of the stored records will be 128 bytes.

Limit

Value

Default value: 120

min

1

表 21.41. Tx manager impl

tx_manager_impl

The name of the Transaction Manager service to use as defined in the TM service provider constructor, defaults to native.

表 21.42. Use memory mapped buffers

use_memory_mapped_buffers

Tell Neo4j to use memory mapped buffers for accessing the native storage layer.

21.4. Neo4j的缓存设置

21.4.1. 文件缓冲区

21.4.2. 对象缓冲区

关于如何进行关于Neo4j的自定义配置,请参考: 第 21.1 节 “介绍”。

Neo4j使用两个不同类型的缓存:一个文件缓冲区和一个对象缓冲区。文件缓冲区以相同格式存储文件因为它们存储在持久存储介质上。对象缓存区缓存节点,关系和属性,它们的存储格式是以高速读写为目标做优化后的结果。

21.4.1. 文件缓冲区

要点. * * 文件缓冲区有时被称为 低级缓存 或者 文件系统缓存 。 * 它缓存Neo4j数据在持久化介质上面。 * 如果可以的话,图使用操作系统内存映射特性。 * Neo4j会自动配置缓存只要JVM的堆尺寸配置适当。 *

文件缓冲区以相同格式存储文件因为它们存储在持久存储介质上。这个缓冲层的目的是提示读写性能。文件缓冲区通过写到缓存来提示写的性能,直到逻辑日志发生翻转才真正写入。这种行为是安全的因为所有的事务总是在写入操作时已经写入了逻辑日志了,这个日志可以用来恢复存储的文件。

因为缓存操作与涉及到的数据紧紧关联在一起,Neo4j数据表示形式的简短描述有必要在后台处理。 Neo4j存储数据在多个文件中而依赖文件系统来管控处理效率。每一个Neo4j存储文件都包括某一类型的相同尺寸的记录:

Store file

Record size

Contents

nodestore

9 B

Nodes

relstore

33 B

Relationships

propstore

41 B

Properties for nodes and relationships

stringstore

128 B

Values of string properties

arraystore

128 B

Values of array properties

为了存储数据是变长长度的数据,比如字符串和数组,数据以一个或者多个120B大小的块方式存储,并且使用者8B大小数据来管理。这些块的大小实际上可以在存储的时候通过参数 string_block_size和 array_block_size配置来创建。每一个记录类型的尺寸也可以用来计算一个Neo4j图数据库的存储需求或者每一个文件缓冲区的最接近的缓存大小。主意一些字符串和数组也可以不以字符串或者数组的方式存储,请参考: 第 21.7 节 “短字符串的压缩存储”and 第 21.8 节 “短数组的压缩存储”。

Neo4j使用多个文件缓冲区,每一个服务都对应一个不同的存储文件。每一个文件缓冲区都切分它们的存储文件成一系列相同大小的窗口。没一个缓冲窗口包含一系列存储记录。缓冲区控制在内存中的最大的活动缓冲窗口并且追踪它们的命中率和丢失率。当一个未缓存窗口的命中率大于了一个缓存窗口的丢失率的时候,这个缓存窗口会被驱逐而会被之前未缓存的窗口取代。

重要

注意快尺寸只能在存储创建时间时配置。

配置

The maximum amount of memory to use for memory mapped buffers for this file buffer cache. The default unit is MiB, for other units use any of the following suffixes: B, k, Mor G.

The number of bytes per block.

Parameter

Possible values

Effect

use_memory_mapped_buffers

trueor false

If set to trueNeo4j will use the operating systems memory mapping functionality for the file buffer cache windows. If set to falseNeo4j will use its own buffer implementation. In this case the buffers will reside in the JVM heap which needs to be increased accordingly. The default value for this parameter is true, except on Windows.

neostore.nodestore.db.mapped_memory

The maximum amount of memory to use for the file buffer cache of the node storage file.

neostore.relationshipstore.db.mapped_memory

The maximum amount of memory to use for the file buffer cache of the relationship store file.

neostore.propertystore.db.index.keys.mapped_memory

The maximum amount of memory to use for the file buffer cache of the something-something file.

neostore.propertystore.db.index.mapped_memory

The maximum amount of memory to use for the file buffer cache of the something-something file.

neostore.propertystore.db.mapped_memory

The maximum amount of memory to use for the file buffer cache of the property storage file.

neostore.propertystore.db.strings.mapped_memory

The maximum amount of memory to use for the file buffer cache of the string property storage file.

neostore.propertystore.db.arrays.mapped_memory

The maximum amount of memory to use for the file buffer cache of the array property storage file.

string_block_size

Specifies the block size for storing strings. This parameter is only honored when the store is created, otherwise it is ignored. Note that each character in a string occupies two bytes, meaning that a block size of 120 (the default size) will hold a 60 character long string before overflowing into a second block. Also note that each block carries an overhead of 8 bytes. This means that if the block size is 120, the size of the stored records will be 128 bytes.

array_block_size

Specifies the block size for storing arrays. This parameter is only honored when the store is created, otherwise it is ignored. The default block size is 120 bytes, and the overhead of each block is the same as for string blocks, i.e., 8 bytes.

dump_configuration

trueor false

If set to truethe current configuration settings will be written to the default system output, mostly the console or the logfiles.

当内存映射缓冲区配置 (use_memory_mapped_buffers = true) 来在使用时,JVM的堆大小必须小于计算机整个可以使用的内存,要减去用于文件缓冲区的内存大小。当堆缓冲区配置 (use_memory_mapped_buffers = false) 来在使用时,JVM的堆大小必须足够大以包括所有的缓冲区,加上应用和对象缓冲的实时堆内存需求。

Neo4j在启动时读取配置参数,并且自动配置哪些没有指定的参数。缓冲大小会基于计算机上可以使用的内存大小来配置,以决定JVM堆该用多大,存储文件该用多大的内存等。

21.4.2. 对象缓冲区

要点. * * 对象缓冲有时被成为 高级缓存 。 * 它缓存Neo4j的数据以一种更加优化便于高速遍历的格式存储。 *

对象缓冲区以一种便于高速遍历的格式缓存节点和关系以及它们的属性。在Neo4j中有两个不同类别的对象。

其中之一是参考缓存。这儿,Neo4j将利用它能从分配的JVM的内存中获取尽可能多的用于缓存对象,依赖于在一个LRU方式驱逐缓存的垃圾收集方式。然而要主意Neo4j是在和JVM上面其他对象在 "竞争" 堆空间的,比如你有一个应用以嵌入模式部署,在应用需要更多内存时,Neo4j会让需要更少内存的应用获得 "胜利" 。

注意

在下面描述的GC耐高速缓存只在Neo4j企业版中可以使用。

另外一种是 GC耐高速缓存它会从JVM的堆空间获取一固定大小的内存,当对象存储超过了这个空间时,它会自动清理。分配最大的内存给它以便所有缓存对象都在里面而不会超出。当最大内存被耗尽时,对象将被清理,而不依赖GC的决定。这个在堆上与其他对象的竞争让GC-pauses能被更好的控制,因为缓存分配一个最大空间的堆空间使用。与参考缓存相比,GC耐高速缓存的开销更小,插入/查询的速度更快。

提示

对于java垃圾收集器来说,堆内存的使用是一个方面 — 依赖于缓存类型而释放需要更大的堆空间。因此,分配一个大尺寸堆给Neo4j并不总是一个最好的策略因为它可能导致长时间的GC-pauses。相反应该留一些空间给Neo4j的文件系统缓存。这些都是超出堆和内核的直接控制下,因此更有效率。

这个缓存中的内容发展成面向支持Neo4j的API和图形遍历的对象。从这个缓存读数据比从文件系统缓存快5~10倍。这个缓存被包括在JVM的堆上,而大小跟当前可以使用的堆内存的总量适应。

节点和关系只要它们被访问就会被加入对象缓存中。然而缓存对象是被懒填充的。一个节点或者关系的属性不会被加载直到属性被访问。字符串(和数组)属性不会被加载直到该指定属性被访问。一个指定节点的关系也不会被加载直到该关系被访问。

配置

对象缓冲区主要的配置参数就是 cache_type。这指定了对象缓存由哪一个缓存来实现。对象缓冲区主要存在两个缓存实例,一个是用于节点而一个是用于关系。可以采用的缓存类型有:

cache_type

Description

none

Do not use a high level cache. No objects will be cached.

soft

Provides optimal utilization of the available memory. Suitable for high performance traversal. May run into GC issues under high load if the frequently accessed parts of the graph does not fit in the cache.

This is the default cache implementation.

weak

Provides short life span for cached objects. Suitable for high throughput applications where a larger portion of the graph than what can fit into memory is frequently accessed.

strong

This cache will hold on to all datathat gets loaded to never release it again. Provides good performance if your graph is small enough to fit in memory.

gcr

Provides means of assigning a specific amount of memory to dedicate to caching loaded nodes and relationships. Small footprint and fast insert/lookup. Should be the best option for most scenarios. See below on how to configure it. Note that this option is only available in the Neo4j Enterprise Edition.

GC耐高速缓存配置

因为GC高速缓存操作了JVM中一个最大的区域,每次使用的时候都会被配置用于优化性能。有两个方面的缓存大小。

一个是对象被放到缓存的数组索引的大小。它被指定为一小部分堆,比如指定 5表示让数组占用整个堆的5%的空间。增加这个指数(直到最大的10)会减少消耗更多的堆用于哈希碰撞的机会。更多的碰撞意味着来自低级缓存的更多的冗余对象加载。

configuration option

Description (what it controls)

Example value

node_cache_array_fraction

Fraction of the heap to dedicate to the array holding the nodes in the cache (max 10).

7

relationship_cache_array_fraction

Fraction of the heap to dedicate to the array holding the relationships in the cache (max 10).

5

另外一个方面是在缓存中的所有对象的尺寸。它以字节为单位指定。比如 500M或者 2G等。在要接近最大尺寸时一个 清场操作将被执行,随机对象将被逐出内存直到最大尺寸降低到90%以下。最大尺寸的最优化设置依赖于你的图数据库的大小。配置最大尺寸应该留足够的空间给在JVM共存的其他对象使用,但在同一时间,应该大到足以保持在最低限度的从低级缓存加载的存储需求。在JVM上的预测负载以及域级别的对象的布局也应该考虑到。

configuration option

Description (what it controls)

Example value

node_cache_size

Maximum size of the heap memory to dedicate to the cached nodes.

2G

relationship_cache_size

Maximum size of the heap memory to dedicate to the cached relationships.

800M

你可以从下面的地址阅读Sun HotSpot的JVM配置和参考:

- Understanding soft/weak references

- How Hotspot Decides to Clear SoftReferences

- HotSpot FAQ

堆内存使用

下面的表格可以用来计算在一个64位的JVM上面,对象缓存会占据多少内存:

Node

Relationship

Property

Relationships

Object

Size

Comment

344 B

Size for each node (not counting its relationships or properties).

48 B

Object overhead.

136 B

Property storage (ArrayMap 48B, HashMap 88B).

136 B

Relationship storage (ArrayMap 48B, HashMap 88B).

24 B

Location of first / next set of relationships.

208 B

Size for each relationship (not counting its properties).

48 B

Object overhead.

136 B

Property storage (ArrayMap 48B, HashMap 88B).

116 B

Size for each property of a node or relationship.

32 B

Data elementallows for transactional modification and keeps track of on disk location.

48 B

Entry in the hash table where it is stored.

12 B

Space used in hash table, accounts for normal fill ratio.

24 B

Property key index.

108 B

Size for each relationship type for a node that has a relationship of that type.

48 B

Collection of the relationships of this type.

48 B

Entry in the hash table where it is stored.

12 B

Space used in hash table, accounts for normal fill ratio.

Relationships

8 B

Space used by each relationship related to a particular node (both incoming and outgoing).

Primitive

24 B

Size of a primitive property value.

String

64+B

Size of a string property value. 64 + 2*len(string) B(64 bytes, plus two bytes for each character in the string).

21.5. 逻辑日志

Logical logs in Neo4j are the journal of which operations happens and are the source of truth in scenarios where the database needs to be recovered after a crash or similar. Logs are rotated every now and then (defaults to when they surpass 25 Mb in size) and the amount of legacy logs to keep can be configured. Purpose of keeping a history of logical logs include being able to serve incremental backups as well as keeping an HA cluster running. Regardless of configuration at least the latest non-empty logical log be kept.

For any given configuration at least the latest non-empty logical log will be kept, but configuration can be supplied to control how much more to keep. There are several different means of controlling it and the format in which configuration is supplied is:

1

2

keep_logical_logs=<true/false>

keep_logical_logs=<amount> <type>

For example:

1

2

3

4

5

6

7

8

9

10

11

# Will keep logical logs indefinitely

keep_logical_logs=true

# Will keep only the most recent non-empty log

keep_logical_logs=false

# Will keep logical logs which contains any transaction committed within 30days

keep_logical_logs=30days

# Will keep logical logs which contains any of the most recent 500000transactions

keep_logical_logs=500k txs

Full list:

Type

Description

Example

files

Number of most recent logical log files to keep

"10 files"

size

Max disk size to allow log files to occupy

"300M size" or "1G size"

txs

Number of latest transactions to keep Keep

"250k txs" or "5M txs"

hours

Keep logs which contains any transaction committed within N hours from current time

"10 hours"

days

Keep logs which contains any transaction committed within N days from current time

"50 days"

21.6. JVM设置

21.6.1. Configuring heap size and GC

There are two main memory parameters for the JVM, one controls the heap space and the other controls the stack space. The heap space parameter is the most important one for Neo4j, since this governs how many objects you can allocate. The stack space parameter governs the how deep the call stack of your application is allowed to get.

When it comes to heap space the general rule is: the larger heap space you have the better, but make sure the heap fits in the RAM memory of the computer. If the heap is paged out to disk performance will degrade rapidly. Having a heap that is much larger than what your application needs is not good either, since this means that the JVM will accumulate a lot of dead objects before the garbage collector is executed, this leads to long garbage collection pauses and undesired performance behavior.

Having a larger heap space will mean that Neo4j can handle larger transactions and more concurrent transactions. A large heap space will also make Neo4j run faster since it means Neo4j can fit a larger portion of the graph in its caches, meaning that the nodes and relationships your application uses frequently are always available quickly. The default heap size for a 32bit JVM is 64MB (and 30% larger for 64bit), which is too small for most real applications.

Neo4j works fine with the default stack space configuration, but if your application implements some recursive behavior it is a good idea to increment the stack size. Note that the stack size is shared for all threads, so if you application is running a lot of concurrent threads it is a good idea to increase the stack size.

- The heap size is set by specifying the -Xmx???mparameter to hotspot, where ???is the heap size in megabytes. Default heap size is 64MB for 32bit JVMs, 30% larger (appr. 83MB) for 64bit JVMs.

- The stack size is set by specifying the -Xss???mparameter to hotspot, where ???is the stack size in megabytes. Default stack size is 512kB for 32bit JVMs on Solaris, 320kB for 32bit JVMs on Linux (and Windows), and 1024kB for 64bit JVMs.

Most modern CPUs implement a Non-Uniform Memory Access (NUMA) architecture, where different parts of the memory have different access speeds. Suns Hotspot JVM is able to allocate objects with awareness of the NUMA structure as of version 1.6.0 update 18. When enabled this can give up to 40% performance improvements. To enabled the NUMA awareness, specify the -XX:+UseNUMAparameter (works only when using the Parallel Scavenger garbage collector (default or -XX:+UseParallelGCnot the concurrent mark and sweep one).

Properly configuring memory utilization of the JVM is crucial for optimal performance. As an example, a poorly configured JVM could spend all CPU time performing garbage collection (blocking all threads from performing any work). Requirements such as latency, total throughput and available hardware have to be considered to find the right setup. In production, Neo4j should run on a multi core/CPU platform with the JVM in server mode.

21.6.1. Configuring heap size and GC

A large heap allows for larger node and relationship caches — which is a good thing — but large heaps can also lead to latency problems caused by full garbage collection. The different high level cache implementations available in Neo4j together with a suitable JVM configuration of heap size and garbage collection (GC) should be able to handle most workloads.

The default cache (soft reference based LRU cache) works best with a heap that never gets full: a graph where the most used nodes and relationships can be cached. If the heap gets too full there is a risk that a full GC will be triggered; the larger the heap, the longer it can take to determine what soft references should be cleared.

Using the strong reference cache means that allthe nodes and relationships being used must fit in the available heap. Otherwise there is a risk of getting out-of-memory exceptions. The soft reference and strong reference caches are well suited for applications were the overal throughput is important.

The weak reference cache basically needs enough heap to handle the peak load of the application — peak load multiplied by the average memory required per request. It is well suited for low latency requirements were GC interuptions are not acceptable.

重要

When running Neo4j on Windows, keep in mind that the memory mapped buffers are allocated on heap by default, so they need to be taken into account when determining heap size.

表 21.43. Guidelines for heap size

Number of primitives

RAM size

Heap configuration

Reserved RAM for the OS

10M

2GB

512MB

the rest

100M

8GB+

1-4GB

1-2GB

1B+

16GB-32GB+

4GB+

1-2GB

提示

The recommended garbage collector to use when running Neo4j in production is the Concurrent Mark and Sweep Compactor turned on by supplying -XX:+UseConcMarkSweepGCas a JVM parameter.

When having made sure that the heap size is well configured the second thing to tune in order to tune the garbage collector for your application is to specify the sizes of the different generations of the heap. The default settings are well tuned for "normal" applications, and work quite well for most applications, but if you have an application with either really high allocation rate, or a lot of long lived objects you might want to consider tuning the sizes of the heap generation. The ratio between the young and tenured generation of the heap is specified by using the -XX:NewRatio=#command line option (where #is replaced by a number). The default ratio is 1:12 for client mode JVM, and 1:8 for server mode JVM. You can also specify the size of the young generation explicitly using the -Xmncommand line option, which works just like the -Xmxoption that specifies the total heap space.

GC shortname

Generation

Command line parameter

Comment

Copy

Young

-XX:+UseSerialGC

The Copying collector

MarkSweepCompact

Tenured

-XX:+UseSerialGC

The Mark and Sweep Compactor

ConcurrentMarkSweep

Tenured

-XX:+UseConcMarkSweepGC

The Concurrent Mark and Sweep Compactor

ParNew

Young

-XX:+UseParNewGC

The parallel Young Generation Collector — can only be used with the Concurrent mark and sweep compactor.

PS Scavenge

Young

-XX:+UseParallelGC

The parallel object scavenger

PS MarkSweep

Tenured

-XX:+UseParallelGC

The parallel mark and sweep collector

These are the default configurations on some platforms according to our non-exhaustive research:

JVM

-d32 -client

-d32 -server

-d64 -client

-d64 -server

Mac OS X Snow Leopard, 64-bit, Hotspot 1.6.0_17

ParNewand ConcurrentMarkSweep

PS Scavengeand PS MarkSweep

ParNewand ConcurrentMarkSweep

PS Scavengeand PS MarkSweep

Ubuntu, 32-bit, Hotspot 1.6.0_16

Copyand MarkSweepCompact

Copyand MarkSweepCompact

N/A

N/A

21.7. 短字符串的压缩存储

Neo4j 将尝试分类您短字符串类中的字符串,如果它管理,它将相应地对待它。在这种情况下,它将存储,而不在属性存储,内联的间接寻址它反而在财产记录中,意味着不会参与动态字符串存储存储的价值,从而减少的磁盘占用空间。此外,当需要时没有字符串记录,以存储属性,它可以读取和写入单个查找,从而导致性能改进中。

The various classes for short strings are:

- Numerical, consisting of digits 0..9 and the punctuation space, period, dash, plus, comma and apostrophe.

- Date, consisting of digits 0..9 and the punctuation space dash, colon, slash, plus and comma.

- Uppercase, consisting of uppercase letters A..Z, and the punctuation space, underscore, period, dash, colon and slash.

- Lowercase, like upper but with lowercase letters a..z instead of uppercase

- E-mail, consisting of lowercase letters a..z and the punctuation comma, underscore, period, dash, plus and the at sign (@).

- URI, consisting of lowercase letters a..z, digits 0..9 and most punctuation available.

- Alphanumerical, consisting of both upper and lowercase letters a..zA..z, digits 0..9 and punctuation space and underscore.

- Alphasymbolical, consisting of both upper and lowercase letters a..zA..Z and the punctuation space, underscore, period, dash, colon, slash, plus, comma, apostrophe, at sign, pipe and semicolon.

- European, consisting of most accented european characters and digits plus punctuation space, dash, underscore and period — like latin1 but with less punctuation.

- Latin 1.

- UTF-8.

In addition to the string’s contents, the number of characters also determines if the string can be inlined or not. Each class has its own character count limits, which are

表 21.44. Character count limits

String class

Character count limit

Numerical and Date

54

Uppercase, Lowercase and E-mail

43

URI, Alphanumerical and Alphasymbolical

36

European

31

Latin1

27

UTF-8

14

That means that the largest inline-able string is 54 characters long and must be of the Numerical class and also that all Strings of size 14 or less will always be inlined.

Also note that the above limits are for the default 41 byte PropertyRecord layout — if that parameter is changed via editing the source and recompiling, the above have to be recalculated.

21.8. 短数组的压缩存储

Neo4j will try to store your primitive arrays in a compressed way, so as to save disk space and possibly an I/O operation. To do that, it employs a "bit-shaving" algorithm that tries to reduce the number of bits required for storing the members of the array. In particular:

1.For each member of the array, it determines the position of leftmost set bit.

2.Determines the largest such position among all members of the array

3.It reduces all members to that number of bits

4.Stores those values, prefixed by a small header.

That means that when even a single negative value is included in the array then the natural size of the primitives will be used.

There is a possibility that the result can be inlined in the property record if:

- It is less than 24 bytes after compression

- It has less than 64 members

For example, an array long[] {0L, 1L, 2L, 4L} will be inlined, as the largest entry (4) will require 3 bits to store so the whole array will be stored in 4*3=12 bits. The array long[] {-1L, 1L, 2L, 4L} however will require the whole 64 bits for the -1 entry so it needs 64*4 = 32 bytes and it will end up in the dynamic store.

21.9. 内存I映射配置

21.9.1. Optimizing for traversal speed example

21.9.2. Batch insert example

Each file in the Neo4j store can use memory mapped I/O for reading/writing. Best performance is achieved if the full file can be memory mapped but if there isn’t enough memory for that Neo4j will try and make the best use of the memory it gets (regions of the file that get accessed often will more likely be memory mapped).

重要

Neo4j makes heavy use of the java.niopackage. Native I/O will result in memory being allocated outside the normal Java heap so that memory usage needs to be taken into consideration. Other processes running on the OS will impact the availability of such memory. Neo4j will require all of the heap memory of the JVM plus the memory to be used for memory mapping to be available as physical memory. Other processes may thus not use more than what is available after the configured memory allocation is made for Neo4j.

A well configured OS with large disk caches will help a lot once we get cache misses in the node and relationship caches. Therefore it is not a good idea to use all available memory as Java heap.

If you look into the directory of your Neo4j database, you will find its store files, all prefixed by neostore:

- nodestorestores information about nodes

- relationshipstoreholds all the relationships

- propertystorestores information of properties and all simple properties such as primitive types (both for relationships and nodes)

- propertystore stringsstores all string properties

- propertystore arraysstores all array properties

There are other files there as well, but they are normally not interesting in this context.

This is how the default memory mapping configuration looks:

1

2

3

4

5

neostore.nodestore.db.mapped_memory=25M

neostore.relationshipstore.db.mapped_memory=50M

neostore.propertystore.db.mapped_memory=90M

neostore.propertystore.db.strings.mapped_memory=130M

neostore.propertystore.db.arrays.mapped_memory=130M

21.9.1. Optimizing for traversal speed example

To tune the memory mapping settings start by investigating the size of the different store files found in the directory of your Neo4j database. Here is an example of some of the files and sizes in a Neo4j database:

1

2

3

4

14M neostore.nodestore.db

510M neostore.propertystore.db

1.2G neostore.propertystore.db.strings

304M neostore.relationshipstore.db

In this example the application is running on a machine with 4GB of RAM. We’ve reserved about 2GB for the OS and other programs. The Java heap is set to 1.5GB, that leaves about 500MB of RAM that can be used for memory mapping.

提示

If traversal speed is the highest priority it is good to memory map as much as possible of the node- and relationship stores.

An example configuration on the example machine focusing on traversal speed would then look something like:

1

2

3

4

5

neostore.nodestore.db.mapped_memory=15M

neostore.relationshipstore.db.mapped_memory=285M

neostore.propertystore.db.mapped_memory=100M

neostore.propertystore.db.strings.mapped_memory=100M

neostore.propertystore.db.arrays.mapped_memory=0M

21.9.2. Batch insert example

Read general information on batch insertion in batchinsert.

The configuration should suit the data set you are about to inject using BatchInsert. Lets say we have a random-like graph with 10M nodes and 100M relationships. Each node (and maybe some relationships) have different properties of string and Java primitive types (but no arrays). The important thing with a random graph will be to give lots of memory to the relationship and node store:

1

2

3

4

5

neostore.nodestore.db.mapped_memory=90M

neostore.relationshipstore.db.mapped_memory=3G

neostore.propertystore.db.mapped_memory=50M

neostore.propertystore.db.strings.mapped_memory=100M

neostore.propertystore.db.arrays.mapped_memory=0M

The configuration above will fit the entire graph (with exception to properties) in memory.

A rough formula to calculate the memory needed for the nodes:

1

number_of_nodes * 9bytes

and for relationships:

1

number_of_relationships * 33bytes

Properties will typically only be injected once and never read so a few megabytes for the property store and string store is usually enough. If you have very large strings or arrays you may want to increase the amount of memory assigned to the string and array store files.

An important thing to remember is that the above configuration will need a Java heap of 3.3G+ since in batch inserter mode normal Java buffers that gets allocated on the heap will be used instead of memory mapped ones

21.10. Linux性能向导

21.10.1. Setup

21.10.2. Running the benchmark

21.10.3. Fixing the problem

The key to achieve good performance on reads and writes is to have lots of RAM since disks are so slow. This guide will focus on achieving good write performance on a Linux kernel based operating system.

If you have not already read the information available in 第 21 章 配置和调优do that now to get some basic knowledge on memory mapping and store files with Neo4j.

This section will guide you through how to set up a file system benchmark and use it to configure your system in a better way.

21.10.1. Setup

Create a large file with random data. The file should fit in RAM so if your machine has 4GB of RAM a 1-2GB file with random data will be enough. After the file has been created we will read the file sequentially a few times to make sure it is cached.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

$ ddif=/dev/urandomof=store bs=1M count=1000

1000+0 records in

1000+0 records out

1048576000 bytes (1.0 GB) copied, 263.53 s, 4.0 MB/s

$

$ ddif=store of=/dev/nullbs=100M

10+0 records in

10+0 records out

1048576000 bytes (1.0 GB) copied, 38.6809 s, 27.1 MB/s

$

$ ddif=store of=/dev/nullbs=100M

10+0 records in

10+0 records out

1048576000 bytes (1.0 GB) copied, 1.52365 s, 688 MB/s

$ ddif=store of=/dev/nullbs=100M

10+0 records in

10+0 records out

1048576000 bytes (1.0 GB) copied, 0.776044 s, 1.4 GB/s

If you have a standard hard drive in the machine you may know that it is not capable of transfer speeds as high as 1.4GB/s. What is measured is how fast we can read a file that is cached for us by the operating system.

Next we will use a small utility that simulates the Neo4j kernel behavior to benchmark write speed of the system.

1

2

3

4

5

6

7

8

$ git clone git@github.com:neo4j/tooling.git

...

$ cdtooling/write-test/

$ mvn compile

[INFO] Scanning forprojects...

...

$ ./run

Usage: <large file> <log file> <[record size] [min tx size] [max tx size] [tx count] <[--nosync | --nowritelog | --nowritestore | --noread | --nomemorymap]>>

The utility will be given a store file (large file we just created) and a name of a log file. Then a record size in bytes, min tx size, max tx size and transaction count must be set. When started the utility will map the large store file entirely in memory and read (transaction size) records from it randomly and then write them sequentially to the log file. The log file will then force changes to disk and finally the records will be written back to the store file.

21.10.2. Running the benchmark

Lets try to benchmark 100 transactions of size 100-500 with a record size of 33 bytes (same record size used by the relationship store).

1

2

3

4

$ ./runstore logfile 33 100 500 100

tx_count[100] records[30759] fdatasyncs[100] read[0.96802425 MB] wrote[1.9360485 MB]

Time was: 4.973

20.108585 tx/s, 6185.2 records/s, 20.108585 fdatasyncs/s, 199.32773 kB/son reads, 398.65546 kB/son writes

我们看到我们得到了有关 6185 记录更新/s 和 s 随当前事务大小 20 交易。我们可以改变交易规模更大,例如写作的大小 1000年-5000 记录 10 交易:

1

2

3

4

$ ./runstore logfile 33 1000 5000 10

tx_count[10] records[24511] fdatasyncs[10] read[0.77139187 MB] wrote[1.5427837 MB]

Time was: 0.792

12.626263 tx/s, 30948.232 records/s, 12.626263 fdatasyncs/s, 997.35516 kB/son reads, 1994.7103 kB/son writes

With larger transaction we will do fewer of them per second but record throughput will increase. Lets see if it scales, 10 transactions in under 1s then 100 of them should execute in about 10s:

1

2

3

4

$ ./runstore logfile 33 1000 5000 100

tx_count[100] records[308814] fdatasyncs[100] read[9.718763 MB] wrote[19.437527 MB]

Time was: 65.115

1.5357445 tx/s, 4742.594 records/s, 1.5357445 fdatasyncs/s, 152.83751 kB/son reads, 305.67502 kB/son writes

This is not very linear scaling. We modified a bit more than 10x records in total but the time jumped up almost 100x. Running the benchmark watching vmstat output will reveal that something is not as it should be:

1

2

3

4

5

6

7

8

$ vmstat 3

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd freebuff cache si so bi bo incs us sy idwa

0 1 47660 298884 136036 2650324 0 0 0 10239 1167 2268 5 7 46 42

0 1 47660 302728 136044 2646060 0 0 0 7389 1267 2627 6 7 47 40

0 1 47660 302408 136044 2646024 0 0 0 11707 1861 2016 8 5 48 39

0 2 47660 302472 136060 2646432 0 0 0 10011 1704 1878 4 7 49 40

0 1 47660 303420 136068 2645788 0 0 0 13807 1406 1601 4 5 44 47

There are a lot of blocks going out to IO, way more than expected for the write speed we are seeing in the benchmark. Another observation that can be made is that the Linux kernel has spawned a process called "flush-x:x" (run top) that seems to be consuming a lot of resources.

The problem here is that the Linux kernel is trying to be smart and write out dirty pages from the virtual memory. As the benchmark will memory map a 1GB file and do random writes it is likely that this will result in 1/4 of the memory pages available on the system to be marked as dirty. The Neo4j kernel is not sending any system calls to the Linux kernel to write out these pages to disk however the Linux kernel decided to start doing so and it is a very bad decision. The result is that instead of doing sequential like writes down to disk (the logical log file) we are now doing random writes writing regions of the memory mapped file to disk.

It is possible to observe this behavior in more detail by looking at /proc/vmstat "nr_dirty" and "nr_writeback" values. By default the Linux kernel will start writing out pages at a very low ratio of dirty pages (10%).

1

2

3

4

5

$ sync

$ watchgrep-A 1 dirty /proc/vmstat

...

nr_dirty 22

nr_writeback 0

The "sync" command will write out all data (that needs writing) from memory to disk. The second command will watch the "nr_dirty" and "nr_writeback" count from vmstat. Now start the benchmark again and observe the numbers:

1

2

nr_dirty 124947

nr_writeback 232

The "nr_dirty" pages will quickly start to rise and after a while the "nr_writeback" will also increase meaning the Linux kernel is scheduling a lot of pages to write out to disk.

21.10.3. Fixing the problem

As we have 4GB RAM on the machine and memory map a 1GB file that does not need its content written to disk (until we tell it to do so because of logical log rotation or Neo4j kernel shutdown) it should be possible to do endless random writes to that memory with high throughput. All we have to do is to tell the Linux kernel to stop trying to be smart. Edit the /etc/sysctl.conf (need root access) and add the following lines:

1

2

vm.dirty_background_ratio = 50

vm.dirty_ratio = 80

Then (as root) execute:

1

# sysctl -p

The "vm.dirty_background_ratio" tells at what ratio should the linux kernel start the background task of writing out dirty pages. We increased this from the default 10% to 50% and that should cover the 1GB memory mapped file. The "vm.dirty_ratio" tells at what ratio all IO writes become synchronous, meaning that we can not do IO calls without waiting for the underlying device to complete them (which is something you never want to happen).

Rerun the benchmark:

1

2

3

4

$ ./runstore logfile 33 1000 5000 100

tx_count[100] records[265624] fdatasyncs[100] read[8.35952 MB] wrote[16.71904 MB]

Time was: 6.781

14.7470875 tx/s, 39171.805 records/s, 14.7470875 fdatasyncs/s, 1262.3726 kB/son reads, 2524.745 kB/son writes

Results are now more in line with what can be expected, 10x more records modified results in 10x longer execution time. The vmstat utility will not report any absurd amount of IO blocks going out (it reports the ones caused by the fdatasync to the logical log) and Linux kernel will not spawn a "flush-x:x" background process writing out dirty pages caused by writes to the memory mapped store file.

21.11. Linux特有的注意事项

21.11.1. File system tuning for high IO

21.11.2. Setting the number of open files

21.11.1. File system tuning for high IO

In order to support the high IO load of small transactions from a database, the underlying file system should be tuned. Symptoms for this are low CPU load with high iowait. In this case, there are a couple of tweaks possible on Linux systems:

- Disable access-time updates: noatime,nodiratimeflags for disk mount command or in the /etc/fstabfor the database disk volume mount.

- Tune the IO scheduler for high disk IO on the database disk.

21.11.2. Setting the number of open files

Linux platforms impose an upper limit on the number of concurrent files a user may have open. This number is reported for the current user and session with the command

1

2

user@localhost:~$ ulimit-n

1024

The usual default of 1024 is often not enough, especially when many indexes are used or a server installation sees too many connections (network sockets count against that limit as well). Users are therefore encouraged to increase that limit to a healthy value of 40000 or more, depending on usage patterns. Setting this value via the ulimitcommand is possible only for the root user and that for that session only. To set the value system wide you have to follow the instructions for your platform.

What follows is the procedure to set the open file descriptor limit to 40k for user neo4j under Ubuntu 10.04 and later. If you opted to run the neo4j service as a different user, change the first field in step 2 accordingly.

1.Become root since all operations that follow require editing protected system files.

1

2

3

user@localhost:~$ sudosu-

Password:

root@localhost:~$

2.Edit /etc/security/limits.confand add these two lines:

1

2

neo4j soft nofile 40000

neo4j hard nofile 40000

3.Edit /etc/pam.d/suand uncomment or add the following line:

1

session required pam_limits.so

4.A restart is required for the settings to take effect.

After the above procedure, the neo4j user will have a limit of 40000 simultaneous open files. If you continue experiencing exceptions on Too many open filesor Could not stat() directorythen you may have to raise that limit further.