注:本专栏所有分析以函数为主线,必要数据结构会带入讲解;数据库版本为Postgresql10.16。
注:如有讨论的需要请email to jackgo73@outlook.com
Postgresql中常见的的sharebuffer配置为内存的25%,而mysql的bp常见配置为内存的75%,原因和刷盘方式不同有关。
常见模式(和配置有关)
这里着重分析PG的几种sync参数的不同
决定是否同步刷xlog,默认打开。
#fsync = on # flush data to disk for crash safety
# (turning this off can cause
# unrecoverable data corruption)
If this parameter is on, the PostgreSQL server will try to make sure that updates are physically written to disk, by issuing
fsync()
system calls or various equivalent methods (see wal_sync_method). This ensures that the database cluster can recover to a consistent state after an operating system or hardware crash.
默认打开。
使用wal_sync_method配置的刷盘函数保证数据落盘。
While turning off
fsync
is often a performance benefit, this can result in unrecoverable data corruption in the event of a power failure or system crash. Thus it is only advisable to turn offfsync
if you can easily recreate your entire database from external data.Examples of safe circumstances for turning off
fsync
include the initial loading of a new database cluster from a backup file, using a database cluster for processing a batch of data after which the database will be thrown away and recreated, or for a read-only database clone which gets recreated frequently and is not used for failover. High quality hardware alone is not a sufficient justification for turning offfsync
.For reliable recovery when changing
fsync
off to on, it is necessary to force all modified buffers in the kernel to durable storage. This can be done while the cluster is shutdown or whilefsync
is on by runninginitdb --sync-only
, runningsync
, unmounting the file system, or rebooting the server.In many situations, turning off synchronous_commit for noncritical transactions can provide much of the potential performance benefit of turning off
fsync
, without the attendant risks of data corruption.
fsync
can only be set in thepostgresql.conf
file or on the server command line. If you turn this parameter off, also consider turning off full_page_writes.
上述参数打开后,使用什么方式同步刷xlog。默认fdatasync。
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
# fdatasync (default on Linux)
# fsync
# fsync_writethrough
# open_sync
Method used for forcing WAL updates out to disk. If
fsync
is off then this setting is irrelevant, since WAL file updates will not be forced out at all. Possible values are:
open_datasync
(write WAL files withopen()
optionO_DSYNC
)
fdatasync
(callfdatasync()
at each commit)
fsync
(callfsync()
at each commit)
fsync_writethrough
(callfsync()
at each commit, forcing write-through of any disk write cache)
open_sync
(write WAL files withopen()
optionO_SYNC
)The
open_
* options also useO_DIRECT
if available. Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform, except thatfdatasync
is the default on Linux and FreeBSD. The default is not necessarily ideal; it might be necessary to change this setting or other aspects of your system configuration in order to create a crash-safe configuration or achieve optimal performance. These aspects are discussed in Section 29.1. This parameter can only be set in thepostgresql.conf
file or on the server command line.
如果系统支持,open_*会使用O_DIRECT;Linux会默认使用fdatasync;
open_datasync:open() O_DSYNC
fdatasync:fdatasync()
fsync:fsync()
fsync_writethrough:fsync()
open_sync:open() O_SYNC
源码位置:src/include/access/xlog.h
/* Sync methods */
#define SYNC_METHOD_FSYNC 0
#define SYNC_METHOD_FDATASYNC 1
#define SYNC_METHOD_OPEN 2 /* for O_SYNC */
#define SYNC_METHOD_FSYNC_WRITETHROUGH 3
#define SYNC_METHOD_OPEN_DSYNC 4 /* for O_DSYNC */
extern int sync_method;
默认SYNC_METHOD_FDATASYNC
3 synchronous_commit
同步提交约束配置,默认on。
Specifies how much WAL processing must complete before the database server returns a “success” indication to the client. Valid values are
remote_apply
,on
(the default),remote_write
,local
, andoff
.If
synchronous_standby_names
is empty, the only meaningful settings areon
andoff
;remote_apply
,remote_write
andlocal
all provide the same local synchronization level ason
. The local behavior of all non-off
modes is to wait for local flush of WAL to disk.
所有非off的配置,都需要等fsync成功,事务才能返回。
In
off
mode, there is no waiting, so there can be a delay between when success is reported to the client and when the transaction is later guaranteed to be safe against a server crash. (The maximum delay is three times wal_writer_delay.) Unlike fsync, setting this parameter tooff
does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turningsynchronous_commit
off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction. For more discussion see Section 29.3.
事务真正提交 和 事务成功返回客户端 不是一致的! 中间可能最多差三倍的wal_writer_delay。
一般把这个参数关了可以提升性能,为什么不关fsync呢?
因为这个参数关了之后,系统crash后最近的几条成功提交的事务会直接丢失,不会造成数据不一致。
而fsync关了之后,日志落盘完全没有保障了,提交了的事物可能一部分刷盘,一部分没有刷盘造成数据不一致。
If synchronous_standby_names is non-empty,
synchronous_commit
also controls whether transaction commits will wait for their WAL records to be processed on the standby server(s).When set to
remote_apply
, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and applied it, so that it has become visible to queries on the standby(s), and also written to durable storage on the standbys. This will cause much larger commit delays than previous settings since it waits for WAL replay. When set toon
, commits wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and flushed it to durable storage. This ensures the transaction will not be lost unless both the primary and all synchronous standbys suffer corruption of their database storage. When set toremote_write
, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and written it to their file systems. This setting ensures data preservation if a standby instance of PostgreSQL crashes, but not if the standby suffers an operating-system-level crash because the data has not necessarily reached durable storage on the standby. The settinglocal
causes commits to wait for local flush to disk, but not for replication. This is usually not desirable when synchronous replication is in use, but is provided for completeness.This parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it commits. It is therefore possible, and useful, to have some transactions commit synchronously and others asynchronously. For example, to make a single multistatement transaction commit asynchronously when the default is the opposite, issue
SET LOCAL synchronous_commit TO OFF
within the transaction.Table 19.1 summarizes the capabilities of the
synchronous_commit
settings.Table 19.1. synchronous_commit Modes
synchronous_commit setting local durable commit standby durable commit after PG crash standby durable commit after OS crash standby query consistency remote_apply • • • • on • • • remote_write • • local • off
常用配置
wal_writer_delay = 10ms
wal_writer_flush_after = 0 # IO很好的机器,不需要考虑平滑调度, 否则建议128~256kB
Specifies how often the WAL writer flushes WAL, in time terms. After flushing WAL the writer sleeps for the length of time given by
wal_writer_delay
, unless woken up sooner by an asynchronously committing transaction. If the last flush happened less thanwal_writer_delay
ago and less thanwal_writer_flush_after
worth of WAL has been produced since, then WAL is only written to the operating system, not flushed to disk. If this value is specified without units, it is taken as milliseconds. The default value is 200 milliseconds (200ms
). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds; settingwal_writer_delay
to a value that is not a multiple of 10 might have the same results as setting it to the next higher multiple of 10. This parameter can only be set in thepostgresql.conf
file or on the server command line.
不满足10ms的事务只写磁盘,不做flush。
wal_writer_flush_after
(integer
)Specifies how often the WAL writer flushes WAL, in volume terms. If the last flush happened less than
wal_writer_delay
ago and less thanwal_writer_flush_after
worth of WAL has been produced since, then WAL is only written to the operating system, not flushed to disk. Ifwal_writer_flush_after
is set to0
then WAL data is always flushed immediately. If this value is specified without units, it is taken as WAL blocks, that isXLOG_BLCKSZ
bytes, typically 8kB. The default is1MB
. This parameter can only be set in thepostgresql.conf
file or on the server command line.
同上,从空间上触发刷盘。一般盘的io写日志无瓶颈的话,不需要使用这个参数。
除非发现刷xlog周期性的打满IO,配这个参数有奇效。
xlog文件创建。
int
XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
{
...
fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method),
S_IRUSR | S_IWUSR);
...
}
get_sync_bit(sync_method) sync_method = 1
...
/* Sync methods */
#define SYNC_METHOD_FSYNC 0
#define SYNC_METHOD_FDATASYNC 1
#define SYNC_METHOD_OPEN 2 /* for O_SYNC */
#define SYNC_METHOD_FSYNC_WRITETHROUGH 3
#define SYNC_METHOD_OPEN_DSYNC 4 /* for O_DSYNC */
extern int sync_method;
...
static int
get_sync_bit(int method)
{
....
if (!XLogIsNeeded() && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;
....
switch (method)
{
/*
* enum values for all sync options are defined even if they are
* not supported on the current platform. But if not, they are
* not included in the enum option array, and therefore will never
* be seen here.
*/
case SYNC_METHOD_FSYNC:
case SYNC_METHOD_FSYNC_WRITETHROUGH:
case SYNC_METHOD_FDATASYNC:
return 0;
#ifdef OPEN_SYNC_FLAG
case SYNC_METHOD_OPEN:
return OPEN_SYNC_FLAG | o_direct_flag;
#endif
#ifdef OPEN_DATASYNC_FLAG
case SYNC_METHOD_OPEN_DSYNC:
return OPEN_DATASYNC_FLAG | o_direct_flag;
#endif
default:
/* can't happen (unless we are out of sync with option array) */
elog(ERROR, "unrecognized wal_sync_method: %d", method);
return 0; /* silence warning */
}
}
默认函数返回0。
xlog文件打开
fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method),
S_IRUSR | S_IWUSR);
触发一次同步刷盘
/*
* Issue appropriate kind of fsync (if any) for an XLOG output file.
*
* 'fd' is a file descriptor for the XLOG file to be fsync'd.
* 'log' and 'seg' are for error reporting purposes.
*/
void
issue_xlog_fsync(int fd, XLogSegNo segno)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
if (pg_fsync_no_writethrough(fd) != 0)
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not fsync log file %s: %m",
XLogFileNameP(ThisTimeLineID, segno))));
break;
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
if (pg_fsync_writethrough(fd) != 0)
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not fsync write-through log file %s: %m",
XLogFileNameP(ThisTimeLineID, segno))));
break;
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
if (pg_fdatasync(fd) != 0)
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not fdatasync log file %s: %m",
XLogFileNameP(ThisTimeLineID, segno))));
break;
#endif
case SYNC_METHOD_OPEN:
case SYNC_METHOD_OPEN_DSYNC:
/* write synced it already */
break;
default:
elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
break;
}
}
默认走fdatasync
/*
* pg_fdatasync --- same as fdatasync except does nothing if enableFsync is off
*
* Not all platforms have fdatasync; treat as fsync if not available.
*/
int
pg_fdatasync(int fd)
{
if (enableFsync)
{
#ifdef HAVE_FDATASYNC
return fdatasync(fd);
#else
return fsync(fd);
#endif
}
else
return 0;
}
HAVE_FDATASYNC宏由configure时配置。
上述过程可以看到,默认情况open参数如下:
O_RDWR | PG_BINARY | 0
S_IRUSR | S_IWUSR