Postgresql中的sync相关参数源码分析

徐承载

2023-12-01

注：本专栏所有分析以函数为主线，必要数据结构会带入讲解；数据库版本为Postgresql10.16。

注：如有讨论的需要请email to jackgo73@outlook.com

一、问题

Postgresql中常见的的sharebuffer配置为内存的25%，而mysql的bp常见配置为内存的75%，原因和刷盘方式不同有关。

常见模式（和配置有关）

pgsql：数据依赖OS CACHE，日志依赖OS CACHE
mysql：数据不依赖OS CACHE使用O_DIRECT直接刷盘，日志依赖OS CACHE

这里着重分析PG的几种sync参数的不同

二、参数

1 fsync

决定是否同步刷xlog，默认打开。

#fsync = on             # flush data to disk for crash safety
                    # (turning this off can cause
                    # unrecoverable data corruption)

fsync (boolean)

If this parameter is on, the PostgreSQL server will try to make sure that updates are physically written to disk, by issuing fsync() system calls or various equivalent methods (see wal_sync_method). This ensures that the database cluster can recover to a consistent state after an operating system or hardware crash.

默认打开。

使用wal_sync_method配置的刷盘函数保证数据落盘。

While turning off fsync is often a performance benefit, this can result in unrecoverable data corruption in the event of a power failure or system crash. Thus it is only advisable to turn off fsync if you can easily recreate your entire database from external data.

Examples of safe circumstances for turning off fsync include the initial loading of a new database cluster from a backup file, using a database cluster for processing a batch of data after which the database will be thrown away and recreated, or for a read-only database clone which gets recreated frequently and is not used for failover. High quality hardware alone is not a sufficient justification for turning off fsync.

For reliable recovery when changing fsync off to on, it is necessary to force all modified buffers in the kernel to durable storage. This can be done while the cluster is shutdown or while fsync is on by running initdb --sync-only, running sync, unmounting the file system, or rebooting the server.

In many situations, turning off synchronous_commit for noncritical transactions can provide much of the potential performance benefit of turning off fsync, without the attendant risks of data corruption.

fsync can only be set in the postgresql.conf file or on the server command line. If you turn this parameter off, also consider turning off full_page_writes.

2 wal_sync_method

上述参数打开后，使用什么方式同步刷xlog。默认fdatasync。

#wal_sync_method = fsync        # the default is the first option
                    # supported by the operating system:
                    #   open_datasync
                    #   fdatasync (default on Linux)
                    #   fsync
                    #   fsync_writethrough
                    #   open_sync

wal_sync_method (enum)

Method used for forcing WAL updates out to disk. If fsync is off then this setting is irrelevant, since WAL file updates will not be forced out at all. Possible values are:

open_datasync (write WAL files with open() option O_DSYNC)

fdatasync (call fdatasync() at each commit)

fsync (call fsync() at each commit)

fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache)

open_sync (write WAL files with open() option O_SYNC)

The open_* options also use O_DIRECT if available. Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform, except that fdatasync is the default on Linux and FreeBSD. The default is not necessarily ideal; it might be necessary to change this setting or other aspects of your system configuration in order to create a crash-safe configuration or achieve optimal performance. These aspects are discussed in Section 29.1. This parameter can only be set in the postgresql.conf file or on the server command line.

如果系统支持，open_*会使用O_DIRECT；Linux会默认使用fdatasync；

open_datasync：open() O_DSYNC

fdatasync：fdatasync()

fsync：fsync()

fsync_writethrough：fsync()

open_sync：open() O_SYNC

源码位置：src/include/access/xlog.h

/* Sync methods */
#define SYNC_METHOD_FSYNC       0
#define SYNC_METHOD_FDATASYNC   1
#define SYNC_METHOD_OPEN        2   /* for O_SYNC */
#define SYNC_METHOD_FSYNC_WRITETHROUGH  3
#define SYNC_METHOD_OPEN_DSYNC  4   /* for O_DSYNC */
extern int  sync_method;

默认SYNC_METHOD_FDATASYNC

`3 synchronous_commit`

同步提交约束配置，默认on。

synchronous_commit (enum)

Specifies how much WAL processing must complete before the database server returns a “success” indication to the client. Valid values are remote_apply, on (the default), remote_write, local, and off.

If synchronous_standby_names is empty, the only meaningful settings are on and off; remote_apply, remote_write and local all provide the same local synchronization level as on. The local behavior of all non-off modes is to wait for local flush of WAL to disk.

所有非off的配置，都需要等fsync成功，事务才能返回。

In off mode, there is no waiting, so there can be a delay between when success is reported to the client and when the transaction is later guaranteed to be safe against a server crash. (The maximum delay is three times wal_writer_delay.) Unlike fsync, setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction. For more discussion see Section 29.3.

事务真正提交和事务成功返回客户端不是一致的！中间可能最多差三倍的wal_writer_delay。

一般把这个参数关了可以提升性能，为什么不关fsync呢？

因为这个参数关了之后，系统crash后最近的几条成功提交的事务会直接丢失，不会造成数据不一致。

而fsync关了之后，日志落盘完全没有保障了，提交了的事物可能一部分刷盘，一部分没有刷盘造成数据不一致。

If synchronous_standby_names is non-empty, synchronous_commit also controls whether transaction commits will wait for their WAL records to be processed on the standby server(s).

When set to remote_apply, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and applied it, so that it has become visible to queries on the standby(s), and also written to durable storage on the standbys. This will cause much larger commit delays than previous settings since it waits for WAL replay. When set to on, commits wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and flushed it to durable storage. This ensures the transaction will not be lost unless both the primary and all synchronous standbys suffer corruption of their database storage. When set to remote_write, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and written it to their file systems. This setting ensures data preservation if a standby instance of PostgreSQL crashes, but not if the standby suffers an operating-system-level crash because the data has not necessarily reached durable storage on the standby. The setting local causes commits to wait for local flush to disk, but not for replication. This is usually not desirable when synchronous replication is in use, but is provided for completeness.

This parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it commits. It is therefore possible, and useful, to have some transactions commit synchronously and others asynchronously. For example, to make a single multistatement transaction commit asynchronously when the default is the opposite, issue SET LOCAL synchronous_commit TO OFF within the transaction.

Table 19.1 summarizes the capabilities of the synchronous_commit settings.

Table 19.1. synchronous_commit Modes

synchronous_commit setting local durable commit standby durable commit after PG crash standby durable commit after OS crash standby query consistency
remote_apply • • • •
on • • •
remote_write • •
local •
off

synchronous_commit setting	local durable commit	standby durable commit after PG crash	standby durable commit after OS crash	standby query consistency
remote_apply	•	•	•	•
on	•	•	•
remote_write	•	•
local	•
off

wal_writer_delay/wal_writer_flush_after

常用配置

wal_writer_delay = 10ms

wal_writer_flush_after = 0 # IO很好的机器，不需要考虑平滑调度, 否则建议128~256kB

wal_writer_delay (integer)

Specifies how often the WAL writer flushes WAL, in time terms. After flushing WAL the writer sleeps for the length of time given by wal_writer_delay, unless woken up sooner by an asynchronously committing transaction. If the last flush happened less than wal_writer_delay ago and less than wal_writer_flush_after worth of WAL has been produced since, then WAL is only written to the operating system, not flushed to disk. If this value is specified without units, it is taken as milliseconds. The default value is 200 milliseconds (200ms). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds; setting wal_writer_delay to a value that is not a multiple of 10 might have the same results as setting it to the next higher multiple of 10. This parameter can only be set in the postgresql.conf file or on the server command line.

不满足10ms的事务只写磁盘，不做flush。

wal_writer_flush_after (integer)

Specifies how often the WAL writer flushes WAL, in volume terms. If the last flush happened less than wal_writer_delay ago and less than wal_writer_flush_after worth of WAL has been produced since, then WAL is only written to the operating system, not flushed to disk. If wal_writer_flush_after is set to 0 then WAL data is always flushed immediately. If this value is specified without units, it is taken as WAL blocks, that is XLOG_BLCKSZ bytes, typically 8kB. The default is 1MB. This parameter can only be set in the postgresql.conf file or on the server command line.

同上，从空间上触发刷盘。一般盘的io写日志无瓶颈的话，不需要使用这个参数。

除非发现刷xlog周期性的打满IO，配这个参数有奇效。

三、fsync相关源码

xlog文件创建。

XLogFileInit

int
XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
{
    ...
        fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method),
                           S_IRUSR | S_IWUSR);
    ...
}

get_sync_bit(sync_method) sync_method = 1

...

/* Sync methods */
#define SYNC_METHOD_FSYNC       0
#define SYNC_METHOD_FDATASYNC   1
#define SYNC_METHOD_OPEN        2   /* for O_SYNC */
#define SYNC_METHOD_FSYNC_WRITETHROUGH  3
#define SYNC_METHOD_OPEN_DSYNC  4   /* for O_DSYNC */
extern int  sync_method;


...

static int
get_sync_bit(int method)
{
    ....
    if (!XLogIsNeeded() && !AmWalReceiverProcess())
        o_direct_flag = PG_O_DIRECT;
    ....

    switch (method)
    {
            /*
             * enum values for all sync options are defined even if they are
             * not supported on the current platform.  But if not, they are
             * not included in the enum option array, and therefore will never
             * be seen here.
             */
        case SYNC_METHOD_FSYNC:
        case SYNC_METHOD_FSYNC_WRITETHROUGH:
        case SYNC_METHOD_FDATASYNC:
            return 0;
#ifdef OPEN_SYNC_FLAG
        case SYNC_METHOD_OPEN:
            return OPEN_SYNC_FLAG | o_direct_flag;
#endif
#ifdef OPEN_DATASYNC_FLAG
        case SYNC_METHOD_OPEN_DSYNC:
            return OPEN_DATASYNC_FLAG | o_direct_flag;
#endif
        default:
            /* can't happen (unless we are out of sync with option array) */
            elog(ERROR, "unrecognized wal_sync_method: %d", method);
            return 0;           /* silence warning */
    }
}

默认函数返回0。

XLogFileOpen

xlog文件打开

    fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method),
                       S_IRUSR | S_IWUSR);

issue_xlog_fsync

触发一次同步刷盘

/*
 * Issue appropriate kind of fsync (if any) for an XLOG output file.
 *
 * 'fd' is a file descriptor for the XLOG file to be fsync'd.
 * 'log' and 'seg' are for error reporting purposes.
 */
void
issue_xlog_fsync(int fd, XLogSegNo segno)
{
    switch (sync_method)
    {
        case SYNC_METHOD_FSYNC:
            if (pg_fsync_no_writethrough(fd) != 0)
                ereport(PANIC,
                        (errcode_for_file_access(),
                         errmsg("could not fsync log file %s: %m",
                                XLogFileNameP(ThisTimeLineID, segno))));
            break;
#ifdef HAVE_FSYNC_WRITETHROUGH
        case SYNC_METHOD_FSYNC_WRITETHROUGH:
            if (pg_fsync_writethrough(fd) != 0)
                ereport(PANIC,
                        (errcode_for_file_access(),
                         errmsg("could not fsync write-through log file %s: %m",
                                XLogFileNameP(ThisTimeLineID, segno))));
            break;
#endif
#ifdef HAVE_FDATASYNC
        case SYNC_METHOD_FDATASYNC:
            if (pg_fdatasync(fd) != 0)
                ereport(PANIC,
                        (errcode_for_file_access(),
                         errmsg("could not fdatasync log file %s: %m",
                                XLogFileNameP(ThisTimeLineID, segno))));
            break;
#endif
        case SYNC_METHOD_OPEN:
        case SYNC_METHOD_OPEN_DSYNC:
            /* write synced it already */
            break;
        default:
            elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
            break;
    }
}

默认走fdatasync

/*
 * pg_fdatasync --- same as fdatasync except does nothing if enableFsync is off
 *
 * Not all platforms have fdatasync; treat as fsync if not available.
 */
int
pg_fdatasync(int fd)
{
    if (enableFsync)
    {
#ifdef HAVE_FDATASYNC
        return fdatasync(fd);
#else
        return fsync(fd);
#endif
    }
    else
        return 0;
}

HAVE_FDATASYNC宏由configure时配置。

OPEN参数含义

上述过程可以看到，默认情况open参数如下：

O_RDWR | PG_BINARY | 0

O_RDWR：打开方式 read/write
PG_BINARY：适配windows，默认为0。

S_IRUSR | S_IWUSR

S_IRUSR：读权限
S_IWUSR：写权限