Optimizing NexentaStor 3.1.4 for SSDs
鲁钱明
2023-12-01
Introduction
SSDs are the fastest data storage devices available today for computers and have a few characteristics that are different than HDDs. With a few minor modifications to sd.conf, you can get optimal performance from any SSD. This paper outlines the steps necessary to achieve maximum performance.
4096 Sector Size
Most SSDs available today have 4K physical sectors instead of the old standard 512-byte sectors. Many SSDs vendors utilize a 4K indirection table which group 8x 512B sectors, or 8 Logical Block Addresses (LBA), and map them into a physical 4K block of non-volatile memory. This means that a 512B write, results in a read-modify-write operation. LBAs are still assigned to 512B sectors and the 512B sectors are, for all intents and purposes, 8 logical sectors grouped into 1 physical 4K sector. This relationship is invisible to operating systems, and as such, operating systems performing multiple 512B accesses will see reduced system performance. A good example of this is 512B write, which results in a read-modify-write operation of the entire 4K block. For optimal performance, it is recommended that the OS be configured to utilize a minimum transfer size of 4K. It is therefore important to ensure that the drive is partitioned correctly and that data is written in 4K blocks. NexentaStor 3.1.4 adds support for this new size.
This would happen automatically, if the drive that was using 4K sectors would report it to the operating system. Unfortunately, there is no reliable way to automatically identify such a drive. The process discussed below assumes NexentaStor releases 3.1.4 and later. There are some workarounds that are applicable for releases prior to 3.1.4 but this document will not address those workarounds.
There are a number of ways to find out if the SDD is uses a 4k sector. The easiest method is to look at the manufactures’ documentation for the drive. If the drive uses a larger sector size, the manufacturer usually lists the sector size in the specifications. They will usually show a 512B size, for compatibility and a “native sector size” of 4096. If the drive documentation shows a larger native sector size of 4096, then the drive should be configured with a 4096 byte sector size.
The other method is to refer to the performance tests that are run when the drive was certified. You can reference the sequential write tests for 512B and 4096B tests. (These tests are run as part of the certification suite.) Looking at the I/O Rate, a drive that has a native 512-byte sector, you should see the 512 bytes per sector test results be equal or slightly higher than the IOPs to those for 4096 bytes result. If you look at MB/Sec, you should see the result for the 512 bytes MB/sec result to be about 7/8 of that for the 4096 result. When we have a drive that has a 4096 byte sector, the 512 result will be substantially lower in the MB/sec (less than half of the 4096 byte result) and a corresponding drop in IOPs when compared to the 4096 result. What you are observing, is the effects of the drive doing an additional I/O due to the read-modify-write operation.
Below is an example of two runs. (Figures 1 & 2) The first result is from a drive that has a 512-byte sector size and the second is from a device that has a 4096-byte size.
Copyright © 2013 Nexenta® Systems, ALL RIGHTS RESERVED Page 3 of 6
Figure 1: Drive A
Figure 2: Drive B
Once you have identified a drive that has a native sector size of 4096, we need to manually tell
to operating system that it’s dealing with a drive that has a larger sector size. We do this by
modifying the /kernel/drv/sd.conf. We only have to go through process once per drive. The steps
to modify that file are as follows:
1) First, you will need the correct vendor ID (VID) and product ID (PID). You can get those directly from
the OS by issuing the following command:
echo "::walk sd_state | ::grep '.!=0' | ::print struct sd_lun un_sd | ::print struct scsi_device sd_inq |
::print struct scsi_inquiry inq_vid inq_pid" | mdb -k
The results should look similar to the following:
inq_pid = [ "ST1000NM0001 " ]
inq_vid = [ "SEAGATE " ]
23:19:40.001 Starting RD=run_seqWrites; I/O rate: Uncontrolled MAX; elapsed=20; For loops: threads=1.0 xfersize=512.0
Feb 22, 2013 interval i/o MB/sec bytes read resp read write resp resp queue cpu% cpu%
rate 1024**2 i/o pct time resp resp max stddev depth sys+u sys
23:19:46.131 1 16004.20 7.81 512 0.00 0.057 0.000 0.057 13.587 0.051 0.9 3.9 3.1
23:19:50.027 2 16178.60 7.90 512 0.00 0.058 0.000 0.058 1.036 0.015 0.9 3.8 3.2
23:19:55.027 3 16267.00 7.94 512 0.00 0.058 0.000 0.058 1.040 0.022 0.9 3.6 3.0
23:20:00.022 4 16311.20 7.96 512 0.00 0.058 0.000 0.058 1.035 0.014 0.9 3.5 3.0
23:20:00.036 avg_2-4 16252.27 7.94 512 0.00 0.058 0.000 0.058 1.040 0.017 0.9 3.6 3.1
23:20:06.001 Starting RD=run_seqWrites; I/O rate: Uncontrolled MAX; elapsed=20; For loops: threads=1.0 xfersize=4k
Feb 22, 2013 interval i/o MB/sec bytes read resp read write resp resp queue cpu% cpu%
rate 1024**2 i/o pct time resp resp max stddev depth sys+u sys
23:20:11.027 1 13738.40 53.67 4096 0.00 0.067 0.000 0.067 14.579 0.056 0.9 3.4 2.8
23:20:16.025 2 14333.20 55.99 4096 0.00 0.065 0.000 0.065 1.052 0.012 0.9 3.5 2.9
23:20:21.028 3 14073.60 54.98 4096 0.00 0.065 0.000 0.065 0.368 0.009 0.9 3.4 2.7
23:20:26.025 4 14214.40 55.53 4096 0.00 0.066 0.000 0.066 1.041 0.010 0.9 4.1 3.2
23:20:26.041 avg_2-4 14207.07 55.50 4096 0.00 0.065 0.000 0.065 1.052 0.011 0.9 3.6 2.9
16:09:58.000 Starting RD=run_seqWrites; I/O rate: Uncontrolled MAX; elapsed=30; For loops: threads=1.0 xfersize=512.0
Feb 08, 2013 interval i/o MB/sec bytes read resp resp resp cpu% cpu%
rate 1024**2 i/o pct time max stddev sys+usr sys
16:10:03.064 1 928.60 0.45 512 0.00 1.056 4.220 0.857 0.8 0.4
16:10:08.019 2 938.20 0.46 512 0.00 1.063 4.128 0.860 0.4 0.3
16:10:13.017 3 948.20 0.46 512 0.00 1.052 4.173 0.858 0.4 0.3
16:10:18.016 4 986.00 0.48 512 0.00 1.012 4.190 0.841 0.8 0.6
16:10:23.016 5 993.40 0.49 512 0.00 1.005 4.411 0.831 0.4 0.4
16:10:28.014 6 974.00 0.48 512 0.00 1.024 4.432 0.841 0.4 0.3
16:10:28.023 avg_2-6 967.96 0.47 512 0.00 1.031 4.432 0.846 0.5 0.4
16:10:34.000 Starting RD=run_seqWrites; I/O rate: Uncontrolled MAX; elapsed=30; For loops: threads=1.0 xfersize=4k
Feb 08, 2013 interval i/o MB/sec bytes read resp resp resp cpu% cpu%
rate 1024**2 i/o pct time max stddev sys+usr sys
16:10:39.010 1 14332.40 55.99 4096 0.00 0.067 0.638 0.006 2.4 2.0
16:10:44.009 2 14291.60 55.83 4096 0.00 0.068 0.666 0.006 2.1 1.8
16:10:49.009 3 14101.40 55.08 4096 0.00 0.069 0.662 0.005 2.0 1.8
16:10:54.009 4 14463.40 56.50 4096 0.00 0.067 0.660 0.006 2.0 1.8
16:10:59.009 5 14792.80 57.78 4096 0.00 0.066 0.658 0.006 2.4 2.0
16:11:04.008 6 14456.80 56.47 4096 0.00 0.068 0.658 0.006 2.0 1.8
16:11:04.010 avg_2-6 14421.20 56.33 4096 0.00 0.068 0.666 0.006 2.1 1.8
Copyright © 2013 Nexenta® Systems, ALL RIGHTS RESERVED Page 4 of 6
inq_pid = [ "ST1000NM0001 " ]
inq_vid = [ "STEC " ]
inq_pid = [ "ZeusRAM " ]
inq_vid = [ "SEAGATE " ]
inq_pid = [ "ST3600057SS " ]
... etc.
2) The next step is to build a valid VID/PID tuple uses the following format:
"012345670123456789012345"
"|---VID---| |---------PID-----------|"
We have eight characters for the vendor ID, and 16 characters for the product ID. You will need to construct a string of 24 characters using the results from the query. If we assume that the VID/PID looks like the above example, we now should have " STEC ZeusRAM ".
3) Now that you have the results to properly identify the disk, we will use that result in constructing the entry for /kernel/drv/sd.conf. A single line will be appended to the file. Using the example, it should look as follows:
sd-config-list=" STEC ZeusRAM ","physical-block-size:4096";
You will have to substitute your string for the string from the example. If you need to have more than one device in sd.conf, your first entry will end in a comma, and your last entry will end with a semicolon. An example of this is:
sd-config-list=
" STEC ZeusIOPS ","physical-block-size:4096",
" STEC ZeusRAM ","physical-block-size:4096";
4) After modifying the file, you then need to update the OS. That is done via the command:
update_drv -vf sd
You should get the following in response:
Cannot unload module: sd
Will be unloaded upon reboot.
Forcing update of sd.conf
Updated in the kernel.
Don't worry about the errors—we only care that the sd.conf was re-read by the kernel.
* Please note: This new setting will take effect only for drives being added to a new or existing zpools, and will not affect a drive that is part of an active zpool.
5) To verify the disk was updated to 4096, issue the following command.
Copyright © 2013 Nexenta® Systems, ALL RIGHTS RESERVED Page 5 of 6
echo ::sd_state |mdb -k | grep phy_blocksize
You should see something like the following:
un_phy_blocksize = 0x200
un_phy_blocksize = 0x200
un_phy_blocksize = 0x200
un_phy_blocksize = 0x200
un_phy_blocksize = 0x1000
un_phy_blocksize = 0x1000
un_phy_blocksize = 0x1000
... etc.
The 0x200 is a 512-byte and the 0x1000 is a 4096-byte sector.
Caching
NexentaStor insures data integrity with all disk-level caches. It does this by issuing a SYNCHRONIZE_CACHE to the drive so that the data is safely placed on stable storage. For HDD storage, this works as designed and without problems. In the case of newer enterprise SSD storage, this SYNCHRONIZE_CACHE may not necessary but the drive may still receive and act upon a SYNCHRONIZE_CACHE operation. Under normal circumstances, ZFS issues infrequent flushes after each uberblock updates. The flushing is infrequent enough that tuning isn’t necessary. ZFS will also issues a flush every time an application requests a synchronous write (O_DSYNC, fsync, NFS commit, etc). This type of flush is waited upon by the application and impacts performance. The drop in performance is such that it may neutralize the benefits of having a SSD device.
If the SDD has a non-volatile cache, disabling cache flush also helps performance when we use the device is as a log. When all the LUNs exposed to ZFS are SSDs with non-volatile caches, we can disable all flush requests system wide by setting zfs_nocacheflush=1 in the /etc/system file. However, if one or more of the LUNs that is exposed to ZFS are not protected by a non-volatile cache (such is the case with most HDDs), then disabling zfs_nocacheflush can lead to data loss, application level corruption, or, worse, even pool corruption. With some SSDs, the cache flush command is a no-op, and disabling it will make little to no performance difference.
In the case where there are SSD devices that we want to disable the SYNCHRONIZE_CACHE operation for only those devices, there is a method to doing this on a individual device basis. The /kernel/drv/sd.conf, contains detailed customized information about the capabilities of the device. In addition to the block size, discussed above, there is entry named ‘cache-nonvolatile. All that needs to be done is to set that variable to ‘true’, and the OS takes care of the rest. Using the same example above, the entry for the device would look as follows:
sd-config-list=" STEC ZeusRAM ", "physical-block-size:4096, cache-nonvolatile:true”;
Queue Length
To take full advantage of a SSD, you may need to have a queue up 8 - 16 requests for each
Copyright © 2013 Nexenta® Systems, ALL RIGHTS RESERVED Page 6 of 6
device. Manufacturers quote the IOPS maximum using a high queue length.
There are two OS parameters that will need to be set. The first determines the maximum queue length for the device. This parameter is ‘throttle-max’. All SSD devices can support a queue length of 32, so 32 should be used.
We don’t want the OS to do any optimizations, since this means sorting the requests for seek distances. This parameter is ‘disksort’ and we will want the parameter to be ‘false’. Building on the same example as above, the entry for the device with these two parameter settings would look something as follows:
sd-config-list=" STEC ZeusRAM ", "physical-block-size:4096, cache-nonvolatile:true, throttle-max:32, disksort:false ”
Conclusion
This paper discusses some relatively simple changes that can optimize a system when using SSDs. With these changes, you can get the optimal performance from your SSDs.