I used the insert benchmark on servers that use disks in my quest to learn more
about MongoDB internals. The insert benchmark is interesting for a few reasons.
First while inserting a lot of data isn't something I do all of the time it is something
for which performance matters some of the time. Second it subjects secondary
indexes to fragmentation and excessive fragmentation leads to wasted IO and
wasted disk space. Finally it allows for useful optimizations including writeoptimized
algorithms (fractal tree via TokuMX
[http://www.tokutek.com/2013/07/tokumx-fractal-treer-indexes-what-are-they/] , LSM vis
RocksDB [http://rocksdb.org/] and WiredTiger [http://wiredtiger.com/] ) or the
InnoDB insert buffer [https://www.google.com/search?q=innodb+insert+buffer] .
Hopefully I can move onto other workloads after this week.
This test is interesting for another reason that I don't really explore here but will
in a future post. While caching all or most of the database in RAM works great at
eliminating reads it might not do much for avoiding random writes. So a write
heavy workload with a cached database can still be limited by random write IO
and this will be more of an issue as RAM capacity grows on commodity servers
while people try to reuse their favorite update-in-place b-tree for cached
workloads. Some of the impact from that can be viewed in the results for
MongoDB when the database is smaller than 72G. I wonder whether InnoDB can
be improved in this case. The traditional solution is to use snapshots (sequential
IO) and a redo log.
The test server has 72G of RAM and at least 8 10K RPM SAS disks with HW
RAID and a battery-backed write cache so it can do a few thousand random IOPs
given many pending requests if we trade latency for throughput. The insert
benchmark was used with 1 client thread and the test was started with an empty
collection/table. I used the Java client [https://github.com/tmcallaghan/iibenchmongodb]
for MongoDB and TokuMX and the Python client
[http://bazaar.launchpad.net/~mdcallag/mysql-patch/mytools/files/head:/bench/ibench/]
for InnoDB. The MongoDB inserts are done with w:1,j:1 and
journalCommitInterval=2 (or logFlushPeriod=0 with TokuMX). So there is a wait
for fsync but with all of the tests I have done to this point there is not much
difference between j:0 and j:1 as the journal sync does not have much impact
when inserting 1000 documents per insert request. The InnoDB inserts are done
with innodb_flush_log_at_trx_commit=1 so it also waits for fsync. I also used 8kb
pages for InnoDB and disabled the doublewrite buffer. Compression was not
used for InnoDB. Fsync is fast on the test hardware given the RAID write cache.
The clients run as the same host as the server to reduce network latency. The
oplog/binlog was disabled.
TokuMX, MongoDB and InnoDB versus
the insert benchmark with disks
Creative Commons Attribution-Share Alike -- http://creativecommons.org/licenses/by-sa/3.0/us.
Dynamic Views template. Powered by Blogger.
Classic Flipcard Magazine Mosaic Sidebar Snapshot Timeslide
Small Datum search
2014年3月25日Small Datum
http://smalldatum.blogspot.com/ 2/27
I usually have feature requests listed in a post but not this time. I think that
MongoDB needs much more in the way of per-collection and per-index statistics.
That shouldn't be a surprise given my work on the same for MySQL. But that will
wait for another post.
The performance summary isn't a surprise. TokuMX does better than InnoDB
because fractal trees greatly reduce the random IOPs demand. InnoDB does
better than MongoDB. There are a few reasons why InnoDB does better than
MongoDB even though they both use an update-in-place b-tree:
1. Databases with MongoDB are larger than with InnoDB so cache hit rates are
lower when the database is larger than RAM. I don't understand all of the
reasons for the size differences. Including attribute names in every
document is not the majority of the problem. I think there is more secondary
index fragmentation with MongoDB. I have results with and without the
powerOf2Sizes option
[http://docs.mongodb.org/manual/reference/command/collMod/] enabled and that
doesn't explain the difference.
2. The InnoDB insert buffer [http://dev.mysql.com/doc/innodb/1.1/en/innodbperformance-
change_buffering.html] is the primary reason that InnoDB does
better. This is true when comparing InnoDB to many products that use an
update-in-place b-tree, not just MongoDB. Because of the insert buffer
InnoDB is unlikely to stall on disk reads to leaf pages of secondary indexes
during index maintenance. Those reads can be done in the background
using many concurrent IO requests. MongoDB doesn't have this feature. It
blocks on disk reads during secondary index maintenance and won't benefit
from concurrent IO for reads despite the RAID array used by the
server. This note has performance results
[https://www.facebook.com/note.php?note_id=492969385932] for the insert
benchmark and InnoDB when the insert buffer is disabled to show the
benefit from that feature. I have also written about problems since fixed in
InnoDB that prevented the insert buffer from being useful because it
became full [http://mysqlha.blogspot.com/2008/12/other-performanceproblem.
html] .
For the test the client inserts up to 2B rows. But I wasn't willing to wait for
MongoDB and stopped it after less than 600M rows. InnoDB was stopped after
1.8B rows. The columns used for the result table are listed below. There are a lot
more details on these columns in a previous post
[http://smalldatum.blogspot.com/2014/03/redo-logs-in-mongodb-and-innodb.html] . Each
of the sections that follow describe the performance to insert the next 100M
documents/rows.
sizeGB - the database size in GB
bpd - bytes per document/row computed from sizeGB / #docs (or #rows)
MB/s - the average rate for bytes written per second computed from iostat.
This has IO for the database file and the journal/redo logs
GBw - the total number of GB written to the database including journal/redo
logs
secs - the number of seconds to insert data
Results
2014年3月25日Small Datum
http://smalldatum.blogspot.com/ 3/27
irate - the rate of documents or rows inserted per second
notes - more details on the configuration
This has results from inserting 100M documents/rows to an empty
collection/table. Things that interest me that I have previously reported include 1)
MongoDB databases are much larger and 2) MongoDB does much more disk IO
for the same workload and the increase in bytes written isn't explained by the
database being larger. One of the reasons for the high bytes written rate is that
the test takes longer to complete with MongoDB and a hard checkpoint is done
every syncdelay seconds. InnoDB is better at delaying writeback for dirty pages.
The interesting result that I have seen in a few cases with both MongoDB 2.4.9
and 2.6.0 is that results are worse with powerOf2Sizes enabled. I have not take
the time to debug this problem. That is on my TODO list. At first I thought I had a
few bad servers (flaky HW, etc) but I haven't seen the opposite for this workload
(powerOf2Sizes enabled getting better insertion rates). The problem appears to
be intermittent. Note that 2.6 has a fix for JIRA 12216
[https://jira.mongodb.org/browse/SERVER-12216] that doesn't block allocation of
new files when msync is in progress so 2.6 should be somewhat faster than 2.4.
config sizeGB bpd MB/s GBw secs irate notes
innodb 16 171 28.9 124 4290 23308
tokumx 9.2 98 11.1 79 7127 14030
mongo24 43 461 46.4 1539 33230 3009
powerOf2Sizes=0
mongo24 44 472 30.0 1545 51634 1937
powerOf2Sizes=1
mongo26 42 450 47.9 1446 30199 3311
powerOf2Sizes=1
TokuMX and fractal trees are starting to show a benefit relative to InnoDB.
config sizeGB bpd MB/s GBw secs irate notes
innodb 31 166 24.3 238 9781 10224
tokumx 17 91 12.3 90 7328 13646
mongo24 72 386 37.4 1768 47329 2113
powerOf2Sizes=0
mongo24 79 424 24.6 1731 70325 1422
powerOf2Sizes=1
mongo26 76 408 39.3 1611 40992 2439
powerOf2Sizes=1
More of the same as TokuMX gets better relative to others.
config sizeGB bpd MB/s GBw secs irate notes
innodb 45 161 21.7 350 16136 6198
From 0 to 100M rows
From 100M to 200M rows
From 200M to 300M rows
2014年3月25日Small Datum
http://smalldatum.blogspot.com/ 4/27
tokumx 25 89 12.0 84 7071 14142
mongo24 98 350 30.7 2008 65514 1526
powerOf2Sizes=0
mongo24 106 379 19.9 1917 96351 1038
powerOf2Sizes=1
mongo26 108 386 24.9 1933 77677 1287
powerOf2Sizes=1
TokuMX begins to get slower. MongoDB gets a lot slower as the database is
much larger than RAM. Problems unrelated to MongoDB cost me two of the long
running test servers (for 2.4.9 and 2.6.0 with powerOf2Sizes=1).
config sizeGB bpd MB/s GBw secs irate notes
innodb 61 163 21.1 376 17825 5610
tokumx 31 83 12.1 86 7172 13941
mongo24 130 348 14.7 2313 157395 635
powerOf2Sizes=0
MongoDB is getting significantly slower as the database is larger than RAM.
More on this in the next section.
config sizeGB bpd MB/s GBw secs irate notes
innodb 75 161 19.8 462 23337 4285
tokumx 39 83 11.1 84 7584 13186
mongo24 160 344 4.7 2105 441534 202
powerOf2Sizes=0
I wasn't willing to wait for MongoDB to make it to 600M. I stopped the test when it
reached ~540M inserts. The insert rate continues to drop dramatically. InnoDB
does better because of the insert buffer. I assume that for MongoDB it would
drop to ~50/second were I willing to wait. That would happen when there was a
disk read for every secondary index per inserted document, there are 3, and the
disk array can do ~150 disk reads/second when requests are submitted serially.
InnoDB was slightly faster compared to the previous 100M inserts, but it will get
slower in the long run.
I looked at iostat output and the MongoDB host was doing ~260 disk
reads/second and ~375 disk writes/second at test end. For both reads and
writes the average request size was ~8kb. The write stats include writes to
journal and database files. From PMP stack traces [http://poormansprofiler.org/] I
see a single thread busy walking b-tree indexes most of the time. Note that the
write rate for MongoDB has fallen in line with the reduction in the insert rate.
Database pages aren't getting dirty as fast as they used to get because
MongoDB is stalled on secondary index leaf node reads.
config sizeGB bpd MB/s GBw secs irate notes
From 300M to 400M rows
From 400M to 500M rows
From 500M to 600M rows
2014年3月25日Small Datum
http://smalldatum.blogspot.com/ 5/27
... at 600M docs/rows
innodb 89 159 20.1 392 19465 5137
tokumx 46 82 11.6 90 7741 12917
... at 540M documents
mongo24 168 340 2.9 1235 X 123
powerOf2Sizes=0
Alas InnoDB has begun to degrade faster. Even the insert buffer eventually is no
match for a write-optimized algorithm.
config sizeGB bpd MB/s GBw secs irate notes
innodb 148 158 15.4 1515 98413 1016
tokumx 74 79 10.9 92 8436 11853
More of the same.
config sizeGB bpd MB/s GBw secs irate notes
innodb 221 158 12.5 1745 140274 713
tokumx 104 74 11.0 96 8722 11464
TokuMX is all alone.
config sizeGB bpd MB/s GBw secs irate notes
tokumx 142 76 12.6 99 7868 12709
Posted 3 hours ago by Mark Callaghan
Labels: mongodb, mysql
From 900M to 1B rows
From 1.4B to 1.5B rows
From 1.5B to 2B rows
0 Add a comment
Yesterday
Both MongoDB and InnoDB support ACID. For MongoDB this is limited to single
document changes [http://docs.mongodb.org/manual/faq/fundamentals/] while
InnoDB extends that to multi-statement and possibly long-lived transactions. My
goal in this post is to explain how the MongoDB journal is implemented and used
to support ACID. Hopefully this will help to understand performance. I include
comparisons to InnoDB.
There are a few interesting constraints on the support for ACID with MongoDB. It
uses a per-database reader-writer lock
Redo logs in MongoDB and InnoDB
What is ACID?
2014年3月25日Small Datum
http://smalldatum.blogspot.com/ 6/27
[http://docs.mongodb.org/manual/faq/concurrency/] . When a write is in progress all
other uses of that database (writes & reads) are blocked. Reads can be done
concurrent with other reads but block writes. The manual states that the lock is
writer greedy [http://docs.mongodb.org/manual/faq/concurrency/] so that a pending
writer gets the lock before a pending reader. I am curious if this also means that
a pending writer prevents additional readers when the lock is currently in read
mode and added that question to my TODO list. The reader-writer lock code is
kind of complex. By in progress I mean updating in-memory structures from
mmap'd files like the document in the heap storage and any b-tree indexes. For
a multi-document write the lock can be yielded (released) in between documents
so the write might be done from multiple points in time and readers can see the
write in progress. There are no dirty reads. The $isolated option
[http://docs.mongodb.org/manual/reference/operator/update/isolated/] can be used to
avoid yielding the per-db lock for multi-document writes. Even with that option an
error half-way into a multi-document operation results in a half done change as
there is no undo. MyISAM users are familiar with this problem. The cursor
isolation provided by MongoDB isn't that different from READ COMMITTED on a
system that doesn't do MVCC snapshots (see older versions of BerkeleyDB and
maybe IBM DB2 today).
It is bad for an in progress write to stall on a disk read (page fault from an
mmap'd file) while holding the per-database lock. MongoDB has support to yield
the per-db lock [http://docs.mongodb.org/manual/faq/concurrency/] on a page fault
(real or predicted) but I have questions about this. Is the yield only done for the
document pages in the heap storage or extended to index pages? Is anything
done to guarantee that most or all pages (document in heap store, all index
pages to be read) are in memory before applying any writes. Note that if the
document is updated in memory and then a page fault occurs during index
maintenance then I doubt the lock can be yielded. This is another question on
my TODO list. I am not the only person with
[http://stackoverflow.com/questions/22256776/does-mongodb-yield-lock-on-index-pagefault]
that [https://groups.google.com/forum/#!topic/mongodb-user/UEhi7Y74DvE]
question. MongoDB has something like an LRU to predict whether there will be a
page fault on a read and understanding the performance overhead from that is
also on my TODO list. I have seen a lot of CPU overhead from that code on
some benchmarks.
MongoDB doesn't have row locks. The change to a document is visible as soon
as the per-db lock is released. Not only are some writes visible from a multidocument
change before all documents have been modified but all changes are
visible before the change is durable [http://smalldatum.blogspot.com/2014/03/whendoes-
mongodb-make-transaction.html] via the journal. This behavior is different
than what you can get from a DBMS and users should be aware
[https://jira.mongodb.org/browse/DOCS-2908] of that.
InnoDB has a redo log and uses the system tablespace for undo. The changes
written to support undo are made durable via the redo log just like changes to
database tables. The undo information enables consistent reads for long
running transactions. The InnoDB redo log uses buffered IO by default and is
Redo logs
2014年3月25日Small Datum
http://smalldatum.blogspot.com/ 7/27
configured via the innodb_flush_method option
[https://dev.mysql.com/doc/refman/5.6/en/innodbparameters.
html#sysvar_innodb_flush_method] . Redo log pages are 512 bytes and
that might need to change when the conversion from 512 to 4096 byte disk
sectors is complete. Each log page has a checksum.
The innodb_flush_log_at_trx_commit option
[https://dev.mysql.com/doc/refman/5.6/en/innodbparameters.
html#sysvar_innodb_flush_log_at_trx_commit] determines whether the
log is forced on commit or once per second. There are a fixed number of redo
log files and they are preallocated. With buffered IO and 512 byte aligned writes
the first write on a 4kb boundary can incur a disk read to get the page into the
OS cache before applying the change. This is a waste of disk IO and the
workaround seems to be [http://dom.as/2010/11/18/logs-memory-pressure/] some
form of padding to 4kb for some writes. But note that the padding will be
reused/reclaimed on the next log write. An alternative is to use direct IO but there
might be several calls to write or pwrite before the log must be forced and
making each write synchronous will delay log writing. With buffered IO the
filesystem can coalesce the data from multiple writes as these are adjacent in the
log file.
MongoDB doesn't do undo. It does have a redo log called the journal
[http://docs.mongodb.org/manual/core/journaling/] . This uses direct IO on Linux. The
log page size is 8kb and is protected by a checksum. The in-memory log buffer is
compressed via Snappy before the log write is done and the compressed result
is padded to 8kb. The space taken by padding isn't reused/reclaimed for the
next write so a sequence of small inserts with j:1 each write at least 8kb to the
journal. Writes to the journal are done by a background thread (see durThread
in dur.cpp). Note that the background thread iterates over a list of redo log
entries that must be written to the journal, copies them to a string buffer, then
uses Snappy to compress that data, then pads the compressed output to the
next multiple of 8kb, then writes the padded output to the journal file. The dur
section in serverStatus output [http://docs.mongodb.org/manual/reference/serverstatus/]
has counters for the amount of data written to the journal which includes
the padding (journaledMB). The size of the data prior to padding is the
journaledMB counter divided by the compression counter. Note that these
counters are computed over the last few seconds.
MongoDB optionally recycles log files and this is determined when journal
preallocation (preallocj) is enabled. With preallocj 3 journal files are created at
process start and this can delay startup for the time required to create 3 1GB
journal files (see preallocateFiles in dur_journal.cpp). This writes data to the files
so real IO is done including an fsync. In steady state, after process start, old log
files are recycled with preallocj (see removeOldJournalFile in dur_journal.cpp).
Without preallocj the journal files are not preallocated at process start and old
journal files are not reused. There is an undocumented option, --nopreallocj, that
can disable preallocj. There is no option to force preallocj. That is determined by
a short performance test done at process start (see preallocateIsFaster in
dur_journal.cpp). One way to determine whether preallocj is in use is to check
the journal directory for the preallocated files.
Preallocation for both database and journal doesn't mean that files are written an
2014年3月25日Small Datum
http://smalldatum.blogspot.com/ 8/27
extra time -- once during preallocation, at least once during regular use. I was
happy to learn this. Database file preallocation uses posix_fallocate rather than
write/pwrite on Linux (see the run and ensureLength methods in the FileAllocator
class). Journal file preallocation uses write append, but that should only be done
at process start and then the files are recycled (see preallocateFile and
removeOldJournalFile in dur_journal.cpp).
Using strace is a great way to understand complex DBMS software. This shows
the sequence of 8kb writes to the journal during the insert benchmark with a
client that uses j:1 and 1 document per insert:
strace -f -p $( pidof mongod ) -ewrite
write(5,
"g\n\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\224'L\376\377\377\377iiben
"..., 8192) = 8192
write(5,
"\t\v\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\344)L\376\377\377\377iibe
n"..., 8192) = 8192
write(5,
"m\6\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\264\26L\376\377\377\377iib
en"..., 8192) = 8192
write(5,
"R\4\0\0\264\346\0\0\0\0\0\0\216\f/S\374=8\22\224\16L\376\377\377\377iib
en"..., 8192) = 8192
Proper group commit is now supported for InnoDB but I will skip the details. It is
done directly by a thread handling the COMMIT operation for a user's
connection and there is no wait unless another thread is already forcing the log.
My team did the first implementation of group commit but MariaDB and MySQL
did something better. We were thrilled to remove that change from our big patch
for MySQL.
MongoDB has group commit. The journal is forced to disk every
journalCommitInterval [http://docs.mongodb.org/manual/tutorial/manage-journaling/]
milliseconds. When a thread is blocked waiting for the journal to be forced the
interval is reduced to 1/3 of that value. The minimum value for
journalCommitInterval is 2, so the maximum wait in that case should be 1 (2/3
rounded up). This means that MongoDB will do at most 1000 log forces per
second. Some hardware can do 5000+ fast fsyncs courtesy of battery backed
write cache in HW RAID or flash so there are some workloads that will want
MongoDB to force the log faster than it can today. Group commit is done by a
background thread (see durThread, durThreadGroupCommit, and
_groupCommit in dur.cpp). Forcing the journal at journalCommitInterval/3
milliseconds is also done when there is too much data ready to be written to it.
I used the insert benchmark to understand redo log performance. The test used
1 client thread to insert 10M documents/rows into an empty collection/table with 1
document/row per insert. The test was repeated in several configurations to
understand what limited performance. I did this to collect data for several
Group commit
Performance
2014年3月25日Small Datum
http://smalldatum.blogspot.com/ 9/27
questions: how fast can a single-threaded workload sync the log and how much
data is written to the log per small transaction. For the InnoDB tests I used
MySQL 5.6.12, disabled the doublewrite buffer and used an 8k page. The
MongoDB tests used 2.6.0 rc0. The TokuMX tests use 1.4.0. The following
configurations were tested:
inno-sync - fsync per insert with innodb_flush_log_at_trx_commit=1.
inno-lazy - ~1 fsync per second with innodb_flush_log_at_trx_commit=2
toku-sync - fsync per insert with logFlushPeriod=0
mongo-sync - fsync per insert, journalCommitInterval=2, inserts used j:1
mongo-lazy - a few fsyncs/second, journalCommitInterval=300, inserts used
w:1, j:0
mongo-nojournal - journal disabled, inserts used w:1
The following metrics are reported for this test.
bpd - bytes per document (or row). This is the size of the database at test
end divided by the number of documents (or rows) in the database. As I
previously reported [http://smalldatum.blogspot.com/2014/03/insert-benchmarkfor-
innodb-mongodb-and.html] , MongoDB uses much more space than InnoDB
whether or not powerOf2Sizes
[http://docs.mongodb.org/manual/reference/command/collMod/] is enabled. They
are both update-in-place b-trees so I don't understand why MongoDB does
so much worse when subject to a workload that causes fragmentation.
Storing attribute names in every document doesn't explain the difference.
But in this case the results overstate the MongoDB overhead because of
database file preallocation.
MB/s - the average disk write rate during the test in megabytes per second
GBw - the total number of bytes written to disk during the test in GB. This
includes writes to the database files and (when enabled) the redo logs. The
difference between inno-sync and inno-lazy is the overhead of a 4kb redo
log write per insert. The same is not true between mongo-sync and mongolazy.
My educated guess to explain why MongoDB and InnoDB are different
is that for mongo-sync the test takes much longer to finish than mongo-lazy
so there are many more hard checkpoints (write all dirty pages each 60
seconds). InnoDB is much better at keeping pages dirty in the buffer pool
without writeback. In all cases MongoDB is writing much more data to disk. In
the lazy mode it writes ~15X more and in the sync mode it writes ~6X more. I
don't know if MongoDB does hard checkpoints (force all dirty pages to disk
every syncdelay seconds) when the journal is disabled. Perhaps I was too
lazy to read more code.
secs - the number of seconds to insert 10M documents/rows.
bwpi - bytes written per inserted document/row. This is GBw divided by the
number of documents/rows inserted. The per row overhead for inno-sync is
4kb because a redo log force is a 4kb write. The per document overhead for
mongo-sync is 8kb because a redo log force is an 8kb write. So most of the
difference in the write rates between MongoDB and InnoDB is not from the
redo log force overhead.
irate - the rate of documents/rows inserted per second. MongoDB does
fewer than 1000 per second as expected given the use of
Test Results
2014年3月25日Small Datum
http://smalldatum.blogspot.com/ 10/27
journalCommitInterval. This makes for a simple implementation of group
commit but is not good for some workloads (single-threaded with j:1).
logb - this is the total bytes written to the redo log as reported by the DBMS.
Only MongoDB reports this accurately when sync-on-commit is used
because it pads the result to a multiple of the filesystem page size. For
MongoDB the data comes from the dur section of the serverStatus output.
But I changed MongoDB to not reset the counters as upstream code resets
them every few seconds. InnoDB pads to a multiple of 512 bytes and I used
the os_log_written counter to measure it. AFAIK TokuMX doesn't pad and
the counter is LOGGER_BYTES_WRITTEN. So both TokuMX and InnoDB
don't account for the fact that the write is done using the filesystem page
size (multiple of 4kb).
logbpi - log bytes per insert. This is logb divided by the number of
documents/rows inserted. There are two numbers for MongoDB. The first is
the size after padding and compression. It is a bit larger than 8kb. As the
minimum value is 8kb given this workload this isn't a surprise. The second
number is the size prior to compression and padding. This value can be
compared to InnoDB and TokuMX and I am surprised that it is so much
larger for MongoDB. I assume MongoDB doesn't log page images. This is
something for my TODO list.
iibench 1 doc/insert, fsync, 10M rows
bpd MB/s GBw secs bwpi irate
logb logbpi
inno-sync 146 18.9 54.3 3071 5690 3257
7.5G 785
inno-lazy 146 2.8 5.8 2251 613 4442
7.5G 785
toku-sync 125 31.0 86.8 2794 9104 3579
2.3G 251
mongo-sync 492 23.1 312.0 13535 32712 739
83.3G 8733/4772
mongo-lazy 429 40.5 79.8 1969 8365 5078
21.9G 2294/4498
mongo-nojournal 440 34.1 42.0 1226 4401 8154 NA
NA
From all of this I have a few feature requests:
1. Don't try to compress the journal buffer when it is already less than 8kb.
That makes commit processing slower and doesn't reduce the amount of
data written to the journal as it will be padded to 8kb.
2. Provide an option to disable journal compression. For some configurations
of the insert benchmark I get 10% more inserts/second with compression
disabled. Compression is currently done by the background thread before
writing to the log. This adds latency for many workloads. When compression
is enabled it is possible to be more clever and begin compressing the
journal buffer early. Compression requires 3 copies of data -- once to the