Deduplication in NexentaStor

松骏俊

2023-12-01

Q: How do I use deduplication?

Deduplication is a technique used to improve storage space efficiency by not storing redundant data. Deduplication occurs at the volume level -- all folders and zvols in a volume share the same deduplicated space. This can be very useful when multiple folders or zvols can contain the same data, but has a number of potential drawbacks that must be considered before implementation. The wins for using deduplication in the proper environment are great; but the drawbacks can be prohibitive if deduplication is used in the wrong environment.

Q: Should I use deduplication?

As a number of the entries in the FAQ mention, performance can be severely impacted by deduplication, and the data you're attempting to deduplicate may not lend itself well to deduplication (or only will do so under certain non-default conditions). Deduplication can be a serious win in certain environments, but in many common setups provides no benefit and has penalties for its use. Nexenta wants you to use deduplication where it fits, and not where it would hurt you for no reward. Please contact Nexenta Sales, Support, or Professional Services for a consultation or if you have any questions about utilizing deduplication in your environment. We highly recommend, especially in cases where the storage will be put directly and immediately into production use, that you discuss deduplication with a Nexenta sales, support, or PS engineer prior to enabling it. If you have any ability to put your live production data on a test system prior to deployment, you can find out how well it deduplications and how big the DDT (dedupe table, explained later) is going to be (and thus how much RAM will be necessary) prior to production deployment and make a decision on deduplication accordingly.

Q: Is deduplication performed at file, byte or block level?
ZFS deduplication is performed at the block level. This is the best level of granularity for most storage systems.

Q: I hear deduplication can be a great space saver. Why isn't it on by default?
Deduplication can save space, but how much is a factor of block size (the smaller, the better the deduplication ratio) and of data type (how deduplicatable is it, really, at a block level). Most customer data doesn't actually lend itself to deduplication even in sometimes surprising situations (backups where they believe there is a lot of duplication as its backing up a lot of identical systems, but it turns out their backup software compresses or modifies data before writing in such a way that deduplication doesn't detect duplicate blocks).Also, because deduplication is best on small block sizes (smaller sets of data = higher chance of duplicate set) and small block sizes are not often the best performance, combined with the actual deduplication logic (which adds a slight overhead), combined with the very large RAM requirement of deduplication, and you can often end up in a situation where enabling deduplication is either: not useful (little to no dedup ratio) and severely impacts performance (typically due to insufficient RAM/L2ARC) useful (decent dedup ratio) and severely impacts performance (typically due to insufficient RAM/L2ARC)

For systems with limited RAM, no cache devices and slow disks, the performance impact can be very noticeable.

Deduplication can provide very decent ratios without severely impacting performance only if all the proper environmental variables are right. For a discussion on this and a review of your potential use case, please contact Nexenta. If you have a test environment with real data, you can try this out for yourself (be sure to do plenty of performance testing as close to real-world requirements as possible, on real-world data); if not, again, please contact us.

Q: What is the appropriate amount of RAM to have?

This is dependent on a number of factors including the block size of the data to deduplicate and how much the data actually deduplicates. RAM usage for deduplication is based on a 'deduplication table' (DDT). An entry is made in the DDT for each unique or non-unique block, and the rule of thumb is to assume 320 bytes per entry in that table. So block size becomes very important. Per OpenSolaris.org, "20 TB of unique data stored in 128K records or more than 1TB of unique data in 8K records would require about 32 GB of physical memory".So in essence, you can need as much as 30-40 GB of RAM for just 1 TB of deduplicated data, if the data ends up not duplicating well and is at 8 KB block size. Or you could need far less, if the data deduplicated extremely well, or your block size is larger (though that might affect dedup ratios). If you have an existing ZFS pool with the data you plan to deduplicate, there is a method to determine for real how much RAM it would take - contact Nexenta Support or Sales for more information on that. If you do not presently have the data you plan to deduplicate already in a ZFS pool or this is for a new project, then please contact Nexenta to discuss possible ramifications of deduplication.
The DDT needs to remain in RAM or SSD for performance on the system to not be severely impacted. As such, if the projected DDT size would be greater than the amount of RAM in the system, either the amount of RAM needs to increase to cover the difference, or SSD's need to be added to the pool as 'cache' (L2ARC) devices to provide an alternative, SSD space to put the DDT. It is a fact that almost all customer complaints that even remotely involve performance on systems making use of deduplication end up being a lack of RAM or SSD for the DDT to fully reside in.

Q: I manage the volumes and I bill the users for their folder usage. With deduplication on, will they see any changes?

Folder usage statistics are independent of deduplication. If you are not using quotas, then folders can grow arbitrarily large. In particular folders can now grow larger than its containing volume. Without quotas on the folders, the usage bills can then grow arbitrarily larger as well if data being stored is being deduplicated.

Q: Deduplication is hurting performance but I am afraid to turn it off because I have a number of deduplicated files and may run out of space. What should I do?

Turning deduplication off will impact only the new data writes, it has no effect on existing data on the system. As a general rule of thumb, if you have ever enabled dedup on the pool, dedup is forever enabled -- the only way to really get rid of it is to get the data off the pool, destroy the pool, create it again (without dedup on) and put the data back.

Q: I made a small change to a large file and now I don't have space to save the file. What happened?

If this was a deduplicated file the change may have caused data to be aligned at different block boundaries. So, it is no longer matching other copies of the same data.

Q: Deduplication is on for the folder but off for the volume. What behavior should I expect?

Only data written into the folder will be deduplicated. So if a volume has folders 'a', 'b' and 'c', and only 'a' has deduplication set, then data written to 'b' and 'c' will not be deduplicated, even if a block written there would be a duplicate of a block written already for 'a'.

Q: I think the data I have could be deduplicated. Can I turn on deduplication on a folder and gain space?

No, deduplication is performed only on data written after the property is set. You can however consider creating a different folder with deduplication set, copying the data to the new folder, and destroying the original folder. The new folder will be deduplicated.

Q: When to use compression instead of deduplication?

Because of the almost complete lack of penalty for using compression, it is recommend that it always be enabled (this has become a default in NexentaStor as of 3.1, but is recommend for older versions when creating new folders and Zvols as well).

Deduplication is recommeded only in situations where multiple copies of the same data is saved on the volume (the data itself may be non-text). Contact Nexenta Support or Sales for a discussion on rather or not to enable deduplication, if you do not have a test lab to try it out for yourself first. Also, compression and deduplication are complimentary features -- one should typically enable compression if using dedup.

Q: I like to use the 'copies' property to ensure there are multiple copies of my data. What will happen if I turn deduplication on?

The copies property allows you to specify the number of physical block copies on the disk of your data. When copies is set for a deduplicated folder, then the deduplicated block will follow the copies property and create the required number of copies. So a deduplicated folder, which is using 1GB of space with copies=1, will use 2GB of space, when copies=2 is set. Eg: 'copies' takes effect AFTER dedup, and will continue to function exactly like it does if dedup is off.

Deduplication in NexentaStor

相关阅读

相关文章

相关问答