2018/10/22 16:37

# 提升ZFS性能的10个简便方法

ZFS具有最强的自我校正特性和ZFS后台算法的固有特性，能帮助你在无需昂贵的硬件控制器的情况下达到比大多数RAID控制器和RAID模组更好的性能。所以我们说ZFS是目前业界的第一个真正的RAID（廉价磁盘阵列）解决方案。

• 文件系统性能基础
• 性能预期，目标与策略
• #1: 添加足够的内存
• #2: 添加更多的内存
• #3: 增加更多的内存获得重复数据消除技术带来的性能提升
• #4: 使用固态硬盘（SSDs）提升读取性能
• #5: 使用固态硬盘（SSDs）提升写入性能
• #6: 使用镜像
• #7: 增加更多的磁盘
• #8: 保留足够的空闲空间
• #9: 雇佣一个专家
• #10: 深入调教——但是你必须清楚自己在做什么
• 奖励：一些五花八门的设定实例
• 轮到你了
• 相关文章

### 文件系统性能基础

• Bandwidth（带宽）: 以MB/s (或 GB/s 如果你幸运的话)为衡量单位,这个参数告诉你单位时间内有多少数据通过文件系统（被读取或写入）。
• IOPS: 每秒钟的IO 操作次数。

• Sequential（顺序的）: 可预测的，连续（/相邻）存储的数据块
• Random（随机的）: 不可预测了，无序的，难以连续操作/存取的数据块

• Synchronous Writes（同步写）: 只有在数据写入稳定的存储介质（stable storage如磁盘）之后，写操作才视为完成。在ZFS中,这类操作会通过ZFS Intent Log, or ZIL来执行. 这类操作最常发生在文件服务器和数据库服务器上，对磁盘的潜伏周期和IPOS性能最为敏感。
• Asynchronous Writes（异步写）: 在数据被提交到磁盘前，只要已被缓存到内存后就可返回（进行其他后续数据操作)的写操作结束。由此很容易获得性能提升，但是是以牺牲数据存储可靠性为代价的。如果在数据被后台程序真正写入到磁盘以前系统意外掉电，将有可能发生数据丢失，甚至是更为严重的问题，比如RAID5条带写入陷阱的问题（将导致整个条带数据的校验错误，所以对可靠性要求高的场合需要采用昂贵的后备电源方案）。

*确定实际的预期：ZFS是很棒的，是的。 但是你需要遵守物理学定律。 一个10000 rpm的一个磁盘不能实现超过每秒166次的随机IOPS，因为10000 prm（周/分钟） 除以60秒(每分钟)等于166。这表示磁头每秒钟只能在一个随机街区上方定位它自己的位置166次。 任何多于那个数的寻道和你的数据读/写其实不是随机的。磁盘随机读/写操作的最大理论IOPS数就是这么计算出来的。

*设定性能目标：

*系统性：我们试验这，然后我们试验那，我们用CP(1)来测量，即使我们的应用实际上是数据库。然后我们各处拧（调整参数），并且通常在我们知道它是什么之前，我们意识到：我们真的什么也不知道。

#1: 增加足够的内存

#2: 增加跟多的内存

ZFS 会使用他找到的每一块内存来缓存数据。ZFS具有非常精致的缓存算法，他会尝试缓存最进使用和最经常使用的数据，根据数据的使用情况自适应平衡两种数据类型的缓存。ZFS同时还有高级的预读能力，可以极大得改善不同类型的数据顺序读取性能。

If you want to influence the balance between user data and metadata in the ZFS ARC cache, check out the primarycache filesystem property that you can set using the zfs(1M) command. For RAM-starved servers with a lot of random reads, it may make sense to restrict the precious RAM cache to metadata and use an L2ARC, explained in tip #4 below.

#3: 增加更多的内存获得重复数据消除技术带来的提升性能

#4: 使用固态硬盘（SSDs）提升读取性能

SSDs can deliver two orders of magnitude better IOPS than traditional harddisks, and they're much cheaper on a per-GB basis than RAM.
They form an excellent layer of cache between the ZFS RAM-based ARC and the actual disk storage.

You don't need to observe any reliability requirements when configuring L2ARC devices: If they fail, no data is lost because it can always be retrieved from disk.

This means that L2ARC devices can be cheap, but before you start putting USB sticks into your server, you should make sure they deliver a good performance benefit over your rotating disks :).

SSDs come in various sizes: From drop-in-replacements for existing SATA disks in the range of 32GB to the Oracle Sun F20 PCI card with 96GB of flash and built-in SAS controllers (which is one of the secrets behind Oracle Exadata V2's breakthrough performance), to the mighty fast Oracle Sun F5100 flash array (which is the secret behind Oracle's current TPC-C and other world records) with a whopping 1.96TB of pure flash memory and over a million IOPS. Nice!

And since the dedup table is stored in the ZFS ARC and consequently spills off into the L2ARC if available, using SSDs as cache devices will also benefit deduplication performance.

#5: Use SSDs to Improve Write Performance

Most write performance problems are related to synchronous writes. These are mostly found in file servers and database servers.

With synchronous writes, ZFS needs to wait until each particular IO is written to stable storage, and if that's your disk, then it'll need to wait until the rotating rust has spun into the right place, the harddisk's arm moved to the right position, and finally, until the block has been written. This is mechanical, it's latency-bound, it's slow.

See Roch's excellent article on ZFS NFS performance for a more detailed discussion on this.

SSDs can change the whole game for synchronous writes because they have 100x better latency: No moving parts, no waiting, instant writes, instant performance.

So if you're suffering from a high load in synchronous writes, add SSDs as ZFS log devices (aka ZIL, Logzillas) and watch your synchronous writes fly. Check out the zpool(1M) man page under the "Intent Log" section for more details.

Make sure you mirror your ZIL devices: They are there to guarantee the POSIX requirement for "stable storage" so they must function reliably, otherwise data may be lost on power or system failure.

Also, make sure you use high quality SLC Flash Memory devices, because they can give you reliable write transactions. Cheaper MLC cells can damage existing data if the power fails during write operations, something you really don't want.

#6: Use Mirroring

Many people configure their storage for maximum capacity. They just look at how many TB they can get out of their system. After all, storage is expensive, isn't it?

Wrong. Storage capacity is cheap. Every 18 months or so, the same disk only costs half as much, or you can buy double the capacity for the same price, depending on how you view it.

But storage performance can be precious. So why squeeze the last GB out of your storage if capacity is cheap anyway? Wouldn't it make more sense to trade in capacity for speed?

This is what mirroring disks offer as opposed to RAID-Z or RAID-Z2:

• RAID-Z(2) groups several disks into a RAID group, called vdevs. This means that every I/O operation at the file system level is going to be translated into a parallel group of I/O operations to all of the disks in the same vdev.
The result: Each RAID group can only deliver the IOPS performance of a single disk, because the transaction always has to wait until all of the disks in the same vdev are finished.
This is both true for reads and for writes: The whole pool can only deliver as many IOPS as the total number of striped vdevs times the IOPS of a single disk.
There are cases where the total bandwidth of RAID-Z can take advantage of the aggregate performance of all drives in parallel, but if you're reading this, you're probably not seeing such a a case.
• Mirroring behaves differently: For writes, the rules are the same: Each mirrored pair of disks will deliver the write IOPS of a single disk, because each write transaction will need to wait until it has completed on both disks. But a mirrored pair of disks is a much smaller granularity than your typical RAID-Z set (with up to 10 disks per vdev). For 20 disks, this could be the difference between 10x the IOPS of a disk in the mirror case vs. only 2x the IOPS of a disk in a wide stripes RAID-Z2 scenario (8+2 disks per RAID-Z2 vdev). A 5x performance difference!
For reads, the difference is even bigger: ZFS will round-robin across all of the disks when reading from mirrors. This will give you 20x the IOPS of a single disk in a 20 disk scenario, but still only 2x if you use wide stripes of the 8+2 kind.
Of course, the numbers can change when using smaller RAID-Z stripes, but the basic rules are the same and the best performance is always achieved with mirroring.

For a more detailed discussion on this, I highly recommend Richard Elling's post on ZFS RAID recommendations: Space, performance and MTTDL.

Also, there's some more discussion on this in my earlier RAID-GREED-article.

Bottom line: If you want performance, use mirroring.

Our next tip was already buried inside tip #6: Add more disks. The more vdevs ZFS has to play with, the more shoulders it can place its load on and the faster your storage performance will become.

This works both for increasing IOPS and for increasing bandwidth, and it'll also add to your storage space, so there's nothing to lose by adding more disks to your pool.

But keep in mind that the performance benefit of adding more disks (and of using mirrors instead of RAID-Z(2)) only accelerates aggregate performance. The performance of every single I/O operation is still confined to that of a single disk's I/O performance.

So, adding more disks does not substitute for adding SSDs or RAM, but it'll certainly help aggregate IOPS and bandwidth for the cases where lots of concurrent IOPS and bigger overall bandwidth are needed.

#8 Leave Enough Free Space

Don't wait until your pool is full before adding new disks, though.

ZFS uses copy on write which means that it writes new data into free blocks, and only when the überblock has been updated, the new state becomes valid.

This is great for performance because it gives ZFS the opportunity to turn random writes into sequential writes - by choosing the right blocks out of the list of free blocks so they're nicely in order and thus can be written to quickly.

That is, when there are enough blocks.

Because if you don't have enough free blocks in your pool, ZFS will be limited in its choice, and that means it won't be able to choose enough blocks that are in order, and hence it won't be able to create an optimal set of sequential writes, which will impact write performance.

As a rule of thumb, don't let your pool become more full than about 80% of its capacity. Once it reaches that point, you should start adding more disks so ZFS has enough free blocks to choose from in sequential write order.

#9: Hire A ZFS Expert

There's a reason why this point comes up almost last: In the utter majority of all ZFS performance cases, one or more of #1-#8 above are almost always the solution.

And they're cheaper than hiring a ZFS performance expert who will likely tell you to add more RAM, or add SSDs or switch from RAID-Z to mirroring after looking at your configuration for a couple of minutes anyway!

But sometimes, a performance problem can be really tricky. You may think it's a storage performance problem, but instead your application may be suffering from an entirely different effect.

Or maybe there are some complex dependencies going on, or some other unusual interaction between CPUs, memory, networking, I/O and storage.

Or perhaps you're hitting a bug or some other strange phenomenon?

So, if all else fails and none of the above options seem to help, contact your favorite Oracle/Sun representative (or send me a mail) and ask for a performance workshop quote.
If your performance problem is really that hard, we want to know about it.

#10: Be An Evil Tuner - But Know What You Do

If you don't want to go for option #9 and if you know what you do, you can check out the ZFS Evil Tuning Guide.

There's a reason it's called "evil": ZFS is not supposed to be tuned. The default values are almost always the right values, and most of the time, changing them won't help, unless you really know what you're doing. So, handle with care.

Still, when people encounter a ZFS performance problem, they tend to Google "ZFS tuning", then they'll find the Evil Tuning Guide, then think that performance is just a matter of setting that magic variable in /etc/system.

This is simply not true.

Measuring performance in a standardized way, setting goals, then sticking to them helps. Adding RAM helps. Using SSDs helps. Thinking about the right number and RAID level of disks helps. Letting ZFS breathe helps.

But tuning kernel parameters is reserved for very special cases, and then you're probably much better off hiring an expert to help you do that correctly.

Bonus: Some Miscellaneous Settings

If you look through the zfs(1M) man page, you'll notice a few performance related properties you can set.
They're not general cures for all performance problems (otherwise they'd be set by default), but they can help in specific situations. Here are a few:

• atime: This property controls whether ZFS records the time of last access for reads. Switching this to off will save you extra write IOs when reading data. This can have a big impact if your application doesn't care about the time of last access for a file and if you have a lot of small files that need to be read frequently.
• checksum and compression can be double-edged swords: The stronger the checksum, the better your data is protected against corruption (and this is even more important when using dedup). But a stronger checksum method will incur some more load on the CPU for both reading and writing.
Similarly, using compression may save a lot of IOPS if the data can be compressed well, but may be in the way for data that isn't easily compressed. Again, compression costs some extra CPU time.
Keep an eye on CPU load while running tests and if you find that your CPU is under heavy load, you might need to tweak one of these.
• recordsize: Don't change this property unless your running a database in this filesystem. ZFS automatically figures out what the best blocksize is for your filesystems.
In case you're running a database (where the file may be big, but the access pattern is always in fixed-size chunks), setting this property to your database record size may help performance a lot.
• primarycache and secondarycache: We already introduced the primarycache property in tip #2 above. It controls whether your precious RAM cache should be used for metadata or for both metadata and user data. In cases where you have an SSD configured as a cache device and if you're using a large filesystem, it may help to set primarycache=metadata so the RAM is used for metadata only.
secondarycache does the same for cache devices, but it should be used to cache metadata only in cases where you have really big file systems and almost no real benefit from caching data.
• logbias: When executing synchronous writes, there's a tradeoff to be made: Do you want to wait a little, so you can accumulate more synchronous write requests to be written into the log at once, or do you want to service each individual synchronous write as fast as possible, at the expense of throughput?
This property lets you decide which side of the tradeoff you want to favor.

Sorry for the long article. I hope the table of contents at the beginning makes it more digestible, and I hope it's useful to you as a little checklist for ZFS performance planning and for dealing with ZFS performance problems.

Let me know if you want me to split up longer articles like these (though this one is really meant to remain together).

Now it's your turn: What is your experience with ZFS performance? What options from the above list did you implement for what kind of application/problem and what were your results? What helped and what didn't and what are your own ZFS performance secrets?

Related Posts

0
0 收藏

0 评论
0 收藏
0