~miguelbernadi

OpenZFS Dev Summit 2020

October 6 - 7th 2020, originally in San Francisco (via Zoom). This is a conference about the development of OpenZFS, the Open Source implementation of ZFS. Originally developed by Sun and currently owned as proprietary code by Oracle.

EDIT 2020-10-13: Official (trimmed) videos and slides published

I have been interested in OpenZFS for a while. It is a very advanced filesystem and has lots of interesting properties to it. I haven’t had to work with it, but as I learned more about Illumos and related systems I got more interested.

I’ve recently started using it for home systems and I currently have a small NAS with a 2-disk mirror used basically for backups and light NFS media read-only access, and my laptop’s main drive. It has been a pretty comfortable experience, and as I don’t have grand requests from the system I haven’t had any problems yet.

As a next step into this world I’m trying to keep up with some of the events, and though I won’t attend a conference (yet), I’m inclined to review the talks :-).

Here is a brief overview introduction to what it is and a bit of the story if you hadn’t heard about it. Otherwise, I found this collection of blog posts very informative and easy to read: Aaron Toponce’s Guide to ZFS

The conference spanned 2 days, the first one full of talks and the second one mostly an open hackathon. The talks of the second day where hackathon topic suggestions. Here is the conference agenda (and slides).

First day video (~7h, youtube):

  • State of OpenZFS - Matt Ahrens (52:09 - 1:03:38)
  • ZFS Caching: How Big Is the ARC? - George Wilson (1:03:45 - 1:19:15 + Q&A)
  • Persistent L2ARC - George Amanakis (1:28:26 - 1:39:23 + Q&A)
  • ZIL Performance Improvements for Fast Media - Saji Nair (2:09:44 - 2:47:43 + Q&A)
  • Sequential Reconstruction - Mark Maybee (3:06:53 - 3:21:28 + Q&A)
  • dRAID, Finally! - Mark Maybee (3:26:32 - 3:59:06 + Q&A)
  • Send/Receive Performance Enhancements - Matt Ahrens (5:07:40 - 5:42:32 + Q&A)
  • Improving “zfs diff” performance with reverse-name lookup - Sanjeev Bagewadi & David Chen (6:05:42 - 6:25:56)
  • Performance Troubleshooting Tools - Gaurav Kumar (6:27:34 - 6:48:56)

Second day video (~2h, hackathon pitches, youtube):

  • File Cloning with Block Reference Table - Pawel Dawidek (50:42 - 1:08:33 + QA)
  • ZFS + S3 Layer - Muhammad Ahmad (1:26:40 - 1:32:51)

Talks #

Here I have a brief summary of the talks and any interesting points worth sharing. If you want more details, go watch the videos.

State of OpenZFS - Matt Ahrens #

This was the welcome to the conference day including details on how the sessions were to be conducted. At the beginning it also goes briefly into the current state of the project and the plans for near future.

OpenZFS 2.0 is still scheduled for release in 2020, possibly by the end of October. This version unifies the codebase for Linux and FreeBSD and presents several improvements since ZoL 0.8:

  • Fast clone deletion
  • Log Spacemap
  • Metaslab performance
  • send/receive performance improvements
  • Redacted send/receive
  • zpool wait
  • Persistent L2ARC
  • Sequential resilvering
  • zstd compression
  • sectional zfs/zpool manpages

In the future OpenZFS expects to make a release every year starting with OpenZFS 3.0, and there’s ongoing work to add OSX support to the unified codebase.

ZFS Caching: How big is the ARC? - George Wilson #

Adaptive Replacement Cache (ARC) is an in-memory caching layer of ZFS filesystems. Delphix has migrated its product from Illumos to Linux and have found that Linux takes half the RAM for ARC by default, while on Illumos it takes most of the RAM. This resulted in a smaller cache in Linux systems.

Increasing the size of the ARC to a setting closer to Illumos or FreeBSD led to some issues around the ARC and memory reclamation that are specific to Linux. The result has been a set of changes to make ARC behave better in Linux but there’s still some more work to go to make the ARC size a non-issue for any user.

Persistent L2ARC - George Amanakis #

The audio changes volume continuously, making it very hard to follow.

L2ARC (Level-2 ARC) is a mechanism to overflow the ARC when it can’t fit in memory so the cache can be used for longer. Until this effort, the contents of an L2ARC device would be lost on a reboot. The talk goes into some detail on the design and performance characteristics of the new feature.

ZIL Performance Improvements for Fast Media - Saji Nair #

The ZFS Intent Log (ZIL) is a logging mechanism that stores all the data pending to be written to disk in a transactional write. This can be stored together with the data (by default) or in a dedicated device (Separate Intent Log, SLOG).

This talk goes over how the ZIL works and the bottlenecks when using fast devices (such as NVMe disks) for the SLOG. It also explains a redesign proposal to this component that will improve performance for such systems on some workloads. The implementation for this new design is still incomplete.

If you want some background on the ZIL, the SLOG and the differences between both, I recommend you check this article.

Sequential Reconstruction - Mark Maybee #

Just with the title of this talk I didn’t really know what to expect, but the first 5 minutes are devoted to explain the mixed terms in use in ZFS to refer to similar but different ways recovering redundancy after a disk has been lost.

They have introduced two new terms to differentiate the approaches, resulting in the traditional resilver being called Healing Resilver from now on, and this new method being called Sequential Resilver.

Healing Resilver works in any redundant layout (mirrors and raidz) and is very efficient for small outages as it traverses the disks based on block timestamps, but becomes very slow for aged pools. Sequential Resilver traverses the disk based on the allocated parts of the disk via space allocation maps which allows it to be much faster, but does not verify the contents of the blocks, so it doesn’t work with raidz setups.

dRAID, Finally! - Mark Maybee #

dRAID is something I hadn’t heard about either but the idea seems to have been going around since 2015.

This feature introduces a new kind of vdev called draid that alters the distribution of data among the disks allowing to use all spindles (including spare disks) in most operations most of the time. Another benefit is that it allows using Sequential Resilver for these devices which will be much faster.

The talk goes into detail on what had to be changed from the initial implementation for upstreaming, improvements and the limitations of this feature.

It’s targeted for release with OpenZFS 2.1, either at the very end of 2020 or early 2021.

Send/Receive Performance Enhancements - Matt Ahrens #

The send/receive mechanism allows sending snapshots of a filesystem over the network to a separate system. In the Dev Summit of 2015 there was a talk on how this works and why it’s fast, but turns out that it only gets to saturate the hardware for big block sizes. The goal was to double the throughput for 4k block (typical size after compression) transmission.

The talk guides through the performance investigation on how send and receive actually work, and where the improvements have been made. The improvements are impressive, almost doubling the throughput for the 4k block size and also improving other block sizes.

Part of the changes has been merged already into OpenZFS 2.0 and they are currently upstreaming the rest. One of the last changes for receive requires a change in the stream format so it will take longer to be upstreamed, but that should be fairly minor.

Improving “zfs diff” performance with reverse-name lookup - Sanjeev Bagewadi & David Chen #

zfs diff is used to show which files or paths have changed between snapshots. When many files changed between snapshots, the diff can take a long time to be processed due to a huge number of repeated linear lookups.

The effort described in the talk introduces some metadata and caching to replace linear searches with constant time searches and reducing duplicate work.

Performance Troubleshooting Tools - Gaurav Kumar #

This talk goes over the ZFS metrics and logs available to system administrators to determine if there are issues in the system and identifying them. It starts going over the tools provided by ZFS itself with a few pointers on how to read the outputs and what they mean, and then goes into case studies showing the whole investigation and outcomes.

The last case study goes over some issues with an NFS shared pool that resulted very interesting to me, as the topic is quite complex.

Hackathon pitches #

There were several hackathon pitches, I have just gathered the ones that used some slides. If you are interested on the results you can check the hackathon presentations.

File Cloning with Block Reference Table - Pawel Dawidek #

This pitch is about a new feature proposal, discussing the design and benefits as there is only a prototype so far. The goal is to provide file cloning, so modifying one copy doesn’t modify the other.

A benefit of this feature is space saving when cloning files or recovering them from snapshots, as clone operations are much faster (there is no access to the data). It’s outcome is similar to the deduplication feature, just in a manual fashion that has less requirements but that also can’t be replicated to other systems.

ZFS + S3 Layer - Muhammad Ahmad #

This pitch presents the idea to provide ZFS as back-end to MinIO servers, an Object Storage service with the same API as S3. The goal is to set up ZFS as a backend and compare its performance to 2 different setups to determine what needs tuning, if anything, on ZFS’ side.

COMMENTS

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~miguelbernadi/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.