How memory works in ONTAP: Write Allocation, Tetris, Summary (Part 3)

Write Allocation

NetApp has changed this part of the WAFL architecture a lot since its first release. Increased demand for performance, multi-core CPUs and a new type for media were the forces to continue to improve Write Allocation.

Each thread can be processed separately and depends on data type, write/read a pattern, data block size, place media where it will be written and other characteristics. Based on that WAFL can decide how and where data will be placed and what type of media, it should use.

For example, for flash media, it is important to write data taking into account the smallest block size that media can operate to ensure cells are not wear-out and evenly utilized. Another example is when data separated from metadata and stored on different tiers of storage. In reality Write Allocation is a very, very huge, separate topic to discuss, how ONTAP is optimizing data in this module.

Affinities

Increase in the number of CPU cores forced NetApp to develop new algorithms for process parallelization over time. Volume affinities (and other affinities) are algorithms which allow executing multiple processes in parallel, so they do not block each other and can run in parallel, though sometimes it is necessary to execute a serial process which blocks others. For example, when two hosts working on the same storage with two different volumes, they can write & modify data in parallel. If those two hosts start to write to a single volume, ONTAP becomes a broker and serialize processes to that volume and gives access to one host at a time and when it’s done, only then give access to another. Write operations are always executed on a single core; otherwise, because each core can be loaded unequally, you can end up in messing with your data. Each of these affinities allowed to decrease locking other processes and increase parallelization as well as improve multi-core CPU utilization. Volume affinities called “Waffinity” because each volume is a separate WAFL file system, so they combined wafl with affinity words together.

  • Classical Waffinity (2006)
  • Hierarchical Waffinity (2011)
  • Hybrid Waffinity (2016)

FlexGroup

If for instance, ONTAP needs to work on an aggregate level, the whole bunch of volumes living on that aggregate will stop getting write operations for some time because aggregate operations are locking volume operations, this is just one example of locking mechanisms and NetApp improving it throughout ONTAP lifespan. What is the most natural solution would be in such a case? To have multiple volumes, multiple aggregates, and even multiple nodes, but in this case, instead of a single bucket of space, you will have multiple volumes on multiple nodes & aggregates. That’s when FlexGroup gets into the picture: it joins all the (constituent) volumes into a single logical space visible to clients as a single volume or file share. Before FlexGroup ONTAP was very good optimized for workloads with random & small blocks and even sequential reads, but now thanks to FlexGroup, ONTAP optimized for sequential operations and mainly benefiting from sequential writes.

flexgroups

RAID

From WAFL module data delivered to the RAID module which processing it and writing in one transaction (known as stripes) to the disks, including parity data to the parity drives.

Taking in to account that data written to the disks in stripes, it means there is no need to calculate parity data for parity drives because the system prepared everything in RAID module for us. Moreover, that is the reason why on practice parity disk drives always less utilized than data drives, unlike it happens with traditional RAID-6 and RAID-4. This allows avoiding re-writes of data, placing new data to a new empty space and simply moving metadata links to a new place. This allows the system not to read data to its memory and recalculate new parity to a stripe after a single block change and therefore system memory user more rationally. More about RAID-DP in TR-3298.

de6f0fa23b6a4ddcada803d73a66b118.png

Tetris and IO redirection

Tetris is a WAFL mechanism for write & read optimizations. Tetris collects data for each CP and compiles data into chains, so blocks from a single host assembled in bigger chunks (this process is also known as IO redirection). On another hand, this simple logic allows enabling predictive read operations because there is no difference for example to read 5KB or 8KB of data; 13KB or 16KB. A predictive read is a form of read caching. So, when times come to decide what data should be read in advance the answer comes naturally: data most probably will be read right away after the first part most probably the same data what was written right away with the first part. When it comes to a decision what extra data should we read,- it already decided.

5253fc57758c4d239e54dad8b017b02f.png

Read Cache

MBUF used for both read and write caching and all the reads without exception inevitable copied to the cache. From the same cache, hosts can read just written data. When CPU cannot find data in the system cache for the host, it looks for them in another cache if available and only then on disks. Once data found on disks, the system will place it into MBUF. If that piece of data wasn’t accessed for a while CPU can de-stage it to the second memory ties like FlashPool or FlashCache.

Important to note that system very granularity evicts cache from unaccessed 4kb blocks data. Another important thing is that cache in ONTAP systems is deduplication-aware, so is that block already exists in MBUF or on the second ties cache, it will not be copied there again.

Why is NVRAM/NVMEM size not always important?

In NetApp FAS/AFF and other ONTAP-based systems NVRAM/NVMEM used exclusively for NVLOGs, not for write-buffer, so it doesn’t have to be as big as system memory size in other systems.

ONTAP NVRAM vs competitors

Battery and boot flash drive

As I mentioned before, hardware systems like FAS & AFF have a battery installed in each controller to power NVRAM/NVMEM to maintain data for 72H in the memory. Just after a controller lost it’s power, data from memory will destage to Boot Flash drive. Once ONTAP booted data restored back.

Flash media and WAFL

As I mentioned in previous articles from the series, ONTAP always writes to a new space because of a number of architectural reasons. However, it turns out flash media needs precisely that. Though some people predicted for WAFL death because of the Flash media, it turns out WAFL works on that media quite well and with “always write to the new space” technique not just optimizes garbage collection and wear out of flash memory cells equally but also shows quite competitive performance.

Summary

System memory architecturally builds not just to optimize performance to offer high data protection and availability and write optimization. Reach ONTAP functionality, unique way NVRAM/NVMEM usage and reach software integration ecosystem qualitatively differentiate NetApp storage systems from others.

Continue to read

NVRAM/NVMEM, MBUF & CP (Part 1)
NVLOG & Write-Through (Part 2)

How ONTAP cluster works?

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.