In this article, I’d like to describe how NVRAM, Cache and system memory works in ONTAP systems.
System memory for any AFF/FAS (and other OEM vendors running ONTAP) consists out of two types of memory: ordinary (volatile) memory and non-volatile memory (that’s why we use “NV” prefix) connected to a battery on the controller. Some systems are using NVRAM and others instead using NVMEM, they both used for the same purpose, the only difference is that NVMEM is installed on the memory bus, while NVRAM is a small board with memory installed on it and that board connected to PCIe bus. Both, either NVRAM or NVMEM used by ONATP to store NVLOGs. Ordinary memory used for system needs and mostly as Memory Buffer (MBUF) in another words cache.
Data de-staging in normally functioning storage system is happening from the MBUF but not from NVRAM/NVMEM. However, the trigger for data de-staging might be the fact of NVRAM/NVMEM been half-full or others. See CP triggers section below.
NVRAM/NVMEM & NVLOGs
Data from hosts always placed to the MBUF first. Then from MBUF with Direct Memory Access (DMA) request data copied to NVMEM/NVRAM, as the host sent them, in unchanged form of logs just like a DB log or similar to a journaling file system. DMA does not consume CPU cycles. That’s why NetApp says this system has Hardware Journaling file system. As soon data been acknowledged to be in NVRAM/NVMEM, the system sends data receiving an acknowledgment to the host. After Consistency Point (CP) occurred, data from MBUF are de-staged to the disks, and the system clears NVRAM/NVMEM without using NVLOGs at all in a normally functioning system. NVLOGs used only in special events or configurations. So, ONTAP does not use NVLOGs in normally functioning High Availability (HA) system it erases logs each time CP occur and then writes new NVLOGs. For instance, ONTAP will use NVLOGs to restore data “how they were” back to MBUF, in case of an unexpected reboot.
Memory Buffer: Write operations
The first place where all writes placed is always MBUF. Then from MBUF data copied to NVRAM/NVMEM with DMA call, after that WAFL module allocates a range of blocks where data from MBUF will be written that simply called Write Allocation. It might sound simple, but it’s kind of big deal, been constantly optimized by NetApp. However, just before it will allocate space for your data, the system will compile Tetris! Yes, I’m talking about the same kind of Tetris game for puzzle-matching you might play in your childhood. So, the Write Allocation’s one of the jobs is to make sure to write all the Tetris data in one unbreakable sequence to the disks whenever possible.
WAFL is also doing some additional data optimizations depending on the type of the data, where it goes, what type of media is going to be written, etc. After WAFL module gets acknowledgment from NVRAM/NVMEM that data secured, RAID module processing data from MBUF and adds checksums to each block (known as Block/Zone checksum) and it will calculate and write parity data on the parity disks. It is important to note that some data in MBUF contain commands which can be “extracted” before they will be delivered to RAID module, for example, some commands can ask the storage system to format some space in preassigned repeated patterns or commands which asks the storage to move chunks of data. Those commands might consume a small amount of space in NVRAM/NVMEM but might generate a big amount of data when executed.
NVMEM/NVRAM in HA
Each HA pair with ONTAP consists out of two nodes (controllers), and each node has the copy of data from its neighbored HA partner. This architecture allows to switch hosts to a survived node and continue to serve them without a noticeable disruption for the hosts.
To be more precise, each NVRAM/NVMEM divided into two pieces: one to store NVLOGs from local controller and another piece to store the copy of NVLOGs from HA partner controller. While each piece also divided into halves. So, each time the system filled first local half of NVRAM/NVMEM, CP event generated and while it happens local controller using second local half for new operations and after second half filled by the system with logs, the system switches back to the first one, already emptied half, and repeats the cycle.
As many modern file systems WAFL is a journaling FS which used to keep consistency and data protection. However, unlike general-purpose journaling file systems, WAFL do not need time or special FS checks to rollback & make sure the FS is consistent. Once a controller unexpectedly rebooted, last unfinished CP transaction not confirmed and similarly to snapshot the last snapshot just deleted and data from NVLOGs used to create new consistency point once the controller booted after an unexpected reboot. CP transaction confirmed only once all the transaction entirely been written to the disks and root inode changed with new pointers to the new data blocks added.
It turns out that NetApp snapshot technology was so successful, so it used literally almost everywhere in ONTAP. Let me remind you that each CP contain data already processed by WAFL and then by RAID module. So, CP is also a snapshot, so before data already been processed with WAFL & RAID been destaged from MBUF to disks, ONTAP create system snapshot from the aggregate where it is going to write data. Then ONTAP writes new data to the aggregate. Once data was successfully written as part of CP transaction, ONTAP changes root inode pointers and clear NVLOGs. Before data from CP transaction been written to disks, ONTAP creates a snapshot which represents the last active file system state. To be more precise, it just copies root inode pointers. If in case of failure even if both controllers will reboot simultaneously, last system snapshot will be rolled back, data will be restored from NVLOGs, aging processed with WAFL and RAID modules and destage back on disks on next CP as soon as the controllers get back online.
In case only one controller will suddenly switch off or reboot, second survived controller will restore data from its own NVRAM/NVLOG and finish earlier unsuccessful CP, applications will be transparently switched to the survived controller, and they will continue to run after a small pause as there was no disruption at all. Once CP is successful, as part of CP transaction, ONTAP changes root inode with pointers to the new data and create a new system snapshot which will capture newly added data and pointers to old data. In this way, ONTAP always maintains data consistency on the storage system without the need to switch to Write-Through mode in any case.
ONTAP 9 performs CP separately for each aggregate, while previously it was controller-wide. With CP separation for each aggregate slow aggregates no longer influencing other aggregates in the system.
An inode contains information about files which is known as metadata, but inodes can store a small fraction of data too. Inodes have a hierarchical structure. Each inode can store up to 4 KB of information. If a file is small enough to fit into an inode and store metadata in the same inode, then only 4 KB block is used for such an inode, a directory actually is also a file on WAFL file-system. So one of the real world examples were an inode stores metadata, and the data itself is an empty directory or an empty file. However, what if the file is not fitting into an inode? Then the inode will store pointers to other inodes, and that indoes store can store pointers to other inodes or address for data blocks. Currently, WAFL has 5-level hierarchy limit. Sometimes inodes and data blocks referred to as files in deep-dive technical documentation about WAFL. Therefore, each file on FlexVolume file system can store no more than 16 TiB. Each FlexVol volume is a separate WAFL file system and has its own volume root inode.
The reason why Write Anywhere File Layout got Anywhere word in its name is metadata can be anywhere in the FS
Interesting nuance. The reason why Write Anywhere File Layout is probably got Anywhere word in its name is metadata can be anywhere in the FS and mixed up with data blocks, while other traditional file systems usually store their metadata on a dedicated area on the disk which usually has a fixed size. Here is the list of metadata information which can be stored alongside with data.
- Volume where the inode resides
- Locking information
- Mode and type of file
- Number of links to the file
- Owner’s user and group ids
- Number of bytes in the file
- Access and modification times
- Time the inode itself was last modified
- Addresses of the file’s blocks on disk
- Permission: UNIX bits or Windows Access Control List (ACLs)
- Qtree ID.
Events generating CP
CP is the event which generated by one of the following conditions:
- 10 seconds passed by since the last CP
- The system filled the first piece if NVRAM
- Local MBUF filled (known as High Watermark). It happens really because of MBUF is usually way bigger that NVMEM/NVRAM. When commands in MBUF after execution generates a lot of new data before of in WAFL/RAID modules.
- When you executed halt command on the controller to stop it
CP condition might indirectly show on some system problems, for example, when there are not enough HDD drives to maintain performance, you will see back-to-back (“B” or “b”) operations. See also Knowledge Base FAQ: Consistency Point.
NVRAM/NVMEM and MetroCluster
To protect data from Split-Brain scenario in MetroCluster (MCC), hosts which writes data to the system will get acknowledgment only after the data will be acknowledged by the local HA partner and by the remote MCC partner (in case if the MCC comprises 4 nodes).
Data synchronization between local HA pair partners happens over HA interconnect. If two controllers in an HA pair located in two separate chassis, then HA interconnect is an external connection (in some models can be over InfiniBand or Ethernet connections and usually named cNx, i.e., c0a, c0b, etc., for example in FAS3240 systems). If two controllers in an HA pair placed in a single HA chassis, HA interconnect is internal, and there are no visible connections. Some controllers might be used in both configurations: HA in a single chassis or HA with each controller in its own chassis, in this case such controllers have dedicated HA interconnect ports often named cNx (i.e., c0a, c0b, etc., for example in FAS3240 systems) but in case this controller used in a single chassis configuration those ports are not used (and cannot be used for other purposes) and internal communication established internally through controller’s back-plain.
Controllers vs Nodes
A storage system formed out of one or few HA pairs. Each HA pair consists out of two controllers; sometimes they called nodes. Controllers and nodes are very similar and often interchangeable terms. The difference between them that controllers are physical devices and nodes are ephemeral OS instance which running on the controllers. Controllers in an HA pair connected with HA cluster interconnect. With hardware appliances like AFF and FAS systems, each hardware disk connected simultaneously to both controllers in an HA pair. Often controllers in tech documents as “controller A” and “controller B.” Even though hard drives in AFF & FAS systems physically got one port, the port comprises two ports. Each port from each drive connected to each controller. So if ever will dig dip into node shell console, and enter disk show command, you’ll see disks named like 0c.00.XX, where 0c means current port through which that disk is connected to the controller which “owns it”, XX is a position of the drive in the disk shelf, and 00 is an ID for the disk shelf. At each time only one controller owns a disk or a partition on a disk. When a controller owns a disk or a partition on a disk, it means that the controller serves data to hosts from that disk or partition. The HA partner is used only when the owner of the disk or the partition will die, therefore each controller in ONTAP has its own drives/partitions and each serves its own drives/partitions, this architecture known as “share nothing“. There two types of HA policies: SFO (storage failover) and CFO (controller failover). CFO used for root aggregates and SFO for data aggregates. CFO do not change disk ownership in the aggregate, while SFO change disk ownership in the aggregates.
ToasterA*> disk show -v DISK OWNER POOL SERIAL NUMBER ------------ ------------- ----- ---------------- 0c.00.1 unowned - WD-WCAW30485556 0c.00.2 ToasterA (142222203) Pool0 WD-WCAW30535000 0c.00.11 ToasterB (142222204) Pool0 WD-WCAW30485887 0c.00.6 ToasterB (142222204) Pool0 WD-WCAW30481983
But since each drive in hardware appliance like FAS & AFF system connected to both controllers, it means each controller can address each disk. And if you will manually change 0c to 0d from this example to the port through which the drive should be available, the system will be able to address the drive.
ToasterB*> disk assign 0d.00.1 -s 142222203 Thu Mar 1 09:18:09 EST [ToasterB:diskown.changingOwner:info]: changing ownership for disk 0d.00.1 (S/N WD-WCAW30485556) from (ID 1234) to ToasterA (ID 142222203)
While software-defined ONTAP storage (ONTAP Select & Cloud Volumes ONTAP) works very like MetroCluster, because by definition it has none ”special” equipment, in this case, it doesn‘t have special 2-port drives connected to both servers (nodes). So instead of connecting to a single drive with both nodes in an HA pair, ONTAP Select (and Cloud Volumes ONTAP) it copies data from one controller to the second controller and keeps two copies of data on each node. And that is the price, another side, of the commodity equipment.
Technically, it is possible to connect single external storage, for instance, by iSCSI to each storage node, avoiding unnecessary data duplicating, but that option is not available in SDS ONTAP at the moment.
While it sounds like disk, it is in really not a disk but rather tiny special area on a disk which consume a few KB. That mailbox area is used to send messages from one HA-partner to another. Mailbox disk is a mechanism which gives ONTAP some additional security level for HA capabilities. Mailbox disks are used to determine the state for the HA partner, in a similar way to emails, where each controller time to time is posting it’s messaging to its own (local) mailbox disks that it is alive, healthy and well while reads from partner’s mailbox. On another hand, if the last timestamp of the last message from a partner is too old, the surviving node will take over. In this way, if HA interconnects not available for some reason or a controller freezes, the partner will determine the state of the second controller using mailbox disks and will perform the takeover. If a disk with mailbox dies, ONTAP going to choose a new disk.
By default, mailbox disks reside on two disks: one data and one parity disk for RAID 4, or one party and one double parity disk for RAID DP, and usually reside at a first aggregate which usually is the system root aggregate.
Cluster1::*> storage failover mailbox-disk show -node node1 Node Location Index Disk Name Physical Location Disk UUID ------- --------- ----- ------------- ------------------ ------------------- node1 local 0 1.0.4 local 20000000:8777E9D6:[...] local 1 1.0.6 partner 20000000:8777E9DE:[...] partner 0 1.0.1 local 20000000:877BA634:[...] partner 1 1.0.2 partner 20000000:8777C1F2:[...]
When a node in HA pair whether software-defined of hardware dies, the survived one will ”takeover” and continue to serve the data from offline node to the hosts. With hardware appliance, the survived node will also change disk ownership from HA-partner to its own.
Active/Active and Active/Passive configurations
In the Active/Active configuration with ONTAP architecture, each controller has its own drives and serves data to hosts, in this case, each controller has at least one data aggregate. In an Active/Passive configuration passive node does not serve data to hosts and have disk drives only for root aggregate (for internal system needs). Each Active/Active and Active/Passive configuration needs to have for each node one root aggregate for the node to function properly. Aggregates formed out of one or few RAID groups. Each RAID group consists out of few disks drives or partitions. All the drives or partitions from an aggregate has to be owned by a single node.
Continue to read
Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.