How does the ONTAP cluster work? (part 1)

In my previous series of articles I explained how ONTAP system memory works and talked about:
NVRAM/NVMEM, NVLOGs, Memory Buffer, HA & HA interconnects, Consistency Points, WAFL iNodes, MetroCluster data availability, Mailbox disks, Takeover, Active/Active and Active/Passive configurations, Write-Through, Write Allocation, Volume Affinities (Wafinity), FlexGroup, RAID & WAFL interaction, Tetris and IO redirection, Read Cache, NVRAM size, role of the system battery and boot flash drive, Flash media and WAFL compatibility. Those articles going be a good addition & help to understand this one, so go & check them too.

First I need to tell that clusterization is a very broad term, and in many vendors and technologies, it has a different meaning. ONTAP uses three different types of clusterization and one of the primary purpose of this article to explain each and how they are a different one from another and how they can complement each other and what additional benefits ONTAP can get out of them. Before we go to clusterization, we need to go deeper and explore other ONTAP components to understand how it works.

When someone is speaking about ONTAP cluster, they most probably mean horizontal scaling clusterization used to scale-out storage (the third type of clusterization).

Storage

There are a few platforms ONTAP support: FAS appliance, AFF appliance, ASA, and SDS virtual appliance. NetApp FAS storage systems which contain only SSD drives with installed SSD-optimized ONTAP OS called All-Flash FAS (AFF). ASA based on the AFF platform and provides access to the storage only with SAN data protocols, therefore the rest of ONTAP-based systems might be called ”Unified ONTAP” meaning unification of SAN and NAS protocols. There is also a Lenovo DM line of products using Unified ONTAP. NetApp HW appliances are using either SATA, Fibre Channel, SAS, SSD disk drives or LUNs from 3rd party storage arrays, which it groups into RAID groups and then into Aggregates to combine disk performance and capacity into a single pool of storage resources. SDS appliances can use space from the hypervisor as virtual disks and join space into aggregates or can use physical disk drives passed through to ONTAP and build RAID out of them and then aggregates.

FlexVol

FlexVol volume is a logical space that placed on top of an aggregate, each volume can expand or shrink in size plus we can apply and change performance limits to each volume. A FlexVol helps to separate performance & capacity from an aggregate pool of resources and flexibly distribute them as needed. Some volumes need to be big but slow, and some very fast but small; volumes can be re-sized and performance re-balanced, – FlexVol is the technology which achieve this goal in ONTAP architecture. Clients are accessing storage from FlexVol volumes over SAN & NAS protocols. Each volume exists on a single aggregate and served by a single storage node (controller).

If two FlexVol volumes created, each on two aggregates and those aggregates owned by two different controllers, and system admin needs to use space from these volumes through a NAS protocol, then admin will create two file shares, one on each volume. In this case, admin most probably even will create different IP addresses; each will be used to access a dedicated file share. Each volume will have single write waffinity and there will be two buckets of space. Though even if two volumes reside on a single controller, and for example on a single aggregate (thus if the second aggregate exists, it will not be used in this case) and both volumes will be accessed through a single IP address, still there will be two write affinities, one on each volume and there still will be two separate buckets of space. So the more volumes you have, the more write waffinities you’ll have (better parallelization and thus more even CPU utilization, which is good), but then you’ll have multiple volumes (and multiple buckets for space thus multiple file shares).

FlexGroup

FlexGroup is a free feature introduced in version 9, which uses the clustered architecture of the ONTAP operating system. FlexGroup provides cluster-wide scalable NAS access with NFS and CIFS protocols. A FlexGroup creates multiple write affinities but on another hand, unlike FlexVol, combines space and performance from all the volumes underneath (thus from multiple aggregates and nodes). A FlexGroup Volume is a collection of constituent FlexVol volumes distributed across nodes in the cluster called “Constituents,” which are transparently joined in a single space. FlexGroup volume combines performance and capacity from all the constituent volumes and thus from all nodes of the cluster where they located. For the end user, each FlexGroup Volume is represented by a single, ordinary file-share with single space equal to the summary of space from all the constituent volumes (not visible to clients) and multiple reads & write waffinities.

NetApp will reveal the full potential of FlexGroup with technologies like NFS multipathing, SMB multichannel, pNFS, SMB CA, and VIP.

Technology NameFlexVolFlexGroup
1. NFS multipathing (session trunking)NoNo
2. SMB multichannelNoNo
3. pNFSYesNo
4. VIP (BGP)YesYes
5. SMB Continuous Availability (SMB CA)YesYes*

*Added in ONTAP 9.6

The FlexGroup feature in ONTAP 9 allows to massively scale in a single namespace to over 20 PB with over 400 billion files, while evenly spreading the performance across the cluster. Starting with ONTAP 9.5 FabricPool supported with FlexGroup, in this case, it is recommended to have all the constituent volumes to back up to a single S3 object storage bucket. FlexGroup supports SMB features for native file auditing, FPolicy, Storage Level Access Guard (SLA), copy offload (ODX) and inherited watches of changes notifications; Quotas and Qtree. SMB Contiguous Availability (CA) supported with FlexGroup in ONTAP 9.6 allows running MS SQL & Hyper-V. FlexGroup also supported on MetroCluster.

Clustered ONTAP

Today’s OS for NetApp AFF, FAS, Lenovo DM line and cloud appliances are known just as ONTAP 9, but before version 9 there was Clustered ONTAP (or Cluster-Mode, Clustered Data ONTAP or cDOT) and 7-mode ONTAP. 7-mode is the old firmware which had capabilities of the first and the second type of clusterization (High Availability and MetroCluster), while Clustered ONTAP 9 has all three HA, MCC, plus horizontal scaling clusterization). The reason why two existed for in parallel was Clustered ONTAP 8 didn’t have all reach functionality from 7-mode, so for a while, it was possible to run both modes (one at a time) on the same NetApp hardware. NetApp spent some time to bring all the functionality to Cluster-Mode, once finished the transition, 7-Mode was deprecated and with that milestone Clustered ONTAP updated to the next version and become ”just” ONTAP 9. ONTAP 8.2.4 was the last version of 7-Mode. Both 7-Mode & Cluster-Mode share a lot of similarities, for example, WAFL file system was used on both, but most were not compatible one with another, in previous example WAFL versions and functionality were different and thus incompatible, only limited compatibility was introduced mostly for migration purposes from 7-mode to Cluster-Mode. The last version of 7-Mide ONTAP 8.2.4 contains WAFL compatible with Cluster-Mode, to introduce fast but offline in-place upgrade to the newest versions of ONTAP.

In version 9, nearly all the features from 7-mode were successfully implemented in ONTAP (Clustered) including SnapLock, FlexCache, MetroCuster, SnapMirror Synchronous, while many new features that not available in 7-Mode were introduced, including features such as FlexGroup, FabricPool, and new capabilities such as fast-provisioning workloads and Flash optimization, NDAS, data compaction, AFF or many others. The uniqueness of NetApp’s Clustered ONTAP is in the ability to add heterogeneous systems (where all systems in a single cluster do not have to be of the same model or generation) to a single (third type) cluster. This provides a single pane of glass for managing all the nodes in a cluster, and non-disruptive operations such as adding new models to a cluster, removing old nodes, online migration of volumes, and LUNs while data is continuously available to its clients.

A node and controller (head or physical appliance) are very similar terms and often used interchangeably. The difference between them is that controller is a physical server with CPU, main board, Memory and NVRAM, while node is an instance of ONTAP OS running on top of controller. A node can migrate, for example when a controller replaced with a new one. ONTAP (the third type) Cluster consist of nodes.

FAS appliances

Are NetApp custom build & OEM hardware. Controllers in FAS systems are computers which running ONTAP OS. FAS systems are used with HDD and SSD drives. SSD often used for caching, but can be used in all-SSD aggregates as well. FAS systems can use NetApp disk shelves to add capacity to the storage or a 3rd party arrays. Each disk shelf connected to one storage system which consists out of one or two controllers (HA pair).

AFF appliances

All-Flash FAS appliance also known as AFF. Usually, NetApp All-Flash systems based on the same hardware as FAS but first one’s OS ONTAP optimized and works only with SSD media on the back end while FAS appliance can use HDD and SSD and SSD as cache. Hare are pairs of appliances which using the same hardware: AFF A700 & FAS9000, A300 & FAS8200, A200 & FAS2600, A220 & FAS2700, but AFF systems do not include FlashCache cards since there is no sense in caching operation from flash media on the flash media. Also, AFF systems do not support FlexArray third-party storage array virtualization functionality. Both AFF & FAS using the same firmware image and nearly all noticeable functionality for the end user are the same for both. However internally data processed and handled differently in ONTAP on AFF systems, for example, used different Write Allocation algorithms than on FAS systems. Because AFF systems have faster underlying SSD drives Inline data deduplication in ONTAP systems nearly not noticeable (no more than 2% performance impact on low-end systems).

ASA appliances

ASA systems based on the AFF platform and provide access over SAN protocols only, therefore to differentiate from ASA the rest of the ONTAP-based systems are called Unified systems meaning unification of SAN & NAS data protocols. ASA systems provide symmetric access to the storage nodes over the network, thus each block device (i.e. LUN or NVMe namespace) accessed over paths from both controllers of the HA pair while rest of the ONTAP-based Unified (non-ASA) systems with SAN protocols normally are using optimized paths only through the controller which owns the LUN and switches to the non-optimized paths only when the optimized paths are not available. See the announcement of ASA in Oct 2019.

Unified ONTAP

Unified ONTAP called all the systems capable of both SAN & NAS protocols. ASA has burned All SAN identity and can serve only SAN data protocols. Thus FAS, AFF, ONTAP Select, Cloud Volumes ONTAP, Cloud Volumes Services are Unified and continue to use ALUA & ANA with SAN protocols.

ONTAP Select

Is software-only solution available in a form of a virtual machine. ONTAP Select is using third-party disk drives and can form its own RAID or use server-based Hardware RAID. If server-based Hardware RAID used, typically aggregates build of a single block device. ONTAP Select can run as single node, HA pair with two nodes or as multiple HA-pair cluster. Previously ONTAP Select was known as Data ONTAP Edge. ONTAP Select used as the platform for Cloud Volumes ONTAP (CVO) offering in public cloud providers.

Disks

FAS and AFF systems are using enterprise level HDD, and SSD (i.e., NVMe SSD) physical drives with two ports, each port connected to each controller in an HA pair. HDD and SSD drives can only be bought from NetApp and installed in NetApp’s Disk Shelves for FAS/AFF platform. Physical HDD and SSD drives, partitions on disk drives, and even LUNs imported from third-party arrays with FlexArray functionality are considered in ONTAP as a Disk. In SDS systems like ONTAP Select & ONTAP Cloud, logical block storage like virtual disk or RDM inside ONTAP also considered as a Disk. Do not confuse the general term “disk drive” and “disk drive term used in ONTAP system” because with ONTAP it could be entire physical HDD or SSD drive, a LUN or a partition on a physical HDD or SSD drive. A LUN imported from third-party arrays with FlexArray functionality in HA pair configuration must be accessible from both nodes of the HA pair as HDD or SSD drive. Each ONTAP disk has ownership on it to show which controller owns and serve the disk. An Aggregates can include only disks owned by a single node, therefore each aggregate owned by a node and any objects on top of it, as FlexVol volumes, LUNs, File Shares are served within a single controller. Each node have its own disks and aggregates and serve them. Where both nodes can be utilized simultaneously even though they not serving the same data.

ADP

Advanced Drive Partitioning (ADP) can be used in AFF & FAS systems depending on the platform and use-case. FlexArray technology does not support ADP. This technique mainly used to overcome some architectural requirements and reduce the number of disk drives in a NetApp FAS & AFF storage systems. There are three types of ADP:

  • Root-Data partitioning
  • Root-Data-Data partitioning (RD2 also known as ADPv2)
  • and Storage Pool.

Root-Data partitioning used in FAS & AFF systems to create small root partitions on drives to use them to create system root aggregates and therefore not to spend entire two physical disk drives for that purpose while the bigger portion of the disk drive will be used for data aggregate. Root-Data-Data partitioning is used in All-Flash systems only, it used for the same reason as Root-Data partitioning with the only difference that bigger portion of the drive left after root partitioning divided equally by two partitions, each partition assigned to one of the two nodes, therefore reducing the minimum number of drives required for an All-Flash system and reducing waste for expensive SSD space. Storage Pool partitioning technology used in FAS systems to equally divide each SSD drive by four pieces which later can be used for (only) FlashPool cache acceleration, with Storage Pool only a few SSD drives can be divided by up to 4 data aggregates which will benefit from FlashPool caching technology reducing minimally required SSD drives for FlashPool.

NetApp RAID in ONTAP

In NetApp ONTAP systems, RAID and WAFL are tightly integrated. There are several RAID types available within NetApp FAS and AFF systems:

  • RAID-4 with 1 dedicated parity disk allowing any 1 drive to fail in a RAID group.
  • RAID-DP with 2 dedicated parity disks allowing any 2 drives to fail simultaneously in a RAID group.
  • RAID-TEC US patent 7640484 with 3 dedicated parity drives, allows any 3 drives to fail simultaneously in a RAID group.

RAID-DP’s double parity leads to a disk loss resiliency similar to that of RAID-6. NetApp overcomes the write performance penalty of traditional RAID-4 style dedicated parity disks via WAFL and innovative use of its nonvolatile memory (NVRAM) within each storage system. Each aggregate consists of one or two plexes, and a plex consists of one or more RAID groups. Typical NetApp FAS or AFF storage system have only one plex in each aggregate, two plexes used in local SyncMirror or MetroCluster configurations. Therefore in systems without MetroCluster or local SyncMirror engineers might say “aggregates consist of RAID groups” to simplify things a bit because plex does not play a vital role in such configurations, while in reality an aggregate always have one or two plex and a plex consists of one or more RAID groups (see the picture with aggregate diagram). Each RAID group usually consists of disk drives of the same type, speed, geometry, and capacity. Though NetApp Support could allow a user to install a drive to a RAID group with same or bigger size and different type, speed and geometry for a temporary basis. RAID can be used with partitions too. Any data aggregates if containing more than one RAID group must have same RAID groups across the aggregate, same RAID group size is recommended, but NetApp allows to have an exception in the last RAID group and configure it as small as half of the RAID group size across aggregate. For example, such an aggregate might consist of 3 RAID groups: RG0:16+2, RG1:16+2, RG2:7+2. Within aggregates, ONTAP sets up flexible volumes (FlexVol) to store data that users can access. The reason ONTAP has ”default” RAID group size and that number is smaller than max RAID group size is to allow admin in the future to add only a few disk drives to existing RAID groups instead of adding a new RAID group with the full set of drives

Aggregates enabled as FlashPool consists of both HDD and SSD drives called hybrid aggregates and used in FAS systems. In FlashPool aggregates the same rules applied to the hybrid aggregate as to ordinary aggregates but separately to HDD and SSD drives, thus it is allowed to have two different RAID types: one RAID type for all HDD drives and one RAID type for all SSD drives in a single hybrid aggregate. For example SAS HDD with RAID-TEC (RG0:18+3, RG1:18+3) and SSD with RAID-DP (RG3:6+2). NetApp storage systems running ONTAP combine underlying RAID groups similarly to RAID-0 in plexes and aggregates, while in Hybrid aggregates SSD portion used for cache and therefore capacity from flash media not contributing to overall aggregate space. Also in NetApp FAS systems with FlexArray feature third party LUNs could be combined in a plex/aggregate similarly as in RAID-0. NetApp storage systems running ONTAP can be deployed in MetroCluster and local SyncMirror configurations which are using technique comparably to RAID-1 with mirroring data between two plexes in an aggregate.

Note that ADPv2 does not support RAID-4. RAID-TEC is recommended if the size of the disks used in an aggregate is greater than 4 TiB. RAID type in storage pool cannot be changed. RAID minimums for root aggregate (with force-small-aggregate true) are:

  • RAID-4 is 2 drives (1d + 1p)
  • RAID-DP is 3 drives (1d + 2p)
  • RAID-TEC is 5 drives (2d + 3p)

Aggregates

One or multiple RAID groups form an “aggregate,” and within aggregates ONTAP operating system sets up “flexible volumes” (FlexVol) to store data that hosts can access.

Similarly, to RAID-0, each aggregate merges space from underlying protected RAID groups to provide one logical piece of storage for flexible volumes therefore Aggregate does not provide data protection mechanisms but rather another layer of abstraction. Alongside with aggregates consisted out of disks and RAID groups, other aggregates could consist of LUNs already protected with third-party storage systems and connected to ONTAP with FlexArray, and in similar way it works in ONTAP Select or Cloud Volumes ONTAP. Each aggregate could consist of either LUNs or NetApp RAID groups. Flexible volumes offer the advantage that many of them can be created on a single aggregate and resized at any time. Smaller volumes can then share all the space & disk performance available to the underlying aggregate, and QoS allows to change the performance of flexible volumes on the fly. Aggregates can only be expanded, never downsized. Current maximum physical useful space size in an aggregate is 800 TiB for All-Flash FAS Systems. the limit applies on space in the aggregate rather then number of disk drives and may be different on AFF & FAS systems.

FlashPool

NetApp FlashPool is a feature on hybrid NetApp FAS systems which allows creating a hybrid aggregate with HDD drives and SSD drives in a single data aggregate. Both HDD and SSD drives form separate RAID groups. Since SSD also used for write operations, it requires RAID redundancy contrary to FlashCache which accelerate only read operations. In hybrid aggregate the system allows to use different RAID types for HDD and SSD, for example, it is possible to have 20 HDD 8TB in RAID-TEC while 4 SSD in RAID-DP or even RAID-4 with 960GB in a single aggregate. SSD RAID used as cache and improved performance for read-write operations for FlexVol volumes on the aggregate where SSD added as the cache. FlashPool cache similarly to FlashCache have policies for reading operations but also include write operations, and system administrator could apply those policies for each FlexVol volume located on the hybrid aggregate, therefore could be disabled on some volumes while others could benefit from SSD cache. Both FlashCache & FlashPool can be used simultaneously to cache data from a single FlexVol. To enable an aggregate with FlashPool technology minimum 4 SSD disks required (2 data, 1 parity, and 1 hot spare), it is also possible to use ADP technology to partition SSD into 4 pieces (Storage Pool) and distribute those pieces between two controllers so each controller’s aggregates could benefit from SSD cache when there is a small amount of SSD. FlashPool is not available with FlexArray and is available only with NetApp FAS native disk drives in NetApp’s disk shelves.

FabricPool

FabricPool technology available for all-SSD aggregates in FAS/AFF systems or in Cloud Volumes ONTAP on SSD media. Starting with ONTAP 9.4 FabricPool supported on ONTAP Select platform. Cloud Volumes ONTAP also supports HDD + S3 FabricPool configuration. FabricPool provides automatic storage tiering capability for cold data blocks from fast media (hot tier) on ONTAP storage to cold media to an S3 object storage (cold tier) and back. Each FlexVol volumes on a FabricPool-enabled all-SSD aggregates can have one out of four policies:

  • None – Does not tier data from a volume
  • Snapshot – Migrate cold data blocks captured in snapshots
  • Auto – Migrates cold data blocks from an active file system and snapshots to cold tier
  • All – this policy tiers all the data writing through directly to S3 object storage, metadata though always stays on SSD hot tier.

FabricPool preserves offline deduplication & offline compression savings. FabricPool tier-off blocks from active file system (by default 31-day data not been accessed) & support data compaction savings. Trigger for tiering from hot tier can be adjusted. The recommended ratio is 1:10 for inodes to data files. For clients connected to the ONTAP storage system, all the FabricPool data-tiering operations are completely transparent, and in case data blocks become hot again, they are copied back to fast media to the ONTAP. FabricPool is compatible with the

  • NetApp StorageGRID
  • Amazon S3 and Amazon Commercial Cloud Services (C2S)
  • Google Cloud
  • Alibaba object storage services
  • Azure Blob supported
  • IBM Cloud Object Storage (ICOS) in the cloud
  • IBM Cleversafe (on-prem object storage)

Other object-based SW & services could be used if requested by the customer and that service will be validated by NetApp. The FabricPool feature in FAS/AFF systems is free for use with NetApp StorageGRID external object storage. For other object storage systems such as Amazon S3 & Azure Blob, FabricPool must be licensed per TB to function (alongside costs for FabricPool licensing, the customer needs also to pay for consumed object space). While with the Cloud Volumes ONTAP storage system, FabricPool does not require licensing, costs will apply only for consumed space on the object storage. FlexGroup volumes and SVM-DR supported with FabricPool, also SVM-DR supported with FlexGroups.

FlashCache

NetApp storage systems running ONTAP can have FlashCache cards which can reduce read operations latency and allows the storage systems to process more read intensive work without adding any additional disk drives to the underlying RAID. Usually, one FlashCache module installed per controller, no mirroring performed between nodes and entire space from FlashCache used by a single node only, since read operations do not require redundancy in case of FlashCache failure, but chip-level data protection is available in FlashCache. If the system unexpectedly rebooted, read chance will be lost, but will restore over the time during regular node operation. FlashCache works on node level, by default accelerates any volumes on that node and only read operations. FlashCache caching policies applied on FlexVol level: system administrator can set cache policy on each individual volume on the controller or disable read cache at all. FlashCache technology is compatible with the FlexArray feature. Starting with 9.1 a single FlexVol volume can benefit from both FlashPool & FlashCache cache simultaneously.

FlexArray

FlexArray is NetApp FAS functionality allows to virtualize third-party storage systems, and other NetApp storage systems over SAN protocols and use them instead of NetApp’s disk shelves. With FlexArray functionality RAID protection must be provided with third-party storage array thus NetApp’s RAID-4, RAID-DP and RAID-TEC not used in such configurations. One or many LUNs from third-party arrays could be added to a single aggregate similarly to RAID-0. FlexArray is licensed feature.

NetApp Storage Encryption

NetApp Storage Encryption (NSE) is using specialized purpose-build disks with low level Hardware-based full disk encryption (FDE/SED) chip, some disks are FIPS-certified self-encrypted drives. NSE & FIPS drives compatible nearly with all NetApp ONTAP features and protocols but except for MetroCluster. NSE feature does overall nearly zero performance impact on the storage system. NSE feature similarly to NetApp Volume Encryption (NVE) in storage systems running ONTAP can store encryption key locally in Onboard Key Manager which stores keys in onboard TPM module or through KMIP protocol on dedicated key manager systems like IBM Security Key Lifecycle Manager and SafeNet KeySecure. NSE is data at rest encryption which means it protects only from physical disks theft and does not give an additional level of data security protection. In a standard operational and running ONTAP system, this feature does not encrypt data over the wire. When OS shuts disks down, they lose encryption key and becomes locked and if KeyManager not available or locked, ONTAP couldn’t boot. NetApp has passed NIST Cryptographic Module Validation Program for its NetApp CryptoMod (TPM) with ONTAP 9.2.

Continue to read

How ONTAP Memory work

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I described my own understanding of the internal organization of ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only.

NCDA auto extension with NCIE

Did you know, if you’ll pass NCIE (DP or SAN) AND you still have not expired yet NCDA, you automatically get certification extension (from higher-level certification) for NCDA with the same dates as NCIE?

To check:

In my case, both NCDA ONTAP & NCIE SAN ONTAP get 2018-Oct-23:

ncie san ontap

 

ncda ontap

 

  • NetApp Certified Data Administrator, ONTAP (NCDA ONTAP)
  • NetApp Certified Implementation Engineer – Data Protection Specialist (NCIE DP)
  • NetApp Certified Implementation Engineer – SAN Specialist, ONTAP (NCIE SAN)

How memory works in ONTAP: NVRAM/NVMEM, MBUF and CP (Part 1)

In this article, I’d like to describe how NVRAM, Cache and system memory works in ONTAP systems.

System Memory

System memory for any AFF/FAS (and other OEM vendors running ONTAP) consists out of two types of memory: ordinary (volatile) memory and non-volatile memory (that’s why we use “NV” prefix) connected to a battery on the controller. Some systems are using NVRAM and others instead using NVMEM, they both used for the same purpose, the only difference is that NVMEM is installed on the memory bus, while NVRAM is a small board with memory installed on it and that board connected to PCIe bus. Both, either NVRAM or NVMEM used by ONATP to store NVLOGs. Ordinary memory used for system needs and mostly as Memory Buffer (MBUF) in another words cache.

Data de-staging in normally functioning storage system is happening from the MBUF but not from NVRAM/NVMEM. However, the trigger for data de-staging might be the fact of NVRAM/NVMEM been half-full or others. See CP triggers section below.

6ec5d438a6ca472388c199391354b620

NVRAM/NVMEM & NVLOGs

Data from hosts always placed to the MBUF first. Then from MBUF with Direct Memory Access (DMA) request data copied to NVMEM/NVRAM, as the host sent them, in unchanged form of logs just like a DB log or similar to a journaling file system. DMA does not consume CPU cycles. That’s why NetApp says this system has Hardware Journaling file system. As soon data been acknowledged to be in NVRAM/NVMEM, the system sends data receiving an acknowledgment to the host. After Consistency Point (CP) occurred, data from MBUF are de-staged to the disks, and the system clears NVRAM/NVMEM without using NVLOGs at all in a normally functioning system. NVLOGs used only in special events or configurations. So, ONTAP does not use NVLOGs in normally functioning High Availability (HA) system it erases logs each time CP occur and then writes new NVLOGs. For instance, ONTAP will use NVLOGs to restore data “how they were” back to MBUF, in case of an unexpected reboot.

Memory Buffer: Write operations

The first place where all writes placed is always MBUF. Then from MBUF data copied to NVRAM/NVMEM with DMA call, after that WAFL module allocates a range of blocks where data from MBUF will be written that simply called Write Allocation. It might sound simple, but it’s kind of big deal, been constantly optimized by NetApp. However, just before it will allocate space for your data, the system will compile Tetris! Yes, I’m talking about the same kind of Tetris game for puzzle-matching you might play in your childhood. So, the Write Allocation’s one of the jobs is to make sure to write all the Tetris data in one unbreakable sequence to the disks whenever possible.

WAFL is also doing some additional data optimizations depending on the type of the data, where it goes, what type of media is going to be written, etc. After WAFL module gets acknowledgment from NVRAM/NVMEM that data secured, RAID module processing data from MBUF and adds checksums to each block (known as Block/Zone checksum) and it will calculate and write parity data on the parity disks. It is important to note that some data in MBUF contain commands which can be “extracted” before they will be delivered to RAID module, for example, some commands can ask the storage system to format some space in preassigned repeated patterns or commands which asks the storage to move chunks of data. Those commands might consume a small amount of space in NVRAM/NVMEM but might generate a big amount of data when executed.

8c2c56d4a0bd427f9b2bfbfff0323d4b

NVMEM/NVRAM in HA

Each HA pair with ONTAP consists out of two nodes (controllers), and each node has the copy of data from its neighbored HA partner. This architecture allows to switch hosts to a survived node and continue to serve them without a noticeable disruption for the hosts.

To be more precise, each NVRAM/NVMEM divided into two pieces: one to store NVLOGs from local controller and another piece to store the copy of NVLOGs from HA partner controller. While each piece also divided into halves. So, each time the system filled first local half of NVRAM/NVMEM, CP event generated and while it happens local controller using second local half for new operations and after second half filled by the system with logs, the system switches back to the first one, already emptied half, and repeats the cycle.

4b8e159e34fe468aa19d9af6d7a0370f

Consistency Points

As many modern file systems WAFL is a journaling FS which used to keep consistency and data protection. However, unlike general-purpose journaling file systems, WAFL do not need time or special FS checks to rollback & make sure the FS is consistent. Once a controller unexpectedly rebooted, last unfinished CP transaction not confirmed and similarly to snapshot the last snapshot just deleted and data from NVLOGs used to create new consistency point once the controller booted after an unexpected reboot. CP transaction confirmed only once all the transaction entirely been written to the disks and root inode changed with new pointers to the new data blocks added.

It turns out that NetApp snapshot technology was so successful, so it used literally almost everywhere in ONTAP. Let me remind you that each CP contain data already processed by WAFL and then by RAID module. So, CP is also a snapshot, so before data already been processed with WAFL & RAID been destaged from MBUF to disks, ONTAP create system snapshot from the aggregate where it is going to write data. Then ONTAP writes new data to the aggregate. Once data was successfully written as part of CP transaction, ONTAP changes root inode pointers and clear NVLOGs. Before data from CP transaction been written to disks, ONTAP creates a snapshot which represents the last active file system state. To be more precise, it just copies root inode pointers. If in case of failure even if both controllers will reboot simultaneously, last system snapshot will be rolled back, data will be restored from NVLOGs, aging processed with WAFL and RAID modules and destage back on disks on next CP as soon as the controllers get back online.

In case only one controller will suddenly switch off or reboot, second survived controller will restore data from its own NVRAM/NVLOG and finish earlier unsuccessful CP, applications will be transparently switched to the survived controller, and they will continue to run after a small pause as there was no disruption at all. Once CP is successful, as part of CP transaction, ONTAP changes root inode with pointers to the new data and create a new system snapshot which will capture newly added data and pointers to old data. In this way, ONTAP always maintains data consistency on the storage system without the need to switch to Write-Through mode in any case.

ONTAP 9 performs CP separately for each aggregate, while previously it was controller-wide. With CP separation for each aggregate slow aggregates no longer influencing other aggregates in the system.

iNodes

An inode contains information about files which is known as metadata, but inodes can store a small fraction of data too. Inodes have a hierarchical structure. Each inode can store up to 4 KB of information. If a file is small enough to fit into an inode and store metadata in the same inode, then only 4 KB block is used for such an inode, a directory actually is also a file on WAFL file-system. So one of the real world examples were an inode stores metadata, and the data itself is an empty directory or an empty file. However, what if the file is not fitting into an inode? Then the inode will store pointers to other inodes, and that indoes store can store pointers to other inodes or address for data blocks. Currently, WAFL has 5-level hierarchy limit. Sometimes inodes and data blocks referred to as files in deep-dive technical documentation about WAFL. Therefore, each file on FlexVolume file system can store no more than 16 TiB. Each FlexVol volume is a separate WAFL file system and has its own volume root inode.

The reason why Write Anywhere File Layout got Anywhere word in its name is metadata can be anywhere in the FS

Interesting nuance. The reason why Write Anywhere File Layout is probably got Anywhere word in its name is metadata can be anywhere in the FS and mixed up with data blocks, while other traditional file systems usually store their metadata on a dedicated area on the disk which usually has a fixed size. Here is the list of metadata information which can be stored alongside with data.

  • Volume where the inode resides
  • Locking information
  • Mode and type of file
  • Number of links to the file
  • Owner’s user and group ids
  • Number of bytes in the file
  • Access and modification times
  • Time the inode itself was last modified
  • Addresses of the file’s blocks on disk
  • Permission: UNIX bits or Windows Access Control List (ACLs)
  • Qtree ID.
Write_Anywhere_File_Layout_iNODEs

Events generating CP

CP is the event which generated by one of the following conditions:

  • 10 seconds passed by since the last CP
  • The system filled the first piece if NVRAM
  • Local MBUF filled (known as High Watermark). It happens really because of MBUF is usually way bigger that NVMEM/NVRAM. When commands in MBUF after execution generates a lot of new data before of in WAFL/RAID modules.
  • When you executed halt command on the controller to stop it
  • Others

CP condition might indirectly show on some system problems, for example, when there are not enough HDD drives to maintain performance, you will see back-to-back (“B” or “b”) operations. See also Knowledge Base FAQ: Consistency Point.

NVRAM/NVMEM and MetroCluster

To protect data from Split-Brain scenario in MetroCluster (MCC), hosts which writes data to the system will get acknowledgment only after the data will be acknowledged by the local HA partner and by the remote MCC partner (in case if the MCC comprises 4 nodes).

MetroCluster_local_and_DR_pare_memory_replication

HA interconnect

Data synchronization between local HA pair partners happens over HA interconnect. If two controllers in an HA pair located in two separate chassis, then HA interconnect is an external connection (in some models can be over InfiniBand or Ethernet connections and usually named cNx, i.e., c0a, c0b, etc., for example in FAS3240 systems). If two controllers in an HA pair placed in a single HA chassis, HA interconnect is internal, and there are no visible connections. Some controllers might be used in both configurations: HA in a single chassis or HA with each controller in its own chassis, in this case such controllers have dedicated HA interconnect ports often named cNx (i.e., c0a, c0b, etc., for example in FAS3240 systems) but in case this controller used in a single chassis configuration those ports are not used (and cannot be used for other purposes) and internal communication established internally through controller’s back-plain.

Controllers vs Nodes

A storage system formed out of one or few HA pairs. Each HA pair consists out of two controllers; sometimes they called nodes. Controllers and nodes are very similar and often interchangeable terms. The difference between them that controllers are physical devices and nodes are ephemeral OS instance which running on the controllers. Controllers in an HA pair connected with HA cluster interconnect. With hardware appliances like AFF and FAS systems, each hardware disk connected simultaneously to both controllers in an HA pair. Often controllers in tech documents as “controller A” and “controller B.” Even though hard drives in AFF & FAS systems physically got one port, the port comprises two ports. Each port from each drive connected to each controller. So if ever will dig dip into node shell console, and enter disk show command, you’ll see disks named like 0c.00.XX, where 0c means current port through which that disk is connected to the controller which “owns it”, XX is a position of the drive in the disk shelf, and 00 is an ID for the disk shelf.  At each time only one controller owns a disk or a partition on a disk. When a controller owns a disk or a partition on a disk, it means that the controller serves data to hosts from that disk or partition. The HA partner is used only when the owner of the disk or the partition will die, therefore each controller in ONTAP has its own drives/partitions and each serves its own drives/partitions, this architecture known as “share nothing“. There two types of HA policies: SFO (storage failover) and CFO (controller failover). CFO used for root aggregates and SFO for data aggregates. CFO do not change disk ownership in the aggregate, while SFO change disk ownership in the aggregates.

ToasterA*> disk show -v
  DISK       OWNER                  POOL   SERIAL NUMBER
------------ -------------          -----  ----------------
0c.00.1      unowned                  -     WD-WCAW30485556
0c.00.2      ToasterA  (142222203)  Pool0   WD-WCAW30535000
0c.00.11     ToasterB  (142222204)  Pool0   WD-WCAW30485887
0c.00.6      ToasterB  (142222204)  Pool0   WD-WCAW30481983

But since each drive in hardware appliance like FAS & AFF system connected to both controllers, it means each controller can address each disk. And if you will manually change 0c to 0d from this example to the port through which the drive should be available, the system will be able to address the drive.

ToasterB*> disk assign 0d.00.1 -s 142222203
Thu Mar 1 09:18:09 EST [ToasterB:diskown.changingOwner:info]: 
changing ownership for disk 0d.00.1 (S/N WD-WCAW30485556) from (ID 1234) to ToasterA (ID 142222203)

While software-defined ONTAP storage (ONTAP Select & Cloud Volumes ONTAP) works very like MetroCluster, because by definition it has none ”special” equipment, in this case, it doesn‘t have special 2-port drives connected to both servers (nodes). So instead of connecting to a single drive with both nodes in an HA pair, ONTAP Select (and Cloud Volumes ONTAP) it copies data from one controller to the second controller and keeps two copies of data on each node. And that is the price, another side, of the commodity equipment.

Technically, it is possible to connect single external storage, for instance, by iSCSI to each storage node, avoiding unnecessary data duplicating, but that option is not available in SDS ONTAP at the moment.

Mailbox “disk”

While it sounds like disk, it is in really not a disk but rather tiny special area on a disk which consume a few KB. That mailbox area is used to send messages from one HA-partner to another. Mailbox disk is a mechanism which gives ONTAP some additional security level for HA capabilities. Mailbox disks are used to determine the state for the HA partner, in a similar way to emails, where each controller time to time is posting it’s messaging to its own (local) mailbox disks that it is alive, healthy and well while reads from partner’s mailbox. On another hand, if the last timestamp of the last message from a partner is too old, the surviving node will take over. In this way, if HA interconnects not available for some reason or a controller freezes, the partner will determine the state of the second controller using mailbox disks and will perform the takeover. If a disk with mailbox dies, ONTAP going to choose a new disk.
By default, mailbox disks reside on two disks: one data and one parity disk for RAID 4, or one party and one double parity disk for RAID DP, and usually reside at a first aggregate which usually is the system root aggregate.

Cluster1::*> storage failover mailbox-disk show -node node1
Node    Location  Index Disk Name     Physical Location   Disk UUID
------- --------- ----- ------------- ------------------ -------------------
node1    local    0      1.0.4         local        20000000:8777E9D6:[...]
         local    1      1.0.6         partner      20000000:8777E9DE:[...]
         partner  0      1.0.1         local        20000000:877BA634:[...]
         partner  1      1.0.2         partner      20000000:8777C1F2:[...]

Takeover

When a node in HA pair whether software-defined of hardware dies, the survived one will ”takeover” and continue to serve the data from offline node to the hosts. With hardware appliance, the survived node will also change disk ownership from HA-partner to its own.

Active/Active and Active/Passive configurations

In the Active/Active configuration with ONTAP architecture, each controller has its own drives and serves data to hosts, in this case, each controller has at least one data aggregate. In an Active/Passive configuration passive node does not serve data to hosts and have disk drives only for root aggregate (for internal system needs). Each Active/Active and Active/Passive configuration needs to have for each node one root aggregate for the node to function properly. Aggregates formed out of one or few RAID groups. Each RAID group consists out of few disks drives or partitions. All the drives or partitions from an aggregate has to be owned by a single node.

Continue to read

How ONTAP cluster works?

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.