ONTAP improvements in version 9.6 (Part 2)

Starting with ONTAP 9.6 all releases are long-term support (LTS). Network auto-discovery from a computer for cluster setup, no need to connect with the console to set up IP. All bug fixes available in P-releases (9.xPy), where “x” is a minor ONTAP version and “y” is P-version with a bunch of bug fixes. P-releases going to be released each 4 weeks.

New OnCommand System Manager based on APIs

First, System Manager no longer carrying OnCommand last name now it is ONTAP System Manager. ONTAP System Manager shows failed disk position in a disk shelf and network topology. Like some other All-Flash vendors, the new dashboard shows storage efficiency with a single number, which includes clones and snapshots, but you still can find information separately for each efficiency mechanism.

Two system managers available simultaneously for ONTAP 9.6:

  • The old one
  • New API-based one (on the image below)
    • Press “Try the new experience” button from the “old” system manager

NetApp will base system Manager and all new Ansible modules on REST APIs only which means NetApp is taking it rather seriously. With 9.6 ONTAP NetApp brought proprietary ZAPI functionality via REST APIs access for cluster management (see more here & here). ONTAP System manager shows the list of ONTAP REST APIs that have been invoked for the performed operations which allows to understand how it works and use APIs in day to day basis. REST APIs available through System Manager web interface at https://ONTAP_ClusterIP_or_Name/docs/API, the page includes:

  • Try it out feature
  • Generate the API token to authorize external use
  • And built-in documentation with examples.

List of cluster management available through REST APIs in ONTAP 9.6:

  • Cloud (object storage) targets
  • Cluster, nodes, jobs and cluster software
  • Physical and logical network
  • Storage virtual machines
  • SVM name services such as LDAP, NIS, and DNS
  • Resources of the storage area network (SAN)
  • Resources of Non-Volatile Memory Express.

APIs will help service providers and companies where ONTAP deployed many instances in an automated fashion. System Manager will save historical performance info, while before 9.6 you can see only data from the moment you have opened the statistic window and after you close it, it would lose statistics. See ONTAP guide for developers.

Automation is the big thing now

All new Ansible modules will use only REST APIs. Python SDK will be available soon as well for some other languages.

OCUM now AUM

On Command Unified Manager renamed to ActiveIQ Unified Manager. Renaming show Unified Manager going to work with ActiveIQ in NetApp cloud more tightly.

  • In this tandem Unified Manager gives a detailed, real-time analytics, simplifies key performance indicator and metrics so IT generalists can understand what’s going on, it allows to troubleshoot and to automate and customize monitoring and management
  • While ActiveIQ is cloud-based intelligence engine, to provide predictive analytics, actionable intelligence, give recommendations to protect, and optimize NetApp environment.

Unified Manager 9.6 provides REST APIs, not just proactively identifying risks but, most importantly, now provide remediation recommendations. And also gives recommended to optimize workload performance and storage resource utilization:

  • Pattern recognition eliminates manual efforts
  • QoS monitoring and management
  • Realtime events and maps key components
  • Built-in analytics for storage performance optimizations

SnapMirror

SnapMirror Synchronous (SM-S) do not have automatic switchover yet as MetroCluster (MCC), and this is the key difference, which still keeps SM-S as a DR solution rather than HA.

  • New configuration supported: SM-S and then cascade SnapMirror Async (SM-A)
  • Automatic TLS encryption over the wire between ONTAP 9.6 and higher systems
  • Workloads that have excessive file creation, directory creation, file permission changes, or directory permission changes are suitable (these are referred to as high-metadata workloads) for SM-S
  • SM-S now supports additional protocols:
    • SMB v2 & SMB v3
    • NFS v4
  • SM-S now support qtree & fpolicy.

FlexGroup

Nearly all important FlexGroup limitations compare FlexVols now removed:

  • SMB Continuous Availability (CA) support allows running MS SQL & Hyper-V on FlexGroup
  • Constituent volume (auto-size) Elastic sizing & FlexGroup resize
    • If one constituent out of space, the system automatically take space from other constituent volumes and provision it to the one needs it the most. Previously it might result at the end of space error, while some space was available in other volumes. Though it means you probably short in space, and it might be a good time to add some more 😉
  • FlexGroup on MCC (FC & IP)
  • FlexGroup rename & re-size in GUI & CLI

FabricPool

Alibaba and Google Cloud object storage support for FabricPool and in GUI now you can see cloud latency of the volume.

Another exciting for me news is a new “All” policy in FabricPool. It is excited for me because I was one of those whom many times insisted it is a must-have feature for secondary systems to write-through directly to cold tier. The whole idea in joining SnapMirror & FabricPool on the secondary system was about space savings, so the secondary system can also be All Flash but with many times less space for the hot tier. We should use secondary system in the role of DR not as Backup because who wants to pay for the backup system as for flash, right? Then if it is a DR system, it assumes someday secondary system might become primary and once trying to run production on the secondary you most probably going to have not enough space on that system for hot tier, which means your DR no longer working. Now once we get this new “All” policy, this idea of joining FabricPool with SnapMirror while getting space savings and fully functional DR going to work.

This new “All” policy replaces “backup” policy in ONTAP 9.6, and you can apply it on primary storage, while the backup policy was available only on SnapMirror secondary storage system. With All policy enabled, all data written to FabricPool-enabled volume written directly to object storage, while metadata remains on performance tier on the storage system.

SVM-DR now supported with FabricPool too.

No more fixed ratio of max object storage compare to hot tier in FabricPool

FabricPool is a technology for tiering cold data to object storage either to the cloud or on-prem, while hot data remain on flash media. When I speak about hot “data,” I mean data and metadata, where metadata ALWAYS HOT = always stays on flash. Metadata stored in inode structure which is the source of WAFL black magic. Since FabricPool introduced in ONTAP till 9.5 NetApp assumed that hot tier (and in this context, they mostly were thinking not about hot data itself but rather metadata inodes) will always need at least 5% on-prem which means 1:20 ratio of hot tier to object storage. However, turns out it’s not always the case and most of the customers do not need that much space for metadata, so NetApp re-thought that and removed hard-coded 1:20 ratio and instead introduced 98% aggregate consumption model which gives more flexibility. For instance, if storage will need only 2% for metadata, then we can have a 1:50 ratio, this is of the cause will be the case only with low-file-count environments & SAN. That means if you have 800 TiB aggregate, you can store 39.2 PiB in cold object storage.

Additional:

  • Aggregate-level encryption (NAE), help cross-volume deduplication to gain savings
  • Multi-tenant key management allows to manage encryption keys within SVM, only external managers supported, previously available on cluster admin level. That will be great news for service providers. Require Key-manager license on ONTAP
  • Premium XL licenses for ONTAP Select allows consuming more CPU & memory to ONTAP which result in approximately 2x more performance.
  • NetApp support 8000 series and 2500 series with ONTAP 9.6
  • Automatic Inactive Data Reporting for SSD aggregates
  • MetroCluster switchover and switchback operations from GUI
  • Trace File Access in GUI allows to trace files on NAS accessed by users
  • Encrypted SnapMirror by default: Primary & Secondary 9.6 or newer
  • FlexCache volumes now managed through GUI: create, edit, view, and delete
  • DP_Optimized (DPO) license: Increases max FlexVol number on a system
  • QoS minimum for ONTAP Select Premium (All-Flash)
  • QoS max available for namespaces
  • NVMe drives with encryption which unlike NSE drives, you can mix in a system
  • FlashCache with Cloud Volumes ONTAP (CVO)
  • Cryptographic Data Sanitization
  • Volume move now available with NVMe namespaces.

Implemented SMB 3.0 CA witness protocol by using a node’s HA (SFO) partner LIF, which improve switchover time:

If two FabricPool aggregates share a single S3 bucket, volume migration will not rehydrate data and move only hot tier

We expect 9.6RC1 around the second half of May 2019, and GA comes about six weeks later.

Read more

Disclaimer

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. No one is sponsoring this article.

Which kind of Data Protection SnapMirror is? (Part 2)

I’m facing this question over and over again in different forms. To answer that question, we need to understand what kinds of data protection exists. The first part of this article How to make a Metro-HA from DR (Part 1)?

High Availability

This type of data protection trying to do its best to make your data available all the time. If you have an HA service, it will continue to work even if one or even a few components fail which means your Recovery Point Objective (RPO) is always 0 with HA, and Recovery Time Objective (RTO) is near 0. With RTO whatever that number is we assume that our service and applications using that service (maybe with a small pause) will survive failure and continue to function and will not return an error to its clients. An essential part of any HA solution is automatic switchover between two or more components, so your applications will transparently switch to the survived elements and your applications continue to interact with survived components instead of the failed one. With HA your timeouts should be set for your applications (typically up to 180 seconds) so that RTO will be equal to or lower. HA solutions made in a way not to reach those application timeouts to make sure they not going to return an error to upstream services but rather a short pause. Whenever you got RPO not 0, it instantly means data protection is not an HA solution. The biggest problem with HA solutions they limited by the distance between which components can communicate, the more significant gap between them, the more time they need all your data to be fully Synchronous across all of them and ready to take over the failed part.

In the context of NetApp FAS/AFF/ONTAP systems, HA can be local HA-pair or MetroCluster stretched between two sites up to 700 km.

Slide1.png
Slide2

Disaster Recovery

The second data protection is DR. What is the difference between DR and HA, they both for data protection, right? By definition, DR is the kind of data protection which starts with the assumption you already get into a situation where your data not available and your HA solution has failed for any reason. Why DR assumes your data not available, and you have a disruption in your infrastructure service? The answer is “by definition.” With DR you might have RPO 0 or not and your RTO is always not 0 which means you will get an error accessing your data, there will be a disruption in your service. DR assumes by definition there is no fully automatic and transparent switchover.

Because HA and DR are both Data Protection techniques, people often confuse them, mix them up and do not see the difference or vice versa, they are trying to contrapose them and choose between them. But, now after explanation what they are and how they are different, you might already guess that you cannot replace one with another they do not compete but rather complement each other.

In the context of NetApp systems, SnapMirror technology strongly associated with DR capabilities.

Slide3.png

Backup & Archive data protection

Backup is another type of data protection. Backup is an even lower level of data protection than DR and allows you to access your data all the time from the Backup site for the data restoration to a production site. An essential role for Backup data is to ensure it does not alter your data. Therefore, with Backup, we assume to restore data back to original or another place but not alter backed up data which means not to run DR on your Backup data. In the context of NetApp AFF/FAS/ONTAP systems backup solution are local Snapshots (a kind of) and SnapVault D2D replication technology. In ONTAP Cluster-Mode (version 8.3 and newer) SnapVault becomes XDP, just another engine for SnapMirror. With XDP SnapMirror capable of Unified Replication for both DR and Backup. With Archives you do not have access to your backups, so you need some time to bring them online before you can restore it back to the source or another location. The type library or NetApp Cloud Backup are the examples for the archive solution.

NetApp_SnapMirror.png

Is SnapMirror HA or DR data protection technology?

There is no straightforward answer to that, and to answer the question we have to consider the details.

SnapMirror comes in two flavors: Asynchronous SnapMirror which transfers data time to time to a secondary site, it is obviously a DR technology because you cannot switch to the DR site automatically since you do not have the latest version of your data. That means that before you start your applications, you might need to prepare them first. For instance, you might need to apply DB logs to your database, so your “not the latest version of data” will become the latest one. Alternatively, you might need to choose one snapshot out of the last few which you need to restore because the latest one might have corrupted data with a virus for instance. Again, by definition DR scenario assumes that you will not switch to a DR instantly, it assumes you already have downtime, and it assumes you might have manual interaction or a script or some modifications made before you’ll be able to start & run your services which require some downtime.

Synchronous SnapMirror (SM-S) also has two modes: Strict Full Synchronous mode and Relaxed Synchronous mode. The problem with Synchronous replication, similarly to HA solutions, is that the longer distance between the two sites, the more time needed to replicate the data. And the longer data will be transferred and confirmed to the first system, the longer time your application will not get the confirmation from your system.

The relaxed mode allows to have lags and network break-out and after network communication restoration auto-sync again, which means it is also a DR solution because it enables RPO to be not 0.

Strict mode does not tolerate network break-out by definition, which means it ensures your RPO to be always 0, which kind of makes it closer to HA.

Does it mean Synchronous SnapMirror in Strict mode is an HA solution?

Well, not precisely. Synchronous SnapMirror in Strict mode can also be part of a DR solution. For instance, if you have a DB with all the data been Asynchronously replicated to a DR site and only DB logs synchronously replicated to DR site, in this way we can reduce network traffic between two locations, provide small overall RPO and with DB synchronous logs restore data to the DB to ensure entire DB with RPO 0. In such a scenario RTO will not be so big but allows your DR site to be located very far away one from another. See scenarios how SnapMirror Sync can be combined with SnapMirror Async to build more robust beneficial DR solution.

To comply with HA definition, you need to have not only RPO to be 0 but also to be able to automatically switch over with RTO not higher than timeouts for your applications & services.

Can SM-S Strict mode switchover between sites automatically?

The answer is “not YET.” To do automatic switchover between sites, NetApp has an entirely different technology called MetroCluster which is Metro-HA technology. Any MetroCluster or local HA systems should be accommodated with DR, Backup & Archive technologies to provide the best data protection possible.

Will SM-S become HA?

I personally believe that NetApp will make it possible in the future to automatically switch over between two sites with SM-S. Most probably it will spin around SVM-DR feature to replicate not only data but also network interfaces and configurations, and for doing that SM-S will need some kind of Tiebreaker like in MCC, but those are not there yet. In my personal opinion, this kind of technology most probably going to (and should) be positioned as online data migration technology across NetApp Data Fabric rather than as a (Merto-) HA solution.

Why should SM-S not be positioned as an HA?

Few reasons:

1) NetApp already has MetroCluster (MCC) technology, and for many-many years it was and still is a superior Metro-HA technology proven to be stable, reliable and performant.

2) Now MCC become easier, simpler and smaller, and the only reasons you would like to have HA on top of SnapMirror are basically that tree. Since we already have MCC over IP (MC-IP), it is theoretically possible to run it even on the smallest AFF systems someday.

According to my own sense of how it will be, in some cases, SM-S might be used as an HA solution someday.

How HA, DR & Backup solutions applied to practice?

As you remember HA, DR & Backup solutions do not compete with but rather complement each other to provide full data protection. In a perfect world without money where you need to provide the highest possible and fully covered data protection, you would need HA, DR, Backups, and Archive. Where HA is located in one place or Geo-distributed as far as possible (up to 700 km), and on top of that, you need DR and Backups. For Backups, you might probably need to place your site as far as possible, for instance, on another side of the country or even to another continent. In these circumstances, you can do Synchronous SnapMirror only for some of your data like DB logs and Async for the rest to an intermediate site (up to a 10 ms network RTT latency) to a DR site and from that intermediate site to another continent all the data replicated Asynchronously or as Backup protection. And from DR and/or Backup sites we can do Archiving to Tape Library or NetApp Cloud Backup or another archive solution.

Slide4

Summary

HA, DR, Backup and Archive are different types of data protection which complement each other. Any company should have not only HA solution for their data but also DR, Backup, and Archive in the best-case scenario or at least HA solution & Backup, but it always depends on business needs, business willingness to get some level of protection, and understanding risks involved with not protecting the data properly.

How to make a Metro-HA from DR? (Part 1)

This is indeed a frequently asked question often asked in many different forms, like: Can a NetApp’s DR solution automatically do site switching on DR event with a FAS2000/A200 system?

As you might guess in NetApp world, Metro-HA is called MetroCluster (or MCC) and DR called Asynchronous SnapMirror. (Read about SnapMirror Synchronous in Part 2)

The question is the same sort of questions if someone would ask “Can you build a MetroCluster-like solution based on A200/FAS2000 with async SnapMirror, without buying a MetroCluster, is there out of the box solution?”. The short answer to that question is no; you cannot do that. There are few quite good reasons for that:

  • First of all is: DR & HA (or Metro-HA) protects from different kinds of failures, therefore designed, behave & working quite differently, though both are data protection technologies. You see MetroCluster is basically is an HA solution stretched between two sites (up to 300 km for HW MCC or up to 10km for MetroCluster SDS), it is not a DR solution
  • MetroCluster Based on another technology called SyncMirror, it requires additional PCI cards, models higher then A200/FAS2000 and there are some other requirements too.

Data Protection technologies comparison

Async SnapMirror on another hand is designed to provide Disaster Recovery, not Metro-HA. When you are saying DR, it means you store point in time data (snapshots) for cases like data (logical) corruption, so you’ll have the ability to choose between snapshots to restore. Moreover, the ability also meant responsibility, because you or another human must decide which one to select & restore. So, there is no “automatic, out of the box” switchover to DR site with Async SnapMirror like MCC. Once you have many snapshots, it means you have many options, which means it is not easy for a program or a system to decide to which one it should switch. Also, SnapMirror provides many opportunities to backup & restore:

  • Different platforms on main & DR sites (in MCC both systems must be the same model)
  • Different number & types of drives (in MCC mirrored aggregates must be the same size & drive type)
  • Fun-Out & Cascade replicas (MCC have only two sites)
  • Replication can be done over L3, no L2 requirements (MCC only for L2)
  • You can replicate separate Volumes or entire SVM (with exclusions for some of the volumes if necessary). With MCC you replicate entire storage system config and selected aggregates
  • Many snapshots (though MCC can contain snapshots it switches only between Active FS on both sites).

All these options give much flexibility for async SnapMirror and mean your storage system must have a very complex logic to switch between sites automatically, long story short, it is impossible to have a single solution which gives you a logic which is going to satisfy every customer, all possible configurations & all the applications in one solution. In other words, with that flexible solution like async SnapMirror switchover in many cases done manually.

At the end of the day, an automatic or semi-automatic switchover is possible

At the end of the day automatic or semi-automatic switchover is possible & must be done very carefully with environment knowledge, understanding precise customer situation and customized for:

  • Different environments
  • Different protocols
  • Different applications.

MetroCluster on another hand can automatically switch over between sites in case of one site failure, but it operates only with the active file system and solves only Data Availability problem, not Data Corruption. It means if your data been (logically) corrupted by let’s say a virus, in this case, MetroCluster switchover not going to help, but Snapshots & SnapMirror will. Unlike SnapMirror, MetroCluster has strict deterministic environmental requirements, and only two sites between which your system can switch plus it works only with the active file system (no snapshots) used, in this deterministic environment it is possible to determine surviving site which is to choose and switch automatically with a tiebreaker. A tiebreaker is a software with built-in logic which makes the decision for site switchover.

SVM DR

SVM DR does not replicate some of SVM’s configuration to DR site. So, you must configure it manually or prepare a script so in case of a disaster your script is going to do it for you.

Do not mix up Metro-HA (MetroCluster) & DR; those are two separate and not mutually exclusive data protection technologies: you can have both MetroCluster & DR, so big companies usually have both MetroCluster & SnapMirror because they have budgets, business requirements & approval for that. The same logic applies not only to NetApp systems but for all storage vendors.

The solution

In this particular case, a customer with FAS2000/A200 & async SnapMirror can have only DR, so manual mount to hosts must be done on the DR site after a disaster event on primary site occurs, though it is possible to set up & configure your own script with logic suitable for your environment which switches between sites automatically or semi-automatically. For this purpose thing like NetApp Work Flow Automation & Backup/Restore ONTAP SMB shares with PowerShell script can help to do the job. Also, you might be interested in VMware SRM + NetApp SRM plugin configuration, which can give you a relatively easy solution to switch between sites.

The second part of this article “Which kind of Data Protection SnapMirror is? (Part 2)“.