Why use NetApp snapshots even when you do not have Premium bundle software?

If you are extremely lazy and do not want to read any farther, the answer is “use snapshots to improve RPO and use ndmpcopy to restore files, LUNs and SnapCreator for app-consistent snapshots.

Premium bundle includes a good deal of software besides Base software in each ONTAP system, like:

  • SnapCenter
  • SnapRestore
  • FlexClone
  • And others.

So, without Premium bundle, with only Basic software we have two issues:

  • You can create snapshots, but without SnapRestore or FlexClone you cannot restore them quickly
  • And without SnapCenter you cannot make application consistent snapshot.

And some people asking, “Do I need to use NetApp snapshots in such circumstances?”

And my answer is: Yes, you can, and you should use ONTAP snapshots.

Here is the explanation of why and how:

Snapshots without SnapRestore

Why use NetApp storage hardware snapshots? Because they have no performance penalty and also no such a thing as snapshot consolidation which causes a performance impact. NetApp snapshots work pretty well and they also have other advantages. Even though it is not that fast as with SnapRestore or FlexClone to restore your data captured in snapshots, you can create snaps very fast. And most times, you need to restore something very seldom, so fast creation of snapshots with slow restoration will give you better RPO compare to a full backup. Of course, I have to admit that you improved RPO only for cases when your data were logically corrupted, and no physical damage was done to the storage because if your storage physically damaged, snapshots will not help. With ONTAP you can have up to 1023 snapshots per volume, and you can create them as fast as you need with no performance degradation whatsoever, which is pretty awesome.

Snapshots with NAS 

If we are speaking about NAS environment without SnapRestore license, you always can go to the .snapshot folder and copy any previous version of a file you need to restore. Also, you can use the ndmpcopy command to perform file, folder or even volume restoration inside storage without involving a host.

Snapshots with SAN 

If we are speaking about SAN environment without SnapRestore license, you do not have such ability as copying a file on your LUN and restore it. There are two stages in case you need to restore something on a LUN:

  1. You copy entire LUN from a snapshot
  2. And then you can either:
    • Restore entire LUN on the place of the last active version of your LUN
    • Or you can copy data from copied LUN to the active LUN.

To do that, you can use either ndmpcopy or lun copy commands to perform the first stage. And if you want to restore only some files from an old version of the LUN from a snapshot, you need to map that copy to a host and copy required data back to active LUN.

Application consistent storage snapshots 

Why do you need application consistency in the first place? Sometimes, in an environment like the NAS file share with doc files, etc., you do not need that at all. But if you are using applications like Oracle DB, MS SQL or VMWare you’d better have application consistency. Imagine you have a Windows machine and you are pulling hard drive while Windows is running, let’s forget for a moment that your Windows will stop working, this is not the point here, and let’s focus on data protection side of that. The same happens when you are creating a storage snapshot, data captured in that snapshot will be similarly not complete. Will the pulled off hard drive be a proper copy of your data? Kind of, right? Because some of the data will be lost in host memory and your FS probably will not be consistent, and even though you’ll be able to restore logged file system, your application data will be damaged in a way it hard to restore, because against of the data lost from host memory. Similarly, snapshots will contain probably damaged FS, if you try to restore from such a copy, your Windows might not start, or it might start after FS recheck, but your applications especially Data Bases definitely will not like such a backup. Why? Because most probably you’ll get your File System corrupted because applications and OS which were running on your machine didn’t have a chance to destage data from memory to your hard drive. So, you need someone who will prepare your OS & applications to create a backup. As you may know, application consistent storage hardware snapshots can be created by backup software like Veeam, Commvault, and many others, or you even can trigger a storage snapshot creation yourself with relatively simple Ansible or PowerShell script. Also, you can do application-consistent snapshots with free NetApp SnapCreator software framework, unlike SnapCenter, it does not have a simplistic and straight-forward application GUI wizards which help to walk you through with the process of integration with your app. Most times, you have to write a simple script for your application to benefit online & application-consistent snapshots, another downside that SnapCreator is not officially supported software. But at the end of the day, it is relatively easy setup, and it will definitely pay you off once you finish setting up.

List of other software features available in Basic software

This Basic ONTAP functionality also might be useful: 

  • Horizontal scaling, nod-disruptive operations such as online volume & LUN migration, non-disruptive upgrade with adding new nodes to the cluster
  • API automation
  • FPolicy file screening
  • Create snapshots to improve RPO
  • Storage efficiencies: Deduplication, Compression, Compaction
  • By default ONTAP deduplicate data across active file system and all the snapshots on the volume. Savings from the snapshot data sharing is a magnitude of number of snapshots: the more snapshots you have, the more savings you’ll have
  • Storage Multi-Tenancy
  • QoS Maximum
  • External key manager for Encryption
  • Host-based MAX Data software which works with ONTAP & SAN protocols
  • You can buy FlexArray license to virtualize 3rd party storage systems
  • If you have an All Flash system, then you can purchase additional FabricPool license which is useful especially with snapshots, because it is destaged cold data to cheap storage like AWS S3, Google Cloud, Azure Blob, IBM Cloud, Alibaba Cloud or on-premise StorageGRID system, etc.

Summary

Even Basic software has a reach functionality on your ONTAP system, you definitely should use NetApp snapshots, and set up application integration to make your snapshot application consistent. With hardware NetApp storage snapshots, you can have 1023 snapshots per volume, create them as fast as you need without sacrificing storage performance, so snapshots will increase your RPO. Application consistency with SnapCreator or any other 3rd party backup software will build confidence that all the snapshots can be restorable when needed.

ONTAP improvements in version 9.6 (Part 2)

Starting with ONTAP 9.6 all releases are long-term support (LTS). Network auto-discovery from a computer for cluster setup, no need to connect with the console to set up IP. All bug fixes available in P-releases (9.xPy), where “x” is a minor ONTAP version and “y” is P-version with a bunch of bug fixes. P-releases going to be released each 4 weeks.

New OnCommand System Manager based on APIs

First, System Manager no longer carrying OnCommand last name now it is ONTAP System Manager. ONTAP System Manager shows failed disk position in a disk shelf and network topology. Like some other All-Flash vendors, the new dashboard shows storage efficiency with a single number, which includes clones and snapshots, but you still can find information separately for each efficiency mechanism.

Two system managers available simultaneously for ONTAP 9.6:

  • The old one
  • New API-based one (on the image below)
    • Press “Try the new experience” button from the “old” system manager

NetApp will base system Manager and all new Ansible modules on REST APIs only which means NetApp is taking it rather seriously. With 9.6 ONTAP NetApp brought proprietary ZAPI functionality via REST APIs access for cluster management (see more here & here). ONTAP System manager shows the list of ONTAP REST APIs that have been invoked for the performed operations which allows to understand how it works and use APIs in day to day basis. REST APIs available through System Manager web interface at https://ONTAP_ClusterIP_or_Name/docs/API, the page includes:

  • Try it out feature
  • Generate the API token to authorize external use
  • And built-in documentation with examples.

List of cluster management available through REST APIs in ONTAP 9.6:

  • Cloud (object storage) targets
  • Cluster, nodes, jobs and cluster software
  • Physical and logical network
  • Storage virtual machines
  • SVM name services such as LDAP, NIS, and DNS
  • Resources of the storage area network (SAN)
  • Resources of Non-Volatile Memory Express.

APIs will help service providers and companies where ONTAP deployed many instances in an automated fashion. System Manager will save historical performance info, while before 9.6 you can see only data from the moment you have opened the statistic window and after you close it, it would lose statistics. See ONTAP guide for developers.

Automation is the big thing now

All new Ansible modules will use only REST APIs. Python SDK will be available soon as well for some other languages.

OCUM now AUM

On Command Unified Manager renamed to ActiveIQ Unified Manager. Renaming show Unified Manager going to work with ActiveIQ in NetApp cloud more tightly.

  • In this tandem Unified Manager gives a detailed, real-time analytics, simplifies key performance indicator and metrics so IT generalists can understand what’s going on, it allows to troubleshoot and to automate and customize monitoring and management
  • While ActiveIQ is cloud-based intelligence engine, to provide predictive analytics, actionable intelligence, give recommendations to protect, and optimize NetApp environment.

Unified Manager 9.6 provides REST APIs, not just proactively identifying risks but, most importantly, now provide remediation recommendations. And also gives recommended to optimize workload performance and storage resource utilization:

  • Pattern recognition eliminates manual efforts
  • QoS monitoring and management
  • Realtime events and maps key components
  • Built-in analytics for storage performance optimizations

SnapMirror

SnapMirror Synchronous (SM-S) do not have automatic switchover yet as MetroCluster (MCC), and this is the key difference, which still keeps SM-S as a DR solution rather than HA.

  • New configuration supported: SM-S and then cascade SnapMirror Async (SM-A)
  • Automatic TLS encryption over the wire between ONTAP 9.6 and higher systems
  • Workloads that have excessive file creation, directory creation, file permission changes, or directory permission changes are suitable (these are referred to as high-metadata workloads) for SM-S
  • SM-S now supports additional protocols:
    • SMB v2 & SMB v3
    • NFS v4
  • SM-S now support qtree & fpolicy.

FlexGroup

Nearly all important FlexGroup limitations compare FlexVols now removed:

  • SMB Continuous Availability (CA) support allows running MS SQL & Hyper-V on FlexGroup
  • Constituent volume (auto-size) Elastic sizing & FlexGroup resize
    • If one constituent out of space, the system automatically take space from other constituent volumes and provision it to the one needs it the most. Previously it might result at the end of space error, while some space was available in other volumes. Though it means you probably short in space, and it might be a good time to add some more 😉
  • FlexGroup on MCC (FC & IP)
  • FlexGroup rename & re-size in GUI & CLI

FabricPool

Alibaba and Google Cloud object storage support for FabricPool and in GUI now you can see cloud latency of the volume.

Another exciting for me news is a new “All” policy in FabricPool. It is excited for me because I was one of those whom many times insisted it is a must-have feature for secondary systems to write-through directly to cold tier. The whole idea in joining SnapMirror & FabricPool on the secondary system was about space savings, so the secondary system can also be All Flash but with many times less space for the hot tier. We should use secondary system in the role of DR not as Backup because who wants to pay for the backup system as for flash, right? Then if it is a DR system, it assumes someday secondary system might become primary and once trying to run production on the secondary you most probably going to have not enough space on that system for hot tier, which means your DR no longer working. Now once we get this new “All” policy, this idea of joining FabricPool with SnapMirror while getting space savings and fully functional DR going to work.

This new “All” policy replaces “backup” policy in ONTAP 9.6, and you can apply it on primary storage, while the backup policy was available only on SnapMirror secondary storage system. With All policy enabled, all data written to FabricPool-enabled volume written directly to object storage, while metadata remains on performance tier on the storage system.

SVM-DR now supported with FabricPool too.

No more fixed ratio of max object storage compare to hot tier in FabricPool

FabricPool is a technology for tiering cold data to object storage either to the cloud or on-prem, while hot data remain on flash media. When I speak about hot “data,” I mean data and metadata, where metadata ALWAYS HOT = always stays on flash. Metadata stored in inode structure which is the source of WAFL black magic. Since FabricPool introduced in ONTAP till 9.5 NetApp assumed that hot tier (and in this context, they mostly were thinking not about hot data itself but rather metadata inodes) will always need at least 5% on-prem which means 1:20 ratio of hot tier to object storage. However, turns out it’s not always the case and most of the customers do not need that much space for metadata, so NetApp re-thought that and removed hard-coded 1:20 ratio and instead introduced 98% aggregate consumption model which gives more flexibility. For instance, if storage will need only 2% for metadata, then we can have a 1:50 ratio, this is of the cause will be the case only with low-file-count environments & SAN. That means if you have 800 TiB aggregate, you can store 39.2 PiB in cold object storage.

Additional:

  • Aggregate-level encryption (NAE), help cross-volume deduplication to gain savings
  • Multi-tenant key management allows to manage encryption keys within SVM, only external managers supported, previously available on cluster admin level. That will be great news for service providers. Require Key-manager license on ONTAP
  • Premium XL licenses for ONTAP Select allows consuming more CPU & memory to ONTAP which result in approximately 2x more performance.
  • NetApp support 8000 series and 2500 series with ONTAP 9.6
  • Automatic Inactive Data Reporting for SSD aggregates
  • MetroCluster switchover and switchback operations from GUI
  • Trace File Access in GUI allows to trace files on NAS accessed by users
  • Encrypted SnapMirror by default: Primary & Secondary 9.6 or newer
  • FlexCache volumes now managed through GUI: create, edit, view, and delete
  • DP_Optimized (DPO) license: Increases max FlexVol number on a system
  • QoS minimum for ONTAP Select Premium (All-Flash)
  • QoS max available for namespaces
  • NVMe drives with encryption which unlike NSE drives, you can mix in a system
  • FlashCache with Cloud Volumes ONTAP (CVO)
  • Cryptographic Data Sanitization
  • Volume move now available with NVMe namespaces.

Implemented SMB 3.0 CA witness protocol by using a node’s HA (SFO) partner LIF, which improve switchover time:

If two FabricPool aggregates share a single S3 bucket, volume migration will not rehydrate data and move only hot tier

We expect 9.6RC1 around the second half of May 2019, and GA comes about six weeks later.

Read more

Disclaimer

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. No one is sponsoring this article.

ONTAP and ESXi 6.х tuning

This article will be useful to those who own an ONTAP system and ESXi environment.

ESXi tuning can be divided into the next parts:

  • SAN network configuration optimization
  • NFS network configuration optimization
  • Hypervisor optimization
  • Guest OS optimization
  • Compatibility for software, firmware, and hardware

There are a few documents which you should use when tuning ESXi for NetApp ONTAP:

TR-4597 VMware vSphere with ONTAP

SAN network

In this section, we will describe configurations for iSCSI, FC, and FCoE SAN protocols

ALUA

ONTAP 9 has ALUA always enabled for FC, FCoE and iSCSI protocols. If ESXi host correctly detected ALUA, then Storage Array Type plug-in will show VMW_SATP_ALUA. With ONTAP it is allowed to use Most Recently Used or Round Robin load balancing algorithm.

Round Robin will show better results if you have more than one path to a controller. In the case of Microsoft Cluster with RDM drives it is recommended to use the Most Recently Used algorithm. Read more about Zoning for ONTAP clusters.

Storage ALUA Protocols ESXi policy Algorithm
ONTAP
9
Enabled FC/FCoE/iSCSI VMW_SATP_ALUA Most
Recently Used
Round
Robin

Let’s check policy and algorithm applied to a Datastore:

# esxcli storage nmp device list
naa.60a980004434766d452445797451376b
Device Display Name: NETAPP Fibre Channel Disk (naa.60a980004434766d452445797451376b)
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on;explicit_support=off; explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=ANO}{TPG_id=0,TPG_state=AO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=0: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba2:C0:T6:L119, vmhba1:C0:T7:L119
Is Local SAS Device: false
Is USB: false
Is Boot USB Device: false

Ethernet Network

Ethernet network can be used for NFS and iSCSI protocols.

Jumbo frames

Either you are using iSCSI or NFS, it is recommended to use jumbo frames with speed of 1Gbps or greater. When you are setting up Jumbo frames.

ESXi & MTU9000

When you are setting up a virtual machine to achieve the best network performance, so you’d better use VMXNET3 virtual adapter since it supports both speeds greater than 1Gbps and MTU 9000. While E1000e virtual adapter supports MTU 9000 and speeds up to 1Gbps. Also, E1000e by default sets up 9000 MTU to all the VMs except Linux. Flexible virtual adapters support only MTU 1500.

To achieve maximum network throughput, connect your VM to virtual switch which also has MTU 9000.

NAS & VAAI

ONTAP storage systems support VAAI (vSphere Storage APIs Array Integration). VAAI hardware acceleration or hardware offload APIs, is a set of APIs to enable communication between VMware vSphere ESXi hosts and storage devices. So instead of ESXi host copying some data from storage, modifying it in host memory and putting it back to storage over the network, with VAAI some of the operations can be done by storage itself with API calls from ESXi host. VAAI enabled by default for SAN protocols but not for NAS. For NAS VAAI to work you need to install a vib kernel module called NetAppNFSVAAI on each ESXi host. Do not expect VAAI to solve all your problems but some performance definitely will improve. NetApp VSC also can help with NetAppNFSVAAI installation. For NFS VAAI function you have to set your NFS share on storage properly and to meet a few criteria:

  1. On the ONTAP storage set NFS export policy so ESXi servers can access it
  2. RO, RW and Superuser fields must have SYS or ANY values in export policy for your volume
  3. You have to enable NFSv3 AND NFSv4 protocols, even if NFSv4 will not be used
  4. Parent volumes in your junction path have to be readable. In most of the cases, it means that root volume (vsroot) on your SVM needs to have at least superuser field to be set up with SYS. Moreover, it is recommended to prohibit write access to SVM root volume.
  5. Enable vStorage feature has to be enabled for your SVM (vserver)

Example:

#cma320c-rtp::> export-policy rule show -vserver svm01 -policyname vmware_access -ruleindex 2
(vserver export-policy rule show)
Vserver: svm01
Policy Name: vmware_access <--- Applied to Exported Volume
Rule Index: 2
Access Protocol: nfs3 <---- needs to be 'nfs' or 'nfs3,nfs4'
Client Match Spec: 192.168.1.7
RO Access Rule: sys
RW Access Rule: sys
User ID To Which Anonymous Users Are Mapped: 65534
Superuser Security Flavors: sys
Honor SetUID Bits In SETATTR: true

#cma320c-rtp::> export-policy rule show -vserver svm01 -policyname root_policy -ruleindex 1
(vserver export-policy rule show)
Vserver: svm01
Policy Name: root_policy <--- Applied to SVM root volume
Rule Index: 1
Access Protocol: nfs <--- like requirement 1, set to nfs or nfs3,nfs4
Client Match Spec: 192.168.1.5
RO Access Rule: sys
RW Access Rule: never <--- this can be never for security reasons
User ID To Which Anonymous Users Are Mapped: 65534
Superuser Security Flavors: sys <--- this is required for VAAI to be set, even in the parent volumes like vsroot
Honor SetUID Bits In SETATTR: true
Allow Creation of Devices: true

#cma320c-rtp::> nfs modify -vserver svm01 -vstorage enabled

ESXi host

First of all, let’s not forget it is a good idea to leave 4GB of memory to the hypervisor itself.  Also, we need to tune some network values for ESXi

ProtocolsValues for ESXi 6.x with ONTAP 9.x
Net.TcpipHeapSizeiSCSI/NFS32
Net.TcpipHeapMaxiSCSI/NFS1536
NFS.MaxVolumesNFS256
NFS41.MaxVolumesNFS 4.1256
NFS.HeartbeatMaxFailuresNFS10
NFS.HeartbeatFrequencyNFS12
NFS.HeartbeatTimeoutNFS5
NFS.MaxQueueDepthNFS64 (If you have only AFF, then 128 or even 256)
Disk.QFullSampleSizeiSCSI/FC/FCoE32
Disk.QFullThresholdiSCSI/FC/FCoE8
VMFS3.HardwareAcceleratedLockingiSCSI/FC/FCoE1
VMFS3.EnableBlockDeleteiSCSI/FC/FCoE0

We can do it in a few ways:

  • The easiest way, again, to use VSC which will configure these values for you
  • Command Line Interface (CLI) on ESXi hosts
  • With the GUI interface of vSphere Client/vCenter Server
  • Remote CLI tool from VMware.
  • VMware Management Appliance (VMA)
  • Applying Host Profile

Let’s set up these values manually in command line:

# For Ethernet-based protocols like iSCSI/NFS
esxcfg-advcfg -s 32 /Net/TcpipHeapSize
esxcfg-advcfg -s 512 /Net/TcpipHeapMax

# For NFS protocol
esxcfg-advcfg -s 256 /NFS/MaxVolumes
esxcfg-advcfg -s 10 /NFS/HeartbeatMaxFailures
esxcfg-advcfg -s 12 /NFS/HeartbeatFrequency
esxcfg-advcfg -s 5 /NFS/HeartbeatTimeout
esxcfg-advcfg -s 64 /NFS/MaxQueueDepth

# For NFS v4.1 protocol
esxcfg-advcfg -s 256 /NFS41/MaxVolumes

# For iSCSI/FC/FCoE SAN protocols
esxcfg-advcfg -s 32 /Disk/QFullSampleSize
esxcfg-advcfg -s 8 /Disk/QFullThreshold

And now let’s check those settings:

# For Ethernet-based protocols like iSCSI/NFS
esxcfg-advcfg -g /Net/TcpipHeapSize
esxcfg-advcfg -g /Net/TcpipHeapMax

# For NFS protocol
esxcfg-advcfg -g /NFS/MaxVolumes
esxcfg-advcfg -g /NFS/HeartbeatMaxFailures
esxcfg-advcfg -g /NFS/HeartbeatFrequency
esxcfg-advcfg -g /NFS/HeartbeatTimeout
esxcfg-advcfg -g /NFS/MaxQueueDepth

# For NFS v4.1 protocol
esxcfg-advcfg -g /NFS41/MaxVolumes

# For iSCSI/FC/FCoE SAN protocols
esxcfg-advcfg -g /Disk/QFullSampleSize
esxcfg-advcfg -g /Disk/QFullThreshold

HBA

NetApp usually recommends using settings by default. However, in some cases, VMware, NetApp or Application vendor can ask you to modify those settings. Read more in VMware KB. Example:

# Set value for Qlogic on 6.0
esxcli system module parameters set -p qlfxmaxqdepth=64 -m qlnativefc
# View value for Qlogic on ESXi 6.0
esxcli system module list | grep qln

VSC

NetApp Virtual Storage Console (VSC) is a free software which helps you to set recommended values for ESXi hosts and Guest OS. Also, VSC helps with basic storage management like datastore creation from vCenter. VSC is a mandatory tool for VVOLs for ONTAP. VSC available only for the vCenter web client supported vCenter 6.0 and newer.

VASA Provider

VASA Provider is a free software which helps your vCenter to know about some specifics and storage capabilities like disk types: SAS/SATA/SSD, Storage Thing Provisioning, Enabled or disabled storage caching, deduplication and compression. VASA Provider integrates with VSC and allows to create storage profiles. VASA Provides also a mandatory tool for VVOLs. NetApp VASA, VSC and Storage Replication Adapter for SRM are bundled in a single virtual appliance and available for all NetApp customers.

Space Reservation — UNMAP

UNMAP functionality allows to free space on datastore and storage system after data been deleted from VMFS or inside Guest OS, this process known as space reclamation.  There are two independent processes:

  1. First space reclamation form: ESXi UNMAP to storage system when some data been deleted from VMFS datastore. For this type of reclamation to work, storage LUN has to be thin provisioned, and space allocation functionality needs to be enabled on the NetApp LUN. Reclamation of this type can happen in two cases:
    • A VM or VMDK has been deleted
    • Data deleted from Guest OS file system and space reclaimed on from Guest OS VMFS. Basically, after the UNMAP form, Guest OS already happened.
  2. Second space reclamation form: UNMPA from Guest OS when some data deleted on Guest OS file system to free space on VMware datastore (either NFS or SAN). This type of reclamation has nothing in to do with the underlying storage system and do not require any storage tuning or setup, but does need Guest OS tuning and some additional requirements for this feature to function.

Both space reclamation forms are not tied one to another, and you can have only one of them set up to work, but for the best space efficiency, you are interested in having both.

First space reclamation form: From ESXi host to storage system

Historically VMware introduced only the first space reclamation form: from VMFS to storage LUN in ESXi 5.0 with space reclamation happened automatically and nearly online. Moreover, it wasn’t the best idea because it immediately hit storage performance. So, with 5.X/6.0 VMware disabled automatic space reclamation and you have to run it manually. ESXi 6.X with VVOLs space reclamation works automatically and with ESXi 6.5 and VMFS6 it also works automatically, but in both cases,  it is asynchronously (not online process).  

On ONTAP space reclamation (space allocation) is always disabled by default:

lun modify -vserver vsm01 -volume vol2 -lun lun1_myDatastore -state offline

lun modify -vserver vsm01 -volume vol2 -lun lun1_myDatastore -space-allocation enabled lun modify -vserver vsm01 -volume vol2 -lun lun1_myDatastore -state online

If you are using an NFS datastore, space reclamation not needed, because with NAS this functionality available by design. UNMAP needed only for SAN environment because it definitely was one of the disadvantages to NAS.

This type of reclamation automatically occurs in ESXi 6.5 during up to 12 hours and can also be initiated manually.

esxcli storage vmfs reclaim config get -l DataStoreOnNetAppLUN
Reclaim Granularity: 248670 Bytes
Reclaim Priority: low esxcli storage vmfs reclaim config set -l DataStoreOnNetAppLUN -p high

Second space reclamation form: UNMPA from Guest OS

Since for VMs VMDK file is basically a block device, you can apply UNMAP mechanism there too. Starting from 6.0 VMware introduced such capability. It started with Windows in the VVOL environment with ESXi 6.0 with automatic space reclamation from Guest OS and manual space reclamation with Windows machines on ordinary datastores. Later introduced automatic space reclamation from Guest OS (Windows and Linux) on ordinary Datastores in ESXi 6.5.

Now to set it up to function properly it might be trickier then you think. The hardest thing to make this UNMAP work is just to comply with requirements. Once you comply with the requirements, it is easy to make it happen. So, you need to have:

  • Virtual Hardware Version 11
  • vSphere 6.0*/6.5
  • VMDK disks must be thin provisioned
  • The file system of the Guest OS must support UNMAP
    • Linux with SPC-4 support or Windows Server 2012 and later

* If you have ESXi 6.0, then CBT must be disabled, which means in a real production environment you are not going to have Guest OS UNMUP since no production can live without proper backups (Backup software leverage CBT for backups to function)

Moreover, if we are adding ESXi UNMAP to the storage system, a few more requirements needed to be honored:

  • LUN on the storage system must be thinly provisioned (in ONTAP it can be enabled/disabled on the fly)
  • Enable UNMAP in ONTAP
  • Enable UNMAP on Hypervisor

Never use Thin virtual disks on Thin LUN

For many years storage all vendors stated not to use thin virtual disks on thin LUNs, and now it is a requirement to make space reclamation from Guest OS.

Windows

UNMAP supported in Windows starting with Windows Server 2012. To make Windows reclama space from VMDK, NTFS must use allocation unit equal to 64KB. To check UNMAP settings issue next command:

fsutil behavior query disabledeletenotify

DisableDeleteNotify = 0 (Disabled) means UNMUP is going to report to the hypervisor to re-clame space.

Linux Guest OS SPC-4 support

Let’s check first is our virtual disk thin or thick:

sg_vpd -p lbpv
Logical block provisioning VPD page (SBC):
Unmap command supported (LBPU): 1

1 means we have a thin virtual disk. If you got 0, then your virtual disk either thick (sparse or eager), both are not supported with UNMAP. Let’s go farther and check that we have SPC-4

sg_inq -d /dev/sda
standard INQUIRY:
PQual=0 Device_type=0 RMB=0 version=0x06 [SPC-4]
Vendor identification: VMware
Product identification: Virtual disk
Product revision level: 2.0

We need to have SPC-4 to make UNMAP work automatically. Let’s check Guest OS notifying SCSI about reclaimed blocks

grep . /sys/block/sdb/queue/discard_max_bytes
1

1 means we are good. Now let’s try to create a file, remove it and see if we get our space freed:

sg_unmap --lba=0 --num=2048 /dev/sda
# or
blkdiscard --offset 0 --length=2048 /dev/sda

If you are getting “blkdiscard: /dev/sda: BLKDISCARD ioctl failed: Operation not supported”, then UNMAP doesn’t work properly. If we do not have an error, we can remount our filesystem with “-o discard” key to make UNMAP automatic.

mount /dev/sda /mnt/netapp_unmap -o discard

Guest OS

You need at least to check your Guest OS configurations for two reasons:

  1. To gain max performance
  2. To make sure in case of one controller down, your Guest OS survive takeover timeout

Disk alignment: to make sure you get max performance

Disk Misalignment is an infrequent situation, but you still you may get into it. There are two levels where you can get this type of problem:

  1. When you created a LUN in ONTAP with geometry, for example, Windows 2003 and then used it with Linux. This type of problem can occur only in a SAN environment. Its very simple to avoid when you are creating a LUN in ONTAP, make sure you chose proper LUN geometry. This problem happens between storage and hypervisor
  2. Inside of a virtual machine. It can happen in SAN and NAS environment.

To understand how it works let’s take a look on a properly aligned configuration

Fully aligned configuration

On this image upper block belong to Guest OS, block in the middle belongs to ESXi, and lower block represents ONTAP storage system.

First case: Misalignment with VMFS

When you have your VMFS file system misaligned with your storage system. Also, that will happen if you create on ONTAP a LUN with geometry not equal to VMware. It is very easy to fix: just create a new LUN in ONTAP with VMware geometry, create new VMFS datastore and move your VMs to the new one, destroy old one.

Second case: Misalignment inside your guest OS

This is also a very rare problem which can occur because you can get this problem with very old Linux distributives, Windows 2003 and older. However, we are here to discuss all the possible problems to understand better how it works, right? This type of problem can occur on NFS datastore and VMFS datastore leveraging SAN protocols, also in RDM and VVOLs. This type of problem usually happens with virtual machines using non-optimally aligned MBR on Guest OS or Guest OS which previously were converted from physical machines to virtual. How to identify and fix misaligned in Guest OS you can find in NetApp KB.

Misalignment on two levels simultaneously

Of course, if you are very lucky, you can get both simultaneously: on VMFS level and Guest OS level. Later in this article, we will discuss how to identify such a problem from the storage system side.

Takeover/Giveback

NetApp ONTAP storage systems consist of one or a few building blocks called HA pairs. Each HA pair consists of two controllers and in the event of one controller failure of one controller, second will take over and continue to serve clients. The takeover is a relatively fast process in ONTAP, and in new All-Flash FAS (AFF) configurations it takes from 2 to 15 seconds. However, with hybrid FAS systems, this time can be longer and take up to 60 seconds. 60 seconds is the absolute maximum after which NetApp guarantees failover to be completed in FAS systems, and it usually occurs for 2-15 seconds. This numbers should not scare you in any way, because during this time your VMs will survive, as long as your timeouts are set equals to or greater than 60 sec and default VMware Tools value for your VMs is 180 seconds in any way. Moreover, since your ONTAP cluster can have different models, generations and disk types of systems, it is a good idea to use the worst-case scenario which is 60 sec.

Guest OSUpdated Guest OS Tuning for SAN:
ESXi 5 and newer, or ONTAP 8.1 and newer (SAN)
Windowsdisk timeout = 60
Linuxdisk timeout = 60

Default values for Guest OS on NFS datastores are tolerable, and there is no need to change them. However, I would recommend testing a takeover in any way to be sure how it works at such events.

You can configure these values manually or with the use of NetApp Virtual Storage Console (VSC) utility. NetApp Virtual Storage Console (VSC) provides the scripts to help reduce the efforts involved in manually updating the guest OS tunings.

Windows

You can change Windows registry and reboot Guest OS. Timeout in Windows set in seconds in hexadecimal format.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\Disk] “TimeOutValue”=dword:0000003c

Linux

Timeout in Linux set in seconds in decimal format. To do that you need to modify udev rule in your Linux OS. Location for udev rules may vary from one Linux distributive to another.

# RedHat systems
ACTION=="add", BUS=="scsi", SYSFS{vendor}=="VMware, " , SYSFS{model}=="VMware Virtual S", RUN+="/bin/sh -c 'echo 60 >/sys$DEVPATH/device/timeout'"

# Debian systems
ACTION=="add", SUBSYSTEMS=="scsi", ATTRS{vendor}=="VMware " , ATTRS{model}=="Virtual disk ", RUN+="/bin/sh -c 'echo 60 >/sys$DEVPATH/device/timeout'"

# Ubuntu and SUSE systems
ACTION=="add", SUBSYSTEMS=="scsi", ATTRS{vendor}=="VMware, " , ATTRS{model}=="VMware Virtual S", RUN+="/bin/sh -c 'echo 60 >/sys$DEVPATH/device/timeout'"

VMware tools automatically sets udev rule with timeout 180 seconds. You should go and double check it with:

lsscsi
find /sys/class/scsi_device/*/device/timeout -exec grep -H . '{}' \;

Compatibility

NetApp has Interoperability Matrix Tool (IMT), and VMware has the Hardware Compatibility List (HCL). You need to check on them and stick to the versions both vendors support to reduce potential problems in your environment. Before any upgrades make sure, you are always staying in the compatibility list.

Summary

Most of this article are theoretical knowledge which most probably you will not need because nowadays nearly all parameters either automatically assigned or set as default configuration and very possible you will not see misalignment in new installations in real life, but if something will go wrong, information in this article will shed light on some of the under-hood aspects of ONTAP architecture and will help you to deep dive & figure out the reasons of your problem.

Correct settings for your VMware environment with NetApp ONTAP gives not just better performance but also ensure you will not get into trouble in the event of storage failover or network link outage. Make sure you are following NetApp, VMware and application recommendations. When you are setting up your infrastructure from scratch always test performance and availability, simulate storage failover and network link outage. Testing will help you to understand your infrastructure baseline performance, behavior in critical situations and help to find week points. Stay in compatibility lists, it will not guarantee yuo never get in to troubles, but reduce risks and keep you supportable by the both vendors.

Continue to read

AI battle: NetApp & Pure

An exciting story about Pure & NetApp comparison in Resnet AI tests that shows NetApp have…smaller numbers according to TheRegister
https://lnkd.in/e7tr5JQ

There are a few small BUTS. Wrong numbers, hasty conclusions, and no price, you see:

  1. In NetApp wp-7267 doc there are 2 graphics with enabled and disabled image distortion. FlashBlade tests show numbers only for image distortion disabled, The Register compared apples with oranges: NetApp A700 with image distortions *enabled* and FlashBlade image distortions *disabled* which mislead readers to make them thinking NetApp worse 7% to 19% than Pure while it’s 1% to 6%
  2. The comparison was for A700, not the latest High-End A800
  3. The comparison didn’t include price, which might be way more than 6% difference
  4. A700 also have both SAN & NAS, on another hand FlashBlade is only NAS solution
  5. While single FlashBlade full of 15 blades (52TB each) can scale-up to 535TB usable capacity (1.6 PB effective assuming 1:3 reduction ratio from usable space) and scale-out up to 5 chassis meaning 2.6PB (8PB effective assuming 1:3 reduction ratio from usable space) on another hand single HA A700 (2 nodes) can scale-up to 5PiB (15PiB effective assuming 1:3 reduction ratio from usable space) with x480 15TB drives and than can scale-out to 24 nodes (12 HA pares) for NAS, meaning 60PiB (180PiB effective assuming 1:3 reduction ratio from usable space). There is 20 times difference (either usable or effective space) concerning usable capacity & scalability which might be necessary for things like AI, DL, etc
    • Note that with incompressible data like images, video, audio, pre-compressed or encrypted, etc., data no noticeable storage efficiencies will be gained with both NetApp & Pure systems, therefore effective will be equal to the usable capacity
  6. Moreover, you can scale performance. The tests show us performance for one fully populated FlashBlade chassis vs. one HA pare of A700. So how will performance scale? Let’s assume 5 times with scaling-up to 5 chassis with FlashBlade or 12 times with A700.

That’s how you get a badge of yellow journalism.

 

AI NetApp vs Pure

Ethernet port aggregation and load balancing with ONTAP

Abstraction

For a small company, it is quite common to have two of four servers, two switches which often supports Multi-chassis EtherChannel and a low-end storage system. It is quite vital for such companies to utilize their infrastructure fully and thus all available technologies and this article describe one aspect of how to do this with ONTAP systems. Usually, there is no need to dig too deep into LACP technology but to those who want to, welcome to this post.

It is essential not just to tune and optimize one part of your infrastructure but the whole stack to achieve the best performance. For instance, if you optimize only the network, the storage system might become a bottleneck in your environment and vice versa.

Majority of modern servers have on-board 1 Gbps or even 10 Gbps Ethernet ports.

In some of the old ONTAP storage systems like FAS255X and more modern FAS26XX have 10Gbps onboard ports.  In this article, I am going to focus on an example with a FAS26XX system with 4x 10Gbps ports on each node, and two servers with 2x 10Gbps ports and a Cisco switch with 10Gbps ports and support for Multi-chassis EtherChannel. However, this article would apply to any small configuration.

Scope

So, we would like to be able to fully utilize network bandwidth in storage system and servers and prevent any bottlenecks. One way to do this is to use iSCSI or FCP protocols which have built-in load balancing and redundancy thus in this article we going to overview protocols which do not have such an ability, like CIFS and NFS. Why would users be interested in those NAS protocols which don’t have built-in load balancing and redundancy? Because NAS protocols have file granularity and file visibility from ONTAP perspective and in combination in many cases give more agility then SAN protocols while network “features” of NAS protocols could easily enough be compensated with functionality of network switches build-in nearly in any switch. Of course, technologies not magically work, and, in each approach, there are some nuances and considerations.

In many cases, users would like to use both SAN and NAS on top of a single pair of Ethernet ports with ONTAP systems, and for this reason, the first thing you should consider is NAS protocols with load balancing and redundancy and only then adapt SAN connection to it. NAS protocols with SAN on top of Ethernet ports often case for customers with smaller ONTAP systems where the number of Ethernet ports is limited.

Also, in this article, I am going to avoid technologies like VVols over NAS, pNFS, dNFS and SMB multichannel. I would like to write about VVol in another dedicated article while it is not related to NAS or SAN protocols directly but can be part of the solution which provides on one hand file granularity, and on another hand can use NFS or iSCSI, where iSCSI could natively load-balance traffic across all available network paths.  pNFS unfortunately currently supported only with RedHat/CentOS systems for enterprise environments, not widespread and does not provide native load balancing because NFS Trunking currently in the draft while SMB multichannel currently not supported with ONTAP 9.3 itself.

In this situation, we have few configurations left.

  • One is to use NAS protocols solely with Ethernet port aggregation
  • Another one is to use NAS protocols with Ethernet port aggregation and SAN on top of aggregated ports, which could be divided into two subgroups:
    • Where you are using iSCSI as SAN protocol
    • Where you are using FCoE as SAN protocol
    • Native FC protocol require dedicated ports and could not work over Ethernet ports

Even though FCoE on top of aggregated Ethernet ports with NAS is possible networking configuration with ONTAP system, I am not going discuss it in this article because FCoE is supported only with expensive converged network switches like Nexus 5000 or 7000 thus not scope of interest of small companies. Though NAS with right configuration can provide entirely compatible performance, load balancing, and redundancy to FC & FCoE with ONTAP systems, so there is no reason to pay more.

NAS protocols with Ethernet port aggregation

Both variants: NAS protocols with Ethernet port aggregation and NAS protocols with Ethernet port aggregation with iSCSI on top of aggregated ports have quite a similar network configuration and topology. Also, this is the configuration Its going to describe in this article.

Theoretical part

Unfortunately, Ethernet load balancing works not sophisticated as in SAN protocols in a quite simple way.  I personally even would call it load-distribution instead of load-balancing. In fact, Ethernet not paying attention to “balancing” part and not actually trying to distribute workload across links evenly, instead it just distributing workload hoping that there would be plenty of network nodes generating read and write threads and only because of Probability theory workload would be more or less evenly distributed. The fewer nodes in the network, the fewer network threads, the less probability that each network link is going to be equally loaded across network links and vice versa.

The most straightforward algorithm for Ethernet load balancing is sequentially picking one of the network links for each new thread, one by one. Another algorithm uses hash sum from the network address of sender and recipient to peek one network link in the aggregate.  Network address could be IP address or MAC address or something else. Moreover, this small nuance plays a role in this article and your infrastructure. Because in case if a source and a destination address hash sum going to be same, therefore the algorithm going to use the same link in the aggregate. In other words, it is essential to understand how load balancing algorithm works to make sure that combinations of network addresses would be such that you not only get redundant network connectivity but also to ensure you are going to utilize all network links. Especially it becomes vital for small companies with few participants in their network.

It is quite often that 4 servers could not fully utilize 10Gbps links, but during peak utilization, it is essential to distribute network threads between links evenly.

Typical network topology and configuration for small companies

In my example, we have 2 servers, 2 switches, and one storage system with two storage nodes running ONTAP 8.3 or higher with the following configuration, and also keep in mind:

  • From a storage node, two links go one to the first switch, another link to the second switch
  • Switches configured with technologies like vPC (or similar) or switches are stacked
  • Switches configured with Multi-chassis EtherChannel/PortChannel technology, so two links from the server connected to two switches aggregated in a single EtherChannel/PortChannel. Links from a storage node connected to two switches aggregated in a single EtherChannel/PortChannel.
  • LACP with IP load balancing configured over EtherChannel
  • 10Gbps switch ports connected to servers and storage set with Flow control = disable
  • Storage system ports and server ports set with Flow control = disable (none)
  • 4 links on first storage node aggregated in a single EtherChannel (ifgroup) with configured LACP (multimode_lacp), same with second storage node. In total two ifgroup, one on each storage node
  • Same NFS VLAN created on top of each ifgroup, one on the first storage node, second on the second storage node
  • On each of two NFS VLAN created 2x IP addresses, 4 in total on two storage nodes
  • Storage nodes each have at least one data aggregate created out of an equal number of disks, for example, each aggregate could be:
    • 9 data + 2 parity disks and 1 hot spare
    • 20 data + 3 parity disks and 1 hot spare
  • Volumes on top of data aggregates configured as:
    • Either one FlexGroup spanned on all aggregates
    • Alternatively, 2 volumes on each storage node – 4 total, which is minimal and sufficient
  • Each server has two 10Gbps ports, one port connected to one switch, the second port to the second switch
  • On each server 2x 10Gbps links aggregated in EtherChannel with LACP
  • Jumbo frame enabled on all components: storage system ports, server ports, and switch ports
  • Each volume mounted on each server as a file share, so each server is going to be able to use all 4 volumes.

Minimum number of volumes for even traffic distribution is pretty much determined by the biggest number of links from either a storage system or a server; in this example, we have 4 ports on each storage nodes, which means we need 4 volumes total. In case if you have only 2 network links from each server and two from a storage system node, I will still suggest keeping at least 4 volumes which are useful not only for network load balancing but also for storage node CPU load balancing. In case of FlexGroup, it is enough to have only one such a group but keep in mind that it is currently not optimized for high metadata workloads like virtual machines and databases.

One IP addresses for each storage node with two or four links on each node in configurations with two or more hosts each with two or four links and with one IP addresses for each host, almost always enough to provide even network distribution. However, with one IP address for each storage node and one IP address for each host, even distribution could be achieved in perfect scenarios where each host is going to access each IP address evenly what on practice hard to achievable, quite hard to predict, and it could change with time. So, to increase the probability of more even network load distribution, we need to divide traffic in more threads, and the only way to do this with LACP is to increase the number of IP addresses. Thus, for small configurations with two of four hosts and two storage nodes each with 2x IP addresses instead of one could increase the probability of more even network traffic distribution across all network links.

Unfortunately, conventional NAS protocols do not allow hosts to recognize a file share mounted with different IP addresses as a single entity. So, for example, if we are will mount an NFS file share to VMware ESXi with two different IP addresses, the hypervisor will see it as two different Datastores. In case you are interested in network link re-balancing, a VM needs to be migrated on a Datastore with different IP but to move that VM, storage vMotion going to be involved even though it is the same network file share (volume).

Network Design

Here is recommended and well-known network design often used with NAS protocols.

(1)

1 LACP Network design

However, merely cabling and configuring switches with LACP doesn’t guarantee you that network traffic is going to be balanced across all the links in the most efficient way, well, it depends, and even if it is this can change after a while. To ensure we get maximum from both network and storage system, we need to tune them a bit, to do so we need to understand how LACP and storage system works. For more network designs, including bad designs, see slides here.

 

(2)

2) Link Selection for Next Hop with LACP

LACP protocol & algorithm

In ONTAP world nodes in a storage system for NAS protocols work it disadvantages as they separated from each other, so you can percept them as separated servers this architecture called share-nothing. The only difference is if one storage node die second will take its disks, workloads and copy IP so hosts going to continue to work with its data as nothing happens, this called takeover in a High Availability pare; also with ONTAP you can move IP and Volumes online between storage nodes, but let’s not focus on this. Since we remember that storage nodes as independent servers LACP protocol could aggregate few Ethernet ports only within a single node, so it does not allow you to aggregate ports from multiple storage nodes. While with Switches we can configure Multi-Chassis Ether Channel, so LACP protocol is done ports from few switches.

Now LACP algorithm selects a link only for the next hop, one step at a time so the full path from sender to a recipient not established nor handled by the initiator as it is done in SAN. Communication between same two network nodes could be sent through one path while response could come back through another path. LACP algorithm uses the hash sum of a source and destination addresses to select the path. The only way to ensure your traffic goes by expected paths with LACP protocol is to enable load balancing by IP, or MAC addresses hash sum and then calculate hash sum result or test it on your equipment. With right combination of source and destination address, you can ensure LACP algorithm going to select your preferred path.

LACP algorithm could be realized in different ways on a server, switch, and storage system, that’s why traffic from the server to storage and from storage to the server could go through different paths.

There are few essential addition circumstances which going to influence on your storage system data partitioning and on source & destination IP address selection. There are applications which can share volumes like VMware vSphere where each ESXi host can work with multiple volumes; and configurations where volumes not shared by your applications.

One volume & one IP per node

Since we have two ONTAP nodes with share-nothing, and we want to utilize storage systems fully, we need to create volumes on each node and thus at least one IP on each node on top of aggregated Ethernet interface. Each aggregated interface consists of two Ethernet ports. In the next network designs some of the objects thin not displayed (such as network links and server) to focus on some of the aspects, note that all the next network designs are based on the very first image “LACP network design.”

(3A)

3A) 1x Shared Vol per st node =1x Network Folder. Each host loads each storage node.png

 

Let’s see the same example but from the storage perspective. Let me remind you that in the next network designs some of the objects were not displayed (such as network links and server) to focus on some of the aspects, note that all the next network designs are based on the very first image “LACP network design.”

 

(3B)

3B) 1x Shared Vol per st node =1x Network Folder. Each host loads each storage node2.png

 

Two volumes & one IP per node

However, some of the configurations do not share volumes between applications running on your servers. So, to utilize network all the links we need to create on each storage node in two volumes: one used only by host1, second used only by host2. Volumes and connections to the second node not displayed to make image simple, in reality, they are existing and are symmetrical to the first storage node.

(4A)

3B) If there no shared network folders between hosts than more volumes needed

Let’s see the same configuration but from the storage perspective. As in previous images symmetrical part of connections are not displayed to simplify image: in this case symmetrical connections to blue buckets on each storage node not displayed but in real configuration exists.

(4B)

4B) If there no shared network folders between hosts than more volumes needed2.png

 

 

Two volumes & two IPs per node

Now if we are going to increase the number of IP, we can mount each volume over two different IP addresses. In such a scenario each mount is going to be percepted by hosts as two separate volume even though it is physically the same volume with same data set. In this situation often makes sense to also increase the number of volumes so that each volume going to be mounted with its own IP. Thus, we are going to achieve more even network load distribution across all the links, either for shared or non-shared application configuration.

(5A)

5A) 2x Shared Vols = 2x network folders with 2x IP on each storage node.png

In non-Shared volume configuration each volume used by only one host. Designs 5A & 5B are quite similar and differ one from another only by how the volumes are mounted on hosts.

(5B)

5B) 2x non-Shared Vols = 2x network folders with 2x IP on each storage node.png

 

Four volumes & two IPs per node

Now if we are going to add more volumes and IP addresses to our configuration where we have two applications which not share volumes and could achieve even better network load balancing across links with a right combination of network share mounts. The same design could be used with an application which shares volumes and similar to the design on image 5.

(6)

6) Network Load distribution. 2x IP & 4x non-Shared Vol on each storage node.png

For more network designs, including bad designs, see slides here.

 

Which design is better?

Whether your applications using shared volumes or not, I would recommend:

  • Design #3 for environments where you have multiple independent applications, so with multiple apps you are going to have in total at least 4 or more volumes on each storage node.
  • Alternatively, Design #6 if you are running only one application like VMware vSphere and not planning to add new applications and volumes. Use 4 volumes per node minimum whether you have shared or non-shared volumes.

How to ensure network traffic goes by expected path?

This is more complex and geeky part. In the real world, you can run in a situation where your switch can decide to put your traffic through additional hop or hash sum from your source, and destination addresses pare of two or more pares could overlap. To ensure your network traffic goes by the expected path you need to calculate the hash sum. Usually, in big enough environments where you have many volumes, file shares, and IP addresses, you do not care about this because more IP you have more probability that your traffic will distribute the workload over your links troubleshooting because of the Probability theory. However, if you care and you have a small environment, you can brute force passwords IPs for your server and storage.

 

Configuring ONTAP

Create data aggregate

‌ cluster1::*> aggr create -aggregate aggr -diskcount 13

Create SVM

cluster1::*> vserver create -vserver vsm_NAS -subtype default -rootvolume svm_root -rootvolume-security-style mixed -language C.UTF-8 -snapshot-policy default -is-repository false -foreground true -aggregate aggr -ipspace Default

Create aggregated ports

cluster1::*> ifgrp create -node cluster1-01 -ifgrp a0a
cluster1::*> ifgrp create -node cluster1-02 -ifgrp a0a

Create VLANs for each protocol-mtu

cluster1::*> vlan create -node * -vlan-name a0a-100

I would recommend creating dedicated broadcast domains for each combination protocol-mtu. For example:

  • Client-SMB-1500
  • Server-SMB-9000
  • NFS-9000
  • iSCSI-9000
cluster1::*> broadcast-domain create -broadcast-domain Client-SMB-1500 -mtu 1500 -ipspace Default -ports cluster1-01:a0a-100,cluster1-02:a0a-100

Create interfaces with IP addresses

cluster1::*> vserver create -vserver vsm_NAS -subtype default -rootvolume svm_root -rootvolume-security-style mixed -language C.UTF-8 -snapshot-policy default -is-repository false -foreground true -aggregate aggr -ipspace Default

If you haven’t created dedicated broadcast domains, then configure fail-over policies for each protocol and assign it to LIF interface.

cluster1::*> network interface failover-groups create -vserver vsm_NAS -failover-group FG_NFS-9000 -targets cluster1-01:a0a-100, cluster1-02:a0a-100
cluster1::*> network interface modify -vserver vsm_NAS -lif nfs01_1 -failover-group FG_NFS-9000

Configuring Switches

This topic is the place where 90% of human error done. People often forget to add word “active” or add it to right place etc.

Example of Switch configuration

Cisco Catalyst 3850 in a stack with 1Gb/s ports

Note “mode active” means “multimode_lacp” in ONTAP, so each interface must have next configuration: “channel-group X mode active,” not Port-channel. Note configuration “flowcontrol receive on” depends on port speed, so if storage sends flow control, then “other side” must receive it. Note it is recommended to use RSTP, in our case with VLANs it is Rapid‐PVST+ and configure switch ports connected to storage and servers with spanning-tree portfast.

system mtu 9198
!
spanning-tree mode rapid-pvst
!
interface Port-channel1
 description N1A-1G-e0a-e0b
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 spanning-tree guard loop
!
interface Port-channel2
 description N1B-1G-e0a-e0b
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 spanning-tree guard loop
!
interface GigabitEthernet1/0/1
 description NetApp-A-e0a
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 cdp enable
 channel-group 1 mode active
 spanning-tree guard loop
 spanning-tree portfast trunk feature
!
interface GigabitEthernet2/0/1
 description NetApp-A-e0b
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 cdp enable
 channel-group 1 mode active
 spanning-tree guard loop
 spanning-tree portfast trunk feature
!
interface GigabitEthernet1/0/2
 description NetApp-B-e0a
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 cdp enable
 channel-group 2 mode active
 spanning-tree guard loop
 spanning-tree portfast trunk feature
!
interface GigabitEthernet2/0/2
 description NetApp-B-e0b
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 cdp enable
 channel-group 2 mode active
 spanning-tree guard loop
 spanning-tree portfast trunk feature

 

Cisco Catalyst 6509 in a stack with 1Gb/s ports

Note “mode active” means “multimode_lacp” in ONTAP, so each interface must have next configuration: “channel-group X mode active,” not Port-channel. Note configuration “flowcontrol receive on” depends on port speed, so if storage sends flow control, then “other side” must receive it. Note it is recommended to use RSTP, in our case with VLANs it is Rapid‐PVST+ and configure switch ports connected to storage and servers with spanning-tree portfast.

system mtu 9198
!
spanning-tree mode rapid-pvst
!
interface Port-channel11
 description NetApp-A-e0a-e0b
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 spanning-tree guard loop
 spanning-tree portfast trunk feature
!
interface Port-channel12
 description NetApp-B-e0a-e0b
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 spanning-tree guard loop
 spanning-tree portfast trunk feature
!
interface GigabitEthernet1/0/1
 description NetApp-A-e0a
 switchport trunk encapsulation dot1q
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 cdp enable
 channel-group 11 mode active
 spanning-tree guard loop
 spanning-tree portfast trunk feature
!
interface GigabitEthernet2/0/1
 description NetApp-A-e0b
 switchport trunk encapsulation dot1q
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 cdp enable
 channel-group 11 mode active
 spanning-tree guard loop
 spanning-tree portfast trunk feature
!
interface GigabitEthernet1/0/2
 description NetApp-B-e0a
 switchport trunk encapsulation dot1q
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 cdp enable
 channel-group 12 mode active
 spanning-tree guard loop
 spanning-tree portfast trunk feature
!
interface GigabitEthernet2/0/2
 description NetApp-B-e0b
 switchport trunk encapsulation dot1q
 switchport trunk native vlan 1
 switchport trunk allowed vlan 53
 switchport mode trunk
 flowcontrol receive on
 cdp enable
 channel-group 12 mode active
 spanning-tree guard loop
 spanning-tree portfast trunk feature

 

Cisco Small Business SG500 in a stack with 10Gb/s ports

Note “mode active” means “multimode_lacp” in ONTAP, so each interface must have next configuration: “channel-group X mode active,” not Port-channel. Note configuration “flowcontrol off” depends on port speed, so if storage not using flow control (flowcontrol none), then on “other side” flowcontrol must also be disabled. Note it is recommended to use RSTP and configure switch ports connected to storage and servers with spanning-tree portfast.

interface Port-channel1 description N1A-10G-e1a-e1b
 spanning-tree ddportfast
 switchport trunk allowed
 vlan add 53
 macro description host
 !next command is internal.
 macro auto smartport dynamic_type host
 flowcontrol off
!
interface Port-channel2
 description N1B-10G-e1a-e1b
 spanning-tree ddportfast
 switchport trunk allowed
 vlan add 53
 macro description host
 !next command is internal. 
macro auto smartport
 dynamic_type host
 flowcontrol off
!
port jumbo-frame
!
interface tengigabitEthernet1/1/1
 description NetApp-A-e1a
 channel-group 1 mode active
 flowcontrol off
!
interface tengigabitEthernet2/1/1
 description NetApp-A-e1b
 channel-group 1 mode active
 flowcontrol off
!
interface tengigabitEthernet1/1/2
 description NetApp-B-e1a
 channel-group 2 mode active
 flowcontrol off
!
interface tengigabitEthernet2/1/2
 description NetApp-B-e1b
 channel-group 2 mode active
 flowcontrol off

 

HP 6120XG switch in blade chassis HP c7000 and 10Gb/s ports

Note “trunk 17-18 Trk1 LACP” means “multimode_lacp” in ONTAP. Note configuration “flowcontrol off” not present in here which means it set to “auto” by default so if a network node connected to the switch going to have disabled Flow control, then the switch not going to use it also. Flow control depends on port speed, so if storage not using flow control (flowcontrol none), then on “other side” flowcontrol must also be disabled. Note it is recommended to use RSTP and configure switch ports connected to storage and servers with spanning-tree portfast.

# HP 6120XG from HP c7000 10Gb/s
trunk 11-12 Trk10 LACP
trunk 18-19 Trk20 LACP
vlan 201
   name "N1AB-10G-e1a-e1b-201"
   ip address 192.168.201.222 255.255.255.0
   tagged Trk1-Trk2
   jumbo
   exit
vlan 202
   name "N1AB-10G-e1a-e1b-202"
   tagged Trk1-Trk2
   no ip address
   jumbo
   exit
spanning-tree force-version rstp-operation

 

Switch troubleshooting

Let’s take a look at the switch output

                               Rx                           TxPort
      Mode    | ------------------------- | -------------------------
              | Kbits/sec  Pkts/sec  Util | Kbits/sec Pkts/sec  Util
------- --------- + ---------- --------- ---- + ---------- ---------- ---
Storage
1/11-Trk21 1000FDx| 5000      0         00.50 | 23088     7591      02.30
1/12-Trk20 1000FDx| 814232    12453     81.42 | 19576     3979      01.95
2/11-Trk21 1000FDx| 810920    12276     81.09 | 20528     3938      02.05
2/12-Trk20 1000FDx| 811232    12280     81.12 | 23024     7596      02.30
Server
1/17-Trk11 1000FDx| 23000     7594      02.30 | 810848    12275     81.08
1/18-Trk10 1000FDx| 23072     7592      02.30 | 410320    6242      41.03
2/17-Trk11 1000FDx| 19504     3982      01.95 | 408952    6235      40.89
2/18-Trk10 1000FDx| 20544     3940      02.05 | 811184    12281     81.11

We can clearly see that one of the links is not utilized. Why does it happen? Because sometimes the algorithm which calculates the hash sum of a pair of source and destination addresses generates the same value for two (or more) pairs of source and destination addresses.

SuperFastHash in ONTAP

Instead of ordinary algorithm widely used by hosts and switches ((source_address XOR destination_address) % number_of_links), ONTAP starting with 7.3.2 using the algorithm called SuperFastHash which gives more dynamic, more balanced load distribution for a big number of clients, so each TCP session associated with only one physical port.

The ONTAL-LACP algorithm is available at GitHub under BSD license. Though I did my best to make it precise and fully functional, I do not give any guarantees so that you can use it AS IS.

You can use the online compiler. You need to find storage IP with the biggest number in “SUM Total Used” column.

This compiler built-in you result what physical port is going to be picked up depending on Troubleshootingsource and destination address.

Let’s create a table for network Design #4A using the output from our simple code. Here is output example

With following variables:

    st_ports = 2;
    srv_ports = 2;
    subnet = 53;
    src_start = 21;
    src_end = 22;
    dst_start = 30;
    dst_end = 50;

Output:

       ¦NTAP       %  ¦NTAP       %  ¦Srv        %  ¦ SUM¦
       ¦OUT      |Path¦IN       |Path¦IN&O     |Path¦Totl¦
   IP  ¦  21|  22|Used¦  21|  22|Used¦  21|  22|Used¦Used¦
 53.30 ¦   1|   0|  75|   1|   0|  75|   1|   0| 100|  83|
 53.31 ¦   1|   1|  37|   0|   1|  62|   0|   1| 100|  66|
 53.32 ¦   0|   1|  75|   1|   0|  75|   1|   0| 100|  83|
 53.33 ¦   0|   1|  75|   0|   1|  75|   0|   1| 100|  83|
 53.34 ¦   0|   1|  75|   1|   0|  75|   1|   0| 100|  83|
 53.35 ¦   0|   0|  37|   0|   1|  62|   0|   1| 100|  66|
 53.36 ¦   1|   0|  75|   1|   0|  75|   1|   0| 100|  83|
 53.37 ¦   1|   0|  75|   0|   1|  75|   0|   1| 100|  83|
 53.38 ¦   0|   0|  37|   1|   0|  62|   1|   0| 100|  66|
 53.39 ¦   0|   1|  75|   0|   1|  75|   0|   1| 100|  83|
 53.40 ¦   1|   0|  75|   1|   0|  75|   1|   0| 100|  83|
 53.41 ¦   1|   0|  75|   0|   1|  75|   0|   1| 100|  83|
 53.42 ¦   1|   0|  75|   1|   0|  75|   1|   0| 100|  83|
 53.43 ¦   0|   1|  75|   0|   1|  75|   0|   1| 100|  83|
 53.44 ¦   0|   0|  37|   1|   0|  62|   1|   0| 100|  66|
 53.45 ¦   0|   1|  75|   0|   1|  75|   0|   1| 100|  83|
 53.46 ¦   1|   1|  37|   1|   0|  62|   1|   0| 100|  66|
 53.47 ¦   0|   0|  37|   0|   1|  62|   0|   1| 100|  66|
 53.48 ¦   1|   0|  75|   1|   0|  75|   1|   0| 100|  83|
 53.49 ¦   1|   0|  75|   0|   1|  75|   0|   1| 100|  83|
 53.50 ¦   1|   0|  75|   1|   0|  75|   1|   0| 100|  83|

 

So, you can use IP addresses XXX.XXX.53.30 for your first storage node and XXX.XXX.53.32 for your second storage node at Design #4.

 

Disadvantages in conventional NAS protocols with Ethernet LACP

Each technology doesn’t work magically and have its own advantages and disadvantages; it is essential to know and understand them.

  • You cannot aggregate two network file shares into one logical space as with LUNs
    • If a storage vendor gives you an aggregation of few volumes for NAS on a storage system, data distribution often is done with the granularity of file-level:
      • Load distribution based on Files depends on their size and could be not equal
      • Load distribution not suitable for high metadata or high re-write workloads.
    • With Ethernet LACP Full path between pears not established nor controlled by initiators
      • Each Next Step is chosen individually: Path towards and backward could be different
      • LACP does not allow you to aggregate ports from multiple storage nodes.
    • No SAN ALUA-like multipathing:
      • LACP allows to aggregate only ports in a single server or a single storage node
      • Multi-Chassis ether Chenal require special switches, though it available nearly in any switches
      • Only a few switches could be in an LACP stack. Entry-level stacked switches could be unstable which limits scalability.

 

Because of these disadvantages, conventional NAS protocols with LACP usually could not achieve full network link utilization and must be tuned manually to do so. Though LACP not ideal

  • it was available for years nearly in any Ethernet switch
  • it is the only best solution currently we have with conventional NAS protocols
  • it is definitely better than conventional NAS without it.

 

Advantages of NAS protocols over Ethernet

LACP has it’s disadvantages and adds them to conventional NAS protocols which don’t have built-in multipathing and load-balancing, though NAS protocols still more attractive with ONTAP because:

NAS:

  • NAS have data visibility in Snapshots
  • More space efficient than SAN in many ways
  • File-granular access in snapshots
  • Individual file copy, no FlexClone or SnapRestore licenses needed
  • Individual file restores or clone (FlexClone or SnapRestore licenses required)
  • Backup data mining for cataloging
  • Accessed directly on storage, no host mounting needed.

Ethernet & LACP:

  • Ethernet switches are cheaper then InfiniBand & FC
  • LACP & Multi-Chassis Ether Channel available nearly with any switch
  • 1, 10, 25, 40, 50, 100 Gb/s available as single pipe
  • Multi-purposes, Multi-protocol, Multi-tenancy with VLANs
  • Cheaper Multi-site: VPN, VXLAN
  • Routing on top of Ethernet available for FCoE, iSCSI, NFS, CIFS.

Looking to the future

Though NAS protocols have their disadvantages because they do not have built-in multipathing and load-balancing they rely on LACP, they evolve and bit by bit copying abilities from other protocols.

For example, SMB v3 protocol with Contiguous Availability feature can survive online IP movement between ports and nodes without disruption which is available in ONTAP, thus can be used with MS SQL & Hyper-V. Also, SMB v3 protocol supports multichannel which provides built-in link aggregation and load balancing without relying on LACP, currently not supported in ONTAP.

NFS from the beginning was not session protocol so with IP move to another storage node application survives. Further NFS evolves and in version 4.1 get a feature called pNFS which provide the ability to automatically and in a transparent way to switch between nodes and ports in case data been moved to follow the data similarly to SAN ALUA, which is also available in ONTAP. Version 4.1 of NFS also include session trunking feature, similarly to SMB v3 multichannel feature it will allow to aggregate links without relying on LACP, currently not supported in ONTAP. NetApp drives NFS v4 protocol with IETF, SNIA and open-source community to accept it as soon as possible.

Conclusion

Though NAS protocols have disadvantages, mainly because of underlying Ethernet & more precise LACP, it is possible to tune LACP to mostly efficient utilize your network and storage. With big environments usually, no need for tuning but for small environments load balancing might become a bottleneck especially if you are using 1 Gb/s ports. Though it is rare to fully utilize network performance of 10Gb/s ports in small environments, tuning is better to do at the very beginning then later on a production environment. NAS protocols are file granular, and since storage system run underlying FS, it can work with files and provide more abilities for thin provisioning, cloning, self-service operations and backup in many ways agiler then SAN. NAS protocols are evolving and absorb abilities from other protocols, to be particular, SAN protocols like FC & iSCSI, to entirely diminish their disadvantages and already provide additional capabilities to environments which can use new versions of SMB and NFS.

 

Troubleshooting

90% of all the problems is network configuration on the switch side, 10% other on the host side. Human error