VMware Primary Snapshot Recovery Use Cases
Transcription
VMware Primary Snapshot Recovery Use Cases
ECX 2.0 ECX Best Practices for Use Data 1 © Catalogic Software, Inc TM, 2015. All rights reserved. This publication contains proprietary and confidential material, and is only for use by licensees of Catalogic DPXTM, Catalogic BEXTM, or Catalogic ECXTM proprietary software systems. This publication may not be reproduced in whole or in part, in any form, except with written permission from Catalogic Software. Catalogic, Catalogic Software, DPX, BEX, ECX, and NSB are trademarks of Catalogic Software, Inc. Backup Express is a registered trademark of Catalogic Software, Inc. All other company and product names used herein may be the trademarks of their respective owners. 2 Table of Contents Terminology and Concepts ........................................................................................................................... 4 Copy Data Management Principles and Practices ........................................................................................ 5 Best Practices for Sites .............................................................................................................................. 5 Best Practices for Usage............................................................................................................................ 6 Best Practices for Compartmental Boundaries ......................................................................................... 6 Best Practices for Resources ..................................................................................................................... 6 Best Practices for IV Modes ...................................................................................................................... 6 Best Practices for Sizing ............................................................................................................................ 7 Use Data Workflow Use Cases ...................................................................................................................... 7 VMware Use Data Workflows ....................................................................................................................... 7 Instant Access (IA) - VM Disks and Datastores ......................................................................................... 7 Instant Virtualization (IV) - VMs, vApps, and VM folders ......................................................................... 9 Distinguishing the VMware Use Data IV modes ................................................................................. 10 NetApp Use Data Workflows ...................................................................................................................... 11 Important Use Cases Not Currently Supported by ECX 2.0 ........................................................................ 13 VMware Use Data ............................................................................................................................... 13 NetApp Use Data................................................................................................................................. 13 3 Terminology and Concepts The terms Copy Data and Use Data are used throughout ECX 2.0. These two terms come from “Copy Data Management” concepts and they refer to the processes of making data copies (Copy Data) and of using those data copies (Use Data) to perform useful operations such as analytics, testing and recovery. Instant Access (IA) provides instant access to Copy Data for use by the more traditional restore scenarios. IA is used to gain instant access to specific data and then restore files, volumes, or application data, as needed, from the instantly accessed data. Instant Virtualization (IV) goes beyond IA to bring up the workloads (operating systems and applications) by connecting the instantly accessed Copy Data (which includes the OS/application disks) to test, clone, or production restored virtual machines. Copy Data Management makes heavy use of IV to address many use cases that traditional backup and restore products avoid. A robust catalog is implemented within ECX’s dynamic platform. The catalog manages SAN/NAS RAID storage (such as NetApp) against virtualized workloads (such as VMware). Creating copies and using them becomes automatable with ECX, and the product can be used to protect and then bring up large numbers of virtual machines (VMs) from data copies. With this ability, a user needs to ensure that those mass amounts of VMs do not conflict with the original VMs on the application level. Fenced networking ensures that these VMs do not conflict. Fenced networks are simply private or contained networks that are separate from and have no direct access to production or other networks. Fenced networks are used by various IV use cases to fence off test or clone virtual machines so that they do not interact with their production counterparts. Hypervisor environments refer to the virtualization environments such as VMware vCenter and Microsoft Hyper-V. ECX 2.0 supports only VMware so the document may digress to using the direct VMware terms. Site allows the user to identify and group the storage copies and vCenter resources by some criteria such as geographical location. Site becomes important when you want to perform mass development/recovery testing on a regular basis. It is best to perform the testing against mirror disks at a remote site because you do not want to impact the performance of the source production machines by doubling up the load on your source RAID and source vCenter environment (CPU/memory/network). Site provides you with the proper insight to where the resources (storage copies and vCenter) are, as well as an easy way to configure automation jobs that are site based. 4 Copy Data Management Principles and Practices One of the basic premises of “Copy Data Management” is that data copies (snapshots/backups) are made for specific reasons to specific storage devices at specific locations (sites) to support a wide range of specific use cases. The location and storage attributes of each copy become strategic based on how they are used. For example, data copies that reside at a remote site would be used for site-based disaster recovery (DR) and Dev/Test. Data copies that reside locally would be used for quick local recovery. Further, local copies that reside on vaults would be used over copies that reside on primary storage, as they typically have better retention policies, and using them avoids impacting the performance of the primary storage. Cheaper or more expensive storage copies could also be employed to adjust and control the cost vs. quality of the copies. Redundant copies can be removed. As such, it is very important to categorize, track, monitor, and control the various different data copies in an enterprise to ensure that the enterprise has the appropriate copies to address the various required use cases that would be defined within the enterprise’s SLAs. As a best practice, it is recommended that the user have multiple sites that each contain storage (such as NetApp) and hypervisor (such as VMware) resources. The sites should be independent of each other at least with respect to power failure so that that they provide a level of protection for the user. It is still better to have sites at different geographical locations to provide protection against regional disasters like power grid failure. Once the sites are determined, the user should place mirror copies of the source primary data at alternate sites. The user should then configure vaults locally to contain snapshots with higher retentions for local recovery. If the budget allows, the user should also consider configuring vaults remotely off of the remote mirrors so that the remote sites also have higher fidelity snapshots. The local vaults can be used for local recovery and the remote vaults for offloaded archiving (i.e. run tape backups from remote vault). Treated as a protection workflow, ECX 2.0 provides the way to do the above tasks of copy creation, placement and control via the ECX 2.0 Copy Data policies. Use Data can only be performed after the Copy Data protection is put in place. Use Data is a slave to how and where the data is placed by Copy Data. So, it is critically important that the user get the Copy Data protection right. Nonetheless, the process of creating automated Use Data policies that can be quickly tested helps the user discover issues with the Copy Data workflows. It is not uncommon for a user to go through several iterations of Copy Data and Use Data policy tweaking on their way to optimizing their full solution to meet the SLA’s they put in place. The following list highlights some best practices for Copy Data Management: Best Practices for Sites Configure multiple independent sites that have independent power, network, storage, and hypervisor resources. Place mirror copies at alternate sites. 5 Place vault copies locally. If the budget allows, place a vault copy at a remote site off of the remote mirror for extra protection and archiving. Best Practices for Usage Vault copies should typically have higher retentions and thus higher RPO fidelity. Use the local vault for local recoveries. Use the remote mirrors for disaster recovery (DR) testing, for Snapshot validation, for Dev/Test and clone mode operations, where the Dev/Test and clone mode operations create cloned environments for data mining or Dev/Test. Place VMs that must be recovered together within the same Copy Data jobs. This ensures that common snapshots are used for these co-dependent VMs during Use Data operations. Best Practices for Compartmental Boundaries It is recommended that the Copy Data and Use Data policies align on application or company compartmental boundaries. This way, each application group or compartment can manage and test their own Copy Data and Use Data policies. Common infrastructure components/applications such as DNS or AD should be in a separate set of Copy Data and Use Data jobs so that they can be leveraged by all the other application or compartmental jobs that depend upon the common applications. The common component Copy Data jobs should have snapshot frequencies and retention periods equivalent to or better than the Copy Data jobs for the dependent applications they support. It is recommended that that storage and datastores also align on application or company compartmental boundaries. This is related to the recommendation above. If there are application group or compartmental boundaries for the VMs, then those VMs should reside on datastores that align the same way. There should not be cross usage of datastores for applications across the boundaries otherwise there will be inter-compartmental dependencies on common storage. Best Practices for Resources To avoid consumption of local resources, use alternate sites to run large test jobs against. For example, a job that recovers all VMs in an environment might be something you do not want to run in test mode locally (to the source site) as that would double up demands against almost every resource (network, CPU, and storage) within the local site. It is better to run these large jobs in test mode to an alternate site that is not busy. These jobs would, of course, be run locally in production mode in the case of a real disaster. Best Practices for IV Modes Use test mode to perform testing, snapshot validation, and any operation within a fenced network that does not require the test mode VMs to run for any extended period of time. The test mode VMs have independent UUIDs from their source, but they run off of snapshot volume clones which are not intended for prolonged use. Clone mode is generally better suited for this. The user can always move test mode VMs back to production to replace the source machines or they can clone them and leave them running in the test network. The existing Copy Data protection policies do not apply to the test VMs as they are essentially different VMs (with different UUIDs) so that they can run concurrently to the real source VMs within the fenced test network. 6 Use clone mode when extending use of the clone VMs is required and there is storage available to create the independent storage copies. As with test mode, the clone mode VMs have independent UUIDs and run within the fenced test network so as to not collide with the concurrently running production machines. The difference is that the clone VMs were moved via vMotion to permanent storage. The user can then modify the clone VMs by changing the host name, revising the Product ID (PID) and Security ID (SID) to gain a new OS identity, reconfiguring the identities of the contained applications, and applying any required OS/application licenses if they wish to expose the clones via the production networks. The existing Copy Data protection policies do not apply to the clone VMs as the clone VMs are essentially different VMs. Use Recovery mode (also referred to as Production mode or Restore mode) to perform restoration of the protected machines themselves due to a disaster. Recovery mode performs a restore which is essentially a replace/overwrite of the original VM with the restored images. Therefore, recovery mode checks to make sure that the production VMs are not running so that it can restore the images for those VMs. The existing Copy Data protection policies will continue to operate after the restore is complete. Best Practices for Sizing When trying to size the requirements for the VMware IV Use Data jobs, the user should understand what the jobs are configured to do from a conceptual level. This is so they can use the sizing/performance requirements of the existing production VMs to determine what a set of copies for test or clone mode would require at the target site. Use Data Workflow Use Cases The following sections provide an understanding of the Use Data workflow use cases that were implemented for ECX 2.0. ECX 2.0 supports both top down and bottom up workflows to address the needs of various users such as application, hypervisor (VMware), and storage (NetApp) administrators. At a high level, ECX 2.0 policies support two major Use Data workflows: VMware Use Data workflows NetApp Use Data workflows The VMware Use Data workflows are top down dealing with the VMware hypervisor objects such as VMs, vApps, VM folders, VM disks and Datastores. The NetApp Use Data workflows are bottom up, storage driven dealing with the storage objects such as NetApp volumes and files. VMware Use Data Workflows Instant Access (IA) - VM Disks and Datastores The Instant Access workflows are “data focused” use cases that allow instant access to backed up data at a given time (snapshot) to be used for data recovery, such as item, file, volume, etc. Depending on 7 your role, you might want to look at this from a bottom up (hypervisor administrator) or top down (application administrator) perspective. ECX VMware IA Use Data Policy - VM disk source selection Two IA workflows are described here: IA Mount Datastores leveraging primary, vault, or mirror snapshot clone(s) of one or more datastores to an ESX Server or Cluster Datastore(s) are mounted from a volume copy (of primary, vault or mirror) to a target ESX host or cluster. The hypervisor administrator is responsible for any recovery action to production sources. Ending the job cleans up the resources. There is also an option to make the IA’d datastores permanent using split-clone if the user used the IA datastores and wants to make them permanent. IA Mount VM Disk(s) leveraging primary, vault, or mirror snapshot clone(s) of one or more VM disk(s) to a chosen VM The application administrator specifies the disks (VMDKs) to recover from a VM disk list and the software mounts the appropriate datastores on mounted snapshot copies. The VMDK(s) contained with the datastores are then assigned to the source VM or to any other VM to be mounted at specified mount points. The mounted disks can be used from within the VM(s) for application item level recovery (such as Exchange mailboxes or SharePoint objects). Ending the job cleans up the resources. There is also an option to make the IA VM disks permanent using split-clone if the IA’d VM disks were used and the user wants to make them permanent. Note that, when using OS disks for file level recovery, many operating system files may be locked preventing file copy. IA may not sufficiently handle data copy of locked OS disks. 8 Instant Virtualization (IV) - VMs, vApps, and VM folders Instant virtualization (IV) is different from Instant Access (IA). With IV, the user actually instantiates running VMs and vApps as they looked at the time and state of the snapshot. This means they are actually starting up operating systems with applications on them (workloads). Test, Clone and Recovery Modes – There are many different use cases which call for supporting different modes of operation. ECX supports three modes: test, clone, and recovery. The user selects the mode when they run the job policy. Network Fencing – Private networks are leveraged to isolate test and clone mode copies from the production environment. This assures the OS’s and applications on the test/clone VMs do not interfere with their production counterparts. The user maps the source to test networks via the job policy. Storage vMotion is leveraged to move the instantly virtualized VMs from snapshot clones to more permanent storage during Clone or Recovery (RRP). The VMs are operational during Storage vMotion so there is no downtime. Users select the target vMotion storage location (RRP datastore) via the job policy. Recovery Order - The order in which workloads are recovered needs to be addressed for multiworkload (multi-machine) applications. For instance, Active Directory (AD) dependent application server workloads need to come up after the AD comes up. The AD requires DNS so the DNS services need to come up prior to AD. SharePoint Web servers should come up after the configuration and content SQL Servers come up. So, recovery order may be important for VM recoveries. ECX 2.0 supports recovery order within the vApp and also when the user selects groups of VMs individually and sorts them in recovery order via the UI. ECX 2.0 does not support recovery order for VMs contained with VM folders chosen for recovery, as folder membership is variable in nature. We recommend that users leverage VMware vApps to manage the recovery order for groups of VMs instead of using VM folders if the recovery order is important for the VMs contained with a VM folder. 9 Distinguishing the VMware Use Data IV modes ECX VMware Use Data Policy - IV Mode selection Test mode is used to test the recovery of a select set of VMs, vApps, etc. • • • • • • • The test VMs are brought up within a fenced test network so that they do not collide with the concurrently running production VMs. The test VMs selected contain the same applications and OS’s as the running sources. The test VMs are run off of snapshot clones. When the user is finished with the test, they end the job and the job cleans up all the test resources. Test mode can be used to test DR, run Dev/Test, and validate snapshots. Test mode can be scheduled. This allows for continual or scheduled testing. Users can decide to run the test VMs locally (with the same site) or at a remote site. Users must be aware that the test VMs will consume the same CPU/memory/storage resources as the production VMs, so most users elect to perform their large scale testing at remote sites. Test mode supports three actions while active: “End IV (Cleanup)”, “RRP (vMotion)”, and “Clone (vMotion)”. RRP stands for Rapid Return to Production. Existing VMware Copy Data protection policies do not apply to test mode VMs. 10 Clone mode is used to create separate running copies of the original VMs within a fenced network for an extended period of time. • • • • • • The clone mode VMs are brought up within the fenced test/clone network so that they do not collide with the concurrently running production VMs. Clone mode first gets the clone VMs up and running off of snapshot clones and then moves those VMs off of temporary snapshot clones (via storage vMotion) to more permanent storage. Running off of snapshot clones is not recommended for long term use as they are differential images from the original source disk. Limits on snapshots may cause a new snapshot to cycle back on the one being used. Clone mode can be used for data mining, migration, creating new VMs using a set of source VMs as “templates”, etc. The user first runs a job in test mode and then decides to clone the test VMs to more permanent storage via the “Clone (vMotion)” action from test mode. Clone VMs intended for permanent use should be modified before they are exposed to the production network. Revise the Product ID (PID) and Security ID (SID), and rename and relicense the OS and applications running on the clone VMs. Existing VMware Copy Data protection policies do not apply to clone mode VMs. Users would need to create new policies for cloned VMs that they make permanent. Recovery mode is used to restore the production VMs to the state contained within the selected snapshots. Recovery mode is also referred to as Production mode or Restore mode. • • • • • • Recovery mode performs a restore which is essentially a replace/overwrite of the original VMs with the restored images. Recovery mode first gets the production VMs up and running off of snapshot clones (for speed) before moving them (while operational) via Storage vMotion to permanent storage. The restored VMs are brought up on the production network. Recovery mode does not proceed if it detects running instances of the production VMs it will replace. It is the user’s responsibility to shut down all production VMs that are being restored or replaced. The user can first run a job in test mode and then decide to use the test VMs to replace the production VMs via the “RRP (vMotion)” action from test mode. RRP stands for Rapid Return to Production. Existing VMware Copy Data protection policies still apply to the production restored VMs. There is no need to edit or change them. NetApp Use Data Workflows The following are examples of uses cases that are bottom up storage driven. These operations are fast and simple but they do require that the user manage the application states and application specific recoveries for those applications that have data on the volumes or files being restored. 11 ECX NetApp Use Data Policy – NFS Destination Specification Instant Access (IA) volumes from any primary, vault, or mirror snapshot copies and expose them via NFS and CIFS The NetApp volume clone, NFS export, and CIFS share features are leveraged to provide this bottom up storage feature. The IA job simply mounts a snapshot copy and then exposes it via NFS and CIFS. The user can define the machines and the users that will have access to the exposed volumes. In addition, the user may convert the IA disk running off the temporary volume clone into a permanent disk by executing the “RRP” action, which is an option available in the ECX user interface. The “RRP” action uses the NetApp split-clone feature to convert the volume clone into a permanent disk while the disk is in use (no downtime). Restore volumes from any primary, vault, or mirror snapshot copy This feature performs the NetApp SnapRestore, SnapVault Restore and SnapMirror Restore under the hood depending upon which copy snapshots are chosen for the restore. Restore from primary snapshot (SnapRestore) was implemented by NetApp as a volume revert which comes with two restrictions: o o Snapshots newer than the primary snapshot selected for restore are removed after the restore Restore to the selected primary snapshot is blocked if a newer snapshots is in use for a SnapMirror or SnapVault It is for this reason we recommend that volume restores use vault or mirror snapshots instead of primary snapshots. If the Use Data job has a choice (after site selection), it will choose vault snapshots over primary snapshots. 12 Restore files (from primary snapshot only) Restore one or more files from primary snapshot copy. The original source file is restored from the selected primary snapshot. Note: File restore from vault or mirror is not supported in ECX 2.0. Restore volumes from secondary vault or mirror copy (SnapVault or SnapMirror) The NetApp snapshot mirror volume restore feature is leveraged to restore one or more files to the state contained with the selected mirror or vault snapshot. Restore from vault or mirror is recommended over restore from primary snapshot because it does not result in the loss of snapshots nor is it blocked by current mirror/vault operations based off of snapshots newer than the one selected. See restore volume from primary snapshot above. Important Use Cases Not Currently Supported by ECX 2.0 VMware Use Data Mirror reverse use cases. Mirror reverse will require the introduction of protection groups to track which set of VMs must be operated on as a group (protection/recovery) because they reside on common or intersecting storage. RDM and direct assigned iSCSI disks are not addressed in ECX 2.0 by both the Copy Data and Use Data workflows. NetApp Use Data File restore from vault or mirror snapshots. Data ONTAP 8.3 introduced APIs to address these too late in the ECX 2.0 development cycle. Volume restore with added enhancements to also re-create the NFS/CIFS configuration that the original volume had in case the source volume is lost. ECX 2.0 does preserve the original NFS/CIFS configurations if the source volumes exist, so this restriction is limited to the case where the source volumes are no longer present. 13