Articles, Blog

DMS pMotion Zero-Downtime Migration Technical Overview

November 21, 2019


This is a technical overview of Cirrus
Data Solutions’ zero-downtime migration solution called DMS. The objective of
this overview is to demonstrate the end-to-end task of migrating the shared
storage of a Microsoft Windows cluster from a fiber channel source storage to
new destination storage using the DMS pMotion feature. You will see that the
cluster application continues to read and write to the shared clustered disk
volume throughout all four steps of the migration process with near zero impact
to the applications performance. The phases of migration include: Step 1 – zero downtime deployment and insertion of the DMS appliances onto the SAN. Step 2 – setting up migration sessions, including automatic LUN creation on the
destination storage, as well as pairing of the existing source LUNs and new
matching destination LUNs. Step 3 – syncing up the data on the source and destination LUNs and transferring the workload from one to the other using pMotion. And finally, Step 4 – performing the final migration cutover
so the DMS appliances can be removed. All of these migration steps are performed
while the cluster application is live and in production. Again no downtime is
needed for these steps. For production environments where the source storage
must be replaced immediately for one reason or another, but when there is
simply no downtime window available due to mission-critical storage and
applications, pMotion fulfills the dream of zero-downtime end-to-end migration.
Here is the lab environment we’re using for this demonstration – a Windows cluster
with two nodes, where each node contains two initiators connected to a dual Fiber
Channel SAN. The source storage is connected with four shared LUNs presented to the cluster. Not shown here are the boot LUNs – one for each of the Windows
nodes, which are also provisioned from the source storage for boot over SAN. The
new destination storage is also already connected, but no LUNs have been created
yet. A Windows client – WinApp1 – runs the
Intel IOmeter tool on mapped network resources provided by the cluster. DMS1A is connected to fabric one and DMS1B is connected to fabric two. Let’s
take a look at the Windows cluster node WinCL-A. Opening the disk manager
here we see four shared LUNs from the source storage, plus a SANboot LUN. The physical properties of the shared storage indicates that these are Power
Path devices. The four LUNs are assigned drive letters.
One is a quorum disk and each of the other three contain a shared folder
inside of which is the Intel IOmeter data file already created. Let’s bring up the remote desktop for
WinApp1 and start Intel IOmeter and launch a typical workload. The workload will use these three mapped
cluster resources for Dev, QA, and PS, and the shared storage will be continually
accessed using a typical read/write workload. In this setting, IOmeter is generating
about 1200 IOPS. This is meant to represent a 24×7 mission-critical
application that has no downtime available for the foreseeable future. Our
mission is to migrate the disks from the old storage to the new storage with zero-downtime and with near zero impact. Let’s get started with the migration. Let’s
take a look at the DMS GUI. The Nexus module performs the datapath
interception. As an overview, we’re about to intercept these two datapaths
associated with SPA1 and SPB1 using the Nexus ports of DMS1A for SPA and DMS1B for SPB. All paths to the LUNs being migrated must be intercepted in order to
perform the migration. Let’s start with the insertion process by opening the
Nexus module and then clicking on the insertion wizard button here, we start
the process of intercepting the first path. The insertion wizard asks for the
worldwide name of the storage ports that are relevant to this path, and they’re
entered and separated by a comma. Next we can enter the IP address of the Cisco
switch as well as the login credentials so the
insertion wizard can proceed with the insertion with minimal input from us.
Clicking the big blue button here prompts the switch to perform the steps
automatically. We can also examine the set of switch
commands that create the VSANs, then click the big blue button to execute. The VSAN is created. The switch commands here will move both
the source and the destination ports inside the VSAN. Click the blue button
to continue. And there it is – the wizard has finished
and the discovered entities are shown. Here, a rescan is now being performed by
the insertion wizard to automatically discover all the hosts and LUNs
and the storage paths. Here are the results. We can see IOmeter is still running and
shows zero errors. At this point we’ve only intercepted a single path, so let’s
repeat the process and intercept the second path in Fabric 2. Clicking on the
Nexus module, we can now use the insertion wizard button for DMS1B and
we’re going to fast forward until after the second path is inserted. Now, clicking on the SAN Config module,
we see a very accurate picture of our SAN – a Windows host with four ports
connected to fabric 1 & 2 and to the source storage with four shared LUNs and the two
SAN boot LUNs. We can go ahead and adjust the picture to show the individual nodes
by separating the two nodes of the cluster using the Reassign Ports
function. And now we see a two node cluster, each
with two ports, connected to the LUNs. We are now ready to migrate. Click on the
migration module and the plus sign to create a new migration session. Click to open WinCL-A and B and group-select all the LUNs. Select local migration and of course enable storage pMotion. Click on Auto-Allocation to load the storage plugins. Enter the IP address and the login
credentials and click continue. Here we can specify if the new LUNs
should be larger than the original source LUNs, so let’s make them 50%
larger. And it’s done. The new destination LUNs
are now allocated and paired up with the original source LUNs. Let’s name this
Migration1. We will enable thin disk migration and we’ll resync every 60
minutes. Now we’re all done. IOmeter is running
without any I/O gap. The intelligent QoS feature of DMS
ensures the host has priority and migration is paused to yield to the host.
So let’s pause IOmeter so that we can quickly sync up and proceed to the next
steps. Now that the session is in “Pending Complete” state (meaning the source and destination LUNs are nearly in sync), we
can swing over the workload using pMotion. Let’s start up IOmeter again to simulate
a live production environment. It’s doing about 1100-1200 IOPS. Let’s start triggering pMotion. Once pMotion is completed, the workload
is moved over to the new destination storage. Note that the new storage is
showing four times the IOPS of the old storage. Throughout the migration and pMotion
cut-over, the Intel IOmeter never stopped except for the time that we
paused it to accelerate the migration. Let’s take a look at the clustered hosts
to see what changed, if anything. Opening up the disk manager and looking
at the physical properties of the LUNs, our workstation still thinks that it’s
accessing the original source storage and that nothing has changed. This is how
we avoided an outage window – by creating the illusion that nothing has changed,
even though the source storage has been completely bypassed and all the I/O is
going directly to the new storage. The DMS is standing in for the decommissioned source storage using the new storage thanks to the pMotion capability. And
we’re now ready for the final cut-over step. For a cluster environment, the reconfiguration is simply performed one node at a time Let’s start with WinCL-A.
While IOmeter is still running we can put the first node offline, which moves
all the resources to the live node. Now we can power off WinCL-A, present
the newly migrated LUNs to this cluster node that is now offline, remove the zones between that cluster
node and the old storage, and finally, replace the previous zones with the new
zones for the new storage. So now WinCL-A should be seeing the new
storage. We can power up the node and make any necessary boot LUN BIOS setting
changes if this node is booting from the SAN. At this point we need to reconfigure MPIO, which requires a reboot. After the reboot we can see MPIO
is working. Let’s now move the cluster workload back to this node so
that we can bring the other node offline for reconfiguring just as we did for
this one. WinCL-A node is now up and running and
has taken back all the workload and these disks are now active. IOmeter is
still running, so now we can shut down the other node
and repeat the same procedure to complete the final cut over. Last step – uninsert the DMS since we no longer need it in the datapath. We first
run the Uninsert Wizard with DMS1A, followed by DMS1B. When the process is
complete, both DMS appliances can be removed. The migration is now complete –
all without any outage to the cluster shared resources. As demonstrated, we started with a
clustered application represented by Intel IOmeter running on the old storage
which delivered about 1200 IOPS. Without stopping the application, we
deployed and inserted the DMS and used pMotion to migrate and move the workload
to the new storage – all without downtime – and using intelligent QoS to ensure
acceptable impact to the application throughout the migration. We then
performed the final cut-over node-by-node, all while Intel IOmeter was still
running. The end result was that IOmeter was running three or four times faster
than with the old storage, and the old storage could be turned off.

You Might Also Like

No Comments

Leave a Reply