Monday, January 19, 2015

Business Continuity with SCVMM and Azure Site Recovery

Business Continuity for the management stamp

Back in November, I wrote a blog post about the DR integration in Windows Azure Pack, where service providers can provide managed DR for their tenants -

I’ve been working with many service providers over the last months where both Azure Pack and Azure Site Recovery has been critical components.

However, looking at the relatively big footprint with the DR add-on in Update Rollup 4 for Windows Azure Pack, organizations has started in the other end in order to bring business continuity to their clouds.

For one of the larger service providers, we had to dive deep into the architecture of Hyper-V Replica, SCVMM and Azure Site Recovery before we knew how to design the optimal layout to ensure business continuity.

In each and every ASR design, you must look at your fabric and management stamp and start looking at the recovery design before you create the disaster design. Did I lost you there?

What I’m saying is that it’s relatively easy to perform the heavy lifting of the data, but once the shit hit the fans, you better know what to expect.

In this particular case, we had a common goal:

We want to ensure business continuity for the entire management stamp with a single click, so that tenants can create, manage and operate their workloads without interruption. This should be achieved in an efficient way with a minimal footprint.

When we first saw the release of Azure Site Recovery, it was called “Hyper-V Recovery Manager” and required two SCVMM management stamps to perform DR between sites. The feedback from potential customers were quite loud and clear: people wanted to leverage their existing SCVMM investment and perform DR operations with a single SCVMM management stamp. Microsoft listened and let us now perform DR between SCVMM Clouds, using the same SCVMM server.

Actually, it’s over a year ago since they made this available and diving into my archive I managed to find the following blog post:

So IMHO, using a single SCVMM stamp is always preferred whenever it is possible, so that was also my recommendations when it came to the initial design for this case.

In this blog post, I will share my findings and workaround for making this possible, ensuring business continuity for the entire management stamp.

The initial configuration

The first step we had to make when designing the management stamp, was to plan and prepare for SQL AlwaysOn Availability Groups.
System Center 2012 R2 – Virtual Machine Manager, Service Manager, Operations Manager and Orchestrator does all support AlwaysOn Availability Groups.

Why plan for SQL AlwaysOn Availability Groups when we have the traditional SQL Cluster solution available for High-Availability?

This is a really good question – and also very important as this is the key for realizing the big goal here. AlwaysOn is a high-availability and disaster recovery solution that provides an enterprise-level alternative to database mirroring. The solution maximizes the availability of a set of user databases and supports a failover environment for those selected databases.
Compared to a traditional SQL Cluster – that can also use shared VHDXs, this was a no brainer. A shared VHDX would have given us a headache and increased the complexity with Hyper-V Replica.
SQL AlwaysOn Availability Groups let us use local storage for each VM within the cluster configuration, and enable synchronous replication between the nodes on the selected user databases.

Alright, the SQL discussion is now over, and we proceeded to the fabric design.
In total, we would have several Hyper-V Clusters for different kind of workload, such as:

·       Management
·       Edge
·       IaaS
·       DR

Since this was a Greenfield project, we had to deploy everything from scratch.
We started with the Hyper-V Management Cluster and from there we deployed two VM instances in a guest cluster configuration, installed with SQL Server for Always On Availability Groups. Our plan was to put the System Center databases – as well as WAP databases onto this database cluster.

Once we had deployed a Highly-Available SCVMM solution, including a HA library server, we performed the initial configuration on the management cluster nodes.
As stated earlier, this is really a chicken and egg scenario. Since we are working with a cluster here, it’s straightforward to configure the nodes – one at a time, putting one node in maintenance mode, move the workload and repeat the process on the remaining node(s). Our desired state configuration at this point is to deploy the logical switch with its profile settings to all nodes, and later provision more storage and define classifications within the fabric.
The description here is relatively high-level, but to summarize: we do the normal fabric stuff in VMM at this point, and prepare the infrastructure to deploy and configure the remaining hosts and clusters.

For more information around the details about the design, I used the following script that I have made available that turns SCVMM into a fabric controller for Windows Azure Pack and Azure Site Recovery integration:

Once the initial configuration was done, we deployed the NVGRE gateway hosts, DR hosts, Iaas hosts, Windows Azure Pack and the remaining System Center components in order to provide service offerings through the tenant portal.

If you are very keen to know more about this process, I recommend to read our whitepaper which covers this end-to-end:

Here’s an overview of the design after the initial configuration:

If we look at this from a different – and perhaps a more traditional perspective, mapping the different layers with each other, we have the following architecture and design of SCVMM, Windows Azure Pack, SPF and our host groups:

So far so good. The design of the stamp was finished and we were ready to proceed with the Azure Site Recovery implementation

Integrating Azure Site Recovery

To be honest, at this point we thought the hardest part of the job was done, such as ensuring HA for all the workloads as well as integrating NVGRE to the environment, spinning up complex VM roles just to improve the tenants and so on and so forth.
We added ASR to the solution and was quite confident that this would work as a charm since we had SQL AlwaysOn as part of the solution.

We soon found out that we had to do some engineering before we could celebrate.

Here’s a description of the issue we encountered.

In the Microsoft Azure portal, you configure ASR and perform the mapping between your management servers and clouds and also the VM networks.

As I described earlier in this blog post, the initial design of Azure Site Recovery in an “Enterprise 2 Enterprise” (on-prem 2 on-prem) scenario, was to leverage two SCVMM management servers. Then the administrator had the opportunity to duplicate the network artifacts (network sites, VLAN, IP pools etc) across sites, ensuring that each VM could be brought online on the secondary site with the same IP configuration as on the primary site.

Sounds quite obvious and really something you would expect, yeah?

Moving away from that design and rather use a single SCVMM management server (a single management server, that is highly-available is not the same as two SCVMM management servers), gave us some challenges.

1)      We could (of course) not create the same networking artifacts twice within a single SCVMM management server
2)      We could not create an empty logical network and map the primary network with this one. This would throw an error
3)      We could not use the primary network as our secondary as well, as this would give the VMs a new IP address from the IP pool
4)      Although we could update IP addresses in DNS, the customer required to use the exact IP configuration on the secondary site post failover

Ok, what do we do now?
At that time it felt a bit awkward to say that we were struggling to keep the same IP configuration across sites.

After a few more cups of coffee, it was time to dive into the recovery plans in ASR to look for new opportunities.

A recovery plan groups virtual machines together for the purposes of failover and recovery, and it specifies the order in which groups of VMs should fail over. We were going to create several recovery plans, so that we could easily and logically group different kind of workloads together and perform DR in a trusted way

Here’s how the recovery plan for the entire stamp looks like:

So this recovery plan would power off the VMs in a specific order, perform the failover to the secondary site and then power on the VMs again in a certain order specified by the administrator.

What was interesting for us to see, was that we could leverage our Powershell skills as part of these steps.

Each step can have an associated script and a manual task assigned.
We found out that the first thing we had to do before even shutting down the VMs, was to run a powershell script that would verify that the VMs would be connected to the proper virtual switch in Hyper-V.

Ok, but why?

Another good question. Let me explain.

Once you are replicating a virtual machine using Hyper-V Replica, you have the option to assign an alternative IP address to the replica VM. This is very interesting when you have different networks across your sites so that the VMs can be online and available immediately after a failover.
In this specific customer case, the VLAN(s) were stretched and made available on the secondary site as well, hence the requirement to keep the exact network configuration. In addition, all of the VMs had assigned static IP addresses from the SCVMM IP Pools.

However, since we didn’t do any mapping at the end in the portal, just to avoid the errors and the wrong outcome, we decided to handle this with powershell.

When enabling replication on a virtual machine in this environment, and not mapping to a specific VM network, the replica VM would have the following configuration:

As you can see, we are connected to a certain switch, but the “Failover TCP/IP” checkbox was enabled with no info. You probably know what this means? Yes, the VM will come up with an APIPA configuration. No good.

What we did

We created a powershell script that:

a)       Detected the active Replica hosts before failover (using the Hyper-V Powershell API)
b)      Ensured that the VM(s) were connected to the right virtual switch on Hyper-V (using the Hyper-V Powershell API)
c)       Disabled the Failover TCP/IP settings on every VM
a.       Of all of the above were successful, the recovery plan could continue to perform the failover
b.       If any of the above were failing, the recovery plan was aborted

For this to work, you have to ensure that the following pre-reqs are met:

·        Ensure that you have at least one library server in your SCVMM deployment
·        If you have a HA SCVMM server deployment as we had, you also have a remote library share (example: \\fileserver.domain.local\libraryshare ). This is where you store your powershell script (nameofscript.ps1)  Then you must configure the share as follow:
a.       Open the Registry editor
b.       Navigate to HKEY_LOCAL_MACHINE_SOFTWARE\Microsoft\Microsoft System Center Virtual Machine Manager Server\DRAdaper/Registration
c.        Edit the value ScriptLibraryPath
d.       Place the value as \\fileserver.domain.local\libraryshare\. Specify the full fully qualified domain name (FQDN).
e.       Provide permission to the share location

This registry setting will replicate across your SCVMM nodes, so you only have to do this once.

Once the script has been placed in the library and the registry changes are implemented, you can associate the script with one or more tasks within a recovery plan as showed below.

Performing the recovery plan(s) now would ensure that every VM that was part of the plan, was brought up at the recovery site with the same IP configuration as on the primary site.

With this, we had a “single-button” DR solution for the entire management stamp, including Windows Azure Pack and its resource providers.


Thursday, January 1, 2015

Azure Site Recovery - Survey

Happy New Year!

Now, let us get back to work.

I have made a very short survey just to get a better understanding of the potential DR scenarios with Microsoft Azure Site Recovery.

As you already know, Azure can be your DR site today, where you can have ongoing replication from your private cloud(s) to Azure, which eliminates the need for a secondary site that you have to manage and operate yourself.

However, there are some limitations when using Azure, such as lack of support for Generation 2 VMs and the advance usage of VHDX.

Please take 30 seconds to complete this short survey - and I will be very grateful.


Tuesday, December 30, 2014

SCVMM Fabric Controller Script – Update

Some weeks ago, I wrote this blog post ( ) to let you know that my demo script for creating management stamps and turning SCVMM into a fabric controller is now available for download.

I’ve made some updates to the SCVMM Fabric Controller script during the Holidays – and you can download the Powershell script from TechNet Gallery:

In this update, you’ll get:

More flexibility
Error handling
3 locations – which is the level of abstraction for your host groups. Rename these to fit your environment.
Each location contain all the main function host groups, like DR, Edge, IaaS and Fabric Management
Each IaaS host group has its corresponding Cloud
Native Uplink Profile for the main location will be created
A global Logical Switch with Uplink Port profile and Virtual Port Profiles will be created with a default virtual port profile for VM Roles
Custom property for each cloud (CreateHighlyAvailableVMRoles = true) to ensure HA for VM roles deployed through Windows Azure Pack

Please note, that you have to add hosts to your host groups before you can associate logical networks with each cloud created in SCVMM, so this is considered as a post deployment task.

I’ve received some questions since the first draft was uploaded to TechNet Gallery, as well as from my colleagues who have tested the new version:

·         Is this best practice and recommendation from your side when it comes to production design for SCVMM as a fabric controller?

Yes, it is. Especially now where the script more or less create the entire design.
If you have read our whitepaper on Hybrid Cloud with NVGRE (Cloud OS) ( ), then you can see that we are following the same principals there – which helped us to democratize software-defined networking for the community.

·         I don’t think I need all the host groups, such as “DR” and “Edge”. I am only using SCVMM for managing my fabric

Although SCVMM can be seen as the primary management tool for your fabric – and not only a fabric controller when adding Azure Pack to the mix, I would like to point out that things might change in your environment. It is always a good idea to have the artifacts in place in case you will grow, scale or add more functionality as you move forward. This script will lay the foundation for you to use whatever fabric scenario you would like, and at the same time keep things structured according to access, intelligent placement and functionality. Changing a SCVMM design over time isn’t straightforward, and in many cases you will end up with a “legacy” SCVMM design that you can’t add into Windows Azure Pack for obvious reasons.

Have fun and let me know what you think.

Sunday, December 14, 2014

SCVMM Fabric Controller Script

We are reaching the holidays, and besides public speaking, I am trying to slow down a bit in order to prepare for the arrival of my baby girl early in January.

However, I haven’t been all that lazy, and in this blog post I would like to share a script with you.

During 2014, I have presented several times on subjects like “management stamp”, “Windows Azure Pack”, “SCVMM” and “Networking”.

All of these subjects have something in common, and that is a proper design of the fabric in SCVMM to leverage the cloud computing characteristics that Azure Pack is bringing to the table.
I have been visiting too many customers and partners over the last months just to see that the design of the fabric in VMM is not scalable or designed in a way that gives some meaning at all.

As a result of this, I had to create a Powershell script that easily could show how it should be designed, based on one criteria: turning SCVMM into a universal fabric controller for all your datacenters and locations.

This means that the relationship between the host groups and the logical networks and network definitions need to be planned carefully.
If you don’t design this properly, you can potentially have no control over where the VMs are deployed. And that is not a good thing.

This is the first version of this script and the plan is to add more and more stuff to it once I have the time.

The script can be found at downloaded here:

Please note that this script should only be executed in an empty SCVMM environment (lab), and you should change the variables to fit your environment.

Once the script has completed, you can add more subnets and link these to the right host groups.

The idea with this version is really just to give you a better understanding of how it should be designed and how you can continue using this design. 

Wednesday, December 3, 2014

Setting Static IP Address on a VM Post Deployment

This short blog post is meant to show you how you can grab an IP address from a VMM IP pool for your virtual machines post deployments.

Recently, I found out that during specific DR scenarios with ASR (E2E), you have to use static IP addresses for some of your VMs, depending on the actual recovery plan you have created (but that is a different blog post).

In order to allocate an IP address from the VMM IP Pool, you can use the following lines of powershell:

$vm = Get-ScvirtualMachine -Name “NameOfVM"
$staticIPPool = Get-SCStaticIPAddressPool -Name "NameOfIPPool"
Grant-SCIPAddress -GrantToObjectType "VirtualNetworkAdapter" -GrantToObjectID $vm.VirtualNetworkAdapters[0].ID -StaticIPAddressPool $staticIPPool
Set-SCVirtualNetworkAdapter -VirtualNetworkAdapter $vm.VirtualNetworkAdapters[0] -IPv4AddressType static

Check the job view in VMM to see which IP is allocated to the vNIC on the VM and ensure that these settings are reflected within the guest operating system as well.

Wednesday, November 19, 2014

Windows Azure Pack with DR add-on (ASR)

One of the good things with Windows Azure Pack is that it is an extensible solution where we are able to customize, extend and integrate WAP to meet our desired configuration.

I have already covered the majority of the API’s we have available, both from an admin perspective and from a tenant perspective.

These blog posts can be found here:

The intention of this blog post is to drive awareness of the solution that Microsoft now has made available.

Offering managed DR for IaaS workload with ASR and Windows Azure Pack

Many people have requested that Windows Azure Pack should have an integration with Hyper-V Replica, or Azure Site Recovery.
If you are not familiar with Azure Site Recovery as a concept, you can think of it as the umbrella for all the DR capabilities that Microsoft provides, including storage replication that will be available in the Update Rollup 5 for SCVMM (currently in preview). Azure Site Recovery let you use Hyper-V Replica through SCVMM on-premise to either replicate to a secondary datacenter (on-premise) or use Microsoft Azure as your DR target.
No matter what and where you go, the experience will be the same and provide you with consistency.

I will not cover the setup or the actual workflow of the DR integration with WAP since it is very detailed explained in the URL above.
Instead, I would point out the high-level design of this solution and what you really need to think of.

After you have installed Update Rollup 4 for Windows Azure Pack, you will see some small changes in the UI when you drill into the Plan in WAP and explore the VM Cloud services.

This is where we will enable DR as an add-on, meaning that tenants are able to associate that add-on to an existing subscription they have.

The DR add-on will consist of several SMA runbooks that you will have to import into your WAP environment in the Admin Portal.

Once this is done from the tenant side, this will effectively trigger the SMA runbooks that will replicate all the virtual machines running in that subscription to the target environment.
The subscription ID itself will be replicated with all the mapping down towards each and every tenant VM.
However, virtual networks (if using NVGRE) is not replicated. This means the tenant will have to recreate the networking artifacts in the secondary environment, and you – the service provider must perform the initial network mapping in ASR.

The SMA runbooks can be scheduled so that once a new VM is deployed into that particular subscription, the VM will be scheduled for initial replication and be protected.

Now, over to the delicate explanation of the initial design in order to implement this.

In Azure Site Recovery when using DR between on-premises sites, we are doing the mapping at the VMM Cloud level. The Cloud in VMM should contains Hyper-V hosts/clusters within one or more host groups that will be the foundation of the virtual machines and the replica.

As you may be aware of, in Windows Azure Pack when you create hosting plans, these hosting plans that contains VM Services will be bound to a VMM cloud and a VMM server.
In other words, we are not able to replicate with ASR using a single cloud, although we could have two different host groups (primary and replica) within that cloud.

So since we have to have two clouds, we also need two plans. Hence we have an isolation issue to deal with in order to provide DR with a good tenant experience.

The subscription each tenant create will be unique in the environment, and we are not able to use the same subscription twice within an environment. But if we have two subscriptions, then the tenant would have to know which one to use and could easily lead to mistakes.

So in order to keep the subscription ID and its resources, we need to have another Azure Pack environment.
And since we need to have another Azure Pack environment, we also need another instance of Service Provider Foundation (SPF).

So from a tenant perspective during a failover process, they will be redirected to the WAP environment which is currently online, sign in with their credentials and get access to their resources. The only thing that has changed is the URL to the tenant portal itself.

I know it can be hard to absorb this information at first, especially if we are not familiar with the concept of stamp and the actual architecture of the multi-tenant IaaS cloud platform we are dealing with. So I have created some graphics to show each layer and the purpose of each layer.

High-level overview of management stamps with Windows Azure Pack and Azure Site Recovery

Overview of the different layers for the VM Cloud Resource provider in the context of WAP with DR add-on:

Hopefully this makes sense and gives you a better understanding of the design of Windows Azure Pack with DR add-on

Please note that this is a managed DR solution, where the service provider has very clear responsibility.
They need to perform the initial setup, perform all the processes and ensure that testing and planning are compliant with the actual SLA they provides for this solution.

Monday, November 17, 2014

Speaking at Campus Days in Copenhagen

The last week in November, I will present several sessions at "Campus Days" in Copenhagen in Denmark.

Together with my good friend, Flemming Riis, we will show different sides of datacenter and cloud management using a real world fabric in order to hit the high notes together with the audience.

What's cool with this conference is that all sessions will be available on Channel 9 afterwards.

I will have two presentations:

Mastering Networking in VMM (level 400)

A very interesting topic (to say the least) where I will cover the design and implementation of networking in VMM.
This should give you a complete overview of how to implement fabric networks, software-defined networking (NVGRE with NVGRE gateways) and lay the foundation for automation with Windows Azure Pack. If you have any questions related to networking in VMM, this is your chance to speak up, ask questions and join the interactive session.

Virtual Machine Manager (level 300)

A bit vague title, but many interesting things can be place underneath the umbrella of Virtual Machine Manager. This can be your day-to-day management tool, or your fabric controller that will be harnessed by Windows Azure Pack.
Here we will touch several aspects such as compute, networking and storage management, as well as service templates, cloud and much more.

I also recommend you to join Dr. Riis sessions, which you later will find online here: