Multisite (light) Disaster recovery
I have created a 4-part video series that will demonstrate how NSX-T multi-site (Disaster Recovery) works.
Part 4 is definitely the cherry on the pie, but make sure you watch part 1, 2 and 3 as well to have a good understanding of the environment and to understand fully what is happening.
The full high-level steps
A bit of a spoiler here: The steps to recover from a site failure is a lengthy process with a lot of (manual) steps, checks and specific prerequisites. The whole process took me around 45 minutes! (if I subtract my slowness)
The full high-level steps that should be taken are described below and can be watched in Part 4 of the video series:
- Make sure DC1 NSX-T Manager(s) is using FQDN for component registration and backup
- This is not the case out of the box
- This can only be turned on (and off) with a REST API call
- Verify if the backup is done correctly with the FQDN name in the folder name
- Verify if the FQDN is used in the registration process on the Host Transport Nodes and the Edge Transport Nodes towards the controller
- Deploy (a) new NSX-T Manager(s) on DC2 with a new IP address in another IP range then the DC1 NSX-T Manager was in
- SIMULATE A DISASTER IN DC1 + START CONTINUOUS PING FROM WEB01 (172.16.10.11) + START STOPWATCH
- Ping is done from WEB01 (172.16.10.11) - > EXTERNAL (192.168.99.100) and the other way around
- Repoint the DNS A record to the (new) NSX-T Manager(s) in DC2
- Make sure this new DC2 NSX-T Manager(s) is using FQDN for component registration and backup
- This is basically the same as we did in step 1
- Restore the backup on the new DC2 NSX-T Manager(s)
- This may take around 20 minutes to finish
- Verify if the FQDN is used in the registration process on the Host Transport Nodes and the Edge Transport Nodes towards the controller
- This is basically the same as we did in step 3
- Run SRM Recovery Plan on DC2 SRM Server and recover the Web, App and DB VM’s of DC1
- Log in to the (newly restored from backup) NSX-T Manager(s)
- Move the T1 Gateway from the DC1-EN-CLUSTER (that is no longer available) to the DC2-EN-CLUSTER
- Move the uplink from the DC1-T0 Gateway (that is no longer available) to the DC2-T0 Gateway
- Verify if ping starts working again
- Ping is done from WEB01 (172.16.10.11) - > EXTERNAL (192.168.99.100) and the other way around
Have fun testing this out! I will be writing an extensive blog about this soon so look out for that one but for now, you have to watch the 4-part video’s
PART 1» Introduction to the Lab POC and Virtual Network environment
PART 2» Ping and Trace-route tests to demonstrate normal operation of the Active and Standby deployment
PART 3»Ping and Trace-route tests to demonstrate normal operation of the Active and Active deployment
PART 4» Simulate failure on DC1 and continue operations from DC2
I am always trying to improve the quality of my articles so if you see any errors, mistakes in this article or you have suggestions for improvement, please contact me and I will fix this.