Multisite (light) Disaster recovery: Difference between revisions
m (Applying replacements) |
|||
(2 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
Part 4 is definitely the cherry on the pie, but make sure you watch part 1, 2 and 3 as well to have a good understanding of the environment and to understand fully what is happening. | Part 4 is definitely the cherry on the pie, but make sure you watch part 1, 2 and 3 as well to have a good understanding of the environment and to understand fully what is happening. | ||
==The full high | ==The full high-level steps== | ||
A bit of a spoiler here: | A bit of a spoiler here: | ||
Line 36: | Line 36: | ||
I will be writing an extensive blog about this soon so look out for that one but for now, you have to watch the 4-part video’s | I will be writing an extensive blog about this soon so look out for that one but for now, you have to watch the 4-part video’s | ||
==PART 1{{fqm}} Introduction to the Lab | ==PART 1{{fqm}} Introduction to the Lab POC and Virtual Network environment== | ||
<youtube width="600" height="360">rqmuTJeuAeA</youtube> | <youtube width="600" height="360">rqmuTJeuAeA</youtube> | ||
==PART 2{{fqm}} Ping | ==PART 2{{fqm}} Ping and Trace-route tests to demonstrate normal operation of the Active and Standby deployment== | ||
<youtube width="600" height="360">c-HkB2PCcas</youtube> | <youtube width="600" height="360">c-HkB2PCcas</youtube> | ||
==PART 3{{fqm}}Ping | ==PART 3{{fqm}}Ping and Trace-route tests to demonstrate normal operation of the Active and Active deployment== | ||
<youtube width="600" height="360">MAp7BTDjfag</youtube> | <youtube width="600" height="360">MAp7BTDjfag</youtube> |
Latest revision as of 13:04, 17 March 2024
I have created a 4-part video series that will demonstrate how NSX-T multi-site (Disaster Recovery) works.
Part 4 is definitely the cherry on the pie, but make sure you watch part 1, 2 and 3 as well to have a good understanding of the environment and to understand fully what is happening.
The full high-level steps
A bit of a spoiler here: The steps to recover from a site failure is a lengthy process with a lot of (manual) steps, checks and specific prerequisites. The whole process took me around 45 minutes! (if I subtract my slowness)
The full high-level steps that should be taken are described below and can be watched in Part 4 of the video series:
- Make sure DC1 NSX-T Manager(s) is using FQDN for component registration and backup
- This is not the case out of the box
- This can only be turned on (and off) with a REST API call
- Verify if the backup is done correctly with the FQDN name in the folder name
- Verify if the FQDN is used in the registration process on the Host Transport Nodes and the Edge Transport Nodes towards the controller
- Deploy (a) new NSX-T Manager(s) on DC2 with a new IP address in another IP range then the DC1 NSX-T Manager was in
- SIMULATE A DISASTER IN DC1 + START CONTINUOUS PING FROM WEB01 (172.16.10.11) + START STOPWATCH
- Ping is done from WEB01 (172.16.10.11) - > EXTERNAL (192.168.99.100) and the other way around
- Repoint the DNS A record to the (new) NSX-T Manager(s) in DC2
- Make sure this new DC2 NSX-T Manager(s) is using FQDN for component registration and backup
- This is basically the same as we did in step 1
- Restore the backup on the new DC2 NSX-T Manager(s)
- This may take around 20 minutes to finish
- Verify if the FQDN is used in the registration process on the Host Transport Nodes and the Edge Transport Nodes towards the controller
- This is basically the same as we did in step 3
- Run SRM Recovery Plan on DC2 SRM Server and recover the Web, App and DB VM’s of DC1
- Log in to the (newly restored from backup) NSX-T Manager(s)
- Move the T1 Gateway from the DC1-EN-CLUSTER (that is no longer available) to the DC2-EN-CLUSTER
- Move the uplink from the DC1-T0 Gateway (that is no longer available) to the DC2-T0 Gateway
- Verify if ping starts working again
- Ping is done from WEB01 (172.16.10.11) - > EXTERNAL (192.168.99.100) and the other way around
Have fun testing this out! I will be writing an extensive blog about this soon so look out for that one but for now, you have to watch the 4-part video’s
PART 1» Introduction to the Lab POC and Virtual Network environment
PART 2» Ping and Trace-route tests to demonstrate normal operation of the Active and Standby deployment
PART 3»Ping and Trace-route tests to demonstrate normal operation of the Active and Active deployment
PART 4» Simulate failure on DC1 and continue operations from DC2
I am always trying to improve the quality of my articles so if you see any errors, mistakes in this article or you have suggestions for improvement, please contact me and I will fix this.