Multisite (light) Disaster recovery: Difference between revisions
m (clean up) |
No edit summary |
||
Line 3: | Line 3: | ||
Part 4 is definitely the cherry on the pie, but make sure you watch part 1, 2 and 3 as well to have a good understanding of the environment and to understand fully what is happening. | Part 4 is definitely the cherry on the pie, but make sure you watch part 1, 2 and 3 as well to have a good understanding of the environment and to understand fully what is happening. | ||
=The full high-level steps= | ==The full high-level steps== | ||
A bit of a spoiler here: | A bit of a spoiler here: | ||
Line 36: | Line 36: | ||
I will be writing an extensive blog about this soon so look out for that one but for now, you have to watch the 4-part video’s | I will be writing an extensive blog about this soon so look out for that one but for now, you have to watch the 4-part video’s | ||
=PART 1: Introduction to the Lab / POC / Virtual Network environment= | ==PART 1: Introduction to the Lab / POC / Virtual Network environment== | ||
<youtube width="600" height="360">rqmuTJeuAeA</youtube> | <youtube width="600" height="360">rqmuTJeuAeA</youtube> | ||
=PART 2: Ping + Trace-route tests to demonstrate normal operation of the Active /Standby deployment= | ==PART 2: Ping + Trace-route tests to demonstrate normal operation of the Active /Standby deployment== | ||
<youtube width="600" height="360">c-HkB2PCcas</youtube> | <youtube width="600" height="360">c-HkB2PCcas</youtube> | ||
=PART 3:Ping + Trace-route tests to demonstrate normal operation of the Active / Active deployment= | ==PART 3:Ping + Trace-route tests to demonstrate normal operation of the Active / Active deployment== | ||
<youtube width="600" height="360">MAp7BTDjfag</youtube> | <youtube width="600" height="360">MAp7BTDjfag</youtube> | ||
=PART 4: Simulate failure on DC1 and continue operations from DC2= | ==PART 4: Simulate failure on DC1 and continue operations from DC2== | ||
<youtube width="600" height="360">auuNPPaQkV0</youtube> | <youtube width="600" height="360">auuNPPaQkV0</youtube> | ||
I am always trying to improve the quality of my articles so if you see any errors, mistakes in this article or you have suggestions for improvement, [[Special:Contact|please contact me]] and I will fix this. | I am always trying to improve the quality of my articles so if you see any errors, mistakes in this article or you have suggestions for improvement, [[Special:Contact|please contact me]] and I will fix this. | ||
[[Category:NSX | [[Category:NSX]] | ||
[[Category: | [[Category:Networking]] | ||
[[Category:VMware]] |
Revision as of 10:39, 21 January 2024
I have created a 4-part video series that will demonstrate how NSX-T multi-site (Disaster Recovery) works.
Part 4 is definitely the cherry on the pie, but make sure you watch part 1, 2 and 3 as well to have a good understanding of the environment and to understand fully what is happening.
The full high-level steps
A bit of a spoiler here: The steps to recover from a site failure is a lengthy process with a lot of (manual) steps, checks and specific prerequisites. The whole process took me around 45 minutes! (if I subtract my slowness)
The full high-level steps that should be taken are described below and can be watched in Part 4 of the video series:
- Make sure DC1 NSX-T Manager(s) is using FQDN for component registration and backup
- This is not the case out of the box
- This can only be turned on (and off) with a REST API call
- Verify if the backup is done correctly with the FQDN name in the folder name
- Verify if the FQDN is used in the registration process on the Host Transport Nodes and the Edge Transport Nodes towards the controller
- Deploy (a) new NSX-T Manager(s) on DC2 with a new IP address in another IP range then the DC1 NSX-T Manager was in
- SIMULATE A DISASTER IN DC1 + START CONTINUOUS PING FROM WEB01 (172.16.10.11) + START STOPWATCH
- Ping is done from WEB01 (172.16.10.11) - > EXTERNAL (192.168.99.100) and the other way around
- Repoint the DNS A record to the (new) NSX-T Manager(s) in DC2
- Make sure this new DC2 NSX-T Manager(s) is using FQDN for component registration and backup
- This is basically the same as we did in step 1
- Restore the backup on the new DC2 NSX-T Manager(s)
- This may take around 20 minutes to finish
- Verify if the FQDN is used in the registration process on the Host Transport Nodes and the Edge Transport Nodes towards the controller
- This is basically the same as we did in step 3
- Run SRM Recovery Plan on DC2 SRM Server and recover the Web, App and DB VM’s of DC1
- Log in to the (newly restored from backup) NSX-T Manager(s)
- Move the T1 Gateway from the DC1-EN-CLUSTER (that is no longer available) to the DC2-EN-CLUSTER
- Move the uplink from the DC1-T0 Gateway (that is no longer available) to the DC2-T0 Gateway
- Verify if ping starts working again
- Ping is done from WEB01 (172.16.10.11) - > EXTERNAL (192.168.99.100) and the other way around
Have fun testing this out! I will be writing an extensive blog about this soon so look out for that one but for now, you have to watch the 4-part video’s
PART 1: Introduction to the Lab / POC / Virtual Network environment
PART 2: Ping + Trace-route tests to demonstrate normal operation of the Active /Standby deployment
PART 3:Ping + Trace-route tests to demonstrate normal operation of the Active / Active deployment
PART 4: Simulate failure on DC1 and continue operations from DC2
I am always trying to improve the quality of my articles so if you see any errors, mistakes in this article or you have suggestions for improvement, please contact me and I will fix this.