Communication Channel Health Check showing down / down
In NSX 6.2 it is possible to do a "Communication Channel Health Check" to see if the NSX manager the Control Plane Agent + Firewall agent connections are "healthy" and up and running.
I encountered a problem in my lab environment where I had the problem that both the Plane Agent and the Firewall agent connections where both down.
Because of this I was also not able to push any firewall rules to that host.
So I started googeling, and I can across this post. This post told me that the services that should be running are possibly down:
/etc/init.d/netcpad /etc/init.d/vShield-Stateful-Firewall
So I verified the status of the services and stopped / started them again.
[root@dc1-pod11-esx-a-03:~] /etc/init.d/vShield-Stateful-Firewall status root ##b##vShield-Stateful-Firewall is running [root@dc1-pod11-esx-a-03:~] /etc/init.d/netcpad status root ##b##netCP agent service is running [root@dc1-pod11-esx-a-03:~] /etc/init.d/netcpad stop watchdog-netcpa: Terminating watchdog process with PID 34973 Memory reservation released for netcpa root ##b##netCP agent service is stopped [root@dc1-pod11-esx-a-03:~] /etc/init.d/vShield-Stateful-Firewall stop watchdog-vShield-Stateful-Firewall: Terminating watchdog process with PID 35483 root ##b##vShield-Stateful-Firewall stopped watchdog-dfwpktlogs: Terminating watchdog process with PID 35463 Resource pool 'host/vim/vmvisor/vsfwd' released. [root@dc1-pod11-esx-a-03:~] /etc/init.d/vShield-Stateful-Firewall start vShield-Stateful-Firewall is not running watchdog-dfwpktlogs: PID file /var/run/vmware/watchdog-dfwpktlogs.PID does not exist watchdog-dfwpktlogs: Unable to terminate watchdog: No running watchdog process for dfwpktlogs Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying.. Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying.. Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying.. Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying.. Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying.. root ##b##vShield-Stateful-Firewall started [root@dc1-pod11-esx-a-03:~] /etc/init.d/netcpad start Memory reservation set for netcpa Reload security domains root ##b##netCP agent service starts [root@dc1-pod11-esx-a-03:~]
This unfortunately still did not resolve the problem ...
My next step was that I just rebooted the host and that did not fix the problem either.
I eventually fixed it with the following steps:
- Put host in maintenance mode
- Take it out of the cluster (drag and drop in DC object)
- Reboot twice
- Put it back into the cluster (drag and drop in cluster object)
- Take is OUT OF maintenance mode
- Force sync / resole (in host preparation)
These actions caused a reinstall of the VIB's on the faulty hosts and that eventually resolved the issue. I was trying to resolve this issue whiteout an host reboot, but this was not possible ...