Communication Channel Health Check showing down / down

From Iwan
Jump to: navigation, search

In NSX 6.2 it is possible to do a "Communication Channel Health Check" to see if the NSX manager the Control Plane Agent + Firewall agent connections are "healthy" and up and running.

I encountered a problem in my lab environment where I had the problem that both the Plane Agent and the Firewall agent connections where both down.

0001.png

Because of this I was also not able to push any firewall rules to that host.

0002.png

So I started googeling, and I can across this post. This post told me that the services that should be running are possibly down:

/etc/init.d/netcpad 
/etc/init.d/vShield-Stateful-Firewall

So I verified the status of the services and stopped / started them again.

[root@dc1-pod11-esx-a-03:~] /etc/init.d/vShield-Stateful-Firewall status
root ##b##vShield-Stateful-Firewall is running

[root@dc1-pod11-esx-a-03:~] /etc/init.d/netcpad status
root ##b##netCP agent service is running

[root@dc1-pod11-esx-a-03:~] /etc/init.d/netcpad stop
watchdog-netcpa: Terminating watchdog process with PID 34973
Memory reservation released for netcpa
root ##b##netCP agent service is stopped

[root@dc1-pod11-esx-a-03:~] /etc/init.d/vShield-Stateful-Firewall stop
watchdog-vShield-Stateful-Firewall: Terminating watchdog process with PID 35483
root ##b##vShield-Stateful-Firewall stopped
watchdog-dfwpktlogs: Terminating watchdog process with PID 35463
Resource pool 'host/vim/vmvisor/vsfwd' released.

[root@dc1-pod11-esx-a-03:~] /etc/init.d/vShield-Stateful-Firewall start
vShield-Stateful-Firewall is not running
watchdog-dfwpktlogs: PID file /var/run/vmware/watchdog-dfwpktlogs.PID does not exist
watchdog-dfwpktlogs: Unable to terminate watchdog: No running watchdog process for dfwpktlogs
Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying..
Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying..
Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying..
Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying..
Resource pool 'host/vim/vmvisor/vsfwd' release failed. retrying..
root ##b##vShield-Stateful-Firewall started

[root@dc1-pod11-esx-a-03:~] /etc/init.d/netcpad start
Memory reservation set for netcpa
Reload security domains
root ##b##netCP agent service starts
[root@dc1-pod11-esx-a-03:~]

This unfortunately still did not resolve the problem ...

0001.png

My next step was that I just rebooted the host and that did not fix the problem either.

I eventually fixed it with the following steps:

  1. Put host in maintenance mode
  2. Take it out of the cluster (drag and drop in DC object)
  3. Reboot twice
  4. Put it back into the cluster (drag and drop in cluster object)
  5. Take is OUT OF maintenance mode
  6. Force sync / resole (in host preparation)

0003.png

These actions caused a reinstall of the VIB's on the faulty hosts and that eventually resolved the issue. I was trying to resolve this issue whiteout an host reboot, but this was not possible ...