My PSOD's are finally gone!!!
Hi everyone,
A few months ago I bought a new powerful lab computer and I was very happy with it and ">I wrote an article about it as well. I finished building this monster lab PC the 16th May 2016.
I was all happy and started to build my lab with some nested ESX hosts and thats when my PSOD nightmare began... With nested host VM's turned on I could not have my physical lab PC (running VMware ESX 6) up for more than 4 days without having a PSOD. For people who do not know what a PSOD is ... this is a "Purple Screen Of Death" and it looks like this:
So I started googling and try to find a solution for this problem ... but could not find anything. I opened internal (VMware) Socialcast posts with very detailed information asking for help, still nothing. I asked around on various Slack groups / channels but still nothing helpful.
I opened a case with Supermicro on the 30th of June 2016 and they basically told me that support was not given because it was not a "barebone" machine and support is not given for these issues for self build systems. I guess this is understandable because they don't have control over people using incompatible parts. So I started testing things and changing things... things that I could find on the internet that may be the issue.
1) Faulty CPU - so I tested it with a utility called CPUburn which proved my CPU was working fine 2) Faulty RAM - so I tested it with a utillity called Memtest86 which proved my RAM was working fine 3) A number of BIOS changes - which did not helped at all 4) BIOS update - I was already running the latest publicly available BIOS
Eventually someone gave me a tip that this guy on the internet named Paul Braran could help me out so I contacted him. In the meanwhile I found another guy that also lives in the same country where I live (The Netherlands) ">Rob Maas, and he was experiencing exactly the same issues, but with slightly different hardware... actually the hardware that ">Paul Braran is promoting on his blogs... So I was not alone anymore :-)
So after contacting Paul, and sending some mails back and forth I came to the conclusion that Paul did not had any solution for me ...
At this point I was a bit sad because I just bought a $5000 (USD) piece of hardware that is unusable and I have no clue where to look for the solution.
In the meanwhile I got notion of 4 internal PR's within VMware investigating this problem as well. So now suddenly there where more then two with this problem!
Time passed by and I thought of a workaround myself to schedule a reboot every night at 03:00AM for my nested hosts and this worked for 10 days ... and after 10 days uptime yes PSOD on day 11. So I could start all over again ... but hey 11 days is more than 4 right?
Suddenly I managed to pull some information out of one of the PR's with a message that this problem is related to a bug of some CPU models and that Intel is aware of this and investigating the issue...
And then that day on September the 6th ">Niels Hagoort sent me a message on twitter:
And I started investigating ">this article.
I was actually happy with this article because I was "affected" and it felt that I has some more ground to stand on.
So I openen a case with Intel on 3 September 2016 (also on recommendation by ">Frank Denneman who I spoke during VMworld US 2016) ... and this was Intel's response:
They basically referred to the same link Niels sent me and they told me to contact SuperMicro again and I needed to ask for an BIOS update that fixed the "microcode" (whatever that is). So I started to do some investigation and try to lookup my microcode.
And this was done with this command (on the ESXi host):
vsish
get /hardware/cpu/cpuList/0
Output snippets:
CPU information {
Family:6
Model:79
Type:0
Stepping:1
...
1:CPUID leaf {
EAX:0x000406f1
EBX:0x00200800
ECX:0x77fefbf7
EDX:0xbfebfbff
}
...
Number of microcode updates:0
Original Revision:0x0b00001b
Current Revision:0x0b00001b
So on the 7th of September I opened another case with Supermicro explaining the situation and adding the VMware KB article and giving them the above output. Within ONE hour I received an email back with a BIOS build that was not publicly available with a possible fix.
Next day I updated the BIOS which was a very intens process because if you want to do a BIOS update trough the IPMI you need some kind of stupid licence (SFT-OOB-LIC) which cost around $30. Because getting this licence would take more then 1 day I could not wait I decided to follow this procedure.
So after fighting with the USB BOOT (I needed to change my boot method in the BIOS from UEFI to DUAL) and finally got the BIOS updated:
Before:
After:
I did another check on the microcode version, and this was indeed different.
Output snippets:
CPU information {
Family:6
Model:79
Type:0
Stepping:1
...
1:CPUID leaf {
EAX:0x000406f1
EBX:0x00200800
ECX:0x77fefbf7
EDX:0xbfebfbff
}
...
Number of microcode updates:0
Original Revision:0x0b000010
Current Revision:0x0b000010
The waiting for the next PSOD began again ...
The good news is I am still waiting :-) so I have executed the below commands every day now with success!
[root@esx-03:~] esxcli system time get
2016-09-09T09:21:53Z
[root@esx-03:~] uptime
9:21:59 up 10:05:39, load average: 0.28, 0.25, 0.24
[root@esx-03:~]
[root@esx-03:~] esxcli system time get
2016-09-19T21:45:03Z
[root@esx-03:~] uptime
21:45:04 up 10 days, 22:28:44, load average: 0.22, 0.23, 0.23
[root@esx-03:~]
This actually fixed my problem!
My timeline from detecting to fixing the problem:
- 16 May 2016 - Build my Lab PC
- 20 May 2016 - First PSOD
- 30 June 2016 - opened first case at SuperMicro
- 20 July 2016 - found out about the internal PR's at VMware for these problems
- 31 August 2016 - Frank Denneman advised me to open a case with Intel
- 6 September 2016 - Niels Hagoord pointed out KB link
- 7 September 2016 - Opened case at Intel
- 7 September 2016 - Opened second case at Supermicro
- 7 September 2016 - Received new BIOS
- 8 September 2016 - Updated BIOS + FIXED the problem