Friday, July 30, 2010

Random server crashes R300 + IPMI = BAD

Ever since enabling IPMI on our Dell servers last week, we have been experiencing problems with random hangs on our R300s. I suspected IPMI immediately, particularly IPMI over a VLAN. When I finally went to the data center to reboot a server myself, I noticed the following error on the front LCD display.

E1410 CPU 1 IERR


Some googling indicates that this problem indicates a faulty CPU, and our Dell contact suggested that it was probably a memory or drive failure. However, further reading suggested that this problem can also be caused by non-hardware failures.

Going back to the original IPMI theory, I found that I was able to reproduce it quite easily by starting parallel iperf sessions between an R300 and another host to saturate the interface. I then started running constant ipmitool queries. I found that I was able to lock the R300 within 10 minutes, consistently.

I resolved the issue by moving the primary network interface for the OS to NIC #2, leaving NIC #1 for exclusive use by IPMI. In this configuration I was not able to crash the server in 30 minutes and it has run all night without issue.

Discussing the issue/resolution with one of the FreeBSD developers, he stated that this is not just a Dell issue, sharing IPMI with the LAN on FreeBSD is really dodgy, depending on on the particular NIC chipset in use (the Broadcom bge driver in this case). It may be that the VLAN tagging may have been the straw that broke the camel's back in this case. The server that caused the most trouble in this episode was previously running for over a year with IPMI enabled, but no VLAN tagging. To be fair, we were not previous doing any monitoring of this machine via IPMI, so the potential exposure was far less.

No comments:

Post a Comment