Friday, May 3, 2013
Constant load on R300 running FreeBSD 9.1
Wednesday, July 27, 2011
I/O errors on zfs import?
Jul 25 13:59:54 leopard kernel: mfi0: I/O error, status= 12 scsi_status= 0
Jul 25 13:59:54 leopard kernel: mfi0: sense error 0, sense_key 0, asc 0, ascq 0
Jul 25 13:59:54 leopard kernel: mfid1: hard error cmd=read 0-255
It was suggested that I should update the firmware on the disks, so this morning I went and updated all the disks, and the PERC 6/E. Voila! No more I/O errors on import.
For reference, here is the link to the firmware download I used. It is a windows executable that allows you to generate a bootable USB key that contains the firmware updater for the disks. I also used the underlying DOS environment to apply the firmware update for the PERC 6/E, and our brand new PERC H800 that came from Dell with ancient firmware.
Friday, April 1, 2011
Multiple Ambient Temp sensors in the Dell R610
tom@R610:~-> sudo ipmitool sdr type "Temperature" | grep -i ambien
Ambient Temp | 07h | ok | 10.1 | 22 degrees C
Ambient Temp | 08h | ok | 10.2 | 20 degrees C
Ambient Temp | 0Eh | ok | 7.1 | 25 degrees C
The three sensors appear to be the redundant PSUs (10.1 & 10.2), and the main chassis sensor (7.1). Doing some checking around, it appears that all our Dell boxes list the "main" ambient temp in category(?) 7.1, but the actual sensor address is not always 0Eh. Category 10.<1|2> seems to always refer to the PSUs on the 610s.
Thursday, December 2, 2010
Deciphering Dell IPMI SNMP Traps
Useful links
PET SpecificationDell PET Events (MIB)
SNMP Trap OIDs
SNMP Traps from the BMC arrive with the following base OID
.1.3.6.1.4.1.3183.1.1
.1.3.6.1.4.1.3183.1.1.0.x defines the Event type, per the Dell MIB above.
.1.3.6.1.4.1.3183.1.1.1 defines the PET spec information analyzed below.
Based on the Event type OID, you can determine much of what you need to know to generate a nagios trap. In our case,
.1.3.6.1.4.1.3183.1.1.0.356096 indicates an Intrusion event.
.1.3.6.1.4.1.3183.1.1.0.356224 indicates an Intrusion event has been cleared.
PET Analysis
44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31 1:16 GUID (t3)
00 01 17:18 Seq# 0001
18 4A 74 D5 19:22 Timestamp (seconds from 0:00 1/1/98) 407532757
FF FF 23:24 UTC offset, minutes (0xFFFF unspecified) unspecified
20 25 Trap Source Type IPMI
20 26 Event Source Type IPMI
10 27 Event Severity Critical
20 28 Sensor Device 32
73 29 Sensor Number 115
18 30 Entity 24 (System Chassis)
00 31 Entity Instance (0x0 unspecified) unspecified
80 01 FF 00 00 00 00 00 32:39 Event Data
19 40 Language Code 25
00 00 02 A2 41:44 Manufacturer ID Dell
01 00 45:46 System ID 256?
6C 69 6F 6E 2D 34 2D 69 70 6D 69 C1 47:(110) OEM Custom Fields
Example PET fields
Pipes denote field bounds
| | | | | | | | | | | | | | | |
44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31 00 01 18 4A 74 D5 FF FF 20 20 10 20 73 18 00 80 01 FF 00 00 00 00 00 19 00 00 02 A2 01 00 6C 69 6F 6E 2D 34 2D 69 70 6D 69 C1
44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31 00 05 18 4A 74 EE FF FF 20 20 04 20 73 18 00 80 01 FF 00 00 00 00 00 19 00 00 02 A2 01 00 6C 69 6F 6E 2D 34 2D 69 70 6D 69 C1
44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31 00 09 18 4A 8E CA FF FF 20 20 10 20 73 18 00 80 01 FF 00 00 00 00 00 19 00 00 02 A2 01 00 6C 69 6F 6E 2D 34 2D 69 70 6D 69 C1
44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31 00 0D 18 4A 8E F2 FF FF 20 20 04 20 73 18 00 80 01 FF 00 00 00 00 00 19 00 00 02 A2 01 00 6C 69 6F 6E 2D 34 2D 69 70 6D 69 C1
Decoding the PET fields
GUID (16-bytes)
Dell doesn't appear to follow the specification for this field. The first 4 characters are DELL, followed by a string that incorporates part of the Service Tag (CKJKML1). More clarity here could be useful.44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31
D E L L K J K O M L 1
Sequence number (2-bytes)
Increasing counter, doesn't appear to be incremental (! +1).Timestamp (4-bytes)
Odd metric, the number of seconds elapsed since 0:00 1/1/1998 (883612800). Here is some perl that will make a regular timestamp.$time = localtime(883612800 + 407532757);
print "$time";
Tue Nov 30 13:32:37 2010
Trap Source Type (1-byte)
Table 3 (p.9) in PET spec defines.
Event Source Type (1-byte)
Table 3 (p.9) in PET spec defines.
Event Severity (1-byte)
Table 3 in PET spec defines. 0x10 == Critical, 0x4 == Normal.
Sensor Device (1-byte)
Device ID,
root-> ipmitool sdr list mcloc
BMC | Dynamic MC @ 20h | ok
DRAC 5 | Dynamic MC @ 26h | ok
Sensor Number (1-byte)
The actual sensor ID as known by the BMC. PET spec table 5 (p.13) defines Sensor Types. In the above example, value 0x73 (Chassis Intrustion) falls within the OEM RESERVED range (0xC0-0xFF), even though there is a Physical Security value defined (0x5). Stupid.
root-> ipmitool -v sensor
Sensor ID : Temp (0x1)
Entity ID : 3.1
Sensor Type (Analog) : Temperature
Sensor Reading : Unable to read sensor: Device Not Present
Event Status : Event Messages Disabled
Assertion Events :
Event Enable : Event Messages Disabled
Assertions Enabled :
Sensor ID : Planar Temp (0x7)
Entity ID : 7.1
Sensor Type (Analog) : Temperature
Sensor Reading : 21 (+/- 1) degrees C
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3.000
Lower Non-Critical : 8.000
Upper Non-Critical : 53.000
Upper Critical : 58.000
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lnc- lcr- unc+ ucr+
Deassertions Enabled : lnc- lcr- unc+ ucr+
Sensor ID : Ambient Temp (0x8)
Entity ID : 7.1
Sensor Type (Analog) : Temperature
Sensor Reading : 16 (+/- 1) degrees C
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3.000
Lower Non-Critical : 8.000
Upper Non-Critical : 42.000
Upper Critical : 47.000
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lnc- lcr- unc+ ucr+
Deassertions Enabled : lnc- lcr- unc+ ucr+
Sensor ID : CMOS Battery (0x10)
Entity ID : 7.1
Sensor Type (Discrete): Battery
Sensor ID : VCORE (0x12)
Entity ID : 3.1
Sensor Type (Discrete): Voltage
States Asserted : Digital State
[State Deasserted]
Sensor ID : CPU VTT (0x16)
Entity ID : 7.1
Sensor Type (Discrete): Voltage
States Asserted : Digital State
[State Deasserted]
Sensor ID : 1.5V PG (0x17)
Entity ID : 7.1
Sensor Type (Discrete): Voltage
States Asserted : Digital State
[State Deasserted]
Sensor ID : 1.8V PG (0x18)
Entity ID : 7.1
Sensor Type (Discrete): Voltage
States Asserted : Digital State
[State Deasserted]
Sensor ID : 1.5V Riser PG (0x19)
Entity ID : 16.1
Sensor Type (Discrete): Voltage
States Asserted : Digital State
[State Deasserted]
Sensor ID : FAN MOD 1A RPM (0x30)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6675 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3525.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : FAN MOD 1B RPM (0x31)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6375 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 2325.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : FAN MOD 2A RPM (0x32)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6900 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3525.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : FAN MOD 2B RPM (0x33)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6300 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 2325.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : FAN MOD 3A RPM (0x34)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6900 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3525.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : FAN MOD 3B RPM (0x35)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6225 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 2325.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : FAN MOD 4A RPM (0x36)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6825 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3525.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : FAN MOD 4B RPM (0x37)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6150 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 2325.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : FAN MOD 5A RPM (0x38)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : Unable to read sensor: Device Not Present
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : FAN MOD 5B RPM (0x39)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : Unable to read sensor: Device Not Present
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : FAN MOD 6A RPM (0x3a)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : Unable to read sensor: Device Not Present
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : FAN MOD 6B RPM (0x3b)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : Unable to read sensor: Device Not Present
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-
Sensor ID : Presence (0x50)
Entity ID : 3.1
Sensor Type (Discrete): Entity Presence
States Asserted : Entity Presence
[Present]
Sensor ID : Presence (0x54)
Entity ID : 10.1
Sensor Type (Discrete): Entity Presence
Unable to read sensor: Device Not Present
Sensor ID : Presence (0x55)
Entity ID : 10.2
Sensor Type (Discrete): Entity Presence
Unable to read sensor: Device Not Present
Sensor ID : Presence (0x56)
Entity ID : 26.1
Sensor Type (Discrete): Entity Presence
States Asserted : Entity Presence
[Absent]
Sensor ID : PFault Fail Safe (0x5f)
Entity ID : 7.1
Sensor Type (Discrete): Voltage
Unable to read sensor: Device Not Present
Sensor ID : Status (0x60)
Entity ID : 3.1
Sensor Type (Discrete): Processor
States Asserted : Processor
[Presence detected]
Sensor ID : Status (0x64)
Entity ID : 10.1
Sensor Type (Discrete): Power Supply
Unable to read sensor: Device Not Present
Sensor ID : Status (0x65)
Entity ID : 10.2
Sensor Type (Discrete): Power Supply
Unable to read sensor: Device Not Present
Sensor ID : Status (0x66)
Entity ID : 16.1
Sensor Type (Discrete): Cable / Interconnect
States Asserted : Cable/Interconnect
[Connected]
Sensor ID : RAC Status (0x70)
Entity ID : 7.1
Sensor Type (Discrete): Module / Board
Sensor ID : OS Watchdog (0x71)
Entity ID : 7.1
Sensor Type (Discrete): Watchdog
Sensor ID : SEL (0x72)
Entity ID : 7.1
Sensor Type (Discrete): Event Logging Disabled
Unable to read sensor: Device Not Present
Sensor ID : Intrusion (0x73)
Entity ID : 7.1
Sensor Type (Discrete): Physical Security
Sensor ID : PS Redundancy (0x74)
Entity ID : 7.1
Sensor Type (Discrete): Power Supply
Unable to read sensor: Device Not Present
Sensor ID : Fan Redundancy (0x75)
Entity ID : 7.1
Sensor Type (Discrete): Fan
States Asserted : Redundancy State
[Fully Redundant]
Sensor ID : CPU Temp Interf (0x76)
Entity ID : 7.1
Sensor Type (Discrete): Temperature
Unable to read sensor: Device Not Present
Sensor ID : Drive (0x80)
Entity ID : 26.1
Sensor Type (Discrete): Drive Slot / Bay
Unable to read sensor: Device Not Present
Sensor ID : Cable SAS (0x90)
Entity ID : 26.1
Sensor Type (Discrete): Cable / Interconnect
Unable to read sensor: Device Not Present
Sensor ID : Cable PDB Ctrl (0x9b)
Entity ID : 7.1
Sensor Type (Discrete): Cable / Interconnect
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
e8 84 ff ff 02 07 d9 9d ea c3 e4 05 21 a5 5d d5
e8 bb f7 a3 c0 82 d0 e8 84 ff ff 02 07 d9 9d ea
c3 e4 05 21 a5 5d d5 e8 bb
Sensor ID : ECC Corr Err (0x1)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
b4 e5 ff ff 02 07 e1 05 e7 29 ae 0a 84 fb aa eb
3b b9 03 d1 bc 49 df b4 e5 ff ff 02 07 e1 05 e7
29 ae 0a 84 fb aa eb 3b b9
Sensor ID : ECC Uncorr Err (0x2)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
ed db ff ff 02 07 a4 9f 92 d0 b0 e2 f7 e1 7d 32
6a d1 4b 5d 2e b5 13 ed db ff ff 02 07 a4 9f 92
d0 b0 e2 f7 e1 7d 32 6a d1
Sensor ID : I/O Channel Chk (0x3)
Entity ID : 34.1
Sensor Type (Discrete): Critical Interrupt
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
4a bb ff ff 02 07 08 02 91 61 be 82 9f 5a c2 fe
06 c7 dd 43 e1 e8 03 4a bb ff ff 02 07 08 02 91
61 be 82 9f 5a c2 fe 06 c7
Sensor ID : PCI Parity Err (0x4)
Entity ID : 34.1
Sensor Type (Discrete): Critical Interrupt
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
2d 73 ff ff 02 07 ef 35 08 e3 a8 d0 20 24 06 f9
c7 8e d2 6b 6e bc de 2d 73 ff ff 02 07 ef 35 08
e3 a8 d0 20 24 06 f9 c7 8e
Sensor ID : PCI System Err (0x5)
Entity ID : 34.1
Sensor Type (Discrete): Critical Interrupt
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
a3 0d ff ff 02 07 4f f0 04 c0 1a 99 af 12 46 1d
74 e9 bf 16 12 0c 13 a3 0d ff ff 02 07 4f f0 04
c0 1a 99 af 12 46 1d 74 e9
Sensor ID : SBE Log Disabled (0x6)
Entity ID : 34.1
Sensor Type (Discrete): Event Logging Disabled
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
cc b9 ff ff 02 07 87 67 27 84 5a b6 f3 f1 82 4a
8b 89 74 67 69 be 11 cc b9 ff ff 02 07 87 67 27
84 5a b6 f3 f1 82 4a 8b 89
Sensor ID : Logging Disabled (0x7)
Entity ID : 34.1
Sensor Type (Discrete): Event Logging Disabled
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
3b c3 ff ff 02 07 3e 85 9a 4c 8f 63 b7 53 73 e4
02 5a 3b 5d 4e 47 73 3b c3 ff ff 02 07 3e 85 9a
4c 8f 63 b7 53 73 e4 02 5a
Sensor ID : Unknown (0x8)
Entity ID : 34.1
Sensor Type (Discrete): System Event
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
f7 35 ff ff 02 07 f6 37 7a ef e8 74 61 e9 71 f9
fc b0 e1 89 d3 f5 a9 f7 35 ff ff 02 07 f6 37 7a
ef e8 74 61 e9 71 f9 fc b0
Sensor ID : CPU Protocol Err (0xa)
Entity ID : 34.1
Sensor Type (Discrete): Processor
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
23 f7 ff ff 02 07 3c 95 d4 23 3a a8 33 05 91 7e
ce 24 73 7c 99 10 8c 23 f7 ff ff 02 07 3c 95 d4
23 3a a8 33 05 91 7e ce 24
Sensor ID : CPU Bus PERR (0xb)
Entity ID : 34.1
Sensor Type (Discrete): Processor
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
09 66 ff ff 02 07 f7 cc ba bf c7 38 50 8b 2f 39
b3 fc 0c 00 72 77 aa 09 66 ff ff 02 07 f7 cc ba
bf c7 38 50 8b 2f 39 b3 fc
Sensor ID : CPU Init Err (0xc)
Entity ID : 34.1
Sensor Type (Discrete): Processor
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
30 89 ff ff 02 07 61 33 4c 17 c3 f1 2d 92 e1 10
57 b6 71 73 93 6a d7 30 89 ff ff 02 07 61 33 4c
17 c3 f1 2d 92 e1 10 57 b6
Sensor ID : CPU Machine Chk (0xd)
Entity ID : 34.1
Sensor Type (Discrete): Processor
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
02 a7 ff ff 02 07 21 d2 89 70 8b 5d 5d 17 8b bb
ae 82 dd 44 ae 4c 51 02 a7 ff ff 02 07 21 d2 89
70 8b 5d 5d 17 8b bb ae 82
Sensor ID : Memory Spared (0x11)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
32 e8 ff ff 02 07 b1 0d 1f 30 f1 92 fe 56 0d c0
4e 65 ea 72 f3 b1 5c 32 e8 ff ff 02 07 b1 0d 1f
30 f1 92 fe 56 0d c0 4e 65
Sensor ID : Memory Mirrored (0x12)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
21 6f ff ff 02 07 a8 30 d8 ec 71 02 00 a4 d1 3f
d3 c9 90 7e 8f 06 60 21 6f ff ff 02 07 a8 30 d8
ec 71 02 00 a4 d1 3f d3 c9
Sensor ID : Memory RAID (0x13)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
c3 d1 ff ff 02 07 e3 44 16 17 2a 90 0e 17 81 bf
e8 08 39 40 ad 72 a0 c3 d1 ff ff 02 07 e3 44 16
17 2a 90 0e 17 81 bf e8 08
Sensor ID : Memory Added (0x14)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
e8 84 ff ff 02 07 3f f7 41 58 00 29 a3 b9 e6 62
96 15 f7 a3 c0 82 d0 e8 84 ff ff 02 07 3f f7 41
58 00 29 a3 b9 e6 62 96 15
Sensor ID : Memory Removed (0x15)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
b4 e5 ff ff 02 07 41 0c a2 b5 d5 ac fa 16 a4 72
84 d7 03 d1 bc 49 df b4 e5 ff ff 02 07 41 0c a2
b5 d5 ac fa 16 a4 72 84 d7
Sensor ID : Memory Cfg Err (0x16)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
ed db ff ff 02 07 07 e8 5a fa 86 ce 6b 76 73 7c
5b aa 4b 5d 2e b5 13 ed db ff ff 02 07 07 e8 5a
fa 86 ce 6b 76 73 7c 5b aa
Sensor ID : Mem Redun Gain (0x17)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
4a bb ff ff 02 07 48 27 41 3c 46 00 c1 02 1a 34
e8 9c dd 43 e1 e8 03 4a bb ff ff 02 07 48 27 41
3c 46 00 c1 02 1a 34 e8 9c
Sensor ID : PCIE Fatal Err (0x18)
Entity ID : 34.1
Sensor Type (Discrete): Critical Interrupt
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
2d 73 ff ff 02 07 39 32 87 7d 45 a2 db 02 9c c5
37 c9 d2 6b 6e bc de 2d 73 ff ff 02 07 39 32 87
7d 45 a2 db 02 9c c5 37 c9
Sensor ID : Chipset Err (0x19)
Entity ID : 34.1
Sensor Type (Discrete): Critical Interrupt
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
a3 0d ff ff 02 07 d7 c6 5d b1 7f 62 43 47 5a 77
de bc bf 16 12 0c 13 a3 0d ff ff 02 07 d7 c6 5d
b1 7f 62 43 47 5a 77 de bc
Sensor ID : Err Reg Pointer (0x1a)
Entity ID : 34.1
Sensor Type (Discrete): Unknown (0xC1)
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
cc b9 ff ff 02 07 64 6e 71 6c 91 87 23 4a 6b fd
f7 68 74 67 69 be 11 cc b9 ff ff 02 07 64 6e 71
6c 91 87 23 4a 6b fd f7 68
Sensor ID : Mem ECC Warning (0x1b)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
3b c3 ff ff 02 07 05 79 96 82 45 b8 12 6c c3 5e
cf f2 3b 5d 4e 47 73 3b c3 ff ff 02 07 05 79 96
82 45 b8 12 6c c3 5e cf f2
Sensor ID : Mem CRC Err (0x1c)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
f7 35 ff ff 02 07 bd 82 5a 26 89 50 fb 7c ab db
db c2 e1 89 d3 f5 a9 f7 35 ff ff 02 07 bd 82 5a
26 89 50 fb 7c ab db db c2
Sensor ID : USB Over-current (0x1d)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
23 f7 ff ff 02 07 6a e7 7e 30 28 75 30 2c 64 c3
d4 5a 73 7c 99 10 8c 23 f7 ff ff 02 07 6a e7 7e
30 28 75 30 2c 64 c3 d4 5a
Sensor ID : POST Err (0x1e)
Entity ID : 34.1
Sensor Type (Discrete): System Firmwares
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
09 66 ff ff 02 07 07 f7 35 2f af 3c a0 41 6f 32
aa b1 0c 00 72 77 aa 09 66 ff ff 02 07 07 f7 35
2f af 3c a0 41 6f 32 aa b1
Sensor ID : Hdwr version err (0x1f)
Entity ID : 34.1
Sensor Type (Discrete): Version Change
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
30 89 ff ff 02 07 95 50 49 39 93 8e 61 72 fa 30
77 07 71 73 93 6a d7 30 89 ff ff 02 07 95 50 49
39 93 8e 61 72 fa 30 77 07
Sensor ID : Mem Overtemp (0x20)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
02 a7 ff ff 02 07 31 99 d7 13 76 96 49 e1 f0 58
dc 00 dd 44 ae 4c 51 02 a7 ff ff 02 07 31 99 d7
13 76 96 49 e1 f0 58 dc 00
Sensor ID : Mem Fatal SB CRC (0x21)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
bridge command response (41 bytes)
32 e8 ff ff 02 07 da 7a 46 57 4c f4 82 17 35 7f
63 8f ea 72 f3 b1 5c 32 e8 ff ff 02 07 da 7a 46
57 4c f4 82 17 35 7f 63 8f
Sensor ID : Mem Fatal NB CRC (0x22)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present
Entity (1-byte)
PET spec table 6 (p.17) defines values
Entity Instance (1-byte)
0x0 unspecified
Event Data (8-bytes)
Additional information about the event, as defined in PET spec table 5 (p.13), or in our case, by the OEM..
80 01 FF 00 00 00 00 00
Language Code (1-byte)
Manufacturer ID (4-bytes)
0x2A2 = 674 = Dell
source
System ID (2-bytes)
0x100 = 256 = ???
OEM Custom Fields (<=64-bytes)
Custom fields defined by the OEM.
Friday, July 30, 2010
Random server crashes R300 + IPMI = BAD
E1410 CPU 1 IERR
Some googling indicates that this problem indicates a faulty CPU, and our Dell contact suggested that it was probably a memory or drive failure. However, further reading suggested that this problem can also be caused by non-hardware failures.
Going back to the original IPMI theory, I found that I was able to reproduce it quite easily by starting parallel iperf sessions between an R300 and another host to saturate the interface. I then started running constant ipmitool queries. I found that I was able to lock the R300 within 10 minutes, consistently.
I resolved the issue by moving the primary network interface for the OS to NIC #2, leaving NIC #1 for exclusive use by IPMI. In this configuration I was not able to crash the server in 30 minutes and it has run all night without issue.
Discussing the issue/resolution with one of the FreeBSD developers, he stated that this is not just a Dell issue, sharing IPMI with the LAN on FreeBSD is really dodgy, depending on on the particular NIC chipset in use (the Broadcom bge driver in this case). It may be that the VLAN tagging may have been the straw that broke the camel's back in this case. The server that caused the most trouble in this episode was previously running for over a year with IPMI enabled, but no VLAN tagging. To be fair, we were not previous doing any monitoring of this machine via IPMI, so the potential exposure was far less.
Wednesday, July 21, 2010
Is FreeBSD clobbering your IPMI LAN access?
- Edit /boot/loader.conf, appending the following line:
hw.bge.allow_asf="1" - Save the file and reboot.
On a side note, it is worth noting that IPMI != DRAC; IPMI == BMC. DRAC refers to the enhanced management tools provided by an add-in DRAC card or integrated into some higher-end Dell servers. This includes a web interface for configuration/monitoring and remote console (in the higher-end implementations). DRAC provides IPMI instrumentation and control, but IPMI does not provide DRAC functionality.
Update/Big Fat Warning: using IPMI on the same interface as your LAN can cause BIG problems with the bge driver. See this post.
