Thursday, December 2, 2010

Deciphering Dell IPMI SNMP Traps

Dell Servers can be configured to send traps when a system even occurs. The following sections discuss how to decipher the SNMP traps. A PowerEdge R300 was used in the examples, and the event discussed is a Chassis Intrusion Alarm.

Useful links

PET Specification
Dell PET Events (MIB)

SNMP Trap OIDs

SNMP Traps from the BMC arrive with the following base OID
.1.3.6.1.4.1.3183.1.1

.1.3.6.1.4.1.3183.1.1.0.x defines the Event type, per the Dell MIB above.
.1.3.6.1.4.1.3183.1.1.1 defines the PET spec information analyzed below.

Based on the Event type OID, you can determine much of what you need to know to generate a nagios trap. In our case,
.1.3.6.1.4.1.3183.1.1.0.356096 indicates an Intrusion event.
.1.3.6.1.4.1.3183.1.1.0.356224 indicates an Intrusion event has been cleared.

PET Analysis

44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31         1:16    GUID (t3)
00 01 17:18 Seq# 0001
18 4A 74 D5 19:22 Timestamp (seconds from 0:00 1/1/98) 407532757
FF FF 23:24 UTC offset, minutes (0xFFFF unspecified) unspecified
20 25 Trap Source Type IPMI
20 26 Event Source Type IPMI
10 27 Event Severity Critical
20 28 Sensor Device 32
73 29 Sensor Number 115
18 30 Entity 24 (System Chassis)
00 31 Entity Instance (0x0 unspecified) unspecified
80 01 FF 00 00 00 00 00 32:39 Event Data
19 40 Language Code 25
00 00 02 A2 41:44 Manufacturer ID Dell
01 00 45:46 System ID 256?
6C 69 6F 6E 2D 34 2D 69 70 6D 69 C1 47:(110) OEM Custom Fields

Example PET fields

Pipes denote field bounds

                                               |      |           |     |  |  |  |  |  |  |  |      |                 |  |           |     |
44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31 00 01 18 4A 74 D5 FF FF 20 20 10 20 73 18 00 80 01 FF 00 00 00 00 00 19 00 00 02 A2 01 00 6C 69 6F 6E 2D 34 2D 69 70 6D 69 C1
44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31 00 05 18 4A 74 EE FF FF 20 20 04 20 73 18 00 80 01 FF 00 00 00 00 00 19 00 00 02 A2 01 00 6C 69 6F 6E 2D 34 2D 69 70 6D 69 C1
44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31 00 09 18 4A 8E CA FF FF 20 20 10 20 73 18 00 80 01 FF 00 00 00 00 00 19 00 00 02 A2 01 00 6C 69 6F 6E 2D 34 2D 69 70 6D 69 C1
44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31 00 0D 18 4A 8E F2 FF FF 20 20 04 20 73 18 00 80 01 FF 00 00 00 00 00 19 00 00 02 A2 01 00 6C 69 6F 6E 2D 34 2D 69 70 6D 69 C1


Decoding the PET fields

GUID (16-bytes)

Dell doesn't appear to follow the specification for this field. The first 4 characters are DELL, followed by a string that incorporates part of the Service Tag (CKJKML1). More clarity here could be useful.
44 45 4C 4C 4B 00 10 4A 80 4B C3 C0 4F 4D 4C 31
D E L L K J K O M L 1

Sequence number (2-bytes)

Increasing counter, doesn't appear to be incremental (! +1).

Timestamp (4-bytes)

Odd metric, the number of seconds elapsed since 0:00 1/1/1998 (883612800). Here is some perl that will make a regular timestamp.
$time = localtime(883612800 + 407532757);
print "$time";
Tue Nov 30 13:32:37 2010

Trap Source Type (1-byte)

Table 3 (p.9) in PET spec defines.

Event Source Type (1-byte)

Table 3 (p.9) in PET spec defines.

Event Severity (1-byte)

Table 3 in PET spec defines. 0x10 == Critical, 0x4 == Normal.

Sensor Device (1-byte)

Device ID,

root-> ipmitool sdr list mcloc
BMC | Dynamic MC @ 20h | ok
DRAC 5 | Dynamic MC @ 26h | ok

Sensor Number (1-byte)

The actual sensor ID as known by the BMC. PET spec table 5 (p.13) defines Sensor Types. In the above example, value 0x73 (Chassis Intrustion) falls within the OEM RESERVED range (0xC0-0xFF), even though there is a Physical Security value defined (0x5). Stupid.

root-> ipmitool -v sensor
Sensor ID : Temp (0x1)
Entity ID : 3.1
Sensor Type (Analog) : Temperature
Sensor Reading : Unable to read sensor: Device Not Present

Event Status : Event Messages Disabled
Assertion Events :
Event Enable : Event Messages Disabled
Assertions Enabled :

Sensor ID : Planar Temp (0x7)
Entity ID : 7.1
Sensor Type (Analog) : Temperature
Sensor Reading : 21 (+/- 1) degrees C
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3.000
Lower Non-Critical : 8.000
Upper Non-Critical : 53.000
Upper Critical : 58.000
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lnc- lcr- unc+ ucr+
Deassertions Enabled : lnc- lcr- unc+ ucr+

Sensor ID : Ambient Temp (0x8)
Entity ID : 7.1
Sensor Type (Analog) : Temperature
Sensor Reading : 16 (+/- 1) degrees C
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3.000
Lower Non-Critical : 8.000
Upper Non-Critical : 42.000
Upper Critical : 47.000
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lnc- lcr- unc+ ucr+
Deassertions Enabled : lnc- lcr- unc+ ucr+

Sensor ID : CMOS Battery (0x10)
Entity ID : 7.1
Sensor Type (Discrete): Battery

Sensor ID : VCORE (0x12)
Entity ID : 3.1
Sensor Type (Discrete): Voltage
States Asserted : Digital State
[State Deasserted]

Sensor ID : CPU VTT (0x16)
Entity ID : 7.1
Sensor Type (Discrete): Voltage
States Asserted : Digital State
[State Deasserted]

Sensor ID : 1.5V PG (0x17)
Entity ID : 7.1
Sensor Type (Discrete): Voltage
States Asserted : Digital State
[State Deasserted]

Sensor ID : 1.8V PG (0x18)
Entity ID : 7.1
Sensor Type (Discrete): Voltage
States Asserted : Digital State
[State Deasserted]

Sensor ID : 1.5V Riser PG (0x19)
Entity ID : 16.1
Sensor Type (Discrete): Voltage
States Asserted : Digital State
[State Deasserted]

Sensor ID : FAN MOD 1A RPM (0x30)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6675 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3525.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : FAN MOD 1B RPM (0x31)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6375 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 2325.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : FAN MOD 2A RPM (0x32)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6900 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3525.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : FAN MOD 2B RPM (0x33)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6300 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 2325.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : FAN MOD 3A RPM (0x34)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6900 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3525.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : FAN MOD 3B RPM (0x35)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6225 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 2325.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : FAN MOD 4A RPM (0x36)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6825 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 3525.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : FAN MOD 4B RPM (0x37)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : 6150 (+/- 75) RPM
Status : ok
Lower Non-Recoverable : na
Lower Critical : 2325.000
Lower Non-Critical : na
Upper Non-Critical : na
Upper Critical : na
Upper Non-Recoverable : na
Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : FAN MOD 5A RPM (0x38)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : Unable to read sensor: Device Not Present

Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : FAN MOD 5B RPM (0x39)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : Unable to read sensor: Device Not Present

Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : FAN MOD 6A RPM (0x3a)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : Unable to read sensor: Device Not Present

Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : FAN MOD 6B RPM (0x3b)
Entity ID : 7.1
Sensor Type (Analog) : Fan
Sensor Reading : Unable to read sensor: Device Not Present

Assertion Events :
Assertions Enabled : lcr-
Deassertions Enabled : lcr-

Sensor ID : Presence (0x50)
Entity ID : 3.1
Sensor Type (Discrete): Entity Presence
States Asserted : Entity Presence
[Present]

Sensor ID : Presence (0x54)
Entity ID : 10.1
Sensor Type (Discrete): Entity Presence
Unable to read sensor: Device Not Present

Sensor ID : Presence (0x55)
Entity ID : 10.2
Sensor Type (Discrete): Entity Presence
Unable to read sensor: Device Not Present

Sensor ID : Presence (0x56)
Entity ID : 26.1
Sensor Type (Discrete): Entity Presence
States Asserted : Entity Presence
[Absent]

Sensor ID : PFault Fail Safe (0x5f)
Entity ID : 7.1
Sensor Type (Discrete): Voltage
Unable to read sensor: Device Not Present

Sensor ID : Status (0x60)
Entity ID : 3.1
Sensor Type (Discrete): Processor
States Asserted : Processor
[Presence detected]

Sensor ID : Status (0x64)
Entity ID : 10.1
Sensor Type (Discrete): Power Supply
Unable to read sensor: Device Not Present

Sensor ID : Status (0x65)
Entity ID : 10.2
Sensor Type (Discrete): Power Supply
Unable to read sensor: Device Not Present

Sensor ID : Status (0x66)
Entity ID : 16.1
Sensor Type (Discrete): Cable / Interconnect
States Asserted : Cable/Interconnect
[Connected]

Sensor ID : RAC Status (0x70)
Entity ID : 7.1
Sensor Type (Discrete): Module / Board

Sensor ID : OS Watchdog (0x71)
Entity ID : 7.1
Sensor Type (Discrete): Watchdog

Sensor ID : SEL (0x72)
Entity ID : 7.1
Sensor Type (Discrete): Event Logging Disabled
Unable to read sensor: Device Not Present

Sensor ID : Intrusion (0x73)
Entity ID : 7.1
Sensor Type (Discrete): Physical Security

Sensor ID : PS Redundancy (0x74)
Entity ID : 7.1
Sensor Type (Discrete): Power Supply
Unable to read sensor: Device Not Present

Sensor ID : Fan Redundancy (0x75)
Entity ID : 7.1
Sensor Type (Discrete): Fan
States Asserted : Redundancy State
[Fully Redundant]

Sensor ID : CPU Temp Interf (0x76)
Entity ID : 7.1
Sensor Type (Discrete): Temperature
Unable to read sensor: Device Not Present

Sensor ID : Drive (0x80)
Entity ID : 26.1
Sensor Type (Discrete): Drive Slot / Bay
Unable to read sensor: Device Not Present

Sensor ID : Cable SAS (0x90)
Entity ID : 26.1
Sensor Type (Discrete): Cable / Interconnect
Unable to read sensor: Device Not Present

Sensor ID : Cable PDB Ctrl (0x9b)
Entity ID : 7.1
Sensor Type (Discrete): Cable / Interconnect
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
e8 84 ff ff 02 07 d9 9d ea c3 e4 05 21 a5 5d d5
e8 bb f7 a3 c0 82 d0 e8 84 ff ff 02 07 d9 9d ea
c3 e4 05 21 a5 5d d5 e8 bb
Sensor ID : ECC Corr Err (0x1)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
b4 e5 ff ff 02 07 e1 05 e7 29 ae 0a 84 fb aa eb
3b b9 03 d1 bc 49 df b4 e5 ff ff 02 07 e1 05 e7
29 ae 0a 84 fb aa eb 3b b9
Sensor ID : ECC Uncorr Err (0x2)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
ed db ff ff 02 07 a4 9f 92 d0 b0 e2 f7 e1 7d 32
6a d1 4b 5d 2e b5 13 ed db ff ff 02 07 a4 9f 92
d0 b0 e2 f7 e1 7d 32 6a d1
Sensor ID : I/O Channel Chk (0x3)
Entity ID : 34.1
Sensor Type (Discrete): Critical Interrupt
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
4a bb ff ff 02 07 08 02 91 61 be 82 9f 5a c2 fe
06 c7 dd 43 e1 e8 03 4a bb ff ff 02 07 08 02 91
61 be 82 9f 5a c2 fe 06 c7
Sensor ID : PCI Parity Err (0x4)
Entity ID : 34.1
Sensor Type (Discrete): Critical Interrupt
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
2d 73 ff ff 02 07 ef 35 08 e3 a8 d0 20 24 06 f9
c7 8e d2 6b 6e bc de 2d 73 ff ff 02 07 ef 35 08
e3 a8 d0 20 24 06 f9 c7 8e
Sensor ID : PCI System Err (0x5)
Entity ID : 34.1
Sensor Type (Discrete): Critical Interrupt
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
a3 0d ff ff 02 07 4f f0 04 c0 1a 99 af 12 46 1d
74 e9 bf 16 12 0c 13 a3 0d ff ff 02 07 4f f0 04
c0 1a 99 af 12 46 1d 74 e9
Sensor ID : SBE Log Disabled (0x6)
Entity ID : 34.1
Sensor Type (Discrete): Event Logging Disabled
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
cc b9 ff ff 02 07 87 67 27 84 5a b6 f3 f1 82 4a
8b 89 74 67 69 be 11 cc b9 ff ff 02 07 87 67 27
84 5a b6 f3 f1 82 4a 8b 89
Sensor ID : Logging Disabled (0x7)
Entity ID : 34.1
Sensor Type (Discrete): Event Logging Disabled
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
3b c3 ff ff 02 07 3e 85 9a 4c 8f 63 b7 53 73 e4
02 5a 3b 5d 4e 47 73 3b c3 ff ff 02 07 3e 85 9a
4c 8f 63 b7 53 73 e4 02 5a
Sensor ID : Unknown (0x8)
Entity ID : 34.1
Sensor Type (Discrete): System Event
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
f7 35 ff ff 02 07 f6 37 7a ef e8 74 61 e9 71 f9
fc b0 e1 89 d3 f5 a9 f7 35 ff ff 02 07 f6 37 7a
ef e8 74 61 e9 71 f9 fc b0
Sensor ID : CPU Protocol Err (0xa)
Entity ID : 34.1
Sensor Type (Discrete): Processor
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
23 f7 ff ff 02 07 3c 95 d4 23 3a a8 33 05 91 7e
ce 24 73 7c 99 10 8c 23 f7 ff ff 02 07 3c 95 d4
23 3a a8 33 05 91 7e ce 24
Sensor ID : CPU Bus PERR (0xb)
Entity ID : 34.1
Sensor Type (Discrete): Processor
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
09 66 ff ff 02 07 f7 cc ba bf c7 38 50 8b 2f 39
b3 fc 0c 00 72 77 aa 09 66 ff ff 02 07 f7 cc ba
bf c7 38 50 8b 2f 39 b3 fc
Sensor ID : CPU Init Err (0xc)
Entity ID : 34.1
Sensor Type (Discrete): Processor
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
30 89 ff ff 02 07 61 33 4c 17 c3 f1 2d 92 e1 10
57 b6 71 73 93 6a d7 30 89 ff ff 02 07 61 33 4c
17 c3 f1 2d 92 e1 10 57 b6
Sensor ID : CPU Machine Chk (0xd)
Entity ID : 34.1
Sensor Type (Discrete): Processor
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
02 a7 ff ff 02 07 21 d2 89 70 8b 5d 5d 17 8b bb
ae 82 dd 44 ae 4c 51 02 a7 ff ff 02 07 21 d2 89
70 8b 5d 5d 17 8b bb ae 82
Sensor ID : Memory Spared (0x11)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
32 e8 ff ff 02 07 b1 0d 1f 30 f1 92 fe 56 0d c0
4e 65 ea 72 f3 b1 5c 32 e8 ff ff 02 07 b1 0d 1f
30 f1 92 fe 56 0d c0 4e 65
Sensor ID : Memory Mirrored (0x12)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
21 6f ff ff 02 07 a8 30 d8 ec 71 02 00 a4 d1 3f
d3 c9 90 7e 8f 06 60 21 6f ff ff 02 07 a8 30 d8
ec 71 02 00 a4 d1 3f d3 c9
Sensor ID : Memory RAID (0x13)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
c3 d1 ff ff 02 07 e3 44 16 17 2a 90 0e 17 81 bf
e8 08 39 40 ad 72 a0 c3 d1 ff ff 02 07 e3 44 16
17 2a 90 0e 17 81 bf e8 08
Sensor ID : Memory Added (0x14)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
e8 84 ff ff 02 07 3f f7 41 58 00 29 a3 b9 e6 62
96 15 f7 a3 c0 82 d0 e8 84 ff ff 02 07 3f f7 41
58 00 29 a3 b9 e6 62 96 15
Sensor ID : Memory Removed (0x15)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
b4 e5 ff ff 02 07 41 0c a2 b5 d5 ac fa 16 a4 72
84 d7 03 d1 bc 49 df b4 e5 ff ff 02 07 41 0c a2
b5 d5 ac fa 16 a4 72 84 d7
Sensor ID : Memory Cfg Err (0x16)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
ed db ff ff 02 07 07 e8 5a fa 86 ce 6b 76 73 7c
5b aa 4b 5d 2e b5 13 ed db ff ff 02 07 07 e8 5a
fa 86 ce 6b 76 73 7c 5b aa
Sensor ID : Mem Redun Gain (0x17)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
4a bb ff ff 02 07 48 27 41 3c 46 00 c1 02 1a 34
e8 9c dd 43 e1 e8 03 4a bb ff ff 02 07 48 27 41
3c 46 00 c1 02 1a 34 e8 9c
Sensor ID : PCIE Fatal Err (0x18)
Entity ID : 34.1
Sensor Type (Discrete): Critical Interrupt
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
2d 73 ff ff 02 07 39 32 87 7d 45 a2 db 02 9c c5
37 c9 d2 6b 6e bc de 2d 73 ff ff 02 07 39 32 87
7d 45 a2 db 02 9c c5 37 c9
Sensor ID : Chipset Err (0x19)
Entity ID : 34.1
Sensor Type (Discrete): Critical Interrupt
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
a3 0d ff ff 02 07 d7 c6 5d b1 7f 62 43 47 5a 77
de bc bf 16 12 0c 13 a3 0d ff ff 02 07 d7 c6 5d
b1 7f 62 43 47 5a 77 de bc
Sensor ID : Err Reg Pointer (0x1a)
Entity ID : 34.1
Sensor Type (Discrete): Unknown (0xC1)
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
cc b9 ff ff 02 07 64 6e 71 6c 91 87 23 4a 6b fd
f7 68 74 67 69 be 11 cc b9 ff ff 02 07 64 6e 71
6c 91 87 23 4a 6b fd f7 68
Sensor ID : Mem ECC Warning (0x1b)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
3b c3 ff ff 02 07 05 79 96 82 45 b8 12 6c c3 5e
cf f2 3b 5d 4e 47 73 3b c3 ff ff 02 07 05 79 96
82 45 b8 12 6c c3 5e cf f2
Sensor ID : Mem CRC Err (0x1c)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
f7 35 ff ff 02 07 bd 82 5a 26 89 50 fb 7c ab db
db c2 e1 89 d3 f5 a9 f7 35 ff ff 02 07 bd 82 5a
26 89 50 fb 7c ab db db c2
Sensor ID : USB Over-current (0x1d)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
23 f7 ff ff 02 07 6a e7 7e 30 28 75 30 2c 64 c3
d4 5a 73 7c 99 10 8c 23 f7 ff ff 02 07 6a e7 7e
30 28 75 30 2c 64 c3 d4 5a
Sensor ID : POST Err (0x1e)
Entity ID : 34.1
Sensor Type (Discrete): System Firmwares
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
09 66 ff ff 02 07 07 f7 35 2f af 3c a0 41 6f 32
aa b1 0c 00 72 77 aa 09 66 ff ff 02 07 07 f7 35
2f af 3c a0 41 6f 32 aa b1
Sensor ID : Hdwr version err (0x1f)
Entity ID : 34.1
Sensor Type (Discrete): Version Change
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
30 89 ff ff 02 07 95 50 49 39 93 8e 61 72 fa 30
77 07 71 73 93 6a d7 30 89 ff ff 02 07 95 50 49
39 93 8e 61 72 fa 30 77 07
Sensor ID : Mem Overtemp (0x20)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
02 a7 ff ff 02 07 31 99 d7 13 76 96 49 e1 f0 58
dc 00 dd 44 ae 4c 51 02 a7 ff ff 02 07 31 99 d7
13 76 96 49 e1 f0 58 dc 00
Sensor ID : Mem Fatal SB CRC (0x21)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

bridge command response (41 bytes)
32 e8 ff ff 02 07 da 7a 46 57 4c f4 82 17 35 7f
63 8f ea 72 f3 b1 5c 32 e8 ff ff 02 07 da 7a 46
57 4c f4 82 17 35 7f 63 8f
Sensor ID : Mem Fatal NB CRC (0x22)
Entity ID : 34.1
Sensor Type (Discrete): Memory
Unable to read sensor: Device Not Present

Entity (1-byte)

PET spec table 6 (p.17) defines values

Entity Instance (1-byte)

0x0 unspecified

Event Data (8-bytes)

Additional information about the event, as defined in PET spec table 5 (p.13), or in our case, by the OEM..

80 01 FF 00 00 00 00 00

Language Code (1-byte)

Manufacturer ID (4-bytes)

0x2A2 = 674 = Dell
source

System ID (2-bytes)

0x100 = 256 = ???

OEM Custom Fields (<=64-bytes)

Custom fields defined by the OEM.

Tuesday, November 9, 2010

PAM LDAP error: unexpected return value 4?

Are you seeing this error in your logs, along with an inability to log in?

Nov 9 15:05:08 leoger sshd[41524]: in _openpam_check_error_code(): pam_sm_acct_mgmt(): unexpected return value 4
Nov 9 15:05:08 leoger kernel: Nov 9 15:05:08 leoger sshd[41524]: in _openpam_check_error_code(): pam_sm_acct_mgmt(): unexpected return value 4

I did, and I figured out that the problem was caused by having changed my system hostname in config files and DNS, without actually having changed the hostname of the server. Note, the hostname must also be resolvable, either by DNS or /etc/hosts.

Friday, October 15, 2010

Hot-swap SATA disks in FreeBSD

If you ever need to hot-swap a disk on a FreeBSD box, atacontrol(8) is your friend. Swap the disk, then use atacontrol list to retrieve the list of ATA channels on the system.

root@neutron:~-> atacontrol list
ATA channel 0:
Master: acd0 ATA/ATAPI revision 5
Slave: no device present
ATA channel 2:
Master: ad4 SATA revision 2.x
Slave: no device present
ATA channel 3:
Master: ad6 SATA revision 2.x
Slave: no device present
ATA channel 4:
Master: ad8 SATA revision 1.x
Slave: no device present
ATA channel 5:
Master: no device present
Slave: no device present


Find the appropriate channel, in this case ata5. Then simply perform a detach/attach operation on the channel and the disk should be found.

root@neutron:~-> atacontrol detach ata5
root@neutron:~-> atacontrol attach ata5
Master: ad10 SATA revision 2.x
Slave: no device present


This example was done on FreeBSD 8.1-RELEASE.

Friday, July 30, 2010

Random server crashes R300 + IPMI = BAD

Ever since enabling IPMI on our Dell servers last week, we have been experiencing problems with random hangs on our R300s. I suspected IPMI immediately, particularly IPMI over a VLAN. When I finally went to the data center to reboot a server myself, I noticed the following error on the front LCD display.

E1410 CPU 1 IERR


Some googling indicates that this problem indicates a faulty CPU, and our Dell contact suggested that it was probably a memory or drive failure. However, further reading suggested that this problem can also be caused by non-hardware failures.

Going back to the original IPMI theory, I found that I was able to reproduce it quite easily by starting parallel iperf sessions between an R300 and another host to saturate the interface. I then started running constant ipmitool queries. I found that I was able to lock the R300 within 10 minutes, consistently.

I resolved the issue by moving the primary network interface for the OS to NIC #2, leaving NIC #1 for exclusive use by IPMI. In this configuration I was not able to crash the server in 30 minutes and it has run all night without issue.

Discussing the issue/resolution with one of the FreeBSD developers, he stated that this is not just a Dell issue, sharing IPMI with the LAN on FreeBSD is really dodgy, depending on on the particular NIC chipset in use (the Broadcom bge driver in this case). It may be that the VLAN tagging may have been the straw that broke the camel's back in this case. The server that caused the most trouble in this episode was previously running for over a year with IPMI enabled, but no VLAN tagging. To be fair, we were not previous doing any monitoring of this machine via IPMI, so the potential exposure was far less.

Wednesday, July 21, 2010

IPMI on FreeBSD

Here are some notes regarding how to use IPMI on FreeBSD. This information is relevant to the Dell boxes we have at work, no guarantees otherwise.

To load the IPMI module into a running system, use

kldload ipmi

or add the following to loader.conf and reboot (if you want the changes to be persistent)
vi /boot/loader.conf
ipmi_load="YES"

shutdown -r now


The kernel log should show output similar to this

ipmi0: on isa0
ipmi0: KCS mode found at io 0xca8 alignment 0x4 on isa
ipmi0: KCS error: ff
ipmi0: IPMI device rev. 0, firmware rev. 2.2, version 2.0
ipmi0: Number of channels 4
ipmi0: Attached watchdog

Install the ipmitool package/port. you should now be able to talk to ipmi on the local machine (or remote machines for that matter). Here are a couple of commands that I've found useful.

ipmitool lan print
Prints the current Ethernet configuration for the BMC.

ipmitool lan set
Prints the usage information for configuring the BMC LAN settings. A channel is required for setting these parameters. In my [limited] experience, the channel is always "1".

ipmitool sensor
ipmitool sdr list
Prints information about the sensors that can be monitored via ipmitool. the -v parameter added to ipmitool sensor prints the information organized in a list format.



Additional Information (sources):
FreeBSDwiki IPMI page
Linux IPMI notes (some FreeBSD info here)
Dell Linux IPMI page

Is FreeBSD clobbering your IPMI LAN access?

We've been enabling IPMI/BMC on our servers for environment monitoring, remote control, etc. Our newer Dell R300 servers share NIC #1 with IPMI and the Operating System. I noticed that IPMI works before FreeBSD starts the Ethernet drivers, then it stops responding. It turns out that this behavior can be stopped by adding a line to loader.conf. Here are the steps to do this (found on this page):
  1. Edit /boot/loader.conf, appending the following line:
    hw.bge.allow_asf="1"
  2. Save the file and reboot.
This also works if you have configured the BMC to use VLAN tagging.

On a side note, it is worth noting that IPMI != DRAC; IPMI == BMC. DRAC refers to the enhanced management tools provided by an add-in DRAC card or integrated into some higher-end Dell servers. This includes a web interface for configuration/monitoring and remote console (in the higher-end implementations). DRAC provides IPMI instrumentation and control, but IPMI does not provide DRAC functionality.

Update/Big Fat Warning: using IPMI on the same interface as your LAN can cause BIG problems with the bge driver. See this post.

Thursday, July 15, 2010

Getting net-snmp to use Liebert MIBs...any MIBs for that matter.

After far too much screwing around trying to get this work, I finally figured out how to get snmpwalk to display the text names of OIDs for our Liebert MPH rack PDUs. I downloaded the Liebert Global Products MIB from the Liebert website and placed them in ~/.snmp/mibs/. The Readme file in the downloaded archive says that MIBs need to be loaded in a specific order, which did nothing more me than to waste a lot of time. In order to use the extracted MIBs while walking the Emerson (Liebert) tree, use the -mall argument with snmpwalk.

snmpwalk -v 2c -c public -mall -OS 10.20.30.40 1.3.6.1.4.1.476

Tuesday, January 26, 2010

Using VLANs with Virtualbox

After a number of hours wasted trying to figure out why my VLANs stopped working, it's probably a good idea to make note of what I found.

I have a lot of different Virtualbox VMs created that I use for testing different things. A while back I created a VM image to act as a PPPoE server (Debian squeeze guest running on a lenny host). For reasons that made sense to me at the time, I created VLANs on the guest operating system rather than on the host system (and attaching virtual network adaptors to the guest VM). I hadn't used this VM in several months until today; when I fired up the VM for some PPPoE testing. Unsurprisingly, the VM didn't have any network connectivity on boot. I figured it would probably take some finagling to get things working exactly the way they were working before. However the topology was correct, the switch set up properly, and still no PPPoE.

After several hours of screwing around with different VirtualBox versions, different combinations of host/guest VLAN configurations, etc. I identified the following behaviors:

  1. On a clean reboot of the host, the VLANs on the virtual machine work fine.
  2. As soon as I add a VLAN to the ethernet interface on the host, connectivity to the guest fails.
  3. If I remove the VLAN from the host, guest connectivity is restored.
The wheels started spinning and I remembered that at some point in the past I had a bridge interface configured on my host machine. I had recently removed the bridge in order to simplify my configuration. Some quick testing confirmed that when my host machine is configured with a virtual bridge interface, VLANs on the host can happily coexist with VLANs on the guest. My network configuration necessary to set this up is as follows.
  1. Add a bridge interface br0 to the host.
  2. Add ethernet adaptor eth0 to bridge br0.
  3. Add the host VLAN interfaces to bridge br0.
  4. Configure appropriate IP addressing for the host to br0.
  5. Attach the VirtualBox VM network interface to eth0.
In this configuration, both the host and the guest are able to create tagged VLAN interfaces without conflicting.

Sunday, January 24, 2010

Did your Kerberos authenticated NFS mounts all just break?

Here's why. A recent update to the krb5 packages disabled weak ciphers, DES in particular. I'm all for stronger security, but when it breaks my system, I get a bit crabby. I was the dummy here, I saw there message during the update process about possible NFS breakage due to the weak ciphers being disabled. I ignored the messages because I was in a hurry and I figured it would be a two minute fix. An hour and a half later I now have it fixed. Here is a link to the bug report that helped me. The fix is to re-enable the weak ciphers in your /etc/krb5.conf filein the libdefaults section.

allow_weak_crypto = true