Spontaneous domain reboot on SunFire 6800

Because the respective case at Sun was closed, I want to add this note for the future reference, just in case. So… One day I came to my desk and found that one the domains on SF6800 had been reboot for no reason, at least the very first impression was exactly like that. Superficially and quickly looking at /var/adm/messsage, prtdiag output revealed no hardware or software issues. The next step was to login into SC to go a bit deeper into analyzing the problem. Thus showboards, showfru, showchs, showplatform – everything was fine, but the showlogs, and especially showlogs -d C, output put me on my guard:

May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 757768 local6.crit] 
                           ErrorMonitor: Domain C has a SYSTEM ERROR
May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 346505 local6.error] RP2 encountered the first error
May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 628870 local6.error] ArAsic reported first error on /N0/IB8
May 15 07:38:51 SF6900-1-sc0 Domain-C.SC: [ID 894554 local6.error] 
/partition1/domain0/IB8/ar0: 
>>> L2CheckError[0x6150] : 0x06068606
             CMDVSyncErr [12:09] : 0x3 Ports [9:6] command valid mismatched against internal expected command valid
             PreqSyncErr [04:01] : 0x3 Ports [9:6] prereq mismatched against internal expected prereq
          AccCMDVSyncErr [28:25] : 0x3 accumulated valid command mismatch
                      FE [15:15] : 0x1 
          AccPreqSyncErr [20:17] : 0x3 accumulated prerequisite mismatch

May 15 07:38:51 SF6900-1-sc0 Domain-C.SC: [ID 612655 local6.error] 
/partition1/RP2/sdc0: 
>>> SafariPortError8[0x280] : 0x00088008
                      FE [15:15] : 0x1 
           AccParL2ErrDT [19:19] : 0x1 
              ParL2ErrDT [03:03] : 0x1 L2 parity error for DTransID

May 15 07:38:52 SF6900-1-sc0 Domain-C.SC: [ID 286372 local6.error] [AD] Event: SF6800.ASIC.SDC.PAR_L2_ERR_DT.60143038
     CSN: 0344MM204E DomainID: C ADInfo: 1.SCAPP.20.3
     Time: Fri May 15 07:38:52 MSD 2009
     FRU-List-Count: 2; FRU-PN: 5014404; FRU-SN: 046286; FRU-LOC: /N0/IB8
                        FRU-PN: 5016418; FRU-SN: 004613; FRU-LOC: RP2
     Recommended-Action: Service action required

Does it look like a bunch of some cryptic messages which only initiated into Sun’s engineering secretes could decipher? Well, as always the truth is somewhere in between, because in our case we could only make an assumption about which part of our big system is faulty or just went off the beam for a jiffy. So, lets go forward…
First, we see two errors that took place simultaneously:

May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 346505 local6.error] RP2 encountered the first error
May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 628870 local6.error] ArAsic reported first error on /N0/IB8

Since we have (First Error) FE [15:15]: 0x1 in both errors that indeed means that these two alerts happened at the same time. But keep in mind, they’re unrelated to each other since FE bit is only valid for a single ASIC and has no relation to errors reported by other ASICs in the system. Next:

/partition1/domain0/IB8/ar0: 
>>> L2CheckError[0x6150] : 0x06068606
             CMDVSyncErr [12:09] : 0x3 Ports [9:6] command valid mismatched against internal expected command valid
             PreqSyncErr [04:01] : 0x3 Ports [9:6] prereq mismatched against internal expected prereq
          AccCMDVSyncErr [28:25] : 0x3 accumulated valid command mismatch
                      FE [15:15] : 0x1 
          AccPreqSyncErr [20:17] : 0x3 accumulated prerequisite mismatch

It just tells us that ports 6 through 9 of the AR (Address Repeater), on IO board 8, received CMDVSyncErr and PreqSyncErr. More details could be found here.
0x3 is a hint that tells us that RP2/RP3 were involved. Acc stand for “accumulated” and hence Acc[CMDVSyncErr|PreqSyncErr] lines just inform us that these errors occurred more than once.

Continue with the second error.

/partition1/RP2/sdc0: 
>>> SafariPortError8[0x280] : 0x00088008
                      FE [15:15] : 0x1 
           AccParL2ErrDT [19:19] : 0x1 
              ParL2ErrDT [03:03] : 0x1 L2 parity error for DTransID

This is a clear indication of the parity error on port 8 of SDC (Serengeti Data Controller), on RP2. Consulting “Sun Fire™ 6800/4800/4810/3800 Systems Troubleshooting Manual” revealed that port 8 connects to IB8.

In the end we have a list of suspected FRU:

  1. RP2
  2. IB8

What’s next? With probability of 99%, you will be given a recommendation to monitor you box for a couple of weeks and only if the same error knocks your server down again one of those parts will be replaced and the investigation spins up at the deeper level.

Posted on June 1, 2009 at 1:42 pm by sergeyt · Permalink
In: Sun

Leave a Reply