利用IBM硬件信息中心定位硬件问题(原创)

本文主要是通过一次对AIX服务器的硬件故障排查过程来引进一个故障排查的思路,希望大家拍砖。

# errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
BFE4C025   0416192308 P H sysplanar0     UNDETERMINED ERROR

# errpt -aj
BFE4C025
---------------------------------------------------------------------------
LABEL:          SCAN_ERROR_CHRP
IDENTIFIER:     BFE4C025

Date/Time:       Wed Apr 16 19:23:10 2008
Sequence Number: 120
Machine Id:      000599F6D700
Node Id:         PEKAX019
Class:           H
Type:            PERM
Resource Name:   sysplanar0      #系统平台错误,根据经验可先通过

Resource Class: planar                 diag  sysplanar0 -v -e 查看相关日志在通过
Resource Type:   sysplanar_rspc     lsmcode -A检查 微码是否过旧 ,如 微码没问
Location:                                      题,那么应该是硬件 故障   

Description
UNDETERMINED ERROR

Failure Causes
UNDETERMINED

        Recommended Actions
        RUN SYSTEM DIAGNOSTICS.

Detail Data
PROBLEM DATA
0644 00E0 0000 01B4 8E00 8E00 0000 0000 0000 0000 4942 4D00 5048 0030 0100 EA10

...省略了一些
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

Diagnostic Analysis
Diagnostic Log sequence number: 104
Resource tested:        sysplanar0
Resource Description:   System Planar
Location:             
SRC:                    B17CE433  
Description:            Surveillance Error Predictive Error, general. Refer to
                        the system service documentation for more information.
Additional Words:       2-030000F0 3-53B71510 4-C13920FF 5-400000FF
                        6-00000000 7-000007F7 8-00000800 9-00000000
Possible FRUs:
    Priority: H Maintainence Procedure: FSPSP33
    Location: n/a
    Priority: M Maintainence Procedure: FSPSP04
    Location: n/a
    Priority: L FRU: 32N1272 S/N: YL1126327097 CCIN: 293A
    Location: U787F.001.DPM2DCM-P1-C7

---------------------------------------------------------------------------

打开IBM 硬件信息中心

http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/index.jsp

搜索

1) SRC  B17CE433

System Reference Code (SRC)主要用于描述系统错误的代码

Explanation
This error log entry is generated when the HMC fails to send its heartbeat message within the allotted time. The reason could be network issues, or the Ethernet cable is disconnected.
Response
If this is a tracking event, no service actions are required. Otherwise, use the FRU and procedure callouts detailed with the SRC to determine service actions.


2)FSPSP33:
A problem has been detected in the connection with the HMC.
    Ensure that the cable connectors to the network from the HMC, managed system, managed system partitions, and other HMCs are securely connected. If the connections are not secure, plug the cables back into the proper spots and make sure that the connections are good.
    Check to see if the HMC is working correctly or if the HMC was disconnected incorrectly from the managed system, managed system partitions, and other HMCs. If either has happened, reboot the HMC. For more information, see Shutting down, rebooting, and logging off the HMC.
    Verify that the network connection between the HMC, managed system, managed system partitions, and other HMCs is working properly. If you have a high performance switch (HPS) network, verify that the network connection to the CSM Management Server is also working. If the connection is not working properly, contact the customer network support to correct the problems.
    If applicable, service the next FRU.
    If the problem continues to persist, contact your next level of support. This ends the procedure


3)FSPSP04:
A problem has been detected in the service processor firmware.


4)FRU:32N1272

Field Replace Unit(FRU)现场可更换单元

在电脑上的一些可更换的部件。主要是厂商为了节省成本,把设备分成多个FRU,直接更换而不修。(该FRU号没有找到结果,有时候事实就是这样!)


5)CCIN:293A

custom card identification number(CCIN)自定义识别号


6)Location: U787F.001.DPM2DCM-P1-C7

实际的物理位置,其中U787F.001.DPM2DCM为逻辑分区标识,P1-C7为物理设备标识

通过Location结合FRU与CCIN可定位到实际设备,定位的时候注意比对Maintainence Procedure避免定位错误。

定位结果


相关说明


参考至:http://rocolex.blog.163.com/blog/static/68446410201062102627624/

           http://www.loveunix.net/archiver/tid-129933.html

           http://www-947.ibm.com/systems/support/i/probsolv/src/index.html

           http://baike.baidu.com/view/1511517.htm

           http://jingh3209.blog.163.com/blog/static/15696672009421113615882/


本文原创,转载请注明出处、作者

如有错误,欢迎指正

邮箱:[email protected]

猜你喜欢

转载自czmmiao.iteye.com/blog/1171972