Android 系统(115)---死机问题分析

一、死机现象

1. 死机/phong hang 定义
当手机长时间无法再被用户控制操作时,我们称为死机或者hang 机。
在这里我们强调长时间,如果是短时间,我们归结为Perfomance 问题,即性能问题.

2. 死机表现
* 用户操作手机无任何响应, 如触摸屏幕,按键操作等.
* 手机屏幕黑屏, 无法点亮屏幕.
* 手机界面显示内容和用户输入风马牛不相及.

二、死机可能原因

1. 系统简图
当用户对手机进行操作时, 对应的数据流将是下面一个概括的流程图.
hang process

* HW 如传感器, 触摸屏(TP), 物理按键(KP)等感知到用户操作后,触发相关的中断(ISR) 传递给Kernel, Kernel 相关的driver 对这些ISR 进行处理后,转化成标准的InputEvent. 
* User Space 的System Server 中的Input System 则持续监听Kernel 传递上来的原始InputEvent, 对其进行进一步的处理后, 变成上层APP 可直接处理的Input Event, 如button 点击, 长按, 滑动等等.
* APP 对相关的事件进行处理后,请求更新相关的逻辑界面,而这个则由System Server 中的WMS 等来负责.
* 相关的逻辑界面更新后(Z-Window), 则会请求SurfaceFlinger 来产生FrameBuffer 数据, SurfaceFlinger 则会利用GPU 等来计算生成.
* Display System/Driver 则会将FrameBuffer 中的数据更新显示出来, 这样用户才能感知到他的操作行为.
 
2. 可能死机的原因
原则上上面流程中,每一步出现问题,都可能引发死机问题. 大的方面将,可以分成硬件HW 和 软件SW 两个层次, 硬件HW 不在我们的讲诉之内.
软件SW 上,死机的原因可以分成两种:
(1). 逻辑行为异常
 ** 逻辑判断错误
 ** 逻辑设计错误
 
(2). 逻辑卡顿(block)
 * 死循环 (Deadloop)
 * 死锁 (Deadlock)
 
从具体的原因上将,可以进一步分成:
(1). Input Driver
* 无法接收HW 的中ISR,产生原始的InputEvent, 或者产生的InputEvent 异常.
(2). Input System
* 无法监听Kernel 传递上来的原始InputEvent, 或者转换与传递异常.
(3). System Logic
* 无法正常响应Input System 传递过来的InputEvent, 或者响应出错.
(4). WMS/Surfaceflinger 行为异常
* WMS/SF 无法正确的对Z-Window 进行叠加转换
(5). Display System
* 无法更新Framebuffer 数据,或者填充的数据错误
(6). LCM Driver
* 无法将Framebuffer 数据显示在LCM 上
 
对应硬件HW hang, 经常见得的情况有:
* Power 
* Clock
* Memory & Memory Controller
* Fail IC

三、死机的数据分析

1. 死机分析数据
俗话说, 巧妇难为无米之炊, 死机分析, 同样需要获取第一手的资料, 方可分析问题. 那么哪些数据可以用来分析死机呢。
大概的讲,可以分成空间数据和时间数据。空间数据,即当时现场环境,如有哪些process 在运行,CPU 的执行情况,memory 的利用情况,以及具体的process 的memory 数据等。 时间数据,即行为上的连续数据,比如某个Process 在一段时间内执行了哪些操作,某段时间内CPU 利用率的变化等。通常时空都是交融的,对应我们抓取资讯时往往也是。
 
2. 哪些数据可以用来分析.
* Backtrace 
Backtrace 又分成Java backtrace, Native Backtrace, Kernel Backtrace. 它是分析死机的非常重要的手段,我们可以快速的知道,对应的process/thread 在当时正在执行哪些动作,卡住哪里等。可以非常直观的分析死机现场。
 
* 系统运行环境
客观的反应系统的执行环境,通常包括如CPU 利用率,Memory 使用情况, Storage 剩余情况等。这些资料也非常重要,比如可以快速的知道,当时是否有Process 在疯狂的执行,当时是不是处于严重的low memory 情况, Storage 是否有耗尽的情况发生等。
 
* 程序执行环境
客观的反应当时某个程序(Kernel 也可以看成一个程序)的执行现场, 此类资讯通常包括如process 的coredump, java heap prof, kernel 的memory dump 等。完整的执行环境,我们可以快速的知道当时具体的变量的值,寄存器值等,可以精细的分析问题。
 
* 其他的一些资讯
这些资讯相对来说,比较零散了,如通常的LOG, 一些debug 命令的结果, JTAG & CVD 的数据等。

四、Backtrace 分析

1. Java Backtrace
从Java Backtrace, 我们可以知道当时Process 的虚拟机执行状态. Java Backtrace 依靠SignalCatcher 来抓取.
Google default: SignalCatcher catchs SIGQUIT(3), and then print the java backtrace to /data/anr/trace.txt
MTK Enhance:  SignalCatcher catchs SIGSTKFLT(16), and then print the java backtrace to /data/anr/mtktrace.txt( After 6577.SP/ICS2.MP)
You can update system properties dalvik.vm.stack-trace-file to  Change the address,  default is /data/anr/traces.txt
 
1.1 抓取的方式
* 在ENG Build 中
adb remount
adb shell chmod 0777 data/anr
adb shell kill -3 pid
adb pull /data/anr
 
* 在User Build 中
没有root 权限的情况下,只能直接pull 已经存在的backtrace.
adb pull /data/anr
 
* 你可以尝试直接使用下面的脚本一次性抓取

adb remount
adb shell chmod 0777 data/anr
adb shell ps
@echo off
set processid=
set /p processid=Please Input process id:
@echo on
adb shell kill -3 %processid%
@echo off
ping -n 8 127.0.0.1>nul
@echo on
adb pull data/anr/traces.txt trace-%processid%.txt
pause
 
1.2 JavaBacktrace 解析
下面是一小段system server 的java backtrace 的开始
----- pid 682 at 2014-07-30 18:04:53 -----
Cmd line: system_server
JNI: CheckJNI is off; workarounds are off; pins=4; globals=1484 (plus 50 weak)
DALVIK THREADS:
(mutexes: tll=0 tsl=0 tscl=0 ghl=0)
"main" prio=5 tid=1 NATIVE
  | group="main" sCount=1 dsCount=0 obj=0x4193fde0 self=0x418538f8
  | sysTid=682 nice=-2 sched=0/0 cgrp=apps handle=1074835940
  | state=S schedstat=( 47858718206 26265263191 44902 ) utm=4074 stm=711 core=0
  at android.os.MessageQueue.nativePollOnce(Native Method)
  at android.os.MessageQueue.next(MessageQueue.java:138)
  at android.os.Looper.loop(Looper.java:150)
  at com.android.server.ServerThread.initAndLoop(SystemServer.java:1468)
  at com.android.server.SystemServer.main(SystemServer.java:1563)
  at java.lang.reflect.Method.invokeNative(Native Method)
  at java.lang.reflect.Method.invoke(Method.java:515)
  at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:829)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:645)
  at dalvik.system.NativeStart.main(Native Method)
  
我们一行一行来解析.
* 0# 最开始是 -----PID at Time  然后接着 Cmd line: process name
 
* 1#  the backtrace header: dvm thread :“DALVIK THREADS:”
 
* 2#  Global DVM mutex value: if 0 unlock,  else lock
 tll:  threadListLock, 
 tsl:  threadSuspendLock, 
 tscl: threadSuspendCountLock
 ghl:  gcHeapLock
 
 (mutexes: tll=0 tsl=0 tscl=0 ghl=0)
 
* 3# thread name, java thread Priority, [daemon], DVM thread id, DVM thread status.
 "main"  -> main thread -> activity thread
 prio  -> java thread priority default is 5, (nice =0, linux thread priority 120),  domain is [1,10]
 DVM thread id, NOT linux thread id
 DVM thread Status:
  ZOMBIE,  RUNNABLE, TIMED_WAIT, MONITOR, WAIT, INITALIZING,STARTING, NATIVE, VMWAIT, SUSPENDED,UNKNOWN
 
 "main" prio=5 tid=1 NATIVE
 
* 4# DVM thread status 
 group: default is “main”
   Compiler,JDWP,Signal Catcher,GC,FinalizerWatchdogDaemon,FinalizerDaemon,ReferenceQueueDaemon are system group
 sCount:  thread suspend count
 dsCount: thread dbg suspend count
 obj:     thread obj address
 Sef:  thread point address
 
 group="main" sCount=1 dsCount=0 obj=0x4193fde0 self=0x418538f8
 
* #5 Linux thread status
 sysTId:  linux thread tid
 Nice: linux thread nice value
 sched: cgroup policy/gourp id
 cgrp:  c group  
 handle:  handle address
 sysTid=682 nice=-2 sched=0/0 cgrp=apps handle=1074835940
 
* #6 CPU Sched stat
 Schedstat (Run CPU Clock/ns, Wait CPU Clock/ns,  Slice times)
 utm: utime, user space time(jiffies)
 stm: stime, kernel space time(jiffies)
 Core now running in cpu.
 state=S schedstat=( 47858718206 26265263191 44902 ) utm=4074 stm=711 core=0
 
* #7-more Call Stack
 
1.3 几种常见的java backtrace
 
1.3.1 ActivityThread 正常状态/ActivityThread Normal Case

Message Queue is empty, and thread wait for next message.
 "main" prio=5 tid=1 NATIVE
   | group="main" sCount=1 dsCount=0 obj=0x4193fde0 self=0x418538f8
   | sysTid=11559 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=1074835940
   | state=S schedstat=( 2397315020 9177261498 7975 ) utm=100 stm=139 core=1
   at android.os.MessageQueue.nativePollOnce(Native Method)
   at android.os.MessageQueue.next(MessageQueue.java:138)
   at android.os.Looper.loop(Looper.java:150)
   at android.app.ActivityThread.main(ActivityThread.java:5299)
   at java.lang.reflect.Method.invokeNative(Native Method)
   at java.lang.reflect.Method.invoke(Method.java:515)
   at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:829)
   at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:645)
   at dalvik.system.NativeStart.main(Native Method)
  
1.3.2 Java Backtrace Monitor case

Synchronized Lock: 等待同步锁时的backtrace.
 "AnrMonitorThread" prio=5 tid=24 MONITOR
   | group="main" sCount=1 dsCount=0 obj=0x41fd80c8 self=0x551ac808
   | sysTid=711 nice=0 sched=0/0 cgrp=apps handle=1356369328
   | state=S schedstat=( 8265377638 4744771625 6892 ) utm=160 stm=666 core=0
   at com.android.server.am.ANRManager$AnrDumpMgr.dumpAnrDebugInfoLocked(SourceFile:~832)
   - waiting to lock <0x42838968> (a com.android.server.am.ANRManager$AnrDumpRecord) held by tid=20 (ActivityManager)
   at com.android.server.am.ANRManager$AnrDumpMgr.dumpAnrDebugInfo(SourceFile:824)
   at com.android.server.am.ANRManager$AnrMonitorHandler.handleMessage(SourceFile:220)
   at android.os.Handler.dispatchMessage(Handler.java:110)
   at android.os.Looper.loop(Looper.java:193)
   at android.os.HandlerThread.run(HandlerThread.java:61)
   
1.3.3 执行JNI code 未返回,状态是native 的情况

 "WifiP2pService" prio=5 tid=37 NATIVE
   | group="main" sCount=1 dsCount=0 obj=0x427a9910 self=0x55f088d8
   | sysTid=734 nice=0 sched=0/0 cgrp=apps handle=1443230288
   | state=S schedstat=( 91121772 135245305 170 ) utm=7 stm=2 core=1
   #00  pc 00032700  /system/lib/libc.so (epoll_wait+12)
   #01  pc 000105e3  /system/lib/libutils.so (android::Looper::pollInner(int)+94)
   #02  pc 00010811  /system/lib/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+92)
   #03  pc 0006c96d  /system/lib/libandroid_runtime.so (android::NativeMessageQueue::pollOnce(_JNIEnv*, int)+22)
   #04  pc 0001eacc  /system/lib/libdvm.so (dvmPlatformInvoke+112)
   #05  pc 0004fed9  /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+484)
   #06  pc 00027ea8  /system/lib/libdvm.so
   #07  pc 0002f4b0  /system/lib/libdvm.so (dvmMterpStd(Thread*)+76)
   #08  pc 0002c994  /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+188)
   #09  pc 000632a5  /system/lib/libdvm.so (dvmCallMethodV(Thread*, Method const*, Object*, bool, JValue*, std::__va_list)+340)
   #10  pc 000632c9  /system/lib/libdvm.so (dvmCallMethod(Thread*, Method const*, Object*, JValue*, ...)+20)
   #11  pc 00057961  /system/lib/libdvm.so
   #12  pc 0000dd40  /system/lib/libc.so (__thread_entry+72)
   at android.os.MessageQueue.nativePollOnce(Native Method)
   at android.os.MessageQueue.next(MessageQueue.java:138)
   at android.os.Looper.loop(Looper.java:150)
   at android.os.HandlerThread.run(HandlerThread.java:61)
   
1.3.4 执行object.wait 等待状态

 "AsyncTask #1" prio=5 tid=33 WAIT
   | group="main" sCount=1 dsCount=0 obj=0x427a8480 self=0x56036b40
   | sysTid=733 nice=10 sched=0/0 cgrp=apps/bg_non_interactive handle=1443076000
   | state=S schedstat=( 1941480839 10140523154 4229 ) utm=119 stm=75 core=0
   at java.lang.Object.wait(Native Method)
   - waiting on <0x427a8618> (a java.lang.VMThread) held by tid=33 (AsyncTask #1)
   at java.lang.Thread.parkFor(Thread.java:1212)
   at sun.misc.Unsafe.park(Unsafe.java:325)
   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:157)
   at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2017)
   at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:410)
   at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1035)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1097)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:587)
   at java.lang.Thread.run(Thread.java:848)
 
1.3.5 Suspend 状态,通常表明是抓取backtrace 时,当时还正在执行java code, 被强制suspend 的情况
 "FileObserver" prio=5 tid=23 SUSPENDED
   | group="main" sCount=1 dsCount=0 obj=0x41fd1dc8 self=0x551abda0
   | sysTid=710 nice=0 sched=0/0 cgrp=apps handle=1427817920
   | state=S schedstat=( 130152222 399783851 383 ) utm=9 stm=4 core=0
   #00  pc 000329f8  /system/lib/libc.so (__futex_syscall3+8)
   #01  pc 000108cc  /system/lib/libc.so (__pthread_cond_timedwait_relative+48)
   #02  pc 0001092c  /system/lib/libc.so (__pthread_cond_timedwait+64)
   #03  pc 00055a93  /system/lib/libdvm.so
   #04  pc 0005614d  /system/lib/libdvm.so (dvmChangeStatus(Thread*, ThreadStatus)+34)
   #05  pc 0004ae7f  /system/lib/libdvm.so
   #06  pc 0004e353  /system/lib/libdvm.so
   #07  pc 000518d5  /system/lib/libandroid_runtime.so
   #08  pc 0008af9f  /system/lib/libandroid_runtime.so
   #09  pc 0001eacc  /system/lib/libdvm.so (dvmPlatformInvoke+112)
   #10  pc 0004fed9  /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+484)
   #11  pc 00027ea8  /system/lib/libdvm.so
   #12  pc 0002f4b0  /system/lib/libdvm.so (dvmMterpStd(Thread*)+76)
   #13  pc 0002c994  /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+188)
   #14  pc 000632a5  /system/lib/libdvm.so (dvmCallMethodV(Thread*, Method const*, Object*, bool, JValue*, std::__va_list)+340)
   #15  pc 000632c9  /system/lib/libdvm.so (dvmCallMethod(Thread*, Method const*, Object*, JValue*, ...)+20)
   #16  pc 00057961  /system/lib/libdvm.so
   #17  pc 0000dd40  /system/lib/libc.so (__thread_entry+72)
   at android.os.FileObserver$ObserverThread.observe(Native Method)
   at android.os.FileObserver$ObserverThread.run(FileObserver.java:88)
 
2. Native Backtrace
 
2.1 Native Backtrace 抓取方式
 
2.1.1 利用debuggerd 抓取
MTK 已经制作了一个利用debuggerd 抓取Native backtrace 的tool RTT(Runtime Trace), 对应的执行命令是:
rtt built timestamp (Apr 18 2014 15:36:21)
USAGE : rtt [-h] -f function -p pid [-t tid]
  -f funcion : current support functions:
                 bt  (Backtrace function)
  -p pid     : pid to trace
  -t tid     : tid to trace
  -n name    : process name to trace
  -h         : help menu
  
2.1.2 添加代码直接抓取
Google 默认提供了CallStack API, 请参考
system/core/include/libutils/CallStack.h 
system/core/libutils/CallStack.cpp
可快速打印单个线程的backtrace.
 
2.2 解析Native Backtrace
你可以使用GDB, 或者addr2line 等 tool 来解析抓回的Native Backtrace, 从而知道当时正在执行的native 代码.
如addr2line 执行
arm-linux-androideabi-addr2line -f -C -e symbols address
 
3. Kernel Backtrace
 
3.1 Kernel Backtrace 抓取方式
 
3.1.1 运行时抓取
* AEE/RTT 工具

 
* Proc System
  cat proc/pid/task/tid/stack

 
* Sysrq-trigger
 adb shell cat proc/kmsg > kmsg.txt
 adb shell "echo 8 > proc/sys/kernel/printk“ //修改printk loglevel
 adb shell "echo t > /proc/sysrq-trigger“ //打印所有的backtrace
 adb shell "echo w > /proc/sysrq-trigger“//打印'-D' status 'D' 的 process
 
* KDB
 Long press volume UP and DOWN more then 10s
 btp             <pid>                
 Display stack for process <pid>
 bta             [DRSTCZEUIMA]        
 Display stack all processes
 btc                                  
 Backtrace current process on each cpu
 btt             <vaddr>              
 Backtrace process given its struct task add
 
3.1.2 添加代码直接抓取
* #include <linux/sched.h>
 当前thread:  dump_stack();
 其他thread:  show_stack(task, NULL);
 
3.2 Process/Thread 状态
 "R (running)",  /*   0 */
 "S (sleeping)",  /*   1 */
 "D (disk sleep)", /*   2 */
 "T (stopped)",  /*   4 */
 "t (tracing stop)", /*   8 */
 "Z (zombie)",  /*  16 */
 "X (dead)",  /*  32 */
 "x (dead)",  /*  64 */
 "K (wakekill)",  /* 128 */
 "W (waking)",  /* 256 */
 
通常一般的Process 处于的状态都是S (sleeping), 而如果一旦发现处于如D (disk sleep), T (stopped), Z (zombie) 等就要认真审查.
 
4. 几种典型的异常情况

4.1 Deadlock
下面这个case 可以看到PowerManagerService, ActivityManager, WindowManager 相互之间发生deadlock.
"PowerManagerService" prio=5 tid=25 MONITOR
  | group="main" sCount=1 dsCount=0 obj=0x42bae270 self=0x6525d5c0
  | sysTid=913 nice=0 sched=0/0 cgrp=apps handle=1696964440
  | state=S schedstat=( 5088939411 10237027338 34016 ) utm=232 stm=276 core=2
  at com.android.server.am.ActivityManagerService.bindService(ActivityManagerService.java:~14079)
  - waiting to lock <0x42aa0f78> (a com.android.server.am.ActivityManagerService) held by tid=16 (ActivityManager)
  at android.app.ContextImpl.bindServiceCommon(ContextImpl.java:1665)
  at android.app.ContextImpl.bindService(ContextImpl.java:1648)
  at com.android.server.power.PowerManagerService.bindSmartStandByService(PowerManagerService.java:4090)
  at com.android.server.power.PowerManagerService.handleSmartStandBySettingChangedLocked(PowerManagerService.java:4144)
  at com.android.server.power.PowerManagerService.access$5600(PowerManagerService.java:102)
  at com.android.server.power.PowerManagerService$SmartStandBySettingObserver.onChange(PowerManagerService.java:4132)
  at android.database.ContentObserver$NotificationRunnable.run(ContentObserver.java:181)
  at android.os.Handler.handleCallback(Handler.java:809)
  at android.os.Handler.dispatchMessage(Handler.java:102)
  at android.os.Looper.loop(Looper.java:139)
  at android.os.HandlerThread.run(HandlerThread.java:58)
  
  "ActivityManager" prio=5 tid=16 MONITOR
  | group="main" sCount=1 dsCount=0 obj=0x42aa0d08 self=0x649166b0
  | sysTid=902 nice=-2 sched=0/0 cgrp=apps handle=1687251744
  | state=S schedstat=( 39360881460 25703061063 69675 ) utm=1544 stm=2392 core=2
  at com.android.server.wm.WindowManagerService.setAppVisibility(WindowManagerService.java:~4783)
  - waiting to lock <0x42d17590> (a java.util.HashMap) held by tid=12 (WindowManager)
  at com.android.server.am.ActivityStack.stopActivityLocked(ActivityStack.java:2432)
  at com.android.server.am.ActivityStackSupervisor.activityIdleInternalLocked(ActivityStackSupervisor.java:2103)
  at com.android.server.am.ActivityStackSupervisor$ActivityStackSupervisorHandler.activityIdleInternal(ActivityStackSupervisor.java:2914)
  at com.android.server.am.ActivityStackSupervisor$ActivityStackSupervisorHandler.handleMessage(ActivityStackSupervisor.java:2921)
  at android.os.Handler.dispatchMessage(Handler.java:110)
  at android.os.Looper.loop(Looper.java:147)
  at com.android.server.am.ActivityManagerService$AThread.run(ActivityManagerService.java:2112)
  
  "WindowManager" prio=5 tid=12 MONITOR
  | group="main" sCount=1 dsCount=0 obj=0x42a92550 self=0x6491dd48
  | sysTid=898 nice=-4 sched=0/0 cgrp=apps handle=1687201104
  | state=S schedstat=( 60734070955 41987172579 219755 ) utm=4659 stm=1414 core=1
  at com.android.server.power.PowerManagerService.setScreenBrightnessOverrideFromWindowManagerInternal(PowerManagerService.java:~3207)
  - waiting to lock <0x42a95140> (a java.lang.Object) held by tid=25 (PowerManagerService)
  at com.android.server.power.PowerManagerService.setScreenBrightnessOverrideFromWindowManager(PowerManagerService.java:3196)
  at com.android.server.wm.WindowManagerService.performLayoutAndPlaceSurfacesLockedInner(WindowManagerService.java:9686)
  at com.android.server.wm.WindowManagerService.performLayoutAndPlaceSurfacesLockedLoop(WindowManagerService.java:8923)
  at com.android.server.wm.WindowManagerService.performLayoutAndPlaceSurfacesLocked(WindowManagerService.java:8879)
  at com.android.server.wm.WindowManagerService.access$500(WindowManagerService.java:170)
  at com.android.server.wm.WindowManagerService$H.handleMessage(WindowManagerService.java:7795)
  at android.os.Handler.dispatchMessage(Handler.java:110)
  at android.os.Looper.loop(Looper.java:147)
  at android.os.HandlerThread.run(HandlerThread.java:58)
 
 
4.2 执行JNI native code 后一去不复返
"main" prio=5 tid=1 NATIVE
  | group="main" sCount=1 dsCount=0 obj=0x41bb3d98 self=0x41ba2878
  | sysTid=814 nice=-2 sched=0/0 cgrp=apps handle=1074389380
  | state=D schedstat=( 22048890928 19526803112 32612 ) utm=1670 stm=534 core=0
  (native backtrace unavailable)
  at android.hardware.SystemSensorManager$BaseEventQueue.nativeDisableSensor(Native Method)
  at android.hardware.SystemSensorManager$BaseEventQueue.disableSensor(SystemSensorManager.java:399)
  at android.hardware.SystemSensorManager$BaseEventQueue.removeAllSensors(SystemSensorManager.java:325)
  at android.hardware.SystemSensorManager.unregisterListenerImpl(SystemSensorManager.java:194)
  at android.hardware.SensorManager.unregisterListener(SensorManager.java:560)
  at com.android.internal.policy.impl.WindowOrientationListener.disable(WindowOrientationListener.java:139)
  at com.android.internal.policy.impl.PhoneWindowManager.updateOrientationListenerLp(PhoneWindowManager.java:774)
  at com.android.internal.policy.impl.PhoneWindowManager.screenTurnedOff(PhoneWindowManager.java:4897)
  at com.android.server.power.Notifier.sendGoToSleepBroadcast(Notifier.java:518)
  at com.android.server.power.Notifier.sendNextBroadcast(Notifier.java:434)
  at com.android.server.power.Notifier.access$500(Notifier.java:63)
  at com.android.server.power.Notifier$NotifierHandler.handleMessage(Notifier.java:584)
  at android.os.Handler.dispatchMessage(Handler.java:110)
  at android.os.Looper.loop(Looper.java:193)
  at com.android.server.ServerThread.initAndLoop(SystemServer.java:1436)
  at com.android.server.SystemServer.main(SystemServer.java:1531)
  at java.lang.reflect.Method.invokeNative(Native Method)
  at java.lang.reflect.Method.invoke(Method.java:515)
  at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:824)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:640)
  at dalvik.system.NativeStart.main(Native Method)
  
===>
KERNEL SPACE BACKTRACE, sysTid=814
 [<ffffffff>] 0xffffffff from [<c07e5140>] __schedule+0x3fc/0x950
 [<c07e4d50>] __schedule+0xc/0x950 from [<c07e57e4>] schedule+0x40/0x80
 [<c07e57b0>] schedule+0xc/0x80 from [<c07e5ae4>] schedule_preempt_disabled+0x20/0x2c
 [<c07e5ad0>] schedule_preempt_disabled+0xc/0x2c from [<c07e3c3c>] mutex_lock_nested+0x1b8/0x560
 [<c07e3a90>] mutex_lock_nested+0xc/0x560 from [<c05667d8>] gsensor_operate+0x1bc/0x2c0
 [<c0566628>] gsensor_operate+0xc/0x2c0 from [<c0495fa0>] hwmsen_enable+0xa8/0x30c
 [<c0495f04>] hwmsen_enable+0xc/0x30c from [<c0496500>] hwmsen_unlocked_ioctl+0x2fc/0x528
 [<c0496210>] hwmsen_unlocked_ioctl+0xc/0x528 from [<c018ad98>] do_vfs_ioctl+0x94/0x5bc
 [<c018ad10>] do_vfs_ioctl+0xc/0x5bc from [<c018b33c>] sys_ioctl+0x7c/0x8c
 [<c018b2cc>] sys_ioctl+0xc/0x8c from [<c000e480>] ret_fast_syscall+0x0/0x40
 [<ffffffff>]  from [<ffffffff>]

五、系统运行环境分析

1. 系统运行环境
客观的反应系统的执行环境,通常包括如CPU 利用率,Memory 使用情况, Storage 剩余情况等。这些资料也非常重要,比如可以快速的知道,当时是否有Process 在疯狂的执行,当时是不是处于严重的low memory 情况, Storage 是否有耗尽的情况发生等。
 
2. CPU Usage
追查CPU 利用率可大体的知道,当时机器是否有Process 在疯狂的运行,  当时系统运行是否繁忙。通常死机分析,只需要抓取基本的使用情况即可。下面说一下一般的抓取方式
 
2.1 top
top 可以简单的查询Cpu 的基本使用情况
Usage: top [ -m max_procs ] [ -n iterations ] [ -d delay ] [ -s sort_column ] [ -t ] [ -h ]
    -m num  Maximum number of processes to display.
    -n num  Updates to show before exiting.
    -d num  Seconds to wait between updates.
    -s col  Column to sort by (cpu,vss,rss,thr).
    -t      Show threads instead of processes.
    -h      Display this help screen.
 
注意的是top 的CPU% 是按全部CPU 来计算的,如果以单线程来计算,比如当时有开启4个核心,那么最多吃到25%. 
 
个人常见的操作方式如:  top -t -m 5 -n 2
I.E 正常情况
User 2%, System 12%, IOW 0%, IRQ 0%
User 14 + Nice 0 + Sys 67 + Idle 471 + IOW 0 + IRQ 0 + SIRQ 2 = 554
  PID   TID PR CPU% S     VSS     RSS PCY UID      Thread          Proc
 2423  2423  1   8% R   2316K   1128K     root     top             top
  270   270  0   1% S   2160K    924K     root     aee_resmon      /system/bin/aee_resmon
  159   159  0   0% D      0K      0K     root     bat_thread_kthr
    3     3  0   0% S      0K      0K     root     ksoftirqd/0
   57    57  0   0% D      0K      0K     root     hps_main
 
User 1%, System 7%, IOW 0%, IRQ 0%
User 10 + Nice 0 + Sys 41 + Idle 494 + IOW 0 + IRQ 0 + SIRQ 0 = 545
  PID   TID PR CPU% S     VSS     RSS PCY UID      Thread          Proc
 2423  2423  1   8% R   2324K   1152K     root     top             top
   57    57  0   0% D      0K      0K     root     hps_main
  242   419  0   0% S   8600K   4540K     shell    mobile_log_d    /system/bin/mobile_log_d
  982   991  0   0% S   4364K   1156K     media_rw sdcard          /system/bin/sdcard
  272   272  0   0% S  30680K   9048K     root     em_svr          /system/bin/em_svr
  
  从上面可以看出, 系统基本运行正常,没有很吃CPU 的进程.
 
I.E 异常情况
 
User 59%, System 4%, IOW 2%, IRQ 0%
User 1428 + Nice 0 + Sys 110 + Idle 811 + IOW 67 + IRQ 0 + SIRQ 1 = 2417
  PID   TID PR CPU% S     VSS     RSS PCY UID      Thread          Proc
16132 32195  3  14% R 997100K  53492K  bg u0_a60   Thread-1401     com.android.mms
16132 32190  1  14% R 997100K  53492K  bg u0_a60   Thread-1400     com.android.mms
16132 32188  2  14% R 997100K  53492K  bg u0_a60   Thread-1399     com.android.mms
16132 32187  0  14% R 997100K  53492K  bg u0_a60   Thread-1398     com.android.mms
18793 18793  4   1% R   2068K   1020K     shell    top             top
 
User 67%, System 3%, IOW 7%, IRQ 0%
User 1391 + Nice 0 + Sys 75 + Idle 435 + IOW 146 + IRQ 0 + SIRQ 1 = 2048
  PID   TID PR CPU% S     VSS     RSS PCY UID      Thread          Proc
16132 32195  3  16% R 997100K  53492K  bg u0_a60   Thread-1401     com.android.mms
16132 32188  2  16% R 997100K  53492K  bg u0_a60   Thread-1399     com.android.mms
16132 32190  0  16% R 997100K  53492K  bg u0_a60   Thread-1400     com.android.mms
16132 32187  1  16% R 997100K  53492K  bg u0_a60   Thread-1398     com.android.mms
18793 18793  4   2% R   2196K   1284K     shell    top             top
 
可以明显的看到,贵司的mms 的4个thread 都有进入了deadloop, 分别占用了一个cpu core.  同时可以快速抓取他们的java trace,  进一步可以看到当时MMS 的四个backtrace, 以便快速分析.
 
"Thread-1401" prio=5 tid=32 SUSPENDED JIT
  | group="main" sCount=1 dsCount=0 obj=0x4264f860 self=0x7b183558
  | sysTid=32195 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2078705952
  | state=S schedstat=( 3284456714198 104216273858 383002 ) utm=324720 stm=3725 core=5
  at com.yulong.android.mms.c.f.d(MmsChatDataServer.java:~1095)
  at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:7582)
  at java.lang.Thread.run(Thread.java:841)
 
"Thread-1400" prio=5 tid=31 SUSPENDED JIT
  | group="main" sCount=1 dsCount=0 obj=0x41f5d8f0 self=0x7be2a8e8
  | sysTid=32190 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2078029504
  | state=S schedstat=( 3284905134412 105526230562 382946 ) utm=324805 stm=3685 core=5
  at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
  at java.lang.Thread.run(Thread.java:841)
 
"Thread-1399" prio=5 tid=30 SUSPENDED JIT
  | group="main" sCount=1 dsCount=0 obj=0x42564d28 self=0x7b0e6838
  | sysTid=32188 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2077662640
  | state=S schedstat=( 3288042313685 103203810616 375959 ) utm=325143 stm=3661 core=7
  at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
  at java.lang.Thread.run(Thread.java:841)
 
"Thread-1398" prio=5 tid=29 SUSPENDED
  | group="main" sCount=1 dsCount=0 obj=0x4248e5a8 self=0x7be0d128
  | sysTid=32187 nice=0 sched=0/0 cgrp=apps/bg_non_interactive handle=2079251904
  | state=S schedstat=( 3287248372432 105116936413 379634 ) utm=325055 stm=3669 core=6
  at com.yulong.android.mms.ui.MmsChatActivity$37.run(MmsChatActivity.java:~7586)
  at java.lang.Thread.run(Thread.java:841)
 
当时处于suspend, 即意味着当时这四个thread 正在执行java code, 而抓取backtrace 时强制将thread suspend 。看起来客户改动所致,并且客户有Proguard, 麻烦客户自己review 代码。
 
2.2 cputime
通常用cputime 来打印一段时间内, CPU 的利用率的统计情况, 资讯比top 更加详细.

cputime -h
Usage: cputime  [-start/-stop]  [-n iterations] [-d delay] [-e time] [ -m max_count ]  [-p] [-s sort_column] [-i id] [-h]
 -start       Start cpu time monitor.
 -stop        Stop cpu time monitor.
 -n  num      Updates to show before exiting.
 -d  num      Seconds to wait between updates.
 -m  num      Maximum number of information to display.
 -e  num      Enable CPU time monitor and stop monitor after num seconds. If no this parameter will
              show last cputime monitor data.
 -p           Show process instead of thread. If no this parameter default will show thread information.
 -s  col      Column to sort by time/user/kernel/id/isr_c/isr_t(cputime/user time/kernel time/id/isr_count/
              isr_time). If no this parameter default will sort by cputime.
 -i  id       show isr information of thread id.
 -h           Display this help screen.
*       Example     *
cputime -e 100: Enable cputime monitor and stop after 100 seconds. Then show thread cputime, and sort by cputime.
cputime -e 200  -s user: Enable cputime monitor and stop after 200 seconds. Then show thread cputime, and sort by user time.
cputime: Show thread cputime, and sort by cputime.
cputime -p -s id: Show process cputime, and sort by process id.
 
2.3 ftrace
ftrace 可以纪录CPU 最为详细的执行情况, 即linux scheduler 的执行情况. 通常默认只开启 sched_switch. 

 
2.4 Kernel core status
有的时候我们需要追查一下,当时Kernel 的基本调度情况,以及接收中断的情况,以判断当前CPU 执行的基本情况是否异常。比如有时候如果某个中断上来太过频繁,就容易导致系统运行缓慢,甚至死机。
* CPU Sched status
     adb shell cat proc/sched_debug
     Use sysrq-trigger
* CPU interrupts
     adb shell cat proc/interrupts
     adb shell cat proc/irq/irq_id/spurious
 
3. Memory Usage
Memory Usage, 我们通常会审查,系统当时memory 是否足够, 是否处于low memory 状态, 是否可能出现因无法申请到memory 而卡死的情况.

常见的一些基本信息如下:
* meminfo: basic memory status
adb shell cat proc/meminfo
adb shell cat proc/pid/maps
adb shell cat proc/pid/smaps
* procrank info: all process memory status
adb shell procrank
adb shell procmem pid
adb shell dumpsys meminfo pid
* zoneinfo:
adb shell cat proc/zoneinfo
* buddyinfo:
adb shell cat /proc/buddyinfo
 
4. Storage Usage
查看Storage 的情况,通常主要是查询data 分区是否已经刷满, sdcard 是否已经刷满, 剩余的空间是否足够。以及是否有产生超大文件等。
通常使用的命令如 df
df
Filesystem               Size     Used     Free   Blksize
/dev                   446.0M   128.0K   445.8M   4096
/sys/fs/cgroup         446.0M    12.0K   445.9M   4096
/mnt/secure            446.0M     0.0K   446.0M   4096
/mnt/asec              446.0M     0.0K   446.0M   4096
/mnt/obb               446.0M     0.0K   446.0M   4096
/system                  1.2G   915.3M   355.5M   4096
/data                    1.1G   136.7M  1010.1M   4096
/cache                 106.2M    48.0K   106.2M   4096
/protect_f               4.8M    52.0K     4.8M   4096
/protect_s               4.8M    48.0K     4.8M   4096
/mnt/cd-rom              1.2M     1.2M     0.0K   2048
/mnt/media_rw/sdcard0     4.6G     1.1G     3.4G   32768
/mnt/secure/asec         4.6G     1.1G     3.4G   32768
/storage/sdcard0         4.6G     1.1G     3.4G   32768
以及ls, du 等命令,如du

du -help
usage: du [-H | -L | -P] [-a | -d depth | -s] [-cgkmrx] [file ...]

du -LP -d 1
8       ./lost+found
88      ./local
384     ./misc
48      ./nativebenchmark
912     ./nativetest
8       ./dontpanic
13376   ./data
8       ./app-private
8       ./app-asec
129424  ./app-lib
8       ./app
136     ./property
16      ./ssh
116312  ./dalvik-cache
8       ./resource-cache
48      ./drm
8       ./mediadrm
8       ./security
3888    ./nvram
8       ./amit
8       ./acdapi
88       ./@btmtk
32      ./sec
8       ./user
16      ./media
16      ./agps_supl
8       ./anr
8       ./gps_mnl
8       ./nfc_socket
16      ./ccci_cfg
32      ./mdlog
1312    ./system
176     ./recovery
32      ./backup
274688  .


猜你喜欢

转载自blog.csdn.net/zhangbijun1230/article/details/80763403