systemui start time out导致的黑屏问题

一. 问题描述

1.1 现象

      手机黑屏,长按power键能出现关机界面

1.2 JIRA

      xxx

1.3 结论

      systemui 的service没有重启,导致黑屏。这是一个原生bug,因为systemui启动的方式比较特别,他是通过service来将界面画出来的,如果service没有起来那就会导致黑屏。

1.4 修复链接:

      xxx

二. 初步分析

2.1 查看system_server traces

      一般遇到黑屏我们也都会先看server_server的traces,看看是不是system_server卡住了导致的。正常情况下会看看system_server traces里有没有"held by"关键字或者“state=D”在,然后在bugreport中看看有没有发生watchdog,很不幸的是这个问题都没有这些关键信息,system_srever各个线程的状态都是正常的,那么不是system_server出问题,那应该看什么呢?这种情况下可以看看surfaceflinger,window,systemui等等和界面相关的信息。

2.2 查看界面相关的信息

        和显示的同事经过长时间讨论并且看了很久的surfaceflinger的信息,确认一个信息:因为当前没有画任何东西,所以界面是黑的。那为什么当前没有画任何东西呢?我们又继续看了下systemui的trace,因为没有专门研究过systemui,所以第一次看的时候只关注了systemui的线程有没有卡住,由于systemui的各个线程都没有卡住,导致我们误以为systemui也是正常的。然后继续看了bugreport中systemui的日志并且和systemui的同事一起确认了一下,最后发现systemui并没有重启它的service,关键日志如下:
07-06 11:48:42.890 1000 7903 7922 I am_kill : [0,23775,com.android.systemui,-800,bg anr]
07-06 11:48:49.472 1000 7903 17478 I am_proc_died: [0,23775,com.android.systemui,-800,0]
07-06 11:48:49.477 1000 7903 17478 I am_schedule_service_restart: [0,com.android.systemui/.fsgesture.FsGestureService,0]
07-06 11:48:49.484 1000 7903 17478 I am_schedule_service_restart: [0,com.android.systemui/.SystemUIService,0]
07-06 11:48:49.485 1000 7903 17478 I am_schedule_service_restart: [0,com.android.keyguard/.KeyguardService,0]
07-06 11:48:49.517 1000 7903 17478 I am_proc_start: [0,23058,1000,com.android.systemui,restart,com.android.systemui]
07-06 11:48:49.584 1000 7903 7922 I am_kill : [0,23058,com.android.systemui,-800,bg anr]
07-06 11:48:59.563 1000 7903 7922 I am_kill : [0,23058,com.android.systemui,-800,start timeout]
07-06 11:48:59.650 1000 7903 7922 I am_proc_start: [0,23278,1000,com.android.systemui,added application,com.android.systemui]
由于对services重启这部分代码不是特别熟悉,直接看日志肯定是看不出来的。只能怀疑这几个service是不是被remove了,所以先在ActiveServices.java中所有service.remove的地方加了日志,并且编译了个包让测试跑,跑了几天都没有复现。测试如果不能复现的话,那就只能自己想办法复现了。先总结日志的规律,然后按照当前的时序来复现,我们根据上边的日志能看到的是,systemui先发生bg anr,然后打印am_proc_deid,接着重启service,再接着重启systemui进程,但是在重启进程的时候time out了,并且没有打印am_proc_died,接着又重启了一边,最后一次重启没有启动service。

三. 复现

      service,process start time out ANR我是启动的时候post了一个延时消息,然后启动完成的时候remove掉这个延时消息,如果这个延时消息在固定时间里没有被remove,那么就会发生ANR。我们直接去代码里找service和process ANR的那两个消息:
SERVICE_TIMEOUT_MSG/PROC_START_TIMEOUT_MSG,然后在所有post的地方打上条件断点(进程是systemui才停下),然后在处理消息的地方打上断点,根据之前的日志开始复现。经过几天的复现,没有复现出来(因为偷懒了。。。。我制造anr的方式不是正常的,而是当代码走到remove anr message的时候手动改了,不让它remove ANR的message,虽然也会发生anr,但是每次systemui time out的时候都会打印am_proc_died)最后又看了下代码,怀疑systemui time out那一次重启应该是没有走到linkToDeath,才没有走binderDied(am_proc_died是在appDiedLocked方法中打印的)。长时间的断点调试,对这部分代码开始熟悉了,并且找到了必现的方法:
几个重要的断点:
1.断点 AMS中的processStartTimedOutLocked方法(进程start time out会走这里):
case PROC_START_TIMEOUT_MSG: {
                ProcessRecord app = (ProcessRecord)msg.obj;
                synchronized (ActivityManagerService.this) {
                    processStartTimedOutLocked(app);
                }
            } break;
2.断点 AMS startProcessLocked 方法中的checkTime(startTime, "startProcess: done updating pids map");(这里启动进程后会post PROC_START_TIMEOUT消息):
synchronized (mPidsSelfLocked) {
                this.mPidsSelfLocked.put(startResult.pid, app);
                if (isActivityProcess) {
                    Message msg = mHandler.obtainMessage(PROC_START_TIMEOUT_MSG);
                    msg.obj = app;
                    mHandler.sendMessageDelayed(msg, startResult.usingWrapper
                            ? PROC_START_TIMEOUT_WITH_WRAPPER : PROC_START_TIMEOUT);
                }
            }
            checkTime(startTime, "startProcess: done updating pids map");
3.断点AMS中的appDiedLocked:
@Override
public void binderDied() {
if (DEBUG_ALL) Slog.v(
TAG, "Death received in " + this
+ " for thread " + mAppThread.asBinder());
synchronized(ActivityManagerService.this) {
appDiedLocked(mApp, mPid, mAppThread, true);
}
}
4.断点ActiveServices的 performServiceRestartLocked(这里是service重启)
        public void run() {
            synchronized(mAm) {
                performServiceRestartLocked(mService);
            }
        }
5.断点AMS的attachApplicationLocked方法:
@Override
public final void attachApplication(IApplicationThread thread) {
synchronized (this) {
int callingPid = Binder.getCallingPid();
final long origId = Binder.clearCallingIdentity();
attachApplicationLocked(thread, callingPid);
Binder.restoreCallingIdentity(origId);
}
}
6.连接手机,kill systemui,然后在第二个断点的地方等待10s(为了让进程启动ANR),然后继续执行如果执行的顺序如下,那么就能必现了:
appDiedLocked->performServiceRestartLocked->processStartTimedOutLocked->attachApplicationLocked(原生android中,不会走到attachApplicationLocked,因为在processStartTimedOutLocked中原生android不会重启进程。还有就是在第二步的checkTime那等十秒也不一定能走到processStartTimedOutLocked,不太清楚原因,可能调试的原因,没有深究,我都是在checkTime之后加了个sleep 10s,这样基本都能必现)

四. 深入分析

      复现了之后就要分析为什么service没有重启了,在这之前需要大概了解一下service是怎么重启的。

4.1 简单介绍service如何重启的

    所有的service启动都会走到retrieveServiceLocked方法中来,在这方法中会把当前的这个ServiceRecord记录到ServiceRestarter对象中,主要具体代码如下:
final ServiceRestarter res = new ServiceRestarter();
...
r = new ServiceRecord(mAm, ss, name, filter, sInfo, callingFromFg, res);
res.setService(r);

ServiceRecord(ActivityManagerService ams,
BatteryStatsImpl.Uid.Pkg.Serv servStats, ComponentName name,
Intent.FilterComparison intent, ServiceInfo sInfo, boolean callerIsFg,
Runnable restarter) {
    ...
    this.restarter = restarter;
    ...
}

private class ServiceRestarter implements Runnable {
    private ServiceRecord mService;

扫描二维码关注公众号,回复: 4467353 查看本文章

    void setService(ServiceRecord service) {
        mService = service;
    }

    public void run() {
        synchronized(mAm) {
            performServiceRestartLocked(mService);
        }
    }
}
performServiceRestartLocked是service重启的重要方法之一,看下service重启的调用栈:
at com.android.server.am.ActiveServices.scheduleServiceRestartLocked(ActiveServices.java:2078)
at com.android.server.am.ActiveServices.killServicesLocked(ActiveServices.java:3303)
at com.android.server.am.ActivityManagerService.cleanUpApplicationRecordLocked(ActivityManagerService.java:18579)
at com.android.server.am.ActivityManagerService.handleAppDiedLocked(ActivityManagerService.java:5527)
at com.android.server.am.ActivityManagerService.appDiedLocked(ActivityManagerService.java:5731)
at com.android.server.am.ActivityManagerService$AppDeathRecipient.binderDied(ActivityManagerService.java:1669)
- locked <0x2cc2> (a com.android.server.am.ActivityManagerService)
at android.os.BinderProxy.sendDeathNotice(Binder.java:849)
这里在killServicesLocked方法中会判断service是否需要重启:
final void killServicesLocked(ProcessRecord app, boolean allowRestart) {
    ...
    // Now do remaining service cleanup.
    for (int i=app.services.size()-1; i>=0; i--) {//判断ProcessRecord中的service是否大于0,进程启动过的service会记录在这里
        ...
    } else {
        boolean canceled = scheduleServiceRestartLocked(sr, true);//重启service

        ...
    }
    ...
    if (!allowRestart) {
        app.services.clear();//如果不允许重启就清空services
        ...
    }
    ...
}
最后在scheduleServiceRestartLocked方法中post Runnable重启service:
private final boolean scheduleServiceRestartLocked(ServiceRecord r, boolean allowCancel) {
    boolean canceled = false;
    ...
    final long now = SystemClock.uptimeMillis();
    ...
    } else {
        // Persistent processes are immediately restarted, so there is no
        // reason to hold of on restarting their services.
        r.totalRestartCount++;
        r.restartCount = 0;
        r.restartDelay = 0;
        r.nextRestartTime = now;//现在的时间
    }
    //判断mRestartingServices中是否记录有当前的这个service,如果没有的话就add。在performServiceRestartLocked方法中会根据mRestartingServices来重启
    if (!mRestartingServices.contains(r)) {
        r.createdFromFg = false;
        mRestartingServices.add(r);
        r.makeRestarting(mAm.mProcessStats.getMemFactorLocked(), now);
    }
    ....
    if (!mRestartingServices.contains(r)) {
        r.createdFromFg = false;
        mRestartingServices.add(r);/将当前的service放到mRestartingServices中
        r.makeRestarting(mAm.mProcessStats.getMemFactorLocked(), now);
    }

cancelForegroundNotificationLocked(r);
mAm.mHandler.removeCallbacks(r.restarter);
mAm.mHandler.postAtTime(r.restarter, r.nextRestartTime);
r.nextRestartTime = SystemClock.uptimeMillis() + r.restartDelay;
Slog.w(TAG, "Scheduling restart of crashed service "
+ r.shortName + " in " + r.restartDelay + "ms");
....
EventLog.writeEvent(EventLogTags.AM_SCHEDULE_SERVICE_RESTART,
r.userId, r.shortName, r.restartDelay);

    return canceled;
}
postAtTime之后会调用到对应的performServiceRestartLocked方法中的bringUpServiceLocked:
private String bringUpServiceLocked(ServiceRecord r, int intentFlags, boolean execInFg,
boolean whileRestarting, boolean permissionsReviewRequired)
throws TransactionTooLargeException {
    ...

    // We are now bringing the service up, so no longer in the
    // restarting state.
    if (mRestartingServices.remove(r)) {//将当前的service从mRestartingServices中移除
        clearRestartingIfNeededLocked(r);
    }
    ...
    if (!mPendingServices.contains(r)) {
        mPendingServices.add(r);//将当前的service添加到mPendingServices中
    }
    ...
    return null;
}
在我们进程启动的时候在attachApplicationLocked也会根据mRestartingServices中记录的service去重启service,在attachApplicationLocked方法中有如下一段代码:
// Find any services that should be running in this process...
if (!badApp) {
    try {
        didSomething |= mServices.attachApplicationLocked(app, processName);
        checkTime(startTime, "attachApplicationLocked: after mServices.attachApplicationLocked");
    } catch (Exception e) {
        Slog.wtf(TAG, "Exception thrown starting services in " + app, e);
        badApp = true;
    }
}
在mServices.attachApplicationLocked方法中也会根据mRestartingServices中记录的service去重启对应的service:
boolean attachApplicationLocked(ProcessRecord proc, String processName)
throws RemoteException {
    ...
    // Also, if there are any services that are waiting to restart and
    // would run in this process, now is a good time to start them. It would
    // be weird to bring up the process but arbitrarily not let the services
    // run at this point just because their restart time hasn't come up.
    if (mRestartingServices.size() > 0) {
        ServiceRecord sr;
        for (int i=0; i<mRestartingServices.size(); i++) {
            sr = mRestartingServices.get(i);
            if (proc != sr.isolatedProc && (proc.uid != sr.appInfo.uid
                || !processName.equals(sr.processName))) {
            continue;
            }
        mAm.mHandler.removeCallbacks(sr.restarter);
        mAm.mHandler.post(sr.restarter);
        }
    }
return didSomething;
}
总结一下,我们重启service一共有两个地方,第一个是在appDiedLocked的流程中,第二个是在attachApplicationLocked的流程中。如果在appDiedLocked流程中post Runnable之后先执行了performServiceRestartLocked,然后在attachApplicationLocked重启的流程中就不会post Runnable,因为mRestartingServices.size=0。如果在appDiedLocked流程中post Runnable之后先执行了attachApplicationLocked然后会重新remover 上次的message,然后再post Runnable。service重启大概介绍这么多吧,估计有点晕,下边我们还会说到mRestartingServices,只需要记住它在appDiedLocked的流程中add service,在processStartTimedOutLocked中remove service。

4.2 service为什么没有重启

    我们再来回忆一下具体调用顺序:
异常流程:
appDiedLocked->performServiceRestartLocked->processStartTimedOutLocked->attachApplicationLocked
正常的流程:
appDiedLocked->performServiceRestartLocked->attachApplicationLocked或者appDiedLocked->attachApplicationLocked->performServiceRestartLocked
正常的流程分析:
第一种情况:appDiedLocked(mRestartingServices add,post Runnable准备重启service,重启进程)->performServiceRestartLocked(mRestartingServices remove,service重启)->attachApplicationLocked(进程重启)
第二种情况:appDiedLocked(mRestartingServices add,post Runnable准备重启service,重启进程)->attachApplicationLocked(进程重启)->performServiceRestartLocked(mRestartingServices remove,,service重启)

异常的情况分析:
appDiedLocked(mRestartingServices add,post Runnable准备重启service,重启进程)->performServiceRestartLocked(mRestartingServices remove,service重启)->processStartTimedOutLocked(进程被杀,重启进程)->attachApplicationLocked(进程重启,因为由于之前在performServiceRestartLocked中mRestartingServices已经被全部remove了,所以service不会再重启了)

五. 解决方案

1.让service在attachApplicationLocked中重启:xxx
具体思路是:
appDiedLocked(mRestartingServices add,不再post Runnable准备重启service,重启进程)->processStartTimedOutLocked(进程被杀,重启进程)->attachApplicationLocked(进程重启)->performServiceRestartLocked(mRestartingServices remove,service重启)

2.在processStartTimedOutLocked中把service再次放到mRestartingServices中让service重启:xxx

3.在processStartTimedOutLocked中,把systemui做特殊处理,重启service:xxx

猜你喜欢

转载自blog.csdn.net/aa787282301/article/details/81413242