本文所有源码分析基于android-13.0.0_r31 详见Watchdog源码

功能

Watchdog用于检查系统重要服务或线程是否堵塞，防止系统卡死（发现系统卡死就干掉自己重启系统进程）,是一个针对系统的”ANR“检测工具，同时有接受来自系统服务重启广播进行系统重启的作用。

原理

大体上可以理解成Watchdog跑在一个无限循环的线程上，然后在循环体内安排检测任务。系统服务的检测由一个特定线程(FgThread)负责，其他线程的检测由其自身负责。Watchdog每一轮安排完检测任务后就会阻塞特定时间，阻塞结束后检查所有被检测对象（服务或线程）的检测结果，如果有其中一个服务或线程阻塞就会重启系统进程。
有三个重要类和接口。
一个是Monitor接口,只有一个monitor方法，由被监视的对象（各个系统服务）实现，以InputManagerService的实现为例。

    // Called by the heartbeat to ensure locks are not held indefinitely (for deadlock detection).
    @Override
    public void monitor() {
    
    
        synchronized (mInputFilterLock) {
    
     }
        synchronized (mAssociationsLock) {
    
     /* Test if blocked by associations lock. */}
        synchronized (mLidSwitchLock) {
    
     /* Test if blocked by lid switch lock. */ }
        synchronized (mInputMonitors) {
    
     /* Test if blocked by input monitor lock. */ }
        synchronized (mAdditionalDisplayInputPropertiesLock) {
    
     /* Test if blocked by props lock */ }
        mNative.monitor();
    }

就是获取一下同步锁，能获取到锁说明服务运行正常。大部分Monitor都是上面的实现，只有BinderThreadMonitor例外，他是通过是否有可用的BinderThread来检测Bindder线程池是否正常，代码如下

    /** Monitor for checking the availability of binder threads. The monitor will block until
     * there is a binder thread available to process in coming IPCs to make sure other processes
     * can still communicate with the service.
     */
    private static final class BinderThreadMonitor implements Watchdog.Monitor {
    
    
        @Override
        public void monitor() {
    
    
            Binder.blockUntilThreadAvailable();
        }
    }

一个是HandlerChecker类。HandlerChecker持有两个Monitor列表，添加新的Monitor时先提交到临时列表mMonitorQueue等到每一轮检测流程开始时才转移到正式列表mMonitors，这样保证Monitor列表不会在遍历检测到一半时发生变化。HandlerChecker实现Runnable接口,在run方法内遍历执行Monitor.monitor()，遍历结束后设置标志位mCompleted表示检测完成。但它不会另起线程，而是post抛到通过构造方法传入的Handler执行。构造方法两个参数，一个是Handler,负责执行检测服务（Monitor）是否阻塞的代码，同时也起到检测自身消息队列是否堵塞的作用，毕竟能够处理HandlerChecker这个消息就代表没有堵塞，对线程的检测就是这个原理；一个是堵塞超时的时间，系统默认会传入60秒。重点只需要看scheduleCheckLocked和run两个方法。

    /**
     * Used for checking status of handle threads and scheduling monitor callbacks.
     */
    public final class HandlerChecker implements Runnable {
    
    
        private final Handler mHandler;
        private final String mName;
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();//检测Monitor列表
        private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();//待添加检测Monitor列表
        private long mWaitMax;
        private boolean mCompleted;
        private Monitor mCurrentMonitor;
        private long mStartTime;
        private int mPauseCount;

        HandlerChecker(Handler handler, String name) {
    
    
            mHandler = handler;
            mName = name;
            mCompleted = true;
        }

        void addMonitorLocked(Monitor monitor) {
    
    
            // We don't want to update mMonitors when the Handler is in the middle of checking
            // all monitors. We will update mMonitors on the next schedule if it is safe
            mMonitorQueue.add(monitor);
        }

        public void scheduleCheckLocked(long handlerCheckerTimeoutMillis) {
    
    
            mWaitMax = handlerCheckerTimeoutMillis;
            if (mCompleted) {
    
    
                // Safe to update monitors in queue, Handler is not in the middle of work
                mMonitors.addAll(mMonitorQueue);
                mMonitorQueue.clear();
            }
            if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
                    || (mPauseCount > 0)) {
    
    
                // Don't schedule until after resume OR
                // If the target looper has recently been polling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread. Note that we
                // only do this if we have no monitors since those would need to
                // be executed at this point.
                mCompleted = true;
                return;
            }
            if (!mCompleted) {
    
    
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            mHandler.postAtFrontOfQueue(this);
        }

        public int getCompletionStateLocked() {
    
    
            if (mCompleted) {
    
    
                return COMPLETED;
            } else {
    
    
                long latency = SystemClock.uptimeMillis() - mStartTime;
                if (latency < mWaitMax/2) {
    
    
                    return WAITING;
                } else if (latency < mWaitMax) {
    
    
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }
        ...
        @Override
        public void run() {
    
    
            // Once we get here, we ensure that mMonitors does not change even if we call
            // #addMonitorLocked because we first add the new monitors to mMonitorQueue and
            // move them to mMonitors on the next schedule when mCompleted is true, at which
            // point we have completed execution of this method.
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
    
    
                synchronized (mLock) {
    
    
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

            synchronized (mLock) {
    
    
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }

        /** Pause the HandlerChecker. */
        public void pauseLocked(String reason) {
    
    
            mPauseCount++;
            // Mark as completed, because there's a chance we called this after the watchog
            // thread loop called Object#wait after 'WAITED_HALF'. In that case we want to ensure
            // the next call to #getCompletionStateLocked for this checker returns 'COMPLETED'
            mCompleted = true;
            Slog.i(TAG, "Pausing HandlerChecker: " + mName + " for reason: "
                    + reason + ". Pause count: " + mPauseCount);
        }

        /** Resume the HandlerChecker from the last {@link #pauseLocked}. */
        public void resumeLocked(String reason) {
    
    
            if (mPauseCount > 0) {
    
    
                mPauseCount--;
                Slog.i(TAG, "Resuming HandlerChecker: " + mName + " for reason: "
                        + reason + ". Pause count: " + mPauseCount);
            } else {
    
    
                Slog.wtf(TAG, "Already resumed HandlerChecker: " + mName);
            }
        }
    }

还有一个就是Watchdog本身这个类了。它有两个成员变量

    private final ArrayList<HandlerCheckerAndTimeout> mHandlerCheckers = new ArrayList<>();
    private final HandlerChecker mMonitorChecker;

mMonitorChecker是系统服务的HandlerChecker，mHandlerCheckers保存包括mMonitorChecker和其他HandlerChecker在内的所有HandlerChecker。他们都会在Watchdog的构造方法内实例化。
重点看下它的run方法,精简代码如下

    private void run() {
    
    
        boolean waitedHalf = false;
        while (true) {
    
    
        ...
            boolean doWaitedHalfDump = false;
            final long watchdogTimeoutMillis = mWatchdogTimeoutMillis;
            final long checkIntervalMillis = watchdogTimeoutMillis / 2;
            synchronized (mLock) {
    
    
                long timeout = checkIntervalMillis;
                for (int i=0; i<mHandlerCheckers.size(); i++) {
    
    
                    HandlerCheckerAndTimeout hc = mHandlerCheckers.get(i);
                    hc.checker().scheduleCheckLocked(hc.customTimeoutMillis()
                            .orElse(watchdogTimeoutMillis * Build.HW_TIMEOUT_MULTIPLIER));
                }
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {
    
    
                    try {
    
    
                        mLock.wait(timeout);
                    } catch (InterruptedException e) {
    
    
                        Log.wtf(TAG, e);
                    }
                    timeout = checkIntervalMillis - (SystemClock.uptimeMillis() - start);
                }

                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
    
    
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
    
    
                    continue;
                } else if (waitState == WAITED_HALF) {
    
    
                    if (!waitedHalf) {
    
    
                        Slog.i(TAG, "WAITED_HALF");
                        waitedHalf = true;
  
                        subject = describeCheckersLocked(blockedCheckers);
                        pids = new ArrayList<>(mInterestingJavaPids);
                        doWaitedHalfDump = true;
                    } else {
    
    
                        continue;
                    }
                } else {
    
    
                    // something is overdue!
                    blockedCheckers = getCheckersWithStateLocked(OVERDUE);
                    subject = describeCheckersLocked(blockedCheckers);
                    allowRestart = mAllowRestart;
                    pids = new ArrayList<>(mInterestingJavaPids);
                }
            } // END synchronized (mLock)
            logWatchog(doWaitedHalfDump, subject, pids);
            if (doWaitedHalfDump) {
    
    
                continue;
            }
            IActivityController controller;
            synchronized (mLock) {
    
    
                controller = mController;
            }
            if (controller != null) {
    
    
                try {
    
    
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
    
    
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
    
    
                }
            }
            ...
             Process.killProcess(Process.myPid());
             System.exit(10);
             ...
        }
    }

每一轮循环都遍历执行HandlerChecker.scheduleCheckLocked()，然后等待超时一半的时间后（比如30秒）执行evaluateCheckerCompletionLocked方法检查所有HandlerChecker的结果，有以下四种情况

    // These are temporally ordered: larger values as lateness increases
    private static final int COMPLETED = 0;
    private static final int WAITING = 1;
    private static final int WAITED_HALF = 2;
    private static final int OVERDUE = 3;

取所有HandlerChecker结果中最大的一个。如果是COMPLETED或者WAITING代表正常则开始新一轮的检测，如果是WAITED_HALF代表至少有一个HandlerChecker的检测等待超过一半的时间了还没有完成，这时会进行一些日志的打印然后进入下一轮检测。如果是OVERDUE代表至少有一个HandlerChecker超时了即系统卡死，会进入后面的杀死自身（SystemServer）进程，重启系统。有个点需要注意的是每个HandlerChecker所处的时间阶段是独立的，HandlerChecker.scheduleCheckLocked执行时有些HandlerChecker可能在COMPLETED阶段，有些可能在WAITED_HALF阶段。

流程

SystemServer的部分关键代码如下

/**
* The main entry point from zygote.
*/
public static void main(String[] args) {
    
    
      new SystemServer().run();
}

private void run(){
    
    
	...
	// Start services.
        try {
    
    
            t.traceBegin("StartServices");
            startBootstrapServices(t);
            startCoreServices(t);
            startOtherServices(t);
            startApexServices(t);
        } catch (Throwable ex) {
    
    
            throw ex;
        } finally {
    
    
            t.traceEnd(); // StartServices
        }
	...
}
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
    
    
		...
        // Start the watchdog as early as possible so we can crash the system server
        // if we deadlock during early boot
        t.traceBegin("StartWatchdog");
        final Watchdog watchdog = Watchdog.getInstance();
        watchdog.start();
        mDumper.addDumpable(watchdog);
        t.traceEnd();
        ...
        
        mActivityManagerService = ActivityManagerService.Lifecycle.startService(
        mSystemServiceManager, atm);
        ...
        
        watchdog.init(mSystemContext, mActivityManagerService);
}

可以看到Wtachdog是单例设计，并且在SystemServer启动服务初期就先行实例化和启动了。我们看下getInstance和start两个方法。

    public static Watchdog getInstance() {
    
    
        if (sWatchdog == null) {
    
    
            sWatchdog = new Watchdog();
        }
        return sWatchdog;
    }

    private Watchdog() {
    
    
        mThread = new Thread(this::run, "watchdog"); //实例化Watchdog工作线程

        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.
        //
        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread"); //实例化mMonitorChecker
        mHandlerCheckers.add(withDefaultTimeout(mMonitorChecker)); 
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(withDefaultTimeout(
                new HandlerChecker(new Handler(Looper.getMainLooper()), "main thread")));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(withDefaultTimeout(
                new HandlerChecker(UiThread.getHandler(), "ui thread")));
        // And also check IO thread.
        mHandlerCheckers.add(withDefaultTimeout(
                new HandlerChecker(IoThread.getHandler(), "i/o thread")));
        // And the display thread.
        mHandlerCheckers.add(withDefaultTimeout(
                new HandlerChecker(DisplayThread.getHandler(), "display thread")));
        // And the animation thread.
        mHandlerCheckers.add(withDefaultTimeout(
                 new HandlerChecker(AnimationThread.getHandler(), "animation thread")));
        // And the surface animation thread.
        mHandlerCheckers.add(withDefaultTimeout(
                new HandlerChecker(SurfaceAnimationThread.getHandler(),
                    "surface animation thread")));
        // Initialize monitor for Binder threads.
        addMonitor(new BinderThreadMonitor());
        ...
    }

    /**
     * Called by SystemServer to cause the internal thread to begin execution.
     */
    public void start() {
    
    
        mThread.start();
    }

getInstance方法很简单，然后我们看下Watchdog构造方法。实例化了mMonitorChecker并添加到mHandlerCheckers列表里，同时往mHandlerCheckers里添加了很多HandlerChecker，上面可以看出Watchdog监视了FgThread，“main thread”，UiThread，IoThread，DisplayThread，AnimationThread，SurfaceAnimationThread这些线程，而实现了Monitor接口的服务基本都是在各自初始化时通过addMonitor方法添加到Watchdog的。而start方法也很简单，就是启动Watchdog的工作线程。从这里开始Watchdog就能够实现对系统重要服务是否堵塞进行监视。我们看下Watchdog.init方法

    public void init(Context context, ActivityManagerService activity) {
    
    
        mActivity = activity;
        context.registerReceiver(new RebootRequestReceiver(),
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);
    }

    final class RebootRequestReceiver extends BroadcastReceiver {
    
    
        @Override
        public void onReceive(Context c, Intent intent) {
    
    
            if (intent.getIntExtra("nowait", 0) != 0) {
    
    
                rebootSystem("Received ACTION_REBOOT broadcast");
                return;
            }
        }
    }

    void rebootSystem(String reason) {
    
    
        IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
        try {
    
    
            pms.reboot(false, reason, false);
        } catch (RemoteException ex) {
    
    
        }
    }

可以看到很简单，就是注册了一个重启的广播接受器，接收来自系统组件的重启广播进行系统重启。就是因为需要注册广播所以才在ActivityManagerService启动之后init。

Android Watchdog 原理记录

功能

原理

流程

猜你喜欢