android watchdog机制

纪正德

2023-12-01

Android Watchdog 机制

早期手机平台上通常是在设备中增加一个硬件看门狗(WatchDog), 软件系统必须定时的向看门狗硬件中写值来表示自己没出故障（俗称“喂狗”）, 否则超过了规定的时间看门狗就会重新启动设备. 大体原理是, 在系统运行以后启动了看门狗的计数器, 看门狗就开始自动计数，如果到了一定的时间还不去清看门狗，那么看门狗计数器就会溢出从而引起看门狗中断，造成系统复位。

而手机, 其实是一个超强超强的单片机, 其运行速度比单片机快N倍, 存储空间比单片机大N倍, 里面运行了若干个线程, 各种软硬件协同工作, Android 的 SystemServer 是一个非常复杂的进程，里面运行的服务超过五十种，是最可能出问题的进程，因此有必要对 SystemServer 中运行的各种线程实施监控。

但是如果使用硬件看门狗的工作方式，每个线程隔一段时间去喂狗，不但非常浪费CPU，而且会导致程序设计更加复杂。因此 Android 开发了 Watchdog 类作为软件看门狗来监控 SystemServer 中的线程。一旦发现问题，Watchdog 会杀死 SystemServer 进程。

Watchdog的功能

Watchdog主要有两个作用

Blocked in Monitor 被监控线程的monitor接口实现阻塞
Blocked int handler 被监控线程的消息队列不处理消息

判断线程是否卡住的方法

MessageQueue.isPolling
Monitor.monitor
---
HandlerChecker 检查looper是否阻塞
monitor 检查是否死锁

Watchdog的工作机制

Watchdog的工作机制 https://img-blog.csdnimg.cn/img_convert/e5c8133c7f86583251c775de4ceae9c0.jpeg

Watchdog 的启动

Watchdog 是在 SystemServer 进程中被初始化和启动的，在 SystemServer 的 run 方法中，各种Android 服务被注册和启动，其中也包括了Watchdog 的初始化和启动，代码如下：

final Watchdog watchdog = Watchdog.getInstance();//line: 864
watchdog.init(context, mActivityManagerService);

在 SystemServer 中 startOtherServices() 的后半段，在 AMS(ActivityManagerService) 的 SystemReady 接口的 CallBack 函数中实现 Watchdog 的启动：

Watchdog.getInstance().start();//line: 1852

Watchdog的构造方法

super("watchdog");
//初始化每一个我们希望检查的线程
//这里没有检查后台线程
//共享的前台线程是主检查器, 还有分配其monitor检查其它线程
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                                     "foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// 为主线程添加检查器
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                                        "main thread", DEFAULT_TIMEOUT));
// 为共享UI线程添加检查器
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                                        "ui thread", DEFAULT_TIMEOUT));
// 为共享IO线程添加检查器
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                                        "i/o thread", DEFAULT_TIMEOUT));
// 为共享display线程添加检查器.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                                        "display thread", DEFAULT_TIMEOUT));

// 初始化检查器 binder线程.
addMonitor(new BinderThreadMonitor());

mOpenFdMonitor = OpenFdMonitor.create();

// See the notes on DEFAULT_TIMEOUT.
assert DB ||
    DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;

Watchdog的构造方法中创建了一些HandlerChecker对象, 并添加到自己的监听队列中.

Watchdog添加的监听handler

线程名	对应handler	说明	Timeout
foreground thread	FgThread.getHandler()	前台线程	60s
main thread	new Handler(Looper.getMainLooper())	主线程	60s
ui thread	UiThread.getHandler()	UI线程	60s
i/o thread	IoThread.getHandler()	IO线程	60s
display thread	DisplayThread.getHandler()	Display线程	60s

PackageManager	addThread(mHandler, time)	PackageManagerService主动add的线程	10min
PackageManager	addThread(mHandler, time)	PermissionManagerService主动add的线程	60s
PowerManagerService	addThread(mHandler, time)	PowerManagerService主动add的线程	60s
ActivityManagerService	addThread(mHandler, time)	ActivityManagerService主动add的线程	60s

Watchdog添加的监听monitor

monitor程名	说明	Timeout
BinderThreadMonitor	检查Binder线程	60s
OpenFdMonitor	检查fd线程	60s

TvRemoteService	addMonitor(this) mLock
ActivityManagerService	addMonitor(this) this
MediaProjectionManagerService	addMonitor(this) mLock
MediaRouterService	addMonitor(this) mLock
MediaSessionService	addMonitor(this) mLock
InputManagerService	addMonitor(this) mInputFilterLock nativeMonitor(mPtr);
PowerManagerService	addMonitor(this) mLock
NetworkManagementService	addMonitor(this) mConnector
StorageManagerService	addMonitor(this) mVold
WindowManagerService	addMonitor(this) mWindowMap

HandlerChecker

public final class HandlerChecker implements Runnable

HandlerChecker用于检查句柄线程的状态和调度监视器回调, 其原理就是通过各个Handler的looper的MessageQueue来判断该线程是否卡住了。当然，该线程是运行在SystemServer进程中的线程。

Watchdog中会构建很多的HandlerChecker, 可以分为两类

Monitor Checker，用于检查是Monitor对象可能发生的死锁, AMS, PKMS, WMS等核心的系统服务都是Monitor对象。
Looper Checker，用于检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列，ui, Io, display这些全局的消息队列都是被检查的对象。此外，一些重要的线程的消息队列，也会加入到Looper Checker中，譬如AMS, PKMS，这些是在对应的对象初始化时加入的。

两类HandlerChecker的侧重点不同

Monitor Checker 预警我们不能长时间持有核心系统服务的对象锁，否则会阻塞很多函数的运行
Looper Checker预警我们不能长时间的霸占消息队列，否则其他消息将得不到处理

HandlerChecker的构造函数

public final class HandlerChecker implements Runnable {
    private final Handler mHandler;
    private final String mName;
    private final long mWaitMax;
    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
    private boolean mCompleted;
    private Monitor mCurrentMonitor;
    private long mStartTime;

    HandlerChecker(Handler handler, String name, long waitMaxMillis) {
        mHandler = handler; //线程handler
        mName = name; //名称
        mWaitMax = waitMaxMillis; //等待超时时间
        mCompleted = true; //线程状态
    }
}

HandlerChecker::scheduleCheckLocked

这个方法是在Watchdog中的run方法会调用, 是HandlerChecker的核心方法, 用来检查HandlerChecker是否发生了死锁.

public void scheduleCheckLocked() {
    if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
        // If the target looper has recently been polling, then
        // there is no reason to enqueue our checker on it since that
        // is as good as it not being deadlocked.  This avoid having
        // to do a context switch to check the thread.  Note that we
        // only do this if mCheckReboot is false and we have no
        // monitors, since those would need to be executed at this point.
        mCompleted = true;
        return;
    }

    if (!mCompleted) {
        // we already have a check in flight, so no need
        return;
    }

    mCompleted = false;
    mCurrentMonitor = null;
    mStartTime = SystemClock.uptimeMillis();
    mHandler.postAtFrontOfQueue(this);
}

isPolling() 这个方法是判断当前线程Looper是否就绪的核心方法. 如果true 当前正在轮询事件, 正常运行, 会继续向下执行
如果没有mCompleted, 说明已经在检查了
`mHandler.postAtFrontOfQueue(this)将自己post到队列中, 之后会执行run方法

在scheduleCheckLocked 中，其实主要是处理mMonitorChecker 的情况，对于其他的没有monitor 注册进来的且处于polling 状态的 HandlerChecker 是不去检查的，例如，UiThread，肯定一直处于polling 状态。

MessageQueue::isPolling

mHandler.getLooper().getQueue().isPolling() 这个方法可以判断当前线程是否被卡住.
true: 表示looper当前正在轮询事件,

这个方法的实现在MessageQueue中，可以看到上面的注释写到：返回当前的looper线程是否在polling工作来做，这个是个很好的用于检测loop是否存活的方法。

frameworks/base/core/java/android/os/MessageQueue.java

/**
     * Returns whether this looper's thread is currently polling for more work to do.
     * This is a good signal that the loop is still alive rather than being stuck
     * handling a callback.  Note that this method is intrinsically racy, since the
     * state of the loop can change before you get the result back.
     *
     * <p>This method is safe to call from any thread.
     *
     * @return True if the looper is currently polling for events.
     * @hide
     */
public boolean isPolling() {
    synchronized (this) {
        return isPollingLocked();
    }
}

HandlerChecker::run

@Override
public void run() {
    final int size = mMonitors.size();
    for (int i = 0 ; i < size ; i++) {
        synchronized (Watchdog.this) {
            mCurrentMonitor = mMonitors.get(i);
        }
        mCurrentMonitor.monitor();
    }

    synchronized (Watchdog.this) {
        mCompleted = true;
        mCurrentMonitor = null;
    }
}

里面对自己的Monitors遍历并进行monitor。若有monitor发生了阻塞，那么mComplete会一直是false。
for循环用来检测监听列表中是否有阻塞，而且只有mMonitorChecker会走进此循环
其余的handlerChecker因为mMonitors为空，都不会执行此循环

HandlerChecker::getCompletionStateLocked

public int getCompletionStateLocked() {
    if (mCompleted) {
        return COMPLETED;
    } else {
        long latency = SystemClock.uptimeMillis() - mStartTime;
        if (latency < mWaitMax/2) {
            return WAITING;
        } else if (latency < mWaitMax) {
            return WAITED_HALF;
        }
    }
    return OVERDUE;
}

获取完成时间标识, mStartTime初值是在scheduleCheckLocked中设置的
在系统检测调用这个获取未完成状态时，就会进入else里面，进行了时间的计算，并返回相应的时间状态码。

线程的状态

状态	描述
COMPLETED	对应消息已处理完毕线程无阻塞
WAITING	对应消息处理花费0～29秒,继续运行
WAITED_HALF	对应消息处理花费30～59秒，线程可能已经被阻塞，需要保存当前AMS堆栈状态, 继续监听
OVERDUE	对应消息处理已经花费超过60, 准备 kill 当前进程. 能够走到这里，说明已经发生了超时60秒了。那么下面接下来全是应对超时的情况

HandlerThread的继承关系

这里的HandlerChecker使用的传入参数都是创建的HandlerThread线程的Handler

java.lang.Object
  ↳ Thread implements Runnable
    ↳ HandlerThread extends Thread
      ↳ ServiceThread extends HandlerThread
        ↳ FgThread extends ServiceThread

初始化的HandlerChecker

public ServiceThread(String name, int priority, boolean allowIo)

private FgThread() {
    super("android.fg", android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);
}

private UiThread() {
    super("android.ui", Process.THREAD_PRIORITY_FOREGROUND, false /*allowIo*/);
}

private IoThread() {
    super("android.io", android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);
}

private DisplayThread() {
    //DisplayThread运行重要的东西，但这些东西不如AnimationThread中运行的东西重要。
    //因此，将优先级设置为较低的一个。
    super("android.display", Process.THREAD_PRIORITY_DISPLAY + 1, false /*allowIo*/);
}

Android线程优先级

frameworks/base/core/java/android/os/Process.java

public static final int THREAD_PRIORITY_DEFAULT = 0; //默认的线程优先级
public static final int THREAD_PRIORITY_LOWEST = 19; //最低的线程级别
public static final int THREAD_PRIORITY_BACKGROUND = 10; //后台线程建议设置这个优先级
public static final int THREAD_PRIORITY_FOREGROUND = -2; //用户正在交互的UI线程，代码中无法设置该优先级，系统会按照情况调整到该优先级
public static final int THREAD_PRIORITY_DISPLAY = -4; //也是与UI交互相关的优先级界别，但是要比THREAD_PRIORITY_FOREGROUND优先
public static final int THREAD_PRIORITY_URGENT_DISPLAY = -8; //显示线程的最高级别，用来处理绘制画面和检索输入事件
public static final int THREAD_PRIORITY_AUDIO = -16; //声音线程的标准级别
public static final int THREAD_PRIORITY_URGENT_AUDIO = -19; //声音线程的最高级别，优先程度较THREAD_PRIORITY_AUDIO要高。
public static final int THREAD_PRIORITY_MORE_FAVORABLE = -1; //相对THREAD_PRIORITY_DEFAULT稍微优先
public static final int THREAD_PRIORITY_LESS_FAVORABLE = 1; // 相对THREAD_PRIORITY_DEFAULT稍微落后一些

应用设置线程优先级的方法如下, 但是有一些级别是不允许应用设置的, 是由系统进行分配的.

Process.setThreadPriority(Process.THREAD_PRIORITY_BACKGROUND +
                Process.THREAD_PRIORITY_LESS_FAVORABLE)

describeBlockedStateLocked

public String describeBlockedStateLocked() {
    if (mCurrentMonitor == null) {
        return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
    } else {
        return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
            + " on " + mName + " (" + getThread().getName() + ")";
    }
}

打印Monitor信息

Monitor

Monitor是一个接口, 用来

public interface Monitor {
    void monitor();
}

实现Watchdog.Monitor接口的类

ActivityManagerService
WindowManagerService
PowerManagerService
InputManagerService
MediaSessionService
MediaRouterService
StorageManagerService
NetworkManagementService
NativeDaemonConnector
MediaProjectionManagerService
TvRemoteService

BinderThreadMonitor
OpenFdMonitor

Monitor是一个接口，实现这个接口的类有好几个。比如：如下是android9.0搜出来的结果

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QpJfi2aa-1666612570217)(/home/jun/Desktop/Plane3/CoreSystemServer/watchdog/WatchdogImplClass.png)]

使用Watchdog

这么多的类实现了该接口, 他们都注册到了Watchdog中, 如AMS中

public class ActivityManagerService extends IActivityManager.Stub
    implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
    ......
    public ActivityManagerService(Context systemContext) {
        ......
        Watchdog.getInstance().addMonitor(this);
        Watchdog.getInstance().addThread(mHandler);
        ......
    }
    ......
    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
    public void monitor() {
        synchronized (this) { }
    }
    ......
}

Watchdog::addThread

public void addThread(Handler thread) {
    addThread(thread, DEFAULT_TIMEOUT); //60s
}

public void addThread(Handler thread, long timeoutMillis) {
    synchronized (this) {
        if (isAlive()) {
            throw new RuntimeException("Threads can't be added once the Watchdog is running");
        }
        final String name = thread.getLooper().getThread().getName();
        mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
    }
}

addThread是将线程的Hander传给Watchdog, 然后Watchdog会根据Handler创建一个新的HandlerChecker,
将新的HandlerChecker添加到监听队列中

Watchdog::addMonitor

public void addMonitor(Monitor monitor) {
    synchronized (this) {
        if (isAlive()) {
            throw new RuntimeException("Monitors can't be added once the Watchdog is running");
        }
        mMonitorChecker.addMonitor(monitor);
    }
}

传递monitor, Watchdog会调用monitor方法, 来判断是否发生阻塞
所有的Monitor都添加到了mMonitorChecker, 所以只有mMonitorChecker里是有Monitor的

Watchdog::run()

Watchdog的核心方法, 检查线程死锁, looper阻塞, 收集信息和kill掉system_server进程, 重启

@Override
public void run() {
    boolean waitedHalf = false;
    while (true) {
        final List<HandlerChecker> blockedCheckers;
        final String subject;
        final boolean allowRestart;
        int debuggerWasConnected = 0;
        synchronized (this) {
            long timeout = CHECK_INTERVAL;
            // Make sure we (re)spin the checkers that have become idle within
            // this wait-and-check interval
            for (int i=0; i<mHandlerCheckers.size(); i++) {
                //调用每个HandlerChecker的scheduleCheckLocked() 方法
                HandlerChecker hc = mHandlerCheckers.get(i);
                hc.scheduleCheckLocked();
            }

            if (debuggerWasConnected > 0) {
                debuggerWasConnected--;
            }

            // NOTE: We use uptimeMillis() here because we do not want to increment the time we
            // wait while asleep. If the device is asleep then the thing that we are waiting
            // to timeout on is asleep as well and won't have a chance to run, causing a false
            // positive on when to kill things.
            long start = SystemClock.uptimeMillis(); 
            while (timeout > 0) {
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                try {
                    wait(timeout);
                } catch (InterruptedException e) {
                    Log.wtf(TAG, e);
                }
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            }

            boolean fdLimitTriggered = false;
            if (mOpenFdMonitor != null) {
                fdLimitTriggered = mOpenFdMonitor.monitor();
            }

            if (!fdLimitTriggered) {
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) { //线程状态正常，重新轮询
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {//处于阻塞状态，但监测时间小于30s，继续监测
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {//处于阻塞状态，监测时间已经超过30s，开始dump一些系统信息，然后继续监测30s
                    if (!waitedHalf) {
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                                               getInterestingNativePids());
                        waitedHalf = true;
                    }
                    continue;
                }

                // something is overdue!
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
            } else {
                blockedCheckers = Collections.emptyList();
                subject = "Open FD high water mark reached";
            }
            allowRestart = mAllowRestart;
        }

        // If we got here, that means that the system is most likely hung.
        // First collect stack traces from all threads of the system process.
        // Then kill this process so that the system will restart.
        EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

        ArrayList<Integer> pids = new ArrayList<>();
        pids.add(Process.myPid());
        if (mPhonePid > 0) pids.add(mPhonePid);
        // Pass !waitedHalf so that just in case we somehow wind up here without having
        // dumped the halfway stacks, we properly re-initialize the trace file.
        final File stack = ActivityManagerService.dumpStackTraces(
            !waitedHalf, pids, null, null, getInterestingNativePids());

        // Give some extra time to make sure the stack traces get written.
        // The system's been hanging for a minute, another second or two won't hurt much.
        SystemClock.sleep(2000);

        // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
        doSysRq('w');
        doSysRq('l');

        // Try to add the error to the dropbox, but assuming that the ActivityManager
        // itself may be deadlocked.  (which has happened, causing this statement to
        // deadlock and the watchdog as a whole to be ineffective)
        Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
            public void run() {
                mActivity.addErrorToDropBox(
                    "watchdog", null, "system_server", null, null,
                    subject, null, stack, null);
            }
        };
        dropboxThread.start();
        try {
            dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
        } catch (InterruptedException ignored) {}

        IActivityController controller;
        synchronized (this) {
            controller = mController;
        }
        if (controller != null) {
            Slog.i(TAG, "Reporting stuck state to activity controller");
            try {
                Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                // 1 = keep waiting, -1 = kill system
                int res = controller.systemNotResponding(subject);
                if (res >= 0) {
                    Slog.i(TAG, "Activity controller requested to coninue to wait");
                    waitedHalf = false;
                    continue;
                }
            } catch (RemoteException e) {
            }
        }

        // Only kill the process if the debugger is not attached.
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        if (debuggerWasConnected >= 2) {
            Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
        } else if (debuggerWasConnected > 0) {
            Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
        } else if (!allowRestart) {
            Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
        } else {
            Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
            WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
            Slog.w(TAG, "*** GOODBYE!");
            Process.killProcess(Process.myPid());
            System.exit(10);
        }

        waitedHalf = false;
    }
}

run() 方法就是死循环, 不断的去遍历所有HandlerChecker,并调其监控方法，等待三十秒，评估状态。

遍历所有的HandlerChecker, 并调用其scheduleCheckLocked方法, 记录开始时间

for (int i=0; i<mHandlerCheckers.size(); i++) {
    HandlerChecker hc = mHandlerCheckers.get(i);
    hc.scheduleCheckLocked();
}

等待 30 秒

// 等待30秒
//使用uptimeMills是为了不把手机睡眠时间算进入，手机睡眠时系统服务同样睡眠
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    try {
        wait(timeout);
    } catch (InterruptedException e) {
        Log.wtf(TAG, e);
    }
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}

评估Checker的状态，里面会遍历所有的HandlerChecker,并获取最大的返回值。
最大的返回值有四种情况:

COMPLETED 对应消息已处理完毕线程无阻塞
WAITING 对应消息处理花费0～29秒,继续运行
WAITED_HALF 对应消息处理花费30～59秒，线程可能已经被阻塞，需要保存当前AMS堆栈状态, 继续监听
OVERDUE 对应消息处理已经花费超过60, 准备 kill 当前进程. 能够走到这里，说明已经发生了超时60秒了。那么下面接下来全是应对超时的情况

boolean fdLimitTriggered = false;
if (mOpenFdMonitor != null) {
    fdLimitTriggered = mOpenFdMonitor.monitor();
}
if (!fdLimitTriggered) {
    final int waitState = evaluateCheckerCompletionLocked();
    if (waitState == COMPLETED) {
        // The monitors have returned; reset
        waitedHalf = false;
        continue;
    } else if (waitState == WAITING) {
        // still waiting but within their configured intervals; back off and recheck
        continue;
    } else if (waitState == WAITED_HALF) {
        if (!waitedHalf) {
            // We've waited half the deadlock-detection interval.  Pull a stack
            // trace and wait another half.
            ArrayList<Integer> pids = new ArrayList<Integer>();
            pids.add(Process.myPid());
            ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                                   getInterestingNativePids());
            waitedHalf = true;
        }
        continue;
    }

    // something is overdue!
    blockedCheckers = getBlockedCheckersLocked();
    subject = describeCheckersLocked(blockedCheckers);
} else {
    blockedCheckers = Collections.emptyList();
    subject = "Open FD high water mark reached";
}

fdMonitor

public boolean monitor() {
    if (mFdHighWaterMark.exists()) {
        dumpOpenDescriptors();
        return true;
    }
    return false;
}

收集信息
杀死系统进程

Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);

HandlerChecker::scheduleCheckLocked

HandlerChecker::run

Watchdog::evaluateCheckerCompletionLocked

评估Checker的状态，里面会遍历所有的HandlerChecker,并获取最大的返回值。

private int evaluateCheckerCompletionLocked() {
    int state = COMPLETED;// COMPLETED = 0
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        state = Math.max(state, hc.getCompletionStateLocked());
    }
    return state;
}

HandlerChecker::getCompletionStateLocked

Watchdog::getBlockedCheckersLocked

Watchdog::describeCheckersLocked

private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
    ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        if (hc.isOverdueLocked()) {
            checkers.add(hc);
        }
    }
    return checkers;
}

private String describeCheckersLocked(List<HandlerChecker> checkers) {
    StringBuilder builder = new StringBuilder(128);
    for (int i=0; i<checkers.size(); i++) {
        if (builder.length() > 0) {
            builder.append(", ");
        }
        builder.append(checkers.get(i).describeBlockedStateLocked());
    }
    return builder.toString();
}

打印阻塞或死锁线程的信息

注意

通过 monitor() 方法检查死锁针对不同线程之间的，而服务主线程是否阻塞是针对主线程，所以通过 sendMessage() 方式是只能检测主线程是否阻塞，而不能检测是否死锁，因为如果服务主线程和另外一个线程发生死锁（如另外一个线程synchronized 关键字长时间持有某个锁，不释放），此时向主线程发送 Message，主线程的Handler是可以继续处理的。

触发方法

Blocked in Monitor
使用Monitor接口中的锁一直无法释放即可
Blocked in handler
可以在Service的onCreate中做crash, 这样长时间就会导致systemServer重启.

触发log

常见Log有下面两种，一种是Blocked in handler 、另外一种是: Blocked in monitor

Blocked in handler

11-15 06:56:39.696 24203 24902 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main), Blocked in handler on ui thread (android.ui)
11-15 06:56:39.696 24203 24902 W Watchdog: main thread stack trace:
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.nativePollOnce(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.next(MessageQueue.java:323)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.Looper.loop(Looper.java:142)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.run(SystemServer.java:377)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.main(SystemServer.java:239)
11-15 06:56:39.696 24203 24902 W Watchdog:     at java.lang.reflect.Method.invoke(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:901)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:791)
11-15 06:56:39.696 24203 24902 W Watchdog: ui thread stack trace:
......

Blocked in monitor

10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace:
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!

reference

Android SystemServer 中 WatchDog 机制介绍

Android系统层Watchdog机制源码分析

Watchdog原理和问题分析

Android 系统中的 WatchDog 详解

应用与系统稳定性第五篇—Watchdog原理和问题分析

Watchdog 日志分析

Watchdog识别到SystemServer线程死锁后, 会收集打印信息, 代码在run函数中

while (true) {
    //如果发生了死锁或者消息队列阻塞就会走到下面   

    // If we got here, that means that the system is most likely hung.
    // First collect stack traces from all threads of the system process.
    // Then kill this process so that the system will restart.
    EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

    ArrayList<Integer> pids = new ArrayList<>();
    pids.add(Process.myPid());
    if (mPhonePid > 0) pids.add(mPhonePid);
    // Pass !waitedHalf so that just in case we somehow wind up here without having
    // dumped the halfway stacks, we properly re-initialize the trace file.
    final File stack = ActivityManagerService.dumpStackTraces(
        !waitedHalf, pids, null, null, getInterestingNativePids());

    // Give some extra time to make sure the stack traces get written.
    // The system's been hanging for a minute, another second or two won't hurt much.
    SystemClock.sleep(2000);

    // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
    doSysRq('w');
    doSysRq('l');

    // Try to add the error to the dropbox, but assuming that the ActivityManager
    // itself may be deadlocked.  (which has happened, causing this statement to
    // deadlock and the watchdog as a whole to be ineffective)
    Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
        public void run() {
            mActivity.addErrorToDropBox(
                "watchdog", null, "system_server", null, null,
                subject, null, stack, null);
        }
    };
    dropboxThread.start();
    try {
        dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
    } catch (InterruptedException ignored) {}

    IActivityController controller;
    synchronized (this) {
        controller = mController;
    }
    if (controller != null) {
        Slog.i(TAG, "Reporting stuck state to activity controller");
        try {
            Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
            // 1 = keep waiting, -1 = kill system
            int res = controller.systemNotResponding(subject);
            if (res >= 0) {
                Slog.i(TAG, "Activity controller requested to coninue to wait");
                waitedHalf = false;
                continue;
            }
        } catch (RemoteException e) {
        }
    }

    // Only kill the process if the debugger is not attached.
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    if (debuggerWasConnected >= 2) {
        Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
    } else if (debuggerWasConnected > 0) {
        Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
    } else if (!allowRestart) {
        Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
    } else {
        Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
        WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
        Slog.w(TAG, "*** GOODBYE!");
        Process.killProcess(Process.myPid());
        System.exit(10);
    }

    waitedHalf = false;
}

输出event log

EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

dump 堆栈信息

ArrayList<Integer> pids = new ArrayList<>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
final File stack = ActivityManagerService.dumpStackTraces(
    !waitedHalf, pids, null, null, getInterestingNativePids());
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(2000);

dump kerner info

// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');

收集dropbox信息

// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked.  (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
    public void run() {
        mActivity.addErrorToDropBox(
            "watchdog", null, "system_server", null, null,
            subject, null, stack, null);
    }
};
dropboxThread.start();
try {
    dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}

kill 掉系统进程, 如果不在debug模式, 就kill掉自己

// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
    debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
    Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
    Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
    Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
    Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
    WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
    Slog.w(TAG, "*** GOODBYE!");
    Process.killProcess(Process.myPid());
    System.exit(10);
}

prop dalvik.vm.stack-trace-dir

指的是 /data/anr

final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");

reference

Android 系统中WatchDog 日志分析

Java基础之—反射

android watchdog机制