当前位置: 首页 > 工具软件 > Watchdog > 使用案例 >

android watchdog机制

纪正德
2023-12-01

Android Watchdog 机制

早期手机平台上通常是在设备中增加一个硬件看门狗(WatchDog), 软件系统必须定时的向看门狗硬件中写值来表示自己没出故障(俗称“喂狗”), 否则超过了规定的时间看门狗就会重新启动设备. 大体原理是, 在系统运行以后启动了看门狗的计数器, 看门狗就开始自动计数,如果到了一定的时间还不去清看门狗,那么看门狗计数器就会溢出从而引起看门狗中断,造成系统复位。

而手机, 其实是一个超强超强的单片机, 其运行速度比单片机快N倍, 存储空间比单片机大N倍, 里面运行了若干个线程, 各种软硬件协同工作, Android 的 SystemServer 是一个非常复杂的进程,里面运行的服务超过五十种,是最可能出问题的进程,因此有必要对 SystemServer 中运行的各种线程实施监控。

但是如果使用硬件看门狗的工作方式,每个线程隔一段时间去喂狗,不但非常浪费CPU,而且会导致程序设计更加复杂。因此 Android 开发了 Watchdog 类作为软件看门狗来监控 SystemServer 中的线程。一旦发现问题,Watchdog 会杀死 SystemServer 进程。

Watchdog的功能

Watchdog主要有两个作用

  1. Blocked in Monitor 被监控线程的monitor接口实现阻塞
  2. Blocked int handler 被监控线程的消息队列不处理消息

判断线程是否卡住的方法

MessageQueue.isPolling
Monitor.monitor
---
HandlerChecker 检查looper是否阻塞
monitor 检查是否死锁

Watchdog的工作机制

Watchdog的工作机制 https://img-blog.csdnimg.cn/img_convert/e5c8133c7f86583251c775de4ceae9c0.jpeg

Watchdog 的启动

Watchdog 是在 SystemServer 进程中被初始化和启动的,在 SystemServer 的 run 方法中,各种Android 服务被注册和启动,其中也包括了Watchdog 的初始化和启动,代码如下:

final Watchdog watchdog = Watchdog.getInstance();//line: 864
watchdog.init(context, mActivityManagerService);

在 SystemServer 中 startOtherServices() 的后半段,在 AMS(ActivityManagerService) 的 SystemReady 接口的 CallBack 函数中实现 Watchdog 的启动:

Watchdog.getInstance().start();//line: 1852

Watchdog的构造方法

super("watchdog");
//初始化每一个我们希望检查的线程
//这里没有检查后台线程
//共享的前台线程是主检查器, 还有分配其monitor检查其它线程
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                                     "foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// 为主线程添加检查器
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                                        "main thread", DEFAULT_TIMEOUT));
// 为共享UI线程添加检查器
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                                        "ui thread", DEFAULT_TIMEOUT));
// 为共享IO线程添加检查器
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                                        "i/o thread", DEFAULT_TIMEOUT));
// 为共享display线程添加检查器.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                                        "display thread", DEFAULT_TIMEOUT));

// 初始化检查器 binder线程.
addMonitor(new BinderThreadMonitor());

mOpenFdMonitor = OpenFdMonitor.create();

// See the notes on DEFAULT_TIMEOUT.
assert DB ||
    DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;

Watchdog的构造方法中创建了一些HandlerChecker对象, 并添加到自己的监听队列中.

Watchdog添加的监听handler

线程名对应handler说明Timeout
foreground threadFgThread.getHandler()前台线程60s
main threadnew Handler(Looper.getMainLooper())主线程60s
ui threadUiThread.getHandler()UI线程60s
i/o threadIoThread.getHandler()IO线程60s
display threadDisplayThread.getHandler()Display线程60s
PackageManageraddThread(mHandler, time)PackageManagerService主动add的线程10min
PackageManageraddThread(mHandler, time)PermissionManagerService主动add的线程60s
PowerManagerServiceaddThread(mHandler, time)PowerManagerService主动add的线程60s
ActivityManagerServiceaddThread(mHandler, time)ActivityManagerService主动add的线程60s

Watchdog添加的监听monitor

monitor程名说明Timeout
BinderThreadMonitor检查Binder线程60s
OpenFdMonitor检查fd线程60s
TvRemoteServiceaddMonitor(this) mLock
ActivityManagerServiceaddMonitor(this) this
MediaProjectionManagerServiceaddMonitor(this) mLock
MediaRouterServiceaddMonitor(this) mLock
MediaSessionServiceaddMonitor(this) mLock
InputManagerServiceaddMonitor(this) mInputFilterLock
nativeMonitor(mPtr);
PowerManagerServiceaddMonitor(this) mLock
NetworkManagementServiceaddMonitor(this) mConnector
StorageManagerServiceaddMonitor(this) mVold
WindowManagerServiceaddMonitor(this) mWindowMap

HandlerChecker

public final class HandlerChecker implements Runnable

HandlerChecker用于检查句柄线程的状态和调度监视器回调, 其原理就是通过各个Handler的looper的MessageQueue来判断该线程是否卡住了。当然,该线程是运行在SystemServer进程中的线程。

Watchdog中会构建很多的HandlerChecker, 可以分为两类

  • Monitor Checker,用于检查是Monitor对象可能发生的死锁, AMS, PKMS, WMS等核心的系统服务都是Monitor对象。
  • Looper Checker,用于检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列,ui, Io, display这些全局的消息队列都是被检查的对象。此外,一些重要的线程的消息队列,也会加入到Looper Checker中,譬如AMS, PKMS,这些是在对应的对象初始化时加入的。

两类HandlerChecker的侧重点不同

  • Monitor Checker 预警我们不能长时间持有核心系统服务的对象锁,否则会阻塞很多函数的运行
  • Looper Checker预警我们不能长时间的霸占消息队列,否则其他消息将得不到处理

HandlerChecker的构造函数

public final class HandlerChecker implements Runnable {
    private final Handler mHandler;
    private final String mName;
    private final long mWaitMax;
    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
    private boolean mCompleted;
    private Monitor mCurrentMonitor;
    private long mStartTime;

    HandlerChecker(Handler handler, String name, long waitMaxMillis) {
        mHandler = handler; //线程handler
        mName = name; //名称
        mWaitMax = waitMaxMillis; //等待超时时间
        mCompleted = true; //线程状态
    }
}

HandlerChecker::scheduleCheckLocked

这个方法是在Watchdog中的run方法会调用, 是HandlerChecker的核心方法, 用来检查HandlerChecker是否发生了死锁.

public void scheduleCheckLocked() {
    if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
        // If the target looper has recently been polling, then
        // there is no reason to enqueue our checker on it since that
        // is as good as it not being deadlocked.  This avoid having
        // to do a context switch to check the thread.  Note that we
        // only do this if mCheckReboot is false and we have no
        // monitors, since those would need to be executed at this point.
        mCompleted = true;
        return;
    }

    if (!mCompleted) {
        // we already have a check in flight, so no need
        return;
    }

    mCompleted = false;
    mCurrentMonitor = null;
    mStartTime = SystemClock.uptimeMillis();
    mHandler.postAtFrontOfQueue(this);
}
  1. isPolling() 这个方法是判断当前线程Looper是否就绪的核心方法. 如果true 当前正在轮询事件, 正常运行, 会继续向下执行
  2. 如果没有mCompleted, 说明已经在检查了
  3. `mHandler.postAtFrontOfQueue(this)将自己post到队列中, 之后会执行run方法

在scheduleCheckLocked 中,其实主要是处理mMonitorChecker 的情况,对于其他的没有monitor 注册进来的且处于polling 状态的 HandlerChecker 是不去检查的,例如,UiThread,肯定一直处于polling 状态。

MessageQueue::isPolling

mHandler.getLooper().getQueue().isPolling() 这个方法可以判断当前线程是否被卡住.
true: 表示looper当前正在轮询事件,

这个方法的实现在MessageQueue中,可以看到上面的注释写到:返回当前的looper线程是否在polling工作来做,这个是个很好的用于检测loop是否存活的方法。

frameworks/base/core/java/android/os/MessageQueue.java

/**
     * Returns whether this looper's thread is currently polling for more work to do.
     * This is a good signal that the loop is still alive rather than being stuck
     * handling a callback.  Note that this method is intrinsically racy, since the
     * state of the loop can change before you get the result back.
     *
     * <p>This method is safe to call from any thread.
     *
     * @return True if the looper is currently polling for events.
     * @hide
     */
public boolean isPolling() {
    synchronized (this) {
        return isPollingLocked();
    }
}

HandlerChecker::run

@Override
public void run() {
    final int size = mMonitors.size();
    for (int i = 0 ; i < size ; i++) {
        synchronized (Watchdog.this) {
            mCurrentMonitor = mMonitors.get(i);
        }
        mCurrentMonitor.monitor();
    }

    synchronized (Watchdog.this) {
        mCompleted = true;
        mCurrentMonitor = null;
    }
}
  1. 里面对自己的Monitors遍历并进行monitor。若有monitor发生了阻塞,那么mComplete会一直是false。
  2. for循环用来检测监听列表中是否有阻塞,而且只有mMonitorChecker会走进此循环
  3. 其余的handlerChecker因为mMonitors为空,都不会执行此循环

HandlerChecker::getCompletionStateLocked

public int getCompletionStateLocked() {
    if (mCompleted) {
        return COMPLETED;
    } else {
        long latency = SystemClock.uptimeMillis() - mStartTime;
        if (latency < mWaitMax/2) {
            return WAITING;
        } else if (latency < mWaitMax) {
            return WAITED_HALF;
        }
    }
    return OVERDUE;
}
  1. 获取完成时间标识, mStartTime初值是在scheduleCheckLocked中设置的
  2. 在系统检测调用这个获取未完成状态时,就会进入else里面,进行了时间的计算,并返回相应的时间状态码。

线程的状态

状态描述
COMPLETED对应消息已处理完毕线程无阻塞
WAITING对应消息处理花费0~29秒,继续运行
WAITED_HALF对应消息处理花费30~59秒,线程可能已经被阻塞,需要保存当前AMS堆栈状态, 继续监听
OVERDUE对应消息处理已经花费超过60, 准备 kill 当前进程. 能够走到这里,说明已经发生了超时60秒了。那么下面接下来全是应对超时的情况

HandlerThread的继承关系

这里的HandlerChecker使用的传入参数都是创建的HandlerThread线程的Handler

java.lang.Object
  ↳ Thread implements Runnable
    ↳ HandlerThread extends Thread
      ↳ ServiceThread extends HandlerThread
        ↳ FgThread extends ServiceThread

初始化的HandlerChecker

public ServiceThread(String name, int priority, boolean allowIo)

private FgThread() {
    super("android.fg", android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);
}

private UiThread() {
    super("android.ui", Process.THREAD_PRIORITY_FOREGROUND, false /*allowIo*/);
}

private IoThread() {
    super("android.io", android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);
}

private DisplayThread() {
    //DisplayThread运行重要的东西,但这些东西不如AnimationThread中运行的东西重要。
    //因此,将优先级设置为较低的一个。
    super("android.display", Process.THREAD_PRIORITY_DISPLAY + 1, false /*allowIo*/);
}

Android线程优先级

frameworks/base/core/java/android/os/Process.java

public static final int THREAD_PRIORITY_DEFAULT = 0; //默认的线程优先级
public static final int THREAD_PRIORITY_LOWEST = 19; //最低的线程级别
public static final int THREAD_PRIORITY_BACKGROUND = 10; //后台线程建议设置这个优先级
public static final int THREAD_PRIORITY_FOREGROUND = -2; //用户正在交互的UI线程,代码中无法设置该优先级,系统会按照情况调整到该优先级
public static final int THREAD_PRIORITY_DISPLAY = -4; //也是与UI交互相关的优先级界别,但是要比THREAD_PRIORITY_FOREGROUND优先
public static final int THREAD_PRIORITY_URGENT_DISPLAY = -8; //显示线程的最高级别,用来处理绘制画面和检索输入事件
public static final int THREAD_PRIORITY_AUDIO = -16; //声音线程的标准级别
public static final int THREAD_PRIORITY_URGENT_AUDIO = -19; //声音线程的最高级别,优先程度较THREAD_PRIORITY_AUDIO要高。
public static final int THREAD_PRIORITY_MORE_FAVORABLE = -1; //相对THREAD_PRIORITY_DEFAULT稍微优先
public static final int THREAD_PRIORITY_LESS_FAVORABLE = 1; // 相对THREAD_PRIORITY_DEFAULT稍微落后一些

应用设置线程优先级的方法如下, 但是有一些级别是不允许应用设置的, 是由系统进行分配的.

Process.setThreadPriority(Process.THREAD_PRIORITY_BACKGROUND +
                Process.THREAD_PRIORITY_LESS_FAVORABLE)

describeBlockedStateLocked

public String describeBlockedStateLocked() {
    if (mCurrentMonitor == null) {
        return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
    } else {
        return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
            + " on " + mName + " (" + getThread().getName() + ")";
    }
}

打印Monitor信息

Monitor

Monitor是一个接口, 用来

public interface Monitor {
    void monitor();
}

实现Watchdog.Monitor接口的类

ActivityManagerService
WindowManagerService
PowerManagerService
InputManagerService
MediaSessionService
MediaRouterService
StorageManagerService
NetworkManagementService
NativeDaemonConnector
MediaProjectionManagerService
TvRemoteService

BinderThreadMonitor
OpenFdMonitor

Monitor是一个接口,实现这个接口的类有好几个。比如:如下是android9.0搜出来的结果

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QpJfi2aa-1666612570217)(/home/jun/Desktop/Plane3/CoreSystemServer/watchdog/WatchdogImplClass.png)]

使用Watchdog

这么多的类实现了该接口, 他们都注册到了Watchdog中, 如AMS中

public class ActivityManagerService extends IActivityManager.Stub
    implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
    ......
    public ActivityManagerService(Context systemContext) {
        ......
        Watchdog.getInstance().addMonitor(this);
        Watchdog.getInstance().addThread(mHandler);
        ......
    }
    ......
    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
    public void monitor() {
        synchronized (this) { }
    }
    ......
}

Watchdog::addThread

public void addThread(Handler thread) {
    addThread(thread, DEFAULT_TIMEOUT); //60s
}

public void addThread(Handler thread, long timeoutMillis) {
    synchronized (this) {
        if (isAlive()) {
            throw new RuntimeException("Threads can't be added once the Watchdog is running");
        }
        final String name = thread.getLooper().getThread().getName();
        mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
    }
}
  1. addThread是将线程的Hander传给Watchdog, 然后Watchdog会根据Handler创建一个新的HandlerChecker,
  2. 将新的HandlerChecker添加到监听队列中

Watchdog::addMonitor

public void addMonitor(Monitor monitor) {
    synchronized (this) {
        if (isAlive()) {
            throw new RuntimeException("Monitors can't be added once the Watchdog is running");
        }
        mMonitorChecker.addMonitor(monitor);
    }
}
  1. 传递monitor, Watchdog会调用monitor方法, 来判断是否发生阻塞
  2. 所有的Monitor都添加到了mMonitorChecker, 所以只有mMonitorChecker里是有Monitor的

Watchdog::run()

Watchdog的核心方法, 检查线程死锁, looper阻塞, 收集信息和kill掉system_server进程, 重启

@Override
public void run() {
    boolean waitedHalf = false;
    while (true) {
        final List<HandlerChecker> blockedCheckers;
        final String subject;
        final boolean allowRestart;
        int debuggerWasConnected = 0;
        synchronized (this) {
            long timeout = CHECK_INTERVAL;
            // Make sure we (re)spin the checkers that have become idle within
            // this wait-and-check interval
            for (int i=0; i<mHandlerCheckers.size(); i++) {
                //调用每个HandlerChecker的scheduleCheckLocked() 方法
                HandlerChecker hc = mHandlerCheckers.get(i);
                hc.scheduleCheckLocked();
            }

            if (debuggerWasConnected > 0) {
                debuggerWasConnected--;
            }

            // NOTE: We use uptimeMillis() here because we do not want to increment the time we
            // wait while asleep. If the device is asleep then the thing that we are waiting
            // to timeout on is asleep as well and won't have a chance to run, causing a false
            // positive on when to kill things.
            long start = SystemClock.uptimeMillis(); 
            while (timeout > 0) {
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                try {
                    wait(timeout);
                } catch (InterruptedException e) {
                    Log.wtf(TAG, e);
                }
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            }

            boolean fdLimitTriggered = false;
            if (mOpenFdMonitor != null) {
                fdLimitTriggered = mOpenFdMonitor.monitor();
            }

            if (!fdLimitTriggered) {
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) { //线程状态正常,重新轮询
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {//处于阻塞状态,但监测时间小于30s,继续监测
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {//处于阻塞状态,监测时间已经超过30s,开始dump一些系统信息,然后继续监测30s
                    if (!waitedHalf) {
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                                               getInterestingNativePids());
                        waitedHalf = true;
                    }
                    continue;
                }

                // something is overdue!
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
            } else {
                blockedCheckers = Collections.emptyList();
                subject = "Open FD high water mark reached";
            }
            allowRestart = mAllowRestart;
        }

        // If we got here, that means that the system is most likely hung.
        // First collect stack traces from all threads of the system process.
        // Then kill this process so that the system will restart.
        EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

        ArrayList<Integer> pids = new ArrayList<>();
        pids.add(Process.myPid());
        if (mPhonePid > 0) pids.add(mPhonePid);
        // Pass !waitedHalf so that just in case we somehow wind up here without having
        // dumped the halfway stacks, we properly re-initialize the trace file.
        final File stack = ActivityManagerService.dumpStackTraces(
            !waitedHalf, pids, null, null, getInterestingNativePids());

        // Give some extra time to make sure the stack traces get written.
        // The system's been hanging for a minute, another second or two won't hurt much.
        SystemClock.sleep(2000);

        // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
        doSysRq('w');
        doSysRq('l');

        // Try to add the error to the dropbox, but assuming that the ActivityManager
        // itself may be deadlocked.  (which has happened, causing this statement to
        // deadlock and the watchdog as a whole to be ineffective)
        Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
            public void run() {
                mActivity.addErrorToDropBox(
                    "watchdog", null, "system_server", null, null,
                    subject, null, stack, null);
            }
        };
        dropboxThread.start();
        try {
            dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
        } catch (InterruptedException ignored) {}

        IActivityController controller;
        synchronized (this) {
            controller = mController;
        }
        if (controller != null) {
            Slog.i(TAG, "Reporting stuck state to activity controller");
            try {
                Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                // 1 = keep waiting, -1 = kill system
                int res = controller.systemNotResponding(subject);
                if (res >= 0) {
                    Slog.i(TAG, "Activity controller requested to coninue to wait");
                    waitedHalf = false;
                    continue;
                }
            } catch (RemoteException e) {
            }
        }

        // Only kill the process if the debugger is not attached.
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        if (debuggerWasConnected >= 2) {
            Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
        } else if (debuggerWasConnected > 0) {
            Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
        } else if (!allowRestart) {
            Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
        } else {
            Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
            WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
            Slog.w(TAG, "*** GOODBYE!");
            Process.killProcess(Process.myPid());
            System.exit(10);
        }

        waitedHalf = false;
    }
}
  1. run() 方法就是死循环, 不断的去遍历所有HandlerChecker,并调其监控方法,等待三十秒,评估状态。

  2. 遍历所有的HandlerChecker, 并调用其scheduleCheckLocked方法, 记录开始时间

    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        hc.scheduleCheckLocked();
    }
    
  3. 等待 30 秒

    // 等待30秒
    //使用uptimeMills是为了不把手机睡眠时间算进入,手机睡眠时系统服务同样睡眠
    long start = SystemClock.uptimeMillis();
    while (timeout > 0) {
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        try {
            wait(timeout);
        } catch (InterruptedException e) {
            Log.wtf(TAG, e);
        }
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
    }
    
  4. 评估Checker的状态,里面会遍历所有的HandlerChecker,并获取最大的返回值。
    最大的返回值有四种情况:

    • COMPLETED 对应消息已处理完毕线程无阻塞
    • WAITING 对应消息处理花费0~29秒,继续运行
    • WAITED_HALF 对应消息处理花费30~59秒,线程可能已经被阻塞,需要保存当前AMS堆栈状态, 继续监听
    • OVERDUE 对应消息处理已经花费超过60, 准备 kill 当前进程. 能够走到这里,说明已经发生了超时60秒了。那么下面接下来全是应对超时的情况
    boolean fdLimitTriggered = false;
    if (mOpenFdMonitor != null) {
        fdLimitTriggered = mOpenFdMonitor.monitor();
    }
    if (!fdLimitTriggered) {
        final int waitState = evaluateCheckerCompletionLocked();
        if (waitState == COMPLETED) {
            // The monitors have returned; reset
            waitedHalf = false;
            continue;
        } else if (waitState == WAITING) {
            // still waiting but within their configured intervals; back off and recheck
            continue;
        } else if (waitState == WAITED_HALF) {
            if (!waitedHalf) {
                // We've waited half the deadlock-detection interval.  Pull a stack
                // trace and wait another half.
                ArrayList<Integer> pids = new ArrayList<Integer>();
                pids.add(Process.myPid());
                ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                                       getInterestingNativePids());
                waitedHalf = true;
            }
            continue;
        }
    
        // something is overdue!
        blockedCheckers = getBlockedCheckersLocked();
        subject = describeCheckersLocked(blockedCheckers);
    } else {
        blockedCheckers = Collections.emptyList();
        subject = "Open FD high water mark reached";
    }
    
  5. fdMonitor

    public boolean monitor() {
        if (mFdHighWaterMark.exists()) {
            dumpOpenDescriptors();
            return true;
        }
        return false;
    }
    
  6. 收集信息

  7. 杀死系统进程

Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);

HandlerChecker::scheduleCheckLocked

HandlerChecker::run

Watchdog::evaluateCheckerCompletionLocked

评估Checker的状态,里面会遍历所有的HandlerChecker,并获取最大的返回值。

private int evaluateCheckerCompletionLocked() {
    int state = COMPLETED;// COMPLETED = 0
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        state = Math.max(state, hc.getCompletionStateLocked());
    }
    return state;
}

HandlerChecker::getCompletionStateLocked

Watchdog::getBlockedCheckersLocked

Watchdog::describeCheckersLocked

private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
    ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        if (hc.isOverdueLocked()) {
            checkers.add(hc);
        }
    }
    return checkers;
}

private String describeCheckersLocked(List<HandlerChecker> checkers) {
    StringBuilder builder = new StringBuilder(128);
    for (int i=0; i<checkers.size(); i++) {
        if (builder.length() > 0) {
            builder.append(", ");
        }
        builder.append(checkers.get(i).describeBlockedStateLocked());
    }
    return builder.toString();
}
  1. 打印阻塞或死锁线程的信息

注意

通过 monitor() 方法检查死锁针对不同线程之间的,而服务主线程是否阻塞是针对主线程,所以通过 sendMessage() 方式是只能检测主线程是否阻塞,而不能检测是否死锁,因为如果服务主线程和另外一个线程发生死锁(如另外一个线程synchronized 关键字长时间持有某个锁,不释放),此时向主线程发送 Message,主线程的Handler是可以继续处理的。

触发方法

  1. Blocked in Monitor
    使用Monitor接口中的锁一直无法释放即可
  2. Blocked in handler
    可以在Service的onCreate中做crash, 这样长时间就会导致systemServer重启.

触发log

常见Log有下面两种,一种是Blocked in handler 、另外一种是: Blocked in monitor

Blocked in handler

11-15 06:56:39.696 24203 24902 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main), Blocked in handler on ui thread (android.ui)
11-15 06:56:39.696 24203 24902 W Watchdog: main thread stack trace:
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.nativePollOnce(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.next(MessageQueue.java:323)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.Looper.loop(Looper.java:142)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.run(SystemServer.java:377)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.main(SystemServer.java:239)
11-15 06:56:39.696 24203 24902 W Watchdog:     at java.lang.reflect.Method.invoke(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:901)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:791)
11-15 06:56:39.696 24203 24902 W Watchdog: ui thread stack trace:
......

Blocked in monitor

10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace:
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!

reference

Android SystemServer 中 WatchDog 机制介绍

Android系统层Watchdog机制源码分析

Watchdog原理和问题分析

Android 系统中的 WatchDog 详解

应用与系统稳定性第五篇—Watchdog原理和问题分析

Watchdog 日志分析

Watchdog识别到SystemServer线程死锁后, 会收集打印信息, 代码在run函数中

while (true) {
    //如果发生了死锁或者消息队列阻塞就会走到下面   

    // If we got here, that means that the system is most likely hung.
    // First collect stack traces from all threads of the system process.
    // Then kill this process so that the system will restart.
    EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

    ArrayList<Integer> pids = new ArrayList<>();
    pids.add(Process.myPid());
    if (mPhonePid > 0) pids.add(mPhonePid);
    // Pass !waitedHalf so that just in case we somehow wind up here without having
    // dumped the halfway stacks, we properly re-initialize the trace file.
    final File stack = ActivityManagerService.dumpStackTraces(
        !waitedHalf, pids, null, null, getInterestingNativePids());

    // Give some extra time to make sure the stack traces get written.
    // The system's been hanging for a minute, another second or two won't hurt much.
    SystemClock.sleep(2000);

    // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
    doSysRq('w');
    doSysRq('l');

    // Try to add the error to the dropbox, but assuming that the ActivityManager
    // itself may be deadlocked.  (which has happened, causing this statement to
    // deadlock and the watchdog as a whole to be ineffective)
    Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
        public void run() {
            mActivity.addErrorToDropBox(
                "watchdog", null, "system_server", null, null,
                subject, null, stack, null);
        }
    };
    dropboxThread.start();
    try {
        dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
    } catch (InterruptedException ignored) {}

    IActivityController controller;
    synchronized (this) {
        controller = mController;
    }
    if (controller != null) {
        Slog.i(TAG, "Reporting stuck state to activity controller");
        try {
            Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
            // 1 = keep waiting, -1 = kill system
            int res = controller.systemNotResponding(subject);
            if (res >= 0) {
                Slog.i(TAG, "Activity controller requested to coninue to wait");
                waitedHalf = false;
                continue;
            }
        } catch (RemoteException e) {
        }
    }

    // Only kill the process if the debugger is not attached.
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    if (debuggerWasConnected >= 2) {
        Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
    } else if (debuggerWasConnected > 0) {
        Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
    } else if (!allowRestart) {
        Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
    } else {
        Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
        WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
        Slog.w(TAG, "*** GOODBYE!");
        Process.killProcess(Process.myPid());
        System.exit(10);
    }

    waitedHalf = false;
}
  1. 输出event log

    EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
    
  2. dump 堆栈信息

ArrayList<Integer> pids = new ArrayList<>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
final File stack = ActivityManagerService.dumpStackTraces(
    !waitedHalf, pids, null, null, getInterestingNativePids());
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(2000);
  1. dump kerner info

    // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
    doSysRq('w');
    doSysRq('l');
    
  2. 收集dropbox信息

    // Try to add the error to the dropbox, but assuming that the ActivityManager
    // itself may be deadlocked.  (which has happened, causing this statement to
    // deadlock and the watchdog as a whole to be ineffective)
    Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
        public void run() {
            mActivity.addErrorToDropBox(
                "watchdog", null, "system_server", null, null,
                subject, null, stack, null);
        }
    };
    dropboxThread.start();
    try {
        dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
    } catch (InterruptedException ignored) {}
    
  3. kill 掉系统进程, 如果不在debug模式, 就kill掉自己

    // Only kill the process if the debugger is not attached.
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    if (debuggerWasConnected >= 2) {
        Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
    } else if (debuggerWasConnected > 0) {
        Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
    } else if (!allowRestart) {
        Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
    } else {
        Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
        WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
        Slog.w(TAG, "*** GOODBYE!");
        Process.killProcess(Process.myPid());
        System.exit(10);
    }
    

prop dalvik.vm.stack-trace-dir

指的是 /data/anr

final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");

reference

Android 系统中WatchDog 日志分析

Java基础之—反射

 类似资料: