当前位置: 首页 > 工具软件 > Watchdog > 使用案例 >

WatchDog工作原理

勾渝
2023-12-01

一、概述

Android系统中,有硬件WatchDog用于定时检测关键硬件是否正常工作,类似地,在framework层有一个软件WatchDog用于定期检测关键系统服务是否发生死锁事件。WatchDog功能主要是分析系统核心服务和重要线程是否处于Blocked状态。

  • 监视reboot广播;
  • 监视mMonitors关键系统服务是否死锁。

二、WatchDog初始化

2.1 startOtherServices

[-> SystemServer.java]

private void startOtherServices() {
    ...
    //创建watchdog【见小节2.2】
    final Watchdog watchdog = Watchdog.getInstance();
    //注册reboot广播【见小节2.3】
    watchdog.init(context, mActivityManagerService);
    ...
    mSystemServiceManager.startBootPhase(SystemService.PHASE_LOCK_SETTINGS_READY); //480
    ...
    mActivityManagerService.systemReady(new Runnable() {
 
       public void run() {
           mSystemServiceManager.startBootPhase(
                   SystemService.PHASE_ACTIVITY_MANAGER_READY);
           ...
           // watchdog启动【见小节3.1】
           Watchdog.getInstance().start();
           mSystemServiceManager.startBootPhase(
                   SystemService.PHASE_THIRD_PARTY_APPS_CAN_START);
        }
        
    }
}

system_server进程启动的过程中初始化WatchDog,主要有:

  • 创建watchdog对象,该对象本身继承于Thread;
  • 注册reboot广播;
  • 调用start()开始工作。

2.2 getInstance

[-> Watchdog.java]

public static Watchdog getInstance() {
    if (sWatchdog == null) {
        //单例模式,创建实例对象【见小节2.3 】
        sWatchdog = new Watchdog();
    }
    return sWatchdog;

2.3 创建Watchdog

[-> Watchdog.java]

public class Watchdog extends Thread {
    //所有的HandlerChecker对象组成的列表,HandlerChecker对象类型【见小节2.3.1】
    final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
    ...
 
    private Watchdog() {
        super("watchdog");
        //将前台线程加入队列
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        //将主线程加入队列
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        //将ui线程加入队列
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        //将i/o线程加入队列
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        //将display线程加入队列
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        //【见小节2.3.2】
        addMonitor(new BinderThreadMonitor());
    }
}

Watchdog继承于Thread,创建的线程名为”watchdog”。mHandlerCheckers队列包括、 主线程,fg, ui, io, display线程的HandlerChecker对象。

2.3.1 HandlerChecker

[-> Watchdog.java]

public final class HandlerChecker implements Runnable {
    private final Handler mHandler; //Handler对象
    private final String mName; //线程描述名
    private final long mWaitMax; //最长等待时间
    //记录着监控的服务
    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
    private boolean mCompleted; //开始检查时先设置成false
    private Monitor mCurrentMonitor; 
    private long mStartTime; //开始准备检查的时间点
 
    HandlerChecker(Handler handler, String name, long waitMaxMillis) {
        mHandler = handler;
        mName = name;
        mWaitMax = waitMaxMillis; 
        mCompleted = true;
    }
}

2.3.2 addMonitor

public class Watchdog extends Thread {
    public void addMonitor(Monitor monitor) {
        synchronized (this) {
            ...
            //此处mMonitorChecker数据类型为HandlerChecker
            mMonitorChecker.addMonitor(monitor);
        }
    }
 
    public final class HandlerChecker implements Runnable {
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
 
        public void addMonitor(Monitor monitor) {
            //将上面的BinderThreadMonitor添加到mMonitors队列
            mMonitors.add(monitor);
        }
        ...
    }
}

监控Binder线程, 将monitor添加到HandlerChecker的成员变量mMonitors列表中。 在这里是将BinderThreadMonitor对象加入该线程。

private static final class BinderThreadMonitor implements Watchdog.Monitor {
    public void monitor() {
        Binder.blockUntilThreadAvailable();
    }
}

blockUntilThreadAvailable最终调用的是IPCThreadState,等待有空闲的binder线程

void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        //等待正在执行的binder线程小于进程最大binder线程上限(16个)
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);
}

可见addMonitor(new BinderThreadMonitor())是将Binder线程添加到android.fg线程的handler(mMonitorChecker)来检查是否工作正常。

2.3 init

[-> Watchdog.java]

public void init(Context context, ActivityManagerService activity) {
    mResolver = context.getContentResolver();
    mActivity = activity;
    //注册reboot广播接收者【见小节2.3.1】
    context.registerReceiver(new RebootRequestReceiver(),
            new IntentFilter(Intent.ACTION_REBOOT),
            android.Manifest.permission.REBOOT, null);
}

2.3.1 RebootRequestReceiver

[-> Watchdog.java]

final class RebootRequestReceiver extends BroadcastReceiver {
    @Override
    public void onReceive(Context c, Intent intent) {
        if (intent.getIntExtra("nowait", 0) != 0) {
            //【见小节2.3.2】
            rebootSystem("Received ACTION_REBOOT broadcast");
            return;
        }
        Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
    }
}

2.3.2 rebootSystem

[-> Watchdog.java]

void rebootSystem(String reason) {
    Slog.i(TAG, "Rebooting system because: " + reason);
    IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
    try {
        //通过PowerManager执行reboot操作
        pms.reboot(false, reason, false);
    } catch (RemoteException ex) {
    }
}

最终是通过PowerManagerService来完成重启操作,具体的重启流程后续会单独讲述。

三、Watchdog检测机制

当调用Watchdog.getInstance().start()时,则进入线程“watchdog”的run()方法, 该方法分成两部分:

  • 前半部 [小节3.1] 用于监测是否触发超时;
  • 后半部 [小节4.1], 当触发超时则输出各种信息。

3.1 run

[-> Watchdog.java]

public void run() {
    boolean waitedHalf = false;
    while (true) {
        final ArrayList<HandlerChecker> blockedCheckers;
        final String subject;
        final boolean allowRestart;
        int debuggerWasConnected = 0;
        synchronized (this) {
            long timeout = CHECK_INTERVAL; //CHECK_INTERVAL=30s
            for (int i=0; i<mHandlerCheckers.size(); i++) {
                HandlerChecker hc = mHandlerCheckers.get(i);
                //执行所有的Checker的监控方法, 每个Checker记录当前的mStartTime[见小节3.2]
                hc.scheduleCheckLocked();
            }
 
            if (debuggerWasConnected > 0) {
                debuggerWasConnected--;
            }
 
            long start = SystemClock.uptimeMillis();
            //通过循环,保证执行30s才会继续往下执行
            while (timeout > 0) {
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                try {
                    wait(timeout); //触发中断,直接捕获异常,继续等待.
                } catch (InterruptedException e) {
                    Log.wtf(TAG, e);
                }
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            }
            
            //评估Checker状态【见小节3.3】
            final int waitState = evaluateCheckerCompletionLocked();
            if (waitState == COMPLETED) {
                waitedHalf = false;
                continue;
            } else if (waitState == WAITING) {
                continue;
            } else if (waitState == WAITED_HALF) {
                if (!waitedHalf) {
                    //首次进入等待时间过半的状态
                    ArrayList<Integer> pids = new ArrayList<Integer>();
                    pids.add(Process.myPid());
                    //输出system_server和3个native进程的traces【见小节4.2】
                    ActivityManagerService.dumpStackTraces(true, pids, null, null,
                            NATIVE_STACKS_OF_INTEREST);
                    waitedHalf = true;
                }
                continue;
            }
            ... //进入这里,意味着Watchdog已超时【见小节4.1】
        }
        ...
    }
}
 
public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {
    "/system/bin/mediaserver",
    "/system/bin/sdcard",
    "/system/bin/surfaceflinger"
};

该方法主要功能:

  1. 执行所有的Checker的监控方法scheduleCheckLocked()
    • 当mMonitor个数为0(除了android.fg线程之外都为0)且处于poll状态,则设置mCompleted = true;
    • 当上次check还没有完成, 则直接返回.
  2. 等待30s后, 再调用evaluateCheckerCompletionLocked来评估Checker状态;
  3. 根据waitState状态来执行不同的操作:
    • 当COMPLETED或WAITING,则相安无事;
    • 当WAITED_HALF(超过30s)且为首次, 则输出system_server和3个Native进程的traces;
    • 当OVERDUE, 则输出更多信息.

由此,可见当触发一次Watchdog, 则必然会调用两次AMS.dumpStackTraces, 也就是说system_server和3个Native进程的traces 的traces信息会输出两遍,且时间间隔超过30s.

3.2 scheduleCheckLocked

public final class HandlerChecker implements Runnable {
    ...
    public void scheduleCheckLocked() {
        if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
            mCompleted = true; //当目标looper正在轮询状态则返回。
            return;
        }
 
        if (!mCompleted) {
            return; //有一个check正在处理中,则无需重复发送
        }
        mCompleted = false;
        
        mCurrentMonitor = null;
        // 记录当下的时间
        mStartTime = SystemClock.uptimeMillis();
        //发送消息,插入消息队列最开头, 见下方的run()方法
        mHandler.postAtFrontOfQueue(this);
    }
    
    public void run() {
        final int size = mMonitors.size();
        for (int i = 0 ; i < size ; i++) {
            synchronized (Watchdog.this) {
                mCurrentMonitor = mMonitors.get(i);
            }
            //回调具体服务的monitor方法
            mCurrentMonitor.monitor();
        }
 
        synchronized (Watchdog.this) {
            mCompleted = true;
            mCurrentMonitor = null;
        }
    }
}

该方法主要功能: 向Watchdog的监控线程的Looper池的最头部执行该HandlerChecker.run()方法, 在该方法中调用monitor(),执行完成后会设置mCompleted = true. 那么当handler消息池当前的消息, 导致迟迟没有机会执行monitor()方法, 则会触发watchdog.

其中postAtFrontOfQueue(this),该方法输入参数为Runnable对象,根据消息机制(具体参考链接 消息机制), 最终会回调HandlerChecker中的run方法,该方法会循环遍历所有的Monitor接口,具体的服务实现该接口的monitor()方法。

这里简单展开以下,为什么调用 postAtFrontOfQueue(this)后,HandlerChecker 会被调用:
frameworks/base/services/core/java/com/android/server/Watchdog.java
public final class HandlerChecker implements Runnable {
  public void scheduleCheckLocked() {
    mHandler.postAtFrontOfQueue(this);//将当前 HandlerChecker传入
  }
}

frameworks/base/core/java/android/os/Handler.java
    public Handler(Callback callback, boolean async) {
        mLooper = Looper.myLooper();
        //mQueue 实际是 Looper的 mQueue
        mQueue = mLooper.mQueue;
        mCallback = callback;
        mAsynchronous = async;
    }

    public final boolean postAtFrontOfQueue(Runnable r)
    {
        return sendMessageAtFrontOfQueue(getPostMessage(r));
    }

    private static Message getPostMessage(Runnable r) {
        Message m = Message.obtain();
        //重要,将传入的Runnable当前代码代码为 HandlerChecker指向 callback
        m.callback = r;
        return m;
    }

    public final boolean sendMessageAtFrontOfQueue(Message msg) {
        MessageQueue queue = mQueue;
        return enqueueMessage(queue, msg, 0);
    }
    private boolean enqueueMessage(MessageQueue queue, Message msg, long uptimeMillis) {
        msg.target = this;//设置消息的 target 为当前 Handler
        //uptimeMillis = 0
        return queue.enqueueMessage(msg, uptimeMillis);
    }

frameworks/base/core/java/android/os/MessageQueue.java
    boolean enqueueMessage(Message msg, long when) {
        msg.when = when;//when = 0
        Message p = mMessages;
        if (p == null || when == 0 || when < p.when) {
            //插入消息队列头中,优先执行
            // New head, wake up the event queue if blocked.
            msg.next = p;
            mMessages = msg;
            needWake = mBlocked;
        }
    }

frameworks/base/core/java/android/os/Looper.java
    public static void loop() {
        final Looper me = myLooper();
        final MessageQueue queue = me.mQueue;
        for (;;) {
            Message msg = queue.next(); //取出消息
            //重要,消息分发出去
            try {
                //msg.target 在 Handler.java的enqueueMessage中设置为对应Handler
                //所以实际调用 Handler的 dispatchMessage()方法
                msg.target.dispatchMessage(msg);
                dispatchEnd = needEndTime ? SystemClock.uptimeMillis() : 0;
            }
        }
    }

frameworks/base/core/java/android/os/Handler.java
    //上面的 sendMessageAtFrontOfQueue,经过多层调用会调用到 dispatchMessage()
    public void dispatchMessage(Message msg) {
        if (msg.callback != null) {
            //重要,之前Message的 callback 不为null,且指向 HandlerChecker
            //所以会调用 HandlerChecker的 run()方法
            handleCallback(msg);
        } else {
            if (mCallback != null) {
                if (mCallback.handleMessage(msg)) {
                    return;
                }
            }
            handleMessage(msg);
        }
    }

    //调用 callback 的run() 方法,即 HandlerChecker 的run() 方法
    private static void handleCallback(Message message) {
        message.callback.run();
    }

可能的问题,如果有其他消息不断地调用postAtFrontOfQueue()也可能导致watchdog没有机会执行;或者是每个monitor消耗一些时间,雷加起来超过1分钟造成的watchdog. 这些都是非常规的Watchdog.

3.3 evaluateCheckerCompletionLocked

private int evaluateCheckerCompletionLocked() {
    int state = COMPLETED;
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        //【见小节3.4】
        state = Math.max(state, hc.getCompletionStateLocked());
    }
    return state;
}

获取mHandlerCheckers列表中等待状态值最大的state.

3.4 getCompletionStateLocked

public int getCompletionStateLocked() {
    if (mCompleted) {
        return COMPLETED;
    } else {
        long latency = SystemClock.uptimeMillis() - mStartTime;
        // mWaitMax默认是60s
        if (latency < mWaitMax/2) {
            return WAITING;
        } else if (latency < mWaitMax) {
            return WAITED_HALF;
        }
    }
    return OVERDUE;
}
  • COMPLETED = 0:等待完成;
  • WAITING = 1:等待时间小于DEFAULT_TIMEOUT的一半,即30s;
  • WAITED_HALF = 2:等待时间处于30s~60s之间;
  • OVERDUE = 3:等待时间大于或等于60s。

四. Watchdog处理流程

4.1 run

[-> Watchdog.java]

public void run() {
    while (true) {
        synchronized (this) {
            ...
            //获取被阻塞的checkers 【见小节4.1.1】
            blockedCheckers = getBlockedCheckersLocked();
            // 获取描述信息 【见小节4.1.2】
            subject = describeCheckersLocked(blockedCheckers);
            allowRestart = mAllowRestart;
        }
 
        EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
 
        ArrayList<Integer> pids = new ArrayList<Integer>();
        pids.add(Process.myPid());
        if (mPhonePid > 0) pids.add(mPhonePid);
        //第二次以追加的方式,输出system_server和3个native进程的栈信息【见小节4.2】
        final File stack = ActivityManagerService.dumpStackTraces(
                !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
                
        //系统已被阻塞1分钟,也不在乎多等待2s,来确保stack trace信息输出
        SystemClock.sleep(2000);
 
        if (RECORD_KERNEL_THREADS) {
            //输出kernel栈信息【见小节4.3】
            dumpKernelStackTraces();
        }
 
        //触发kernel来dump所有阻塞线程【见小节4.4】
        doSysRq('l');
        
        //输出dropbox信息【见小节4.5】
        Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
            public void run() {
                mActivity.addErrorToDropBox(
                        "watchdog", null, "system_server", null, null,
                        subject, null, stack, null);
            }
        };
        dropboxThread.start();
        
        try {
            dropboxThread.join(2000); //等待dropbox线程工作2s
        } catch (InterruptedException ignored) {
        }
 
        IActivityController controller;
        synchronized (this) {
            controller = mController;
        }
        if (controller != null) {
            //将阻塞状态报告给activity controller,
            try {
                Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                //返回值为1表示继续等待,-1表示杀死系统
                int res = controller.systemNotResponding(subject);
                if (res >= 0) {
                    waitedHalf = false; 
                    continue; //设置ActivityController的某些情况下,可以让发生Watchdog时继续等待
                }
            } catch (RemoteException e) {
            }
        }
 
        //当debugger没有attach时,才杀死进程
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        if (debuggerWasConnected >= 2) {
            Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
        } else if (debuggerWasConnected > 0) {
            Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
        } else if (!allowRestart) {
            Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
        } else {
            Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
            //遍历输出阻塞线程的栈信息
            for (int i=0; i<blockedCheckers.size(); i++) {
                Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
                StackTraceElement[] stackTrace
                        = blockedCheckers.get(i).getThread().getStackTrace();
                for (StackTraceElement element: stackTrace) {
                    Slog.w(TAG, "    at " + element);
                }
            }
            Slog.w(TAG, "*** GOODBYE!");
            //杀死进程system_server【见小节4.6】
            Process.killProcess(Process.myPid());
            System.exit(10);
        }
        waitedHalf = false;
    }
}

Watchdog检测到异常的信息收集工作:

  • AMS.dumpStackTraces:输出Java和Native进程的栈信息;
  • WD.dumpKernelStackTraces:输出Kernel栈信息;
  • doSysRq
  • dropBox

收集完信息后便会杀死system_server进程。此处allowRestart默认值为true, 当执行am hang操作则设置不允许重启(allowRestart =false), 则不会杀死system_server进程.

4.1.1 getBlockedCheckersLocked

private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
    ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
    //遍历所有的Checker
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        //将所有没有完成,且超时的checker加入队列
        if (hc.isOverdueLocked()) {
            checkers.add(hc);
        }
    }
    return checkers;
}

4.1.2 describeCheckersLocked

private String describeCheckersLocked(ArrayList<HandlerChecker> checkers) {
     StringBuilder builder = new StringBuilder(128);
     for (int i=0; i<checkers.size(); i++) {
         if (builder.length() > 0) {
             builder.append(", ");
         }
         // 输出所有的checker信息
         builder.append(checkers.get(i).describeBlockedStateLocked());
     }
     return builder.toString();
 }
 
 
 public String describeBlockedStateLocked() {
     //非前台线程进入该分支
     if (mCurrentMonitor == null) {
         return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
     //前台线程进入该分支
     } else {
         return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
                 + " on " + mName + " (" + getThread().getName() + ")";
     }
 }

将所有执行时间超过1分钟的handler线程或者monitor都记录下来.

  • 当输出的信息是Blocked in handler,意味着相应的线程处理当前消息时间超过1分钟;
  • 当输出的信息是Blocked in monitor,意味着相应的线程处理当前消息时间超过1分钟,或者monitor迟迟拿不到锁;

4.2 AMS.dumpStackTraces

public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
        ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
    //默认为 data/anr/traces.txt
    String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
    if (tracesPath == null || tracesPath.length() == 0) {
        return null;
    }
 
    File tracesFile = new File(tracesPath);
    try {
        //当clearTraces,则删除已存在的traces文件
        if (clearTraces && tracesFile.exists()) tracesFile.delete();
        //创建traces文件
        tracesFile.createNewFile();
        // -rw-rw-rw-
        FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1);
    } catch (IOException e) {
        return null;
    }
    //输出trace内容
    dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);
    return tracesFile;
}

输出system_server和mediaserver,/sdcard,surfaceflinger这3个native进程的traces信息。

4.3 WD.dumpKernelStackTraces

private File dumpKernelStackTraces() {
    // 路径为data/anr/traces.txt
    String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
    if (tracesPath == null || tracesPath.length() == 0) {
        return null;
    }
    // [见小节4.3.1]
    native_dumpKernelStacks(tracesPath);
    return new File(tracesPath);
}

native_dumpKernelStacks,经过JNI调用到android_server_Watchdog.cpp文件中的dumpKernelStacks()方法。

4.3.1 dumpKernelStacks

[-> android_server_Watchdog.cpp]

static void dumpKernelStacks(JNIEnv* env, jobject clazz, jstring pathStr) {
    char buf[128];
    DIR* taskdir;
    
    const char *path = env->GetStringUTFChars(pathStr, NULL);
    // 打开traces文件
    int outFd = open(path, O_WRONLY | O_APPEND | O_CREAT,
        S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH);
    ...
 
    snprintf(buf, sizeof(buf), "\n----- begin pid %d kernel stacks -----\n", getpid());
    write(outFd, buf, strlen(buf));
 
    //读取该进程内的所有线程
    snprintf(buf, sizeof(buf), "/proc/%d/task", getpid());
    taskdir = opendir(buf);
    if (taskdir != NULL) {
        struct dirent * ent;
        while ((ent = readdir(taskdir)) != NULL) {
            int tid = atoi(ent->d_name);
            if (tid > 0 && tid <= 65535) {
                //输出每个线程的traces 【4.3.2】
                dumpOneStack(tid, outFd);
            }
        }
        closedir(taskdir);
    }
 
    snprintf(buf, sizeof(buf), "----- end pid %d kernel stacks -----\n", getpid());
    write(outFd, buf, strlen(buf));
 
    close(outFd);
done:
    env->ReleaseStringUTFChars(pathStr, path);
}

通过读取该节点/proc/%d/task获取当前进程中的所有线程信息。

4.3.2 dumpOneStack

[-> android_server_Watchdog.cpp]

static void dumpOneStack(int tid, int outFd) {
    char buf[64];
    //通过读取节点/proc/%d/stack
    snprintf(buf, sizeof(buf), "/proc/%d/stack", tid);
    int stackFd = open(buf, O_RDONLY);
    if (stackFd >= 0) {
        //头部
        strncat(buf, ":\n", sizeof(buf) - strlen(buf) - 1);
        write(outFd, buf, strlen(buf));
 
        //拷贝stack信息
        int nBytes;
        while ((nBytes = read(stackFd, buf, sizeof(buf))) > 0) {
            write(outFd, buf, nBytes);
        }
 
        //尾部
        write(outFd, "\n", 1);
        close(stackFd);
    } else {
        ALOGE("Unable to open stack of tid %d : %d (%s)", tid, errno, strerror(errno));
    }
}

4.4 WD.doSysRq

private void doSysRq(char c) {
    try {
        FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");
        sysrq_trigger.write(c);
        sysrq_trigger.close();
    } catch (IOException e) {
        Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
    }
}

通过向节点/proc/sysrq-trigger写入字符,触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log。

4.5 dropBox

关于dropbox已在dropBox源码篇详细讲解过,输出文件到/data/system/dropbox。对于触发watchdog时,生成的dropbox文件的tag是system_server_watchdog,内容是traces以及相应的blocked信息。

4.6 killProcess

Process.killProcess已经在文章理解杀进程的实现原理已详细讲解,通过发送信号9给目标进程来完成杀进程的过程。

当杀死system_server进程,从而导致zygote进程自杀,进而触发init执行重启Zygote进程,这便出现了手机framework重启的现象。

五. 总结

Watchdog是一个运行在system_server进程的名为”watchdog”的线程::

  • Watchdog运作过程,当阻塞时间超过1分钟则触发一次watchdog,会杀死system_server,触发上层重启;
  • mHandlerCheckers记录所有的HandlerChecker对象的列表,包括foreground, main, ui, i/o, display线程的handler;
  • mHandlerChecker.mMonitors记录所有Watchdog目前正在监控Monitor,所有的这些monitors都运行在foreground线程。
  • 有两种方式加入Watchdog的监控:
    • addThread():用于监测Handler对象,默认超时时长为60s.这种超时往往是所对应的handler线程消息处理得慢;
    • addMonitor(): 用于监控实现了Watchdog.Monitor接口的服务.这种超时可能是”android.fg”线程消息处理得慢,也可能是monitor迟迟拿不到锁;

以下情况,即使触发了Watchdog,也不会杀掉system_server进程:

  • monkey: 设置IActivityController,拦截systemNotResponding事件, 比如monkey.
  • hang: 执行am hang命令,不重启;
  • debugger: 连接debugger的情况, 不重启;

5.1 输出信息

watchdog在check过程中出现阻塞1分钟的情况,则会输出:

  1. AMS.dumpStackTraces:输出system_server和3个native进程的traces
    • 该方法会输出两次,第一次在超时30s的地方;第二次在超时1min;
  2. WD.dumpKernelStackTraces,输出system_server进程中所有线程的kernel stack;
    • 节点/proc/%d/task获取进程内所有的线程列表
    • 节点/proc/%d/stack获取kernel的栈
  3. doSysRq, 触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log;
    • 节点/proc/sysrq-trigger
  4. dropBox,输出文件到/data/system/dropbox,内容是trace + blocked信息
  5. 杀掉system_server,进而触发zygote进程自杀,从而重启上层framework。

5.2 Handler方式

Watchdog监控的线程有:默认地DEFAULT_TIMEOUT=60s,调试时才为10s方便找出潜在的ANR问题。

线程名对应handler说明
system_servernew Handler(Looper.getMainLooper())当前主线程
android.fgFgThread.getHandler前台线程
android.uiUiThread.getHandlerUI线程
android.ioIoThread.getHandlerI/O线程
android.displayDisplayThread.getHandlerdisplay线程
ActivityManagerAMS.MainHandlerAMS构造函数中使用
PowerManagerServicePMS.PowerManagerHandlerPMS.onStart()中使用

目前watchdog会监控system_server进程中的以上7个线程,必须保证这些线程的Looper消息处理时间不得超过1分钟。

5.3 Monitor方式

能够被Watchdog监控的系统服务都实现了Watchdog.Monitor接口,并实现其中的monitor()方法。运行在android.fg线程, 系统中实现该接口类主要有:

  • ActivityManagerService
  • WindowManagerService
  • InputManagerService
  • PowerManagerService
  • NetworkManagementService
  • MountService
  • NativeDaemonConnector
  • BinderThreadMonitor
  • MediaProjectionManagerService
  • MediaRouterService
  • MediaSessionService
  • BinderThreadMonitor

原文地址: http://gityuan.com/2016/06/21/watchdog/

 类似资料: