Android 系统中WatchDog 日志分析

东方森
2023-12-01

来源:

https://blog.csdn.net/shift_wwx/article/details/81030749

源码基于:Android O 

0. 前言

Android 系统中的 WatchDog 详解 一文中通过source code详细的分析了 WatchDog 的软件设计原理和监测过程,这一篇结合实例对WatchDog 做进一步的分析。

1. 源码引路

frameworks/base/services/core/java/com/android/server/Watchdog.java

    public void run() {
            ...

            final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");
            File stackFd = finalStack;
            if (tracesDirProp.isEmpty()) {
                String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
                String traceFileNameAmendment = "_SystemServer_WDT" +
                        mTraceDateFormat.format(new Date());

                if (tracesPath != null && tracesPath.length() != 0) {
                    File traceRenameFile = new File(tracesPath);
                    String newTracesPath;
                    int lpos = tracesPath.lastIndexOf (".");
                    if (-1 != lpos) {
                        newTracesPath = tracesPath.substring(0, lpos) + traceFileNameAmendment
                                + tracesPath.substring(lpos);
                    } else {
                        newTracesPath = tracesPath + traceFileNameAmendment;
                    }
                    traceRenameFile.renameTo(new File(newTracesPath));
                    stackFd = new File(newTracesPath);
                } else {
                    Slog.w(TAG, "dump WDT Traces: no trace path configured");
                }
            } else {
                // the new trace dumpsing mechanism
                String newTracesPath = "traces_SystemServer_WDT"
                        + mTraceDateFormat.format(new Date()) + "_pid"
                        + String.valueOf(Process.myPid());
                File tracesDir = new File(tracesDirProp);
                File watchdogTraces = new File(tracesDir, newTracesPath);

                try {
                    if (watchdogTraces.createNewFile()) {
                        FileUtils.setPermissions(watchdogTraces.getAbsolutePath(),
                                0600, -1, -1); // -rw------- permissions

                        // Append both traces from the first and second half
                        // to a new file, making it easier to debug Watchdog timeouts
                        //dumpStackTraces() can return a null instance, so check the same
                        if (initialStack != null) {
                            // check the last-modified time of this file.
                            // we are interested in this only it was written to in the
                            // last minute or so
                            final long age = System.currentTimeMillis()
                                    - initialStack.lastModified();
                            final long MINUTE_IN_MILLIS = 1000 * 60;
                            if (age < MINUTE_IN_MILLIS) {
                                Slog.e(TAG, "First set of traces taken from "
                                        + initialStack.getAbsolutePath());
                                appendFile(watchdogTraces, initialStack);
                            }
                        } else {
                            Slog.e(TAG, "First set of traces are empty!");
                        }

                        if (finalStack != null) {
                            Slog.e(TAG, "Second set of traces taken from "
                                    + finalStack.getAbsolutePath());
                            appendFile(watchdogTraces, finalStack);
                        } else {
                             Slog.e(TAG, "Second set of traces are empty!");
                        }
                    } else {
                        Slog.w(TAG, "Unable to create Watchdog dump file: createNewFile failed");
                    }
                } catch (IOException ioe) {
                    Slog.e(TAG, "Exception creating Watchdog dump file:", ioe);
                }
            }
            final File newFd = stackFd;

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                subject, null, newFd, null);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}


            // At times, when user space watchdog traces don't give an indication on
            // which component held a lock, because of which other threads are blocked,
            // (thereby causing Watchdog), crash the device to analyze RAM dumps
            boolean crashOnWatchdog = SystemProperties
                                        .getBoolean("persist.sys.crashOnWatchdog", false);
            if (crashOnWatchdog) {
                // Trigger the kernel to dump all blocked threads, and backtraces
                // on all CPUs to the kernel log
                Slog.e(TAG, "Triggering SysRq for system_server watchdog");
                doSysRq('w');
                doSysRq('l');

                // wait until the above blocked threads be dumped into kernel log
                SystemClock.sleep(3000);

                // now try to crash the target
                doSysRq('c');
            }

            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            if (controller != null) { //一般monkey 的时候会打印
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                for (int i=0; i<blockedCheckers.size(); i++) {
                    Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
                    StackTraceElement[] stackTrace
                            = blockedCheckers.get(i).getThread().getStackTrace();
                    for (StackTraceElement element: stackTrace) {
                        Slog.w(TAG, "    at " + element);
                    }
                }
                Slog.w(TAG, "*** GOODBYE!");
                Process.killProcess(Process.myPid());
                System.exit(10);
            }

1.1 prop dalvik.vm.stack-trace-dir

一帮指的是 /data/anr

1.2 创建wd trace文件

String newTracesPath = "traces_SystemServer_WDT"
           + mTraceDateFormat.format(new Date()) + "_pid"
           + String.valueOf(Process.myPid());
File tracesDir = new File(tracesDirProp);
File watchdogTraces = new File(tracesDir, newTracesPath);

最后在平台上的/data/anr/目录下会看到:

-rw-------  1 system system 227332 2007-11-06 00:05 BinderTraces.txt
-rw-------  1 system system 179568 2007-11-06 00:04 anr_2007-11-06-00-04-41-133
-rw-------  1 system system 230506 2007-11-06 00:05 anr_2007-11-06-00-05-14-301
-rw-------  1 system system 180175 2007-11-06 02:57 anr_2007-11-06-02-57-39-825
-rw-------  1 system system 229772 2007-11-06 02:58 anr_2007-11-06-02-58-13-023
-rw-------  1 system system 198777 2007-11-06 19:26 anr_2007-11-06-19-26-00-611
-rw-------  1 system system 249791 2007-11-06 19:26 anr_2007-11-06-19-26-33-888
-rw-------  1 system system 410074 2007-11-06 00:05 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197
-rw-------  1 system system 229772 2007-11-06 02:58 traces_SystemServer_WDT06_Nov_02_58_55.714_pid2726
-rw-------  1 system system 249791 2007-11-06 19:27 traces_SystemServer_WDT06_Nov_19_27_16.140_pid2473

这里文件比较多啊,手机好久没有清理过的缘故,没事的后面会知道到底是哪个文件是我们需要的。

1.3 将AMS中日志append 到trace中

Android 系统中的 WatchDog 详解  的 run() 函数分析中我们知道,当 monitor 超过30秒的时候,check 返回的状态是WAITED_HALF,WatchDog 会给第二次机会,但是同时会将 AMS 的日志记录dump 出来。

appendFile(watchdogTraces, initialStack);

如果第二次还是无法正常返回,会将第二次的log 也添加到watchdogTraces 中:

appendFile(watchdogTraces, finalStack);

最后的 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197 的log 就最终产生了。

1.4 在 /data/system/dropbox 下创建watchdog 文件

    Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
        public void run() {
            mActivity.addErrorToDropBox(
                      "watchdog", null, "system_server", null, null,
                       subject, null, newFd, null);
            }
        };
    dropboxThread.start();

(1)subject 是blockedChecker 描述:

        public String describeBlockedStateLocked() {
            if (mCurrentMonitor == null) {
                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
            } else {
                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
                        + " on " + mName + " (" + getThread().getName() + ")";
            }
        }

会在 /data/system/dropbox/ 下的watchdog 日志中看到:

Subject: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager)

就是这里来的。

(2)newFd 其实指的是上面一点中提到的finalStack 的fd,所以最后在dropbox 里面的traces 没有initialStack 的log

具体的 addErrorToDropBox() 函数暂时不分析,下面会结合log 进一步解释,在平台上的 /data/system/dropbox/ 目录下可以看到:

-rw-------  1 system system   329 2007-11-06 00:38 SYSTEM_BOOT@1194309534851.txt
-rw-------  1 system system   329 2007-11-06 00:43 SYSTEM_BOOT@1194309834605.txt
-rw-------  1 system system   329 2007-11-06 00:45 SYSTEM_RESTART@1194309831228.txt
-rw-------  1 system system  1414 2007-11-06 00:44 data_app_crash@1194309853600.txt
-rw-------  1 system system  1414 2007-11-06 00:44 data_app_crash@1194309854775.txt
-rw-------  1 system system  4302 2007-11-06 00:43 system_app_strictmode@1194309835666.txt.gz
-rw-------  1 system system  1832 2007-11-06 00:43 system_app_strictmode@1194309837786.txt.gz
-rw-------  1 system system  2009 2007-11-06 00:44 system_app_strictmode@1194309840686.txt.gz
-rw-------  1 system system  4035 2007-11-06 00:45 system_server_lowmem@1194309831226.txt.gz
-rw-------  1 system system 19587 2007-11-06 00:05 system_server_watchdog@1194307528602.txt.gz

一般出问题找最新的traces 文件即可。

1.5 logcat 中打印堆栈信息

    Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
    for (int i=0; i<blockedCheckers.size(); i++) {
        Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
        StackTraceElement[] stackTrace
                 = blockedCheckers.get(i).getThread().getStackTrace();
        for (StackTraceElement element: stackTrace) {
            Slog.w(TAG, "    at " + element);
        }
    }
    Slog.w(TAG, "*** GOODBYE!");
    Process.killProcess(Process.myPid());
    System.exit(10);

在logcat 中会看到这样的打印:

11-06 00:05:28.603  2197  2441 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager)
11-06 00:05:28.603  2197  2441 W Watchdog: foreground thread stack trace:
11-06 00:05:28.603  2197  2441 W Watchdog:     at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:23862)
11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211)
11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:790)
11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:99)
11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Looper.loop(Looper.java:164)
11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:65)
11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:46)
11-06 00:05:28.604  2197  2441 W Watchdog: ActivityManager stack trace:
11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.am.ActivityManagerService.idleUids(ActivityManagerService.java:23305)
11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2428)
11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:106)
11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Looper.loop(Looper.java:164)
11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:65)
11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:46)
11-06 00:05:28.604  2197  2441 W Watchdog: *** GOODBYE!
11-06 00:05:28.605  2197  2441 I Process : Sending signal. PID: 2197 SIG: 9

从log 中看到:

  • monitor是在 fg thread 中,也就是 WatchDog 中的 mMonitorChecker 中的 monitor() 触发了
  • 除了 monitor(),AMS 中的 Handler 也触发了WatchDog
  • 最后会把出现问题的进程kill 掉,这里是 pid 2197

2. 问题分析

结合上面的log,我们来分析一下。

2.1 logcat

上面的第 1.5 点中已经说明,通过打印可以大概知道大概是哪个线程或者是哪个线程的monitor 出现了问题

2.2 分析traces 文件

先来看下 /data/anr/ 的 watchdog 文件,根据上面 logcat 可以知道最后的pid 是2197,那么就要找结尾是 2197 字符串的traces文件,也就是 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197

(1)根据上面第 1.5 点,先找到monitor 出现的地方,也就是 fg thread:

"android.fg" prio=5 tid=17 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13380dc8 self=0x9e3b3200
  | sysTid=2225 nice=0 cgrp=default sched=0/0 handle=0x9159d970
  | state=S schedstat=( 0 0 0 ) utm=5 stm=2 core=2 HZ=100
  | stack=0x9149b000-0x9149d000 stackSize=1038KB
  | held mutexes=
  at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:23862)
  - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81
  at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211)
  at android.os.Handler.handleCallback(Handler.java:790)
  at android.os.Handler.dispatchMessage(Handler.java:99)
  at android.os.Looper.loop(Looper.java:164)
  at android.os.HandlerThread.run(HandlerThread.java:65)
  at com.android.server.ServiceThread.run(ServiceThread.java:46)

在等待锁,这个锁被tid 为 81 的线程持有,下面来看下tid 为81 干了什么:

"Binder:2197_4" prio=5 tid=81 Waiting
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x14041c20 self=0x8d713200
  | sysTid=2435 nice=-10 cgrp=default sched=0/0 handle=0x8c749970
  | state=S schedstat=( 0 0 0 ) utm=9 stm=6 core=2 HZ=100
  | stack=0x8c64f000-0x8c651000 stackSize=1006KB
  | held mutexes=
  at java.lang.Object.wait(Native method)
  - waiting on <0x0c1b5e76> (a com.android.server.PermissionDialogReqQueue$PermissionDialogReq)
  at com.android.server.PermissionDialogReqQueue$PermissionDialogReq.get(PermissionDialogReqQueue.java:100)
  - locked <0x0c1b5e76> (a com.android.server.PermissionDialogReqQueue$PermissionDialogReq)
  at com.android.server.AppOpsService.noteOperationUnchecked(AppOpsService.java:1157)
  at com.android.server.AppOpsService.noteOperation(AppOpsService.java:1086)
  at com.android.server.am.ActivityManagerService.isStartAllowed(ActivityManagerService.java:24937)
  at com.android.server.am.ActiveServices.startServiceLocked(ActiveServices.java:369)
  at com.android.server.am.ActivityManagerService.startService(ActivityManagerService.java:18450)
  - locked <0x0316e620> (a com.android.server.am.ActivityManagerService)
  at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:528)
  at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2971)
  at android.os.Binder.execTransact(Binder.java:697)

是在 startService() 的时候占用了AMS 这把锁,但是后面又在等 PermissionDialogReqQueue 的锁,这个锁是自己锁上的并没有放出来,导致了一连串的死锁。

(2)同样的,根据上面第 1.5 点,来看下 AMS 中的Handler 也触发了  WatchDog

"ActivityManager" prio=5 tid=12 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13380710 self=0x9e0fae00
  | sysTid=2216 nice=-2 cgrp=default sched=0/0 handle=0x91bb4970
  | state=S schedstat=( 0 0 0 ) utm=60 stm=23 core=3 HZ=100
  | stack=0x91ab2000-0x91ab4000 stackSize=1038KB
  | held mutexes=
  at com.android.server.am.ActivityManagerService.idleUids(ActivityManagerService.java:23305)
  - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81
  at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2428)
  at android.os.Handler.dispatchMessage(Handler.java:106)
  at android.os.Looper.loop(Looper.java:164)
  at android.os.HandlerThread.run(HandlerThread.java:65)
  at com.android.server.ServiceThread.run(ServiceThread.java:46)

等待同样的一个锁

结合上面的 traces 分析,大概能看出来应该是 AMS 的锁被持有一直没有放掉,再来看下dropbox 下的 traces(一般到第二次的traces时候已经出现了ANR),这个时候主线程应该会出现ANR,来看下main:

"main" prio=5 tid=1 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x723919f0 self=0xa7a28000
  | sysTid=2197 nice=-2 cgrp=default sched=0/0 handle=0xabb1e4a4
  | state=S schedstat=( 0 0 0 ) utm=320 stm=99 core=2 HZ=100
  | stack=0xbe447000-0xbe449000 stackSize=8MB
  | held mutexes=
  at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:20018)
  - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81
  at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4175)
  at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4165)
  at com.android.server.BatteryService$9.run(BatteryService.java:578)
  at android.os.Handler.handleCallback(Handler.java:790)
  at android.os.Handler.dispatchMessage(Handler.java:99)
  at android.os.Looper.loop(Looper.java:164)
  at com.android.server.SystemServer.run(SystemServer.java:435)
  at com.android.server.SystemServer.main(SystemServer.java:278)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:438)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:787)

ANR 也是应该这个锁,根据两个traces 分析,可以确定是因为 PermissionDialogReqQueue  自身的锁没有释放,导致了timeout 触发watch dog。

但是实际上,这个第二次的 log 意义并不大,因为这个时候系统 ANR,需要检查的是系统为什么出现 ANR,也就是说第一次 traces 的log 可能在后面的 debug 中起到关键因素。

3、寻找原因

来看下 PermissionDialogReqQueue.java 100行 get() 函数:

        public int get() {
            synchronized (this) {
                try {
                    wait();
                } catch (InterruptedException e) {
                    //ignore
                }
            }
            Log.d(TAG, "result from permission dialog is " + mResult);
            return mResult;
        }

从逻辑上看,这里会通过wait 等待dialog 返回结果,在set 函数的时候notify:

        public void set(int res) {
            synchronized (this) {
                mResult = res;
                notifyAll();
            }
        }

逻辑上处理应该是没有问题的,如果dialog 在30秒内点击,之后notify() 就会放掉之前持有的锁。但是从现象上看,是在dialog 弹出之后,AMS 接口调用导致了dialog 卡死,这样无法正常的释放锁,最终导致了WatchDog 重启system。 

这个问题弄的我很疑惑,逻辑上都是ok 就是在AMS 中要锁的时候导致了dialog 卡住,后来怀疑可能是dialog 的同时调用了AMS 的接口导致了主线程卡死,几经挫折想到将 dialog 放到子线程中show(Android 子线程中更新UI 详解),结果居然成功了,还在继续寻找原因中!

 类似资料: