来源:
https://blog.csdn.net/shift_wwx/article/details/81030749
源码基于:Android O
Android 系统中的 WatchDog 详解 一文中通过source code详细的分析了 WatchDog 的软件设计原理和监测过程,这一篇结合实例对WatchDog 做进一步的分析。
frameworks/base/services/core/java/com/android/server/Watchdog.java
public void run() {
...
final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");
File stackFd = finalStack;
if (tracesDirProp.isEmpty()) {
String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
String traceFileNameAmendment = "_SystemServer_WDT" +
mTraceDateFormat.format(new Date());
if (tracesPath != null && tracesPath.length() != 0) {
File traceRenameFile = new File(tracesPath);
String newTracesPath;
int lpos = tracesPath.lastIndexOf (".");
if (-1 != lpos) {
newTracesPath = tracesPath.substring(0, lpos) + traceFileNameAmendment
+ tracesPath.substring(lpos);
} else {
newTracesPath = tracesPath + traceFileNameAmendment;
}
traceRenameFile.renameTo(new File(newTracesPath));
stackFd = new File(newTracesPath);
} else {
Slog.w(TAG, "dump WDT Traces: no trace path configured");
}
} else {
// the new trace dumpsing mechanism
String newTracesPath = "traces_SystemServer_WDT"
+ mTraceDateFormat.format(new Date()) + "_pid"
+ String.valueOf(Process.myPid());
File tracesDir = new File(tracesDirProp);
File watchdogTraces = new File(tracesDir, newTracesPath);
try {
if (watchdogTraces.createNewFile()) {
FileUtils.setPermissions(watchdogTraces.getAbsolutePath(),
0600, -1, -1); // -rw------- permissions
// Append both traces from the first and second half
// to a new file, making it easier to debug Watchdog timeouts
//dumpStackTraces() can return a null instance, so check the same
if (initialStack != null) {
// check the last-modified time of this file.
// we are interested in this only it was written to in the
// last minute or so
final long age = System.currentTimeMillis()
- initialStack.lastModified();
final long MINUTE_IN_MILLIS = 1000 * 60;
if (age < MINUTE_IN_MILLIS) {
Slog.e(TAG, "First set of traces taken from "
+ initialStack.getAbsolutePath());
appendFile(watchdogTraces, initialStack);
}
} else {
Slog.e(TAG, "First set of traces are empty!");
}
if (finalStack != null) {
Slog.e(TAG, "Second set of traces taken from "
+ finalStack.getAbsolutePath());
appendFile(watchdogTraces, finalStack);
} else {
Slog.e(TAG, "Second set of traces are empty!");
}
} else {
Slog.w(TAG, "Unable to create Watchdog dump file: createNewFile failed");
}
} catch (IOException ioe) {
Slog.e(TAG, "Exception creating Watchdog dump file:", ioe);
}
}
final File newFd = stackFd;
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, newFd, null);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
// At times, when user space watchdog traces don't give an indication on
// which component held a lock, because of which other threads are blocked,
// (thereby causing Watchdog), crash the device to analyze RAM dumps
boolean crashOnWatchdog = SystemProperties
.getBoolean("persist.sys.crashOnWatchdog", false);
if (crashOnWatchdog) {
// Trigger the kernel to dump all blocked threads, and backtraces
// on all CPUs to the kernel log
Slog.e(TAG, "Triggering SysRq for system_server watchdog");
doSysRq('w');
doSysRq('l');
// wait until the above blocked threads be dumped into kernel log
SystemClock.sleep(3000);
// now try to crash the target
doSysRq('c');
}
IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) { //一般monkey 的时候会打印
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
for (int i=0; i<blockedCheckers.size(); i++) {
Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
StackTraceElement[] stackTrace
= blockedCheckers.get(i).getThread().getStackTrace();
for (StackTraceElement element: stackTrace) {
Slog.w(TAG, " at " + element);
}
}
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
}
一帮指的是 /data/anr
String newTracesPath = "traces_SystemServer_WDT"
+ mTraceDateFormat.format(new Date()) + "_pid"
+ String.valueOf(Process.myPid());
File tracesDir = new File(tracesDirProp);
File watchdogTraces = new File(tracesDir, newTracesPath);
最后在平台上的/data/anr/目录下会看到:
-rw------- 1 system system 227332 2007-11-06 00:05 BinderTraces.txt
-rw------- 1 system system 179568 2007-11-06 00:04 anr_2007-11-06-00-04-41-133
-rw------- 1 system system 230506 2007-11-06 00:05 anr_2007-11-06-00-05-14-301
-rw------- 1 system system 180175 2007-11-06 02:57 anr_2007-11-06-02-57-39-825
-rw------- 1 system system 229772 2007-11-06 02:58 anr_2007-11-06-02-58-13-023
-rw------- 1 system system 198777 2007-11-06 19:26 anr_2007-11-06-19-26-00-611
-rw------- 1 system system 249791 2007-11-06 19:26 anr_2007-11-06-19-26-33-888
-rw------- 1 system system 410074 2007-11-06 00:05 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197
-rw------- 1 system system 229772 2007-11-06 02:58 traces_SystemServer_WDT06_Nov_02_58_55.714_pid2726
-rw------- 1 system system 249791 2007-11-06 19:27 traces_SystemServer_WDT06_Nov_19_27_16.140_pid2473
这里文件比较多啊,手机好久没有清理过的缘故,没事的后面会知道到底是哪个文件是我们需要的。
Android 系统中的 WatchDog 详解 的 run() 函数分析中我们知道,当 monitor 超过30秒的时候,check 返回的状态是WAITED_HALF,WatchDog 会给第二次机会,但是同时会将 AMS 的日志记录dump 出来。
appendFile(watchdogTraces, initialStack);
如果第二次还是无法正常返回,会将第二次的log 也添加到watchdogTraces 中:
appendFile(watchdogTraces, finalStack);
最后的 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197 的log 就最终产生了。
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, newFd, null);
}
};
dropboxThread.start();
(1)subject 是blockedChecker 描述:
public String describeBlockedStateLocked() {
if (mCurrentMonitor == null) {
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
} else {
return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
+ " on " + mName + " (" + getThread().getName() + ")";
}
}
会在 /data/system/dropbox/ 下的watchdog 日志中看到:
Subject: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager)
就是这里来的。
(2)newFd 其实指的是上面一点中提到的finalStack 的fd,所以最后在dropbox 里面的traces 没有initialStack 的log
具体的 addErrorToDropBox() 函数暂时不分析,下面会结合log 进一步解释,在平台上的 /data/system/dropbox/ 目录下可以看到:
-rw------- 1 system system 329 2007-11-06 00:38 SYSTEM_BOOT@1194309534851.txt
-rw------- 1 system system 329 2007-11-06 00:43 SYSTEM_BOOT@1194309834605.txt
-rw------- 1 system system 329 2007-11-06 00:45 SYSTEM_RESTART@1194309831228.txt
-rw------- 1 system system 1414 2007-11-06 00:44 data_app_crash@1194309853600.txt
-rw------- 1 system system 1414 2007-11-06 00:44 data_app_crash@1194309854775.txt
-rw------- 1 system system 4302 2007-11-06 00:43 system_app_strictmode@1194309835666.txt.gz
-rw------- 1 system system 1832 2007-11-06 00:43 system_app_strictmode@1194309837786.txt.gz
-rw------- 1 system system 2009 2007-11-06 00:44 system_app_strictmode@1194309840686.txt.gz
-rw------- 1 system system 4035 2007-11-06 00:45 system_server_lowmem@1194309831226.txt.gz
-rw------- 1 system system 19587 2007-11-06 00:05 system_server_watchdog@1194307528602.txt.gz
一般出问题找最新的traces 文件即可。
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
for (int i=0; i<blockedCheckers.size(); i++) {
Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
StackTraceElement[] stackTrace
= blockedCheckers.get(i).getThread().getStackTrace();
for (StackTraceElement element: stackTrace) {
Slog.w(TAG, " at " + element);
}
}
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
在logcat 中会看到这样的打印:
11-06 00:05:28.603 2197 2441 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager)
11-06 00:05:28.603 2197 2441 W Watchdog: foreground thread stack trace:
11-06 00:05:28.603 2197 2441 W Watchdog: at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:23862)
11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Handler.handleCallback(Handler.java:790)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Looper.loop(Looper.java:164)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:46)
11-06 00:05:28.604 2197 2441 W Watchdog: ActivityManager stack trace:
11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.am.ActivityManagerService.idleUids(ActivityManagerService.java:23305)
11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2428)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:106)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Looper.loop(Looper.java:164)
11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:46)
11-06 00:05:28.604 2197 2441 W Watchdog: *** GOODBYE!
11-06 00:05:28.605 2197 2441 I Process : Sending signal. PID: 2197 SIG: 9
从log 中看到:
结合上面的log,我们来分析一下。
上面的第 1.5 点中已经说明,通过打印可以大概知道大概是哪个线程或者是哪个线程的monitor 出现了问题
先来看下 /data/anr/ 的 watchdog 文件,根据上面 logcat 可以知道最后的pid 是2197,那么就要找结尾是 2197 字符串的traces文件,也就是 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197
(1)根据上面第 1.5 点,先找到monitor 出现的地方,也就是 fg thread:
"android.fg" prio=5 tid=17 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x13380dc8 self=0x9e3b3200
| sysTid=2225 nice=0 cgrp=default sched=0/0 handle=0x9159d970
| state=S schedstat=( 0 0 0 ) utm=5 stm=2 core=2 HZ=100
| stack=0x9149b000-0x9149d000 stackSize=1038KB
| held mutexes=
at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:23862)
- waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81
at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211)
at android.os.Handler.handleCallback(Handler.java:790)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loop(Looper.java:164)
at android.os.HandlerThread.run(HandlerThread.java:65)
at com.android.server.ServiceThread.run(ServiceThread.java:46)
在等待锁,这个锁被tid 为 81 的线程持有,下面来看下tid 为81 干了什么:
"Binder:2197_4" prio=5 tid=81 Waiting
| group="main" sCount=1 dsCount=0 flags=1 obj=0x14041c20 self=0x8d713200
| sysTid=2435 nice=-10 cgrp=default sched=0/0 handle=0x8c749970
| state=S schedstat=( 0 0 0 ) utm=9 stm=6 core=2 HZ=100
| stack=0x8c64f000-0x8c651000 stackSize=1006KB
| held mutexes=
at java.lang.Object.wait(Native method)
- waiting on <0x0c1b5e76> (a com.android.server.PermissionDialogReqQueue$PermissionDialogReq)
at com.android.server.PermissionDialogReqQueue$PermissionDialogReq.get(PermissionDialogReqQueue.java:100)
- locked <0x0c1b5e76> (a com.android.server.PermissionDialogReqQueue$PermissionDialogReq)
at com.android.server.AppOpsService.noteOperationUnchecked(AppOpsService.java:1157)
at com.android.server.AppOpsService.noteOperation(AppOpsService.java:1086)
at com.android.server.am.ActivityManagerService.isStartAllowed(ActivityManagerService.java:24937)
at com.android.server.am.ActiveServices.startServiceLocked(ActiveServices.java:369)
at com.android.server.am.ActivityManagerService.startService(ActivityManagerService.java:18450)
- locked <0x0316e620> (a com.android.server.am.ActivityManagerService)
at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:528)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2971)
at android.os.Binder.execTransact(Binder.java:697)
是在 startService() 的时候占用了AMS 这把锁,但是后面又在等 PermissionDialogReqQueue 的锁,这个锁是自己锁上的并没有放出来,导致了一连串的死锁。
(2)同样的,根据上面第 1.5 点,来看下 AMS 中的Handler 也触发了 WatchDog
"ActivityManager" prio=5 tid=12 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x13380710 self=0x9e0fae00
| sysTid=2216 nice=-2 cgrp=default sched=0/0 handle=0x91bb4970
| state=S schedstat=( 0 0 0 ) utm=60 stm=23 core=3 HZ=100
| stack=0x91ab2000-0x91ab4000 stackSize=1038KB
| held mutexes=
at com.android.server.am.ActivityManagerService.idleUids(ActivityManagerService.java:23305)
- waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81
at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2428)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:164)
at android.os.HandlerThread.run(HandlerThread.java:65)
at com.android.server.ServiceThread.run(ServiceThread.java:46)
等待同样的一个锁
结合上面的 traces 分析,大概能看出来应该是 AMS 的锁被持有一直没有放掉,再来看下dropbox 下的 traces(一般到第二次的traces时候已经出现了ANR),这个时候主线程应该会出现ANR,来看下main:
"main" prio=5 tid=1 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x723919f0 self=0xa7a28000
| sysTid=2197 nice=-2 cgrp=default sched=0/0 handle=0xabb1e4a4
| state=S schedstat=( 0 0 0 ) utm=320 stm=99 core=2 HZ=100
| stack=0xbe447000-0xbe449000 stackSize=8MB
| held mutexes=
at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:20018)
- waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81
at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4175)
at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4165)
at com.android.server.BatteryService$9.run(BatteryService.java:578)
at android.os.Handler.handleCallback(Handler.java:790)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loop(Looper.java:164)
at com.android.server.SystemServer.run(SystemServer.java:435)
at com.android.server.SystemServer.main(SystemServer.java:278)
at java.lang.reflect.Method.invoke(Native method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:438)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:787)
ANR 也是应该这个锁,根据两个traces 分析,可以确定是因为 PermissionDialogReqQueue 自身的锁没有释放,导致了timeout 触发watch dog。
但是实际上,这个第二次的 log 意义并不大,因为这个时候系统 ANR,需要检查的是系统为什么出现 ANR,也就是说第一次 traces 的log 可能在后面的 debug 中起到关键因素。
来看下 PermissionDialogReqQueue.java 100行 get() 函数:
public int get() {
synchronized (this) {
try {
wait();
} catch (InterruptedException e) {
//ignore
}
}
Log.d(TAG, "result from permission dialog is " + mResult);
return mResult;
}
从逻辑上看,这里会通过wait 等待dialog 返回结果,在set 函数的时候notify:
public void set(int res) {
synchronized (this) {
mResult = res;
notifyAll();
}
}
逻辑上处理应该是没有问题的,如果dialog 在30秒内点击,之后notify() 就会放掉之前持有的锁。但是从现象上看,是在dialog 弹出之后,AMS 接口调用导致了dialog 卡死,这样无法正常的释放锁,最终导致了WatchDog 重启system。
这个问题弄的我很疑惑,逻辑上都是ok 就是在AMS 中要锁的时候导致了dialog 卡住,后来怀疑可能是dialog 的同时调用了AMS 的接口导致了主线程卡死,几经挫折想到将 dialog 放到子线程中show(Android 子线程中更新UI 详解),结果居然成功了,还在继续寻找原因中!