当 timeline server 使用LeveldbTimelineStore 时,hive 任务有时卡住。
根本原因,在删除过期数据时,对于每种 Entiry 类型,拿到 deleteLock.writeLock()
, 在put 操作,需要拿到 deleteLock.readLock
。当长时间做清理工作时,put 需要一直等待。
在 yarn-site.xml 配置 yarn.timeline-service.enabled=false
影响: 此影响比较大,不仅 tez-ui 无法打开,其他服务如 spark history 也受影响。但是比较彻底,timeline server 不会出现问题。
在 hive-site.xml 配置 yarn.timeline-service.enabled=false
。因为现在 tez 产生的数据最多,所以关闭 hive 任务能解决此问题。
影响:tez-ui 没有用。历史任务如果知道 application id,查看日志没有影响。当其他任务发送大量数据时,还是会有问题。
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>604800000</value>
</property>
默认 7 天,减少 timeline ttl 时间可以缓解,但不能彻底解决此问题。
从以下日志分析,当 timelineserver 没有运行,任务卡住一会(5秒),然后继续运行
22/09/21 15:38:38 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:39 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:40 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:41 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:42 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
level db(92G) 数据类型统计。手动 leveldb 切换时间 9月1日21:26.
type:d, count: 180466
type:e, count: 205866053
type:i, count: 800383696
type:k, count: 9511102
type:o, count: 180466
d: DOMAIN_ENTRY_PREFIX
e: ENTITY_ENTRY_PREFIX
i: INDEXED_ENTRY_PREFIX
k: START_TIME_LOOKUP_PREFIX
o: OWNER_LOOKUP_PREFIX
2022-09-01 15:28:57,711 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 12662 entities for timestamp 1661412494248 and earlier in 43.463 seconds
2022-09-01 15:34:48,522 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 15766 entities for timestamp 1661412837711 and earlier in 50.811 seconds
2022-09-01 15:41:31,624 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 32860 entities for timestamp 1661413188523 and earlier in 103.101 seconds
2022-09-01 15:52:43,284 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 128077 entities for timestamp 1661413591624 and earlier in 371.66 seconds
2022-09-01 16:11:08,737 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 279761 entities for timestamp 1661414263284 and earlier in 805.453 seconds
2022-09-01 16:46:48,740 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 630752 entities for timestamp 1661415368738 and earlier in 1840.002 seconds
2022-09-01 17:45:03,624 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 1147831 entities for timestamp 1661417508740 and earlier in 3194.884 seconds
2022-09-01 19:30:44,539 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 1664671 entities for timestamp 1661421003624 and earlier in 6040.915 seconds
2022-09-01 21:04:49,154 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 1855544 entities for timestamp 1661427344539 and earlier in 5344.615 seconds
2022-09-01 22:26:20,494 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 456813 entities for timestamp 1661436231351 and earlier in 1349.129 seconds
说明在切换leveldb 之前,每次 delete 过期数据耗费太长时间。导致 leveldb 响应变慢。
如现在保留7天的日志,如果改为1小时,那么就会删除大部分日志。并且在1小时后,基本没有日志。
3942 Sep 16
7073 Sep 17
3218 Sep 18
648 Sep 19
2630 Sep 20
1004 Sep 21
[root@bmr-master-40d309e ~]# grep -e Discarded -e "Deleted" /mnt/bmr/log/hadoop-yarn/yarn/hadoop-yarn-timelineserver-bmr-master-40d309e.log | grep -v "Discarded 0 entities" | grep -v "Deleted 0 "
2022-09-21 09:19:05,341 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1952 entities of type TEZ_APPLICATION
2022-09-21 09:19:15,403 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1952 entities of type TEZ_APPLICATION_ATTEMPT
2022-09-21 09:19:31,542 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 15392 entities of type TEZ_CONTAINER_ID
2022-09-21 09:19:32,657 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1845 entities of type TEZ_DAG_EXTRA_INFO
2022-09-21 09:19:38,116 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1845 entities of type TEZ_DAG_ID
2022-09-21 10:28:04,133 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1909751 entities of type TEZ_TASK_ATTEMPT_ID
2022-09-21 11:26:06,381 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1915052 entities of type TEZ_TASK_ID
2022-09-21 11:29:58,285 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 21668 entities of type TEZ_VERTEX_ID
2022-09-21 11:30:00,194 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2262 entities of type YARN_APPLICATION
2022-09-21 11:30:01,634 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2265 entities of type YARN_APPLICATION_ATTEMPT
2022-09-21 11:33:38,043 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2264 entities of type YARN_CONTAINER
2022-09-21 11:33:38,043 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 3876248 entities for timestamp 1663719541882 and earlier in 8076.019 seconds
过两个小时后,查看数据量,为1.2G。数据量有变少, 说明清理清除历史数据, 但是可能比较慢。
[root@bmr-master-40d309e ~]# du -s -h /mnt/hadoop/yarn/timeline/leveldb-timeline-store.ldb/
1.2G /mnt/hadoop/yarn/timeline/leveldb-timeline-store.ldb/
再进行一次测试,在测试的时候,执行 tpcds-3 query。
从以下日志分析,结束时间是 2022-09-21 16:31:48,531
,开始时间是 2022-09-21 16:16:34,262
。每种类型的删除是获取锁的。
2022-09-21 16:16:36,468 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 267 entities of type HIVE_QUERY_ID
2022-09-21 16:16:36,795 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 256 entities of type TEZ_APPLICATION
2022-09-21 16:16:37,178 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 256 entities of type TEZ_APPLICATION_ATTEMPT
2022-09-21 16:16:39,607 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2117 entities of type TEZ_CONTAINER_ID
2022-09-21 16:16:39,749 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 267 entities of type TEZ_DAG_EXTRA_INFO
2022-09-21 16:16:40,497 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 267 entities of type TEZ_DAG_ID
2022-09-21 16:23:59,988 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 234974 entities of type TEZ_TASK_ATTEMPT_ID
2022-09-21 16:31:29,118 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 234823 entities of type TEZ_TASK_ID
2022-09-21 16:31:42,539 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 3334 entities of type TEZ_VERTEX_ID
2022-09-21 16:31:43,628 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1271 entities of type YARN_APPLICATION
2022-09-21 16:31:44,056 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 682 entities of type YARN_APPLICATION_ATTEMPT
2022-09-21 16:31:48,531 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 668 entities of type YARN_CONTAINER
2022-09-21 16:31:48,531 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 479182 entities for timestamp 1663744594222 and earlier in 914.279 seconds
可以看到,query3 卡住很长时间。
Begin exec query1 at 20220921-16:16.58
Run DAG 17.07s
End exec query1 successfully at 20220921-16:18.21
Begin exec query2 at 20220921-16:18.21
Run DAG 24.78s
End exec query2 successfully at 20220921-16:19.04
Begin exec query3 at 20220921-16:19.04
Run DAG 13.19s
End exec query3 successfully at 20220921-16:32.07
Begin exec query4 at 20220921-16:32.07
Run DAG 39.89s
End exec query4 successfully at 20220921-16:33.09
Begin exec query5 at 20220921-16:33.09
@Override
public void run() {
while (true) {
long timestamp = System.currentTimeMillis() - ttl;
try {
discardOldEntities(timestamp);
Thread.sleep(ttlInterval);
} catch (IOException e) {
LOG.error(e.toString());
} catch (InterruptedException e) {
LOG.info("Deletion thread received interrupt, exiting");
break;
}
}
}
在处理每种类型时,先获取 writeLock,处理后释放。
void discardOldEntities(long timestamp)
throws IOException, InterruptedException {
byte[] reverseTimestamp = writeReverseOrderedLong(timestamp);
long totalCount = 0;
long t1 = System.currentTimeMillis();
try {
List<String> entityTypes = getEntityTypes();
for (String entityType : entityTypes) {
LeveldbIterator iterator = null;
LeveldbIterator pfIterator = null;
long typeCount = 0;
try {
deleteLock.writeLock().lock();
iterator = getDbIterator(false);
pfIterator = getDbIterator(false);
if (deletionThread != null && deletionThread.isInterrupted()) {
throw new InterruptedException();
}
boolean seeked = false;
while (deleteNextEntity(entityType, reverseTimestamp, iterator,
pfIterator, seeked)) {
typeCount++;
totalCount++;
seeked = true;
if (deletionThread != null && deletionThread.isInterrupted()) {
throw new InterruptedException();
}
}
} catch (IOException e) {
LOG.error("Got IOException while deleting entities for type " +
entityType + ", continuing to next type", e);
} finally {
IOUtils.cleanupWithLogger(LOG, iterator, pfIterator);
deleteLock.writeLock().unlock();
if (typeCount > 0) {
LOG.info("Deleted " + typeCount + " entities of type " +
entityType);
}
}
}
} finally {
long t2 = System.currentTimeMillis();
LOG.info("Discarded " + totalCount + " entities for timestamp " +
timestamp + " and earlier in " + (t2 - t1) / 1000.0 + " seconds");
}
}
在 put 时,获取读锁,如果正在删除,则此方法一直等待
@Override
public TimelinePutResponse put(TimelineEntities entities) {
try {
deleteLock.readLock().lock();
TimelinePutResponse response = new TimelinePutResponse();
for (TimelineEntity entity : entities.getEntities()) {
put(entity, response, false);
}
return response;
} finally {
deleteLock.readLock().unlock();
}
}