当前位置: 首页 > 工具软件 > log2timeline > 使用案例 >

Timeline-server 做清理时卡住 hive 任务分析

富波光
2023-12-01

当 timeline server 使用LeveldbTimelineStore 时,hive 任务有时卡住。

1. 根本原因

根本原因,在删除过期数据时,对于每种 Entiry 类型,拿到 deleteLock.writeLock(), 在put 操作,需要拿到 deleteLock.readLock。当长时间做清理工作时,put 需要一直等待。

2. 推荐方案

2.1 关闭 timeline server

在 yarn-site.xml 配置 yarn.timeline-service.enabled=false

影响: 此影响比较大,不仅 tez-ui 无法打开,其他服务如 spark history 也受影响。但是比较彻底,timeline server 不会出现问题。

2.2 关闭 hive 不向 timeline server 发送数据

在 hive-site.xml 配置 yarn.timeline-service.enabled=false。因为现在 tez 产生的数据最多,所以关闭 hive 任务能解决此问题。
影响:tez-ui 没有用。历史任务如果知道 application id,查看日志没有影响。当其他任务发送大量数据时,还是会有问题。

2.3 减少 timeline ttl 时间

   <property>
      <name>yarn.timeline-service.ttl-ms</name>
      <value>604800000</value>
    </property>

默认 7 天,减少 timeline ttl 时间可以缓解,但不能彻底解决此问题。

3. 分析过程

3.1 测试当 timelineserver 没有运行,任务卡住一会还是能继续运行?

从以下日志分析,当 timelineserver 没有运行,任务卡住一会(5秒),然后继续运行

22/09/21 15:38:38 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:39 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:40 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:41 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:42 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)

3.2 测试 timeline 各数据类型的数据量

level db(92G) 数据类型统计。手动 leveldb 切换时间 9月1日21:26.
type:d, count: 180466
type:e, count: 205866053
type:i, count: 800383696
type:k, count: 9511102
type:o, count: 180466

d: DOMAIN_ENTRY_PREFIX
e: ENTITY_ENTRY_PREFIX
i: INDEXED_ENTRY_PREFIX
k: START_TIME_LOOKUP_PREFIX
o: OWNER_LOOKUP_PREFIX

3.3 2022-09-01部分清理日志

2022-09-01 15:28:57,711 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 12662 entities for timestamp 1661412494248 and earlier in 43.463 seconds
2022-09-01 15:34:48,522 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 15766 entities for timestamp 1661412837711 and earlier in 50.811 seconds
2022-09-01 15:41:31,624 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 32860 entities for timestamp 1661413188523 and earlier in 103.101 seconds
2022-09-01 15:52:43,284 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 128077 entities for timestamp 1661413591624 and earlier in 371.66 seconds
2022-09-01 16:11:08,737 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 279761 entities for timestamp 1661414263284 and earlier in 805.453 seconds
2022-09-01 16:46:48,740 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 630752 entities for timestamp 1661415368738 and earlier in 1840.002 seconds
2022-09-01 17:45:03,624 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 1147831 entities for timestamp 1661417508740 and earlier in 3194.884 seconds
2022-09-01 19:30:44,539 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 1664671 entities for timestamp 1661421003624 and earlier in 6040.915 seconds
2022-09-01 21:04:49,154 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 1855544 entities for timestamp 1661427344539 and earlier in 5344.615 seconds
2022-09-01 22:26:20,494 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 456813 entities for timestamp 1661436231351 and earlier in 1349.129 seconds

说明在切换leveldb 之前,每次 delete 过期数据耗费太长时间。导致 leveldb 响应变慢。

3.4 测试是否不能删除彻底。

如现在保留7天的日志,如果改为1小时,那么就会删除大部分日志。并且在1小时后,基本没有日志。

3.4.1 某集群,重启timeline-server 之前个日期的文件数量,第1列为文件数量,总数据量为 30G

   3942 Sep 16
   7073 Sep 17
   3218 Sep 18
    648 Sep 19
   2630 Sep 20
   1004 Sep 21

3.4.2 重启 timeline-server 时间 09:18

[root@bmr-master-40d309e ~]# grep -e Discarded -e "Deleted" /mnt/bmr/log/hadoop-yarn/yarn/hadoop-yarn-timelineserver-bmr-master-40d309e.log | grep -v  "Discarded 0 entities" | grep -v "Deleted 0 "
2022-09-21 09:19:05,341 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1952 entities of type TEZ_APPLICATION
2022-09-21 09:19:15,403 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1952 entities of type TEZ_APPLICATION_ATTEMPT
2022-09-21 09:19:31,542 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 15392 entities of type TEZ_CONTAINER_ID
2022-09-21 09:19:32,657 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1845 entities of type TEZ_DAG_EXTRA_INFO
2022-09-21 09:19:38,116 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1845 entities of type TEZ_DAG_ID
2022-09-21 10:28:04,133 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1909751 entities of type TEZ_TASK_ATTEMPT_ID
2022-09-21 11:26:06,381 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1915052 entities of type TEZ_TASK_ID
2022-09-21 11:29:58,285 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 21668 entities of type TEZ_VERTEX_ID
2022-09-21 11:30:00,194 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2262 entities of type YARN_APPLICATION
2022-09-21 11:30:01,634 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2265 entities of type YARN_APPLICATION_ATTEMPT
2022-09-21 11:33:38,043 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2264 entities of type YARN_CONTAINER
2022-09-21 11:33:38,043 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 3876248 entities for timestamp 1663719541882 and earlier in 8076.019 seconds

3.4.3 查看 leveldb数据量

过两个小时后,查看数据量,为1.2G。数据量有变少, 说明清理清除历史数据, 但是可能比较慢。

[root@bmr-master-40d309e ~]# du -s -h /mnt/hadoop/yarn/timeline/leveldb-timeline-store.ldb/
1.2G  /mnt/hadoop/yarn/timeline/leveldb-timeline-store.ldb/

3.5 测试 delete 过去 entries 卡住put

再进行一次测试,在测试的时候,执行 tpcds-3 query。

3.5.1 timeline 相关日志

从以下日志分析,结束时间是 2022-09-21 16:31:48,531,开始时间是 2022-09-21 16:16:34,262。每种类型的删除是获取锁的。

2022-09-21 16:16:36,468 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 267 entities of type HIVE_QUERY_ID
2022-09-21 16:16:36,795 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 256 entities of type TEZ_APPLICATION
2022-09-21 16:16:37,178 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 256 entities of type TEZ_APPLICATION_ATTEMPT
2022-09-21 16:16:39,607 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2117 entities of type TEZ_CONTAINER_ID
2022-09-21 16:16:39,749 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 267 entities of type TEZ_DAG_EXTRA_INFO
2022-09-21 16:16:40,497 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 267 entities of type TEZ_DAG_ID
2022-09-21 16:23:59,988 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 234974 entities of type TEZ_TASK_ATTEMPT_ID
2022-09-21 16:31:29,118 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 234823 entities of type TEZ_TASK_ID
2022-09-21 16:31:42,539 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 3334 entities of type TEZ_VERTEX_ID
2022-09-21 16:31:43,628 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1271 entities of type YARN_APPLICATION
2022-09-21 16:31:44,056 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 682 entities of type YARN_APPLICATION_ATTEMPT
2022-09-21 16:31:48,531 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 668 entities of type YARN_CONTAINER
2022-09-21 16:31:48,531 INFO  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 479182 entities for timestamp 1663744594222 and earlier in 914.279 seconds

3.5.2 tpcds测试结果

可以看到,query3 卡住很长时间。

Begin exec query1 at 20220921-16:16.58
Run DAG                                17.07s
End exec query1 successfully at 20220921-16:18.21
Begin exec query2 at 20220921-16:18.21
Run DAG                                24.78s
End exec query2 successfully at 20220921-16:19.04
Begin exec query3 at 20220921-16:19.04
Run DAG                                13.19s
End exec query3 successfully at 20220921-16:32.07
Begin exec query4 at 20220921-16:32.07
Run DAG                                39.89s
End exec query4 successfully at 20220921-16:33.09
Begin exec query5 at 20220921-16:33.09

2. 代码分析

2.1 删除过期数据线程

@Override
    public void run() {
      while (true) {
        long timestamp = System.currentTimeMillis() - ttl;
        try {
          discardOldEntities(timestamp);
          Thread.sleep(ttlInterval);
        } catch (IOException e) {
          LOG.error(e.toString());
        } catch (InterruptedException e) {
          LOG.info("Deletion thread received interrupt, exiting");
          break;
        }
      }
    }

2.2 discardOldEntities 方法

在处理每种类型时,先获取 writeLock,处理后释放。

void discardOldEntities(long timestamp)
      throws IOException, InterruptedException {
    byte[] reverseTimestamp = writeReverseOrderedLong(timestamp);
    long totalCount = 0;
    long t1 = System.currentTimeMillis();
    try {
      List<String> entityTypes = getEntityTypes();
      for (String entityType : entityTypes) {
        LeveldbIterator iterator = null;
        LeveldbIterator pfIterator = null;
        long typeCount = 0;
        try {
          deleteLock.writeLock().lock();
          iterator = getDbIterator(false);
          pfIterator = getDbIterator(false);

          if (deletionThread != null && deletionThread.isInterrupted()) {
            throw new InterruptedException();
          }
          boolean seeked = false;
          while (deleteNextEntity(entityType, reverseTimestamp, iterator,
              pfIterator, seeked)) {
            typeCount++;
            totalCount++;
            seeked = true;
            if (deletionThread != null && deletionThread.isInterrupted()) {
              throw new InterruptedException();
            }
          }
        } catch (IOException e) {
          LOG.error("Got IOException while deleting entities for type " +
              entityType + ", continuing to next type", e);
        } finally {
          IOUtils.cleanupWithLogger(LOG, iterator, pfIterator);
          deleteLock.writeLock().unlock();
          if (typeCount > 0) {
            LOG.info("Deleted " + typeCount + " entities of type " +
                entityType);
          }
        }
      }
    } finally {
      long t2 = System.currentTimeMillis();
      LOG.info("Discarded " + totalCount + " entities for timestamp " +
          timestamp + " and earlier in " + (t2 - t1) / 1000.0 + " seconds");
    }
  }

2.3 put 方法

在 put 时,获取读锁,如果正在删除,则此方法一直等待

@Override
  public TimelinePutResponse put(TimelineEntities entities) {
    try {
      deleteLock.readLock().lock();
      TimelinePutResponse response = new TimelinePutResponse();
      for (TimelineEntity entity : entities.getEntities()) {
        put(entity, response, false);
      }
      return response;
    } finally {
      deleteLock.readLock().unlock();
    }
  }
 类似资料: