cephfs：1 clients failing to respond to cache pressure原因分析

刘安志

2023-12-01

问题现象

ceph -s中经常出现报警：

1 clients failing to respond to cache pressure

其大致原因是cephfs的mds让客户端释放部分metadata的cache，客户端释放不及时mds会向monitor上报此类告警

通过代码分析其详细原因及如何进行规避或者修复

原因分析

找到mds代码中上报此告警的地方，如下：

void Beacon::notify_health(MDSRank const *mds)
{
    // 由于函数比较长，截取其中报错的部分
    set<Session*> sessions;
    mds->sessionmap.get_client_session_set(sessions);

    const auto recall_warning_threshold = g_conf().get_val<Option::size_t>("mds_recall_warning_threshold");
    const auto max_completed_requests = g_conf()->mds_max_completed_requests;
    const auto max_completed_flushes = g_conf()->mds_max_completed_flushes;
    std::vector<MDSHealthMetric> late_recall_metrics;
    std::vector<MDSHealthMetric> large_completed_requests_metrics;
    for (auto& session : sessions) {
      const uint64_t recall_caps = session->get_recall_caps(); // 获取每条连接上的recall caps的数量
      if (recall_caps > recall_warning_threshold) {
        dout(2) << "Session " << *session <<
             " is not releasing caps fast enough. Recalled caps at " << recall_caps
          << " > " << recall_warning_threshold << " (mds_recall_warning_threshold)." << dendl;
        std::ostringstream oss;
        oss << "Client " << session->get_human_name() << " failing to respond to cache pressure";
        MDSHealthMetric m(MDS_HEALTH_CLIENT_RECALL, HEALTH_WARN, oss.str());
        m.metadata["client_id"] = stringify(session->get_client());
        late_recall_metrics.emplace_back(std::move(m));
    }
}

很容易发现只要连接上的recall caps数大于recall_warning_threshold就会上报此告警，而recall_warning_threshold是通过配置文件读取的，此处是默认配置32K。

那么为什么recall_caps会大于32K呢？继续找其原因

找到recall caps的修改处

uint64_t Session::notify_recall_sent(size_t new_limit)
{
  const auto num_caps = caps.size();
  ceph_assert(new_limit < num_caps);  // Behaviour of Server::recall_client_state
  const auto count = num_caps-new_limit;
  uint64_t new_change;
  if (recall_limit != new_limit) {
    new_change = count;
  } else {
    new_change = 0; /* no change! */
  }

  /* Always hit the session counter as a RECALL message is still sent to the
   * client and we do not want the MDS to burn its global counter tokens on a
   * session that is not releasing caps (i.e. allow the session counter to
   * throttle future RECALL messages).
   */
  recall_caps_throttle.hit(count);
  recall_caps_throttle2o.hit(count);
  recall_caps.hit(count); // 增加数量
  return new_change;
}

void Session::notify_cap_release(size_t n_caps)
{
  recall_caps.hit(-(double)n_caps); // 减少数量
  release_caps.hit(n_caps);
}

很明显在notify_recall_sent是会增加recall caps，在notify_cap_release会减少recall caps。

其中notify_recall_sent是在mds通知client释放cache时会调用；

notify_cap_release是在mds收到client释放cache的回复消息时调用

下面先分析哪些场景会触发mds向client发recall msg

mds调用notify_recall_sent函数就一个地方recall_client_state

recall_client_state函数有两个地方调用：

1.通过ceph daemon或者ceph tell让mds主动drop cache时（此种情况不分析了）

2.MDCache中的upkeeper线程周期性调用recall_client_state函数

upkeeper线程实现如下：

upkeeper = std::thread([this]() {
    std::unique_lock lock(upkeep_mutex);
    while (!upkeep_trim_shutdown.load()) {
      auto now = clock::now();
      auto since = now-upkeep_last_trim;
      auto trim_interval = clock::duration(g_conf().get_val<std::chrono::seconds>("mds_cache_trim_interval"));//默认1s
      if (since >= trim_interval*.90) {
        lock.unlock(); /* mds_lock -> upkeep_mutex */
        std::scoped_lock mds_lock(mds->mds_lock);
        lock.lock();
        if (upkeep_trim_shutdown.load())
          return;
        if (mds->is_cache_trimmable()) {
          dout(20) << "upkeep thread trimming cache; last trim " << since << " ago" << dendl;
          trim_client_leases();
          trim();
          check_memory_usage();
          auto flags = Server::RecallFlags::ENFORCE_MAX|Server::RecallFlags::ENFORCE_LIVENESS;
          mds->server->recall_client_state(nullptr, flags);
          upkeep_last_trim = now = clock::now();
        } else {
          dout(10) << "cache not ready for trimming" << dendl;
        }
      } else {
        trim_interval -= since;
      }
      since = now-upkeep_last_release;
      auto release_interval = clock::duration(g_conf().get_val<std::chrono::seconds>("mds_cache_release_free_interval"));
      if (since >= release_interval) {
        /* XXX not necessary once MDCache uses PriorityCache */
        dout(10) << "releasing free memory" << dendl;
        ceph_heap_release_free_memory();
        upkeep_last_release = clock::now();
      } else {
        release_interval -= since;
      }
      auto interval = std::min(release_interval, trim_interval);
      dout(20) << "upkeep thread waiting interval " << interval << dendl;
      upkeep_cvar.wait_for(lock, interval);
    }
  });

upkeeper线程主要的功能是：

按照mds_cache_trim_interval（默认1s）周期进行trim client lease，trim cache，检查内存使用情况。

在调用recall_client_state时：

如果是内存超过限制，传入的flag为：Server::RecallFlags::TRIM

如果是upkeeker周期性调用，传入的flag为：Server::RecallFlags::ENFORCE_MAX|Server::RecallFlags::ENFORCE_LIVENESS

std::pair<bool, uint64_t> Server::recall_client_state(MDSGatherBuilder* gather, RecallFlags flags)
{
  const auto now = clock::now();
  const bool steady = !!(flags&RecallFlags::STEADY);
  const bool enforce_max = !!(flags&RecallFlags::ENFORCE_MAX);
  const bool enforce_liveness = !!(flags&RecallFlags::ENFORCE_LIVENESS);
  const bool trim = !!(flags&RecallFlags::TRIM);

  const auto max_caps_per_client = g_conf().get_val<uint64_t>("mds_max_caps_per_client");
  const auto min_caps_per_client = g_conf().get_val<uint64_t>("mds_min_caps_per_client");
  const auto recall_global_max_decay_threshold = g_conf().get_val<Option::size_t>("mds_recall_global_max_decay_threshold");
  const auto recall_max_caps = g_conf().get_val<Option::size_t>("mds_recall_max_caps");
  const auto recall_max_decay_threshold = g_conf().get_val<Option::size_t>("mds_recall_max_decay_threshold");
  const auto cache_liveness_magnitude = g_conf().get_val<Option::size_t>("mds_session_cache_liveness_magnitude");

  dout(7) << __func__ << ":"
           << " min=" << min_caps_per_client
           << " max=" << max_caps_per_client
           << " total=" << Capability::count()
           << " flags=" << flags
           << dendl;

  /* trim caps of sessions with the most caps first */
  std::multimap<uint64_t, Session*> caps_session;
  // 挑选符合条件的session放到caps_session中
  auto f = [&caps_session, enforce_max, enforce_liveness, trim, max_caps_per_client, cache_liveness_magnitude](auto& s) {
    auto num_caps = s->caps.size();
    auto cache_liveness = s->get_session_cache_liveness();
    if (trim || (enforce_max && num_caps > max_caps_per_client) || (enforce_liveness && cache_liveness < (num_caps>>cache_liveness_magnitude))) {
      caps_session.emplace(std::piecewise_construct, std::forward_as_tuple(num_caps), std::forward_as_tuple(s));
    }
  };
  mds->sessionmap.get_client_sessions(std::move(f));

  std::pair<bool, uint64_t> result = {false, 0};
  auto& [throttled, caps_recalled] = result;
  last_recall_state = now;
  // 遍历挑选出来的session
  for (const auto& [num_caps, session] : boost::adaptors::reverse(caps_session)) {
    if (!session->is_open() ||
        !session->get_connection() ||
        !session->info.inst.name.is_client()) // 只对client类型的session处理
      continue;

    dout(10) << __func__ << ":"
             << " session " << session->info.inst
       << " caps " << num_caps
       << ", leases " << session->leases.size()
       << dendl;

    uint64_t newlim;
    // 如果client的caps数量在min和min+max之间（默认100 ～（1M+100）），limit会设置为min
    if (num_caps < recall_max_caps || (num_caps-recall_max_caps) < min_caps_per_client) {
      newlim = min_caps_per_client;
    } else {
      newlim = num_caps-recall_max_caps;
    }
    if (num_caps > newlim) {
      /* now limit the number of caps we recall at a time to prevent overloading ourselves */
      uint64_t recall = std::min<uint64_t>(recall_max_caps, num_caps-newlim);
      newlim = num_caps-recall;
      const uint64_t session_recall_throttle = session->get_recall_caps_throttle();
      const uint64_t session_recall_throttle2o = session->get_recall_caps_throttle2o();
      const uint64_t global_recall_throttle = recall_throttle.get();
      if (session_recall_throttle+recall > recall_max_decay_threshold) {
        dout(15) << "  session recall threshold (" << recall_max_decay_threshold << ") hit at " << session_recall_throttle << "; skipping!" << dendl;
        throttled = true;
        continue;
      } else if (session_recall_throttle2o+recall > recall_max_caps*2) {
        dout(15) << "  session recall 2nd-order threshold (" << 2*recall_max_caps << ") hit at " << session_recall_throttle2o << "; skipping!" << dendl;
        throttled = true;
        continue;
      } else if (global_recall_throttle+recall > recall_global_max_decay_threshold) {
        dout(15) << "  global recall threshold (" << recall_global_max_decay_threshold << ") hit at " << global_recall_throttle << "; skipping!" << dendl;
        throttled = true;
        break;
      }

      // 省略部分代码。。。

      dout(7) << "  recalling " << recall << " caps; session_recall_throttle = " << session_recall_throttle << "; global_recall_throttle = " << global_recall_throttle << dendl;

      auto m = make_message<MClientSession>(CEPH_SESSION_RECALL_STATE);
      m->head.max_caps = newlim;
      mds->send_message_client(m, session);
      if (gather) {
        flush_session(session, gather);
      }
      caps_recalled += session->notify_recall_sent(newlim);
      recall_throttle.hit(recall);
    }
  }

  dout(7) << "recalled" << (throttled ? " (throttled)" : "") << " " << caps_recalled << " client caps." << dendl;

  return result;
}

其核心就是这三个判断条件：

1.trim

只有mds内存到达限制的95%时才会强制trim，我们的环境内存限制为40G，mds只用了800多MB，显然不会因为这个进来

2.enforce_max && num_caps > max_caps_per_client)

当前client的caps数量超过了max_caps_per_client，默认是1M，我们环境也没有超过（可以通过session ls查到）

3.enforce_liveness && cache_liveness < (num_caps>>cache_liveness_magnitude)

liveness是Decay统计的，如果client的cache活跃度不高，会一直降低其cache_liveness值，如果client的num_caps比较多，就容易走到这个分支

分析到此处已经知道是什么原因导致的频繁去客户端recall，但是如果客户端能够进行cache释放，为什么还会出现此告警呢？

结论

upkeeper 默认 1秒钟跑一次，如果在1秒内客户端没有返回，那么会导致持续向此client发送recall，

从而导致recall_caps一直被hit增加，超过recall_warning_threshold后就会发生告警

所以可以通过将mds_cache_trim_interval调大来降低upkeeper线程处理的频率，从而容忍客户端有更多的时间来drop cache。

cephfs：1 clients failing to respond to cache pressure原因分析

问题现象

原因分析

结论

相关阅读

相关文章

相关问答

相关文档