ceph -s中经常出现报警:
1 clients failing to respond to cache pressure
其大致原因是cephfs的mds让客户端释放部分metadata的cache,客户端释放不及时mds会向monitor上报此类告警
通过代码分析其详细原因及如何进行规避或者修复
找到mds代码中上报此告警的地方,如下:
void Beacon::notify_health(MDSRank const *mds)
{
// 由于函数比较长,截取其中报错的部分
set<Session*> sessions;
mds->sessionmap.get_client_session_set(sessions);
const auto recall_warning_threshold = g_conf().get_val<Option::size_t>("mds_recall_warning_threshold");
const auto max_completed_requests = g_conf()->mds_max_completed_requests;
const auto max_completed_flushes = g_conf()->mds_max_completed_flushes;
std::vector<MDSHealthMetric> late_recall_metrics;
std::vector<MDSHealthMetric> large_completed_requests_metrics;
for (auto& session : sessions) {
const uint64_t recall_caps = session->get_recall_caps(); // 获取每条连接上的recall caps的数量
if (recall_caps > recall_warning_threshold) {
dout(2) << "Session " << *session <<
" is not releasing caps fast enough. Recalled caps at " << recall_caps
<< " > " << recall_warning_threshold << " (mds_recall_warning_threshold)." << dendl;
std::ostringstream oss;
oss << "Client " << session->get_human_name() << " failing to respond to cache pressure";
MDSHealthMetric m(MDS_HEALTH_CLIENT_RECALL, HEALTH_WARN, oss.str());
m.metadata["client_id"] = stringify(session->get_client());
late_recall_metrics.emplace_back(std::move(m));
}
}
很容易发现只要连接上的recall caps数大于recall_warning_threshold就会上报此告警,而recall_warning_threshold是通过配置文件读取的,此处是默认配置32K。
那么为什么recall_caps会大于32K呢?继续找其原因
找到recall caps的修改处
uint64_t Session::notify_recall_sent(size_t new_limit)
{
const auto num_caps = caps.size();
ceph_assert(new_limit < num_caps); // Behaviour of Server::recall_client_state
const auto count = num_caps-new_limit;
uint64_t new_change;
if (recall_limit != new_limit) {
new_change = count;
} else {
new_change = 0; /* no change! */
}
/* Always hit the session counter as a RECALL message is still sent to the
* client and we do not want the MDS to burn its global counter tokens on a
* session that is not releasing caps (i.e. allow the session counter to
* throttle future RECALL messages).
*/
recall_caps_throttle.hit(count);
recall_caps_throttle2o.hit(count);
recall_caps.hit(count); // 增加数量
return new_change;
}
void Session::notify_cap_release(size_t n_caps)
{
recall_caps.hit(-(double)n_caps); // 减少数量
release_caps.hit(n_caps);
}
很明显在notify_recall_sent是会增加recall caps,在notify_cap_release会减少recall caps。
其中notify_recall_sent是在mds通知client释放cache时会调用;
notify_cap_release是在mds收到client释放cache的回复消息时调用
下面先分析哪些场景会触发mds向client发recall msg
mds调用notify_recall_sent函数就一个地方recall_client_state
recall_client_state函数有两个地方调用:
1.通过ceph daemon或者ceph tell让mds主动drop cache时(此种情况不分析了)
2.MDCache中的upkeeper线程周期性调用recall_client_state函数
upkeeper线程实现如下:
upkeeper = std::thread([this]() {
std::unique_lock lock(upkeep_mutex);
while (!upkeep_trim_shutdown.load()) {
auto now = clock::now();
auto since = now-upkeep_last_trim;
auto trim_interval = clock::duration(g_conf().get_val<std::chrono::seconds>("mds_cache_trim_interval"));//默认1s
if (since >= trim_interval*.90) {
lock.unlock(); /* mds_lock -> upkeep_mutex */
std::scoped_lock mds_lock(mds->mds_lock);
lock.lock();
if (upkeep_trim_shutdown.load())
return;
if (mds->is_cache_trimmable()) {
dout(20) << "upkeep thread trimming cache; last trim " << since << " ago" << dendl;
trim_client_leases();
trim();
check_memory_usage();
auto flags = Server::RecallFlags::ENFORCE_MAX|Server::RecallFlags::ENFORCE_LIVENESS;
mds->server->recall_client_state(nullptr, flags);
upkeep_last_trim = now = clock::now();
} else {
dout(10) << "cache not ready for trimming" << dendl;
}
} else {
trim_interval -= since;
}
since = now-upkeep_last_release;
auto release_interval = clock::duration(g_conf().get_val<std::chrono::seconds>("mds_cache_release_free_interval"));
if (since >= release_interval) {
/* XXX not necessary once MDCache uses PriorityCache */
dout(10) << "releasing free memory" << dendl;
ceph_heap_release_free_memory();
upkeep_last_release = clock::now();
} else {
release_interval -= since;
}
auto interval = std::min(release_interval, trim_interval);
dout(20) << "upkeep thread waiting interval " << interval << dendl;
upkeep_cvar.wait_for(lock, interval);
}
});
upkeeper线程主要的功能是:
按照mds_cache_trim_interval(默认1s)周期进行trim client lease,trim cache,检查内存使用情况。
在调用recall_client_state时:
如果是内存超过限制,传入的flag为:Server::RecallFlags::TRIM
如果是upkeeker周期性调用,传入的flag为:Server::RecallFlags::ENFORCE_MAX|Server::RecallFlags::ENFORCE_LIVENESS
std::pair<bool, uint64_t> Server::recall_client_state(MDSGatherBuilder* gather, RecallFlags flags)
{
const auto now = clock::now();
const bool steady = !!(flags&RecallFlags::STEADY);
const bool enforce_max = !!(flags&RecallFlags::ENFORCE_MAX);
const bool enforce_liveness = !!(flags&RecallFlags::ENFORCE_LIVENESS);
const bool trim = !!(flags&RecallFlags::TRIM);
const auto max_caps_per_client = g_conf().get_val<uint64_t>("mds_max_caps_per_client");
const auto min_caps_per_client = g_conf().get_val<uint64_t>("mds_min_caps_per_client");
const auto recall_global_max_decay_threshold = g_conf().get_val<Option::size_t>("mds_recall_global_max_decay_threshold");
const auto recall_max_caps = g_conf().get_val<Option::size_t>("mds_recall_max_caps");
const auto recall_max_decay_threshold = g_conf().get_val<Option::size_t>("mds_recall_max_decay_threshold");
const auto cache_liveness_magnitude = g_conf().get_val<Option::size_t>("mds_session_cache_liveness_magnitude");
dout(7) << __func__ << ":"
<< " min=" << min_caps_per_client
<< " max=" << max_caps_per_client
<< " total=" << Capability::count()
<< " flags=" << flags
<< dendl;
/* trim caps of sessions with the most caps first */
std::multimap<uint64_t, Session*> caps_session;
// 挑选符合条件的session放到caps_session中
auto f = [&caps_session, enforce_max, enforce_liveness, trim, max_caps_per_client, cache_liveness_magnitude](auto& s) {
auto num_caps = s->caps.size();
auto cache_liveness = s->get_session_cache_liveness();
if (trim || (enforce_max && num_caps > max_caps_per_client) || (enforce_liveness && cache_liveness < (num_caps>>cache_liveness_magnitude))) {
caps_session.emplace(std::piecewise_construct, std::forward_as_tuple(num_caps), std::forward_as_tuple(s));
}
};
mds->sessionmap.get_client_sessions(std::move(f));
std::pair<bool, uint64_t> result = {false, 0};
auto& [throttled, caps_recalled] = result;
last_recall_state = now;
// 遍历挑选出来的session
for (const auto& [num_caps, session] : boost::adaptors::reverse(caps_session)) {
if (!session->is_open() ||
!session->get_connection() ||
!session->info.inst.name.is_client()) // 只对client类型的session处理
continue;
dout(10) << __func__ << ":"
<< " session " << session->info.inst
<< " caps " << num_caps
<< ", leases " << session->leases.size()
<< dendl;
uint64_t newlim;
// 如果client的caps数量在min和min+max之间(默认100 ~(1M+100)),limit会设置为min
if (num_caps < recall_max_caps || (num_caps-recall_max_caps) < min_caps_per_client) {
newlim = min_caps_per_client;
} else {
newlim = num_caps-recall_max_caps;
}
if (num_caps > newlim) {
/* now limit the number of caps we recall at a time to prevent overloading ourselves */
uint64_t recall = std::min<uint64_t>(recall_max_caps, num_caps-newlim);
newlim = num_caps-recall;
const uint64_t session_recall_throttle = session->get_recall_caps_throttle();
const uint64_t session_recall_throttle2o = session->get_recall_caps_throttle2o();
const uint64_t global_recall_throttle = recall_throttle.get();
if (session_recall_throttle+recall > recall_max_decay_threshold) {
dout(15) << " session recall threshold (" << recall_max_decay_threshold << ") hit at " << session_recall_throttle << "; skipping!" << dendl;
throttled = true;
continue;
} else if (session_recall_throttle2o+recall > recall_max_caps*2) {
dout(15) << " session recall 2nd-order threshold (" << 2*recall_max_caps << ") hit at " << session_recall_throttle2o << "; skipping!" << dendl;
throttled = true;
continue;
} else if (global_recall_throttle+recall > recall_global_max_decay_threshold) {
dout(15) << " global recall threshold (" << recall_global_max_decay_threshold << ") hit at " << global_recall_throttle << "; skipping!" << dendl;
throttled = true;
break;
}
// 省略部分代码。。。
dout(7) << " recalling " << recall << " caps; session_recall_throttle = " << session_recall_throttle << "; global_recall_throttle = " << global_recall_throttle << dendl;
auto m = make_message<MClientSession>(CEPH_SESSION_RECALL_STATE);
m->head.max_caps = newlim;
mds->send_message_client(m, session);
if (gather) {
flush_session(session, gather);
}
caps_recalled += session->notify_recall_sent(newlim);
recall_throttle.hit(recall);
}
}
dout(7) << "recalled" << (throttled ? " (throttled)" : "") << " " << caps_recalled << " client caps." << dendl;
return result;
}
其核心就是这三个判断条件:
1.trim
只有mds内存到达限制的95%时才会强制trim,我们的环境内存限制为40G,mds只用了800多MB,显然不会因为这个进来
2.enforce_max && num_caps > max_caps_per_client)
当前client的caps数量超过了max_caps_per_client,默认是1M,我们环境也没有超过(可以通过session ls查到)
3.enforce_liveness && cache_liveness < (num_caps>>cache_liveness_magnitude)
liveness是Decay统计的,如果client的cache活跃度不高,会一直降低其cache_liveness值,如果client的num_caps比较多,就容易走到这个分支
分析到此处已经知道是什么原因导致的频繁去客户端recall,但是如果客户端能够进行cache释放,为什么还会出现此告警呢?
upkeeper 默认 1秒钟跑一次,如果在1秒内客户端没有返回,那么会导致持续向此client发送recall,
从而导致recall_caps一直被hit增加,超过recall_warning_threshold后就会发生告警
所以可以通过将mds_cache_trim_interval调大来降低upkeeper线程处理的频率,从而容忍客户端有更多的时间来drop cache。