问题：

从卡桑德拉 2.1.4 升级到 2.1.5

湛鸿雪

2023-03-14

每个人

几天前，我将我们的6节点EC2集群从cassandra 2.1.4升级到了2.1.5。

从那时起，我所有的节点的cpu使用量都“爆炸”了——大部分时间它们的cpu都是100%，它们的平均负载在100-300之间（！！！）。

升级后，此操作不会立即开始。几个小时后，它从其中一个节点开始，慢慢地，越来越多的节点开始表现出相同的行为。它似乎与我们最大的列系列的压缩相关，并且在压缩完成后（开始后约24小时），节点似乎恢复正常。它只有2天左右，所以我希望它不会再发生，但我仍在监视这一点。

以下是我的问题

这是错误还是预期行为？

如果这是预期的行为-

这个问题的解释是什么？
它是否记录在我错过的某个地方？
我应该以不同的方式进行升级吗？也许每24小时左右一次1个或2个节点？最佳实践是什么？

如果这是个错误-

<李>知道吗？ < li >我应该向哪里报告此事？我应该添加什么数据？ < li >降级回2.1.4可以吗？

对此的任何反馈都会很棒

谢谢

阿米尔

更新：

这是有问题的表格的结构。

创建表tbl1（

key text PRIMARY KEY,

created_at timestamp,

customer_id bigint,

device_id bigint,

event text,

fail_count bigint,

generation bigint,

gr_id text,

imei text,

raw_post text,

"timestamp" timestamp

)和紧凑的储物空间

AND bloom_filter_fp_chance = 0.01

AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'

AND comment = ''

AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}

AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}

AND dclocal_read_repair_chance = 0.0

AND default_time_to_live = 0

AND gc_grace_seconds = 864000

AND max_index_interval = 2048

AND memtable_flush_period_in_ms = 0

AND min_index_interval = 128

AND read_repair_chance = 0.0

AND speculative_retry = 'NONE';

日志没有透露太多信息(至少对我来说)。下面是日志的一个片段

INFO[WRITE-/10 . 0 . 1 . 142]2015-05-23 05:43:42，577 YamlConfigurationLoader.java:92-从文件:/etc/cassandra/cassandra.yaml加载设置

信息[写入-/10.1.142]2015-05-23 05:43:42580 YamlConfigurationLoader。java:135-节点配置：[authenticator=AllowAllAuthenticator；authorizer=AllowAllAuthorizer；auto_snapshot=true；batch_size_warn_threshold_in_kb=5；batchlog_replay_throttle_in_kb=1024；broadcast_rpc_address=10.0.2.145；cas_contentation_timeout_in_ms=1000；client_encryption_options=；cluster_name=Gryphonet21 cluster；column_index_size_in_kb=64；commit_failure_policy=stop；commitlog_directory=/data/cassandra/commitlog；commitlog_segment_size_in_mb=32；commitlog_sync=周期性；commitlog_sync_period_in_ms=10000；per_ sec＝16；concurrent_counter_writes=32；concurrent_ reads＝32；concurrent_writes＝32；counter_ cache_；counter_cache_size_in_mb=空；counter_write_request_timeout_in_ms=5000；cross_node_timeout=假；data_file_directories=[/data/cassandra/data]；disk_failure_policy=停止；dynamic_snitch_badness_threshold=0.1；dynamic_snitch_reset_interval_in_ms=600000；dynamic_snitch_update_interval_in_ms=100；endpoint_snitch=GossipingPropertyFileSnitch；hindd_handoff_enabled=真；_kb＝1024；incremental_backups=false；index_summary_capacity_in_mb=空；index_summary_resize_interval_in_分钟=60；inter_dc_tcp_nodelay=假；internode_compression=all；保存周期=14400；key_cache_size_in_mb=空；max_hint_window_in_ms=10800000；max_hints_delivery_threads=2；memtable_allocation_type=堆缓冲区；native_transport_port＝9042；num_ tokens＝16；partitioner=随机分区器；permissions_validity_in_ms=2000；range_request_timeout_in_ms=10000；read_request_timeout_in_ms=5000；request_scheduler=org.apache.cassandra.scheduler。NoScheduler；request_timeout_in_ms＝10000；row_cache_save_period=0；row_cache_size_in_mb=0；rpc_address＝0.0.0.0；rpc_keepalive=真；rpc_端口=9160；rpc_server_type=同步；saved_caches_directory=/data/cassandra/saved_caches；seed_provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider，参数=[{seeds=10.0.1.141,10.0.2.145,10.0.3.149}]；server_encryption_options=；snapshot_before_compaction=假；ssl_storage_port=7001；sstable_preemptive_open_interval_in_mb=50；start_native_transport=真；start_rpc=真；storage_port＝7000；size_in_mb=15；tombstone_failure_threshold=100000；阈值＝1000；trickle_fsync=假；trickle_fsync_ interval_；truncate_request_timeout_in_ms＝60000；write_request_timeout_in_ms=2000]

信息[握手-/10.1.142]2015-05-23 05:43:42591出站Tcp连接。java:494-无法与/10.1.142握手

信息[计划任务：1]2015-05-23 05:43:42713 MessagingService。java:887-过去5000ms内丢弃135条MUTATION消息

信息[scheduled tasks:1]2015-05-23 05:43:42，713 StatusLogger.java:51—池名称活动挂起已完成已阻止所有时间已阻止

2015-05-23 05:43:42,714StatusLogger.java:66-CounterMutationStage 0 0 0 0 0

StatusLogger.java:66-阅读阶段5 1 5702809 0

2015-05-23 05:43:42,715StatusLogger.java:66-请求回复0 45 29528010 0 0

信息[计划任务：1]2015-05-23 05:43:42715 StatusLogger。java:66-ReadRepairStage 0 0 997 0 0

信息[计划任务：1]2015-05-23 05:43:42715 StatusLogger。java:66-MutationStage 0 31 43404309 0 0

StatusLogger.java:66-戈西普斯塔格0 0 569931 0

信息 [预定任务：1] 2015-05-23 05：43：42，716 状态记录器.java：66 - 抗熵阶段 0 0 0 0 0 0

信息[计划任务：1]2015-05-23 05:43:42716 StatusLogger。java:66-CacheCleanupExecutor 0 0 0 0

StatusLogger.java:66-迁移阶段0 0 9 0 0

2015-05-23 05:43:42,829StatusLogger.java:66-ValidationExector 0 0 0 0 0

信息[计划任务：1]2015-05-23 05:43:42830 StatusLogger。java:66-采样器0 0 0 0

2015-05-23 05:43:42,830StatusLogger.java:66-MiscStage 0 0 0 0 0

2015-05-23 05:43:42,831StatusLogger.java:66-提交日志归档0 0 0 0 0

信息 [预定任务：1] 2015-05-23 05：43：42，831 状态记录器.java：66 - MemtableFlushWriter 1 1 1756 0 0

StatusLogger.java:66-彭丁兰格计算器0 0 11 0

StatusLogger.java:66-内存回收内存

StatusLogger.java:66-memtable post flush 1 2 3819 0

信息[计划任务：1]2015-05-23 05:43:42832 StatusLogger。java:66-CompactionExecutor 2 32 742 0 0

2015-05-23 05:43:42,833StatusLogger.java:66-内部回复0 0 0 0 0 0

信息[握手-/10.1.142]2015-05-23 05:43:45086出站Tcp连接。java:485-带/10.1.142的握手版本

更新：

问题仍然存在。我以为在每个节点上进行一次压缩后，节点就会恢复正常，但事实并非如此。几个小时后，CPU 跳转到 100%，负载平均在 100-300 之间。

我正在降级回2.1.4。

更新：

使用phact的dumpThreads脚本获取堆栈跟踪。此外，尝试使用jvmtop，但它似乎只是挂起。

输出太大，无法粘贴到此处，但您可以在 http://downloads.gryphonet.com/cassandra/ 找到它。

用户名：cassandra密码：cassandra

共有2个答案

太叔小云

2023-03-14

回答我自己的问题-

我们正在使用一个非常特殊的thrift API——describe _ splits _ ex，这似乎导致了这个问题。当cpu使用率达到100%时，查看所有不同线程的所有堆栈跟踪，这是显而易见的。对我们来说，这很容易解决，因为我们把这个api作为一种优化，而不是必须的，所以我们停止使用它，问题就解决了。

然而，这个api也被cassandra hadoop连接器使用（至少在早期版本中是这样），所以如果您使用连接器，我会在升级到2.1.5之前进行测试。

不知道2.1.5中的什么变化导致了这个问题，但我知道它在2.1.4中没有发生，并且在2.1.5中一直发生。

李烨

2023-03-14

尝试使用jvmtop来看看卡桑德拉进程在做什么。它有两种模式，一种是查看当前正在运行的线程，另一种是显示每个类过程的cpu分布（--profile），将两个输出粘贴到此处

从卡桑德拉 2.1.4 升级到 2.1.5

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档