How Twitter Uses Redis to Scale - 105TB RAM, 39MM QPS, 10,000+ Instances

严宏旷
2023-12-01

Yao Yue has worked on Twitter’s Cache team since 2010. She recently gave a really great talk:Scaling Redis at Twitter. It’s about Redis of course, but it's not just about Redis.

Yao has worked at Twitter for a few years. She's seen some things. She’s watched the growth of the cache service at Twitter explode from it being used by just one project to nearly a hundred projects using it. That's many thousands of machines, many clusters, and many terabytes of RAM.

It's clear from her talk that's she's coming from a place of real personal experience and that shines through in the practical way she explores issues. It's a talk well worth watching.

As you might expect, Twitter has a lot of cache.

Timeline Service for one datacenter using Hybrid List:

  • ~40TB allocated heap
  • ~30MM qps
  • > 6,000 instances


Use of BTree in one datacenter:

  • ~65TB allocated heap
  • ~9MM qps
  • >4,000 instances

You'll learn more about BTree and Hybrid List later in the post.

A couple of points stood out:

  • Redis is a brilliant idea because it takes underutilized resources on servers and turns them into valuable service.
  • Twitter specialized Redis with two new data types that fit their use cases perfectly. So they got the performance they needed, but it locked them into an older code based and made it hard to merge in new features. I have to wonder, why use Redis for this sort of thing? Just create a timeline service using your own datastructures. Does Redis really add anything to the party?
  • Summarize large chunks of log data on the node, using your local CPU power, before saturating the network.
  • If you want something that’s high performance separate the fast path, which is the data path, away from the slow path, which is the command and control path. 
  • Twitter is moving towards a container environment with Mesos as the job scheduler. This is still a new approach so it's interesting to hear about how it works. One issue is the Mesos wastage problem that stems from requirement to specify hard resource usage limits in a complicated runtime world.
  • A central cluster manager is really important to keep a cluster in a state that’s easy to understand.
  • The JVM is slow and C is fast. Their cache proxy layer is moving back to C/C++.

With that in mind, let's learn more about how Redis is used at Twitter:

Why Redis?

  • Redis drives Timeline, Twitter’s most important service. Timeline is an index of tweets indexed by an id. Chaining tweets together in a list produces the Home Timeline. The User Timeline, which consists of tweets the user has tweeted, is just another list. 
  • Why consider Redis instead of Memcache? The Network Bandwidth Problem and The Long Common Prefix Problem.
  • The Network Bandwidth Problem.
    • Memcache didn’t work as well as Redis for the timeline. The problem was dealing with fanout. 
    • Twitter read and writes happen incrementally and they are fairly small, but the timelines themselves are fairly large.
    • When a tweet is generated it needs to be written to all relevant timelines. The tweet is a small piece of data that is attached to some data structure. On read it’s desirable to load a small batch of tweets. On a scroll down another batch is loaded.
    • The hometime line can be largish, what is reasonable for a viewer to read in one set. Maybe 3000 entries, for example. Which means for performance reasons accessing the databases should be avoided. 
    • A read-modify-write cycle  for incremental writes, and small reads, on large objects (the timeline), is too expensive and creates a network bottleneck.
    • On a gigalink at 100K+ reads and writes per second, if the average object size is more than 1K, the network becomes the bottleneck. 
  • The Long Common Prefix Problem (really two problems)
    • A flexible schema approach is used for data formats. An object has certain attributes that may or may not exist. A separate key can be created for each individual attribute. This requires sending out a separate request for each individual attribute and not all attributes may be in the cache. 
    • Metrics that are observed over time have the same name with each sample having a different time stamp. If storing each metric individually the long common prefix is being stored many many times. 
    • To be more space efficient in both scenarios, for metrics and a flexible schema, it is desirable to have a hierarchical key space.
  • A dedicated caching cluster under utilizes CPUs. For simple cases, in-memory key-value stores are CPU light. 1% of CPU time  on a box can handle more than 1K requests per second for small key values. Though for different data structures the result can be different. 
  • Redis is a brilliant idea. It sees what the server can do, but is not doing. For simple key-value stores, there’s a lot of CPU headroom on the server side for a service like Redis.
  • Redis was first used within Twitter in 2010 for the Timeline service. It is also used in the Ads service.
  • The on disk features of Redis are not used. Partly this is because inside Twitter the Cache and Storage services are in different teams so they use whatever mechanisms they think best. Partly this may be because the Storage team thinks another service fits their goals better than Redis.
  • Twitter forked Redis 2.4 and added some features to it, so they are stuck at 2.4 (2.8.14 is the latest stable version). Changes were: two data structure features within Redis; in-house cluster management features; in-house logging and data insight.
  • Hotkeys are a problem so they are a building a tiered caching solution with client side caching that will automatically cache hotkeys.

Hybrid List

  • Added Hybrid List to Redis formore predictable memory performance.
  • Timeline is a list of Tweet IDs, so it’s a list of integers. Each ID is small. 
  • Redis supports two list types: ziplist and linklist. Ziplist is space efficient. Linked list is flexible, but as a doubly linked list has the overhead of two pointers per key, which given the size of the ID is very high overhead. 
  • To use memory efficiently ziplists are used exclusively.
  • A Redis ziplist threshold is set to the max size of a Timeline. Never store a bigger Timeline than can be stored in a ziplist. This means a product decision, how many tweets can be in a Timeline, are linked to a low level component (Redis). Generally not desirable.
  • Adding to and deleting from a ziplist is inefficient, especially with a very large list. Deleting from a ziplist uses memmove to move data around, to make sure the list is still contiguous. Adding to a ziplist requires a memory realloc call to make enough space for the new entry. 
  • Potential high latency for write operations due to Timeline size. Timelines vary a lot in size. Most users don’t tweet very much, so their User Timeline is small. Home Timelines, especially those involving celebreties can be huge. When updating a large timeline and the cache runs out of heap, which is often the case when using a cache, a very large number of verysmall timelines will be evicted before there’s enough contiguous RAM to handle one big ziplist. As all this cache management takes time, a write operation can have a high latency.
  • Since writes are fanned out to a lot of timelines there’s a higher chance to be caught in awrite latency trap as memory is used for expanding the timelines.
  • It’s hard to create a SLA for write operations given the high variability of write latencies.
  • Hybrid List is a linked list of ziplists. A threshold is set of how big each ziplist can be in bytes. In bytes because tomemory efficient it helps to allocate and deallocate blocks of the same size. When a list goes over it is spilled into the next ziplist. A ziplist is not recycled until the list is empty, which means it is possible, through deletion, to have each ziplist have only one entry. In practice, tweets aren’t deleted all that often.
  • Before Hybrid List a workaround was to expire larger timelines more quickly, which freed up memory for other timelines, but was expensive when a user went to view their timeline.

Btree

  • Added BTree to Redis to support range queries on hierarchical keys to return a list of results.
  • In Redis the way to deal with secondary keys or fields is a hash map. To have sorted data in order to perform a range query a sorted set is used. Sorted set orders by a score which is a double, so an arbitrary secondary key or an arbitrary name can’t be used for the sorting. Since hash map uses a linear search it’s not great if there are a lot of secondary keys or fields.
  • BTree is the attempt fix the shortcomings of hash map and sorted set. It’s better to just haveone data structure that does what you want. It’s easier to understand and reason about.
  • Borrowed the BSD implementation of BTree and added it to Redis to create a BTree. Supports key lookup as well as range query. Has good lookup performance. The code is relatively simple. The downside isBTree is not memory efficient. It has a lot of meta data overhead due to the pointers.

Cluster Management

  • A cluster is using more than one instance of Redis for a single purpose. If a data set is larger than a single Redis instance can handle or throughput is higher than what a single instance can handle, the key space will need to be partitioned so the data can be stored in more than one shard, across a set of instances. Routing is taking a key and figuring out which shard the data for the key is on.
  • Thinks cluster management is the number one reason Redis adoption hasn’t exploded. When a cluster is available there’s no reason not tomigrate all cache use cased to Redis.
  • Tricky to get Redis cluster right. People use Redis because as a data structure server the idea is to perform frequent updates. But a lot of Redis operations are not idempotent. If there’s a network glitch a retry is required and the data can be corrupted. 
  • Redis cluster favors having a centralized manager dictating the global view. With memcache a lot clusters use a client side approach based on consistent hashing. If there’s inconsistent data, so be it. To provide really good services, a cluster needs features like detecting which shard is down and then replaying operations to get back in sync. After a long enough period spent down cache state should be cleaned up. Corrupted data in Redis is hard to detect. When there’s a list and it’s missing a chunk, it’s hard to tell.
  • Twitter has multiple attempts at building a Redis cluster.Twemproxy which is not used by Twitter internally, it was built forTwemcache and Redis support was added. Two more solutions were based on proxy style routing. One was associated with the Timeline service and not meant to be general. The second was a generalization of the Timeline solution that provided cluster management, replication, and shard repairing. 
  • Three options in a cluster: servers talk to each other to reach agreement of what a cluster looks like; use a proxy; or do client side cluster management where the clients form a quorum.
  • Didn’t go with a server approach because the philosophy is tokeep servers simple, dumb and fast.
  • Didn’t go with the client because changes are hard to propagate. Approximately 100 projects in Twitter use a cache cluster. Changing anything in the client would have to be pushed to 100 clients it could take years for changes to propagate. Quick iteration means it’s almost impossible to put code in the client.
  • Went with a proxy style routing approach and partitioning for two reasons. A cache service is a high performance service. If you want something that’shigh performance separate the fast path, which is the data path, away from the slow path, which is the command and control path. If cluster management is merged into the server it complicates the code for Redis, which is a stateful service, any time you want to fix a bug or provide an upgrade to the cluster management code, the stateful Redis service must be restarted too, which will potentially throw away a bunch of data. A rolling restart of a cluster is painful.
  • There was a concern using the proxy approach that another network hop is inserted between the client and the server.Profiling showed the extra hop is a myth. At least in their ecosystem. Latency to through the Redis server was less than .5 milliseconds. At Twitter most of the backend services are Java based and use Finagle to talk to each other. When going through the Finagle path the latency was close to 10 milliseconds. So the extra hop isn’t the problem.Inside the JVM is the problem. Outside the JVM you can do pretty much whatever you want, unless of course you go through another JVM.
  • Failure of a proxy doesn’t matter much. On the data path introducing a proxy layer isn’t so bad. The client doesn’t care which proxy they talk to. If a proxy fails after a timeout the client goes to another proxy. No sharding is happening at the proxy level, they are all stateless. To scale throughput simply add more proxies. The tradeoff is additional cost. The proxy layer is allocated resources just to do the forwarding. Cluster management, sharding, and doing the view of the cluster happens outside the proxies. The proxies don’t have to agree with each other. 
  • Twitter has instances that have 100K open connections and it works fine. There’s just overhead to pay. There’s no reason to close connections. Just keep them open, it improves latency.
  • Cache clusters are used as a look-aside cache. The caches themselves are not responsible for data replenishment. The client is responsible for fetching a missing key from storage then caching it. If a node goes down the shard is moved to another node. The failed machine is flushed when it comes back so no data is left around. All this is done by thecluster leader. A central viewpoint is really important to keep a cluster in a state that’s easy to understand.
  • Did an experiment with a proxy written in C++. TheC++ proxy saw a significant performance increase (no number given). The proxy tier is being moved back to C and C++.

Data Insight

  • When there’s a call saying the cache system is misbehavingmost of the time the cache is fine. Usually the clients are configured wrong. Or they are abusing the cache system by requesting way too many keys. Or requesting the same key over and over again and saturating the server or the link.
  • When you tell someone they are abusing your system they want proof. Which key? Which shard is bad? What kind of traffic leads to this behaviour? Proof requires metrics and analysis that can be shown to customers.
  • An SOA architecture doesn’t give you problem isolation or make debugging easier automatically. You have to havegood visibility into every component that makes up the system.
  • Decided to build Insight into caching. The cache is written in C and is fast, so it can provide data that other components can’t. Other compents can't handle the load of providing data for every request.
  • Logging every single command is possible. The cache can log everything at 100K qps. Only meta data is logged, values are not logged (Good joke about the NSA).
  • Avoid locking and blocking. Especially don’t block on disk writes.
  • At 100 qps and a 100 bytes per log message, each box will log 10MB of data per second. That’s a lot of data to move off the box. 10% of network bandwidth would be used just in case something went bad.Economically not feasible.
  • Precompute logs on the box to reduce costs. Assumption is that it is already knows what will be computed. A process reads the logs and generates a summary and periodically sends this view of the box. The view is tiny compared to the original data.
  • View data is aggregated by Storm, stored, and there’s a visualization system sitting on top. You can get data like here are your top 20 keys; here’s your traffic by second and there’s a peak which means the traffic pattern is spiky; here’s are the number of unique keys, which helps with capacity planning. A lot can be done when every single log is captured. 
  • Insight is very valuable for operations. If there are packet drops often that can be linked to either a hot key or spiky traffic behaviour.

Wish List For Redis

  • Explicit memory management. 
  • Deployable (Lua) Scripts. Talked about near the start.
  • Multi-threading. Would make cluster management easier. Twitter has a lot of “tall boxes,” where a host has 100+ GB of memory and a lot of CPUs. To use the full capabilities of a server a lot of Redis instances need to be started on a physical machine. With multi-threading fewer instances would need to be started which is much easier to manage.

Lessons Learned

  • Scale demands predictability. The larger the cluster, the more customers, the more predictable and deterministic you want your service to be. When there’s one customer and there’s a problem you can dig into a problem and it’s intriguing. When you have 70 customers you can’t keep up. 
  • Tail latencies matter. When you do fanouts to a lot of shards, when one is slow your entire query will be slow. 
  • Deterministic configuration is operationally important. Twitter is moving towards a container environment. Mesos is used as the job scheduler. The scheduler fulfills the request for the amount of CPU, memory etc. A monitor kills any job that goes over its resource requirement. Redis causes a problem in a container environment. Redis introduces external fragmentation, meaning you use more memory to store the same amount of data. If you don’t want to be killed you have to compensate for that with oversupply. You have to think my memory fragmentation ratio won’t go over 5%, but I’ll allocate 10% more as a buffer space. Maybe even 20%. Or I think I’ll get 5000 connections per host, but just in case let me allocate memory for 10,000 connections. The result is a huge potential for waste. Super low latency services don’t play well with Mesos today, so these jobs are isolated from other jobs.
  • Knowing your resource usage at runtime is really helpful. In a large cluster bad stuff happens. You think you are safe but things happen and behaviour is unexpected. Most services today can’t degrade gracefully. For example, when a limit of 10GB of RAM is reached then requests are rejected until there’s free RAM. This only fails a small percentage of traffic that’s proportional to the resource that they require. That's graceful. Garbage collection problems are not graceful, traffic just gets dropped on the floor, this problem affects a lot of teams in a lot of companies every day. 
  • Push computation to the data. If you look at relative network speeds, CPU speeds, and disk speeds, it makes sense to do computation before going to disk and do computation before going to the network. An example is summarizing logs on a node before they are pushed to a centralized monitoring service. LUA in Redis another way to apply computation close to the data.
  • LUA is not production ready in Redis today. On demand scripting means service providers can’t guarantee their SLA. A loaded script can do anything. What service provider would want to take the risk of blowing their SLA because of someone elses code? A deployment model would be better. It would allow for code review and benchmarking, so resource usage and performance could be properly calculated. 
  • Redis as the next high performance stream processing platform. It has pub-sub and scripting. Why not?

Related Articles



From: http://highscalability.com/blog/2014/9/8/how-twitter-uses-redis-to-scale-105tb-ram-39mm-qps-10000-ins.html?utm_source=tuicool


=============================================================================


以下为译文:

自2010年,Yao Yu已经效力于Twitter的缓存团队。而本文主要基于她近日发表的“Scaling Redis at Twitter”演讲,主要谈Twitter的Redis扩展,同时也不局限于Redis。从演讲中不难发现,Twitter在缓存服务打造上积累了相当丰富的经验,就如你所想,Twitter使用了大量的缓存。

Timeline服务(一个数据中心)Hybrid List使用情况:

  • 分配40TB左右的内存堆栈
  • 3000万QPS(query per second)
  • 超过6000个实例

BTtree(一个数据中心)使用状态:

  • 分配65TB的内存堆栈
  • 900万QPS
  • 超过4000个实例

下文将会带你详细的学习BTree和Hybrid。

几个值得关注的点:

  • Redis的表现非常不错,因为它将大量服务器中未使用的资源整合成有价值的服务。
  • Twitter通过两个专为其业务设计的新数据模型来定制化Redis,因此他们获得了理想的性能,但是也因此受限于旧的代码,无法迅速添加新特性。因此我(Todd)一直在想,为什么他们会使用Redis来做这样的事情。只是想基于自己数据结构建立一个Timeline服务?Redis真的适合干这样的事情?
  • 在网络饱和之前,分析大量节点上的日志数据,使用本地的CPU。
  • 如果想获得非常高的性能,那么必须做快慢分离,数据永远都是快的代言,而命令和控制则代表了慢。
  • 当下,Twitter正在使用Mesos作为作业调度程序以迁移到一个容器环境,这个做法很新颖,因此如何实现是一大看点。当然这个途径也存在弊端,比如在复杂的运行时环境指定硬件资源的使用限制。
  • 中央集群管理器来监控集群。
  • 缓慢的JVM,快速的C,因此他们使用C/C++重写缓存代理。

重点关注以上技术点,下面一起来看Twitter的Redis使用之道:

为什么使用Redis?

  • 使用Redis驱动Timeline,也是Twitter系统内最重要的服务。Timeline是Tweet基于id的一个索引,Home Timeline则是所有Tweet连接起来形成的一个列表。User Timeline则由用户生成的Tweet组成,它们形成了另外一个列表。
  • 为什么使用Redis代替Memcache?主要因为网络带宽问题(Network Bandwidth Problem)和长通用前缀问题(Long Common Prefix Problem)。
  • 网络带宽问题
  1. Memcache在Timeline上的表现并没有Redis好,最大问题发生在fanout(推送)上。
  2. Twitter的读和写往往以增量方式进行,虽然每次的更新很少,但是Timeline本身的体积很大。
  3. 当一个Tweet产生时,它会被写入对应的Timeline中。对于某些数据结构来说,Tweet是组成它的一小部分数据。在读的时候,小批量Tweet被加载,而在滚轮向下滚动时,则会加载另外的一小批量。
  4. Home Timeline可能会非常大,但是用户仍然希望在同一个集合中读取,它可能会包含3000个实体。因此,在性能的优化上,避免数据库读取将非常重要。
  5. 增量读写使用了一个“读-改-写”的过程,因此对Timeline这样的大型对象进行small reads开销是非常昂贵的,通常会造成网络瓶颈。
  6. 在每秒10万+读和写的gigalink上,如果对象的平均大小超过1K,网络将成为瓶颈。
  • 长通用前缀问题(其实是两个问题)
  1. 在数据格式上使用了一个灵活的模式,每个对象都有不同的属性组成。对每个属性都建独立的键,这需求给每个属性单独发送请求,而不是对所有的属性都进行缓存。
  2. 使用不同的时间戳跟踪度量。如果独立缓存每个度量样本,那么长通用前缀会被不停的重复缓存。
  3. 平衡度量和灵活的数据模式,使用分层的key space更令人满意。
  • 通过CPU使用情况配置专门的集群。举个例子,内存键值存储对CPU的利用很低。服务器上1%的CPU时间也许能支撑1K+的RPS,基于不同数据模式会得出不同的结果。
  • Reids的表现非常不错。它能看到服务器可以做什么,而不是正在做什么。对于简单的键值存储,像Redis这样的服务在服务器上可能会存在很多的CPU动态余量。
  • Twitter首次为Timeline服务配置Redis是在2010年,同时搭载Redis的还有Ads服务。
  • Twitter并没有使用Redis的磁盘特性。这很大程度因为在Twitter的系统中,缓存和存储都在不同的团队完成,他们会根据自己的使用来定制。也就是,对比Redis,存储团队有更好的服务。
  • 热键是一个必须解决的问题,因此Twitter建立一个分层式缓存,客户缓存会自动的缓存热键。

Hybrid List

  • 为Redis添加Hybrid List以获得更可预期的内存性能。
  • Timeline是Tweet ID组成的列表,因此是一堆的整型,每个ID都很小。
  • Redis支持两个列表类型:ziplist和linklist。Ziplist更具备空间效率,linklist则更加灵活,在双向链表下每个建会占用两个指针,对比ID的体积来说,这个开销非常大。
  • 如果从内存的使用效率上看,ziplists是唯一之选。
  • Ziplist的最大阈值被设置为Timeline的大小。因此在调整Ziplist的阈值之前,永远都不会储存比它更大的Timeline。这同样意味着一个产品决策,在Timeline中储存多少个Tweet。
  • 对ziplist做添加和删除操作效率是非常低的,特别在列表非常大的情况下。从ziplist中使用memmove将数据删除,这里必须保证列表仍然是连续的。向ziplist中添加则需要一个内存realloc调用,以保证新实体有足够的空间。
  • 鉴于Timeline的体积,写入操作存在潜在的高延时。Timeline在体积上的变化很大,大部分的用户不会频繁发送Tweet,因此他们的User Timeline都很小。Home Timeline,特别涉及到那些名流人物则可能很大。当修改一个巨大的Timeline,而内存又被缓存占满,通常会进行这个过程:大量小体积timeline将会被驱逐,直到有足够的连续内存来储存这个巨大的ziplist。这些内存管理操作将花费很多的时间,因此写入操作可能存在非常高的潜在延时。
  • 因为写入会造成很多Timeline的修改操作,基于需要在内存中扩展Timeline,这里有很大的可能会产生写延时陷阱。
  • 基于写延时带来的高可变性,很难为写入建立SLA。
  • Hybrid List是ziplists组成的linklist,因此存在一个基于所有ziplist最大体积的阈值(以字节为单位)。以字节为单位最大程度上是为了内存效率,它可以帮助分配和解除分配相同体积的内存。当一个列表结束后,空间就被分配给下一个Ziplist。Ziplist并不可回收,直到这个列表为空。这就意味,通过删除,你可以让每个ziplist只包含一个实体。通常情况下,Tweet并不会被全部删除。
  • 在Hybrid List之前,解决方案是尽快的让一个大的Timeline过期,这会给其他Timeline节约出内存,但是如果用户再查看这个Timeline时,开销是非常大的。

BTree

  • 将BTree添加到Redis是为了支持分层键上的范围查询,从而得到一个结果列表。
  • 在Redis中,增加次关键字或字段一般通过Hash Map处理,排序数据以执行一个范围查询时,sorted set被使用。因为sorted set只能使用一个double类型的score来排序,所以这里不能任意指定次级键或者名称来排序。同时,鉴于Hash Map使用的线性搜索,因此它并不适用于存在太多次关键字或字段的情况。
  • 通过BSD实现BTree,并将之添加到Redis中。支持键查询和范围查询,同时还具备了良好的查询性能及简单的代码。唯一的缺点是BTree并不具备一个良好的内存利用效率,因为指针将造成大量的元数据开销。

集群管理

  • 为每个目的建立单独的集群,每个集群部署了1个以上的Redis实例。如果一个数据集的大小大于单Redis实例可以支撑的极限,或者单Redis实例并不能提供足够的吞吐量,key space需要被分割,数据则会横跨一组实例在多个分片上保存,路由器将会为key选择应该保存的数据分片。
  • 集群管理是Redis不会被撑爆的首要原因。既然可以使用集群,那么没理由不将所有的用例转移到Redis上。
  • Redis集群运营起来并不轻松。基于频繁的更新操作,人们使用Redis,但是许多Redis运营并不是幂等的。比如,网络故障可能会引起重试,从而在一定程度上破坏数据的完整性。
  • Redis集群支持集中管理者操控全局。使用memcache,许多集群使用了一个基于一致性哈希的客户端解决方案。如果出现非一致性数据,只能顺其自然。为了提供更好的服务,集群需要一个功能负责检测哪个分片发生问题,并且重新进行操作来保证同步。在足够长的时间后,缓存应该被清理。在Redis中破坏数据是不容易被发现的,当有一个列表,它缺失了一个部分,这个问题很难被描述清楚。
  • 在建立Redis集群上,Twitter做过了很多尝试。Twemproxy,最初并不是在Twitter内部使用,它被建立用于服务Twemcache,随后添加了对Redis的支持。同时,还针对代理类型路由建立了两个额外的解决方案,其中一个与Timeline服务有关,但不是通用的;另一个则是专为Timeline设计的通用解决方案,提供了集群管理、复制和分片修复。
  • 集群中存在3个选择,服务器相互间通信以达成一个协议:集群状态;使用一个代理;或者是当客户端数量达到阈值时做一个客户端方面的集群管理。
  • 没有做服务器方面的优化,因为一直以保持服务器简单、透明和快速为理念。
  • 并没有通过客户端,因为改变不容易被推广。在Twitter,1个缓存集群大约为100个项目使用。如果在一个客户端中做改变必须推进到100个客户端,这花费的时间可能以年计算。快速迭代意味着客户端不能放任何代码。
  • 使用一个代理模式路由途径以及分片主要基于两个原因。首先,缓存服务必须是个高性能服务。如果你想获得高性能,就必须分离快快慢路径,快对应着数据路径,而慢则对应了命令行和控制路径。其次,如果将集群管理混合到服务器中,将增加Redis编码的复杂度,从而造成一个有状态的服务,如果你想给管理代码打补丁或者升级,有状态的服务你必须要重启,从而可能会导致丢失一部分数据,滚动重启集群是非常痛苦的。
  • 使用代理途径的另一个原因是在客户端和服务器之间插入一个新的网络跃点。关于在添加额外的网络跃点上,Profiling 揭露了业内普遍存在的一个谣言。最起码在Twitter的系统中,Redis服务器产生的延时不会超过0.5毫秒。在Twiiter,大部分的后端系统都基于Java,并使用Finagle做相互间的交互。在通过Finagle后,延时增加到10毫秒。因此附加的网络跃点并不是问题,问题的本身在于JVM。除了JVM之外,你可以做任何的事情,当然除了添加又一个JVM。
  • 代理的失败并不会增加困扰。在数据路径上,引入一个代理层并不意味着带来额外开销。客户端不用去关心需要连接的代理。如果因为代理故障而造成的超时,客户端只需要随便选择一个其他的代理。在代理等级不会发生任何分片,它们同样是无状态的。为了扩展吞吐量可以添加更多的代理,代价是额外的成本。代理层只会因为转发被分配资源。集群管理、分片以及集群状态监视都不是代理负责的范围。代理之间并不需要保持一致。
  • Twitter中存在实例同时打开10万个连接的情况,服务器也并不会因此发生故障。在Twitter,服务器没理由去关闭一个连接,让它一直打开有助于改善延时。
  • 缓存集群被用作后备缓存(look-aside cache)。缓存本身不会负责数据的补充,客户端负责取得一个丢失的键并进行缓存。如果一个节点发生故障,分片会被转移到另一个节点。对故障恢复后的服务器进行同步以保证数据的一致性,所有这些都通过集群管理者完成。在让一个集群更容易理解上,中央观点确实非常重要。
  • 测试使用C++来编写代理。C++代理带来了一个显著的性能提升,随后代理层都使用了C和C++。

数据洞察

  • 当有调用显示缓存系统失效,而大多数缓存都是正常时,这通常因为客户端的配置错误,或者它们请求的键过多从而导致滥用缓存,当然也可能因为对同一个键多次请求造成服务器或链接饱和。
  • 如果通知某个人正在滥用系统,他们很可能会要求出示证据。哪个键?哪个分片的问题?什么样的流量导致了这个问题?因此你需要做足够的度量和分析,从而将证据展示给客户。
  • 面向服务的架构并不会给带来问题隔离或者是自动debug,必须对组成系统的每个组件保持足够的能见度。
  • 决定在缓存上获得洞察力。缓存使用C编写所以足够快速,因此它可以能其他组件所不能,提供足够的数据,而其他服务不能为每个请求都提供数据。
  • 可以实现为每条命令单独建立日志。在10万QPS时,缓存可以记录下所有发生的事情。
  • 避免锁和阻塞,特别不能因为磁盘写入而造成阻塞。
  • 在每秒100请求和每条日志消息100字节的情况下,每台服务器每秒会记录10MB的数据。当问题发生时,这些数据传输将造成很大的网络开销,大约占10%的带宽,这种开销完全不允许。
  • 预计算服务器上的日志以减少开销。设想是已经清楚需要被计算的内容,一个进程负责读取日志,计算并生成一个摘要信息,然后只定期发送主机的摘要信息。对比原始数据,这个体积将微不足道。
  • 这些云计算数据通过Storm聚合和存储,并在上面建立可视化系统,你可以基于这些建立容量规划。因为每条日志都可以被捕获,所以你可以做许多事情。
  • 对于运营来说,洞察力非常重要。如果出现丢包现象,通常情况下是热键或者是流量峰值导致。

对Redis的希望清单

  • 显式的内存管理。
  • Deployable(Lua)Scripts。
  • 多线程,可以简化集群管理。Twitter有很多高性能服务器,每个主机都拥有100GB以上的内存以及大量的CPU。为了使用一个服务器的所有能力,需要在实体主机商开启许多Redis实例。通过多线程减少需要启动实例数量,从而更容易的进行管理。

学到的知识

  • 可预见的规模需求。集群越大、客户越多,你越希望你的服务更可预知和确定。如果只有一个用户和一个问题,你可以迅速的找到并解决这个问题。但是如果你有70个客户,你完全忙不过来。
  • 如果你需要在许多分片上推广某个更新,一个分片的速度问题就可能导致全局问题。
  • Twitter正在向container环境发展,使用了Mesos作为作业调度器,调度器用来给请求分配CPU、内存等资源数量。当作业占用的资源高于请求时,监视器会直接将它终止。在容器的环境下,Redis会产生一个问题。Redis引入了外部存储碎片,这意味着你要使用更多内存来存储同样的数据。如果不想作业被终止,必须设计一个缓冲的区间。你可能会认为内存碎片率设定在5%就足矣,但是我更愿意多分配10%,甚至是20%的空间作为缓冲。或者在我认为每台主机连接数可能会达到5000时,我将给系统分配支撑1万个连接数的内存,结果会造成很大的浪费。对于当下大多数低延时服务来说,Mesos都不太适合,因此这些作业会与其他作业隔离。
  • 在运行时清楚资源使用是非常重要的在一个大型集群中,总会有问题产生。虽然你一直认为系统很安全,但是事情还是以意料之外的方式发生了。当下很少出现某台机器完全崩溃的情况,比如,在达到10GB的内存上限后,在有空闲内存之前,请求都会被拒绝,造成的后果仅是一小部分请求不能获得自己所需的内存资源,无伤大雅。而垃圾回收问题则是非常麻烦的,它可能降低系统的流量,当下这个问题已经困扰了很多机构。
  • 用数据说话。在计算到磁盘和计算到网络之前,查看相对网络速度、CPU速度计磁盘速度是非常有意义的,比如,节点被推送到中央监视服务之前查看的日志综述。除此之外,Redis中的LUA也是给数据提供计算的途径。
  • LUA当下还没有在Redis生产环境中实现响应式脚本意味着服务提供商不能保证他们的SLA,一个被加载的脚本可以做任何事情,因此没有服务提供商会因为添加一些代码铤而走险去破坏SLA。但是对于部署模型来说,意义重大,它将允许代码评审和基准测试,以清晰的计算资源使用和性能。
  • Redis作为下一个高性能流处理平台。当下已经拥有pub-sub和scripting,没什么不可以的。

From: http://www.csdn.net/article/2014-09-10/2821615-how-twitter-uses-redis-to-scale-105tb-ram-39mm-qps-10000-ins

 类似资料:

相关阅读

相关文章

相关问答