Why and how teams are replacing external database caching

The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

Although external caches are great for reducing latency, they often cause more problems than benefits. Here's how to fix this problem.

Translated from Why and How Teams Are Replacing External Database Caches by Felipe Cardeneti Mendes.

Teams often consider external caching when an existing database cannot meet the required service level agreement (SLA). This is a decidedly performance-oriented decision. Placing external cache in front of the database is often done to compensate for sub-optimal latency caused by various factors (e.g. inefficient database internals, driver usage, infrastructure choices, traffic spikes, etc.).

Caching appears to be a quick and easy solution, as deployment can be implemented without huge hassle and without incurring the significant costs of database expansion , database schema redesign, or even deeper technology conversions. However, external caching is not as simple as it is often said. They can be one of the more problematic components of distributed application architecture.

In some cases, this is a necessary evil, such as when you need frequent access to transformed data resulting from lengthy and expensive computations, and you've tried every other way to reduce latency. But in many cases, the performance gain simply isn't worth it. You solve one problem, but create others.

Here are the often-overlooked risks associated with external caching, and ways three teams achieved performance gains and cost savings by replacing their core database and external caching with a single solution. Spoiler: They use ScyllaDB, a high-performance database that achieves improved long-tail latency by leveraging a specialized internal cache .

Why not cache?

At ScyllaDB, we work with countless teams that grapple with the cost, hassle, and limitations of traditional database performance improvement attempts. Here are the main difficulties we see teams encounter when putting external caching in front of their databases.

External caching increases latency

A separate cache means one more jump along the way. When caching surrounds the database, the first access occurs at the cache layer. If the data is not in the cache, the request will be sent to the database. This adds latency to the already slow path to uncached data. One might claim that when the entire data set fits in the cache, the additional latency does not come into play. However, unless your data set is fairly small, storing it all in memory increases the cost significantly, making it prohibitively expensive for most organizations.

External caching is an additional cost

Caching means expensive DRAM, which means the cost per gigabyte is higher than solid-state disks. (For more details on this, see Grafana's Danny Kopping's talk at P99 CONF .) Rather than provisioning a completely separate infrastructure for caching, it's often better to use existing database memory or even increase it to accommodate Internal cache. When sized correctly, modern database caches can be as efficient as traditional in-memory caching solutions. Databases often do a good job of optimizing I/O access to flash storage when the working set size is too large to fit in memory, making a separate database (without external cache) the preferred and cheaper option.

External caching reduces availability

No caching high-availability solution can rival the database itself. Modern distributed databases have multiple replicas; they are also topology-aware and speed-aware, and can withstand multiple failures without losing data.

For example, a common replication pattern is three local replicas, which often allows reads to be balanced across these replicas to effectively utilize the database's internal caching mechanism. Consider a nine-node cluster with a replication factor of three: Essentially, each node will hold approximately one-third of your total dataset size. Since requests are balanced across different replicas, this gives you more space to cache the data, which can eliminate the need for external caching. Conversely, if the external cache happens to invalidate an entry just before a large number of cold requests, availability may be affected for a period of time because the database does not have that data in its internal cache (more on this below).

Caches often lack high availability properties and can easily fail or invalidate records based on their heuristics. Partial failures are more common and even worse in terms of consistency. When the cache inevitably fails, the database will be hit with a flood of unmitigated queries and can break your SLA. Furthermore, even if the cache itself has some high-availability features, it cannot coordinate the handling of such failures with the persistent database in front of it. Bottom line : rely on the database, rather than having your latency SLA depend on the cache.

Application complexity – your application needs to handle more situations

External caching introduces application and operational complexity. Once you have an external cache, it is your responsibility to keep the cache up to date with the database. Regardless of your caching strategy (e.g. write-through, cache bypass, etc.), there will be edge cases where your cache may get out of sync with the database, and you must account for these situations during application development. Your client settings (such as failover, retry, and timeout policies) need to match the properties of the cache and database in order to function when the cache is unavailable or goes cold. Typically, such scenarios are difficult to test and implement.

External cache corrupts database cache

Modern databases have embedded caches and complex strategies for managing them. When you put a cache in front of the database, most read requests will only hit the external cache, and the database will not hold these objects in its memory. As a result, the database cache becomes invalid. When the request finally reaches the database, its cache will be cold and the response will mostly come from disk. As a result, the round trip from cache to database and back to the application can increase latency.

External caching may increase security risks

External caching adds a whole new attack surface to your infrastructure. Encryption, isolation, and access controls for data placed in the cache may differ from those at the database layer itself.

External caching ignores database knowledge and database resources

The database is complex and built for specialized I/O workloads on the system. Many queries access the same data, and a certain amount of the working set size can be cached in memory to save disk access. A good database should have complex logic to decide which objects, indexes and accesses it should cache. The database should also have an eviction policy to determine when new data should replace existing (older) cache objects.

Scan-resistant caching is one example. When scanning large data sets, such as large range or full table scans, a large number of objects are read from disk. The database can realize that this is a scan (rather than a regular query) and choose to keep its objects out of its internal cache. However, external caching (which follows the read-through policy) treats the result set like any other result set and attempts to cache the results. The database automatically synchronizes cached content with disk based on incoming request rates, so users and developers don't need to do anything to ensure performance and consistency in lookups of recently written data. So if for some reason your database isn't responding fast enough, it means:

Cache configuration error.
Not enough RAM for cache.
The working set size and request pattern are not suitable for caching.
The database cache implementation is poor.

Better option: let the database handle it

How to meet your SLA without the risk of external database caching? Many teams find that by migrating to a faster database (such as ScyllaDB) and using a dedicated internal cache , they are able to meet their latency SLAs with less hassle and lower cost. Of course, results will vary based on workload characteristics and technical requirements. But for what's possible, consider what these teams can achieve.

SecurityScorecard achieves 90% latency reduction with $1M annual savings

SecurityScorecard aims to make the world a more secure place by changing the way thousands of organizations understand, mitigate and communicate about cybersecurity. Its ratings platform is an objective, data-driven and quantifiable measure of an organization's overall cybersecurity and cyber risk exposure.

The team's previous data architecture served them well for a while, but couldn't keep up with their growth. Their platform API queries one of three data stores: Redis (for faster lookups of 12 million scorecards), Aurora (for storing 4 billion measurement statistics across nodes), or the Hadoop distributed file system Presto cluster on (for complex SQL queries on historical results).

As data and requests grow, challenges arise. Aurora and Presto experience spikes in latency at high throughput. The largest possible instance of Redis was still not enough, and they didn't want to use the complexity of Redis Cluster.

To reduce latency at the new scale required for rapid business growth, the team turned to ScyllaDB Cloud and developed a new scoring API to route less latency-sensitive requests to Presto and S3 storage. Here's a visualization of this architecture, and it's fairly simple:

This move resulted in:

90% lower latency for most service endpoints
80% reduction in production incidents related to Presto/Aurora performance
$1 million in annual infrastructure cost savings
Data pipeline processing speed increased by 30%
Dramatically improve customer experience

IMVU reduces Redis cost to 100 times

IMVU is a popular social community that enables people around the world to interact using 3D avatars on desktops, tablets and mobile devices. To meet growing scale requirements, IMVU decided it needed a higher-performance solution than its previous database architecture (MySQL and Memcached in front of Redis). The team looked for something that was easier to configure, easier to scale, and (if successful) easier to scale.

"Redis was great for prototyping capabilities, but once we actually rolled it out, the expense became hard to justify," said Ken Rudy, senior software engineer at IMVU. "ScyllaDB is optimized to keep the required data in memory and everything else on disk. ScyllaDB allows us to maintain the same responsiveness for hundreds of times the scale that Redis can handle."

Comcast uses $2.5M in annual savings to reduce long-tail latency by 95%

Comcast is a global media and technology company with three primary businesses: Comcast Cable, one of the largest providers of video, high-speed Internet and phone calls to residential customers in the United States; NBCUniversal and Sky. Comcast's Xfinity service serves 15 million homes, with more than 2 billion API calls (read/write) and more than 200 million new objects every day. In seven years, the program has expanded from supporting 30,000 devices to more than 31 million devices.

Cassandra's long-tail latency proved unacceptable at the company's rapidly growing scale. To hide Cassandra's latency issues from users, the team placed 60 cache servers in front of its database. Keeping this caching layer consistent with the database creates a lot of headaches for administrators. Because the cache and related infrastructure must be replicated between data centers, Comcast needs to keep the cache active. They implemented a cache warmer that checked write volume and then copied the data between data centers.

Comcast quickly turned to ScyllaDB after struggling with the overhead of this approach. ScyllaDB is designed to minimize latency spikes through its internal caching mechanism, allowing Comcast to eliminate external caching layers, providing a simple framework where data services connect directly to data stores. Comcast was able to replace 962 Cassandra nodes with only 78 ScyllaDB nodes. They improved overall availability and performance while completely eliminating 60 cache servers. Results: P99, P999, and P9999 reduced latency by 95% and were able to handle twice as many requests—at an operational cost of 60%. This ultimately saved them $2.5 million per year in infrastructure costs and personnel overhead.

Conclusion

While external caches are great companions for reducing latency (such as serving static content and personalized data that don't require any level of persistence), they often create more problems than benefits when placed in front of the database.

The top trade-offs include increased cost, increased application complexity, additional round trips to the database, and additional security surfaces. By rethinking existing caching strategies and switching to a modern database that delivers predictable low latency at scale, teams can simplify their infrastructure and minimize costs. At the same time, they can still meet their SLAs without the extra hassle and complexity of external caching.

This article was first published on Yunyunzhongsheng ( https://yylives.cc/ ), everyone is welcome to visit.