How Reddit Achieved Post View Counts at Scale

We wanted to better communicate the scale of Reddit to our users. Vote score and number of comments are by far the main indicators of activity on a particular post. However, Reddit has many visitors who read content without voting or commenting. We want to build a system that captures the number of reads on a post. That number is then presented to content creators and moderators so they can better understand the activity on a particular post.

In this post, we discuss how we implement counting at scale.

counting method

There are four main requirements for view counts:

  • Counting must be real-time or near real-time. Not a daily or hourly total.
  • Each user can only be counted once for a short period of time.
  • The displayed quantity is within a few percent of the actual error.
  • The system must be able to run in production and process events within seconds of their occurrence.

Meeting these four requirements is more complicated than it sounds. In order to keep an accurate count in real time, we need to know if a particular user has ever visited this post. To know this information, we need to store the group of users who have previously visited each post, and then look at that group each time we process a new visit to that post. A naive implementation of this solution is to store this collection of unique users in memory as a hash table, keyed by the post ID.

This method is suitable for articles with a small number of views, but once the article is popular and the number of readers increases rapidly, this method is difficult to expand. Several popular posts have over a million unique readers! For this kind of post, it's a memory and CPU hit, because you're storing all the IDs, and you're frequently looking up the collection to see if anyone has already visited it.

Since we could not provide exact counts, we investigated several different cardinality estimation algorithms. We considered two options that fit our expectations very well:

  1. Linear probabilistic counting method, very accurate, but requires linearly more memory for larger sets to be counted.
  2. Counting method based on  HyperLogLog (HLL). HLLs grow sub-linearly with set size, but do not provide the same accuracy as linear counters.

To see how much space HLL really saves, take a look at the r/pics post included at the top of this article. It has over 1 million unique users. If we store 1 million unique user IDs, and each user ID is 8 bytes long, then we need 8 megabytes of memory to count unique users for a single post! In contrast, using an HLL for counting takes less memory. The amount of memory is different for each implementation, but for this implementation we can compute over a million IDs using only 12 kilobytes of space, which would be 0.15% of the original space usage!

( This article on high scalability provides a good overview of the two algorithms above.)

Many HLL implementations use a combination of the above two approaches, starting with linear counting for small sets and switching to HLL once the size reaches a certain point. The former is often referred to as "sparse" HLL expression, while the latter is referred to as "dense" HLL expression. A hybrid approach is very beneficial as it provides accurate results while retaining a modest memory footprint. This method  is described in more detail in Google's HyperLogLog++ paper .

While the HLL algorithm is fairly standard, in our implementation we consider three variants. Note that for the in-memory HLL implementation, we only focus on the Java and Scala implementations, since we primarily use Java and Scala in the data engineering team.

  1. Twitter's Algebird library, implemented in Scala. Algebird has good usage documentation, but the implementation details of sparse and dense HLL representations are not easy to understand.
  2. The implementation of HyperLogLog++ in stream-lib, implemented in Java. The code in stream-lib is well documented, but it was somewhat difficult to understand how to use this library properly and adapt it to our needs.
  3. HLL implementation for Redis (our choice). In our opinion, Redis's HLL implementation is well documented and easy to configure, and the provided HLL-related APIs are easy to integrate. As an added bonus, using Redis alleviated many of our performance issues by taking the CPU and memory intensive part of the counting application (HLL calculations) off and onto a dedicated server.

Reddit's data pipeline revolves around  Apache Kafka . When a user views a post, an event is fired and sent to an event collector server, which batches the events and saves them into Kafka.

From here, the view counting system has two components that operate in sequence. The first part of our counting architecture is a   Kafka consumer called Nazar that will read every event from Kafka and go through a set of rules we have programmed to determine if an event should be counted. We gave it this name because Nazar is an eye shaped amulet that protects you from evil, the Nazar system is an "eye" that protects us from bad elements. Nazar uses Redis to maintain state and track potential reasons why a browse should not be counted. One reason we may not be able to count events is due to repeated browsing by the same user within a short period of time. Nazar will then alter the event, adding a boolean flag indicating whether it should be counted, before sending back the Kafka event.

This is the second part of this project. We have a second Kafka consumer called  Abacus  which actually counts the browses and makes the count visible on the website and client side. Abacus reads the Kafka events output by Nazar. Then, depending on Nazar's decision, it will count or skip this view. If the event is marked as counted, Abacus first checks to see if there is an HLL counter in Redis for which a post corresponding to the event already exists. If the counter is already in Redis, then Abacus sends a  PFADD  request to Redis. If the counters are not already in Redis, Abacus makes a request to the Cassandra cluster, which we use to persist the HLL counters and raw counts, and a SET request to Redis to add the  filter  . This usually happens when people view old posts that have been deleted by Redis.

In order to maintain maintenance of old posts that may be deleted from Redis, Abacus periodically logs Redis' full HLL filter along with a count of each post to the Cassandra cluster. Writes to Cassandra are written in batches of 10 seconds to avoid overloading. The following is a high-level event flow diagram.

Summarize

We hope the view counters will better help content creators understand what's going on with each post, and help moderators quickly identify which posts are getting a lot of traffic in their communities. In the future, we plan to use the real-time potential of data pipelines to provide more useful feedback to more people.

Guess you like

Origin blog.csdn.net/weixin_38860565/article/details/109673418