Krishnan Chandra (u/shrink_and_an_arch)
Senior Software Engineer
We want to better communicate the scale of Reddit to our users. Up to this point, vote score and number of comments were the main indicators of activity on a given post. However, Reddit has many visitors that consume content without voting or commenting. We wanted to build a system that could capture this activity by counting the number of views a post received. This number is then shown to content creators and moderators to provide them better insight into the activity on specific posts.
In this post, we’re going to talk about how we implemented counting at scale.
We had four main requirements for counting views:
- Counts must be real time or near-real time. No daily or hourly aggregates.
- Each user must only be counted once within a short time window.
- The displayed count must be within a few percentage points of the actual tally.
- The system must be able to run at production scale and process events within a few seconds of their occurrence.
Satisfying all four of these requirements is trickier than it sounds. In order to maintain an exact count in real time we would need to know whether or not a specific user visited the post before. To know that information, we would need to store the set of users who had previously visited each post, and then check that set every time we processed a new view on a post. A naive implementation of this solution would be to store the unique user set as a hash table in memory, with the post ID as the key.
This approach works well for less trafficked posts, but is very difficult to scale once a post becomes popular and the number of viewers rapidly increases. Several popular posts have over one million unique viewers! On posts like these, it becomes extremely taxing on both memory and CPU to store all the IDs and do frequent lookups into the set to see if someone has already visited before.
Since we could not provide exact counts, we looked at a few different cardinality estimation algorithms. We considered two options that closely matched what we were looking to accomplish:
- A linear probabilistic counting approach, which is very accurate, but requires linearly more memory as the set being counted gets larger.
- A HyperLogLog (HLL)-based counting approach. HLLs grow sub-linearly with set size, but do not provide the same level of accuracy as linear counters.
For an understanding of just how much space HLLs really save, consider the r/pics post included at the top of this blog post. It received over 1 million unique users. If we had to store 1 million unique user IDs, and each user ID is an 8-byte long, then we would require 8 megabytes of memory just to count the unique users for a single post! In contrast, using an HLL for counting would take significantly less memory. The amount of memory varies per implementation, but in the case of this implementation, we could count over 1 million IDs using just 12 kilobytes of space, which would be 0.15% of the original space usage!
(This article on High Scalability has a good overview of both of the above algorithms.)
Many HLL implementations use a combination of the above two approaches, by starting with linear counting for small sets and switching over to HLL once the size reaches a certain point. The former is frequently referred to as a “sparse” HLL representation, while the latter is referred to as a “dense” HLL representation. The hybrid approach is very advantageous, because it can provide accurate results for both small sets and large sets while retaining a modest memory footprint. This approach is described in more detail in Google’s HyperLogLog++ paper.
While the HLL algorithm is fairly standard, there were three variants we considered using in our implementation. Note that for in-memory HLL implementations, we only looked at Java and Scala implementations as we primarily use Java and Scala on the data engineering team.
- Twitter’s Algebird library, implemented in Scala. Algebird has good usage docs, but the implementation details of the sparse and dense HLL representations were not easily understandable.
- An implementation of HyperLogLog++ located in stream-lib, implemented in Java. The code in stream-lib is very well-documented, but it was somewhat difficult to understand how to use the library properly and tune it to our needs.
- Redis’s HLL implementation (which we chose). We felt that the Redis implementation of HLLs were well-documented and easily configurable, and the HLL-related APIs provided were easy to integrate. As an added benefit, using Redis alleviated many of our performance concerns by taking the CPU and memory-intensive portion of the counting application (HLL computations) out and moving that onto a dedicated server.
Reddit’s data pipeline is primarily oriented around Apache Kafka. When a user views a post, an event gets fired and sent to an event collector server, which batches the events and persists them into Kafka.
From here, the view-counting system has two components which operate sequentially. The first part of our counting architecture is a Kafka consumer called Nazar, which will read each event from Kafka and pass it through a set of rules we’ve concocted to determine whether or not an event should be counted. We gave it this name because just as a nazar is an eye-shaped amulet protecting you from evil, the Nazar system is an “eye” that protects us from bad actors trying to game the system. Nazar uses Redis to maintain state and keep track of potential reasons why a view should not be counted. One reason we may not count an event is if it’s the result of repeat views from the same user over a short period of time. Nazar will then alter the event, adding a Boolean flag indicating whether or not it should be counted, before sending the event back to Kafka.
This is where the second part of the project picks up. We have a second Kafka consumer called Abacus, which does the actual counting of views and makes the counts available for the site or clients to display. Abacus reads the events from Kafka that were output by Nazar; then, depending on Nazar’s determination, it either counts or skips over the view. If the event is marked for counting, then Abacus first checks if there is an HLL counter already existing in Redis for the post corresponding to the event. If the counter is already in Redis, then Abacus makes a PFADD request to Redis for that post. If the counter is not already in Redis, then Abacus makes a request to a Cassandra cluster, which we use to persist both the HLL counters and the raw count numbers, and makes a SET request into Redis to add the filter. This usually happens when people view older posts whose counters have been evicted from Redis.
In order to allow for maintaining counts on older posts that might have been evicted from Redis, Abacus periodically writes out both the full HLL filter from Redis along with the count for each post to a Cassandra cluster. Writes to Cassandra are batched in 10-second groups per post in order to avoid overloading the cluster. Below is a diagram outlining this event flow at a high level.
We hope that view counts will better enable content creators to understand the full reach of their posts, and help moderators to quickly identify which posts are receiving large amounts of traffic in their communities. In the future, we plan to leverage the real-time potential of our data pipeline to deliver more useful feedback to redditors.
If you’re interested in solving problems like this at scale, check out our Careers page.
Special thanks to u/d3fect, u/powerlanguage, and u/gooeyblob for their contributions on this project, and to u/Kaitaan, u/bsimpson, u/spladug, u/mart2d2, and u/KeyserSosa for reviewing and editing this post.
This Thursday, The Dallas Morning News brings its AMA series to a close with Dallas Mayor Mike Rawlings, who will be chatting with redditors in r/politics at 2 P.M. CST.
Rawlings’s AMA marks the third installment of DMN‘s series on Reddit, which kicked off last month with editor Mike Wilson’s AMA in r/IAmA, followed by publisher Jim Moroney.
Among the topics the two discussed with redditors, Wilson shared his thoughts on clickbait, how he keeps personal bias in check, and the future of outlets like DMN that foster original reporting.
“I think local journalism has value, and we need income to keep providing that information,” he wrote. “So I’m encouraging people to subscribe rather than assuming they should get valuable information for free — an assumption my industry unfortunately encouraged for a long time.”
He also delved into the less serious subjects like pet names and a hypothetical competition with other local journalists:
Meanwhile, Moroney addressed fake news, how he believes media outlets can restore their credibility with readers, and social media echo chambers.
DMN homepage and social producer Nicholas Friedman explained the intent behind the series and the outlet’s recently launched profile on Reddit.
“As a longtime Reddit user, I’ve been thinking for a while now about ways to bring together the community and readership we have at The Dallas Morning News with the open community of a platform like Reddit,” he wrote.
“With the marketing team, members of the Audience Team here and reps from Reddit, we were able to make that a reality, proving that the platform could be key in how we look at and interact with news.”
You can read Wilson’s full AMA here and Moroney’s here. Check out DMN‘s blog post about their ongoing Reddit engagement on dallasnews.com.
|Argonauts of the Internet: Anthropology and Community Management - What’s anthropology good for? A l
Anthropologist / Community Manager
This blog post was adapted from “The Digital and the Applied: Digital Anthropology and Business,” a paper presented at the 2017 Southwestern Anthropological Association Conference.
As a Community Manager at Reddit (one of the many roles lumped into the “admin” title on the site), I get asked a lot of questions about my job. Most of the time, it’s questions about specific communities (usually ones getting attention for causing trouble), but I also get asked often what my my day-to-day is like, and what a Community Manager actually does. Once people find out that my background is in anthropology, the very next question is almost always “Why anthropology?” During a recent radio interview, I was asked a new question: “Are there other anthropologists working in your field?”
This gave me pause. After all, I’d spent the prior few minutes (and months, and years) expounding on why anthropology and community management are like two peas in a pod! And now, I had to admit that most community managers I’ve met (with the exception of the ones here at Reddit and a few other community content-driven sites) came from marketing backgrounds. Few had training in social sciences, and I had yet to meet any that had a background similar to mine.
Anthropology is often thought of as a stodgy discipline: leather armchairs, sepia-toned pictures (without the need for an Instagram filter), old men with fancy facial hair, and tromps through faraway lands to study an exotic culture, writing a book that will sit on a library shelf collecting dust until some unassuming undergrad needs just one more source for an assignment. But honestly, anthropology is now, and has always been, so much more. We’re studying Anonymous, NASCAR, and World of Warcraft. We’ve worked in big tech and market research, among other fields. If you enjoy your yogurt in a portable container, you can thank an anthropologist for that (look me up if you want the story; it’s amazing). Anthropology is endlessly flexible and adaptable to any field.
“Community” is a term often tossed about with great abandon, especially in startup culture.
So if anthropology is so nifty and keen, and isn’t all those other things, what is it? The tl;dr is that it’s the study of humans (anthro meaning “human” and –ology meaning “study of). Each field of anthropology has its focus; cultural (or social) anthropology, which is my main area, focuses on human culture in all its forms. For some of us, it means the traditional idea of going to a different country and living with the people there. For some of us, though, it’s communities closer to home.
“Community” is a term often tossed about with great abandon, especially in startup culture. It’s sometimes interchanged with “network,” but in the end, everyone means the same thing: a group of individuals gathered together who share certain characteristics or interests. Those interests could be a specific fandom, an affection for driving others around in a personal car, completing tasks for a set fee, or any other of the many types of communities on the internet right now. With the growth in the so-called “gig economy,” groups of people forming together online will continue to be big business in this sector, for good or for ill.
Companies in the “gig economy,” organizations with more “traditional” types of communities like Wikipedia and Reddit, and sometimes brands that utilize social media in their advertising and marketing campaigns generally employ community managers to keep track of their communities or networks. This work entails interfacing with the community, ensuring that their needs are met, providing support, and voicing the issues and concerns of that community to the company at large. These jobs require a fairly unique blend of skills: part PR, part customer service, part marketing, not to mention part community confidante, among others. One that isn’t often brought up, though, is researcher. Online community managers are well served by being the #1 experts in their community, how it functions, and even how it forms itself into a working collective. To do this, online community managers must be willing to do in-depth research into their respective communities.
Without good, grounded ideas as to who your community is, what they want, and how they function, you can’t be an effective community manager.
Most community managers I have met in my 15+ years in the field have come from a background of marketing (for those who manage communities on social media, this is key), customer service, or other business-related field. While all of those fields can provide the bulk of the skills needed to be good community managers, it is anthropology that holds a key in tying all of these threads together for effective community management.
Anthropology helps give you the tools needed for in-depth research into a community (or, dare I say, culture) and make sense of what you’re seeing. While this may not be “fieldwork” in the traditional sense of going into a foreign culture and living in it for long periods of time, it is still fieldwork and the work that comes out of it can be ethnography (basically, this is kind of the end result of fieldwork; the write-up of what was observed and concluded). Without good, grounded ideas as to who your community is, what they want, and how they function, you can’t be an effective community manager.
For instance, recently those of us on the Community Team here at Reddit became interested in what makes a community “healthy” versus “unhealthy.” It was something we’d talked to mods about, and we wanted to get some appreciation for how it looked on the ground. When we first started kicking the project around, a lot of quantitative measures (aka hard numbers) were getting tossed about: number of pageviews, number of subscribers, number of comments, number of posts… you get the idea. Things that were easy to count, but really had very little to do with health. You can have a community getting a ton of pageviews and activity, but it’s because there’s a lot of drama and trolling festering inside and making the community an unpleasant place to be. Or, you could have a very low-volume community that’s chugging along peacefully and very healthily. Of course, the converse of both of these could also be true, hence why these numbers really aren’t the end-all-be-all in figuring out health.
So, several of us sat down and brainstormed the things that we’d seen in healthy and unhealthy communities we had experience with (both on and off Reddit), as well as other ideas our mods had told us. Before long, we had a rubric of things to look for in communities. This was not a “healthy community checklist”; instead, it was a list of items to observe. For instance: Are the mods active in the community? What did that look like? Did the community and the mod team seem to get along? Did they stand up for each other? How do they deal with trolls? Was the community there truly to engage with content and/or each other, or just to consume content?
… having this background has certainly helped me formulate research plans and make sense of results quicker and, I feel, with a more holistic view of the entire project overall than I did before when I was doing this work without anthropology.
I want to stress again, none of these items was seen as a strict marker of health (or lack of health). Instead, we wanted to get a broad, holistic picture of each community and see where this led us. In other words, we were conducting a rapid ethnographic assessment (I like to call them “flash ethnographies” because it sounds more exciting): a quick gathering of ethnographic data. From this, we can start to look at our data and see patterns of healthy communities, so we can help new mods (and existing ones, too!) create and maintain healthy communities on Reddit.
Those of you who are reading this and are versed in anthropology may be scratching your heads at this point. After all, anthropology isn’t about defining health or function; it’s about observation and study. However, I would argue that this piece is what takes anthropology and makes it applied. As applied anthropologists, we can take our observations to our managers and higher-ups, but those observations on their faces aren’t going to mean much to them. Imagine if I went to u/spez with a list of things we’d observed in all those subreddits. He’d tell me, “That’s great. What does it mean for us?” That’s where application comes in: I can tell him how these are markers of health, what that means, and what we can do to foster that in more communities.
While this sounds very cut and dry, that doesn’t mean I’m not utilizing a lot of anthropological theory in my day-to-day, so if you’re an anthropology student out there thinking you can ditch your theory classes, think again. While I, personally, may not trade in cultural hegemony or structural functionalism on a regular basis (thank heavens), I do have several ideas that form the basis of what I do, including social capital and communitas. If you’re into community management and/or anthropology, I highly recommend you get familiar with those terms and learn how they can work for you.
None of this is to say that if you don’t have this same background you can’t do this type of research or understand your results. But having this background has certainly helped me formulate research plans and make sense of results quicker and, I feel, with a more holistic view of the entire project overall than I did before when I was doing this work without anthropology. Having ideas to ground and focus in, along with experience in creating research plans, is a boost for my productivity and helps my end results be clearer.
I’ve spent this whole post making the case for anthropology in community management, but there are a host of other places where anthropology is a huge boon. UX design/research is huge for anthropology and the social sciences in general. While not only trying to determine how users will interface with a product, the research piece talks to users about how they would interface with a product and use it, an important thing to determine as sometimes the way a user will use a function vs. how it was designed are entirely different things. Business development could use more anthropologists, too. After all, how can you make the best deals and partnerships for your communities if you don’t understand them? Even developers and engineers can find it useful, since anthropology is about the humans they are creating features and tools for.
In the end, I really see my job as being the person at the table who always remembers the human, seeing the user on the other side of the screen. That person could be coming to Reddit to fangirl over their favorite TV show, to celebrate their team’s victory, or to find comfort after a rough time. Reddit is this and more to so many people, and preserving that is very important to all of us. We have to remember their needs, YOUR needs, at the end of the day, otherwise all those features and tools, designs and uses, deals and partnerships, are all for naught.
Interested in working for the largest community on the internet? Check out our Careers page for a list of open positions.