September 15th, 2017: Thanks to everyone you came out to Carleton Wednesday night! I had a great time and I was COMPLETELY SURPRISED by the cookies they had for all of us. Here's hoping that all future events have free cookies with my characters on them forever. Yes please!
Long ago, in a Reddit blog series easily searchable in the top-right search bar, we shared small slices of what it’s like to work at Reddit HQ in our “Ask an Admin” posts. We continue to get a lot of questions from redditors and job applicants alike asking who exactly makes up the team behind the front page of the internet, so we’re kicking off a new series called “Snoo Spotlights,” starting with Rohaina (better known as u/khaleesi on Reddit) from our Los Angeles Sales team.
Q: What’s a typical day in the life of Rohaina Hassan?
My role sits in the middle of the advocating for our users, partners, and internal teams. A typical day involves expectation setting, updates, and finding solutions amongst various internal and external teams to keep our Advertising partnerships and campaigns running smoothly. You’ll see a variety of meetings on my calendar, from brainstorm sessions to kick-off calls with clients to interviews with candidates.
Most days are sprinkled with spontaneous coffee/pastry runs with people from different teams or finding new ways to troll people on my team. The company is in growth mode, so if I’m walking around the office I’ll also try to make sure I meet new folks and make them feel welcome.
Q: And outside of work?
I do a few things! I help produce a podcast hosted by two of my friends. It’s called Black Girl Book Club, and on each episode Lauren and Jordan talk about books they’ve read that were written by women of color. (Check it out if you’re curious!) I’m also working on a zine as part of the Undertone Collective, and I host a radio show that highlights artist of colors, with an emphasis on South Asian musicians.
Aside from that, you can usually find me scoping out food spots, watching waaaay too much TV, trying to knock off cities on my list, running around with my camera, or volunteering with local organizations.
Q: What’s one thing you didn’t know about Reddit that you learned after joining?
The depth of our communities. I knew how many we had and the volume of interactions and the substantive quality of those interactions, but I didn’t truly know until I joined the team. I’ve discovered so many obscure subreddits and interesting users’ stories. I’m lucky my role entails knowing about our communities, because I have an excuse for all the perusing I do!
Q: What are you most excited about working on right now?
Well, I just moved down to the LA office, so I think a bunch of us on the Sales team are excited to get that up and running. We’ve got some really cool upcoming campaigns with some of our TV & Film partners, so keep an eye out for those. On a technical side, our Ads Engineering and Product teams are incredible, and working on some really exciting projects that will definitely be pivotal for the business.
Q: What are some of your favorite communities on Reddit?
Q: What advice would you give to someone looking to join Reddit’s Sales team?
Embrace yourself, be ready to grind, ask questions, think critically, and always remember the human. Also, while you don’t necessarily need to be a redditor, I would make sure you’re genuinely interested in the Reddit story and mission.
Interested in joining our growing Sales team? Check out our list of open positions on our Careers page.
We are proud that Reddit is home to some of the most authentic conversations anywhere online. While the vast majority of those conversations are informative, uplifting, or delightfully silly, the full reality is that sometimes they can also be uncomfortable.
Dialogue is central to what makes Reddit special—whether it’s r/changemyview, r/ExplainBothSides, or r/AskTrumpSupporters, Reddit is a place where people can step outside their own bubbles to learn from those different from them. This type of communication is critical to fostering tolerance, which the United Nations has defined as “respect, acceptance and appreciation of …[our] ways of being human.” Sounds a lot like “Remember the human,” doesn’t it?
To take these types of conversations to the next level, we are excited to announce the launch of a special AMA series, leading up to the International Day for Tolerance on November 16th. In the coming weeks, subreddits around the site will host thought-provoking AMAs with individuals whose work is dedicated to constructively exploring the most difficult issues in our society, from anti-semitism to how we treat refugees, and from racism to the place of controversial monuments in public spaces. Participants confirmed at time of writing (with more to be added) include:
Daryl Davis is the subject of the documentary film Accidental Courtesy—a black actor and musician who has, through genuinely befriending them, convinced more than 200 people to leave the Ku Klux Klan. His AMA is on September 18, 11am ET, inr/IAmA.
Jim Grossman is the Executive Director of the American Historical Association. His research specializes in the place of history in public culture. Date TBC.
Janis Shinwari and Matthew Zeller
Janis Shinwari is an Afghan interpreter and Matthew Zeller is the American soldier he saved from the Taliban. Now, Shinwari and Zeller are the Co-Founders of No One Left Behind, an organization that helps resettle Afghan and Iraqi refugees who served as interpreters with US forces and are on ISIS kill lists for doing so. Date TBC.
We hope you’ll join us for these fascinating engagements, and stay tuned for more updates on the Day for Tolerance AMA series in the weeks ahead.
Chris Slowe, Nick Caldwell, & Luis Bitencourt-Emilio CTO, VP of Engineering, Director of Engineering
What’s the Fuss?
A common question we get from newbie engineering team members here at Reddit is “When are we going to fix search?” Until this year, the answer was always “Go ask the search team on the 5th floor.” Which was great fun because a) the elevator button to the 5th floor didn’t work and b) there was no search team.
But the times, they are a-changin’. We’re happy to announce that we’re launching a new search engine at Reddit. Actually, it’s been launched to 50% of traffic for the past couple weeks and has already served up nearly half a billion queries. Now that we’re confident in our system, we’re pushing it to 100% of traffic. We hope you enjoy faster and more reliable results!
More importantly, we’ve also started an entire product unit dedicated to search and relevance here at Reddit, led by our Director of Engineering Luis. We recognize that these technologies are critical to Reddit’s future. Our platform contains one of the world’s most interesting collections of content, currently indexing over a quarter billion posts for search, and it gets bigger every day. But we know this content is hard to find. Improving search and relevance will allow Reddit to sift through millions of posts, comments, and communities to create a custom-fit stream of great content straight to your home feed.
That’s the future. For now, we thought it’d be fun to take a trip down memory lane.
Needless to say, search is not an easy challenge to solve. We’ve been on a bit of a roller coaster when it comes to search at Reddit, but now that we’re on our sixth search stack, we’re no strangers to the struggles of doing search at scale. Below is a rough outline of the 12-year history, along with a few select quotes from the team as we’ve iterated to scale our infra to Reddit’s needs:
2005 – Steve Huffman (u/spez), co-founder and now CEO, turns on postgres 7.4’s contrib/tsearch2. This was a simpler time, when the statement “Oh, we can just have Postgres do it!” was greeted with “Sounds good to me! What can’t Postgres do!?” We also really liked TRIGGERs back then (“No, it’s cool. The database does all the work and it’s guaranteed to be accurate” is something we no doubt said). It worked well, but it wasn’t very tunable, and we quickly discovered we were bogging down the majority of Postgres queries with a small minority (~2%) of search traffic:
“We fixed a bug in the search results ordering.” —Steve
“We updated the search system this morning to help alleviate some load problems.” —Steve
“Jeremy is working on search! It’s not a complicated fix (basically, the sorting is whacky).” —Steve
2007 – Chris Slowe (u/KeyserSosa), founding engineer (and now CTO), re-implements with PyLucene.This was actually implemented just over 10 years ago in July 2007. It consisted of a single Python process which was set up as a threaded RPC server over TCP. In the initial version, we had actually supported searching for both post titles and comments, and the Lucene index files were comfortably stored on a single box. This was also before we moved to AWS, and at the time we had seriously considered getting a Google Search Appliance, which would have made a nice addition to our single rack. This version was flexible, but we didn’t set it up in a way to make it easily scalable:
“Search works much better, tagging and user-controlled subreddits are right around the corner” —Steve
“Search is better, but not quite where we’d like it.” —Steve
“Stats and search are temporarily disabled, but will be coming back as soon as we can get them repaired.” —Steve
“We were hoping to include an upgraded search, which, unlike the last version, was actually useful and helped you find what you were looking for. Unfortunately, the version we settled on didn’t quite load test as nicely” —Steve
“I made a quick fix to search that I hope helps until we get a chance to really fix it.” —Steve
2008 – David King (u/ketralnis), third employee and now search engineer, implements Solr. In fact, he implemented a home-built pysolr, which was capable of shipping update documents to Solr in XML and wrapping the response in such a way as to emulate our existing Query models enough to drop it into any sort or listing. It was actually pretty sweet. The initial version didn’t support comments, but that did come later.
“[David]’s been fixing search and hacking mystery projects in Erlang.” —Alexis Ohanian
“I’ve totally replaced the reddit search function.” —David King
2010 – David replaces Solr with IndexTank, a third-party search provider.When you love something, outsource it… said no one ever. As the site continued to grow and we first cracked a billion pageviews in a month with an engineering team of four, we put all of our effort into 503 mitigation, continuing to add Postgres read slaves, adding more cache, starting to take advantage of a very early version of Cassandra (which was followed shortly thereafter by a memorable 24-hour, thundering-herd-related outage), and generally ignoring how bad search was getting. We had an intrepid startup approach us and offer to take search off of our hands forever for less than we were paying to keep Solr running, so we signed on!
“We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt before.” —David King
2012 – Keith Mitchell (u/kemitche) implements CloudSearch after LinkedIn shut down IndexTank. Clearly, it was one of the shorter forevers, but IndexTank served us well until the company was acquired. When we found out they were shutting down, we had to ween off of IndexTank and make a quick transition to AWS CloudSearch. Continuing our long-standing tradition of ‘Let the new guy take care of it,’ that task fell to Keith, and over the next several years we scaled and stretched CloudSearch to bursting:
“Today we moved from the old Amazon CloudSearch domain to a new Amazon CloudSearch domain. The old search domain had significant performance issues: roughly 33% of queries took over 5 seconds to complete and would result in the search error page.” —u/bsimpson
TODAY – Lucidworks Fusion!This time around, we wanted to ensure that search would meet three criteria: it needed to be fast, it needed to scale well with Reddit’s growth, and most importantly, it needed to be relevant. Ultimately, this led us to partner with the search experts at Lucidworks, leveraging Fusion and their unique search expertise from a team comprised of multiple Solr committers. Below, we’ll explain how we went about this in more detail.
Earlier this year, search on Reddit had become truly abysmal. Simple queries could be expected to succeed only half of the time. Want to search with two keywords? Get out of here!
After looking at several options, we partnered with with Lucidworks to revitalize Reddit’s search system. Lucidworks is the creator of Fusion, a Solr-based search stack that supports huge document scale and high query throughput.
First Things First: Ingesting at Reddit Scale
The biggest challenge in moving to a new search system was that our indexing pipeline needed to be updated. The first attempt was a bit of a beast. In the interest of speed, we hastily put it together on our legacy ETL system comprised of Jenkins and Azkaban orchestrating numerous Hive queries. As you can see in the diagram below, pulling together data from several sources into one cohesive canonical view to be indexed proved to be more complex than originally expected.
Our second attempt was both simpler and produced significantly better results. We managed to trim the entire pipeline to just four simpler and more accurate Hive queries, which led to a 33% increase in posts indexed. Another great improvement is that we not only index new post creations but also update their relevance signals in real time as votes, comments, and other signals flow in throughout the day.
Make it Relevant
Search results don’t mean much if they’re not relevant. For our initial rollout the primary goal was to avoid degrading the overall relevance of results returned.
To monitor this, we measured clicks on the search results page and compared the rank of results being clicked across old and new search systems. A perfect search engine would yield 100% of clicks on the top result being returned, which is another way of saying you want the most relevant result at the top. Since we know a perfect search engine isn’t an achievable goal, we use measures like Mean Reciprocal Rank and Discounted Cumulative Gain to compare the quality of our results.
While it’s still early in our experiments, the data so far points towards very comparable relevancy measurements between our old vs. new stacks, with Fusion having a slight edge. The promising part of this is that we haven’t done much relevancy tuning yet — something that our new system actually supports. Advancements like personalization, machine learning models, and query intent and rewriting are now low-hanging fruit.
As we overcame the data ingestion challenges and monitored relevance, we continued to ramp up usage to more and more redditors. The feedback from this early group was invaluable, and we owe the community a huge thank-you in helping us surface bugs and less common use cases. We started out with just 1% of users on the new stack, working through issues reported and improving the ingestion pipeline as we increased rollout percentages to 5, 10, 25 and ultimately 50% of traffic prior to GA. Throughout this time, we sent all search queries as dark traffic to our new search cluster to ensure it would be ready for full scale as we increased rollout percentages.
We’re proud to say that Reddit Search is better than ever! A full reindex of all Reddit content now completes in about 5 hours (down from around 11 hours), and we’re constantly streaming live updates to the index. The error rate is down by two orders of magnitude with 99% of search results served in under 500ms. The number of machines needed to run search dropped from ~200 earlier this year down to ~30 so we even managed to get some cost savings.
Faster, more reliable, more relevant, and lower cost! Certainly this shall be the last time we ever need to change our search stack!
In all seriousness, we think you’ll love this update. It’s our hope that the new search stack will be a foundation for improvements that make it easier to discover all the great content on Reddit. More importantly: we’re not done. Fixing search is just the first step in a series of new capabilities that will make Reddit feel more personalized and relevant to your interests. Reddit finally has a Search & Relevance team, and we are hiring like crazy. If you’re excited about working with one of the world’s most interesting datasets on a search and relevance platform used by hundreds of millions of people, then check out our job listings: