Sweet Reddit

======================================

Personalized real-time reddit topic suggestion

Sweet Reddit is a tool to suggest personalized real time reddit posts to users. The suggestion is based on users' commenting history using the following technologies:

Amazon S3
Apache Kafka 0.8.2.1
Apache Spark
Apache Spark Streaming
Apache Cassandra
Redis
Flask with D3, Bootstrap and Ajax

What Sweet Reddit Provides:

Sweet Reddit permits users to find real-time reddit posts that may best fit a user's personal preference. Based on users historical comments, a list of suggestion will be provided to a user. Also, a user relationship graph is displayed centered by the user's ID.

Sweet Reddit Approach

Sweet Reddit uses Dec. 2015 comments in JSON format. User graph is been processed in batch layer. Real-time data stream is synthetically generated. Real-time data stream affects user's final suggestion.

Data Synethesis

Real-time data stream is synthesized based on the Dec. 2015 user pool (2.8 Million distinct users), with random posts in the same pool. Post-comment rate is 30, which is 3 times of average reddit post-comment ratio. Higher post-comment ration leads to more complex graph structure, which is used to stress real time layer performance.

JSON message fields:

url: URL of the original post
title: title of the original post
author user ID
created_utc: timestamp of the comment in epoch second
subreddit: the sub-reddit the comment belongs to
ups: promotions
name: name of the comment
id: id of the comment
subreddit_id: sub-reddit ID

~ 30 GB of historical data ~ 2.8 Million distinct users on Dec. 2016 ~ 1300 messages streaming per second

Data Ingestion

JSON messages were produced and consumed by python scripts using the kafka-python package from https://github.com/mumrah/kafka-python.git. Messages were published to a single topic with Spark Streaming as consumers. The batch job operates on the one month timing window.

Batch Processing

Batch job has 4 tasks:

Construct user graph table and store in cassandra and redis
Build "find users by a post" table and save in cassandra and redis
Build "find posts by a user" table and save in cassandra and redis
Build "find title and body by url" table and save in cassandra only

Batch views were directly written into cassandra with the spark-cassandra connector, and written into redis as caching layer

sbt libarary dependencies:

"com.datastax.spark" %% "spark-cassandra-connector" % "1.2.0-alpha1"
"org.apache.spark" %% "spark-core" % "1.2.0" % "provided"

A full batch process was made in the case of needing to rebuild the entire batch view. Typically the incremental batch process is run daily from the cached folder on Amazon S3, the source of truth.

Real-time Processing

Four stream processes were performed for real-time views:

Find a list of posts that a new user commented on, in batch layer and real time layer
For all the posts found in step 1, find all users that commented on them, in batch layer and real time layer
Add new edges that between new user and the users found in step 2
Aggregate the edges into graph table in redis as real time layer user graph

Messages streamed into Spark Streaming with the spark-kafka connector Real-time views were written into redis caching layer for fast edge update

sbt library dependencies:

"com.datastax.spark" %% "spark-cassandra-connector" % "1.2.0-alpha1"
"org.apache.spark" %% "spark-core" % "1.2.0" % "provided"
"org.apache.spark" % "spark-streaming_2.10" % "1.2.0" % "provided"
"org.apache.spark" % "spark-streaming-kafka_2.10" % "1.2.0"

Front end query

Four steps when front end query a suggestion of a user ID:

Find top 6 closest (highest edge scores) fellow users from user graph, aggregating both batch layer and real time layer.
For each fellow user, find all posts each of the fellow users commented during past one month.
Send back the posts that querying user ID has not commented on yet.

Database Schema

Tables:

find_post_by_user_realtime: table populated by Spark Streaming (real-time) representing real time posts of a user
find_post_by_user_batch: table populated by Spark batch job representing batch layer posts of a user
find_user_by_post_realtime: table populated by Spark Streaming (real-time) representing real time users of a post
find_user_by_post_batch: table populated by Spark batch job representing batch layer users of a post
user_graph_table_realtime: table populated by Spark Streaming (real-time) representing real time user graph
user_graph_table_batch: table populated by Spark batch job representing batch layer user graph

-- use user id to find post, clustering with create time
CREATE TABLE user_post_table(user text, created_utc bigint, url text, subreddit text, title text, year_month text, body text, PRIMARY KEY ((user), created_utc));
-- use post url to find user
CREATE TABLE post_user_table(url text, user text, created_utc bigint, subreddit text, title text, year_month text, body text, PRIMARY KEY ((url), user));
-- use user1 to find another user who has most post in common, batch layer
CREATE TABLE user_graph(user1 text, nCommonPosts int, user2 text, PRIMARY KEY ((user1), nCommonPosts)) WITH CLUSTERING ORDER BY (nCommonPosts DESC);

Startup Protocol

Kafka server
Spark Streaming
Spark Streaming Kafka consumer
Kafka message producer

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
batch		batch
cassandra_schema		cassandra_schema
config		config
data		data
flask		flask
images		images
kafka		kafka
streaming		streaming
utils		utils
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sweet Reddit

Personalized real-time reddit topic suggestion

What Sweet Reddit Provides:

Sweet Reddit Approach

Data Synethesis

Data Ingestion

Batch Processing

Real-time Processing

Front end query

Database Schema

Startup Protocol

About

Releases

Packages

Languages

lingding0/HottestTopicOnReddit

Folders and files

Latest commit

History

Repository files navigation

Sweet Reddit

Personalized real-time reddit topic suggestion

What Sweet Reddit Provides:

Sweet Reddit Approach

Data Synethesis

Data Ingestion

Batch Processing

Real-time Processing

Front end query

Database Schema

Startup Protocol

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages