Real-Time Reddit Data Orchestration and Trend Analysis Engine

This project demonstrates a real-time data pipeline for extracting data from Reddit, processing it, and performing trend analysis using AWS services. The system includes Lambda functions, AWS Glue for ETL jobs, Amazon S3 for storage, Kafka for data streaming, an API Gateway for user interaction and EC2 for compute.

Architecture Overview

Components:

EC2 Instance with Flask App:
- Amazon EC2 instance hosts a Flask web application.
- Users interact with the application through the web interface.
- Ports: 22, 5000, 80
API Gateway:
- Exposes endpoints for user interaction.
- Connected to the EC2 instance for processing user requests.
Lambda Functions:
- RedditToKafkaLambda1:
  - Produces live data from Reddit API.
  - Streams data to Kafka consumers.
- RedditKafkaConsumer1:
  - Consumes data from Kafka topics.
  - Stores data in Amazon S3 bucket "reddit-data".
- RedditTrendAnalysis1:
  - Performs trend analysis on pre-processed data.
  - Stores results in "reddit-trend-out" S3 bucket.
Kafka (CloudKarafka):
- Used as the message broker for data streaming.
- Kafka topic: "REDDIT_ETL_TOPIC".
Amazon S3:
- Buckets:
  - "reddit-data": Stores raw data from Kafka.
  - "reddit-data-parquet": Stores processed Parquet files.
  - "reddit-trend-out": Stores results of trend analysis.
AWS Glue:
- Crawler:
  - Gathers metadata from "reddit-data" S3 bucket.
  - Stores metadata in the "reddit-data" database.
- ETL Job (RedditCsvMappingParquetJob):
  - Converts raw CSV data to Parquet format.
  - Mapping and transformation of data.
  - Stores output in "reddit-data-parquet" S3 bucket.
- Workflow (RedditWorkflow):
  - Orchestrates the Glue Crawler and ETL Job for automation.

Getting Started

Prerequisites

AWS account with necessary permissions.
Python and pip installed locally for development.
Access to the Reddit API for data extraction.
Kafka setup with CloudKarafka.

Installation

Clone the repository:

git clone https://github.com/Kaushikdhola/RealTime-Reddit-Data-Orchestration.git

Install required Python packages:
```
pip install -r requirements.txt
```

Usage

Deploy the Flask application on an EC2 instance.
Configure the API Gateway endpoints.
Set up the necessary AWS services (Lambda functions, S3 buckets, Glue jobs, etc.).
Run the Lambda functions to start the data pipeline.
Interact with the Flask web app to trigger the data processing.

AWS Services Used

Lambda
EC2
EventBridge
Glue (Crawler, Workflow, ETL Job)
Amazon S3
API Gateway

Project Details

For a detailed overview and step-by-step guide for Data Streaming and ETL orchetration, please refer to my Medium article:

Unleashing Real-time Insights: Building a Dynamic Reddit Data Pipeline with Kafka on AWS

Credits

Author: Kaushik Chanabhai Dhola

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
WebApp		WebApp
architecture		architecture
glue_script		glue_script
lambda_scripts		lambda_scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Services.txt		Services.txt
cloudformation.yaml		cloudformation.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Reddit Data Orchestration and Trend Analysis Engine

Architecture Overview

Components:

Getting Started

Prerequisites

Installation

Usage

AWS Services Used

Project Details

Credits

License

About

Releases

Packages

Languages

License

Kaushikdhola/RealTime-Reddit-Data-Orchestration

Folders and files

Latest commit

History

Repository files navigation

Real-Time Reddit Data Orchestration and Trend Analysis Engine

Architecture Overview

Components:

Getting Started

Prerequisites

Installation

Usage

AWS Services Used

Project Details

Credits

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages