GitHub - Kartik-Banga/Automated-ETL-Pipeline-for-Playstore-Data: Implemented ETL pipeline on AWS for Playstore data using Lambda, Glue Crawlers, and Glue ETL Jobs. Orchestrated workflow with Step Functions and achieved seamless integration, optimal data merging, and enhanced data quality/accessibility.

Automated-ETL-Pipeline-for-Playstore-Data

I recently concluded an end-to-end data pipeline project that seamlessly utilizes AWS services with Power BI for comprehensive analytics. Focused on Play Store data, the project comprised distinct phases, demonstrating a well-orchestrated data flow.

Data Cleaning with AWS Lambda: Initiated by employing AWS Lambda functions and Python scripts (utilizing Pandas and NumPy) for comprehensive data cleaning of the 'playstore_review' dataset. This approach ensured meticulous attention to data quality and integrity. The refined dataset was then securely stored in an S3 bucket, laying the foundation for subsequent stages of the pipeline.
Metadata Extraction with AWS Glue Crawlers: Leveraged AWS Glue Crawlers to extract essential metadata from the cleaned review dataset, establishing a structured foundation for the pipeline.
ETL with PySpark in AWS Glue Job: Orchestrated a PySpark script within an AWS Glue Job, seamlessly integrating SQL queries through 'spark.sql,' along with Spark functions, to execute an inner join between 'playstore_apps' and the cleaned 'playstore_review' datasets. This multifaceted approach not only facilitated dataset merging but also enabled the application of advanced SQL-based data cleaning techniques, resulting in a thoroughly polished dataset.
Automated Workflow with AWS Step Functions: Implemented AWS Step Functions for seamless orchestration, automating the entire data pipeline. This streamlined execution of Lambda functions, Glue jobs, and other processes, ensuring a coherent end-to-end workflow.
Storage and Accessibility on S3:
- The decision not to use Amazon Redshift was primarily driven by cost considerations. Given the scale and scope of the project, leveraging Amazon S3 for storage proved to be a more cost-effective solution, aligning with budget constraints while still meeting analytical requirements.*
Stored the final, cleaned, and merged dataset in an S3 bucket in CSV format, establishing a scalable and accessible storage solution.
Power BI Analysis for Actionable Insights: Utilized Power BI for in-depth analysis, creating visually compelling dashboards and reports. This phase provided actionable insights into Play Store apps, enhancing decision-making capabilities.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Analytics_2.pbix		Analytics_2.pbix
Flow chart of entire process.png		Flow chart of entire process.png
Glue script for cleaning googleplaystore_apps data and merging apps and reviews.txt		Glue script for cleaning googleplaystore_apps data and merging apps and reviews.txt
Lambda script for cleaning googleplaystore_user_reviews.txt		Lambda script for cleaning googleplaystore_user_reviews.txt
Playstore Analytics Dashboard (PowerBI).pdf		Playstore Analytics Dashboard (PowerBI).pdf
README.md		README.md
Step function script for automation of tasks.txt		Step function script for automation of tasks.txt
StepFunction flowchart.png		StepFunction flowchart.png
googleplaystore_apps.csv		googleplaystore_apps.csv
googleplaystore_user_reviews.csv		googleplaystore_user_reviews.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Kartik-Banga/Automated-ETL-Pipeline-for-Playstore-Data

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks