Skip to content

ETL pipeline to transfer data from csv files into Apache Cassandra database using Python and CQL

Notifications You must be signed in to change notification settings

alexandrabaturina/data-modeling-with-apache-cassandra

Repository files navigation

Data Engineering ND #2: Data Modeling With Apache Cassandra

Overview

Data Modeling With Apache Cassandra is the second project of Udacity Data Engineering Nanodegree. It requires to create an Apache Cassandra database for a music streaming app and write an ETL pipeline that transfers data from directory of CSV files into this database using Python and CQL.

The project has the following goals:

  • Model a NoSQL database with Apache Cassandra
  • Perform insert and select queries using CQL
  • Create ETL pipeline with Python

Dependencies

Repo Contents

The repo contains the event_data directory of CSV files on user activity on the app, the images directory with image_event_datafile_new.jpg reference file showing columns that will be used for Apache Cassandra tables, as well as the following files:

  • cql_queries.py: contains all CQL queries.
  • create_tables.py: drops existing tables and creates new ones.
  • data_modeling_with_cassandra.ipynb: reads and processes CSV files from event_data, loads the data into tables, and runs SELECT queries.
  • etl.py: creates event_datafile_full.csv file from CSV files in event_data, loads the data from this file into tables, and runs SELECT queries.

Apache Cassandra Database

Database Purpose

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app.

The music_app_data database is designed to answer the following questions:

  1. Find the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4.
  2. Find only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182.
  3. Find every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'.

Data Description

The event_data directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset:

event_data/2018-11-08-events.csv
event_data/2018-11-09-events.csv

Each CSV file in event_data directory contains data within the following columns:

  • artist
  • auth
  • firstName
  • gender
  • itemInSession
  • lastName
  • length
  • level
  • location
  • method
  • page
  • registration
  • sessionId
  • song
  • status
  • ts
  • userId

The event_datafile_new.csv file that will be used for Apache Cassandra tables has the following columns:

  • artist
  • firstName
  • gender
  • itemInSession
  • lastName
  • length
  • level
  • location
  • sessionId
  • song
  • userId

Below is an example of how the data in event_datafile_new.csv looks like. image

Database Design

The music_app_data database is designed so that each table answers one question. It contains the following tables.

song_in_session

The song_in_session table has the following fields:

  • sessionId: int
  • itemInSession: int
  • artist: text
  • song: text
  • length: float
  • PRIMARY KEY (sessionId, itemInSession)

user_songs

The user_songs table has the following fields:

  • userId: int
  • sessionId: int
  • itemInSession: int
  • song: text
  • artist: text
  • firstName: text
  • lastName: text
  • PRIMARY KEY (userId, sessionId, itemInSession))

song_listeners

The song_listeners table has the following fields:

  • song: text
  • userId: int
  • firstName: text
  • lastName: text
  • PRIMARY KEY (song, userId))

Getting Started

To run ETL pipeline locally,

  1. Clone this repo.
  2. cd into project directory.
  3. Run create_tables.py to reset tables:
root@8fa691392031:/home/workspace# python create_tables.py
Tables are dropped.
Tables are created.

Remember to run create_tables.py every time before running etl.py to reset tables.

  1. Run etl.py. Once tables are creates, query results are displayed.
root@8fa691392031:/home/workspace# python etl.py
Total number of rows: 8056
Inserting data in song_in_session table...
Completed: [==============================] 100%

Inserting data in user_songs table...
Completed: [==============================] 100%

Inserting data in song_listeners table...
Completed: [==============================] 100%

Query 1:
Find the artist, song title and song's length in the music app history
that was heard during sessionId = 338, and itemInSession = 4.
Faithless Music Matters (Mark Knight Dub) 495.30731201171875


Query 2:
Find only the following: name of artist, song (sorted by itemInSession) and
user (first and last name) for userid = 10, sessionid = 182.
0 Down To The Bone Sylvie Cruz
1 Three Drives Sylvie Cruz
2 Sebastien Tellier Sylvie Cruz
3 Lonnie Gordon Sylvie Cruz


Query 3:
Find every user name (first and last) in my music app history who listened
to the song 'All Hands Against His Own'.
Jacqueline Lynch
Tegan Levine
Sara Johnson

Authors

Alexandra Baturina

Acknowledgments

Implementation of python progress bar is based on this solution.

About

ETL pipeline to transfer data from csv files into Apache Cassandra database using Python and CQL

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published