Data Modeling With Apache Cassandra is the second project of Udacity Data Engineering Nanodegree. It requires to create an Apache Cassandra database for a music streaming app and write an ETL pipeline that transfers data from directory of CSV files into this database using Python and CQL.
The project has the following goals:
- Model a NoSQL database with Apache Cassandra
- Perform insert and select queries using CQL
- Create ETL pipeline with Python
The repo contains the event_data
directory of CSV files on user activity on the app, the images
directory with image_event_datafile_new.jpg
reference file showing columns that will be used for Apache Cassandra tables, as well as the following files:
cql_queries.py
: contains all CQL queries.create_tables.py
: drops existing tables and creates new ones.data_modeling_with_cassandra.ipynb
: reads and processes CSV files fromevent_data
, loads the data into tables, and runsSELECT
queries.etl.py
: createsevent_datafile_full.csv
file from CSV files inevent_data
, loads the data from this file into tables, and runsSELECT
queries.
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app.
The music_app_data database is designed to answer the following questions:
- Find the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4.
- Find only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182.
- Find every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'.
The event_data
directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset:
event_data/2018-11-08-events.csv
event_data/2018-11-09-events.csv
Each CSV file in event_data
directory contains data within the following columns:
- artist
- auth
- firstName
- gender
- itemInSession
- lastName
- length
- level
- location
- method
- page
- registration
- sessionId
- song
- status
- ts
- userId
The event_datafile_new.csv
file that will be used for Apache Cassandra tables has the following columns:
- artist
- firstName
- gender
- itemInSession
- lastName
- length
- level
- location
- sessionId
- song
- userId
Below is an example of how the data in event_datafile_new.csv
looks like.
The music_app_data database is designed so that each table answers one question. It contains the following tables.
The song_in_session
table has the following fields:
- sessionId: int
- itemInSession: int
- artist: text
- song: text
- length: float
- PRIMARY KEY (sessionId, itemInSession)
The user_songs
table has the following fields:
- userId: int
- sessionId: int
- itemInSession: int
- song: text
- artist: text
- firstName: text
- lastName: text
- PRIMARY KEY (userId, sessionId, itemInSession))
The song_listeners
table has the following fields:
- song: text
- userId: int
- firstName: text
- lastName: text
- PRIMARY KEY (song, userId))
To run ETL pipeline locally,
- Clone this repo.
cd
into project directory.- Run
create_tables.py
to reset tables:
root@8fa691392031:/home/workspace# python create_tables.py
Tables are dropped.
Tables are created.
Remember to run
create_tables.py
every time before runningetl.py
to reset tables.
- Run
etl.py
. Once tables are creates, query results are displayed.
root@8fa691392031:/home/workspace# python etl.py
Total number of rows: 8056
Inserting data in song_in_session table...
Completed: [==============================] 100%
Inserting data in user_songs table...
Completed: [==============================] 100%
Inserting data in song_listeners table...
Completed: [==============================] 100%
Query 1:
Find the artist, song title and song's length in the music app history
that was heard during sessionId = 338, and itemInSession = 4.
Faithless Music Matters (Mark Knight Dub) 495.30731201171875
Query 2:
Find only the following: name of artist, song (sorted by itemInSession) and
user (first and last name) for userid = 10, sessionid = 182.
0 Down To The Bone Sylvie Cruz
1 Three Drives Sylvie Cruz
2 Sebastien Tellier Sylvie Cruz
3 Lonnie Gordon Sylvie Cruz
Query 3:
Find every user name (first and last) in my music app history who listened
to the song 'All Hands Against His Own'.
Jacqueline Lynch
Tegan Levine
Sara Johnson
Implementation of python progress bar is based on this solution.