Data Engineering ND #2: Data Modeling With Apache Cassandra

Overview

Data Modeling With Apache Cassandra is the second project of Udacity Data Engineering Nanodegree. It requires to create an Apache Cassandra database for a music streaming app and write an ETL pipeline that transfers data from directory of CSV files into this database using Python and CQL.

The project has the following goals:

Model a NoSQL database with Apache Cassandra
Perform insert and select queries using CQL
Create ETL pipeline with Python

Dependencies

Apache Cassandra

Repo Contents

The repo contains the event_data directory of CSV files on user activity on the app, the images directory with image_event_datafile_new.jpg reference file showing columns that will be used for Apache Cassandra tables, as well as the following files:

cql_queries.py: contains all CQL queries.
create_tables.py: drops existing tables and creates new ones.
data_modeling_with_cassandra.ipynb: reads and processes CSV files from event_data, loads the data into tables, and runs SELECT queries.
etl.py: creates event_datafile_full.csv file from CSV files in event_data, loads the data from this file into tables, and runs SELECT queries.

Apache Cassandra Database

Database Purpose

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app.

The music_app_data database is designed to answer the following questions:

Find the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4.
Find only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182.
Find every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'.

Data Description

The event_data directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset:

event_data/2018-11-08-events.csv
event_data/2018-11-09-events.csv

Each CSV file in event_data directory contains data within the following columns:

artist
auth
firstName
gender
itemInSession
lastName
length
level
location
method
page
registration
sessionId
song
status
ts
userId

The event_datafile_new.csv file that will be used for Apache Cassandra tables has the following columns:

artist
firstName
gender
itemInSession
lastName
length
level
location
sessionId
song
userId

Below is an example of how the data in event_datafile_new.csv looks like.

Database Design

The music_app_data database is designed so that each table answers one question. It contains the following tables.

`song_in_session`

The song_in_session table has the following fields:

sessionId: int
itemInSession: int
artist: text
song: text
length: float
PRIMARY KEY (sessionId, itemInSession)

`user_songs`

The user_songs table has the following fields:

userId: int
sessionId: int
itemInSession: int
song: text
artist: text
firstName: text
lastName: text
PRIMARY KEY (userId, sessionId, itemInSession))

`song_listeners`

The song_listeners table has the following fields:

song: text
userId: int
firstName: text
lastName: text
PRIMARY KEY (song, userId))

Getting Started

To run ETL pipeline locally,

Clone this repo.
cd into project directory.
Run create_tables.py to reset tables:

root@8fa691392031:/home/workspace# python create_tables.py
Tables are dropped.
Tables are created.

Remember to run create_tables.py every time before running etl.py to reset tables.

Run etl.py. Once tables are creates, query results are displayed.

root@8fa691392031:/home/workspace# python etl.py
Total number of rows: 8056
Inserting data in song_in_session table...
Completed: [==============================] 100%

Inserting data in user_songs table...
Completed: [==============================] 100%

Inserting data in song_listeners table...
Completed: [==============================] 100%

Query 1:
Find the artist, song title and song's length in the music app history
that was heard during sessionId = 338, and itemInSession = 4.
Faithless Music Matters (Mark Knight Dub) 495.30731201171875


Query 2:
Find only the following: name of artist, song (sorted by itemInSession) and
user (first and last name) for userid = 10, sessionid = 182.
0 Down To The Bone Sylvie Cruz
1 Three Drives Sylvie Cruz
2 Sebastien Tellier Sylvie Cruz
3 Lonnie Gordon Sylvie Cruz


Query 3:
Find every user name (first and last) in my music app history who listened
to the song 'All Hands Against His Own'.
Jacqueline Lynch
Tegan Levine
Sara Johnson

Authors

Alexandra Baturina

Acknowledgments

Implementation of python progress bar is based on this solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering ND #2: Data Modeling With Apache Cassandra

Overview

Dependencies

Repo Contents

Apache Cassandra Database

Database Purpose

Data Description

Database Design

`song_in_session`

`user_songs`

`song_listeners`

Getting Started

Authors

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
event_data		event_data
images		images
README.md		README.md
cql_queries.py		cql_queries.py
create_tables.py		create_tables.py
data_modeling_with_cassandra.ipynb		data_modeling_with_cassandra.ipynb
etl.py		etl.py

alexandrabaturina/data-modeling-with-apache-cassandra

Folders and files

Latest commit

History

Repository files navigation

Data Engineering ND #2: Data Modeling With Apache Cassandra

Overview

Dependencies

Repo Contents

Apache Cassandra Database

Database Purpose

Data Description

Database Design

song_in_session

user_songs

song_listeners

Getting Started

Authors

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`song_in_session`

`user_songs`

`song_listeners`

Packages