Skip to content

I found the musiXmatch data intriguing and wanted to learn a bit of PySpark so I did some language clustering

License

Notifications You must be signed in to change notification settings

bjornper/Examining-musiXmatch-languages-clustering-with-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

I found the musiXmatch data (the lyrics part of "millionsongdataset" at http://millionsongdataset.com/) intriguing and wanted to learn a bit of PySpark so I did some language clustering

I noticed just by looking at the first few hundred words in its dictionary that there are non-English words - SOMETIMES it helps to be a native speaker of something else than English. Then I also wanted to see if the underlying structure of this dataset made up of songs and the collection of words in each song would be especially conductive to clustering. I thought to myself that the words that make up the lyrics of a song will come predominantly from one language giving me an opportunity to determine if the clustering falls mostly along the lines of language boundaries and also to try out an idea I had for semi-supervised K-means.

About

I found the musiXmatch data intriguing and wanted to learn a bit of PySpark so I did some language clustering

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published