Skip to content

CID-ITBA/similarity-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Similarity Lab

Similarity Lab provides a set of tools and trained embedding matrices for historical language analysis.

With the aid of this toolset you will be able to track changes of words significance across several decades.

What's inside?

You'll find two sets of word embedding matrices along with their corresponding vocabularies. These matrices were obtained from two major vocabulary corpuses. Thousand of news articles were used from The New York Time and The Guardian.

Use cases

  • Track significance changes across the years
  • Measure cosine distance between words
  • Perform analogy test
  • Analyse change tendencies

Downloading

You can get all the files mentioned above by just cloning the repo. It may take a while beacause of the size of the matrices so be patient

 git clone https://github.com/CID-ITBA/similarity-lab.git

We have made a python package to interface with the matrices available via pip as well. It's an active project so make sure to check for upcoming updates

 pip install SimiLab

Docs and guides

You can find examples and documentation at our Read the Docs site.

Contributing

We are seeking to expand our word corpuses collection so any good reference to a new source will be appreciated.

License

This project is under the MIT license.

Meet the Team

@cselmo @MT2321 @PabloSML
Memeber of @CID-ITBA and @CoNexDat for the OpLaDyn project Memeber of @CID-ITBA Memeber of @CID-ITBA