t-SNE

An implementation of the t-SNE paper for our SMAI course project in Monsoon'21

Team members

Kunal Jain (2019111037)
Samay Kothari (2019113017)
Aaditya Sharma (2019113009)
Adwait Raste (2019111027)

The paper is available here

Problem Statement

Visualization of high-dimensional data is an important problem in many different domains, and deals with data of widely varying dimensionality. The goal of visualizing such data is to give a basic understanding of the distributions and possible properties of the dataset we are dealing with in a format that is easily comprehendable by humans.

A lot of techniques have been developed for this task like UMAP, pixel-based techniques, etc. However, most of the earlier techniques focus on simply displaying the high dimensional data in two dimensions without taking into consideration the interpretability of the generated visualisation to the human. This creates a need for a method to represent the data in an interpretable fashion.

This creates the problem of preserving as much of the significant structure of the high dimensional data as possible in the low dimensional visualisation.

Goals and Approach

t-Distributed Stochastic Neighbour Embedding(t-SNE) is an unsupervised, non linear technique that is used to do data exploration and visualising high dimensional data. In simpler terms, t-SNE givess us a feel or intuition of how data is arranged in high-dimensional space, using only two or three dimensions.

Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean distances between datapoints into conditional probabilities that represent similarities.

The t-SNE algorithm calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. It then tries to optimize these two similarity measures using a cost function.

The similarity of datapoint $x_j$ to datapoint $x_i$ is the conditional probability, $p_{i|j}$ , that $x_i$ will pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$ , the probability is given by:

$p_{i|j} = \frac{exp(-||x_i-x_j||^2/2\sigma_i^2)}{\sum_{k \neq i}exp(-||x_i-x_k||^2/2\sigma_i^2)}$

t-SNE can be explained in three major steps:

Measure the similarities between points in high dimensional space. Then take a bunch of scattered 2-D points, for each datapoint $x_i$ we will center a gaussian distribution over that point. Then we measure the density of all the other points $x_j$ under that gaussian distributions and then renormalize all the points. This gives us a set of probabilities $p_{i|j}$ for all points. These probabilities are propotional to the similarities. All that means is, if data points $x_1$ and $x_2$ have equal values under this gaussian circle then their proportions and similarities are equal and hence you have local similarities in the structure of this high-dimensional space. The Gaussian distribution or circle can be manipulated using what’s called perplexity, which influences the variance of the distribution (circle size) and essentially the number of nearest neighbors. This manipulation of gaussian is done by performing a binary search for the $\sigma_i$ that produces a $P_i$ with a fixed perplexity that is specfied by the user. It is mathematically defined as: $Perp(P_i)=2^{H(P_i)}$ , where $H(P_i)$ is shannon entropy measured in bits:

$H(P_i) = -\sum_j p_{j|i}log_2(p_{j|i})$

In second step we do similar thing as we did in first step, but instead of using a gaussian distribution we use t-Distribution with one degree of freedom, which are also called cauchy distribution. This gives us a second probability $q_{i|j}$ which is mathematically given by:

$q_{i|j} = \frac{exp(-||y_i-y_j||^2)}{\sum_{k \neq i}exp(-||y_i-y_k||^2)}$

Where $y_i$ 's are the low dimensional counterparts of $x_i$ 's. It have more heavier tails then normal distribution, so it allows for better modelling of far apart distances.

The third step is that we want these probabilities of low dimesional space( $q_{i|j}$ ) to reflect those of high dimensional space( $p_{i|j}$ ) as best as possible. We want the two map structures to be similar. We measure the difference between the probability distributions of the two-dimensional spaces using Kullback-Liebler divergence (KL). We then minmise this KL cost function using gradient descent. Mathematically KL cost function is given by:

$C = \sum_iKL(P_i||Q_i) = \sum_i\sum_jp_{j|i}log\frac{p_{j|i}}{q_{j|i}}$

where $P_i$ represents the conditional probability distribution over all other datapoints given datapoint $x_i$ , and $Q_i$ represents the conditional probability distribution over all other map points given map point $y_i$ .

Dataset

We plan to test our implementation on a number of datasets to ensure generalisability of the technique. We will use the following datasets:

MNIST : This consists of 60,000 grayscale images of handwritten digits. Each image is 28 X 28 = 784 pixels (dimensions).
Olivertti faces : This is dataset of 400 images created with 40 indivisuals who change their expressions in the images along with small variations in viewpoint. Each image is 92 X 112 = 10,304 pixes (dimensions) labelled with their identity.
COIL-20 : There are 1440 images of 20 objects taken from 72 space orientation (equally spaced). Each image is 32 X 32 = 1,024 pixels (dimensions).
Animals10 : There are about 55000 images of animals from 10 classes. Each image is 64 X 64 = 4096 pixels (dimensions)

Expected Deliverables

PCA for the datasets
Visualisations for the datasets
Implementation of tSNE
Comparison with other visualisation methods

Rough timeline

1 November - 7 November : Paper review
7 November - 10 November : Prepare dataset and create basic pipeline for visualisation, run and compare PCAs
10 November - 12 November: Testing standard implementations of the methond using pre-built libraries on given dataset, test on newer datasets
12 November - 20 November : Initial implementation of t-SNE
20 November - 1 December : Testing and improvements based on mid-evaluation.
1 December - 4 December : Final report and presentation

The above timeline is approximate and may change as the project progresses.

Work distribution

Kunal Jain - paper review, Animals10
Samay Kothari - paper review, Oliveretti
Aaditya Sharma - paper review, MNIST
Adwait Raste - paper review, COIL-20

The above will be updated accordingly as the project progresses.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.ipynb_checkpoints		.ipynb_checkpoints
coil-20-proc		coil-20-proc
images		images
samples		samples
src		src
.gitignore		.gitignore
Implementing SNE.ipynb		Implementing SNE.ipynb
PCA.ipynb		PCA.ipynb
README.md		README.md
SMAI Mid Evals Report.pdf		SMAI Mid Evals Report.pdf
download		download
smai.zip		smai.zip
tsne.pdf		tsne.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

t-SNE

Team members

Problem Statement

Goals and Approach

Dataset

Expected Deliverables

Rough timeline

Work distribution

About

Releases

Packages

Contributors 2

Languages

kjain1810/t-SNE

Folders and files

Latest commit

History

Repository files navigation

t-SNE

Team members

Problem Statement

Goals and Approach

Dataset

Expected Deliverables

Rough timeline

Work distribution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages