Name	Name	Last commit message	Last commit date
Latest commit History 33 Commits
data	data
docs	docs
models	models
notebooks	notebooks
references	references
reports	reports
src	src
.gitignore	.gitignore
LICENSE	LICENSE
Makefile	Makefile
README.md	README.md
requirements.txt	requirements.txt
settings.py	settings.py
setup.py	setup.py
test_environment.py	test_environment.py
tox.ini	tox.ini

e-commerce

This deep learning demo project accomplishes the task of product classification into different categories based on image and text data. The categories serve as labels for the supervised learning approach.

The entire data set contains 99.943 images of shop items belonging to 365 different categories. The data set is imbalanced among these categories:

To abbreviate the training time the data set is restricted as follows:

only 10 % of the images
only categories with at least 10 items

This abbreviated approach does not capture the whole complexity of the task but provides a first proof of concept.

The remaining data set contains then 9.171 images distributed among 141 categories and still sketches the class imbalance in a decent way:

Labels

Each category can be decomposed into a tripple of subcategories, e.g.

category = product category > product type > product details.

With this, one can choose between

multiclass classification based on the item's
```
 category
```
multilabel (and still mutliclass) classification based on the category decomposition into
```
 product category, product type, product details
```

The product category contains three classes distributed as

while the product types among these three product categories are distributed as follows

One could now further break down each product type into its product details. However, to keep it simple, the product detail distributions among each product type is not shown here.

Text data

The text data used for training is the product title of the webshop. 110 items are missing a product title in the original data set. However, the subset considered during the following is assembled such that each product has a non-empty title.

The number of words per title vary slightly among the product categories and the corresponding distributions are slightly skewed:

We find that the Kunst (art) product category has both the largest median and the largest average of title length, followed by Schmuck (jewlery) and finally Taschen (bags).

However, when filtering out stop words, digits, other obtacles and stemming the words contained in the titles we find a more equalized distribution of title lengths among the product categories:

These 'cleaned' titles are then used as data for the classifier.

A more detailed investigation of the statistics of the data set can be found in the notebook

notebooks/dataset_statistics.ipynb

Hands-on product classification

Clone the repository

git clone git@github.com:L2Data/e-commerce.git
cd e-commerce

Load the sample_images folder containing the entire images and the desciptive .csv file set into
```
 data/raw
```
Load the pre-trained models (if available) into
```
 models/
```
Create the subset of images with cleaned item titles by running
```
make data
```
Next to the original .csv a 'cleaned' version of the .csv is created. This 'cleaned' version contains added columns such as a tokenized, e.g., cleaned title of each item, as well as the subcategories described above. The creation of the 'cleaned' .csv is usefull, as the title tokenization takes some time. When then in the further process a new subset of the entire data set is created the tokenization has not to be done again as the 'cleaned' .csv will be used for all further steps once it is created.

Furthermore, the data folder shows now the follwing structure
```
 .
 ├── external
 ├── interim
 ├── processed
 │   ├── test
 │   ├── train
 │   └── validate
 └── raw
     └── sample_images
```
where processed/train, for instance, contains the training images copied from the full set of images.

The copying routine takes care of empty images and will exclude them from the further considerations.

Along with each image set (train, validate, test) comes a descriptive .csv as well as a comprehensive one.

The creation of the training data can be altered by several options such as
- train-test-split ratio
- subsample ratio
- minimum number of items per category
All of these parameters carry default values stored in
```
 utils/utils.py
```
Now, everything is ready to run the pretrained models.
Therefore, simply execute
```
make model_predict
```

from the root directory of the project.

This will execute the default pre-trained classifier which is a combination of a deep-learning image classifier and a text classifier.
The image classifier itself is a pre-trained vgg16 model, implemented in keras, along with a customized top to accomplish this particular classification task.

The default classification is multilabel.

Again, in

    utils/utils.py

these default settings can be altered.

Furthermore, one can also alter two particular default settings by running

make model_predict CLASSMODE==<classmode> MODENAME==<modename>

where <classmode> is

    a. multilabel (**default**)
    b. multiclass

and <modelmode>

    a. combined (**default**, run a combined image and text classifier simultaneously)
    b. image (run only the image classifier)
    c. text (run only the text classifier)

Before altering these default settings make sure that there is the corresponding pre-trained model available in <models>

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

Project based on the cookiecutter data science project template. #cookiecutterdatascience

# e-commerce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

e-commerce

Labels

Text data

Hands-on product classification

Project Organization

About

Releases

Packages

Languages

License

pcross616/e-commerce

Folders and files

Latest commit

History

Repository files navigation

e-commerce

Labels

Text data

Hands-on product classification

Project Organization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages