Introduction

Predict Click-Through-Rate with different Machine Learning models.

In this project, we will provide a benchmark of different Algorithms on Click-Through-Rate (CTR) dataset.

This project can be useful in 2 ways:

Quick check on common technique used in CTR prediction. If you have a dataset similar to CTR Dataset, and you want to get a baseline score, you can try to run this project on new dataset.
Used as base project and can be extend with more advanced algorithms like (Deep Learning model) and various CTR datasets. As the first version, we will use only 1 dataset and 4 basic algoithms.

Datasets

Some common properties of CTR dataset are:

Number of rows is large: several millions or more.
Almost features are Categorical Features.
Categorical features are High-Cardinaliry: Number of unique values are large (normaly, thousands values).
Target is highly imbalance.

For each dataset, we report some summary statistics to illustrate the characteristics of CTR dataset. If your dataset has similar characteristics, you can try techniques in this project as well.

Avazu Click-Through Rate

Dataset contains Online Ads informations and its label (Click or No Click) in 10 days. Algorithms have to predict Ads Click probability. Data is ordered chronologically. Data can be downloaded from Kaggle. In this project, we will use only Train data to benchmark algorithm.
- Total rows: 40 millions (40,428,967 rows).
- Date ranges: 10 days, from 2014/10/21 to 2014/10/30.
- Average number of rows-per-day: 4 millions (4,042,896/day).
- Click Rate on data: 0.1698
Each rows of data has these information:
- id: Ads identifier
- click: 0/1 for non-click/click
- hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
- banner_pos
- site_id
- site_domain
- site_category
- app_id
- app_domain
- app_category
- device_id
- device_ip
- device_model
- device_type
- device_conn_type
- C1: anonymized categorical variable
- C14-C21: anonymized categorical variables
Sample rows:

	id	hour	C1	site_id	site_domain	site_category	app_id	app_domain	app_category	device_id	device_ip	device_model	device_type	device_conn_type	C14	C15	C16	C17	C19	C20	C21
0	1000009418151094273	14102100	1005	1fbe01fe	f3845767	28905ebd	ecad2386	7801e8d9	07d7df22	a99f214a	ddd2926e	44956a24	1	2	15706	320	50	1722	35	-1	79
1	10000169349117863715	14102100	1005	1fbe01fe	f3845767	28905ebd	ecad2386	7801e8d9	07d7df22	a99f214a	96809ac8	711ee120	1	0	15704	320	50	1722	35	100084	79
2	10000371904215119486	14102100	1005	1fbe01fe	f3845767	28905ebd	ecad2386	7801e8d9	07d7df22	a99f214a	b3cf8def	8a4875bd	1	0	15704	320	50	1722	35	100084	79

Summary staticstics:

	Column	#Unique
0	C1	7
1	banner_pos	7
2	site_id	4737
3	site_domain	7745
4	site_category	26
5	app_id	8552
6	app_domain	559
7	app_category	36
8	device_id	2686408
9	device_ip	6729486
10	device_model	8251
11	device_type	5
12	device_conn_type	4
13	C14	2626
14	C15	8
15	C16	9
16	C17	435
17	C18	4
18	C19	68
19	C20	172
20	C21	60

Cross Validation

Avazu Click-Through Rate

We use data from date 2014/10/21 - 2014/10/29 as training data. And use last date 2014/10/30 for Validation.
Metric used to report Validation score are: AUC and LogLoss.

Data Processing and Feature Engineering

Avazu Click-Through Rate

Feature Extraction:
- Extract hour in day from hour column. For example: 14102100 -> 00
- Extract count features for each of below columns. For example: How many time we have seen device_id == a99f214a in data? Use only training data to estimate count features. In validation data, map count features for corresponding column.
  - device_id
  - device_ip
  - device_id + device_ip
  - hour
- Sample data after feature extraction:

	id	C1	site_id	site_domain	site_category	app_id	app_domain	app_category	device_id	device_ip	device_model	device_type	device_conn_type	C14	C15	C16	C17	C19	C20	C21	device_id_count	device_ip_count	user_id_count	hour_count
0	1000009418151094273	1005	1fbe01fe	f3845767	28905ebd	ecad2386	7801e8d9	07d7df22	a99f214a	ddd2926e	44956a24	1	2	15706	320	50	1722	35	-1	79	869	5	1	999
1	10000169349117863715	1005	1fbe01fe	f3845767	28905ebd	ecad2386	7801e8d9	07d7df22	a99f214a	96809ac8	711ee120	1	0	15704	320	50	1722	35	100084	79	869	1	1	999
2	10000371904215119486	1005	1fbe01fe	f3845767	28905ebd	ecad2386	7801e8d9	07d7df22	a99f214a	b3cf8def	8a4875bd	1	0	15704	320	50	1722	35	100084	79	869	1	1	999

Preprocessing: A common practice in preprocessing CTR data is Hash Trick. For each feature, hash feature's value into fixed size vector. In current processing step, we hash value into 2^16 fixed size vector. If data has n features, after preprocessing, we expand input dimension space into a larger output dimension space n * 2^16. For space efficiency, we only store index of feature's value. For example: Site ID value after hashing has index = 123, we store: site_id: 123.
- Logistic Regression and Factorization Machine: Before train and predict, we one-hot-encode input feature.
- FTRL and GBM: Do nothing.
- Example data after preprocessing:

	C1	banner_pos	site_id	site_domain	site_category	app_id	app_domain	app_category	device_id	device_ip	device_model	device_type	device_conn_type	C14	C15	C16	C17	C18	C19	C20	C21	device_id_count	device_ip_count	user_id_count	hour_count
0	49542	38901	3703	60146	37267	4859	931	52646	2603	57603	65226	16939	52576	59061	26495	29876	35552	50048	42447	51572	26673	64040	26097	14350	8625
1	49542	38901	3703	60146	37267	4859	931	52646	2603	25930	60692	16939	15778	35267	26495	29876	35552	50048	42447	15865	26673	64040	61530	14350	8625
2	49542	38901	3703	60146	37267	4859	931	52646	2603	10179	56745	16939	15778	35267	26495	29876	35552	50048	42447	15865	26673	64040	61530	14350	8625

Models

Below are models we use in project. (This is just tentative list. We can add or remove some of them later).

Logistic Regression (LR): A linear model, good for strong baseline.
Gradient Boosting Machine (GBM): A non-linear model, works on various dataset. We want to check how Boosting works on this data. It is also a strong baseline too.
Matrix Factorization (FM): A go-to algorithm for CTR problem. Detail of algorithm, see docs/Introduction-of-Factorization-Machine.docx.
FTRL-Proximal online learning algorithm (FTRL): A research from Google for CTR prediction. It bases on Logistic Regression but support some useful features: Online-Learning and low memory footprint. Detail of algorithm, see docs/Introduction-of-FTRL.docx.

Results

Models	AUC	LogLoss	Computation Time	Size (MB)
LR	0.7059	0.5394	4h57m15s	1.26
GBM	0.7049	0.4163	1h08m37s	0.45
FM	0.7239	0.4206	2h31m58s	3.79
FTRL	0.7182	0.4121	0h56m36s	3.87

Note: LR and FM take a bit longer because they require to convert index to one hot vector before training (while GBM and FTRL do not require)

Project Structure


    ├── LICENSE
    ├── README.md          <- The top-level README for developers using this project.
    ├── data
    │   ├── external       <- Data from third party sources.
    │   ├── interim        <- Intermediate data that has been transformed.
    │   ├── processed      <- The final, canonical data sets for modeling.
    │   └── raw            <- The original, immutable data dump.
    │
    ├── docs               <- All documents which is not code.
    │
    ├── models             <- Trained and serialized models, model predictions, or model summaries
    │
    ├── notebooks          <- Jupyter notebooks. Naming convention: <id>-<author>-<desc>
    │                         - id: has 2 digits.
    │                         - author: name of developer.
    │                         - desc: a short description of notebook.
    │                         - Use hyphen "-" as separator.
    │                         - For example: `08-nguyentp2-initial-data-exploration`.
    │
    ├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
    │   └── figures        <- Generated graphics and figures to be used in reporting
    │
    ├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
    │                         generated with `pip freeze > requirements.txt`
    │
    ├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
    │
    ├── ajctr              <- Source code for use in this project.
    │   ├── __init__.py    <- Makes src a Python module
    │   │
    │   ├── data           <- Scripts to download or generate data
    │   │   └── make_dataset.py
    │   │
    │   ├── features       <- Scripts to turn raw data into features for modeling
    │   │   └── build_features.py
    │   │
    │   ├── models         <- Scripts to train models and then use trained models to make
    │   │   │                 predictions
    │   │   ├── predict_model.py
    │   │   └── train_model.py
    │   │
    │   └── reports  <- Scripts to create exploratory and results oriented visualizations
    │       └── metrics.py
    ├── main.py            <- Program Entry point
    │
    ├── test               <- Test code

How to run?

We use a AWS EC2 (Instance Type: R5.2xlarge) to run this project.

Install Anaconda and Python 3.6.
Setup Conda Environment:

conda env create -f environment.yml
conda activate ctr

Download Datasets

chmod 755 download-data.sh
./download-data.sh

Install project in Developement mode:

pip install -e .

Run main program. Result will be in reports/YYYYMMDD_HHMMSS.log.

python ./main.py

Project Developer must know

Coding Convention:
- We will use Google Python Style Guide.
- This guide is long. If you are short of time, just read the Decision section. Or just read the interesting part.
Project structure follow Cookiecutter Data Science template.
Use Jupyter Notebook wisely! Notebook are good for Exploration and Communication.

Some key naming conventions:

Format source code with UTF-8.
Use single quote ' for string and double quotes " for docstring.
Use 4 spaces indentions.
Use underscore _ and lowercase to name function, variable, modules. CamelCase for Class.

Reference

Microsoft Recommenders repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Datasets

Avazu Click-Through Rate

Cross Validation

Avazu Click-Through Rate

Data Processing and Feature Engineering

Avazu Click-Through Rate

Models

Results

Project Structure

How to run?

Project Developer must know

Some key naming conventions:

Reference

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
ajctr		ajctr
docs		docs
models		models
notebooks		notebooks
reports		reports
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download-data.sh		download-data.sh
environment.yml		environment.yml
main.py		main.py
setup-ec2.sh		setup-ec2.sh
setup.py		setup.py

License

nguyentp/aajp-ctr-prediction

Folders and files

Latest commit

History

Repository files navigation

Introduction

Datasets

Avazu Click-Through Rate

Cross Validation

Avazu Click-Through Rate

Data Processing and Feature Engineering

Avazu Click-Through Rate

Models

Results

Project Structure

How to run?

Project Developer must know

Some key naming conventions:

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages