Skip to content


Repository files navigation



You should download the dataset and change the variables' paths in the scripts to dataset's path but I suggest you to run the scripts in Kaggle notebooks and don't download the data, it would be easier since you wouldn't have to download the data, which is over 100 mb's. I developed the project on Kaggle too, so you don't have to change the dataset paths in the scripts, you just have to add the data to your notebook if you work on Kaggle.

Link for the dataset:

My Kaggle Profile:

My GitHub Profile:

My LinkedIn Profile:

Feel free to reach out! ✌

Project Presentation

I would like to introduce my most comprehensive project to date! I have completed the project "Drinking or Smoking Classification" which I developed as the Midterm Project during the 7th week of the Machine Learning Zoomcamp organized by DataTalksClub and presented by Alexey Grigorev. I had the opportunity to apply all the techniques I learned, and I gave my all to this project. I look forward to your feedback and hope for positive reviews!

Problem Description

Smoking and alcohol consumption pose significant health risks. However, these risks are not limited to regular users alone. Passive smokers and non-regular consumers are also at risk. My project aims to determine the harm caused by passive smoking and help non-regular users understand if they are at a similar risk level to regular users. Additionally, it can be used to identify whether children have been exposed to smoking or alcohol even though the dataset has no records below the age 20.

Problem Solution

I've used the dataset to develop 2 different models, one for Alcohol Drinking Prediction, and one for Smoking Prediction. The reason I did this is so that the user can input a single record and get the results from both models, drinking and smoking probability. The target variable for the drinking model is "DRK_YN", for the smoking model it's "SMK_stat_type_cd". I've extracted the target variables from the dataset and trained models separately. The Smoking Model is a multiclass -can be binary if predictions for 0 and 1 is summarized in prediction 0, which would mean not drinker, and 1 stays 1 which already stands for drinker- classification model which predicts 0/1/2 (never smoked/smoked but quit/smoker). The Drinking Model is a binary classification model which predicts 0/1(drinker/not drinker).


This dataset is collected from National Health Insurance Service in Korea. All personal information and sensitive data were excluded. The purpose of this dataset is to:

  • Analysis of body signal
  • Classification of smoker or drinker

Details of dataset:

  • Sex male, female

  • age round up to 5 years

  • height round up to 5 cm[cm]

  • weight [kg]

  • sight_left eyesight(left)

  • sight_right eyesight(right)

  • hear_left hearing left, 1(normal), 2(abnormal)

  • hear_right hearing right, 1(normal), 2(abnormal)

  • SBP Systolic blood pressure[mmHg]

  • DBP Diastolic blood pressure[mmHg]

  • BLDS BLDS or FSG(fasting blood glucose)[mg/dL]

  • tot_chole total cholesterol[mg/dL]

  • HDL_chole HDL cholesterol[mg/dL]

  • LDL_chole LDL cholesterol[mg/dL]

  • triglyceride triglyceride[mg/dL]

  • hemoglobin hemoglobin[g/dL]

  • urine_protein protein in urine, 1(-), 2(+/-), 3(+1), 4(+2), 5(+3), 6(+4)

  • serum_creatinine serum(blood) creatinine[mg/dL]

  • SGOT_AST SGOT(Glutamate-oxaloacetate transaminase) AST(Aspartate transaminase)[IU/L]

  • SGOT_ALT ALT(Alanine transaminase)[IU/L]

  • gamma_GTP y-glutamyl transpeptidase[IU/L]

  • SMK_stat_type_cd Smoking state, 1(never), 2(used to smoke but quit), 3(still smoke)

  • DRK_YN Drinker or Not

Midterm Project Requirements (Evaluation Criteria)

  • Problem description
  • EDA
  • Model training
  • Exporting notebook to script
  • Model deployment
  • Reproducibility
  • Dependency and environment management
  • Containerization
  • Cloud deployment

Dependency and environment management guide

You can easily install dependencies from requirements.txt and use venv.

  • pip install pipenv

  • pipenv shell

  • pip install -r requirements.txt

If can't or don't know how to, here are the needed packages, just run

  • pip install pipenv waitress flask pandas numpy scikit-learn==1.2.2 lightgbm xgboost requests seaborn matplotlib warnings pickle json


1-)docker tag image_name YOUR_DOCKERHUB_NAME/image_name


2-)docker push YOUR_DOCKERHUB_NAME/image_name


Useful link,docker%20hub%20credentials%20are%20incorrect.

Deployment guide

To run it locally:

  • Run python on a terminal

  • Open a new terminal and run python

To run it on docker:

  • Download and run Docker Desktop:

  • Open a terminal

  • docker build -t midterm_project .

  • docker run -it --rm -p 7860:7860 midterm_project

  • Open a new terminal and run python

To run it on cloud:

  • Push your image to docker hub like in the Containerization section

  • Open and create a web service

  • Select "Deploy an existing image from a registry" render0

  • Enter the image url for your "YOUR_DOCKERHUB_NAME/image_name" render1

  • Finalize the setup and run your web service


  • Set the host in

  • Open up a new terminal and run python

  • Here is the final result



Midterm Project for Machine Learning Zoomcamp DataTalksClub.







No releases published


No packages published
