Skip to content

Commit

Permalink
Materials for malware
Browse files Browse the repository at this point in the history
  • Loading branch information
mariolpantunes committed Nov 10, 2023
1 parent 423f596 commit 380dc61
Show file tree
Hide file tree
Showing 10 changed files with 5,630 additions and 0 deletions.
11 changes: 11 additions & 0 deletions datasets/sample.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Country,Age,Salary,Purchased
France,44,72000,No
Spain,27,48000,Yes
Germany,30,54000,No
Spain,38,61000,No
Germany,40,,Yes
France,35,58000,Yes
Spain,,52000,No
France,48,79000,Yes
Germany,50,83000,No
France,37,67000,Yes
443 changes: 443 additions & 0 deletions notebooks/02-malware/notebook00.ipynb

Large diffs are not rendered by default.

1,874 changes: 1,874 additions & 0 deletions notebooks/02-malware/notebook01.ipynb

Large diffs are not rendered by default.

1,515 changes: 1,515 additions & 0 deletions notebooks/02-malware/notebook02.ipynb

Large diffs are not rendered by default.

1,374 changes: 1,374 additions & 0 deletions notebooks/02-malware/notebook03.ipynb

Large diffs are not rendered by default.

187 changes: 187 additions & 0 deletions notebooks/02-malware/notebook04.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# ML 101\n",
"\n",
"## Comparing two classifiers\n",
"\n",
"The choice of a [statistical hypothesis test](https://machinelearningmastery.com/statistical-hypothesis-tests/) is a challenging open problem for interpreting machine learning results.\n",
"\n",
"Model Evaluation is the subsidiary part of the model development process. It is the phase that is decided whether the model performs better. Therefore, it is critical to consider the model outcomes according to every possible evaluation method. Applying different methods can provide different perspectives.\n",
"\n",
"One of the mistakes while evaluating the classification model is considering only the true cases. It means that looking for only how the model estimates actual cases correctly. Therefore, when the results are unsatisfactory, people try to apply different methods or different variations to get the result that makes them satisfied, without considering the main reason for that result. It shouldn’t be forgotten the accuracy also depends on the false predictions as much as it depends on the true predictions. Thus, false predictions also have to be taken into consideration before rendering a certain verdict. These are the predictions which we want to be as minimum as possible. The metrics called Recall and Precision slightly explain the performance of the positive classes (or negative) by considering the false cases too. But, what I try to say is, the false positives and the false negatives should be compared like they are compared for the true cases. This is where the McNemar test should be used for obtaining a probability of difference between the cases of false negative and false positives.\n",
"\n",
"McNemar’s test is applied to $2\\times 2$ contingency tables to find whether row and column marginal frequencies are equal for paired samples. What row and column marginal frequencies mean for confusion matrices is the number of false predictions for both positive and negative classes. It uses the Chi-Square distribution to determine the probability of difference.\n",
"\n",
"![mcnemar](https://media.githubusercontent.com/media/mariolpantunes/ml101/main/figs/mcnemar.png)\n"
],
"metadata": {
"id": "LsyWEfBiiV3P"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CDE8CTnLiThN"
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import numpy as np\n",
"import seaborn as sns\n",
"\n",
"from mlxtend.evaluate import mcnemar_table\n",
"from mlxtend.evaluate import mcnemar\n",
"\n",
"from sklearn import datasets\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"from sklearn.metrics import matthews_corrcoef\n",
"\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn.svm import SVC"
]
},
{
"cell_type": "code",
"source": [
"iris = datasets.load_iris()\n",
"X = iris.data\n",
"y = iris.target\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)\n",
"\n",
"\n",
"# Logistic Regression\n",
"clf = LogisticRegression().fit(X_train, y_train)\n",
"y_pred_lr = clf.predict(X_test)\n",
"m = matthews_corrcoef(y_test, y_pred_lr)\n",
"print(f'LR MCC {m}')\n",
"\n",
"# Naive Bayes\n",
"clf = GaussianNB().fit(X_train, y_train)\n",
"y_pred_nb = clf.predict(X_test)\n",
"m = matthews_corrcoef(y_test, y_pred_nb)\n",
"print(f'NB MCC {m}')\n",
"\n",
"# SVM\n",
"clf = SVC(probability=True, kernel='linear').fit(X_train,y_train)\n",
"y_pred_svm = clf.predict(X_test)\n",
"m = matthews_corrcoef(y_test, y_pred_svm)\n",
"print(f'SVM MCC {m}')"
],
"metadata": {
"id": "p_Xd9Hsui4c8",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "a98723e5-f850-4360-8445-792ba8bb27d4"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"LR MCC 0.9515873026942034\n",
"NB MCC 0.9515873026942034\n",
"SVM MCC 1.0\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"tb = mcnemar_table(y_target=y_test, y_model1=y_pred_nb, y_model2=y_pred_svm)\n",
"sns.heatmap(tb, annot=True)"
],
"metadata": {
"id": "DGzbSjrqkUEB",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 283
},
"outputId": "7f492ae9-c9cd-4a8c-97df-2f806b169df9"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x7fbcd7f79710>"
]
},
"metadata": {},
"execution_count": 3
},
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAVoAAAD4CAYAAACt8i4nAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAOs0lEQVR4nO3dfbBU9X3H8c/nIplpMX+gNgSQgg+MDZk02EGTapLq+ISmKZh2iGZqGcP0Oh1NJZNJ4phOTKtNzTTxoZ2M02slkJSHMhVHSp00DHVKiDaRpIzhwYaIELlcwAc6OmhH7u63f9yFbuFyd/fe89tz+PF+Mb9h95zd3/k6rl+/8z2/c44jQgCAdHrKDgAAckeiBYDESLQAkBiJFgASI9ECQGJnpD7AkVd3sawBJzh7+tVlh4AKeuPwLo91jk5yzvhzzh/z8dpBRQsAiSWvaAGgq+q1siM4AYkWQF5qg2VHcAISLYCsRNTLDuEEJFoAeamTaAEgLSpaAEiMk2EAkBgVLQCkFaw6AIDEOBkGAInROgCAxDgZBgCJUdECQGKcDAOAxDgZBgBpRdCjBYC06NECQGK0DgAgMSpaAEisdqTsCE5AogWQF1oHAJAYrQMASIyKFgASI9ECQFrByTAASKyCPdqesgMAgELV6+2PEdieZvtp29ttb7N9Z2P7V233297SGDe0ComKFkBeiqtoByV9PiJ+avvdkn5ie31j34MR8Y12JyLRAshLQSfDImJA0kDj9Zu2d0iaOpq5aB0AyEvU2x62e21vbhq9w01pe4akiyX9qLHpDtvP215ie2KrkKhoAeRlsP0bf0dEn6S+kT5j+0xJj0taHBFv2H5E0r2SovH3NyV9ZqQ5SLQA8lLgqgPb4zWUZJdHxBpJiogDTfsflbSu1TwkWgB5KahHa9uSHpO0IyIeaNo+udG/laQbJW1tNReJFkBeiqtoL5d0i6Sf2d7S2Ha3pJttz9ZQ62C3pNtaTUSiBZCX4lYdbJLkYXY91elcJFoAeanglWEkWgB56WDVQbeQaAHkJaLsCE5AogWQF26TCACJkWgBIDFOhgFAYrVa2RGcgEQLIC+0DgAgMRItACRGjxYA0oo662gBIC1aBwCQGKsOACAxKloASIxEe/oYOPCK7r73G3rt0CFZ1h/Mu163LJivF3bu0r1//bd66+3/0ZTJ79HX7/mizpwwoexwUYJvPfJ1zb3+Sr3yymv68CXXlx1OPip4UxmegpvIGePG6Quf/WOtXd6nFX0PatWadXrxpT265/6HtPhPbtUT331EV33sMn17+eNlh4qSLP+Hf9In599adhj5qdfbH11Cok3k1845S7MuulCSNGHCr+r86dN04JXXtOflfs2Z/QFJ0m9f8lta/++bygwTJXrmh8/p0Ov/XXYY+alH+6NLWrYObP+GpHmSpjY29UtaGxE7UgaWk/6BA9qx80X95vsv0gXnTde//eBZXfWxy/T9p3+g/QdeLTs8IC8VXHUwYkVr+0uSVmnouTk/bgxLWmn7rhG+12t7s+3Nf/+dlUXGe8p566239bkv36cv/eltOnPCBN179+e0as06LfjMZ3X4rbc1fjxtcqBIUa+3Pbql1X/liyS9PyKONG+0/YCkbZLuH+5LEdEnqU+Sjry6q3qd6S45MjioxV++Tx+/9kpdc8XlkqTzp0/Tow99TZK0+5d7tfGZH5cZIpCfCl4Z1qpHW5c0ZZjtkxv7cBIRoa/81UM6f/o0Lbzpk8e2v3ZoqCdXr9f1d8tWacH8G8oKEchT1NsfXdKqol0saYPtnZJebmz7dUkXSrojZWCnuv98fpv++XsbNPOCGfr9hbdLku68baH27N2nVWvWSZKu/p3LdOPHry0zTJRoydKH9ZGPfkhnnz1RO37+Q33tvof13e+sLjusU18FK1pHizVntnskXar/fzLsuYhoq+N8OrcOcHJnT7+67BBQQW8c3uWxznH4Kze1nXMm/MWqMR+vHS3PxEREXdJ/dCEWABg7bpMIAIlVsHVAogWQlW4u22oXiRZAXqhoASCxCiZa7nUAIC+1WvtjBLan2X7a9nbb22zf2dh+lu31tnc2/p7YKiQSLYCsRD3aHi0MSvp8RMyS9GFJt9ueJekuSRsiYqakDY33IyLRAshLQXfvioiBiPhp4/WbknZo6HqCeZKWNT62TNL8ViHRowWQlw5WHdjuldTbtKmvca+W4z83Q9LFkn4kaVJEDDR27Zc0qdVxSLQA8tLBybDmG2CdjO0zJT0uaXFEvGH/38VkERG2Wx6QRAsgLwWuOrA9XkNJdnlErGlsPmB7ckQM2J4s6WCreejRAshK1Optj5F4qHR9TNKOiHigaddaSQsbrxdKerJVTFS0APJSXEV7uaRbJP3M9pbGtrs1dB/u1bYXSdojaUGriUi0ALLSxrKt9uaJ2KShJ8oM56pO5iLRAshLBa8MI9ECyEv17ilDogWQlxisXqYl0QLIS/XyLIkWQF6KOhlWJBItgLxQ0QJAWlS0AJAaFS0ApBWDZUdwIhItgKxU8GnjJFoAmSHRAkBaVLQAkBiJFgASi9rJbrhVHhItgKxQ0QJAYlGnogWApKhoASCxCCpaAEiKihYAEquz6gAA0uJkGAAkRqIFgMSierejJdECyAsVLQAkxvIuAEisxqoDAEiLihYAEqtij7an7AAAoEgR7Y9WbC+xfdD21qZtX7Xdb3tLY9zQah4SLYCsRN1tjzYslTR3mO0PRsTsxniq1SS0DgBkpVYvrn6MiI22Z4x1HipaAFnppHVgu9f25qbR2+Zh7rD9fKO1MLHVh0m0ALJSD7c9IqIvIuY0jb42DvGIpAskzZY0IOmbrb5A6wBAVlIv74qIA0df235U0rpW36GiBZCVIlcdDMf25Ka3N0raerLPHpW8ov2VKR9NfQgAOKZeYEVre6WkKySdY3uvpHskXWF7tqSQtFvSba3moXUAICsFrzq4eZjNj3U6D4kWQFYqeJdEEi2AvBTZOigKiRZAVripDAAkVsGH4JJoAeQlREULAEkN0joAgLSoaAEgMXq0AJAYFS0AJEZFCwCJ1ahoASCtCj6bkUQLIC91KloASIubygBAYpwMA4DE6qZ1AABJ1coOYBgkWgBZYdUBACTGqgMASIxVBwCQGK0DAEiM5V0AkFiNihYA0qKiBYDESLQAkFgFHxlGogWQFypaAEiMS3ABILEqrqPtKTsAAChSvYPRiu0ltg/a3tq07Szb623vbPw9sdU8JFoAWSky0UpaKmnucdvukrQhImZK2tB4PyISLYCsRAej5VwRGyW9ftzmeZKWNV4vkzS/1TwkWgBZqbv9YbvX9uam0dvGISZFxEDj9X5Jk1p9gZNhALLSyaqDiOiT1DfaY0VE2G5ZHJNoAWSlnv5GiQdsT46IAduTJR1s9QVaBwCyUvDJsOGslbSw8XqhpCdbfYFECyArRZ4Ms71S0rOSLrK91/YiSfdLusb2TklXN96PiNYBgKwUeQluRNx8kl1XdTIPiRZAVgZbn5vqOhItgKxUL82SaAFkhrt3AUBiXVje1TESLYCsVC/NkmgBZIbWAQAkVqtgTUuiBZAVKloASCyoaAEgrSpWtNzroEuuu/YKbdu6US9s36QvfuH2ssNBRfC7KF5d0fboFhJtF/T09OhvHv5L/e4n/lAf+OCV+tSn5ut975tZdlgoGb+LNIq8qUxRSLRdcOklF+vFF3frpZd+qSNHjmj16if1e5+4ruywUDJ+F2kMKtoe3UKi7YIpU9+rl/fuO/Z+b/+Apkx5b4kRoQr4XaQRHfzpllEnWtu3jrDv2HN46vXDoz0EAHSsCzf+7thYKto/P9mOiOiLiDkRMaenZ8IYDpGHff37Ne3cKcfenzt1svbt219iRKgCfhdpVLGiHXF5l+3nT7ZLbTz5EUOe27xFF154nmbMmKb+/v1asGCebvkjzjCf7vhdpFHF5V2t1tFOknSdpEPHbbekZ5JElKFaraY7F/+ZnvqXFRrX06Oly/5R27f/vOywUDJ+F2nUonoXLDhGCMr2Y5K+HRGbhtm3IiI+3eoAZ7xravX+qQFU0uA7/R7rHJ+efmPbOWfFnifGfLx2jFjRRsSiEfa1TLIA0G1cggsAiZ2KPVoAOKXwhAUASIzWAQAkVsVVByRaAFmhdQAAiXEyDAASo0cLAInROgCAxEa62rVTtndLelNSTdJgRMwZzTwkWgBZSfC48Ssj4tWxTECiBZCVKrYOeMICgKxERNuj+SEFjdF7/HSSvm/7J8PsaxsVLYCsdFLRRkSfpL4RPvKRiOi3/R5J622/EBEbO42JihZAVop8wkJE9Df+PijpCUmXjiYmEi2ArNQi2h4jsT3B9ruPvpZ0raSto4mJ1gGArBR4MmySpCdsS0O5ckVEfG80E5FoAWSlqEQbEbskfbCIuUi0ALJS5AULRSHRAshKFdfRkmgBZIWbygBAYrWo3o0SSbQAskKPFgASo0cLAInRowWAxOq0DgAgLSpaAEiMVQcAkBitAwBIjNYBACRGRQsAiVHRAkBitaiVHcIJSLQAssIluACQGJfgAkBiVLQAkBirDgAgMVYdAEBiXIILAInRowWAxOjRAkBiVLQAkBjraAEgMSpaAEiMVQcAkBgnwwAgsSq2DnrKDgAAihQd/GnF9lzb/2X7F7bvGm1MVLQAslJURWt7nKRvSbpG0l5Jz9leGxHbO52LRAsgKwX2aC+V9IuI2CVJtldJmiepeol28J1+pz7GqcJ2b0T0lR0HqoXfRbE6yTm2eyX1Nm3qa/p3MVXSy0379kr60GhiokfbXb2tP4LTEL+LkkREX0TMaRpJ/odHogWA4fVLmtb0/tzGto6RaAFgeM9Jmmn7PNvvknSTpLWjmYiTYd1FHw7D4XdRQRExaPsOSf8qaZykJRGxbTRzuYqLewEgJ7QOACAxEi0AJEai7ZKiLuVDPmwvsX3Q9tayY0FaJNouaLqU73pJsyTdbHtWuVGhApZKmlt2EEiPRNsdxy7li4h3JB29lA+nsYjYKOn1suNAeiTa7hjuUr6pJcUCoMtItACQGIm2Owq7lA/AqYdE2x2FXcoH4NRDou2CiBiUdPRSvh2SVo/2Uj7kw/ZKSc9Kusj2XtuLyo4JaXAJLgAkRkULAImRaAEgMRItACRGogWAxEi0AJAYiRYAEiPRAkBi/wtANhHB3PDSfAAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "code",
"source": [
"chi2, p = mcnemar(ary=tb, exact=True)\n",
"print('chi-squared:', chi2)\n",
"print('p-value:', p)\n",
"\n",
"alpha = 0.05\n",
"\n",
"if p < alpha:\n",
" print('The models are significantly different')\n",
"else:\n",
" print('The models are similar')"
],
"metadata": {
"id": "Nu6m4wkzkt1i",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "8028a540-b6f6-4484-c04b-7734fb8466cd"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"chi-squared: 0\n",
"p-value: 1.0\n",
"The models are similar\n"
]
}
]
}
]
}
Binary file added slides/figures/malware00.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/figures/malware01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/figures/multiclass.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 380dc61

Please sign in to comment.