Skip to content

Commit

Permalink
General improvements to the notebooks
Browse files Browse the repository at this point in the history
  • Loading branch information
mariolpantunes committed Sep 22, 2023
1 parent 240d4d5 commit 0469af8
Show file tree
Hide file tree
Showing 3 changed files with 200 additions and 39 deletions.
2 changes: 1 addition & 1 deletion notebooks/notebook_00.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
"### Probabilities\n",
"\n",
"A set of probability values for an experiment with sample space $S = \\\\{ O_1, O_2, \\cdots, O_n \\\\}$ consists of some probabilities that satisfy: $$ 0 \\leq p_i \\leq 1, \\hspace{0.5cm} i= 1,2, \\cdots, n $$ and\n",
"$$ p_1 +p_2 + \\cdots +p_n = 1 $$\n",
"$$ p_1 + p_2 + \\cdots +p_n = 1 $$\n",
"\n",
"The probability of outcome $O_i$ occurring is said to be $p_i$ and it is written:\n",
"\n",
Expand Down
76 changes: 75 additions & 1 deletion notebooks/notebook_01.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,61 @@
"# SPAM or HAM"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementing a proper Classifier\n",
"\n",
"This notebook builds upon the previous one by:\n",
"\n",
"1. Providing proper text mining processing using NLTK.\n",
"2. Implementing a class for the classifier.\n",
"3. Employing a trick to improve numerical stability.\n",
"3. Evaluating the classifier against a proper SPAM dataset.\n",
"\n",
"### Numerical stability \n",
"\n",
"Numerical instability is a concept that refers to the propensity of an algorithm or computational\n",
"procedure to produce inaccurate results due to round-off errors, truncation errors, or other computational issues.\n",
"\n",
"These errors may be small initially but can accumulate and escalate in the course of iterations, leading to \n",
"results that are significantly far-off from the expected or precise value.\n",
"\n",
"The previous implemenation had to use a reduce to multiple several likelihood probabilities together.\n",
"\n",
"Due to rounding error the program can produce a bad result.\n",
"\n",
"This is where smoothing can help (by providing a small probability to unseen words).\n",
"\n",
"Furthermore, we can explore other operations to reduce the number of multiplications:\n",
"\n",
"$$ \\log ab = \\log a + \\log b $$\n",
"\n",
"$$ \\exp \\log x = x$$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Libraries\n",
"\n",
"Import NLTK and download the additional data for the tokenizer and lemmatizer.\n",
"\n",
"**Tokenization** is the process of tokenizing or splitting a string, text into a list of tokens.\n",
"One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.\n",
"\n",
"**Lemmatization** is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. \n",
"Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.\n",
"\n",
"Examples of lemmatization:\n",
"\n",
"- rocks : rock\n",
"- corpora : corpus\n",
"- better : good"
]
},
{
"cell_type": "code",
"execution_count": 2,
Expand Down Expand Up @@ -46,6 +101,15 @@
"nltk.download('omw-1.4')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Toy Dataset\n",
"\n",
"Same dataset, but in a traditional format."
]
},
{
"cell_type": "code",
"execution_count": 3,
Expand Down Expand Up @@ -82,6 +146,13 @@
"print(f'{dataset_test}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NB Classifier"
]
},
{
"cell_type": "code",
"execution_count": 4,
Expand Down Expand Up @@ -248,7 +319,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Real Dataset"
"## Real Dataset\n",
"\n",
"The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research.\n",
"It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam. "
]
},
{
Expand Down
161 changes: 124 additions & 37 deletions notebooks/notebook_02.ipynb

Large diffs are not rendered by default.

0 comments on commit 0469af8

Please sign in to comment.