Added more chapters

pgolding · Mar 11, 2017 · bb6d021 · bb6d021
1 parent 24e860c
commit bb6d021
Show file tree

Hide file tree

Showing 3 changed files with 261 additions and 2 deletions.
diff --git a/Dealing with Human Language/Identifying Words.ipynb b/Dealing with Human Language/Identifying Words.ipynb
@@ -309,8 +309,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "NOTE: I cheated here because the above method call returned an illegal exception that I was unable to debug (related to passing in the char_filter param). So I created the index using the above params "
+    "NOTE (and TO_DO): I cheated here because the above method call returned an illegal exception that I was unable to debug (related to passing in the char_filter param). So I created the index using the above params via the Kibana developer console before making the call."
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {

diff --git a/Dealing with Human Language/Normalizing Tokens.ipynb b/Dealing with Human Language/Normalizing Tokens.ipynb
@@ -83,7 +83,131 @@
     "\n",
     "Breaking text into tokens is only half the job. To make those tokens more easily searchable, they need to go through a normalization process to remove insignificant differences between otherwise identical words, such as uppercase versus lowercase. Perhaps we also need to remove significant differences, to make esta, ésta, and está all searchable as the same word. Would you search for déjà vu, or just for deja vu?\n",
     "\n",
-    "This is the job of the token filters, which receive a stream of tokens from the tokenizer. You can have multiple token filters, each doing its particular job. Each receives the new token stream as output by the token filter before it."
+    "This is the job of the token filters, which receive a stream of tokens from the tokenizer. You can have multiple token filters, each doing its particular job. Each receives the new token stream as output by the token filter before it.\n",
+    "\n",
+    "#### In That Case\n",
+    "\n",
+    "The most frequently used token filter is the lowercase filter, which does exactly what you would expect; it transforms each token into its lowercase form:\n",
+    "\n",
+    "```\n",
+    "GET /_analyze?tokenizer=standard&filters=lowercase\n",
+    "The QUICK Brown FOX! \n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "the,quick,brown,fox\n"
+     ]
+    }
+   ],
+   "source": [
+    "text = 'The QUICK Brown FOX!'# contains some uppercase words\n",
+    "analyzed_text = [x['token'] for x in es.indices.analyze\\\n",
+    "                 (tokenizer='standard', filter=['lowercase'], text=text)['tokens']]\n",
+    "print(','.join(analyzed_text))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To make this automatic as part of the analysis process, we can create a custom analyzer:\n",
+    "```\n",
+    "PUT /my_index\n",
+    "{\n",
+    "  \"settings\": {\n",
+    "    \"analysis\": {\n",
+    "      \"analyzer\": {\n",
+    "        \"my_lowercaser\": {\n",
+    "          \"tokenizer\": \"standard\",\n",
+    "          \"filter\":  [ \"lowercase\" ]\n",
+    "        }\n",
+    "      }\n",
+    "    }\n",
+    "  }\n",
+    "}\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# first delete the index from previous chapters, if it exists\n",
+    "if es.indices.exists('my_index'): \n",
+    "    es.indices.delete('my_index')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "#es.indices.create('my_index')\n",
+    "from elasticsearch_dsl import analyzer, Index\n",
+    "my_custom_analyzer = analyzer('my_lowercaser',\n",
+    "        tokenizer='standard',\n",
+    "        filter='lowercase')\n",
+    "i = Index('my_index')\n",
+    "i.analyzer(my_custom_analyzer)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'tokens': [{'end_offset': 3,\n",
+       "   'position': 0,\n",
+       "   'start_offset': 0,\n",
+       "   'token': 'the',\n",
+       "   'type': '<ALPHANUM>'},\n",
+       "  {'end_offset': 9,\n",
+       "   'position': 1,\n",
+       "   'start_offset': 4,\n",
+       "   'token': 'quick',\n",
+       "   'type': '<ALPHANUM>'},\n",
+       "  {'end_offset': 15,\n",
+       "   'position': 2,\n",
+       "   'start_offset': 10,\n",
+       "   'token': 'brown',\n",
+       "   'type': '<ALPHANUM>'},\n",
+       "  {'end_offset': 19,\n",
+       "   'position': 3,\n",
+       "   'start_offset': 16,\n",
+       "   'token': 'fox',\n",
+       "   'type': '<ALPHANUM>'}]}"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "es.indices.analyze(index='my_index', analyzer='my_lowercaser', text=text)"
    ]
   },
   {

diff --git a/Dealing with Human Language/Reducing Words to Their Root Form.ipynb b/Dealing with Human Language/Reducing Words to Their Root Form.ipynb
@@ -0,0 +1,126 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Elasticsearch: The Definitive Guide - Python\n",
+    "\n",
+    "Following the examples in the book, here are Python snippets that achieve the same effect.\n",
+    "\n",
+    "Documentation for the Python libs:\n",
+    "\n",
+    "Low-level API:\n",
+    "\n",
+    "https://elasticsearch-py.readthedocs.io/en/master/index.html\n",
+    "\n",
+    "Expressive DSL API (more \"Pythonic\")\n",
+    "\n",
+    "http://elasticsearch-dsl.readthedocs.io/en/latest/index.html\n",
+    "\n",
+    "Github repo for DSL API:\n",
+    "\n",
+    "https://github.com/elastic/elasticsearch-dsl-py\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import sys, os\n",
+    "sys.path.insert(1, os.path.join(sys.path[0], '..'))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "14 items created\n"
+     ]
+    }
+   ],
+   "source": [
+    "import index\n",
+    "from elasticsearch import Elasticsearch\n",
+    "from elasticsearch_dsl import Search, Q\n",
+    "from pprint import pprint\n",
+    "\n",
+    "es = Elasticsearch(\n",
+    "    'localhost',\n",
+    "    # sniff before doing anything\n",
+    "    sniff_on_start=True,\n",
+    "    # refresh nodes after a node fails to respond\n",
+    "    sniff_on_connection_fail=True,\n",
+    "    # and also every 60 seconds\n",
+    "    sniffer_timeout=60\n",
+    ")\n",
+    "\n",
+    "r = index.populate()\n",
+    "print('{} items created'.format(len(r['items'])))\n",
+    "\n",
+    "# Let's repopulate the index as we deleted 'gb' in earlier chapters:\n",
+    "# Run the script: populate.ipynb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "### Reducing Words to Their Root Form\n",
+    "\n",
+    "Most languages of the world are inflected, meaning that words can change their form to express differences in the following:\n",
+    "\n",
+    "* Number: fox, foxes\n",
+    "* Tense: pay, paid, paying\n",
+    "* Gender: waiter, waitress\n",
+    "* Person: hear, hears\n",
+    "* Case: I, me, my\n",
+    "* Aspect: ate, eaten\n",
+    "* Mood: so be it, were it so"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}