Add the quick start notebook

nrpardo · Feb 13, 2018 · c7efdee · c7efdee
1 parent 6c7e307
commit c7efdee
Show file tree

Hide file tree

Showing 2 changed files with 310 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -40,6 +40,7 @@ pip-delete-this-directory.txt
 *.geojson
 missingno.ipynb
 *.ipynb
+!QuickStart.ipynb
 _map.html
 
 # Test cache

diff --git a/QuickStart.ipynb b/QuickStart.ipynb
@@ -0,0 +1,309 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Quick Start\n",
+    "This quickstart uses a sample of the [NYPD Motor Vehicle Collisions Dataset](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95) \n",
+    "dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -r requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import quilt\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "quilt.install('ResidentMario/missingno_data', force=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load the data into memory"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from quilt.data.ResidentMario import missingno_data\n",
+    "\n",
+    "collisions = missingno_data.nyc_collision_factors()\n",
+    "collisions = collisions.replace(\"nan\", np.nan)\n",
+    "collisions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The rest of this walkthrough will draw from this `collisions` dataset. I additionally define **nullity** to mean \n",
+    "whether a particular variable is filled in or not.\n",
+    "\n",
+    "### Matrix\n",
+    "\n",
+    "The `msno.matrix` nullity matrix is a data-dense display which lets you quickly visually pick out patterns in\n",
+    " data completion."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import missingno as msno\n",
+    "%matplotlib inline\n",
+    "msno.matrix(collisions.sample(250))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "At a glance, date, time, the distribution of injuries, and the contribution factor of the first vehicle appear to be\n",
+    "completely populated, while geographic information seems mostly complete, but spottier.\n",
+    "\n",
+    "The sparkline at right summarizes the general shape of the data completeness and points out the maximum and minimum\n",
+    "rows.\n",
+    "\n",
+    "This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap\n",
+    "or become unreadable, and by default large displays omit them.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you are working with time-series data, you can [specify a periodicity](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases)\n",
+    "using the `freq` keyword parameter:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "null_pattern = (np.random.random(1000).reshape((50, 20)) > 0.5).astype(bool)\n",
+    "null_pattern = pd.DataFrame(null_pattern).replace({False: None})\n",
+    "msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Bar Chart\n",
+    "\n",
+    "`msno.bar` is a simple visualization of nullity by column:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "msno.bar(collisions.sample(1000))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can switch to a logarithmic scale by specifying `log=True`. `bar` provides the same information as `matrix`, but in \n",
+    "a simpler format."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Heatmap\n",
+    "\n",
+    "The `missingno` correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "msno.heatmap(collisions)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this example, it seems that reports which are filed with an `OFF STREET NAME` variable are less likely to have complete\n",
+    "geographic data.\n",
+    "\n",
+    "Nullity correlation ranges from `-1` (if one variable appears the other definitely does not) to `0` (variables appearing\n",
+    "or not appearing have no effect on one another) to `1` (if one variable appears the other definitely also does).\n",
+    "\n",
+    "Variables that are always full or always empty have no meaningful correlation, and so are silently removed from the visualization&mdash;in this case for instance the datetime and injury number columns, which are completely filled, are not included.\n",
+    "\n",
+    "Entries marked `<1` or `>-1` are have a correlation that is close to being exactingly negative or positive, but is\n",
+    "still not quite perfectly so. This points to a small number of records in the dataset which are erroneous. For\n",
+    "example, in this dataset the correlation between `VEHICLE CODE TYPE 3` and `CONTRIBUTING FACTOR VEHICLE 3` is `<1`,\n",
+    "indicating that, contrary to our expectation, there are a few records which have one or the other, but not both.\n",
+    "These cases will require special attention.\n",
+    "\n",
+    "The heatmap works great for picking out data completeness relationships between variable pairs, but its explanatory power\n",
+    "is limited when it comes t"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Dendrogram\n",
+    "\n",
+    "The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise\n",
+    "ones visible in the correlation heatmap:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "msno.dendrogram(collisions)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The dendrogram uses a [hierarchical clustering algorithm](http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html)\n",
+    "(courtesy of `scipy`) to bin variables against one another by their nullity correlation (measured in terms of\n",
+    "binary distance). At each step of the tree the variables are split up based on which combination minimizes the\n",
+    "distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to\n",
+    "zero, and the closer their average distance (the y-axis) is to zero.\n",
+    "\n",
+    "To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of\n",
+    "zero fully predict one another's presence&mdash;one variable might always be empty when another is filled, or they\n",
+    "might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the\n",
+    "variables which are required and therefore present in every record.\n",
+    "\n",
+    "Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If\n",
+    "your own interpretation of the dataset is that these columns actually *are* or *ought to be* match each other in\n",
+    "nullity (for example, as `CONTRIBUTING FACTOR VEHICLE 2` and `VEHICLE TYPE CODE 2` ought to), then the height of the\n",
+    "cluster leaf tells you, in absolute terms, how often the records are \"mismatched\" or incorrectly filed&mdash;that is,\n",
+    " how many values you would have to fill in or drop, if you are so inclined.\n",
+    "\n",
+    "As with `matrix`, only up to 50 labeled columns will comfortably display in this configuration. However the\n",
+    "`dendrogram` more elegantly handles extremely large datasets by simply flipping to a horizontal configuration."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Geoplot\n",
+    "\n",
+    "One kind of pattern that's particularly difficult to check, where it appears, is geographic distribution. `missingno`\n",
+    "supports visualizing geospatial data nullity patterns with a geoplot visualization. This is an experimental data \n",
+    "visualization type, and requires the [`geoplot`](https://github.com/ResidentMario/geoplot) and [`geopandas`](http://geopandas.org/) \n",
+    "libraries. These are optional dependencies are must be installed separately from the rest of `missingno`. Once you \n",
+    "have them you can run:\n",
+    "\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "msno.geoplot(collisions, x='LONGITUDE', y='LATITUDE')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If no geographical context can be provided, `geoplot` will compute a\n",
+    "[quadtree](https://en.wikipedia.org/wiki/Quadtree) nullity distribution, as above, which splits the dataset into\n",
+    "statistically significant chunks and colorizes them based on the average nullity of data points within them. In this\n",
+    "case there is good evidence that the distribution of data nullity is mostly random: the number of values left blank \n",
+    "varies across the space by only 5 percent, and the differences look randomly distributed.\n",
+    "\n",
+    "Quadtrees have the advantage that they don't require any information about the space besides latitude/longitude \n",
+    "pairs. Given enough data (hundreds of thousands of records), \n",
+    "[a geoplot can even reconstruct the space being mapped](https://i.imgur.com/4dtGhig.png). It works less well for \n",
+    "small datasets like this sample one.\n",
+    "\n",
+    "If you can specify a geographic grouping within the dataset, you can plot your data as a set of minimum-enclosure \n",
+    "[convex hulls](https://en.wikipedia.org/wiki/Convex_hull) instead:\n",
+    "\n",
+    "    >>> msno.geoplot(collisions, x='LONGITUDE', y='LATITUDE', by='ZIP CODE')\n",
+    "\n",
+    "![alt-text][hull-geoplot]\n",
+    "\n",
+    "[hull-geoplot]: https://i.imgur.com/osnPwEE.png\n",
+    "\n",
+    "Convex hulls are usually more interpretable than the quadtree, especially when the underlying dataset is relatively \n",
+    "small (as this one is). We again see a data nullity distribution that's seemingly at random, giving us confidence \n",
+    "that the nullity of collision records is not geographically variable.\n",
+    "\n",
+    "The `msno.geoplot` chart type extends the `aggplot` function in the `geoplot` package, and accepts keyword arguments \n",
+    "to the latter as parameters. [The `geoplot` documentation provides further details](https://residentmario.github.io/geoplot/index.html) \n",
+    "(including how to pick [a better map projection](https://i.imgur.com/0aaNa9Q.png)). For more advanced configuration \n",
+    "details for the rest of the plot types, refer to the `CONFIGURATION.md` file in this repository.\n",
+    "\n",
+    "That concludes our tour of `missingno`!\n",
+    "\n",
+    "## Contributing\n",
+    "\n",
+    "For thoughts on features or bug reports see [Issues](https://github.com/ResidentMario/missingno/issues). If \n",
+    "you're interested in contributing to this library, see details on doing so in the `CONTRIBUTING.md` file in this \n",
+    "repository."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}