Skip to content

Commit

Permalink
Add the quick start notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
Kevin Moore committed Feb 13, 2018
1 parent 6c7e307 commit c7efdee
Show file tree
Hide file tree
Showing 2 changed files with 310 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ pip-delete-this-directory.txt
*.geojson
missingno.ipynb
*.ipynb
!QuickStart.ipynb
_map.html

# Test cache
Expand Down
309 changes: 309 additions & 0 deletions QuickStart.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Quick Start\n",
"This quickstart uses a sample of the [NYPD Motor Vehicle Collisions Dataset](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95) \n",
"dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install -r requirements.txt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import quilt\n",
"import numpy as np\n",
"import pandas as pd\n",
"quilt.install('ResidentMario/missingno_data', force=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the data into memory"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from quilt.data.ResidentMario import missingno_data\n",
"\n",
"collisions = missingno_data.nyc_collision_factors()\n",
"collisions = collisions.replace(\"nan\", np.nan)\n",
"collisions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The rest of this walkthrough will draw from this `collisions` dataset. I additionally define **nullity** to mean \n",
"whether a particular variable is filled in or not.\n",
"\n",
"### Matrix\n",
"\n",
"The `msno.matrix` nullity matrix is a data-dense display which lets you quickly visually pick out patterns in\n",
" data completion."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import missingno as msno\n",
"%matplotlib inline\n",
"msno.matrix(collisions.sample(250))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At a glance, date, time, the distribution of injuries, and the contribution factor of the first vehicle appear to be\n",
"completely populated, while geographic information seems mostly complete, but spottier.\n",
"\n",
"The sparkline at right summarizes the general shape of the data completeness and points out the maximum and minimum\n",
"rows.\n",
"\n",
"This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap\n",
"or become unreadable, and by default large displays omit them.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are working with time-series data, you can [specify a periodicity](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases)\n",
"using the `freq` keyword parameter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"null_pattern = (np.random.random(1000).reshape((50, 20)) > 0.5).astype(bool)\n",
"null_pattern = pd.DataFrame(null_pattern).replace({False: None})\n",
"msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bar Chart\n",
"\n",
"`msno.bar` is a simple visualization of nullity by column:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"msno.bar(collisions.sample(1000))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can switch to a logarithmic scale by specifying `log=True`. `bar` provides the same information as `matrix`, but in \n",
"a simpler format."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Heatmap\n",
"\n",
"The `missingno` correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"msno.heatmap(collisions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, it seems that reports which are filed with an `OFF STREET NAME` variable are less likely to have complete\n",
"geographic data.\n",
"\n",
"Nullity correlation ranges from `-1` (if one variable appears the other definitely does not) to `0` (variables appearing\n",
"or not appearing have no effect on one another) to `1` (if one variable appears the other definitely also does).\n",
"\n",
"Variables that are always full or always empty have no meaningful correlation, and so are silently removed from the visualization—in this case for instance the datetime and injury number columns, which are completely filled, are not included.\n",
"\n",
"Entries marked `<1` or `>-1` are have a correlation that is close to being exactingly negative or positive, but is\n",
"still not quite perfectly so. This points to a small number of records in the dataset which are erroneous. For\n",
"example, in this dataset the correlation between `VEHICLE CODE TYPE 3` and `CONTRIBUTING FACTOR VEHICLE 3` is `<1`,\n",
"indicating that, contrary to our expectation, there are a few records which have one or the other, but not both.\n",
"These cases will require special attention.\n",
"\n",
"The heatmap works great for picking out data completeness relationships between variable pairs, but its explanatory power\n",
"is limited when it comes t"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dendrogram\n",
"\n",
"The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise\n",
"ones visible in the correlation heatmap:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"msno.dendrogram(collisions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dendrogram uses a [hierarchical clustering algorithm](http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html)\n",
"(courtesy of `scipy`) to bin variables against one another by their nullity correlation (measured in terms of\n",
"binary distance). At each step of the tree the variables are split up based on which combination minimizes the\n",
"distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to\n",
"zero, and the closer their average distance (the y-axis) is to zero.\n",
"\n",
"To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of\n",
"zero fully predict one another's presence&mdash;one variable might always be empty when another is filled, or they\n",
"might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the\n",
"variables which are required and therefore present in every record.\n",
"\n",
"Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If\n",
"your own interpretation of the dataset is that these columns actually *are* or *ought to be* match each other in\n",
"nullity (for example, as `CONTRIBUTING FACTOR VEHICLE 2` and `VEHICLE TYPE CODE 2` ought to), then the height of the\n",
"cluster leaf tells you, in absolute terms, how often the records are \"mismatched\" or incorrectly filed&mdash;that is,\n",
" how many values you would have to fill in or drop, if you are so inclined.\n",
"\n",
"As with `matrix`, only up to 50 labeled columns will comfortably display in this configuration. However the\n",
"`dendrogram` more elegantly handles extremely large datasets by simply flipping to a horizontal configuration."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Geoplot\n",
"\n",
"One kind of pattern that's particularly difficult to check, where it appears, is geographic distribution. `missingno`\n",
"supports visualizing geospatial data nullity patterns with a geoplot visualization. This is an experimental data \n",
"visualization type, and requires the [`geoplot`](https://github.com/ResidentMario/geoplot) and [`geopandas`](http://geopandas.org/) \n",
"libraries. These are optional dependencies are must be installed separately from the rest of `missingno`. Once you \n",
"have them you can run:\n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"msno.geoplot(collisions, x='LONGITUDE', y='LATITUDE')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If no geographical context can be provided, `geoplot` will compute a\n",
"[quadtree](https://en.wikipedia.org/wiki/Quadtree) nullity distribution, as above, which splits the dataset into\n",
"statistically significant chunks and colorizes them based on the average nullity of data points within them. In this\n",
"case there is good evidence that the distribution of data nullity is mostly random: the number of values left blank \n",
"varies across the space by only 5 percent, and the differences look randomly distributed.\n",
"\n",
"Quadtrees have the advantage that they don't require any information about the space besides latitude/longitude \n",
"pairs. Given enough data (hundreds of thousands of records), \n",
"[a geoplot can even reconstruct the space being mapped](https://i.imgur.com/4dtGhig.png). It works less well for \n",
"small datasets like this sample one.\n",
"\n",
"If you can specify a geographic grouping within the dataset, you can plot your data as a set of minimum-enclosure \n",
"[convex hulls](https://en.wikipedia.org/wiki/Convex_hull) instead:\n",
"\n",
" >>> msno.geoplot(collisions, x='LONGITUDE', y='LATITUDE', by='ZIP CODE')\n",
"\n",
"![alt-text][hull-geoplot]\n",
"\n",
"[hull-geoplot]: https://i.imgur.com/osnPwEE.png\n",
"\n",
"Convex hulls are usually more interpretable than the quadtree, especially when the underlying dataset is relatively \n",
"small (as this one is). We again see a data nullity distribution that's seemingly at random, giving us confidence \n",
"that the nullity of collision records is not geographically variable.\n",
"\n",
"The `msno.geoplot` chart type extends the `aggplot` function in the `geoplot` package, and accepts keyword arguments \n",
"to the latter as parameters. [The `geoplot` documentation provides further details](https://residentmario.github.io/geoplot/index.html) \n",
"(including how to pick [a better map projection](https://i.imgur.com/0aaNa9Q.png)). For more advanced configuration \n",
"details for the rest of the plot types, refer to the `CONFIGURATION.md` file in this repository.\n",
"\n",
"That concludes our tour of `missingno`!\n",
"\n",
"## Contributing\n",
"\n",
"For thoughts on features or bug reports see [Issues](https://github.com/ResidentMario/missingno/issues). If \n",
"you're interested in contributing to this library, see details on doing so in the `CONTRIBUTING.md` file in this \n",
"repository."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit c7efdee

Please sign in to comment.