Finish geoplot rewrite.

nrpardo · Feb 4, 2018 · 6c1d5a6 · 6c1d5a6
1 parent c6cedbd
commit 6c1d5a6
Show file tree

Hide file tree

Showing 3 changed files with 64 additions and 50 deletions.
diff --git a/README.md b/README.md
@@ -146,53 +146,49 @@ As with `matrix`, only up to 50 labeled columns will comfortably display in this
 
 ### Geoplot
 
-One kind of pattern that's particularly difficult to check, where it appears, is geographic distribution. The geoplot
- makes this easy:
+One kind of pattern that's particularly difficult to check, where it appears, is geographic distribution. `missingno`
+supports visualizing geospatial data nullity patterns with a geoplot visualization. This is an experimental data 
+visualization type, and requires the [`geoplot`](https://github.com/ResidentMario/geoplot) and [`geopandas`](http://geopandas.org/) 
+libraries. These are optional dependencies are must be installed separately from the rest of `missingno`. Once you 
+have them you can run:
 
-    >>> msno.geoplot(collisions.sample(100000), x='LONGITUDE', y='LATITUDE')
+    >>> msno.geoplot(collisions, x='LONGITUDE', y='LATITUDE')
 
 ![alt-text][large-geoplot]
 
-[large-geoplot]: http://i.imgur.com/4dtGhig.png
+[large-geoplot]: https://i.imgur.com/glZonpD.png
 
-If no geographical context can be provided, `geoplot` can be used to compute a
+If no geographical context can be provided, `geoplot` will compute a
 [quadtree](https://en.wikipedia.org/wiki/Quadtree) nullity distribution, as above, which splits the dataset into
 statistically significant chunks and colorizes them based on the average nullity of data points within them. In this
-case (fortunately for analysis, but unfortunately for the purposes of demonstration) it appears that our dataset's
-data nullity is unaffected by geography.
+case there is good evidence that the distribution of data nullity is mostly random: the number of values left blank 
+varies across the space by only 5 percent, and the differences look randomly distributed.
 
-A quadtree analysis works remarkably well in most cases, but will not always be what you want. If you can specify a
-geographic grouping within the dataset (using the `by` keyword argument), you can plot your data as a set of
-minimum-enclosure [convex hulls](https://en.wikipedia.org/wiki/Convex_hull) instead (the following example also
-demonstrates adding a histogram to the display, using the `histogram=True` argument):
+Quadtrees have the advantage that they don't require any information about the space besides latitude/longitude 
+pairs. Given enough data (hundreds of thousands of records), 
+[a geoplot can even reconstruct the space being mapped](https://i.imgur.com/4dtGhig.png). It works less well for 
+small datasets like this sample one.
 
-    >>> msno.geoplot(collisions.sample(100000), x='LONGITUDE', y='LATITUDE', by='ZIP CODE', histogram=True)
+If you can specify a geographic grouping within the dataset, you can plot your data as a set of minimum-enclosure 
+[convex hulls](https://en.wikipedia.org/wiki/Convex_hull) instead:
 
-![alt-text][hull-geoplot]
-
-[hull-geoplot]: http://i.imgur.com/3kfKMJO.png
-
-Finally, if you have the *actual* geometries of your grouping (in the form of a `dict` or `pandas` `Series` of
-`shapely.Geometry` or `shapely.MultiPolygon` objects), you can dispense with all of this approximation and just plot
-*exactly* what you mean:
+    # msno.geoplot will fail if some groups have only one or two values, you must remove these yourself.
+    >>> msno.geoplot(collisions, x='LONGITUDE', y='LATITUDE', by='ZIP CODE')
 
-    >>> msno.geoplot(collisions.sample(1000), x='LONGITUDE', y='LATITUDE', by='BOROUGH', geometry=geom)
-
-![alt-text][true-geoplot]
-
-[true-geoplot]: http://i.imgur.com/fAyxqnk.png
+![alt-text][hull-geoplot]
 
-In this case this is the least interesting result of all.
+[hull-geoplot]: https://i.imgur.com/RALL9d9.png
 
-Two technical notes:
-* For the geographically inclined, this a [plat carre](https://en.wikipedia.org/wiki/Equirectangular_projection)
-projection&mdash;that is, none at all. Not pretty, but functional.
-* `geoplot` requires the [`shapely`](https://github.com/Toblerity/Shapely) and [`descartes`](https://pypi.python.org/pypi/descartes) libraries, which are
-ancillary to the rest of this package and are thus optional dependencies.
+Convex hulls are usually more interpretable than the quadtree, especially when the underlying dataset is relatively 
+small (as this one is). We again see a data nullity distribution that's seemingly at random, giving us confidence 
+that the nullity of collision records is not geographically variable.
 
-That concludes our tour of `missingno`.
+The `msno.geoplot` chart type extends the `aggplot` function in the `geoplot` package, and accepts keyword arguments 
+to the latter as parameters. [The `geoplot` documentation provides further details](https://residentmario.github.io/geoplot/index.html) 
+(including how to pick [a better map projection](https://i.imgur.com/KSryo6o.png)). For more advanced configuration 
+details for the rest of the plot types, refer to the `CONFIGURATION.md` file in this repository.
 
-For more advanced configuration details, refer to the `CONFIGURATION.md` file in this repository.
+That concludes our tour of `missingno`!
 
 ## Contributing
 

diff --git a/missingno/missingno.py b/missingno/missingno.py
@@ -6,6 +6,7 @@
 import seaborn as sns
 import pandas as pd
 from .utils import nullity_filter, nullity_sort
+import warnings
 
 
 def matrix(df,
@@ -435,7 +436,7 @@ def _calculate_geographic_nullity(geo_group, x_col, y_col):
     entries = point_groups.size()
     width = len(geo_group.columns)
     # Remove empty (NaN, NaN) points.
-    if len(entries) > 0:  # explicit check to avoid a Runtime Warning
+    if len(entries) > 0:  # explicit check to avoid a RuntimeWarning
         geographic_nullity = np.average(1 - counts / width / entries)
         return points, geographic_nullity
     else:
@@ -444,14 +445,30 @@ def _calculate_geographic_nullity(geo_group, x_col, y_col):
 
 def geoplot(df,
             filter=None, n=0, p=0, sort=None,
-            x=None, y=None, coordinates=None, figsize=(25, 10), inline=False,
-            by=None, cmap='Reds', vmin=0, vmax=1, **kwargs):
+            x=None, y=None, figsize=(25, 10), inline=False,
+            by=None, cmap='YlGn_r', **kwargs):
     """
     Generates a geographical data nullity heatmap, which shows the distribution of missing data across geographic
     regions. The precise output depends on the inputs provided. If no geographical context is provided, a quadtree
     is computed and nullities are rendered as abstract geographic squares. If geographical context is provided in the
     form of a column of geographies (region, borough. ZIP code, etc.) in the `DataFrame`, convex hulls are computed
     for each of the point groups and the heatmap is generated within them.
+
+    :param df: The DataFrame whose completeness is being geoplotted.
+    :param filter: The filter to apply to the heatmap. Should be one of "top", "bottom", or None (default).
+    :param sort: The sort to apply to the heatmap. Should be one of "ascending", "descending", or None.
+    :param n: The cap on the number of columns to include in the filtered DataFrame.
+    :param p: The cap on the percentage fill of the columns in the filtered DataFrame.
+    :param figsize: The size of the figure to display. This is a `matplotlib` parameter which defaults to `(25, 10)`.
+    :param x: The variable in the dataset containing the x-coordinates of the dataset.
+    :param y: The variable in the dataset containing the y-coordinates of the dataset.
+    :param by: If specified, plot in convex hull mode, using the given column to cluster points in the same area. If
+    not specified, plot in quadtree mode.
+    :param cmap: The colormap to display the data with. Defaults to `YlGn_r`.
+    :param inline: Whether or not the figure is inline. If it's not then instead of getting plotted, this method will
+    return its figure.
+    :param kwargs: Additional keyword arguments are passed to the underlying `geoplot` function.
+    :return: If `inline` is False, the underlying `matplotlib.figure` object. Else, nothing.
     """
     import geoplot as gplt
     import geopandas as gpd
@@ -461,20 +478,27 @@ def geoplot(df,
     df = nullity_sort(df, sort=sort)
 
     nullity = df.isnull().sum(axis='columns') / df.shape[1]
-    if x and y and not coordinates:
+    if x and y:
         gdf = gpd.GeoDataFrame(nullity, columns=['nullity'],
                                geometry=df.apply(lambda srs: Point(srs[x], srs[y]), axis='columns'))
-    elif coordinates and not x and not y:
-        gdf = gpd.GeoDataFrame(nullity, columns=['nullity'],
-                               geometry=df.apply(lambda srs: Point(*srs[coordinates]), axis='columns'))
     else:
-        raise ValueError("One of 'x' and 'y' OR 'coordinates' must be specified, and they cannot be specified "
-                         "simultaneously.")
+        raise ValueError("The 'x' and 'y' parameters must be specified.")
 
     if by:
+        if df[by].isnull().any():
+            warnings.warn('The "{0}" column included null values. The offending records were dropped'.format(by))
+            df = df.dropna(subset=[by])
+            gdf = gdf.loc[df.index]
+
+        vc = df[by].value_counts()
+        if (vc < 3).any():
+            warnings.warn('Grouping by "{0}" included clusters with fewer than three points, which cannot be made '
+                          'polygonal. The offending records were dropped.'.format(by))
+            where = df[by].isin((df[by].value_counts() > 2).where(lambda b: b).dropna().index.values)
+            gdf = gdf.loc[where]
         gdf[by] = df[by]
 
-    gplt.aggplot(gdf, figsize=figsize, hue='nullity', agg=np.average, cmap=cmap, vmin=vmin, vmax=vmax, by=by, **kwargs)
+    gplt.aggplot(gdf, figsize=figsize, hue='nullity', agg=np.average, cmap=cmap, by=by, edgecolor='None', **kwargs)
     ax = plt.gca()
 
     if inline:

diff --git a/tests/viz_tests.py b/tests/viz_tests.py
@@ -125,22 +125,16 @@ def test_method_dendrogram(self):
 
 
 class TestGeoplot(unittest.TestCase):
-    """Smoke tests only. The main function operations are handled by and tested in the `geoplot` package."""
+    """Integration tests only. The main function operations are handled by and tested in the `geoplot` package."""
     # TODO: Add more tests.
 
     def setUp(self):
         np.random.seed(42)
         simple_df = pd.DataFrame((np.random.random((20, 10))), columns=range(0, 10))
         simple_df = simple_df.add_prefix("r")
         self.x_y_df = simple_df
-        self.coord_df = simple_df.assign(coords=simple_df.apply(lambda srs: (srs['r0'], srs['r1']), axis='columns'))
 
     @pytest.mark.mpl_image_compare
-    def test_x_y_geoplot(self):
+    def test_geoplot_quadtree(self):
         msno.geoplot(self.x_y_df, x='r0', y='r1')
         return plt.gcf()
-
-    @pytest.mark.mpl_image_compare
-    def test_coordinates_geoplot(self):
-        msno.geoplot(self.coord_df, coordinates='coords')
-        return plt.gcf()