Skip to content

Commit

Permalink
Finish geoplot rewrite.
Browse files Browse the repository at this point in the history
  • Loading branch information
ResidentMario committed Feb 4, 2018
1 parent c6cedbd commit 6c1d5a6
Show file tree
Hide file tree
Showing 3 changed files with 64 additions and 50 deletions.
60 changes: 28 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,53 +146,49 @@ As with `matrix`, only up to 50 labeled columns will comfortably display in this

### Geoplot

One kind of pattern that's particularly difficult to check, where it appears, is geographic distribution. The geoplot
makes this easy:
One kind of pattern that's particularly difficult to check, where it appears, is geographic distribution. `missingno`
supports visualizing geospatial data nullity patterns with a geoplot visualization. This is an experimental data
visualization type, and requires the [`geoplot`](https://github.com/ResidentMario/geoplot) and [`geopandas`](http://geopandas.org/)
libraries. These are optional dependencies are must be installed separately from the rest of `missingno`. Once you
have them you can run:

>>> msno.geoplot(collisions.sample(100000), x='LONGITUDE', y='LATITUDE')
>>> msno.geoplot(collisions, x='LONGITUDE', y='LATITUDE')

![alt-text][large-geoplot]

[large-geoplot]: http://i.imgur.com/4dtGhig.png
[large-geoplot]: https://i.imgur.com/glZonpD.png

If no geographical context can be provided, `geoplot` can be used to compute a
If no geographical context can be provided, `geoplot` will compute a
[quadtree](https://en.wikipedia.org/wiki/Quadtree) nullity distribution, as above, which splits the dataset into
statistically significant chunks and colorizes them based on the average nullity of data points within them. In this
case (fortunately for analysis, but unfortunately for the purposes of demonstration) it appears that our dataset's
data nullity is unaffected by geography.
case there is good evidence that the distribution of data nullity is mostly random: the number of values left blank
varies across the space by only 5 percent, and the differences look randomly distributed.

A quadtree analysis works remarkably well in most cases, but will not always be what you want. If you can specify a
geographic grouping within the dataset (using the `by` keyword argument), you can plot your data as a set of
minimum-enclosure [convex hulls](https://en.wikipedia.org/wiki/Convex_hull) instead (the following example also
demonstrates adding a histogram to the display, using the `histogram=True` argument):
Quadtrees have the advantage that they don't require any information about the space besides latitude/longitude
pairs. Given enough data (hundreds of thousands of records),
[a geoplot can even reconstruct the space being mapped](https://i.imgur.com/4dtGhig.png). It works less well for
small datasets like this sample one.

>>> msno.geoplot(collisions.sample(100000), x='LONGITUDE', y='LATITUDE', by='ZIP CODE', histogram=True)
If you can specify a geographic grouping within the dataset, you can plot your data as a set of minimum-enclosure
[convex hulls](https://en.wikipedia.org/wiki/Convex_hull) instead:

![alt-text][hull-geoplot]

[hull-geoplot]: http://i.imgur.com/3kfKMJO.png

Finally, if you have the *actual* geometries of your grouping (in the form of a `dict` or `pandas` `Series` of
`shapely.Geometry` or `shapely.MultiPolygon` objects), you can dispense with all of this approximation and just plot
*exactly* what you mean:
# msno.geoplot will fail if some groups have only one or two values, you must remove these yourself.
>>> msno.geoplot(collisions, x='LONGITUDE', y='LATITUDE', by='ZIP CODE')

>>> msno.geoplot(collisions.sample(1000), x='LONGITUDE', y='LATITUDE', by='BOROUGH', geometry=geom)

![alt-text][true-geoplot]

[true-geoplot]: http://i.imgur.com/fAyxqnk.png
![alt-text][hull-geoplot]

In this case this is the least interesting result of all.
[hull-geoplot]: https://i.imgur.com/RALL9d9.png

Two technical notes:
* For the geographically inclined, this a [plat carre](https://en.wikipedia.org/wiki/Equirectangular_projection)
projection—that is, none at all. Not pretty, but functional.
* `geoplot` requires the [`shapely`](https://github.com/Toblerity/Shapely) and [`descartes`](https://pypi.python.org/pypi/descartes) libraries, which are
ancillary to the rest of this package and are thus optional dependencies.
Convex hulls are usually more interpretable than the quadtree, especially when the underlying dataset is relatively
small (as this one is). We again see a data nullity distribution that's seemingly at random, giving us confidence
that the nullity of collision records is not geographically variable.

That concludes our tour of `missingno`.
The `msno.geoplot` chart type extends the `aggplot` function in the `geoplot` package, and accepts keyword arguments
to the latter as parameters. [The `geoplot` documentation provides further details](https://residentmario.github.io/geoplot/index.html)
(including how to pick [a better map projection](https://i.imgur.com/KSryo6o.png)). For more advanced configuration
details for the rest of the plot types, refer to the `CONFIGURATION.md` file in this repository.

For more advanced configuration details, refer to the `CONFIGURATION.md` file in this repository.
That concludes our tour of `missingno`!

## Contributing

Expand Down
44 changes: 34 additions & 10 deletions missingno/missingno.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import seaborn as sns
import pandas as pd
from .utils import nullity_filter, nullity_sort
import warnings


def matrix(df,
Expand Down Expand Up @@ -435,7 +436,7 @@ def _calculate_geographic_nullity(geo_group, x_col, y_col):
entries = point_groups.size()
width = len(geo_group.columns)
# Remove empty (NaN, NaN) points.
if len(entries) > 0: # explicit check to avoid a Runtime Warning
if len(entries) > 0: # explicit check to avoid a RuntimeWarning
geographic_nullity = np.average(1 - counts / width / entries)
return points, geographic_nullity
else:
Expand All @@ -444,14 +445,30 @@ def _calculate_geographic_nullity(geo_group, x_col, y_col):

def geoplot(df,
filter=None, n=0, p=0, sort=None,
x=None, y=None, coordinates=None, figsize=(25, 10), inline=False,
by=None, cmap='Reds', vmin=0, vmax=1, **kwargs):
x=None, y=None, figsize=(25, 10), inline=False,
by=None, cmap='YlGn_r', **kwargs):
"""
Generates a geographical data nullity heatmap, which shows the distribution of missing data across geographic
regions. The precise output depends on the inputs provided. If no geographical context is provided, a quadtree
is computed and nullities are rendered as abstract geographic squares. If geographical context is provided in the
form of a column of geographies (region, borough. ZIP code, etc.) in the `DataFrame`, convex hulls are computed
for each of the point groups and the heatmap is generated within them.
:param df: The DataFrame whose completeness is being geoplotted.
:param filter: The filter to apply to the heatmap. Should be one of "top", "bottom", or None (default).
:param sort: The sort to apply to the heatmap. Should be one of "ascending", "descending", or None.
:param n: The cap on the number of columns to include in the filtered DataFrame.
:param p: The cap on the percentage fill of the columns in the filtered DataFrame.
:param figsize: The size of the figure to display. This is a `matplotlib` parameter which defaults to `(25, 10)`.
:param x: The variable in the dataset containing the x-coordinates of the dataset.
:param y: The variable in the dataset containing the y-coordinates of the dataset.
:param by: If specified, plot in convex hull mode, using the given column to cluster points in the same area. If
not specified, plot in quadtree mode.
:param cmap: The colormap to display the data with. Defaults to `YlGn_r`.
:param inline: Whether or not the figure is inline. If it's not then instead of getting plotted, this method will
return its figure.
:param kwargs: Additional keyword arguments are passed to the underlying `geoplot` function.
:return: If `inline` is False, the underlying `matplotlib.figure` object. Else, nothing.
"""
import geoplot as gplt
import geopandas as gpd
Expand All @@ -461,20 +478,27 @@ def geoplot(df,
df = nullity_sort(df, sort=sort)

nullity = df.isnull().sum(axis='columns') / df.shape[1]
if x and y and not coordinates:
if x and y:
gdf = gpd.GeoDataFrame(nullity, columns=['nullity'],
geometry=df.apply(lambda srs: Point(srs[x], srs[y]), axis='columns'))
elif coordinates and not x and not y:
gdf = gpd.GeoDataFrame(nullity, columns=['nullity'],
geometry=df.apply(lambda srs: Point(*srs[coordinates]), axis='columns'))
else:
raise ValueError("One of 'x' and 'y' OR 'coordinates' must be specified, and they cannot be specified "
"simultaneously.")
raise ValueError("The 'x' and 'y' parameters must be specified.")

if by:
if df[by].isnull().any():
warnings.warn('The "{0}" column included null values. The offending records were dropped'.format(by))
df = df.dropna(subset=[by])
gdf = gdf.loc[df.index]

vc = df[by].value_counts()
if (vc < 3).any():
warnings.warn('Grouping by "{0}" included clusters with fewer than three points, which cannot be made '
'polygonal. The offending records were dropped.'.format(by))
where = df[by].isin((df[by].value_counts() > 2).where(lambda b: b).dropna().index.values)
gdf = gdf.loc[where]
gdf[by] = df[by]

gplt.aggplot(gdf, figsize=figsize, hue='nullity', agg=np.average, cmap=cmap, vmin=vmin, vmax=vmax, by=by, **kwargs)
gplt.aggplot(gdf, figsize=figsize, hue='nullity', agg=np.average, cmap=cmap, by=by, edgecolor='None', **kwargs)
ax = plt.gca()

if inline:
Expand Down
10 changes: 2 additions & 8 deletions tests/viz_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,22 +125,16 @@ def test_method_dendrogram(self):


class TestGeoplot(unittest.TestCase):
"""Smoke tests only. The main function operations are handled by and tested in the `geoplot` package."""
"""Integration tests only. The main function operations are handled by and tested in the `geoplot` package."""
# TODO: Add more tests.

def setUp(self):
np.random.seed(42)
simple_df = pd.DataFrame((np.random.random((20, 10))), columns=range(0, 10))
simple_df = simple_df.add_prefix("r")
self.x_y_df = simple_df
self.coord_df = simple_df.assign(coords=simple_df.apply(lambda srs: (srs['r0'], srs['r1']), axis='columns'))

@pytest.mark.mpl_image_compare
def test_x_y_geoplot(self):
def test_geoplot_quadtree(self):
msno.geoplot(self.x_y_df, x='r0', y='r1')
return plt.gcf()

@pytest.mark.mpl_image_compare
def test_coordinates_geoplot(self):
msno.geoplot(self.coord_df, coordinates='coords')
return plt.gcf()

0 comments on commit 6c1d5a6

Please sign in to comment.