Skip to content

Commit

Permalink
Move configuration details to a separate file.
Browse files Browse the repository at this point in the history
  • Loading branch information
ResidentMario committed Feb 3, 2018
1 parent 2951728 commit 7832e2d
Show file tree
Hide file tree
Showing 2 changed files with 99 additions and 109 deletions.
96 changes: 96 additions & 0 deletions CONFIGURATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Advanced Configuration

## Sorting and filtering

`missingno` also provides utility functions for filtering records in your dataset based on completion. These are
useful in particular for filtering through and drilling down into particularly large datasets whose data nullity
issues might otherwise be very hard to visualize or understand.

Let's first apply a `nullity_filter()` to the data. The `filter` parameter controls which result set we
want: either `filter=top` or `filter=bottom`. The `n` parameter controls the maximum number of columns that you want:
so for example `n=5` makes sure we get *at most* five results. Finally, `p` controls the percentage cutoff. If
`filter=bottom`, then `p=0.9` makes sure that our columns are *at most* 90% complete; if `filter=top` we get
columns which are *at least* 90% complete.

For example, the following query filtered down to only at most 15 columns which are not completely filled.

>>> filtered_data = msno.nullity_filter(data, filter='bottom', n=15, p=0.999) # or filter='top'
>>> msno.matrix(filtered_data.sample(250))

![alt text][matrix_filtered]

[matrix_filtered]: http://i.imgur.com/UF6hmL8.png

`nullity_sort()` simply reshuffles your rows by completeness, in either `ascending` or `descending` order. Since it
doesn't affect the underlying data it's mainly useful for `matrix` visualization:


>>> sorted_data = msno.nullity_sort(data, sort='descending') # or sort='ascending'
>>> msno.matrix(sorted_data.sample(250))

![alt text][matrix_sorted]

[matrix_sorted]: http://i.imgur.com/qL6zNQj.png

These methods work inline within the visualization methods themselves. For instance, the following is perfectly valid:

>>> msno.matrix(data.sample(250), filter='top', n=5, p=0.9, sort='ascending')

## Visual configuration
### Lesser parameters

Each of the visualizations provides a further set of lesser configuration parameters for visually tweaking the display.

`matrix`, `bar`, `heatmap`, `dendrogram`, and `geoplot` all provide:

* `figsize`: The size of the figure to display. This is a `matplotlib` parameter which defaults to `(20, 12)`, except
for large `dendrogram` visualizations, which compute a height on the fly based on the number of variables to display.
* `fontsize`: The figure's font size. The default is `16`.
* `labels`: Whether or not to display the column names. For `matrix` this defaults to `True` for `<=50` variables and
`False` for `>50`. It always defaults to `True` for `dendrogram` and `heatmap`.
* `inline`: Defaults to `True`, in which case the chart is plotted and nothing is returned. If this is set to `False`
the methods omit plotting and return their visualizations instead.

`matrix` also provides:
* `sparkline`: Set this to `False` to not draw the sparkline.
* `freq`: If you are working with timeseries data (a `pandas` `DataFrame` with a `PeriodIndex` or `DatetimeIndex`)
you can specify and display a [choice of offset](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases).
* `width_ratios`: The ratio of the width of the matrix to the width of the sparkline. Defaults to `(15,
1)`. Does nothing if `sparkline=False`.
* `color`: The color of the filled columns. Defaults to `(0.25, 0.25, 0.25)`.

`bar` also provides:
* `log`: Set this to `True` to use a logarithmic scale.
* `color`: The color of the filled columns. Defaults to `(0.25, 0.25, 0.25)`.


`heatmap` also provides:
* `cmap`: What `matplotlib` [colormap](http://matplotlib.org/users/colormaps.html) to use. Defaults to `RdBu`.


`dendrogram` also provides:
* `orientation`: The [orientation](http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html#scipy.cluster.hierarchy.dendrogram)
of the dendrogram. Defaults to `top` if `<=50` columns and
`left` if there are more.
* `method`: The [linkage method](http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage) `scipy.hierarchy` uses for clustering.
`average` is the default argument.

`geoplot` also provides:
* `x` AND `y` OR `coordinates`: A column of points (in either two columns or one) to plot. These are required.
* `by`: A column of values to group points by.
* `geometry`: A hash table (`dict` or `pd.Series` generally) geometries of the groups being aggregated, if available.
* `cutoff`: The minimum number of observations per rectangle in the quadtree display. No effect if a different
display is used. Defaults to `min([50, 0.05*len(df)])`.
* `histogram`: Whether or not to plot the histogram. Defaults to `False`.

### Manipulation with matplotlib
If you are not satisfied with these admittedly basic configuration parameters, the display can be further manipulated
in any way you like using `matplotlib` post-facto.

The best way to do this is to specify `inline=False`, which will cause `missingno` to return the underlying
`matplotlib.figure.Figure` object. Anyone with sufficient knowledge of `matplotlib` operations and [the missingno source code](https://github.com/ResidentMario/missingno/blob/master/missingno/missingno.py)
can then tweak the display to their liking. For example, the following code will bump the size of the dendrogram
visualization's y-axis labels up from `20` to `30`:

>>> mat = msno.dendrogram(collisions, inline=False)
>>> mat.axes[0].tick_params(axis='y', labelsize=30)
112 changes: 3 additions & 109 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,118 +190,12 @@ projection&mdash;that is, none at all. Not pretty, but functional.
* `geoplot` requires the [`shapely`](https://github.com/Toblerity/Shapely) and [`descartes`](https://pypi.python.org/pypi/descartes) libraries, which are
ancillary to the rest of this package and are thus optional dependencies.

That concludes our tour of `missingno`!
That concludes our tour of `missingno`.

For further details take a look at [this blog post](http://www.residentmar.io/2016/06/12/null-and-missing-data-python.html).

## Other

### Sorting and filtering

`missingno` also provides utility functions for filtering records in your dataset based on completion. These are
useful in particular for filtering through and drilling down into particularly large datasets whose data nullity
issues might otherwise be very hard to visualize or understand.

Let's first apply a `nullity_filter()` to the data. The `filter` parameter controls which result set we
want: either `filter=top` or `filter=bottom`. The `n` parameter controls the maximum number of columns that you want:
so for example `n=5` makes sure we get *at most* five results. Finally, `p` controls the percentage cutoff. If
`filter=bottom`, then `p=0.9` makes sure that our columns are *at most* 90% complete; if `filter=top` we get
columns which are *at least* 90% complete.

For example, the following query filtered down to only at most 15 columns which are not completely filled.

>>> filtered_data = msno.nullity_filter(data, filter='bottom', n=15, p=0.999) # or filter='top'
>>> msno.matrix(filtered_data.sample(250))

![alt text][matrix_filtered]

[matrix_filtered]: http://i.imgur.com/UF6hmL8.png

`nullity_sort()` simply reshuffles your rows by completeness, in either `ascending` or `descending` order. Since it
doesn't affect the underlying data it's mainly useful for `matrix` visualization:


>>> sorted_data = msno.nullity_sort(data, sort='descending') # or sort='ascending'
>>> msno.matrix(sorted_data.sample(250))

![alt text][matrix_sorted]

[matrix_sorted]: http://i.imgur.com/qL6zNQj.png

These methods work inline within the visualization methods themselves. For instance, the following is perfectly valid:

>>> msno.matrix(data.sample(250), filter='top', n=5, p=0.9, sort='ascending')

### Visual configuration
#### Lesser parameters

Each of the visualizations provides a further set of lesser configuration parameters for visually tweaking the display.

`matrix`, `bar`, `heatmap`, `dendrogram`, and `geoplot` all provide:

* `figsize`: The size of the figure to display. This is a `matplotlib` parameter which defaults to `(20, 12)`, except
for large `dendrogram` visualizations, which compute a height on the fly based on the number of variables to display.
* `fontsize`: The figure's font size. The default is `16`.
* `labels`: Whether or not to display the column names. For `matrix` this defaults to `True` for `<=50` variables and
`False` for `>50`. It always defaults to `True` for `dendrogram` and `heatmap`.
* `inline`: Defaults to `True`, in which case the chart is plotted and nothing is returned. If this is set to `False`
the methods omit plotting and return their visualizations instead.

`matrix` also provides:
* `sparkline`: Set this to `False` to not draw the sparkline.
* `freq`: If you are working with timeseries data (a `pandas` `DataFrame` with a `PeriodIndex` or `DatetimeIndex`)
you can specify and display a [choice of offset](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases).
* `width_ratios`: The ratio of the width of the matrix to the width of the sparkline. Defaults to `(15,
1)`. Does nothing if `sparkline=False`.
* `color`: The color of the filled columns. Defaults to `(0.25, 0.25, 0.25)`.

`bar` also provides:
* `log`: Set this to `True` to use a logarithmic scale.
* `color`: The color of the filled columns. Defaults to `(0.25, 0.25, 0.25)`.


`heatmap` also provides:
* `cmap`: What `matplotlib` [colormap](http://matplotlib.org/users/colormaps.html) to use. Defaults to `RdBu`.


`dendrogram` also provides:
* `orientation`: The [orientation](http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html#scipy.cluster.hierarchy.dendrogram)
of the dendrogram. Defaults to `top` if `<=50` columns and
`left` if there are more.
* `method`: The [linkage method](http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage) `scipy.hierarchy` uses for clustering.
`average` is the default argument.

`geoplot` also provides:
* `x` AND `y` OR `coordinates`: A column of points (in either two columns or one) to plot. These are required.
* `by`: A column of values to group points by.
* `geometry`: A hash table (`dict` or `pd.Series` generally) geometries of the groups being aggregated, if available.
* `cutoff`: The minimum number of observations per rectangle in the quadtree display. No effect if a different
display is used. Defaults to `min([50, 0.05*len(df)])`.
* `histogram`: Whether or not to plot the histogram. Defaults to `False`.

#### Advanced configuration
If you are not satisfied with these admittedly basic configuration parameters, the display can be further manipulated
in any way you like using `matplotlib` post-facto.

The best way to do this is to specify `inline=False`, which will cause `missingno` to return the underlying
`matplotlib.figure.Figure` object. Anyone with sufficient knowledge of `matplotlib` operations and [the missingno source code](https://github.com/ResidentMario/missingno/blob/master/missingno/missingno.py)
can then tweak the display to their liking. For example, the following code will bump the size of the dendrogram
visualization's y-axis labels up from `20` to `30`:

>>> mat = msno.dendrogram(collisions, inline=False)
>>> mat.axes[0].tick_params(axis='y', labelsize=30)

<!--
Note that if you are running `matplotlib` line in [inline plotting mode](http://www.scipy-lecture.org/intro/matplotlib/matplotlib.html#ipython-and-the-matplotlib-mode)
(as was done above) it will always plot at the end of the cell anyway, so if you do not want to plot the same
visualization multiple times you will want to do all of your manipulations in a single cell!
Note that this may not be as well-behaved as I would like it to be. I'm still testing configuration&mdash;if you have
any issues be sure to [file them]((https://github.com/ResidentMario/missingno/issues)).
-->
For more advanced configuration details, refer to the `CONFIGURATION.md` file in this repository.

## Contributing

For thoughts on features or bug reports see the [bug tracker](https://github.com/ResidentMario/missingno/issues). If
For thoughts on features or bug reports see [Issues](https://github.com/ResidentMario/missingno/issues). If
you're interested in contributing to this library, see details on doing so in the `CONTRIBUTING.md` file in this
repository.

0 comments on commit 7832e2d

Please sign in to comment.