Skip to content

Commit

Permalink
Refresh docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
ResidentMario committed Jan 29, 2018
1 parent e2cc2a0 commit d938c44
Show file tree
Hide file tree
Showing 4 changed files with 78 additions and 21 deletions.
36 changes: 36 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
## Development

### Cloning

To work on `missingno` locally, you will need to clone it.

```git
git clone https://github.com/ResidentMario/missingno.git
```

You can then set up your own branch version of the code, and
work on your changes for a pull request from there.

```bash
cd missingno
git checkout -B new-branch-name
```

### Environment

I strongly recommend creating a new virtual environment when working on `missingno` (e.g. not using the base system
Python). You can do so with either [`conda`](https://conda.io/) or `virtualenv`. Once you have a virtual environment
ready, I recommend running `pip install -e missingno .` from the root folder of the repository on your local machine.
This will create an [editable install](https://pip.pypa.io/en/latest/reference/pip_install/#editable-installs) of
`missingno` suitable for tweaking and further development.

### Testing

`missingno` is a data visualization package, and test suites for data visualization in Python are still rather
finicky. An explicit test suite (likely using [`pytest-mpl`](https://github.com/matplotlib/pytest-mpl))is still a TODO.

## Documentation

The Quickstart section of `README.md` is the principal documentation for this package. To edit the documentation I
recommend editing that file directly on GitHub, which will handle generating a fork and pull request for you once
your changes are made.
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,13 @@ projection—that is, none at all. Not pretty, but functional.
* `geoplot` requires the [`shapely`](https://github.com/Toblerity/Shapely) and [`descartes`](https://pypi.python.org/pypi/descartes) libraries, which are
ancillary to the rest of this package and are thus optional dependencies.

## Sorting and filtering
That concludes our tour of `missingno`!

For further details take a look at [this blog post](http://www.residentmar.io/2016/06/12/null-and-missing-data-python.html).

## Other

### Sorting and filtering

`missingno` also provides utility functions for filtering records in your dataset based on completion. These are
useful in particular for filtering through and drilling down into particularly large datasets whose data nullity
Expand Down Expand Up @@ -226,9 +232,8 @@ These methods work inline within the visualization methods themselves. For insta

>>> msno.matrix(data.sample(250), filter='top', n=5, p=0.9, sort='ascending')

## Visual configuration

### Lesser parameters
### Visual configuration
#### Lesser parameters

Each of the visualizations provides a further set of lesser configuration parameters for visually tweaking the display.

Expand Down Expand Up @@ -274,7 +279,7 @@ of the dendrogram. Defaults to `top` if `<=50` columns and
display is used. Defaults to `min([50, 0.05*len(df)])`.
* `histogram`: Whether or not to plot the histogram. Defaults to `False`.

### Advanced configuration
#### Advanced configuration
If you are not satisfied with these admittedly basic configuration parameters, the display can be further manipulated
in any way you like using `matplotlib` post-facto.

Expand All @@ -295,13 +300,10 @@ Note that this may not be as well-behaved as I would like it to be. I'm still te
any issues be sure to [file them]((https://github.com/ResidentMario/missingno/issues)).
-->

## Further reading
* The way that `numpy` and `pandas` represent null data informs how one goes about working with such data in Python. It's an interesting subject and an important design limitation that's good keep in mind in your work higher up the stack, so I wrote [an exploratory blog post on the subject](http://www.residentmar.io/2016/06/12/null-and-missing-data-python.html) that's well worth reading if you're into this sort of thing.
* For public reviews and third-party test drives of this library's capacities check out [this post](http://www.ultravioletanalytics.com/2016/05/20/investigating-missing-data-with-missingno/) and [this one](https://blog.modeanalytics.com/python-data-visualization-libraries/).
* For slightly more details on this module's ideation check out [this post on my personal blog](http://www.residentmar.io/2016/03/28/missingno.html).

## Contributing

Bugs? Thoughts? Feature requests? [Throw them at the bug tracker and I'll take a look](https://github.com/ResidentMario/missingno/issues).
For thoughts on features or bug reports see the [bug tracker](https://github.com/ResidentMario/missingno/issues). If
you're interested in contributing to this library, see details on doing so in the `CONTRIBUTING.md` file in this
repository.

As always I'm very interested in hearing feedback&mdash;reach out to me at `aleksey@residentmar.io`.
I'm keen in hearing feedback&mdash;reach out to me at `aleksey@residentmar.io` if you have it.
6 changes: 6 additions & 0 deletions paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
@misc{missingno_archive,
author = {{missingno Archive}},
title = {Missingno: a missing data visualization suite},
doi = {},
howpublished = {\url{}}
}
31 changes: 22 additions & 9 deletions paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,33 @@ authors:
affiliations:
- name: Independent
index: 1
date: 27 October 2017
date: 28 January 2018
bibliography: paper.bib
---

# Summary

- missingno is a Python package providing a suite of tools for visualizing and
understanding missing data in a dataset.
- This software is intended to be used for quickly and easily understanding the pattern of
missing entries in a dataset. This makes it useful and applicable to a wide variety of
problem domains, as datasets are rarely clean enough to not need at least some validating.
- The package is hosted [on GitHub](https://github.com/ResidentMario/missingno).
- I initialized this request due to an email inquiry asking me how best to cite this software
in an academic paper.
Algorithmic models and outputs are only as good as the data they are computed on. As the popular saying goes: garbage
in, garbage out. In tabular datasets, it is usually relatively easily to, at a glance, understanding patterns of
missing data (or nullity) of individual rows, columns, and entries. However, it is far harder to see patterns in the
missingness of data that extend between them. Understanding such patterns in data is benefitial, if not outright
critical, to most applications.

missingno is a Python package for visualizing missing data. It works by converting tabular data matrices into boolean
masks based on whether individual entries contain data (which evaluates to true) or left empty (which evaluates to
false). This "nullity matrix" is then exposed to user assessment through a variety of special-purpose data
visualizations. The simplest tools, the bar chart and matrix display, are literal translations of a data table's
nullity matrix, and are effective for snapshotting general patterns. A heatmap provides an methodology for examining
relationships within pairs of variables. Higher-cardinality data nullity correlations can be understood using a
hierarchically clustered dendrogram. Finally, geospatial data dependencies are viewable using an approach based on
the quadtree or convex hull algorithm.

The visualizations are consciously designed to be as effective as possible
at uncovering missing data patterns both between columns of data, and hence, to help its users build more effective
data models and pipelines. At the same time the package is designed to be easy to use. The underlying packages involved
(numpy, pandas, scipy, matplotlib, and seaborn) are familiar parts of the core scientific Python ecosystem, and
hence very learnable and extensible. missingno works "out of the box" with a variety of data types and formats, and
provides an extremely compact API.

# References

0 comments on commit d938c44

Please sign in to comment.