Skip to content

Commit

Permalink
Include calculations in README (ResidentMario#151)
Browse files Browse the repository at this point in the history
  • Loading branch information
ResidentMario authored Feb 20, 2022
1 parent 79feb2f commit 1ea4611
Showing 1 changed file with 21 additions and 0 deletions.
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,16 @@ In this example, it seems that reports which are filed with an `OFF STREET NAME`

Nullity correlation ranges from `-1` (if one variable appears the other definitely does not) to `0` (variables appearing or not appearing have no effect on one another) to `1` (if one variable appears the other definitely also does).

The exact algorithm used is:

```python
import numpy as np

# df is a pandas.DataFrame instance
df = df.iloc[:, [i for i, n in enumerate(np.var(df.isnull(), axis='rows')) if n > 0]]
corr_mat = df.isnull().corr()
```

Variables that are always full or always empty have no meaningful correlation, and so are silently removed from the visualization—in this case for instance the datetime and injury number columns, which are completely filled, are not included.

Entries marked `<1` or `>-1` have a correlation that is close to being exactingly negative or positive, but is still not quite perfectly so. This points to a small number of records in the dataset which are erroneous. For example, in this dataset the correlation between `VEHICLE CODE TYPE 3` and `CONTRIBUTING FACTOR VEHICLE 3` is `<1`, indicating that, contrary to our expectation, there are a few records which have one or the other, but not both. These cases will require special attention.
Expand All @@ -98,6 +108,17 @@ The dendrogram uses a [hierarchical clustering algorithm](http://docs.scipy.org/
(courtesy of `scipy`) to bin variables against one another by their nullity correlation (measured in terms of
binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

The exact algorithm used is:

```python
from scipy.cluster import hierarchy
import numpy as np

# df is a pandas.DataFrame instance
x = np.transpose(df.isnull().astype(int).values)
z = hierarchy.linkage(x, method)
```

To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence&mdash;one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the variables which are required and therefore present in every record.

Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If your own interpretation of the dataset is that these columns actually *are* or *ought to be* match each other in nullity (for example, as `CONTRIBUTING FACTOR VEHICLE 2` and `VEHICLE TYPE CODE 2` ought to), then the height of the cluster leaf tells you, in absolute terms, how often the records are "mismatched" or incorrectly filed&mdash;that is, how many values you would have to fill in or drop, if you are so inclined.
Expand Down

0 comments on commit 1ea4611

Please sign in to comment.