Skip to content

Commit

Permalink
fixing horrible typos down to cell 16
Browse files Browse the repository at this point in the history
  • Loading branch information
jphall663 committed Nov 15, 2019
1 parent 0ed444b commit fa81320
Showing 1 changed file with 9 additions and 8 deletions.
17 changes: 9 additions & 8 deletions debugging_resid_analysis_redux.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -234,10 +234,11 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The shorthand name `y` is assigned to the prediction target. `X` is assigned to all other input variables in the credit card default data except the row indentifier, `ID`."
"The shorthand name `y` is assigned to the prediction target. `X` is assigned to all other input variables in the credit card default data except the row identifier, `ID`."
]
},
{
Expand Down Expand Up @@ -624,9 +625,9 @@
"\n",
"Monotonic relationships are much easier to explain to colleagues, bosses, customers, and regulators than more complex, non-monotonic relationships and monotonic relationships may also prevent overfitting and excess error due to variance for new data.\n",
"\n",
"To train a transparent monotonic classifier, contraints must be supplied to XGBoost that determine whether the learned relationship between an input variable and the prediction target `DEFAULT_NEXT_MONTH` will be increasing for increases in an input variable or decreasing for increases in an input variable. Pearson correlation provides a linear measure of the direction of the relationship between each input variable and the target. If the pair-wise Pearson correlation between an input and `DEFAULT_NEXT_MONTH` is positive, it will be constrained to have an increasing relationship with the predictions for `DEFAULT_NEXT_MONTH`. If the pair-wise Pearson correlation is negative, the input will be constrained to have a decreasing relationship with the predictions for `DEFAULT_NEXT_MONTH`. \n",
"To train a transparent monotonic classifier, constraints must be supplied to XGBoost that determine whether the learned relationship between an input variable and the prediction target `DEFAULT_NEXT_MONTH` will be increasing for increases in an input variable or decreasing for increases in an input variable. Pearson correlation provides a linear measure of the direction of the relationship between each input variable and the target. If the pair-wise Pearson correlation between an input and `DEFAULT_NEXT_MONTH` is positive, it will be constrained to have an increasing relationship with the predictions for `DEFAULT_NEXT_MONTH`. If the pair-wise Pearson correlation is negative, the input will be constrained to have a decreasing relationship with the predictions for `DEFAULT_NEXT_MONTH`. \n",
"\n",
"Constrainsts are supplied to XGBoost in the form of a Python tuple with length equal to the number of inputs. Each item in the tuple is associated with an input variable based on its index in the tuple. The first constraint in the tuple is associated with the first variable in the training data, the second constraint in the tuple is associated with the second variable in the training data, and so on. The constraints themselves take the form of a 1 for a positive relationship and a -1 for a negative relationship."
"Constraints are supplied to XGBoost in the form of a Python tuple with length equal to the number of inputs. Each item in the tuple is associated with an input variable based on its index in the tuple. The first constraint in the tuple is associated with the first variable in the training data, the second constraint in the tuple is associated with the second variable in the training data, and so on. The constraints themselves take the form of a 1 for a positive relationship and a -1 for a negative relationship."
]
},
{
Expand Down Expand Up @@ -819,7 +820,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The last column of the Pearson correlation matrix is transformed from a numeric column in a Pandas DataFrame into a Python tuple of `1`s and `-1`s that will be used to specifiy monotonicity constraints for each input variable in XGBoost. If the Pearson correlation between an input variable and `DEFAULT_NEXT_MONTH` is positive, a positive montonic relationship constraint is specified for that variable using `1`. If the correlation is negative, a negative monotonic constraint is specified using `-1`. (Specifying `0` indicates that no constraints should be used.) The resulting tuple will be passed to XGBoost when the GBM model is trained."
"The last column of the Pearson correlation matrix is transformed from a numeric column in a Pandas DataFrame into a Python tuple of `1`s and `-1`s that will be used to specify monotonicity constraints for each input variable in XGBoost. If the Pearson correlation between an input variable and `DEFAULT_NEXT_MONTH` is positive, a positive monotonic relationship constraint is specified for that variable using `1`. If the correlation is negative, a negative monotonic constraint is specified using `-1`. (Specifying `0` indicates that no constraints should be used.) The resulting tuple will be passed to XGBoost when the GBM model is trained."
]
},
{
Expand Down Expand Up @@ -892,7 +893,7 @@
"metadata": {},
"source": [
"#### Train XGBoost GBM classifier\n",
"To train an XGBoost classifier, the training and test data must be converted from Pandas DataFrames into SVMLight format. The `DMatrix()` function in the XGBoost package is used to convert the data. Many XGBoost tuning parameters must be specified as well. Typically a grid search would be performed to identify the best parameters for a given modeling task. For brevity's sake, a previously-discovered set of good tuning parameters are specified here. Notice that the monotonicity constraints are passed to XGBoost using the `monotone_constraints` parameter. Because gradient boosting methods typically resample training data, an additional random seed is also specified for XGBoost using the `seed` paramter to create reproducible predictions, error rates, and variable importance values. To calculate Shapley contributions to logloss in later sections of the notebook, early stopping is not used because it may be unsupported for calculating accurate contributions to logloss. Hence, a previously discovered best number of trees is used in this notebook."
"To train an XGBoost classifier, the training and test data must be converted from Pandas DataFrames into SVMLight format. The `DMatrix()` function in the XGBoost package is used to convert the data. Many XGBoost tuning parameters must be specified as well. Typically a grid search would be performed to identify the best parameters for a given modeling task. For brevity's sake, a previously-discovered set of good tuning parameters are specified here. Notice that the monotonicity constraints are passed to XGBoost using the `monotone_constraints` parameter. Because gradient boosting methods typically resample training data, an additional random seed is also specified for XGBoost using the `seed` parameter to create reproducible predictions, error rates, and variable importance values. To calculate Shapley contributions to logloss in later sections of the notebook, early stopping is not used because it may be unsupported for calculating accurate contributions to logloss. Hence, a previously discovered best number of trees is used in this notebook."
]
},
{
Expand Down Expand Up @@ -1768,7 +1769,7 @@
"metadata": {},
"source": [
"#### Plot P-R AUC with best F1 and cutoff\n",
"A P-R AUC curve is a curve formed by plotting precision and recall across different cutoffs. (P-R AUC curves are often a bit more robust to imbalanced target classes than ROC curves.) This curve shows some odd behavior for high precision and low recall, potentially indicating instability in model predictions, but is smooth and mostly well-behaved where the cutoff is selected. Generally a smooth tradeoff bewteen precision and recall is preferred as different cutoffs are considered."
"A P-R AUC curve is a curve formed by plotting precision and recall across different cutoffs. (P-R AUC curves are often a bit more robust to imbalanced target classes than ROC curves.) This curve shows some odd behavior for high precision and low recall, potentially indicating instability in model predictions, but is smooth and mostly well-behaved where the cutoff is selected. Generally a smooth tradeoff between precision and recall is preferred as different cutoffs are considered."
]
},
{
Expand Down Expand Up @@ -1894,7 +1895,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The confusion matrix shows that the GBM model is more accurate than not because the true positive and true negative cells contain the largest values by far. But the GBM model seems to make a large number of type I errors, or false positive predictions. False positives are a known disparity issue, because for complex reasons, many credit scoring and other models tend to overestimate the likelihood of non-reference groups - typically not white males - to default. The large amount of false positves could indicate that a lot of derserving people will not recieve loans that they could actually pay back according to this model. This is a potential pathology that should definitely be investigated further."
"The confusion matrix shows that the GBM model is more accurate than not because the true positive and true negative cells contain the largest values by far. But the GBM model seems to make a large number of type I errors, or false positive predictions. False positives are a known disparity issue, because for complex reasons, many credit scoring and other models tend to overestimate the likelihood of non-reference groups - typically not white males - to default. The large amount of false positives could indicate that a lot of deserving people will not receive loans that they could actually pay back according to this model. This is a potential pathology that should definitely be investigated further."
]
},
{
Expand Down Expand Up @@ -1987,7 +1988,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Some high-magnitude outlying residuals are visible. Who are these customers? Why is the model so wrong about them? And are they somehow excerting undue influence on other predictions? The model could be retrained without these individuals and restested as a potentially remediation strategy. From a security standpoint it can be interesting to examine insider attacks and false negatives. These are people who where issued a loan, but did not pay it back. Are any of these employees, contractors, consultants, or their associates? Could they have poisoned the model's training data to receive the loan? Could they have changed the model's production scoring code to receive the loan?"
"Some high-magnitude outlying residuals are visible. Who are these customers? Why is the model so wrong about them? And are they somehow exerting undue influence on other predictions? The model could be retrained without these individuals and retested as a potentially remediation strategy. From a security standpoint it can be interesting to examine insider attacks and false negatives. These are people who where issued a loan, but did not pay it back. Are any of these employees, contractors, consultants, or their associates? Could they have poisoned the model's training data to receive the loan? Could they have changed the model's production scoring code to receive the loan?"
]
},
{
Expand Down

0 comments on commit fa81320

Please sign in to comment.