Skip to content

Commit

Permalink
Update notebooks
Browse files Browse the repository at this point in the history
  • Loading branch information
lesteve committed Oct 18, 2022
1 parent c254e90 commit aaeecc0
Show file tree
Hide file tree
Showing 3 changed files with 98 additions and 42 deletions.
2 changes: 1 addition & 1 deletion notebooks/03_categorical_pipeline_sol_02.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@
"<div class=\"admonition important alert alert-info\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Important</p>\n",
"<p>Which encoder should I use?</p>\n",
"<table border=\"1\" class=\"colwidths-auto docutils\">\n",
"<table border=\"1\" class=\"docutils\">\n",
"<thead valign=\"bottom\">\n",
"<tr><th class=\"head\"></th>\n",
"<th class=\"head\">Meaningful order</th>\n",
Expand Down
8 changes: 4 additions & 4 deletions notebooks/03_categorical_pipeline_visualization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to define a scikit-learn pipeline and visualize it"
"# Visualizing scikit-learn pipelines in Jupyter"
]
},
{
Expand All @@ -22,7 +22,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### First we load the dataset"
"## First we load the dataset"
]
},
{
Expand Down Expand Up @@ -86,7 +86,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Then we create the pipeline"
"## Then we create the pipeline"
]
},
{
Expand Down Expand Up @@ -176,7 +176,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finally we score the model"
"## Finally we score the model"
]
},
{
Expand Down
130 changes: 93 additions & 37 deletions notebooks/ensemble_hyperparameters.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,28 +17,12 @@
"<div class=\"admonition caution alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
"<p class=\"last\">For the sake of clarity, no cross-validation will be used to estimate the\n",
"testing error. We are only showing the effect of the parameters\n",
"on the validation set of what should be the inner cross-validation.</p>\n",
"variability of the testing error. We are only showing the effect of the\n",
"parameters on the validation set of what should be the inner loop of a nested\n",
"cross-validation.</p>\n",
"</div>\n",
"\n",
"## Random forest\n",
"\n",
"The main parameter to tune for random forest is the `n_estimators` parameter.\n",
"In general, the more trees in the forest, the better the generalization\n",
"performance will be. However, it will slow down the fitting and prediction\n",
"time. The goal is to balance computing time and generalization performance when\n",
"setting the number of estimators when putting such learner in production.\n",
"\n",
"Then, we could also tune a parameter that controls the depth of each tree in\n",
"the forest. Two parameters are important for this: `max_depth` and\n",
"`max_leaf_nodes`. They differ in the way they control the tree structure.\n",
"Indeed, `max_depth` will enforce to have a more symmetric tree, while\n",
"`max_leaf_nodes` does not impose such constraint.\n",
"\n",
"Be aware that with random forest, trees are generally deep since we are\n",
"seeking to overfit each tree on each bootstrap sample because this will be\n",
"mitigated by combining them altogether. Assembling underfitted trees (i.e.\n",
"shallow trees) might also lead to an underfitted forest."
"We will start by loading the california housing dataset."
]
},
{
Expand All @@ -56,6 +40,71 @@
" data, target, random_state=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Random forest\n",
"\n",
"The main parameter to select in random forest is the `n_estimators` parameter.\n",
"In general, the more trees in the forest, the better the generalization\n",
"performance will be. However, it will slow down the fitting and prediction\n",
"time. The goal is to balance computing time and generalization performance\n",
"when setting the number of estimators. Here, we fix `n_estimators=100`, which\n",
"is already the default value.\n",
"\n",
"<div class=\"admonition caution alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
"<p class=\"last\">Tuning the <tt class=\"docutils literal\">n_estimators</tt> for random forests generally result in a waste of\n",
"computer power. We just need to ensure that it is large enough so that doubling\n",
"its value does not lead to a significant improvement of the validation error.</p>\n",
"</div>\n",
"\n",
"Instead, we can tune the hyperparameter `max_features`, which controls the\n",
"size of the random subset of features to consider when looking for the best\n",
"split when growing the trees: smaller values for `max_features` will lead to\n",
"more random trees with hopefully more uncorrelated prediction errors. However\n",
"if `max_features` is too small, predictions can be too random, even after\n",
"averaging with the trees in the ensemble.\n",
"\n",
"If `max_features` is set to `None`, then this is equivalent to setting\n",
"`max_features=n_features` which means that the only source of randomness in\n",
"the random forest is the bagging procedure."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"In this case, n_features={len(data.columns)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also tune the different parameters that control the depth of each tree\n",
"in the forest. Two parameters are important for this: `max_depth` and\n",
"`max_leaf_nodes`. They differ in the way they control the tree structure.\n",
"Indeed, `max_depth` will enforce to have a more symmetric tree, while\n",
"`max_leaf_nodes` does not impose such constraint. If `max_leaf_nodes=None`\n",
"then the number of leaf nodes is unlimited.\n",
"\n",
"The hyperparameter `min_samples_leaf` controls the minimum number of samples\n",
"required to be at a leaf node. This means that a split point (at any depth) is\n",
"only done if it leaves at least `min_samples_leaf` training samples in each of\n",
"the left and right branches. A small value for `min_samples_leaf` means that\n",
"some samples can become isolated when a tree is deep, promoting overfitting. A\n",
"large value would prevent deep trees, which can lead to underfitting.\n",
"\n",
"Be aware that with random forest, trees are expected to be deep since we are\n",
"seeking to overfit each tree on each bootstrap sample. Overfitting is\n",
"mitigated when combining the trees altogether, whereas assembling underfitted\n",
"trees (i.e. shallow trees) might also lead to an underfitted forest."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -67,8 +116,9 @@
"from sklearn.ensemble import RandomForestRegressor\n",
"\n",
"param_distributions = {\n",
" \"n_estimators\": [1, 2, 5, 10, 20, 50, 100, 200, 500],\n",
" \"max_leaf_nodes\": [2, 5, 10, 20, 50, 100],\n",
" \"max_features\": [1, 2, 3, 5, None],\n",
" \"max_leaf_nodes\": [10, 100, 1000, None],\n",
" \"min_samples_leaf\": [1, 2, 5, 10, 20, 50, 100],\n",
"}\n",
"search_cv = RandomizedSearchCV(\n",
" RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,\n",
Expand All @@ -88,15 +138,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can observe in our search that we are required to have a large\n",
"number of leaves and thus deep trees. This parameter seems particularly\n",
"impactful in comparison to the number of trees for this particular dataset:\n",
"with at least 50 trees, the generalization performance will be driven by the\n",
"number of leaves.\n",
"\n",
"Now we will estimate the generalization performance of the best model by\n",
"refitting it with the full training set and using the test set for scoring on\n",
"unseen data. This is done by default when calling the `.fit` method."
"We can observe in our search that we are required to have a large number of\n",
"`max_leaf_nodes` and thus deep trees. This parameter seems particularly\n",
"impactful with respect to the other tuning parameters, but large values of\n",
"`min_samples_leaf` seem to reduce the performance of the model.\n",
"\n",
"In practice, more iterations of random search would be necessary to precisely\n",
"assert the role of each parameters. Using `n_iter=10` is good enough to\n",
"quickly inspect the hyperparameter combinations that yield models that work\n",
"well enough without spending too much computational resources. Feel free to\n",
"try more interations on your own.\n",
"\n",
"Once the `RandomizedSearchCV` has found the best set of hyperparameters, it\n",
"uses them to refit the model using the full training set. To estimate the\n",
"generalization performance of the best model it suffices to call `.score` on\n",
"the unseen data."
]
},
{
Expand Down Expand Up @@ -180,8 +236,8 @@
"\n",
"<div class=\"admonition caution alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
"<p class=\"last\">Here, we tune the <tt class=\"docutils literal\">n_estimators</tt> but be aware that using early-stopping as\n",
"in the previous exercise will be better.</p>\n",
"<p class=\"last\">Here, we tune the <tt class=\"docutils literal\">n_estimators</tt> but be aware that is better to use\n",
"<tt class=\"docutils literal\">early_stopping</tt> as done in the Exercise M6.04.</p>\n",
"</div>\n",
"\n",
"In this search, we see that the `learning_rate` is required to be large\n",
Expand All @@ -196,8 +252,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we estimate the generalization performance of the best model\n",
"using the test set."
"Now we estimate the generalization performance of the best model using the\n",
"test set."
]
},
{
Expand All @@ -216,8 +272,8 @@
"source": [
"The mean test score in the held-out test set is slightly better than the score\n",
"of the best model. The reason is that the final model is refitted on the whole\n",
"training set and therefore, on more data than the inner cross-validated models\n",
"of the grid search procedure."
"training set and therefore, on more data than the cross-validated models of\n",
"the grid search procedure."
]
}
],
Expand Down

0 comments on commit aaeecc0

Please sign in to comment.