Predefined "blind" well data as measure of performance leads to over-fitting. #2

LukasMosser · 2016-10-18T21:21:15Z

The contest outlines that the same well as in the publication will be used to judge the performance of the
proposed solutions. This can lead to overfitting by using the prediction capability for
the proposed well as a loss function.

Should another performance measure be used to compensate for overfitting?

kwinkunks · 2016-10-18T22:04:25Z

Thank you for raising this, Lukas.

You mean because, even without using the well explicitly in the model,
parameters, features, etc, will be chosen to fit that well?

Any ideas for other measures, given that there are no more wells? Perhaps a
combination of 2 or 3 measures?

I guess part of the problem is that any public measure could be trained
towards. Is the only objective test a secret one?

On Tue, Oct 18, 2016, 18:21 Lukas Mosser notifications@github.com wrote:

The contest outlines that the same well as in the publication will be used
to judge the performance of the
proposed solutions. This can lead to overfitting by using the prediction
capability for
the proposed well as a loss function.

Should another performance measure be used to compensate for overfitting?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#2, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABnS1LycEvaEtjSYwM4l8Ja3hPklKTEiks5q1ThMgaJpZM4KaTe3
.

LukasMosser · 2016-10-19T07:29:28Z

One Idea could be to have a secret test well.

Another could be to perform the prediction on all but one well, retaining the others
for training purposes and then averaging the scores over the ensemble of training and testing runs.

E.g.:
Available Wells: A, B, C
First Run: Train on: A,B => Test on: C => Score 1
Second Run: Train on: B,C => Test on: A => Score 2
Third Run: Train on: A,C => Test on: B => Score 3

Final Score = Arithmetic Average of Score 1, Score 2, Score 3

I'm guessing, by no means an expert on this, but this would lead to a better estimate of an appropriate method rather than one set of parameters that are specifically designed to perform well on the proposed blind test well X.

lperozzi · 2016-10-19T14:28:42Z

Actually I think Lukas is right. This is called k-fold cross-validation
and it is well explained by this figure.

https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png

When compute the accuracy of the blind well alone, score variation is
pretty high +/- 5%. K-fold cross-validation can help to stabilize the
score.
Moreover, I think that precision/recall/f1 score are more appropriate
for this kind of problem.

Lukas Mosser mailto:notifications@github.com
October 19, 2016 at 3:29 AM

One Idea could be to have a secret test well.

Another could be to perform the prediction on all but one well,
retaining the others
for training purposes and then averaging the scores over the ensemble
of training and testing runs.

E.g.:
Available Wells: A, B, C
First Run: Train on: A,B => Test on: C => Score 1
Second Run: Train on: B,C => Test on: A => Score 2
Third Run: Train on: A,C => Test on: B => Score 3

Final Score = Arithmetic Average of Score 1, Score 2, Score 3

I'm guessing, by no means an expert on this, but this would lead to a
better estimate of an appropriate method rather than one set of
parameters that are specifically designed to perform well on the
proposed blind test well X.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#2 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF6UYWADLqiDH854N7l2IvWGtUEELLM3ks5q1cbYgaJpZM4KaTe3.

Lukas Mosser mailto:notifications@github.com
October 18, 2016 at 5:21 PM

The contest outlines that the same well as in the publication will be
used to judge the performance of the
proposed solutions. This can lead to overfitting by using the
prediction capability for
the proposed well as a loss function.

Should another performance measure be used to compensate for overfitting?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#2, or mute the thread
https://github.com/notifications/unsubscribe-auth/AF6UYRYdz09-BhAt5XTWKfrRGiCfh5kVks5q1ThMgaJpZM4KaTe3.

Cordialement,
Cheers,

Lorenzo Perozzi | PhD Geophysics
lorenzo.perozzi@gmail.com mailto:lorenzo.perozzi@gmail.com

linkedin https://www.linkedin.com/in/lperozzi github
https://github.com/lperozzi twitter https://twitter.com/lor3nzop3rozzi

liamg logo http://www.liamg.ca

LukasMosser · 2016-10-20T09:05:41Z

Excellent Visualization!

I agree, I've seen even higher score variation it seems when running any of my code.

kwinkunks · 2016-10-25T17:06:23Z

Thank you again @LukasMosser for raising this issue of meta-overfitting, and @lperozzi for chiming in.

We can (and expect to) learn as we go, so we'll look at k-fold CV as a score. That seems like a sensible approach.

I have also got another plan — some fully blind data that we will not release. I'll post back to this issue if and when that's a reality.

I guess all this is just something we have to live with in what is necessarily a limited universe of data. And I'm reading that it is (therefore) a common problem in machine learning contests. It makes me glad that there isn't $1 million at stake in this contest. Because we nearly went with that!

(no, we didn't)

LukasMosser · 2016-10-26T20:49:10Z

@kwinkunks @lperozzi
It's a bit rough code I'd say but here's my take at k=1 fold training.

Gist

kwinkunks · 2016-11-08T18:03:12Z

Hi everyone... A quick update on this issue of meta-overfitting.

We have some labels for the STUART and CRAWFORD wells. These labels were not part of the data package, so they are truly blind. We will therefore be testing entries against them.

We may also keep some type of k-fold CV in the mix. Brendon and I aim to propose a scoring strategy before next week. The goal is to be both analytically rigorous and fair to everyone.

Either way, this means that all of the wells with labels in the data package can be used in training.

mycarta · 2016-11-12T22:47:00Z

Hi Matt, Brendon

I am just starting today on the contest. I am trying to catch up on this issue, so to be sure I understand: we are to use all the wells for training, and in our submission we only include training (crossvalidated) scores? There's no validation on a blind well. I imagine entries can still include visual result (e.g. plot the predicted facies) on the STUART and CRAWFORD wells, but it is you and Brendon that will be doing the validation against STUART and CRAWFORD separately. Correct?
This also means that Bryan / CannedGeo should have an opportunity to re-submit after trowing SHANKLE in the training.

kwinkunks · 2016-11-13T22:55:13Z

Hey @mycarta ... Yes, use all the wells in training. And yes, we will validate against STUART and CRAWFORD, and it's that score that will count. I'm not sure yet if the k-fold CV score will count for anything, but I think it's probably the most useful thing to aim at.

And yes, @CannedGeo should probably retrain his model, although I can do that too with his existing hyperparameters... it just might not be optimal. I suspect he's working on it in any case.

CannedGeo · 2016-11-14T02:56:29Z

@mycarta @kwinkunks If by retraining my model you mean banging head against wall, then yes, that is exactly what I've been doing!! This is maybe better presented in a separate question, but would it be totally against the rules to try and generate a PE for Alexander D and Kimzey A? I mean this is a machine learning exercise is it not?... And then include all wells in the model?

kwinkunks · 2016-11-14T19:08:31Z

@CannedGeo Yes, please do put that in an Issue of its own so other people will be more likely to see it. Short answer: you can do anything you like with the features, including generating new ones. Definitely.

kwinkunks · 2016-11-14T20:47:19Z

On the k-fold CV issue...

I just made a small demo of stepwise dropout. Please have a look and see if you agree that it does what we've been talking about.

cc @lperozzi @LukasMosser (I guess you will get notifications anyway, just making sure :)

StoDIG submission #2

Update from seg

kwinkunks added the question label Oct 26, 2016

This was referenced Oct 27, 2016

The magic number 0.43 #8

Closed

Is there a leaderboard #6

Closed

Definition of scoring method unclear #4

Closed

What is going to be the test set #5

Closed

kwinkunks mentioned this issue Dec 16, 2016

Number of submissions #50

Open

da-wad pushed a commit to da-wad/2016-ml-contest that referenced this issue Jan 29, 2017

StoDIG submission seg#2

1097c54

kwinkunks added a commit that referenced this issue Jan 30, 2017

Merge pull request #165 from da-wad/StoDIG

d8a23fd

StoDIG submission #2

kwinkunks pushed a commit that referenced this issue Jan 31, 2017

Merge pull request #2 from seg/master

103e869

Update from seg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predefined "blind" well data as measure of performance leads to over-fitting. #2

Predefined "blind" well data as measure of performance leads to over-fitting. #2

LukasMosser commented Oct 18, 2016

kwinkunks commented Oct 18, 2016

LukasMosser commented Oct 19, 2016

lperozzi commented Oct 19, 2016

LukasMosser commented Oct 20, 2016

kwinkunks commented Oct 25, 2016

LukasMosser commented Oct 26, 2016 •

edited

Loading

kwinkunks commented Nov 8, 2016

mycarta commented Nov 12, 2016 •

edited

Loading

kwinkunks commented Nov 13, 2016

CannedGeo commented Nov 14, 2016

kwinkunks commented Nov 14, 2016

kwinkunks commented Nov 14, 2016

Predefined "blind" well data as measure of performance leads to over-fitting. #2

Predefined "blind" well data as measure of performance leads to over-fitting. #2

Comments

LukasMosser commented Oct 18, 2016

kwinkunks commented Oct 18, 2016

LukasMosser commented Oct 19, 2016

lperozzi commented Oct 19, 2016

LukasMosser commented Oct 20, 2016

kwinkunks commented Oct 25, 2016

LukasMosser commented Oct 26, 2016 • edited Loading

kwinkunks commented Nov 8, 2016

mycarta commented Nov 12, 2016 • edited Loading

kwinkunks commented Nov 13, 2016

CannedGeo commented Nov 14, 2016

kwinkunks commented Nov 14, 2016

kwinkunks commented Nov 14, 2016

LukasMosser commented Oct 26, 2016 •

edited

Loading

mycarta commented Nov 12, 2016 •

edited

Loading