transcripts/en/Lex_Fridman/vtt/CLOAswsxudo.vtt

﻿WEBVTT

1
00:00:00.450 --> 00:00:05.450
<v Speaker 1>Today we'll talk about how to make </v>
<v Speaker 1>machines see computer vision and we will</v>

2
00:00:06.331 --> 00:00:09.150
<v Speaker 1>present thank you.</v>
<v Speaker 1>Whoever said yes,</v>

3
00:00:11.250 --> 00:00:16.250
<v Speaker 1>and today we will present a competition </v>
<v Speaker 1>that unlike deep traffic which is </v>

4
00:00:17.911 --> 00:00:22.911
<v Speaker 1>designed to explore ideas,</v>
<v Speaker 1>teach you about concepts so deeper,</v>

5
00:00:23.490 --> 00:00:28.490
<v Speaker 1>deeper enforcement learning SegFuse,</v>
<v Speaker 1>the deep dynamic driving scene </v>

6
00:00:28.861 --> 00:00:32.700
<v Speaker 1>segmentation,</v>
<v Speaker 1>competition that are present today is at</v>

7
00:00:32.701 --> 00:00:37.650
<v Speaker 1>the very cutting edge.</v>
<v Speaker 1>Whoever does well in this competition is</v>

8
00:00:37.651 --> 00:00:42.651
<v Speaker 1>likely to produce a publication or ideas</v>
<v Speaker 1>that would lead the world in the area of</v>

9
00:00:44.221 --> 00:00:48.390
<v Speaker 1>perception,</v>
<v Speaker 1>perhaps together with the people running</v>

10
00:00:48.391 --> 00:00:50.580
<v Speaker 1>this class,</v>
<v Speaker 1>perhaps in your own.</v>

11
00:00:51.200 --> 00:00:56.200
<v Speaker 1>I encourage you to do so even more cast </v>
<v Speaker 1>today.</v>

12
00:00:57.870 --> 00:01:02.870
<v Speaker 1>Computer Vision today as it stands is </v>
<v Speaker 1>deep learning majority of the successes </v>

13
00:01:07.230 --> 00:01:10.080
<v Speaker 1>in how we interpret form </v>
<v Speaker 1>representations,</v>

14
00:01:10.230 --> 00:01:15.230
<v Speaker 1>understand images and videos utilize to </v>
<v Speaker 1>a significant degree and you're on that </v>

15
00:01:16.181 --> 00:01:21.181
<v Speaker 1>works.</v>
<v Speaker 1>The very ideas we've been talking about </v>

16
00:01:21.181 --> 00:01:24.560
<v Speaker 1>that applies for supervised,</v>
<v Speaker 1>unsupervised and reinforcement learning </v>

17
00:01:26.290 --> 00:01:29.810
<v Speaker 1>and for the supervised case is the focus</v>
<v Speaker 1>of today.</v>

18
00:01:30.740 --> 00:01:34.850
<v Speaker 1>The process is the same.</v>
<v Speaker 1>The data is essential.</v>

19
00:01:34.940 --> 00:01:39.940
<v Speaker 1>There's annotated data where the human </v>
<v Speaker 1>provides the labels that serves as the </v>

20
00:01:39.940 --> 00:01:44.141
<v Speaker 1>ground truth and the training process.</v>
<v Speaker 1>Then the neural network goes through </v>

21
00:01:45.531 --> 00:01:50.531
<v Speaker 1>that data,</v>
<v Speaker 1>learning to map from the raw sensory </v>

22
00:01:50.531 --> 00:01:55.031
<v Speaker 1>input to the ground truth labels and </v>
<v Speaker 1>then generalize or the testing data set</v>

23
00:01:57.320 --> 00:02:00.200
<v Speaker 1>and the kind of raw senses we're dealing</v>
<v Speaker 1>with are numbers.</v>

24
00:02:01.280 --> 00:02:05.900
<v Speaker 1>I'll say this again and again that for </v>
<v Speaker 1>human vision for us here,</v>

25
00:02:05.930 --> 00:02:10.930
<v Speaker 1>we take for granted this particular </v>
<v Speaker 1>aspect of our ability is to take in raw </v>

26
00:02:10.930 --> 00:02:13.130
<v Speaker 1>sensor information through our eyes and </v>
<v Speaker 1>interpret it,</v>

27
00:02:13.880 --> 00:02:18.880
<v Speaker 1>but it's just numbers.</v>
<v Speaker 1>That's something whether you're an </v>

28
00:02:18.880 --> 00:02:20.960
<v Speaker 1>expert in computer vision person or new </v>
<v Speaker 1>to the field,</v>

29
00:02:21.020 --> 00:02:26.020
<v Speaker 1>you have to always go back to meditate </v>
<v Speaker 1>on is what kind of things the machine is</v>

30
00:02:27.381 --> 00:02:28.610
<v Speaker 1>given,</v>
<v Speaker 1>what,</v>

31
00:02:28.640 --> 00:02:33.640
<v Speaker 1>what?</v>
<v Speaker 1>What is the data that is tasked to work </v>

32
00:02:33.640 --> 00:02:35.150
<v Speaker 1>with in order to perform the task you're</v>
<v Speaker 1>asking it to do?</v>

33
00:02:35.750 --> 00:02:40.750
<v Speaker 1>Perhaps the data is given is highly </v>
<v Speaker 1>insufficient to do what you wanted to </v>

34
00:02:40.971 --> 00:02:45.971
<v Speaker 1>do.</v>
<v Speaker 1>That's the question I'll come up again </v>

35
00:02:45.971 --> 00:02:48.071
<v Speaker 1>and again our images and enough to </v>
<v Speaker 1>understand the world around you and </v>

36
00:02:51.710 --> 00:02:54.830
<v Speaker 1>given these numbers,</v>
<v Speaker 1>these set of numbers,</v>

37
00:02:54.831 --> 00:02:59.831
<v Speaker 1>sometimes with one channel,</v>
<v Speaker 1>sometimes with three rgb where every </v>

38
00:02:59.831 --> 00:03:03.761
<v Speaker 1>single have three different colors.</v>
<v Speaker 1>The task is to classify or regress </v>

39
00:03:07.440 --> 00:03:12.440
<v Speaker 1>producing continuous variable or one of </v>
<v Speaker 1>a set of class labels as before,</v>

40
00:03:16.550 --> 00:03:21.550
<v Speaker 1>we must be careful about our intuition </v>
<v Speaker 1>of what is hard,</v>

41
00:03:21.990 --> 00:03:23.600
<v Speaker 1>what is easy and computer vision.</v>

42
00:03:28.210 --> 00:03:33.210
<v Speaker 1>Let's take a step back to the </v>
<v Speaker 1>inspiration for an year old networks,</v>

43
00:03:34.420 --> 00:03:39.420
<v Speaker 1>our own biological neural networks </v>
<v Speaker 1>because the human vision system and the </v>

44
00:03:40.061 --> 00:03:44.050
<v Speaker 1>computer vision system is a little bit </v>
<v Speaker 1>more similar in these regards.</v>

45
00:03:52.360 --> 00:03:57.360
<v Speaker 1>The structure of the human visual Cortex</v>
<v Speaker 1>is in layers and his information passes </v>

46
00:03:58.480 --> 00:04:03.480
<v Speaker 1>from the eyes of the to the parts of the</v>
<v Speaker 1>brain that makes sense of the influence,</v>

47
00:04:03.700 --> 00:04:07.750
<v Speaker 1>the raw sensor information hiring higher</v>
<v Speaker 1>order representations of formed.</v>

48
00:04:08.830 --> 00:04:13.830
<v Speaker 1>This is the inspiration,</v>
<v Speaker 1>the idea behind using deep neural </v>

49
00:04:13.830 --> 00:04:17.821
<v Speaker 1>networks for images higher and higher </v>
<v Speaker 1>order representation is a form for the </v>

50
00:04:17.821 --> 00:04:18.190
<v Speaker 1>layers,</v>

51
00:04:19.980 --> 00:04:24.980
<v Speaker 1>the early layers taking in the very raw </v>
<v Speaker 1>in sensory information that extracting </v>

52
00:04:25.830 --> 00:04:28.830
<v Speaker 1>edges,</v>
<v Speaker 1>connecting those edges,</v>

53
00:04:28.831 --> 00:04:33.831
<v Speaker 1>forming those edges to form more complex</v>
<v Speaker 1>features and finally into the higher </v>

54
00:04:33.831 --> 00:04:38.511
<v Speaker 1>order semantic meaning that we hope to </v>
<v Speaker 1>get from these images and computer </v>

55
00:04:39.241 --> 00:04:41.160
<v Speaker 1>vision.</v>
<v Speaker 1>Deep learning is hard.</v>

56
00:04:42.180 --> 00:04:47.180
<v Speaker 1>I'll say this again.</v>
<v Speaker 1>The illumination variability is the </v>

57
00:04:47.180 --> 00:04:48.030
<v Speaker 1>biggest challenge,</v>
<v Speaker 1>or at least one of the,</v>

58
00:04:48.120 --> 00:04:53.120
<v Speaker 1>one of the biggest challenges in driving</v>
<v Speaker 1>for visible light cameras pose </v>

59
00:04:55.351 --> 00:04:58.110
<v Speaker 1>variability.</v>
<v Speaker 1>The objects,</v>

60
00:04:59.010 --> 00:05:04.010
<v Speaker 1>as I'll also discuss about some of the </v>
<v Speaker 1>advances from Geoff Hinton and the </v>

61
00:05:04.010 --> 00:05:07.521
<v Speaker 1>capsule networks.</v>
<v Speaker 1>The idea with neural networks as they </v>

62
00:05:07.521 --> 00:05:12.341
<v Speaker 1>are currently used for computer vision </v>
<v Speaker 1>are not good with representing variable </v>

63
00:05:12.571 --> 00:05:17.571
<v Speaker 1>pose.</v>
<v Speaker 1>These objects in images and it's too </v>

64
00:05:17.571 --> 00:05:21.891
<v Speaker 1>deep.</v>
<v Speaker 1>Plane of color and texture look very </v>

65
00:05:21.891 --> 00:05:25.641
<v Speaker 1>different numerically when the object is</v>
<v Speaker 1>rotated and the object is mangled and </v>

66
00:05:27.681 --> 00:05:32.681
<v Speaker 1>shaped in different ways.</v>
<v Speaker 1>The deformable will truncate a cat </v>

67
00:05:32.681 --> 00:05:36.690
<v Speaker 1>interclass variability.</v>
<v Speaker 1>The classification task,</v>

68
00:05:36.691 --> 00:05:41.691
<v Speaker 1>which would be an example today </v>
<v Speaker 1>throughout to introduce some of the </v>

69
00:05:41.691 --> 00:05:46.011
<v Speaker 1>networks over the past decade that have </v>
<v Speaker 1>received success and some of the </v>

70
00:05:46.011 --> 00:05:47.370
<v Speaker 1>intuition and insight that made those </v>
<v Speaker 1>networks work.</v>

71
00:05:47.670 --> 00:05:52.110
<v Speaker 1>Classification,</v>
<v Speaker 1>there is a lot of variability inside the</v>

72
00:05:52.111 --> 00:05:55.470
<v Speaker 1>classes and very little variability </v>
<v Speaker 1>between the classes.</v>

73
00:05:57.070 --> 00:05:58.350
<v Speaker 1>All of these cats</v>

74
00:05:58.390 --> 00:06:00.610
<v Speaker 2>at top,</v>
<v Speaker 2>all those are dogs a bottom.</v>

75
00:06:01.060 --> 00:06:06.060
<v Speaker 2>They look very different and the other,</v>
<v Speaker 2>I would say the second biggest problem </v>

76
00:06:06.060 --> 00:06:08.760
<v Speaker 2>in driving perception,</v>
<v Speaker 2>visible light camera perceptions,</v>

77
00:06:08.810 --> 00:06:13.810
<v Speaker 2>occlusion when part of the object is </v>
<v Speaker 2>occluded due to the three dimensional</v>

78
00:06:15.220 --> 00:06:20.220
<v Speaker 1>nature of our world,</v>
<v Speaker 1>some objects in front of others and they</v>

79
00:06:20.441 --> 00:06:25.441
<v Speaker 1>occlude the background object.</v>
<v Speaker 1>And yet we're still tasked with </v>

80
00:06:25.441 --> 00:06:28.720
<v Speaker 1>identifying the object when only part of</v>
<v Speaker 1>it is visible.</v>

81
00:06:29.200 --> 00:06:34.200
<v Speaker 1>And sometimes that part I told you </v>
<v Speaker 1>there's cats is very hardly visible </v>

82
00:06:34.690 --> 00:06:37.510
<v Speaker 1>here.</v>
<v Speaker 1>We're tasked with classifying a cat with</v>

83
00:06:37.511 --> 00:06:42.511
<v Speaker 1>just an ears visible,</v>
<v Speaker 1>just the leg and on a philosophical </v>

84
00:06:46.121 --> 00:06:50.110
<v Speaker 1>level as we'll talk about the motivation</v>
<v Speaker 1>for our competition here.</v>

85
00:06:50.530 --> 00:06:51.550
<v Speaker 1>Here's a,</v>
<v Speaker 1>a,</v>

86
00:06:51.620 --> 00:06:53.380
<v Speaker 1>a,</v>
<v Speaker 1>a cat dressed as a monk,</v>

87
00:06:53.381 --> 00:06:56.920
<v Speaker 1>eating a banana on a philosophical </v>
<v Speaker 1>level.</v>

88
00:06:58.240 --> 00:07:00.520
<v Speaker 1>Most of us,</v>
<v Speaker 1>uh,</v>

89
00:07:00.940 --> 00:07:05.140
<v Speaker 1>understand what's going on in the scene.</v>
<v Speaker 1>In fact,</v>

90
00:07:05.280 --> 00:07:10.280
<v Speaker 1>a neural network to today successfully </v>
<v Speaker 1>classify this,</v>

91
00:07:12.460 --> 00:07:16.930
<v Speaker 1>uh,</v>
<v Speaker 1>image this video as a cat,</v>

92
00:07:18.010 --> 00:07:21.820
<v Speaker 1>but the context,</v>
<v Speaker 1>the humor of the situation,</v>

93
00:07:21.821 --> 00:07:26.680
<v Speaker 1>and in fact you could argue it's a </v>
<v Speaker 1>monkey is missing.</v>

94
00:07:27.250 --> 00:07:30.640
<v Speaker 1>And what else is missing is the dynamic </v>
<v Speaker 1>information,</v>

95
00:07:30.820 --> 00:07:32.530
<v Speaker 1>the temporal dynamics of the scene.</v>

96
00:07:34.990 --> 00:07:39.990
<v Speaker 1>That's what's missing in a lot of the </v>
<v Speaker 1>perception work that has been done to </v>

97
00:07:39.990 --> 00:07:42.460
<v Speaker 1>date in the autonomous vehicle space,</v>
<v Speaker 1>uh,</v>

98
00:07:42.670 --> 00:07:47.020
<v Speaker 1>in terms of visible light cameras and </v>
<v Speaker 1>we're looking to expand on that.</v>

99
00:07:47.470 --> 00:07:49.600
<v Speaker 1>That's what segue.</v>
<v Speaker 1>Fuse is all about.</v>

100
00:07:50.380 --> 00:07:54.550
<v Speaker 1>Image classification pipeline.</v>
<v Speaker 1>There's a been with different categories</v>

101
00:07:54.551 --> 00:07:56.770
<v Speaker 1>inside each class.</v>
<v Speaker 1>Cat,</v>

102
00:07:56.771 --> 00:07:58.120
<v Speaker 1>dog Mug,</v>
<v Speaker 1>hat,</v>

103
00:07:58.840 --> 00:08:03.840
<v Speaker 1>those bins.</v>
<v Speaker 1>There's a lot of examples of each and </v>

104
00:08:03.840 --> 00:08:07.141
<v Speaker 1>your task with when a new example comes </v>
<v Speaker 1>along you've never seen before to put </v>

105
00:08:07.141 --> 00:08:11.161
<v Speaker 1>that image in a bin.</v>
<v Speaker 1>It's the same as the machine learning </v>

106
00:08:11.161 --> 00:08:15.091
<v Speaker 1>task before and everything relies on the</v>
<v Speaker 1>data that has been ground truth,</v>

107
00:08:16.480 --> 00:08:21.480
<v Speaker 1>that have been labeled by human beings.</v>
<v Speaker 1>Amnesty as a toy data set of handwritten</v>

108
00:08:22.691 --> 00:08:27.130
<v Speaker 1>digits,</v>
<v Speaker 1>often using as examples and Coco psyfari</v>

109
00:08:27.160 --> 00:08:30.550
<v Speaker 1>image net places,</v>
<v Speaker 1>and a lot of other incredible datasets.</v>

110
00:08:30.580 --> 00:08:35.580
<v Speaker 1>Rich data sets of 100 thousands,</v>
<v Speaker 1>millions of images out there represent </v>

111
00:08:35.721 --> 00:08:39.310
<v Speaker 1>scenes,</v>
<v Speaker 1>people's faces and different objects.</v>

112
00:08:39.670 --> 00:08:44.670
<v Speaker 1>Those are all ground truth data for </v>
<v Speaker 1>testing algorithms and for competing </v>

113
00:08:46.271 --> 00:08:48.940
<v Speaker 1>architectures to be evaluated against </v>
<v Speaker 1>each other.</v>

114
00:08:49.720 --> 00:08:51.880
<v Speaker 1>Cfr Ten,</v>
<v Speaker 1>one of the simplest,</v>

115
00:08:52.780 --> 00:08:57.380
<v Speaker 1>almost toy datasets of tiny with 10 </v>
<v Speaker 1>categories of airplane,</v>

116
00:08:57.390 --> 00:08:58.830
<v Speaker 1>automobile,</v>
<v Speaker 1>Bird,</v>

117
00:08:58.831 --> 00:08:59.430
<v Speaker 1>cat,</v>
<v Speaker 1>deere,</v>

118
00:08:59.431 --> 00:09:00.150
<v Speaker 1>dog,</v>
<v Speaker 1>frog,</v>

119
00:09:00.151 --> 00:09:05.151
<v Speaker 1>course,</v>
<v Speaker 1>ship and truck is commonly used to </v>

120
00:09:05.151 --> 00:09:08.031
<v Speaker 1>explore.</v>
<v Speaker 1>Some of the basic convolutional neural </v>

121
00:09:08.031 --> 00:09:08.190
<v Speaker 1>networks will discuss,</v>
<v Speaker 1>so let's come up with a very trivial,</v>

122
00:09:08.191 --> 00:09:11.880
<v Speaker 1>classify it to explain the concept of </v>
<v Speaker 1>how we could go about it.</v>

123
00:09:12.600 --> 00:09:17.600
<v Speaker 1>In fact,</v>
<v Speaker 1>this is maybe if you start to think </v>

124
00:09:17.600 --> 00:09:20.121
<v Speaker 1>about how to classify an image.</v>
<v Speaker 1>If you don't know any of these </v>

125
00:09:20.121 --> 00:09:23.241
<v Speaker 1>techniques,</v>
<v Speaker 1>this is perhaps the approach you would </v>

126
00:09:23.241 --> 00:09:25.641
<v Speaker 1>take is you would subtract images.</v>
<v Speaker 1>So in order to know that an image of a </v>

127
00:09:25.771 --> 00:09:30.771
<v Speaker 1>cat is different than image of a dog,</v>
<v Speaker 1>you have to compare them when given </v>

128
00:09:30.771 --> 00:09:30.860
<v Speaker 1>those two images,</v>
<v Speaker 1>what?</v>

129
00:09:30.890 --> 00:09:33.090
<v Speaker 1>What's the what's the way you compare </v>
<v Speaker 1>them?</v>

130
00:09:33.900 --> 00:09:38.900
<v Speaker 1>One way you could do it is you just </v>
<v Speaker 1>subtract it and then some all the pixel </v>

131
00:09:38.900 --> 00:09:42.840
<v Speaker 1>wise differences in the image.</v>
<v Speaker 1>Just subtract the intensity of the image</v>

132
00:09:42.870 --> 00:09:46.530
<v Speaker 1>pixel by Pixel.</v>
<v Speaker 1>Sum It up if that intent,</v>

133
00:09:46.560 --> 00:09:51.560
<v Speaker 1>if that difference is really high,</v>
<v Speaker 1>that means the image is a very </v>

134
00:09:51.560 --> 00:09:51.560
<v Speaker 1>different.</v>

135
00:09:51.560 --> 00:09:56.180
<v Speaker 1>Using that metric,</v>
<v Speaker 1>we can look at cfr 10 and use it as a </v>

136
00:09:56.180 --> 00:10:00.120
<v Speaker 1>classifier saying,</v>
<v Speaker 1>based on this difference function,</v>

137
00:10:00.390 --> 00:10:05.390
<v Speaker 1>I'm going to find one of the 10 bins for</v>
<v Speaker 1>a new image that that is cool,</v>

138
00:10:07.240 --> 00:10:12.240
<v Speaker 1>that has the lowest difference.</v>
<v Speaker 1>Find an image in this dataset that is </v>

139
00:10:13.511 --> 00:10:16.540
<v Speaker 1>most like the image I have and put it in</v>
<v Speaker 1>the same bin.</v>

140
00:10:16.541 --> 00:10:21.520
<v Speaker 1>Is that images in?</v>
<v Speaker 1>So there's 10 classes.</v>

141
00:10:21.521 --> 00:10:26.521
<v Speaker 1>If we just flip a coin,</v>
<v Speaker 1>the accuracy of our classifier will be </v>

142
00:10:26.521 --> 00:10:28.420
<v Speaker 1>10 percent.</v>
<v Speaker 1>Using our image difference classify,</v>

143
00:10:28.421 --> 00:10:33.421
<v Speaker 1>we can actually do pretty good.</v>
<v Speaker 1>Much better than random was better than </v>

144
00:10:33.421 --> 00:10:34.780
<v Speaker 1>10 percent.</v>
<v Speaker 1>We can do 35,</v>

145
00:10:34.781 --> 00:10:39.781
<v Speaker 1>38 percent accuracy.</v>
<v Speaker 1>That's a classifier wherever first </v>

146
00:10:40.750 --> 00:10:45.750
<v Speaker 1>classifier,</v>
<v Speaker 1>k nearest neighbors.</v>

147
00:10:46.530 --> 00:10:51.530
<v Speaker 1>Let's take our classifier to a whole new</v>
<v Speaker 1>level instead of comparing it to just </v>

148
00:10:51.960 --> 00:10:56.960
<v Speaker 1>fight.</v>
<v Speaker 1>Trying to find one image that's the </v>

149
00:10:56.960 --> 00:10:59.511
<v Speaker 1>closest in our dataset.</v>
<v Speaker 1>We tried to find k closest and say what </v>

150
00:10:59.791 --> 00:11:02.820
<v Speaker 1>is what class do the majority of them </v>
<v Speaker 1>belong to?</v>

151
00:11:03.330 --> 00:11:06.210
<v Speaker 1>And we take that K and increase it for </v>
<v Speaker 1>one to two,</v>

152
00:11:06.211 --> 00:11:07.560
<v Speaker 1>to three,</v>
<v Speaker 1>to four to five,</v>

153
00:11:08.790 --> 00:11:13.790
<v Speaker 1>and see how that changes the problem </v>
<v Speaker 1>with seven years neighbors,</v>

154
00:11:14.541 --> 00:11:17.450
<v Speaker 1>which is the optimal under this approach</v>
<v Speaker 1>for cfr 10,</v>

155
00:11:20.610 --> 00:11:25.610
<v Speaker 1>we achieved 30 percent accuracy.</v>
<v Speaker 1>Human level is 95 percent accuracy and </v>

156
00:11:28.390 --> 00:11:31.760
<v Speaker 1>convolutional neural networks will get </v>
<v Speaker 1>very close to a 100 percent.</v>

157
00:11:34.260 --> 00:11:39.260
<v Speaker 1>That's were you on.</v>
<v Speaker 1>That works shine this very task of </v>

158
00:11:41.691 --> 00:11:46.691
<v Speaker 1>binning images.</v>
<v Speaker 1>It all starts with this basic </v>

159
00:11:46.691 --> 00:11:49.490
<v Speaker 1>computational unit signal in each of the</v>
<v Speaker 1>signals are wade summed,</v>

160
00:11:51.980 --> 00:11:53.150
<v Speaker 1>bias added</v>

161
00:11:55.140 --> 00:12:00.140
<v Speaker 1>and put an input into a nonlinear </v>
<v Speaker 1>activation function that produces an </v>

162
00:12:00.140 --> 00:12:04.181
<v Speaker 1>output.</v>
<v Speaker 1>The nonlinear activation function is </v>

163
00:12:04.181 --> 00:12:07.811
<v Speaker 1>key.</v>
<v Speaker 1>All of these put together and more and </v>

164
00:12:07.971 --> 00:12:12.560
<v Speaker 1>more hidden layers form a deep neural </v>
<v Speaker 1>network,</v>

165
00:12:12.650 --> 00:12:17.650
<v Speaker 1>and that deep neural network is trained </v>
<v Speaker 1>as we've discussed by taking a forward </v>

166
00:12:17.661 --> 00:12:20.600
<v Speaker 1>pass on examples,</v>
<v Speaker 1>have ground truth labels.</v>

167
00:12:20.690 --> 00:12:24.050
<v Speaker 1>Seeing how close those labels are too,</v>
<v Speaker 1>the real ground truth,</v>

168
00:12:24.350 --> 00:12:29.350
<v Speaker 1>and then punishing the weights that </v>
<v Speaker 1>resulted in the incorrect decisions and </v>

169
00:12:29.871 --> 00:12:32.480
<v Speaker 1>rewarding the weights that results in </v>
<v Speaker 1>incorrect decisions.</v>

170
00:12:33.800 --> 00:12:38.800
<v Speaker 1>For the case of 10 examples,</v>
<v Speaker 1>the output of the network is 10 </v>

171
00:12:40.041 --> 00:12:45.041
<v Speaker 1>different values.</v>
<v Speaker 1>The input being handwritten digits from </v>

172
00:12:46.651 --> 00:12:51.651
<v Speaker 1>zero to nine,</v>
<v Speaker 1>10 of those and one of our network to </v>

173
00:12:52.441 --> 00:12:57.441
<v Speaker 1>classify what is in this image of a </v>
<v Speaker 1>handwritten digit is at one zero,</v>

174
00:12:58.201 --> 00:12:58.620
<v Speaker 1>one,</v>
<v Speaker 1>two,</v>

175
00:12:58.621 --> 00:13:03.621
<v Speaker 1>three through nine.</v>
<v Speaker 1>The way it's often done is there's 10 </v>

176
00:13:03.811 --> 00:13:08.811
<v Speaker 1>outputs of the network and each of the </v>
<v Speaker 1>neurons and the output is responsible </v>

177
00:13:12.061 --> 00:13:17.061
<v Speaker 1>for getting really excited when it's </v>
<v Speaker 1>number is called and everybody else is </v>

178
00:13:18.811 --> 00:13:23.811
<v Speaker 1>supposed to be not excited.</v>
<v Speaker 1>Therefore the number of classes is the </v>

179
00:13:24.301 --> 00:13:29.301
<v Speaker 1>number of outputs.</v>
<v Speaker 1>That's how it's commonly done and you </v>

180
00:13:29.301 --> 00:13:32.460
<v Speaker 1>assign a class to the input image based </v>
<v Speaker 1>on the highest,</v>

181
00:13:32.760 --> 00:13:35.250
<v Speaker 1>the neuron which produces the highest </v>
<v Speaker 1>output,</v>

182
00:13:36.870 --> 00:13:40.530
<v Speaker 1>but that's for a fully connected network</v>
<v Speaker 1>that we've discussed on Monday.</v>

183
00:13:42.320 --> 00:13:47.320
<v Speaker 1>There is in deep learning a lot of </v>
<v Speaker 1>tricks that make things work that make </v>

184
00:13:47.721 --> 00:13:52.721
<v Speaker 1>training much more efficient on large </v>
<v Speaker 1>class problems where there's a lot of </v>

185
00:13:54.051 --> 00:13:59.051
<v Speaker 1>classes on large data sets.</v>
<v Speaker 1>When the representation that the neural </v>

186
00:13:59.051 --> 00:14:03.581
<v Speaker 1>network is tasked with learning is </v>
<v Speaker 1>extremely complex and that's where </v>

187
00:14:03.581 --> 00:14:05.090
<v Speaker 1>convolutional neural network step in </v>
<v Speaker 1>that trick.</v>

188
00:14:05.091 --> 00:14:10.091
<v Speaker 1>They use a spatial invariance.</v>
<v Speaker 1>They use the idea that a cat in the top </v>

189
00:14:12.261 --> 00:14:17.261
<v Speaker 1>left corner of an image is the same as a</v>
<v Speaker 1>cat in the bottom right corner of an </v>

190
00:14:17.261 --> 00:14:20.330
<v Speaker 1>image,</v>
<v Speaker 1>so we can learn the same features across</v>

191
00:14:20.331 --> 00:14:25.331
<v Speaker 1>the image.</v>
<v Speaker 1>That's where the convolution operation </v>

192
00:14:25.331 --> 00:14:29.850
<v Speaker 1>steps in.</v>
<v Speaker 1>Instead of the fully connected networks </v>

193
00:14:29.850 --> 00:14:32.840
<v Speaker 1>here,</v>
<v Speaker 1>there's a third dimension of depth,</v>

194
00:14:33.530 --> 00:14:38.530
<v Speaker 1>so the blocks in this neural network </v>
<v Speaker 1>that as input take three d volumes in </v>

195
00:14:39.131 --> 00:14:41.180
<v Speaker 1>this output produced three d volumes.</v>

196
00:14:46.890 --> 00:14:51.890
<v Speaker 1>They take a slice of the image,</v>
<v Speaker 1>a window and it across applying this </v>

197
00:14:53.491 --> 00:14:56.030
<v Speaker 1>same exact weights and we'll go through </v>
<v Speaker 1>an example,</v>

198
00:14:56.330 --> 00:15:01.330
<v Speaker 1>the same exact weights as in the fully </v>
<v Speaker 1>connected network on the edges that are </v>

199
00:15:01.330 --> 00:15:06.251
<v Speaker 1>used to map the input to the output.</v>
<v Speaker 1>Here are used to map the slice of an </v>

200
00:15:08.001 --> 00:15:10.880
<v Speaker 1>image,</v>
<v Speaker 1>this window of an image to the output,</v>

201
00:15:12.350 --> 00:15:17.350
<v Speaker 1>and you can make several,</v>
<v Speaker 1>many of such convolutional filters,</v>

202
00:15:17.870 --> 00:15:22.870
<v Speaker 1>many layers,</v>
<v Speaker 1>many different options of what kind of </v>

203
00:15:22.870 --> 00:15:27.530
<v Speaker 1>features do you look for in an image or </v>
<v Speaker 1>kind of window you slide across in order</v>

204
00:15:27.531 --> 00:15:31.430
<v Speaker 1>to extract all kinds of things.</v>
<v Speaker 1>All kinds of edges,</v>

205
00:15:31.730 --> 00:15:34.490
<v Speaker 1>all kinds of higher order patterns and </v>
<v Speaker 1>the images.</v>

206
00:15:36.530 --> 00:15:40.580
<v Speaker 1>The very important thing is the </v>
<v Speaker 1>parameters on each of these filters,</v>

207
00:15:40.730 --> 00:15:43.850
<v Speaker 1>these subset of the image,</v>
<v Speaker 1>these windows are shared.</v>

208
00:15:44.780 --> 00:15:49.780
<v Speaker 1>If the feature,</v>
<v Speaker 1>the that defines a cat is useful in the </v>

209
00:15:49.780 --> 00:15:50.960
<v Speaker 1>top left corner,</v>
<v Speaker 1>it's useful on the top right corner.</v>

210
00:15:50.961 --> 00:15:53.390
<v Speaker 1>It's useful in every aspect of the </v>
<v Speaker 1>image.</v>

211
00:15:53.690 --> 00:15:58.690
<v Speaker 1>This is the trick that makes </v>
<v Speaker 1>convolutional neural network save a lot </v>

212
00:15:58.690 --> 00:16:02.771
<v Speaker 1>of a lot of parameters.</v>
<v Speaker 1>Reduced parameter significantly is the </v>

213
00:16:04.121 --> 00:16:09.121
<v Speaker 1>reuse,</v>
<v Speaker 1>the spatial sharing of features across </v>

214
00:16:09.121 --> 00:16:09.490
<v Speaker 1>the space of the image.</v>

215
00:16:12.870 --> 00:16:16.500
<v Speaker 1>The depth of these three d volumes is </v>
<v Speaker 1>the number of filters.</v>

216
00:16:17.820 --> 00:16:21.960
<v Speaker 1>The stride is the skip of the filter,</v>
<v Speaker 1>the step size,</v>

217
00:16:22.260 --> 00:16:27.240
<v Speaker 1>how many pixels you skip when you apply </v>
<v Speaker 1>the filter to the input,</v>

218
00:16:28.110 --> 00:16:33.110
<v Speaker 1>and the padding is the padding,</v>
<v Speaker 1>the zero padding on the outside of the </v>

219
00:16:34.591 --> 00:16:39.330
<v Speaker 1>input to a convolutional layer.</v>
<v Speaker 1>Let's go through an example,</v>

220
00:16:41.160 --> 00:16:45.600
<v Speaker 1>so on the left here and the slides are </v>
<v Speaker 1>available online.</v>

221
00:16:45.601 --> 00:16:49.380
<v Speaker 1>You can follow them along and I'll step </v>
<v Speaker 1>through this example.</v>

222
00:16:50.070 --> 00:16:53.940
<v Speaker 1>On the left here is input volume of </v>
<v Speaker 1>three channels.</v>

223
00:16:55.140 --> 00:16:58.430
<v Speaker 1>The left column is the input.</v>
<v Speaker 1>The three blocks,</v>

224
00:16:58.460 --> 00:17:02.460
<v Speaker 1>the three squares.</v>
<v Speaker 1>There are the three channels and there's</v>

225
00:17:02.461 --> 00:17:07.461
<v Speaker 1>numbers inside those channels,</v>
<v Speaker 1>and then we have a filter in red,</v>

226
00:17:12.490 --> 00:17:16.990
<v Speaker 1>two of them,</v>
<v Speaker 1>two channels of filters with a bias,</v>

227
00:17:17.440 --> 00:17:22.440
<v Speaker 1>and we those filters are three by three.</v>
<v Speaker 1>Each one of them is size three by three,</v>

228
00:17:25.090 --> 00:17:30.090
<v Speaker 1>and what we do is we take those three by</v>
<v Speaker 1>three filters that are to be learned.</v>

229
00:17:30.310 --> 00:17:33.100
<v Speaker 1>These are variables are weights that </v>
<v Speaker 1>will have to learn,</v>

230
00:17:34.120 --> 00:17:38.740
<v Speaker 1>and then we slide it across an image to </v>
<v Speaker 1>produce the output on the right,</v>

231
00:17:38.800 --> 00:17:42.850
<v Speaker 1>the green,</v>
<v Speaker 1>so by applying the filters in the red,</v>

232
00:17:43.150 --> 00:17:48.150
<v Speaker 1>there's two of them,</v>
<v Speaker 1>and within each one there's one for </v>

233
00:17:48.150 --> 00:17:51.210
<v Speaker 1>every input channel,</v>
<v Speaker 1>we go from the left to the right,</v>

234
00:17:51.720 --> 00:17:56.310
<v Speaker 1>from the input volume in the left to the</v>
<v Speaker 1>output volume green on the right,</v>

235
00:17:57.780 --> 00:18:02.070
<v Speaker 1>and you can look at it.</v>
<v Speaker 1>You can pull up the slides yourself.</v>

236
00:18:02.071 --> 00:18:04.350
<v Speaker 1>Now if you can't see the numbers on the </v>
<v Speaker 1>screen,</v>

237
00:18:04.740 --> 00:18:09.740
<v Speaker 1>but the operations are performed on the </v>
<v Speaker 1>input to produce the single value that's</v>

238
00:18:13.081 --> 00:18:15.120
<v Speaker 1>highlighted there in the green and the </v>
<v Speaker 1>output,</v>

239
00:18:15.840 --> 00:18:20.840
<v Speaker 1>and we slide this convolution.</v>
<v Speaker 1>No filter along the image with the </v>

240
00:18:22.591 --> 00:18:27.591
<v Speaker 1>stride in this case,</v>
<v Speaker 1>have to skipping,</v>

241
00:18:28.380 --> 00:18:33.380
<v Speaker 1>skipping along.</v>
<v Speaker 1>They some to the to the right,</v>

242
00:18:33.540 --> 00:18:38.100
<v Speaker 1>the two channel I'll put in green,</v>

243
00:18:39.240 --> 00:18:41.550
<v Speaker 1>that's it.</v>
<v Speaker 1>That's the convolutional operation.</v>

244
00:18:42.330 --> 00:18:46.830
<v Speaker 1>That's what's called a convolutional </v>
<v Speaker 1>layer neural networks and the parameters</v>

245
00:18:46.831 --> 00:18:51.831
<v Speaker 1>here,</v>
<v Speaker 1>besides the bias of the red values in </v>

246
00:18:51.831 --> 00:18:56.181
<v Speaker 1>the middle,</v>
<v Speaker 1>that's what we're trying to learn and </v>

247
00:18:56.181 --> 00:18:58.830
<v Speaker 1>there's a lot of interesting tricks </v>
<v Speaker 1>we'll discuss today on top of those,</v>

248
00:18:58.980 --> 00:19:01.680
<v Speaker 1>but this is at the core.</v>
<v Speaker 1>This is the specialty.</v>

249
00:19:01.681 --> 00:19:06.681
<v Speaker 1>Invariant sharing of parameters that may</v>
<v Speaker 1>convolution you're on that works.</v>

250
00:19:07.390 --> 00:19:12.390
<v Speaker 1>I'm able to efficiently learn and find </v>
<v Speaker 1>patterns and images to build your </v>

251
00:19:14.311 --> 00:19:18.180
<v Speaker 1>intuition a little bit more about </v>
<v Speaker 1>convolution.</v>

252
00:19:18.181 --> 00:19:21.240
<v Speaker 1>Here's an input image on the left and on</v>
<v Speaker 1>the right,</v>

253
00:19:22.260 --> 00:19:26.370
<v Speaker 1>the identity filter produces the output </v>
<v Speaker 1>you see on the right,</v>

254
00:19:26.430 --> 00:19:28.710
<v Speaker 1>and then there is different ways you </v>
<v Speaker 1>can,</v>

255
00:19:28.711 --> 00:19:33.711
<v Speaker 1>different kinds of edges.</v>
<v Speaker 1>You can extract with the activate where </v>

256
00:19:34.171 --> 00:19:36.780
<v Speaker 1>the resulting activation maps scene on </v>
<v Speaker 1>the right,</v>

257
00:19:37.260 --> 00:19:42.260
<v Speaker 1>so when applying the filters,</v>
<v Speaker 1>those edge detection filters to the </v>

258
00:19:43.021 --> 00:19:48.021
<v Speaker 1>image on the left you produce in white </v>
<v Speaker 1>are the parts that activate the </v>

259
00:19:48.210 --> 00:19:51.570
<v Speaker 1>convolution.</v>
<v Speaker 1>The results of these filters,</v>

260
00:19:55.310 --> 00:19:58.430
<v Speaker 1>and so you can do any kind of filter.</v>
<v Speaker 1>That's what we're trying to learn.</v>

261
00:19:58.610 --> 00:20:03.610
<v Speaker 1>Any kind of edge,</v>
<v Speaker 1>any kind of any kind of pattern you can </v>

262
00:20:04.101 --> 00:20:06.920
<v Speaker 1>move along in this window and this way </v>
<v Speaker 1>that's shown here.</v>

263
00:20:06.921 --> 00:20:11.921
<v Speaker 1>You slide around the image and you </v>
<v Speaker 1>produce the output you see on the right </v>

264
00:20:12.080 --> 00:20:14.570
<v Speaker 1>and depending on how many filters you </v>
<v Speaker 1>have in every level,</v>

265
00:20:14.571 --> 00:20:17.510
<v Speaker 1>you have many of the slices,</v>
<v Speaker 1>the on the right,</v>

266
00:20:17.750 --> 00:20:20.570
<v Speaker 1>the input on the left,</v>
<v Speaker 1>I'll put it on the right.</v>

267
00:20:20.900 --> 00:20:25.900
<v Speaker 1>If you have dozens of filters,</v>
<v Speaker 1>you would have dozens of images on the </v>

268
00:20:25.900 --> 00:20:30.101
<v Speaker 1>right,</v>
<v Speaker 1>each with different results that show </v>

269
00:20:30.410 --> 00:20:35.410
<v Speaker 1>where each of the individual filter </v>
<v Speaker 1>patterns were found and we learn what </v>

270
00:20:35.451 --> 00:20:40.160
<v Speaker 1>patterns are useful to look for in order</v>
<v Speaker 1>to perform the classification task.</v>

271
00:20:40.850 --> 00:20:43.760
<v Speaker 1>That's the task for the neural network </v>
<v Speaker 1>to learn.</v>

272
00:20:43.790 --> 00:20:48.790
<v Speaker 1>These filters and the filters have </v>
<v Speaker 1>higher and higher order of </v>

273
00:20:50.070 --> 00:20:55.070
<v Speaker 1>representation.</v>
<v Speaker 1>Going from the very basic edges to the </v>

274
00:20:56.001 --> 00:21:01.001
<v Speaker 1>high semantic meaning.</v>
<v Speaker 1>That spans entire images and the ability</v>

275
00:21:03.251 --> 00:21:06.430
<v Speaker 1>to span images can be done in several </v>
<v Speaker 1>ways,</v>

276
00:21:06.730 --> 00:21:09.760
<v Speaker 1>but traditionally has been successfully </v>
<v Speaker 1>done through Max pooling,</v>

277
00:21:09.761 --> 00:21:14.761
<v Speaker 1>through pooling of taking the output of </v>
<v Speaker 1>a convolutional operation and reducing </v>

278
00:21:18.981 --> 00:21:23.981
<v Speaker 1>the resolution of that by by condensing </v>
<v Speaker 1>that information,</v>

279
00:21:24.321 --> 00:21:26.570
<v Speaker 1>but for example,</v>
<v Speaker 1>taking the maximum values,</v>

280
00:21:26.571 --> 00:21:27.830
<v Speaker 1>the maximum activations,</v>

281
00:21:31.920 --> 00:21:36.920
<v Speaker 1>therefore reducing the spatial </v>
<v Speaker 1>resolution which has detrimental effects</v>

282
00:21:36.991 --> 00:21:39.240
<v Speaker 1>as we'll talk about in scene </v>
<v Speaker 1>segmentation,</v>

283
00:21:39.480 --> 00:21:44.480
<v Speaker 1>but it's beneficial for finding higher </v>
<v Speaker 1>order representations and the images </v>

284
00:21:44.670 --> 00:21:49.670
<v Speaker 1>that bring images together that bring </v>
<v Speaker 1>features together to form an entity that</v>

285
00:21:49.831 --> 00:21:53.790
<v Speaker 1>we're trying to identify and classify.</v>
<v Speaker 1>Okay,</v>

286
00:21:54.210 --> 00:21:59.210
<v Speaker 1>so that forms a convolution.</v>
<v Speaker 1>You'll networks such convolutional </v>

287
00:21:59.210 --> 00:22:03.651
<v Speaker 1>layers stacked on top of each other is </v>
<v Speaker 1>the only addition to a neural network </v>

288
00:22:03.651 --> 00:22:07.701
<v Speaker 1>that makes for a convolutional neural </v>
<v Speaker 1>network and then at the end the fully </v>

289
00:22:07.701 --> 00:22:12.531
<v Speaker 1>connected layers or any kind of other </v>
<v Speaker 1>architectures allow us to apply </v>

290
00:22:13.840 --> 00:22:18.840
<v Speaker 1>particular domains.</v>
<v Speaker 1>Let's take image net as a case study in </v>

291
00:22:21.330 --> 00:22:26.330
<v Speaker 1>image net dataset and image net.</v>
<v Speaker 1>The challenge,</v>

292
00:22:27.120 --> 00:22:30.780
<v Speaker 1>the task is classification.</v>
<v Speaker 1>As I mentioned,</v>

293
00:22:30.781 --> 00:22:33.840
<v Speaker 1>the first lecture,</v>
<v Speaker 1>imaging as a data set,</v>

294
00:22:34.830 --> 00:22:36.900
<v Speaker 1>one of the largest in the world of </v>
<v Speaker 1>images.</v>

295
00:22:37.480 --> 00:22:40.100
<v Speaker 1>The $14 million images,</v>
<v Speaker 1>21,000</v>

296
00:22:40.110 --> 00:22:45.110
<v Speaker 1>categories and a lot of depth to many of</v>
<v Speaker 1>the categories,</v>

297
00:22:46.590 --> 00:22:49.200
<v Speaker 1>as I mentioned,</v>
<v Speaker 1>1200 granny Smith apples.</v>

298
00:22:53.190 --> 00:22:58.190
<v Speaker 1>These allow to these allow them to learn</v>
<v Speaker 1>the rich representations in both posed </v>

299
00:23:00.091 --> 00:23:05.091
<v Speaker 1>lighting variability and intercluster </v>
<v Speaker 1>class variation for the particular </v>

300
00:23:05.091 --> 00:23:08.571
<v Speaker 1>things.</v>
<v Speaker 1>Particular classes like Granny Smith </v>

301
00:23:08.571 --> 00:23:12.351
<v Speaker 1>apples,</v>
<v Speaker 1>so let's look through the various </v>

302
00:23:12.351 --> 00:23:12.540
<v Speaker 1>networks.</v>
<v Speaker 1>Let's discuss them.</v>

303
00:23:12.660 --> 00:23:15.950
<v Speaker 1>Let's see the insights.</v>
<v Speaker 1>It started with Alex Net,</v>

304
00:23:15.990 --> 00:23:20.990
<v Speaker 1>the first really big successful GPU </v>
<v Speaker 1>train neural network on image net that's</v>

305
00:23:21.421 --> 00:23:26.421
<v Speaker 1>achieved a significant boost over the </v>
<v Speaker 1>previous year and moved onto vgg net,</v>

306
00:23:28.530 --> 00:23:33.530
<v Speaker 1>Google net,</v>
<v Speaker 1>a Goo lanette resonate yet cua image and</v>

307
00:23:35.490 --> 00:23:38.280
<v Speaker 1>as ynet in 2017.</v>

308
00:23:40.780 --> 00:23:45.780
<v Speaker 1>Again,</v>
<v Speaker 1>the numbers will show for the accuracy </v>

309
00:23:45.780 --> 00:23:48.871
<v Speaker 1>or based on the top five error rate,</v>
<v Speaker 1>we get five guesses and it's a one or </v>

310
00:23:49.731 --> 00:23:51.260
<v Speaker 1>zero.</v>
<v Speaker 1>If you get guests,</v>

311
00:23:51.290 --> 00:23:54.640
<v Speaker 1>if one of the five is correct,</v>
<v Speaker 1>you get a one for that particular guest.</v>

312
00:23:55.010 --> 00:23:56.150
<v Speaker 1>Otherwise it's a zero</v>

313
00:24:01.780 --> 00:24:06.780
<v Speaker 1>and human error is five point one.</v>
<v Speaker 1>When a human tries to achieve the same </v>

314
00:24:06.780 --> 00:24:09.930
<v Speaker 1>tries to perform the same task.</v>
<v Speaker 1>As the machinist asks,</v>

315
00:24:09.931 --> 00:24:14.380
<v Speaker 1>we're doing the air is five point one,</v>
<v Speaker 1>the human adaptation that's performed on</v>

316
00:24:14.381 --> 00:24:17.110
<v Speaker 1>the images based on binary </v>
<v Speaker 1>classification,</v>

317
00:24:17.230 --> 00:24:22.180
<v Speaker 1>Granny Smith,</v>
<v Speaker 1>apple are not cat or not the actual task</v>

318
00:24:22.270 --> 00:24:26.620
<v Speaker 1>that the machine has to perform and that</v>
<v Speaker 1>the human competing has to perform,</v>

319
00:24:26.800 --> 00:24:31.480
<v Speaker 1>is given an image,</v>
<v Speaker 1>is provide one of the many classes under</v>

320
00:24:31.481 --> 00:24:34.000
<v Speaker 1>that human error is five point one </v>
<v Speaker 1>percent,</v>

321
00:24:34.060 --> 00:24:37.690
<v Speaker 1>which was surpassed in 2015 by residents</v>
<v Speaker 1>yet,</v>

322
00:24:38.320 --> 00:24:40.840
<v Speaker 1>uh,</v>
<v Speaker 1>to achieve four percent error.</v>

323
00:24:44.250 --> 00:24:48.240
<v Speaker 1>So let's start with Alex Net.</v>
<v Speaker 1>I'll zoom in on the later networks.</v>

324
00:24:48.241 --> 00:24:53.241
<v Speaker 1>They have some interesting insights,</v>
<v Speaker 1>but Alex Net and Vgg net both followed a</v>

325
00:24:54.391 --> 00:24:58.680
<v Speaker 1>very similar architecture,</v>
<v Speaker 1>very uniform throughout its depth.</v>

326
00:25:02.030 --> 00:25:07.030
<v Speaker 1>Vgg Net in 2014 is convolution </v>
<v Speaker 1>convolution pooling,</v>

327
00:25:08.360 --> 00:25:10.580
<v Speaker 1>convolution pooling,</v>
<v Speaker 1>convolution pooling,</v>

328
00:25:10.760 --> 00:25:15.760
<v Speaker 1>and fully connected layers at the end </v>
<v Speaker 1>does a certain kind of beautiful </v>

329
00:25:15.760 --> 00:25:20.711
<v Speaker 1>simplicity uniformity to these </v>
<v Speaker 1>architectures because you can just make </v>

330
00:25:20.711 --> 00:25:24.441
<v Speaker 1>a deeper and deeper and makes it very </v>
<v Speaker 1>amenable to a implementation in the </v>

331
00:25:24.501 --> 00:25:29.501
<v Speaker 1>layer stack kind of way.</v>
<v Speaker 1>And in any of the deep learning </v>

332
00:25:29.501 --> 00:25:30.980
<v Speaker 1>frameworks,</v>
<v Speaker 1>it's clean and beautiful to understand.</v>

333
00:25:31.460 --> 00:25:36.380
<v Speaker 1>In the case of egg net 16 or 19 layers </v>
<v Speaker 1>with 138 million parameters,</v>

334
00:25:36.680 --> 00:25:38.840
<v Speaker 1>not many optimizations on these </v>
<v Speaker 1>parameters.</v>

335
00:25:38.841 --> 00:25:43.841
<v Speaker 1>Therefore the number of parameters is </v>
<v Speaker 1>much higher than the networks that </v>

336
00:25:43.841 --> 00:25:45.500
<v Speaker 1>followed it.</v>
<v Speaker 1>Despite the layers not being that large.</v>

337
00:25:48.070 --> 00:25:50.590
<v Speaker 1>Google net introduced the inception </v>
<v Speaker 1>module,</v>

338
00:25:51.100 --> 00:25:56.100
<v Speaker 1>starting to do some interesting things </v>
<v Speaker 1>with the small modules within these </v>

339
00:25:57.191 --> 00:26:01.690
<v Speaker 1>networks which allow for the training to</v>
<v Speaker 1>be more efficient and effective.</v>

340
00:26:03.760 --> 00:26:08.760
<v Speaker 1>The idea behind the inception module </v>
<v Speaker 1>shown here with the previous layer on </v>

341
00:26:08.771 --> 00:26:13.771
<v Speaker 1>bottom and the convolutional layer here </v>
<v Speaker 1>with the inception module</v>

342
00:26:16.690 --> 00:26:21.690
<v Speaker 1>on top producer on top is it used the </v>
<v Speaker 1>idea that different size convolutions </v>

343
00:26:25.541 --> 00:26:30.541
<v Speaker 1>provide different value for the network.</v>
<v Speaker 1>Smaller convolutions are able to capture</v>

344
00:26:31.630 --> 00:26:36.630
<v Speaker 1>or propagate forward features that are </v>
<v Speaker 1>very local in high resolution and in,</v>

345
00:26:39.430 --> 00:26:40.750
<v Speaker 1>in,</v>
<v Speaker 1>in texture.</v>

346
00:26:41.530 --> 00:26:46.530
<v Speaker 1>Larger convolutions are better able to </v>
<v Speaker 1>to represent and capture and catch </v>

347
00:26:47.850 --> 00:26:51.030
<v Speaker 1>highly abstracted features,</v>
<v Speaker 1>higher order features.</v>

348
00:26:51.690 --> 00:26:54.600
<v Speaker 1>So the idea behind the inception module </v>
<v Speaker 1>is to say,</v>

349
00:26:54.601 --> 00:26:59.601
<v Speaker 1>well,</v>
<v Speaker 1>as opposed to choosing an in a hyper </v>

350
00:26:59.601 --> 00:27:01.710
<v Speaker 1>parameter tuning process or architecture</v>
<v Speaker 1>design process,</v>

351
00:27:01.711 --> 00:27:05.010
<v Speaker 1>choosing which convolution size we want </v>
<v Speaker 1>to go with,</v>

352
00:27:05.430 --> 00:27:09.570
<v Speaker 1>why not do all of them?</v>
<v Speaker 1>To get well several together in the case</v>

353
00:27:09.571 --> 00:27:13.200
<v Speaker 1>of the Google net model,</v>
<v Speaker 1>there's the one by one,</v>

354
00:27:13.230 --> 00:27:18.230
<v Speaker 1>three by three and five by five </v>
<v Speaker 1>convolutions with the old trusty friend </v>

355
00:27:18.230 --> 00:27:19.980
<v Speaker 1>of Max pooling still left in there as </v>
<v Speaker 1>well,</v>

356
00:27:21.540 --> 00:27:25.800
<v Speaker 1>which has lost favor more and more over </v>
<v Speaker 1>time for the image classification task,</v>

357
00:27:27.450 --> 00:27:31.170
<v Speaker 1>and the result is there's fewer </v>
<v Speaker 1>parameters are required.</v>

358
00:27:31.200 --> 00:27:36.200
<v Speaker 1>If you pick the placing of these </v>
<v Speaker 1>inception modules correctly,</v>

359
00:27:37.350 --> 00:27:42.350
<v Speaker 1>the number of parameters required to </v>
<v Speaker 1>achieve a higher performance is much </v>

360
00:27:42.350 --> 00:27:42.780
<v Speaker 1>lower</v>

361
00:27:46.220 --> 00:27:51.220
<v Speaker 1>radnet,</v>
<v Speaker 1>one of the most popular still to date </v>

362
00:27:53.770 --> 00:27:58.770
<v Speaker 1>architectures that we'll discuss in and </v>
<v Speaker 1>scene segmentation as well came up and </v>

363
00:28:01.751 --> 00:28:06.751
<v Speaker 1>use the idea of a residual block.</v>
<v Speaker 1>The initial inspiring observation,</v>

364
00:28:08.381 --> 00:28:11.230
<v Speaker 1>which doesn't necessarily hold true as </v>
<v Speaker 1>it turns out,</v>

365
00:28:11.560 --> 00:28:16.560
<v Speaker 1>but that network depth increases </v>
<v Speaker 1>representation power,</v>

366
00:28:16.960 --> 00:28:21.520
<v Speaker 1>so these residual blocks allow you to </v>
<v Speaker 1>have much deeper networks,</v>

367
00:28:21.521 --> 00:28:26.521
<v Speaker 1>and I'll explain why in a second here,</v>
<v Speaker 1>but the thought was they worked so well </v>

368
00:28:28.121 --> 00:28:33.121
<v Speaker 1>because the networks how much deeper.</v>
<v Speaker 1>The key thing that makes these blocks so</v>

369
00:28:33.941 --> 00:28:38.941
<v Speaker 1>effective is the same idea.</v>
<v Speaker 1>It's very reminiscent of recurrent </v>

370
00:28:38.941 --> 00:28:43.741
<v Speaker 1>neural networks that I hope we get a </v>
<v Speaker 1>chance to talk about the training of </v>

371
00:28:45.011 --> 00:28:50.011
<v Speaker 1>them as much easier.</v>
<v Speaker 1>They will take a simple block repeated </v>

372
00:28:50.921 --> 00:28:55.010
<v Speaker 1>over and over,</v>
<v Speaker 1>and they passed the input along with our</v>

373
00:28:55.011 --> 00:29:00.011
<v Speaker 1>transformation along with the ability to</v>
<v Speaker 1>transform and to learn to learn the </v>

374
00:29:01.421 --> 00:29:05.980
<v Speaker 1>filters,</v>
<v Speaker 1>learn the weights so you're allowed to.</v>

375
00:29:07.030 --> 00:29:12.030
<v Speaker 1>You're allowed every layer to not only </v>
<v Speaker 1>take on the processing of previous </v>

376
00:29:12.941 --> 00:29:17.941
<v Speaker 1>layers,</v>
<v Speaker 1>but to take in the wrong transform data </v>

377
00:29:17.941 --> 00:29:21.750
<v Speaker 1>and learn something new.</v>
<v Speaker 1>The ability to learn something new </v>

378
00:29:21.750 --> 00:29:26.521
<v Speaker 1>allows you to have much deeper networks </v>
<v Speaker 1>and the simplicity of this block allows </v>

379
00:29:27.761 --> 00:29:30.010
<v Speaker 1>for more effective training.</v>

380
00:29:33.330 --> 00:29:38.330
<v Speaker 1>These state of the art in 2017,</v>
<v Speaker 1>the winner is squeezing excitation </v>

381
00:29:38.330 --> 00:29:42.770
<v Speaker 1>networks.</v>
<v Speaker 1>That unlike the previous year was cu </v>

382
00:29:42.770 --> 00:29:47.430
<v Speaker 1>image with shimply took ensemble methods</v>
<v Speaker 1>and combined a lot of successful </v>

383
00:29:47.430 --> 00:29:50.530
<v Speaker 1>approaches to take a marginal </v>
<v Speaker 1>improvement as seen.</v>

384
00:29:50.531 --> 00:29:55.531
<v Speaker 1>Net got a significant improvement,</v>
<v Speaker 1>at least in percentages,</v>

385
00:29:55.751 --> 00:29:58.810
<v Speaker 1>I think as a 25 percent reduction in </v>
<v Speaker 1>error</v>

386
00:30:00.640 --> 00:30:05.320
<v Speaker 1>from four percent to three percent.</v>
<v Speaker 1>Something like that.</v>

387
00:30:06.550 --> 00:30:11.550
<v Speaker 1>By using a very simple idea that I think</v>
<v Speaker 1>is important to mention as simple </v>

388
00:30:11.550 --> 00:30:16.261
<v Speaker 1>insight.</v>
<v Speaker 1>It added a parameter to each channel and</v>

389
00:30:16.331 --> 00:30:20.560
<v Speaker 1>the convolutional layer and the </v>
<v Speaker 1>convolutional block,</v>

390
00:30:21.400 --> 00:30:26.400
<v Speaker 1>so the network can now adjust the </v>
<v Speaker 1>weighting and each channel based for for</v>

391
00:30:28.180 --> 00:30:30.010
<v Speaker 1>each feature map,</v>
<v Speaker 1>based on the content,</v>

392
00:30:30.011 --> 00:30:35.011
<v Speaker 1>based on the input to the network.</v>
<v Speaker 1>This is kind of a takeaway to think </v>

393
00:30:35.011 --> 00:30:39.721
<v Speaker 1>about about any of the networks who talk</v>
<v Speaker 1>about any of the architectures is a lot </v>

394
00:30:40.721 --> 00:30:44.980
<v Speaker 1>of times you were recurrent neural </v>
<v Speaker 1>networks and convolution neural networks</v>

395
00:30:45.250 --> 00:30:49.390
<v Speaker 1>have tricks that significantly reduced </v>
<v Speaker 1>the number of parameters,</v>

396
00:30:50.320 --> 00:30:52.690
<v Speaker 1>the bulk,</v>
<v Speaker 1>the sort of low hanging fruit.</v>

397
00:30:52.930 --> 00:30:57.930
<v Speaker 1>They use spatial invariants of temporal </v>
<v Speaker 1>and to reduce the number of parameters </v>

398
00:30:57.930 --> 00:31:02.311
<v Speaker 1>to represent the input data,</v>
<v Speaker 1>but they also lead certain things not </v>

399
00:31:02.771 --> 00:31:07.771
<v Speaker 1>parametrized.</v>
<v Speaker 1>They don't allow the network to learn </v>

400
00:31:07.771 --> 00:31:10.471
<v Speaker 1>it.</v>
<v Speaker 1>Allowing this case to network to learn </v>

401
00:31:10.471 --> 00:31:10.720
<v Speaker 1>the waiting on each of the individual </v>
<v Speaker 1>channels.</v>

402
00:31:10.810 --> 00:31:15.810
<v Speaker 1>So each of the individual filters is </v>
<v Speaker 1>something that you learn as a long with </v>

403
00:31:15.810 --> 00:31:18.310
<v Speaker 1>the filters takes.</v>
<v Speaker 1>It makes a huge boost.</v>

404
00:31:18.490 --> 00:31:22.000
<v Speaker 1>The cool thing about this is it's </v>
<v Speaker 1>applicable to any architecture.</v>

405
00:31:22.300 --> 00:31:27.300
<v Speaker 1>This kind of block,</v>
<v Speaker 1>this kind of what the squeeze and </v>

406
00:31:27.300 --> 00:31:29.140
<v Speaker 1>excitation block is applicable to any </v>
<v Speaker 1>architecture.</v>

407
00:31:30.830 --> 00:31:32.450
<v Speaker 1>And,</v>
<v Speaker 1>uh,</v>

408
00:31:32.510 --> 00:31:37.510
<v Speaker 1>because obviously it's a,</v>
<v Speaker 1>it's just simply a parametrized is the </v>

409
00:31:37.510 --> 00:31:39.830
<v Speaker 1>ability to choose which filter you go </v>
<v Speaker 1>with based on the content.</v>

410
00:31:40.220 --> 00:31:43.370
<v Speaker 1>It's a subtle but crucial thing I think </v>
<v Speaker 1>is pretty cool.</v>

411
00:31:43.490 --> 00:31:47.210
<v Speaker 1>And for future research it inspires to </v>
<v Speaker 1>think about,</v>

412
00:31:47.650 --> 00:31:49.520
<v Speaker 1>uh,</v>
<v Speaker 1>what else can be parametrized and neural</v>

413
00:31:49.521 --> 00:31:54.521
<v Speaker 1>networks,</v>
<v Speaker 1>what else can be controlled as part of </v>

414
00:31:54.521 --> 00:31:57.251
<v Speaker 1>the learning process,</v>
<v Speaker 1>including hiring higher order hyper </v>

415
00:31:57.251 --> 00:31:57.530
<v Speaker 1>parameters,</v>
<v Speaker 1>which,</v>

416
00:31:57.620 --> 00:32:02.620
<v Speaker 1>which aspects of the training and the </v>
<v Speaker 1>architecture of the network can be part </v>

417
00:32:02.620 --> 00:32:05.750
<v Speaker 1>of the learning.</v>
<v Speaker 1>This is what this network inspires.</v>

418
00:32:11.830 --> 00:32:16.830
<v Speaker 1>Another network has been in development </v>
<v Speaker 1>since the nineties ideas but Jeff </v>

419
00:32:19.000 --> 00:32:20.200
<v Speaker 1>Hinton,</v>
<v Speaker 1>but really received,</v>

420
00:32:20.260 --> 00:32:22.270
<v Speaker 1>has been published on received </v>
<v Speaker 1>significant attention,</v>

421
00:32:22.271 --> 00:32:27.271
<v Speaker 1>2017 that I won't go into detail here.</v>
<v Speaker 1>Uh,</v>

422
00:32:27.810 --> 00:32:32.810
<v Speaker 1>we are going to release an online only </v>
<v Speaker 1>video about capsule networks.</v>

423
00:32:35.230 --> 00:32:40.230
<v Speaker 1>It's a little bit too technical,</v>
<v Speaker 1>but they inspire very important point </v>

424
00:32:41.210 --> 00:32:43.520
<v Speaker 1>that we should always think about deep </v>
<v Speaker 1>learning.</v>

425
00:32:44.060 --> 00:32:49.060
<v Speaker 1>Uh,</v>
<v Speaker 1>whenever it's successful is to think </v>

426
00:32:49.060 --> 00:32:51.371
<v Speaker 1>about what,</v>
<v Speaker 1>as I mentioned with the cat eating a </v>

427
00:32:51.371 --> 00:32:53.970
<v Speaker 1>banana on a philosophical and the </v>
<v Speaker 1>mathematical level,</v>

428
00:32:53.971 --> 00:32:58.971
<v Speaker 1>we have to consider what assumptions </v>
<v Speaker 1>these networks make and what through </v>

429
00:33:00.171 --> 00:33:05.171
<v Speaker 1>those assumptions they throw away.</v>
<v Speaker 1>So neural networks due to the spacial </v>

430
00:33:05.171 --> 00:33:09.821
<v Speaker 1>with convolutional neural networks due </v>
<v Speaker 1>to their spatial invariance throwaway </v>

431
00:33:09.821 --> 00:33:14.560
<v Speaker 1>information about the relationship </v>
<v Speaker 1>between the hierarchies between the </v>

432
00:33:16.281 --> 00:33:21.281
<v Speaker 1>simple and the complex objects.</v>
<v Speaker 1>So the face on the left and the face on </v>

433
00:33:21.281 --> 00:33:23.270
<v Speaker 1>the right looks the same to accomplish </v>
<v Speaker 1>in neural network.</v>

434
00:33:23.690 --> 00:33:28.690
<v Speaker 1>The presence of eyes and nose and mouth </v>
<v Speaker 1>is the essential aspect of what makes </v>

435
00:33:31.941 --> 00:33:34.660
<v Speaker 1>the classification task work for </v>
<v Speaker 1>accomplished or network.</v>

436
00:33:34.700 --> 00:33:38.360
<v Speaker 1>It will what worry will fire and say </v>
<v Speaker 1>this is definitely a face,</v>

437
00:33:39.230 --> 00:33:43.400
<v Speaker 1>but the spatial relationship is lost,</v>
<v Speaker 1>is ignored.</v>

438
00:33:43.550 --> 00:33:47.000
<v Speaker 1>Which means there's a lot of </v>
<v Speaker 1>implications to this.</v>

439
00:33:47.090 --> 00:33:52.090
<v Speaker 1>But for things like pose variation,</v>
<v Speaker 1>that information is lost.</v>

440
00:33:54.050 --> 00:33:59.050
<v Speaker 1>We're throwing away that away completely</v>
<v Speaker 1>and hoping that the pooling operation </v>

441
00:33:59.720 --> 00:34:04.720
<v Speaker 1>that's performing these networks is able</v>
<v Speaker 1>to sort of Mesh everything together to </v>

442
00:34:04.820 --> 00:34:09.820
<v Speaker 1>come up with the features that are </v>
<v Speaker 1>firing of the different parts of the </v>

443
00:34:09.820 --> 00:34:13.751
<v Speaker 1>face.</v>
<v Speaker 1>That then come up with a total </v>

444
00:34:13.751 --> 00:34:15.221
<v Speaker 1>classification that is a face without </v>
<v Speaker 1>representing really the relationship </v>

445
00:34:15.221 --> 00:34:19.271
<v Speaker 1>between these features at the low level </v>
<v Speaker 1>and the high level at the low level of </v>

446
00:34:19.611 --> 00:34:22.700
<v Speaker 1>the hierarchy,</v>
<v Speaker 1>the simple and the complex level.</v>

447
00:34:23.840 --> 00:34:28.610
<v Speaker 1>This is super exciting field now that's </v>
<v Speaker 1>hopefully will spark developments of how</v>

448
00:34:28.611 --> 00:34:31.870
<v Speaker 1>we design your networks that are able to</v>
<v Speaker 1>learn this,</v>

449
00:34:32.240 --> 00:34:37.240
<v Speaker 1>the rotational,</v>
<v Speaker 1>the orientation and variance a as well.</v>

450
00:34:40.520 --> 00:34:42.500
<v Speaker 1>Okay,</v>
<v Speaker 1>so as I mentioned,</v>

451
00:34:43.470 --> 00:34:46.010
<v Speaker 1>you take these convolutional neural </v>
<v Speaker 1>networks,</v>

452
00:34:46.011 --> 00:34:50.600
<v Speaker 1>chop off the final layer in order to </v>
<v Speaker 1>apply to a particular domain,</v>

453
00:34:51.330 --> 00:34:54.380
<v Speaker 1>and that is what we'll do with fully </v>
<v Speaker 1>convolutional neural networks.</v>

454
00:34:54.500 --> 00:34:58.580
<v Speaker 1>The ones that we tasked to segment the </v>
<v Speaker 1>image at a pixel level.</v>

455
00:35:00.020 --> 00:35:04.400
<v Speaker 1>As a reminder,</v>
<v Speaker 1>these networks through the convolutional</v>

456
00:35:04.401 --> 00:35:09.401
<v Speaker 1>process are really producing a heat map,</v>
<v Speaker 1>different parts of the network and </v>

457
00:35:12.451 --> 00:35:17.451
<v Speaker 1>getting excited based on the different </v>
<v Speaker 1>aspects of the image and so it can be </v>

458
00:35:17.451 --> 00:35:18.720
<v Speaker 1>used to do the localization of </v>
<v Speaker 1>detecting,</v>

459
00:35:18.721 --> 00:35:21.980
<v Speaker 1>not just classifying the image but </v>
<v Speaker 1>localized in the object.</v>

460
00:35:22.820 --> 00:35:27.820
<v Speaker 1>And it could do so at a pixel level.</v>
<v Speaker 1>So the convolutional layers are doing </v>

461
00:35:29.661 --> 00:35:34.661
<v Speaker 1>the encoding process.</v>
<v Speaker 1>They're taking the rich raw sensory </v>

462
00:35:34.661 --> 00:35:38.040
<v Speaker 1>information in the image and encoding </v>
<v Speaker 1>them into</v>

463
00:35:38.040 --> 00:35:43.040
<v Speaker 1>an interpretable set of features,</v>
<v Speaker 1>a representation that can then be used </v>

464
00:35:43.040 --> 00:35:47.601
<v Speaker 1>for classification,</v>
<v Speaker 1>but we can also then use a decoder up </v>

465
00:35:47.601 --> 00:35:51.451
<v Speaker 1>sample that information and produce a </v>
<v Speaker 1>map like this fully convolution neural </v>

466
00:35:52.351 --> 00:35:55.600
<v Speaker 1>network segmentation,</v>
<v Speaker 1>semantic segmentation,</v>

467
00:35:55.610 --> 00:35:58.170
<v Speaker 1>image segmentation.</v>
<v Speaker 1>The goal is to,</v>

468
00:35:58.260 --> 00:36:02.430
<v Speaker 1>as opposed to classify the entire image,</v>
<v Speaker 1>it can classify every single pixel,</v>

469
00:36:02.600 --> 00:36:07.600
<v Speaker 1>this pixel level segmentation you call </v>
<v Speaker 1>her every single pixel with what that </v>

470
00:36:07.600 --> 00:36:12.591
<v Speaker 1>Pixel,</v>
<v Speaker 1>what object that pixel belongs to in </v>

471
00:36:12.591 --> 00:36:13.950
<v Speaker 1>this two d space of the image,</v>
<v Speaker 1>the two d projection,</v>

472
00:36:14.700 --> 00:36:15.700
<v Speaker 1>the,</v>
<v Speaker 1>uh,</v>

473
00:36:15.990 --> 00:36:18.210
<v Speaker 1>in the image of a three dimensional </v>
<v Speaker 1>world.</v>

474
00:36:19.110 --> 00:36:22.110
<v Speaker 1>So the thing is,</v>
<v Speaker 1>there's been a lot of advancement in the</v>

475
00:36:22.111 --> 00:36:27.111
<v Speaker 1>last three years,</v>
<v Speaker 1>but it's still an incredibly difficult </v>

476
00:36:29.061 --> 00:36:32.360
<v Speaker 1>problem if you think,</v>
<v Speaker 1>if you think about,</v>

477
00:36:34.000 --> 00:36:39.000
<v Speaker 1>uh,</v>
<v Speaker 1>the amount of data that's used for </v>

478
00:36:39.000 --> 00:36:40.400
<v Speaker 1>training and the task of Pixel level,</v>
<v Speaker 1>uh,</v>

479
00:36:40.401 --> 00:36:45.401
<v Speaker 1>of megapixels here of millions of pixels</v>
<v Speaker 1>that are tasked with having a,</v>

480
00:36:45.880 --> 00:36:48.470
<v Speaker 1>a single label,</v>
<v Speaker 1>it's an extremely difficult problem.</v>

481
00:36:50.990 --> 00:36:55.990
<v Speaker 1>Why is this interesting,</v>
<v Speaker 1>important problems to try to solve as </v>

482
00:36:55.990 --> 00:36:59.090
<v Speaker 1>opposed to Bonnie boxes around cats?</v>
<v Speaker 1>Well,</v>

483
00:36:59.150 --> 00:37:01.970
<v Speaker 1>it's whenever precise boundaries of </v>
<v Speaker 1>objects are important.</v>

484
00:37:02.270 --> 00:37:07.270
<v Speaker 1>Certainly medical applications when </v>
<v Speaker 1>looking at imaging and detecting </v>

485
00:37:07.270 --> 00:37:08.300
<v Speaker 1>particular,</v>
<v Speaker 1>for example,</v>

486
00:37:08.301 --> 00:37:10.950
<v Speaker 1>detecting tumors and,</v>
<v Speaker 1>uh,</v>

487
00:37:11.310 --> 00:37:13.650
<v Speaker 1>in,</v>
<v Speaker 1>in medical imaging of,</v>

488
00:37:13.700 --> 00:37:14.620
<v Speaker 1>of different,</v>
<v Speaker 1>uh,</v>

489
00:37:14.700 --> 00:37:15.890
<v Speaker 1>uh,</v>
<v Speaker 1>different organs</v>

490
00:37:17.490 --> 00:37:22.490
<v Speaker 1>and in driving and robotics,</v>
<v Speaker 1>when objects are involved as a dense </v>

491
00:37:23.391 --> 00:37:25.220
<v Speaker 1>scene of all the vehicles,</v>
<v Speaker 1>pedestrians,</v>

492
00:37:25.221 --> 00:37:30.221
<v Speaker 1>cyclists,</v>
<v Speaker 1>we need to be able to not just have a </v>

493
00:37:30.221 --> 00:37:34.031
<v Speaker 1>loose estimate of where objects are.</v>
<v Speaker 1>We need to be able to have the exact </v>

494
00:37:34.031 --> 00:37:37.580
<v Speaker 1>boundaries and then potentially through </v>
<v Speaker 1>data fusion,</v>

495
00:37:37.610 --> 00:37:42.610
<v Speaker 1>fusing sensors together,</v>
<v Speaker 1>fusing this rich textual information </v>

496
00:37:42.610 --> 00:37:43.650
<v Speaker 1>about pedestrians,</v>
<v Speaker 1>cyclists,</v>

497
00:37:43.651 --> 00:37:48.651
<v Speaker 1>and vehicles to light our data that's </v>
<v Speaker 1>providing us the three dimensional map </v>

498
00:37:48.651 --> 00:37:52.811
<v Speaker 1>of the world or have both these semantic</v>
<v Speaker 1>meaning of the different objects and </v>

499
00:37:52.811 --> 00:37:56.861
<v Speaker 1>their exact three mentioned location.</v>
<v Speaker 1>Um,</v>

500
00:38:00.040 --> 00:38:05.040
<v Speaker 1>a lot of this work successfully,</v>
<v Speaker 1>a lot of the work and the semantic </v>

501
00:38:05.040 --> 00:38:09.691
<v Speaker 1>segmentation started with fully </v>
<v Speaker 1>convolutional networks for semantic </v>

502
00:38:09.691 --> 00:38:10.420
<v Speaker 1>segmentation,</v>
<v Speaker 1>paper,</v>

503
00:38:10.720 --> 00:38:15.720
<v Speaker 1>fcn.</v>
<v Speaker 1>That's where the name fcn came from in </v>

504
00:38:15.720 --> 00:38:17.620
<v Speaker 1>November 2014.</v>
<v Speaker 1>Now go through a few papers here to give</v>

505
00:38:17.621 --> 00:38:22.621
<v Speaker 1>you some intuition where the field has </v>
<v Speaker 1>gone and how that takes us to segue </v>

506
00:38:22.781 --> 00:38:25.090
<v Speaker 1>views,</v>
<v Speaker 1>the segmentation competition.</v>

507
00:38:26.840 --> 00:38:29.060
<v Speaker 1>So fcn,</v>
<v Speaker 1>repurpose the image,</v>

508
00:38:29.061 --> 00:38:33.140
<v Speaker 1>net pretreating nets the nets that were </v>
<v Speaker 1>trained to classify what's in an image,</v>

509
00:38:33.980 --> 00:38:38.980
<v Speaker 1>entire image,</v>
<v Speaker 1>and chopped off the fully connected </v>

510
00:38:38.980 --> 00:38:43.201
<v Speaker 1>layers.</v>
<v Speaker 1>And then have added decoder parts that </v>

511
00:38:43.201 --> 00:38:47.630
<v Speaker 1>up,</v>
<v Speaker 1>sample the image to produce a heat map </v>

512
00:38:48.820 --> 00:38:49.800
<v Speaker 1>here,</v>
<v Speaker 1>showing a,</v>

513
00:38:49.801 --> 00:38:51.370
<v Speaker 1>uh,</v>
<v Speaker 1>with a tabby cat,</v>

514
00:38:51.460 --> 00:38:53.530
<v Speaker 1>a heat map of where the cat is in the </v>
<v Speaker 1>image.</v>

515
00:38:54.880 --> 00:38:59.880
<v Speaker 1>It's a much slower,</v>
<v Speaker 1>much coarser resolution than the input </v>

516
00:38:59.880 --> 00:38:59.880
<v Speaker 1>image</v>

517
00:39:00.170 --> 00:39:05.170
<v Speaker 1>one eighth at best.</v>
<v Speaker 1>Skip connections to improve courses of </v>

518
00:39:05.541 --> 00:39:09.470
<v Speaker 1>upsampling.</v>
<v Speaker 1>There's a few tricks.</v>

519
00:39:10.350 --> 00:39:15.350
<v Speaker 1>If you do the most naive approach,</v>
<v Speaker 1>the upsampling is going to be extremely </v>

520
00:39:15.350 --> 00:39:17.510
<v Speaker 1>course because that's the whole point of</v>
<v Speaker 1>the neural network.</v>

521
00:39:17.810 --> 00:39:21.890
<v Speaker 1>The encoding part is you throw away all </v>
<v Speaker 1>the useless data,</v>

522
00:39:22.410 --> 00:39:23.180
<v Speaker 1>uh,</v>
<v Speaker 1>the,</v>

523
00:39:23.210 --> 00:39:26.210
<v Speaker 1>you to the most essential aspects that </v>
<v Speaker 1>represent that image.</v>

524
00:39:26.690 --> 00:39:31.690
<v Speaker 1>So you're throwing away a lot of </v>
<v Speaker 1>information that's necessary to then </v>

525
00:39:31.690 --> 00:39:35.260
<v Speaker 1>form a high resolution image.</v>
<v Speaker 1>So there's a few tricks where you skip a</v>

526
00:39:36.021 --> 00:39:41.021
<v Speaker 1>few of the final pooling operations to </v>
<v Speaker 1>go in a similar way.</v>

527
00:39:41.871 --> 00:39:46.871
<v Speaker 1>And this is the residual block to a go </v>
<v Speaker 1>to go to the output produced higher and </v>

528
00:39:46.871 --> 00:39:51.341
<v Speaker 1>higher resolution heat map at the end </v>
<v Speaker 1>segment in 2015.</v>

529
00:39:52.870 --> 00:39:57.870
<v Speaker 1>I applied this to the driving context </v>
<v Speaker 1>and really taking it to Kitty Dataset </v>

530
00:39:58.791 --> 00:40:03.791
<v Speaker 1>and have have have shown a lot of </v>
<v Speaker 1>interesting results and really explored </v>

531
00:40:03.801 --> 00:40:06.890
<v Speaker 1>the encoder decoder or formulation of </v>
<v Speaker 1>the problem.</v>

532
00:40:09.590 --> 00:40:14.590
<v Speaker 1>Really solidifying this the place of the</v>
<v Speaker 1>encoder decoder framework for the </v>

533
00:40:14.590 --> 00:40:19.181
<v Speaker 1>segmentation task dilated convolution.</v>
<v Speaker 1>I'm taking you through a few components </v>

534
00:40:20.691 --> 00:40:25.400
<v Speaker 1>which are critical here to the state of </v>
<v Speaker 1>the art dilated convolutions,</v>

535
00:40:26.120 --> 00:40:31.120
<v Speaker 1>so the convolution operation as the </v>
<v Speaker 1>pooling operation reduces resolution </v>

536
00:40:33.870 --> 00:40:38.870
<v Speaker 1>significantly and dilated convolution </v>
<v Speaker 1>has a certain kind of gritting as </v>

537
00:40:40.551 --> 00:40:45.551
<v Speaker 1>visualized there that maintains the the </v>
<v Speaker 1>local high resolution textures while </v>

538
00:40:50.481 --> 00:40:55.481
<v Speaker 1>still capturing the spatial window </v>
<v Speaker 1>necessary.</v>

539
00:40:56.790 --> 00:41:01.790
<v Speaker 1>It's called dilated convolutional layer </v>
<v Speaker 1>and that's in a 2015 paper,</v>

540
00:41:03.680 --> 00:41:08.510
<v Speaker 1>proved to be much better at upsampling.</v>
<v Speaker 1>A high resolution image.</v>

541
00:41:11.540 --> 00:41:16.540
<v Speaker 1>The lab with a B v One v Two v three </v>
<v Speaker 1>added conditional random fields,</v>

542
00:41:21.290 --> 00:41:26.030
<v Speaker 1>which is the final piece of the of the </v>
<v Speaker 1>state of the art puzzle here.</v>

543
00:41:26.330 --> 00:41:30.350
<v Speaker 1>A lot of the successful networks today </v>
<v Speaker 1>that do segmentation,</v>

544
00:41:30.560 --> 00:41:35.560
<v Speaker 1>not all do post process using a crs </v>
<v Speaker 1>conditional random fields,</v>

545
00:41:38.090 --> 00:41:40.910
<v Speaker 1>and what they do is they smoothed the </v>
<v Speaker 1>segmentation,</v>

546
00:41:41.300 --> 00:41:46.300
<v Speaker 1>the ups sampled segmentation that </v>
<v Speaker 1>results from the fcn by looking at the </v>

547
00:41:46.300 --> 00:41:47.150
<v Speaker 1>underlying image intensities.</v>

548
00:41:50.890 --> 00:41:53.730
<v Speaker 1>So that's the key aspects of the </v>
<v Speaker 1>successful approaches.</v>

549
00:41:53.740 --> 00:41:58.740
<v Speaker 1>Today.</v>
<v Speaker 1>You have the encoder decoder framework </v>

550
00:41:58.740 --> 00:42:01.591
<v Speaker 1>of a fully accomplished in your network.</v>
<v Speaker 1>It replaces the fully connected layers </v>

551
00:42:01.591 --> 00:42:04.540
<v Speaker 1>with a convolutional layers,</v>
<v Speaker 1>dee dee convolutional layers,</v>

552
00:42:04.960 --> 00:42:09.960
<v Speaker 1>and as the years progressed from 2014 to</v>
<v Speaker 1>today as usual,</v>

553
00:42:11.470 --> 00:42:16.470
<v Speaker 1>the underlying networks from Alex Net to</v>
<v Speaker 1>Vgg net and to now Raza net have been </v>

554
00:42:19.900 --> 00:42:24.900
<v Speaker 1>one of the big reasons for the </v>
<v Speaker 1>improvements of these networks to be </v>

555
00:42:24.900 --> 00:42:27.340
<v Speaker 1>able to perform the segmentation.</v>
<v Speaker 1>So naturally they mirrored the image,</v>

556
00:42:27.341 --> 00:42:30.910
<v Speaker 1>net challenge performance and adapting </v>
<v Speaker 1>these networks.</v>

557
00:42:30.970 --> 00:42:35.970
<v Speaker 1>So the state of the art uses resonate or</v>
<v Speaker 1>similar networks conditional random </v>

558
00:42:35.970 --> 00:42:40.290
<v Speaker 1>fields for smoothing based on the input </v>
<v Speaker 1>image intensities and the dilated </v>

559
00:42:42.760 --> 00:42:46.450
<v Speaker 1>convolution that maintains the </v>
<v Speaker 1>computational cost,</v>

560
00:42:46.870 --> 00:42:51.870
<v Speaker 1>but increases the resolution of the </v>
<v Speaker 1>upsampling throughout the intermediate </v>

561
00:42:51.870 --> 00:42:56.640
<v Speaker 1>feature maps.</v>
<v Speaker 1>And that takes us to the state of the </v>

562
00:42:57.181 --> 00:43:02.181
<v Speaker 1>art that we used to produce the images </v>
<v Speaker 1>to produce the images for the </v>

563
00:43:04.261 --> 00:43:09.261
<v Speaker 1>competition.</v>
<v Speaker 1>Brezhnev do you see for dance upsampling</v>

564
00:43:09.421 --> 00:43:13.530
<v Speaker 1>convolution instead of by linear </v>
<v Speaker 1>upsampling,</v>

565
00:43:13.560 --> 00:43:16.170
<v Speaker 1>you make the upsampling learnable.</v>

566
00:43:18.210 --> 00:43:21.960
<v Speaker 1>You learned the upscaling filters that's</v>
<v Speaker 1>on the bottom.</v>

567
00:43:22.470 --> 00:43:24.210
<v Speaker 1>That's really the key part that made it </v>
<v Speaker 1>work.</v>

568
00:43:26.270 --> 00:43:30.120
<v Speaker 1>There should be a theme here.</v>
<v Speaker 1>Sometimes the biggest addition,</v>

569
00:43:30.121 --> 00:43:33.420
<v Speaker 1>they can be done.</v>
<v Speaker 1>This parametrizing one of the aspects of</v>

570
00:43:33.421 --> 00:43:35.040
<v Speaker 1>the network that you've taken for </v>
<v Speaker 1>granted.</v>

571
00:43:35.460 --> 00:43:39.480
<v Speaker 1>Letting the network learn that aspect </v>
<v Speaker 1>and the other.</v>

572
00:43:40.230 --> 00:43:42.780
<v Speaker 1>I'm not sure how important it is to the </v>
<v Speaker 1>success,</v>

573
00:43:42.781 --> 00:43:45.090
<v Speaker 1>but it's a.</v>
<v Speaker 1>it's a cool little addition,</v>

574
00:43:45.091 --> 00:43:50.091
<v Speaker 1>is a hybrid dilated convolution.</v>
<v Speaker 1>As I showed that visualization where the</v>

575
00:43:51.170 --> 00:43:56.170
<v Speaker 1>convolution is spread apart a little bit</v>
<v Speaker 1>at in the input from the input to the </v>

576
00:43:56.170 --> 00:44:01.071
<v Speaker 1>output,</v>
<v Speaker 1>the steps of that dilated convolution </v>

577
00:44:01.071 --> 00:44:01.760
<v Speaker 1>filter,</v>
<v Speaker 1>when they are changed,</v>

578
00:44:01.770 --> 00:44:04.620
<v Speaker 1>it produces a smoother results because </v>
<v Speaker 1>when it's kept the same,</v>

579
00:44:06.740 --> 00:44:11.740
<v Speaker 1>are there certain input?</v>
<v Speaker 1>Pixels get a lot more attention than </v>

580
00:44:11.740 --> 00:44:14.740
<v Speaker 1>others,</v>
<v Speaker 1>so losing that favoritism as well.</v>

581
00:44:14.741 --> 00:44:18.500
<v Speaker 1>It's achieved by using a variable </v>
<v Speaker 1>difference dilation rate.</v>

582
00:44:20.310 --> 00:44:25.310
<v Speaker 1>Those are the two tricks,</v>
<v Speaker 1>but really the biggest one is the </v>

583
00:44:25.310 --> 00:44:26.010
<v Speaker 1>parametric,</v>
<v Speaker 1>one of the upscaling filters.</v>

584
00:44:27.600 --> 00:44:29.010
<v Speaker 1>Okay,</v>
<v Speaker 1>so that's what we're.</v>

585
00:44:29.160 --> 00:44:34.160
<v Speaker 1>That's what we use to generate that data</v>
<v Speaker 1>and that's we provide you the code with </v>

586
00:44:34.160 --> 00:44:35.130
<v Speaker 1>if you're interested in competing in </v>
<v Speaker 1>psych fuse.</v>

587
00:44:36.030 --> 00:44:39.000
<v Speaker 1>The other aspect here that everything </v>
<v Speaker 1>we've talked about,</v>

588
00:44:39.001 --> 00:44:44.001
<v Speaker 1>ball from the classification to the </v>
<v Speaker 1>segmentation to making sense of images </v>

589
00:44:45.060 --> 00:44:50.060
<v Speaker 1>is there the information about a time </v>
<v Speaker 1>the temporal dynamics of the scene is </v>

590
00:44:50.971 --> 00:44:55.971
<v Speaker 1>thrown away and for the driving context </v>
<v Speaker 1>of the robotics contests and what we'd </v>

591
00:44:57.841 --> 00:45:02.841
<v Speaker 1>like to do is take fuse for the </v>
<v Speaker 1>segmentation dynamic scene segmentation </v>

592
00:45:02.841 --> 00:45:06.921
<v Speaker 1>context of when you try to interpret </v>
<v Speaker 1>what's going on in and seen over time </v>

593
00:45:06.921 --> 00:45:10.170
<v Speaker 1>and use that information.</v>
<v Speaker 1>Time is essential.</v>

594
00:45:11.310 --> 00:45:15.090
<v Speaker 1>The the movement of pixels is essential </v>
<v Speaker 1>through time.</v>

595
00:45:15.750 --> 00:45:20.750
<v Speaker 1>That that understanding how those </v>
<v Speaker 1>objects move in a three d space through </v>

596
00:45:23.521 --> 00:45:28.521
<v Speaker 1>the two d projection of an image is </v>
<v Speaker 1>fascinating and there's a lot of set of </v>

597
00:45:29.041 --> 00:45:34.041
<v Speaker 1>open problems there.</v>
<v Speaker 1>So flow is what's very helpful too as a </v>

598
00:45:36.651 --> 00:45:41.250
<v Speaker 1>starting point to help us understand how</v>
<v Speaker 1>these pixels move flow,</v>

599
00:45:41.970 --> 00:45:46.970
<v Speaker 1>optical flow,</v>
<v Speaker 1>dance optical computation that are best </v>

600
00:45:46.970 --> 00:45:51.680
<v Speaker 1>at our best approximation of where each </v>
<v Speaker 1>pixel in image one</v>

601
00:45:55.380 --> 00:46:00.120
<v Speaker 1>and moved in temporarily following </v>
<v Speaker 1>image.</v>

602
00:46:00.121 --> 00:46:03.210
<v Speaker 1>After that,</v>
<v Speaker 1>there's two images in 30 frames.</v>

603
00:46:03.211 --> 00:46:08.211
<v Speaker 1>A second is one image at time zero,</v>
<v Speaker 1>the other is 33 point three milliseconds</v>

604
00:46:08.371 --> 00:46:13.371
<v Speaker 1>later,</v>
<v Speaker 1>and the dense optical flow is our best </v>

605
00:46:13.371 --> 00:46:15.600
<v Speaker 1>estimate of how each pixel in the input </v>
<v Speaker 1>image moved to end.</v>

606
00:46:15.601 --> 00:46:20.601
<v Speaker 1>The output image and the optical flow </v>
<v Speaker 1>for every pixel produces a direction of </v>

607
00:46:21.571 --> 00:46:26.571
<v Speaker 1>where we think that pixel moved and the </v>
<v Speaker 1>magnitude of how far moved that allows </v>

608
00:46:26.571 --> 00:46:31.161
<v Speaker 1>us to take information that we detected </v>
<v Speaker 1>about the first frame and try to </v>

609
00:46:31.980 --> 00:46:36.980
<v Speaker 1>propagate it forward.</v>
<v Speaker 1>This is the competition is to try to </v>

610
00:46:36.980 --> 00:46:41.570
<v Speaker 1>segment and image and propagate that </v>
<v Speaker 1>information forward for manual </v>

611
00:46:43.071 --> 00:46:48.071
<v Speaker 1>annotation of a of an image.</v>
<v Speaker 1>So this kind of coloring book </v>

612
00:46:48.880 --> 00:46:53.880
<v Speaker 1>annotation,</v>
<v Speaker 1>will you call every single pixel in the </v>

613
00:46:53.880 --> 00:46:56.641
<v Speaker 1>state of the art data set for driving </v>
<v Speaker 1>city scapes that it takes one point </v>

614
00:46:59.451 --> 00:47:01.310
<v Speaker 1>five,</v>
<v Speaker 1>one point five hours,</v>

615
00:47:01.311 --> 00:47:05.450
<v Speaker 1>90 minutes to do that coloring,</v>
<v Speaker 1>that's 90 minutes per image.</v>

616
00:47:06.380 --> 00:47:10.490
<v Speaker 1>That's extremely long time.</v>
<v Speaker 1>That's why it doesn't exist today.</v>

617
00:47:10.500 --> 00:47:15.500
<v Speaker 1>Dataset.</v>
<v Speaker 1>And in this class we're going to create </v>

618
00:47:15.500 --> 00:47:15.500
<v Speaker 1>one</v>

619
00:47:15.610 --> 00:47:19.250
<v Speaker 1>of segmentation of these images is </v>
<v Speaker 1>through time,</v>

620
00:47:20.270 --> 00:47:25.270
<v Speaker 1>through video,</v>
<v Speaker 1>so long videos where every single frame </v>

621
00:47:25.340 --> 00:47:30.340
<v Speaker 1>is fully segmented.</v>
<v Speaker 1>That's still an open problem that we </v>

622
00:47:30.340 --> 00:47:33.130
<v Speaker 1>need to solve flows,</v>
<v Speaker 1>a piece of that.</v>

623
00:47:33.580 --> 00:47:38.580
<v Speaker 1>And we also provide you the this </v>
<v Speaker 1>computer state of the art flow using </v>

624
00:47:39.461 --> 00:47:40.850
<v Speaker 1>flow net two point.</v>
<v Speaker 1>Oh,</v>

625
00:47:41.710 --> 00:47:43.720
<v Speaker 1>so flow net one point,</v>
<v Speaker 1>oh,</v>

626
00:47:43.750 --> 00:47:48.750
<v Speaker 1>in May 2015,</v>
<v Speaker 1>use neural networks to learn the optical</v>

627
00:47:50.471 --> 00:47:52.120
<v Speaker 1>flow,</v>
<v Speaker 1>the dense optical flow.</v>

628
00:47:53.410 --> 00:47:56.260
<v Speaker 1>And it did so with two kinds of </v>
<v Speaker 1>architectures.</v>

629
00:47:56.740 --> 00:47:59.590
<v Speaker 1>Flow net s flown out,</v>
<v Speaker 1>simple and flown on.</v>

630
00:47:59.620 --> 00:48:04.620
<v Speaker 1>Core flow in that.</v>
<v Speaker 1>See the simple one is simply taking the </v>

631
00:48:04.620 --> 00:48:05.200
<v Speaker 1>two images.</v>
<v Speaker 1>So what's,</v>

632
00:48:05.230 --> 00:48:10.230
<v Speaker 1>what's the task here?</v>
<v Speaker 1>There's two images and you want to </v>

633
00:48:10.230 --> 00:48:11.500
<v Speaker 1>produce in those two images,</v>
<v Speaker 1>they follow each other in time,</v>

634
00:48:11.770 --> 00:48:15.010
<v Speaker 1>33 point three milliseconds apart,</v>
<v Speaker 1>and uh,</v>

635
00:48:15.011 --> 00:48:18.370
<v Speaker 1>your task is the Zl output to produce </v>
<v Speaker 1>the dense optical flow.</v>

636
00:48:18.880 --> 00:48:22.300
<v Speaker 1>So for the simple architecture,</v>
<v Speaker 1>you just stack them together,</v>

637
00:48:22.390 --> 00:48:27.390
<v Speaker 1>each our rgb,</v>
<v Speaker 1>so it produces a six channel input to </v>

638
00:48:27.390 --> 00:48:28.780
<v Speaker 1>the network.</v>
<v Speaker 1>There's a lot of convolution and finally</v>

639
00:48:28.781 --> 00:48:33.781
<v Speaker 1>it's the same kind of process as the </v>
<v Speaker 1>fully convolutional neural networks to </v>

640
00:48:33.781 --> 00:48:38.461
<v Speaker 1>produce the optical flow.</v>
<v Speaker 1>Then there is flown at correlation </v>

641
00:48:38.461 --> 00:48:43.380
<v Speaker 1>architecture where you performed some </v>
<v Speaker 1>convolutions separately before using a </v>

642
00:48:43.380 --> 00:48:45.970
<v Speaker 1>correlation later to combine the feature</v>
<v Speaker 1>maps.</v>

643
00:48:48.620 --> 00:48:53.620
<v Speaker 1>Both are effective and different </v>
<v Speaker 1>datasets and different applications.</v>

644
00:48:55.010 --> 00:49:00.010
<v Speaker 1>So flow net two point zero in December </v>
<v Speaker 1>2016 is one of the state of the art </v>

645
00:49:02.211 --> 00:49:06.500
<v Speaker 1>frameworks code bases that we use to </v>
<v Speaker 1>generate the data.</v>

646
00:49:06.501 --> 00:49:11.501
<v Speaker 1>I'll show combines the flown at sm,</v>
<v Speaker 1>flown as C and improves over the initial</v>

647
00:49:12.351 --> 00:49:17.351
<v Speaker 1>flow net.</v>
<v Speaker 1>Producing a smoother flow field </v>

648
00:49:17.351 --> 00:49:20.680
<v Speaker 1>preserves the fine motion detail along </v>
<v Speaker 1>the edges of the objects and it runs </v>

649
00:49:22.040 --> 00:49:26.540
<v Speaker 1>extremely efficiently depending on the </v>
<v Speaker 1>architecture as a few variants,</v>

650
00:49:26.810 --> 00:49:31.810
<v Speaker 1>either eight to 140 frames a second,</v>
<v Speaker 1>and the process there is essentially one</v>

651
00:49:34.941 --> 00:49:36.740
<v Speaker 1>that's common across various </v>
<v Speaker 1>applications.</v>

652
00:49:36.770 --> 00:49:39.560
<v Speaker 1>Deep learning is stacking these networks</v>
<v Speaker 1>together.</v>

653
00:49:40.700 --> 00:49:45.700
<v Speaker 1>The very interesting aspect here that </v>
<v Speaker 1>we're still exploring,</v>

654
00:49:47.271 --> 00:49:51.140
<v Speaker 1>and again applicable in all of deep </v>
<v Speaker 1>learning in this case,</v>

655
00:49:51.170 --> 00:49:56.000
<v Speaker 1>it seemed that there was a strong effect</v>
<v Speaker 1>in taking sparse small,</v>

656
00:49:56.030 --> 00:50:01.030
<v Speaker 1>multiple Dataset and doing the training,</v>
<v Speaker 1>the order of which those datasets were </v>

657
00:50:01.030 --> 00:50:03.140
<v Speaker 1>used for the training process mattered a</v>
<v Speaker 1>lot.</v>

658
00:50:04.060 --> 00:50:09.060
<v Speaker 1>That's very interesting.</v>
<v Speaker 1>So using flow net two point zero,</v>

659
00:50:10.420 --> 00:50:13.470
<v Speaker 1>here's the Dataset we're making </v>
<v Speaker 1>available for psych fuse.</v>

660
00:50:13.480 --> 00:50:18.480
<v Speaker 1>The competition cars that mit did,</v>
<v Speaker 1>you use less sex views first.</v>

661
00:50:18.961 --> 00:50:23.961
<v Speaker 1>The original video,</v>
<v Speaker 1>US driving in high definition 10 ADP and</v>

662
00:50:26.641 --> 00:50:29.390
<v Speaker 1>a eightK ,</v>
<v Speaker 1>three 60 video</v>

663
00:50:31.900 --> 00:50:35.580
<v Speaker 1>original video driving around a </v>
<v Speaker 1>cambridge.</v>

664
00:50:37.300 --> 00:50:42.300
<v Speaker 1>Then we're providing the ground truth </v>
<v Speaker 1>for a training set for that training set</v>

665
00:50:45.521 --> 00:50:48.010
<v Speaker 1>for every single frame.</v>
<v Speaker 1>30 frames a second,</v>

666
00:50:48.011 --> 00:50:53.011
<v Speaker 1>we're providing the segmentation frame </v>
<v Speaker 1>to frame to frame segmented on </v>

667
00:50:53.591 --> 00:50:58.591
<v Speaker 1>mechanical Turk.</v>
<v Speaker 1>We're also providing the output of the </v>

668
00:51:00.490 --> 00:51:02.050
<v Speaker 1>network that I mentioned,</v>
<v Speaker 1>the dad,</v>

669
00:51:02.051 --> 00:51:06.490
<v Speaker 1>there are segmentation network that's </v>
<v Speaker 1>pretty damn close to the ground truth,</v>

670
00:51:07.570 --> 00:51:11.690
<v Speaker 1>but still not.</v>
<v Speaker 1>And our task is.</v>

671
00:51:11.840 --> 00:51:16.840
<v Speaker 1>This is the interesting thing is our </v>
<v Speaker 1>task is to take the output of this </v>

672
00:51:17.571 --> 00:51:20.000
<v Speaker 1>network.</v>
<v Speaker 1>Well there's two options.</v>

673
00:51:20.150 --> 00:51:25.150
<v Speaker 1>One is to take the output of this </v>
<v Speaker 1>network and use CA use other networks to</v>

674
00:51:27.441 --> 00:51:29.780
<v Speaker 1>help you propagate the information </v>
<v Speaker 1>better.</v>

675
00:51:30.200 --> 00:51:35.200
<v Speaker 1>So what this segmentation,</v>
<v Speaker 1>the output of this network does is it </v>

676
00:51:35.901 --> 00:51:40.901
<v Speaker 1>only takes a frame by frame by frame.</v>
<v Speaker 1>It's not using the temporal information </v>

677
00:51:40.901 --> 00:51:45.341
<v Speaker 1>at all.</v>
<v Speaker 1>So the question is can we figure out a </v>

678
00:51:45.341 --> 00:51:46.190
<v Speaker 1>way,</v>
<v Speaker 1>can we figure out tricks to use temporal</v>

679
00:51:46.191 --> 00:51:49.280
<v Speaker 1>information to improve this </v>
<v Speaker 1>segmentation?</v>

680
00:51:49.460 --> 00:51:54.460
<v Speaker 1>So it looks more like this segmentation </v>
<v Speaker 1>and we're also providing the optical </v>

681
00:51:57.031 --> 00:52:02.031
<v Speaker 1>flow from frame to frame to frame.</v>
<v Speaker 1>So the optical flow based on flown at </v>

682
00:52:02.031 --> 00:52:04.380
<v Speaker 1>two point zero of how each of the pixels</v>
<v Speaker 1>moved.</v>

683
00:52:07.390 --> 00:52:12.390
<v Speaker 1>Okay.</v>
<v Speaker 1>And now forms the psych fuse </v>

684
00:52:12.390 --> 00:52:12.940
<v Speaker 1>competition.</v>
<v Speaker 1>10,000</v>

685
00:52:12.941 --> 00:52:17.530
<v Speaker 1>images.</v>
<v Speaker 1>And the task is to submit code.</v>

686
00:52:17.531 --> 00:52:22.531
<v Speaker 1>We have starter code in python and on </v>
<v Speaker 1>good hub to take in the original video,</v>

687
00:52:25.580 --> 00:52:28.060
<v Speaker 1>take in for the training,</v>
<v Speaker 1>set the ground truth,</v>

688
00:52:28.450 --> 00:52:31.900
<v Speaker 1>the segmentation from the state of the </v>
<v Speaker 1>art segmentation network,</v>

689
00:52:31.960 --> 00:52:36.370
<v Speaker 1>the optical flow from the state of the </v>
<v Speaker 1>art optical flow network,</v>

690
00:52:36.670 --> 00:52:40.690
<v Speaker 1>and taking that together to improve the.</v>
<v Speaker 1>The stuff on the bottom left,</v>

691
00:52:40.720 --> 00:52:44.800
<v Speaker 1>the segmentation to try to achieve the </v>
<v Speaker 1>ground truth on the on the top right.</v>

692
00:52:47.060 --> 00:52:48.980
<v Speaker 1>Okay.</v>
<v Speaker 1>With that,</v>

693
00:52:49.040 --> 00:52:54.040
<v Speaker 1>I'd like to thank you.</v>
<v Speaker 1>Tomorrow at one PM is Waymo in data 32,</v>

694
00:52:56.001 --> 00:52:56.420
<v Speaker 1>one,</v>
<v Speaker 1>two,</v>

695
00:52:56.421 --> 00:52:58.310
<v Speaker 1>three.</v>
<v Speaker 1>The next lecture,</v>

696
00:52:58.610 --> 00:53:03.610
<v Speaker 1>next week we'll be on deep learning for </v>
<v Speaker 1>a sensing the human understanding the </v>

697
00:53:03.610 --> 00:53:07.331
<v Speaker 1>human,</v>
<v Speaker 1>and we will release online only lecture </v>

698
00:53:07.331 --> 00:53:10.040
<v Speaker 1>on capsule networks and Gans generative </v>
<v Speaker 1>adversarial networks.</v>

699
00:53:10.250 --> 00:53:11.030
<v Speaker 1>Thank you very much.</v>