-
Notifications
You must be signed in to change notification settings - Fork 19
/
CLOAswsxudo.vtt
3478 lines (2778 loc) · 89.4 KB
/
CLOAswsxudo.vtt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
WEBVTT
1
00:00:00.450 --> 00:00:05.450
<v Speaker 1>Today we'll talk about how to make </v>
<v Speaker 1>machines see computer vision and we will</v>
2
00:00:06.331 --> 00:00:09.150
<v Speaker 1>present thank you.</v>
<v Speaker 1>Whoever said yes,</v>
3
00:00:11.250 --> 00:00:16.250
<v Speaker 1>and today we will present a competition </v>
<v Speaker 1>that unlike deep traffic which is </v>
4
00:00:17.911 --> 00:00:22.911
<v Speaker 1>designed to explore ideas,</v>
<v Speaker 1>teach you about concepts so deeper,</v>
5
00:00:23.490 --> 00:00:28.490
<v Speaker 1>deeper enforcement learning SegFuse,</v>
<v Speaker 1>the deep dynamic driving scene </v>
6
00:00:28.861 --> 00:00:32.700
<v Speaker 1>segmentation,</v>
<v Speaker 1>competition that are present today is at</v>
7
00:00:32.701 --> 00:00:37.650
<v Speaker 1>the very cutting edge.</v>
<v Speaker 1>Whoever does well in this competition is</v>
8
00:00:37.651 --> 00:00:42.651
<v Speaker 1>likely to produce a publication or ideas</v>
<v Speaker 1>that would lead the world in the area of</v>
9
00:00:44.221 --> 00:00:48.390
<v Speaker 1>perception,</v>
<v Speaker 1>perhaps together with the people running</v>
10
00:00:48.391 --> 00:00:50.580
<v Speaker 1>this class,</v>
<v Speaker 1>perhaps in your own.</v>
11
00:00:51.200 --> 00:00:56.200
<v Speaker 1>I encourage you to do so even more cast </v>
<v Speaker 1>today.</v>
12
00:00:57.870 --> 00:01:02.870
<v Speaker 1>Computer Vision today as it stands is </v>
<v Speaker 1>deep learning majority of the successes </v>
13
00:01:07.230 --> 00:01:10.080
<v Speaker 1>in how we interpret form </v>
<v Speaker 1>representations,</v>
14
00:01:10.230 --> 00:01:15.230
<v Speaker 1>understand images and videos utilize to </v>
<v Speaker 1>a significant degree and you're on that </v>
15
00:01:16.181 --> 00:01:21.181
<v Speaker 1>works.</v>
<v Speaker 1>The very ideas we've been talking about </v>
16
00:01:21.181 --> 00:01:24.560
<v Speaker 1>that applies for supervised,</v>
<v Speaker 1>unsupervised and reinforcement learning </v>
17
00:01:26.290 --> 00:01:29.810
<v Speaker 1>and for the supervised case is the focus</v>
<v Speaker 1>of today.</v>
18
00:01:30.740 --> 00:01:34.850
<v Speaker 1>The process is the same.</v>
<v Speaker 1>The data is essential.</v>
19
00:01:34.940 --> 00:01:39.940
<v Speaker 1>There's annotated data where the human </v>
<v Speaker 1>provides the labels that serves as the </v>
20
00:01:39.940 --> 00:01:44.141
<v Speaker 1>ground truth and the training process.</v>
<v Speaker 1>Then the neural network goes through </v>
21
00:01:45.531 --> 00:01:50.531
<v Speaker 1>that data,</v>
<v Speaker 1>learning to map from the raw sensory </v>
22
00:01:50.531 --> 00:01:55.031
<v Speaker 1>input to the ground truth labels and </v>
<v Speaker 1>then generalize or the testing data set</v>
23
00:01:57.320 --> 00:02:00.200
<v Speaker 1>and the kind of raw senses we're dealing</v>
<v Speaker 1>with are numbers.</v>
24
00:02:01.280 --> 00:02:05.900
<v Speaker 1>I'll say this again and again that for </v>
<v Speaker 1>human vision for us here,</v>
25
00:02:05.930 --> 00:02:10.930
<v Speaker 1>we take for granted this particular </v>
<v Speaker 1>aspect of our ability is to take in raw </v>
26
00:02:10.930 --> 00:02:13.130
<v Speaker 1>sensor information through our eyes and </v>
<v Speaker 1>interpret it,</v>
27
00:02:13.880 --> 00:02:18.880
<v Speaker 1>but it's just numbers.</v>
<v Speaker 1>That's something whether you're an </v>
28
00:02:18.880 --> 00:02:20.960
<v Speaker 1>expert in computer vision person or new </v>
<v Speaker 1>to the field,</v>
29
00:02:21.020 --> 00:02:26.020
<v Speaker 1>you have to always go back to meditate </v>
<v Speaker 1>on is what kind of things the machine is</v>
30
00:02:27.381 --> 00:02:28.610
<v Speaker 1>given,</v>
<v Speaker 1>what,</v>
31
00:02:28.640 --> 00:02:33.640
<v Speaker 1>what?</v>
<v Speaker 1>What is the data that is tasked to work </v>
32
00:02:33.640 --> 00:02:35.150
<v Speaker 1>with in order to perform the task you're</v>
<v Speaker 1>asking it to do?</v>
33
00:02:35.750 --> 00:02:40.750
<v Speaker 1>Perhaps the data is given is highly </v>
<v Speaker 1>insufficient to do what you wanted to </v>
34
00:02:40.971 --> 00:02:45.971
<v Speaker 1>do.</v>
<v Speaker 1>That's the question I'll come up again </v>
35
00:02:45.971 --> 00:02:48.071
<v Speaker 1>and again our images and enough to </v>
<v Speaker 1>understand the world around you and </v>
36
00:02:51.710 --> 00:02:54.830
<v Speaker 1>given these numbers,</v>
<v Speaker 1>these set of numbers,</v>
37
00:02:54.831 --> 00:02:59.831
<v Speaker 1>sometimes with one channel,</v>
<v Speaker 1>sometimes with three rgb where every </v>
38
00:02:59.831 --> 00:03:03.761
<v Speaker 1>single have three different colors.</v>
<v Speaker 1>The task is to classify or regress </v>
39
00:03:07.440 --> 00:03:12.440
<v Speaker 1>producing continuous variable or one of </v>
<v Speaker 1>a set of class labels as before,</v>
40
00:03:16.550 --> 00:03:21.550
<v Speaker 1>we must be careful about our intuition </v>
<v Speaker 1>of what is hard,</v>
41
00:03:21.990 --> 00:03:23.600
<v Speaker 1>what is easy and computer vision.</v>
42
00:03:28.210 --> 00:03:33.210
<v Speaker 1>Let's take a step back to the </v>
<v Speaker 1>inspiration for an year old networks,</v>
43
00:03:34.420 --> 00:03:39.420
<v Speaker 1>our own biological neural networks </v>
<v Speaker 1>because the human vision system and the </v>
44
00:03:40.061 --> 00:03:44.050
<v Speaker 1>computer vision system is a little bit </v>
<v Speaker 1>more similar in these regards.</v>
45
00:03:52.360 --> 00:03:57.360
<v Speaker 1>The structure of the human visual Cortex</v>
<v Speaker 1>is in layers and his information passes </v>
46
00:03:58.480 --> 00:04:03.480
<v Speaker 1>from the eyes of the to the parts of the</v>
<v Speaker 1>brain that makes sense of the influence,</v>
47
00:04:03.700 --> 00:04:07.750
<v Speaker 1>the raw sensor information hiring higher</v>
<v Speaker 1>order representations of formed.</v>
48
00:04:08.830 --> 00:04:13.830
<v Speaker 1>This is the inspiration,</v>
<v Speaker 1>the idea behind using deep neural </v>
49
00:04:13.830 --> 00:04:17.821
<v Speaker 1>networks for images higher and higher </v>
<v Speaker 1>order representation is a form for the </v>
50
00:04:17.821 --> 00:04:18.190
<v Speaker 1>layers,</v>
51
00:04:19.980 --> 00:04:24.980
<v Speaker 1>the early layers taking in the very raw </v>
<v Speaker 1>in sensory information that extracting </v>
52
00:04:25.830 --> 00:04:28.830
<v Speaker 1>edges,</v>
<v Speaker 1>connecting those edges,</v>
53
00:04:28.831 --> 00:04:33.831
<v Speaker 1>forming those edges to form more complex</v>
<v Speaker 1>features and finally into the higher </v>
54
00:04:33.831 --> 00:04:38.511
<v Speaker 1>order semantic meaning that we hope to </v>
<v Speaker 1>get from these images and computer </v>
55
00:04:39.241 --> 00:04:41.160
<v Speaker 1>vision.</v>
<v Speaker 1>Deep learning is hard.</v>
56
00:04:42.180 --> 00:04:47.180
<v Speaker 1>I'll say this again.</v>
<v Speaker 1>The illumination variability is the </v>
57
00:04:47.180 --> 00:04:48.030
<v Speaker 1>biggest challenge,</v>
<v Speaker 1>or at least one of the,</v>
58
00:04:48.120 --> 00:04:53.120
<v Speaker 1>one of the biggest challenges in driving</v>
<v Speaker 1>for visible light cameras pose </v>
59
00:04:55.351 --> 00:04:58.110
<v Speaker 1>variability.</v>
<v Speaker 1>The objects,</v>
60
00:04:59.010 --> 00:05:04.010
<v Speaker 1>as I'll also discuss about some of the </v>
<v Speaker 1>advances from Geoff Hinton and the </v>
61
00:05:04.010 --> 00:05:07.521
<v Speaker 1>capsule networks.</v>
<v Speaker 1>The idea with neural networks as they </v>
62
00:05:07.521 --> 00:05:12.341
<v Speaker 1>are currently used for computer vision </v>
<v Speaker 1>are not good with representing variable </v>
63
00:05:12.571 --> 00:05:17.571
<v Speaker 1>pose.</v>
<v Speaker 1>These objects in images and it's too </v>
64
00:05:17.571 --> 00:05:21.891
<v Speaker 1>deep.</v>
<v Speaker 1>Plane of color and texture look very </v>
65
00:05:21.891 --> 00:05:25.641
<v Speaker 1>different numerically when the object is</v>
<v Speaker 1>rotated and the object is mangled and </v>
66
00:05:27.681 --> 00:05:32.681
<v Speaker 1>shaped in different ways.</v>
<v Speaker 1>The deformable will truncate a cat </v>
67
00:05:32.681 --> 00:05:36.690
<v Speaker 1>interclass variability.</v>
<v Speaker 1>The classification task,</v>
68
00:05:36.691 --> 00:05:41.691
<v Speaker 1>which would be an example today </v>
<v Speaker 1>throughout to introduce some of the </v>
69
00:05:41.691 --> 00:05:46.011
<v Speaker 1>networks over the past decade that have </v>
<v Speaker 1>received success and some of the </v>
70
00:05:46.011 --> 00:05:47.370
<v Speaker 1>intuition and insight that made those </v>
<v Speaker 1>networks work.</v>
71
00:05:47.670 --> 00:05:52.110
<v Speaker 1>Classification,</v>
<v Speaker 1>there is a lot of variability inside the</v>
72
00:05:52.111 --> 00:05:55.470
<v Speaker 1>classes and very little variability </v>
<v Speaker 1>between the classes.</v>
73
00:05:57.070 --> 00:05:58.350
<v Speaker 1>All of these cats</v>
74
00:05:58.390 --> 00:06:00.610
<v Speaker 2>at top,</v>
<v Speaker 2>all those are dogs a bottom.</v>
75
00:06:01.060 --> 00:06:06.060
<v Speaker 2>They look very different and the other,</v>
<v Speaker 2>I would say the second biggest problem </v>
76
00:06:06.060 --> 00:06:08.760
<v Speaker 2>in driving perception,</v>
<v Speaker 2>visible light camera perceptions,</v>
77
00:06:08.810 --> 00:06:13.810
<v Speaker 2>occlusion when part of the object is </v>
<v Speaker 2>occluded due to the three dimensional</v>
78
00:06:15.220 --> 00:06:20.220
<v Speaker 1>nature of our world,</v>
<v Speaker 1>some objects in front of others and they</v>
79
00:06:20.441 --> 00:06:25.441
<v Speaker 1>occlude the background object.</v>
<v Speaker 1>And yet we're still tasked with </v>
80
00:06:25.441 --> 00:06:28.720
<v Speaker 1>identifying the object when only part of</v>
<v Speaker 1>it is visible.</v>
81
00:06:29.200 --> 00:06:34.200
<v Speaker 1>And sometimes that part I told you </v>
<v Speaker 1>there's cats is very hardly visible </v>
82
00:06:34.690 --> 00:06:37.510
<v Speaker 1>here.</v>
<v Speaker 1>We're tasked with classifying a cat with</v>
83
00:06:37.511 --> 00:06:42.511
<v Speaker 1>just an ears visible,</v>
<v Speaker 1>just the leg and on a philosophical </v>
84
00:06:46.121 --> 00:06:50.110
<v Speaker 1>level as we'll talk about the motivation</v>
<v Speaker 1>for our competition here.</v>
85
00:06:50.530 --> 00:06:51.550
<v Speaker 1>Here's a,</v>
<v Speaker 1>a,</v>
86
00:06:51.620 --> 00:06:53.380
<v Speaker 1>a,</v>
<v Speaker 1>a cat dressed as a monk,</v>
87
00:06:53.381 --> 00:06:56.920
<v Speaker 1>eating a banana on a philosophical </v>
<v Speaker 1>level.</v>
88
00:06:58.240 --> 00:07:00.520
<v Speaker 1>Most of us,</v>
<v Speaker 1>uh,</v>
89
00:07:00.940 --> 00:07:05.140
<v Speaker 1>understand what's going on in the scene.</v>
<v Speaker 1>In fact,</v>
90
00:07:05.280 --> 00:07:10.280
<v Speaker 1>a neural network to today successfully </v>
<v Speaker 1>classify this,</v>
91
00:07:12.460 --> 00:07:16.930
<v Speaker 1>uh,</v>
<v Speaker 1>image this video as a cat,</v>
92
00:07:18.010 --> 00:07:21.820
<v Speaker 1>but the context,</v>
<v Speaker 1>the humor of the situation,</v>
93
00:07:21.821 --> 00:07:26.680
<v Speaker 1>and in fact you could argue it's a </v>
<v Speaker 1>monkey is missing.</v>
94
00:07:27.250 --> 00:07:30.640
<v Speaker 1>And what else is missing is the dynamic </v>
<v Speaker 1>information,</v>
95
00:07:30.820 --> 00:07:32.530
<v Speaker 1>the temporal dynamics of the scene.</v>
96
00:07:34.990 --> 00:07:39.990
<v Speaker 1>That's what's missing in a lot of the </v>
<v Speaker 1>perception work that has been done to </v>
97
00:07:39.990 --> 00:07:42.460
<v Speaker 1>date in the autonomous vehicle space,</v>
<v Speaker 1>uh,</v>
98
00:07:42.670 --> 00:07:47.020
<v Speaker 1>in terms of visible light cameras and </v>
<v Speaker 1>we're looking to expand on that.</v>
99
00:07:47.470 --> 00:07:49.600
<v Speaker 1>That's what segue.</v>
<v Speaker 1>Fuse is all about.</v>
100
00:07:50.380 --> 00:07:54.550
<v Speaker 1>Image classification pipeline.</v>
<v Speaker 1>There's a been with different categories</v>
101
00:07:54.551 --> 00:07:56.770
<v Speaker 1>inside each class.</v>
<v Speaker 1>Cat,</v>
102
00:07:56.771 --> 00:07:58.120
<v Speaker 1>dog Mug,</v>
<v Speaker 1>hat,</v>
103
00:07:58.840 --> 00:08:03.840
<v Speaker 1>those bins.</v>
<v Speaker 1>There's a lot of examples of each and </v>
104
00:08:03.840 --> 00:08:07.141
<v Speaker 1>your task with when a new example comes </v>
<v Speaker 1>along you've never seen before to put </v>
105
00:08:07.141 --> 00:08:11.161
<v Speaker 1>that image in a bin.</v>
<v Speaker 1>It's the same as the machine learning </v>
106
00:08:11.161 --> 00:08:15.091
<v Speaker 1>task before and everything relies on the</v>
<v Speaker 1>data that has been ground truth,</v>
107
00:08:16.480 --> 00:08:21.480
<v Speaker 1>that have been labeled by human beings.</v>
<v Speaker 1>Amnesty as a toy data set of handwritten</v>
108
00:08:22.691 --> 00:08:27.130
<v Speaker 1>digits,</v>
<v Speaker 1>often using as examples and Coco psyfari</v>
109
00:08:27.160 --> 00:08:30.550
<v Speaker 1>image net places,</v>
<v Speaker 1>and a lot of other incredible datasets.</v>
110
00:08:30.580 --> 00:08:35.580
<v Speaker 1>Rich data sets of 100 thousands,</v>
<v Speaker 1>millions of images out there represent </v>
111
00:08:35.721 --> 00:08:39.310
<v Speaker 1>scenes,</v>
<v Speaker 1>people's faces and different objects.</v>
112
00:08:39.670 --> 00:08:44.670
<v Speaker 1>Those are all ground truth data for </v>
<v Speaker 1>testing algorithms and for competing </v>
113
00:08:46.271 --> 00:08:48.940
<v Speaker 1>architectures to be evaluated against </v>
<v Speaker 1>each other.</v>
114
00:08:49.720 --> 00:08:51.880
<v Speaker 1>Cfr Ten,</v>
<v Speaker 1>one of the simplest,</v>
115
00:08:52.780 --> 00:08:57.380
<v Speaker 1>almost toy datasets of tiny with 10 </v>
<v Speaker 1>categories of airplane,</v>
116
00:08:57.390 --> 00:08:58.830
<v Speaker 1>automobile,</v>
<v Speaker 1>Bird,</v>
117
00:08:58.831 --> 00:08:59.430
<v Speaker 1>cat,</v>
<v Speaker 1>deere,</v>
118
00:08:59.431 --> 00:09:00.150
<v Speaker 1>dog,</v>
<v Speaker 1>frog,</v>
119
00:09:00.151 --> 00:09:05.151
<v Speaker 1>course,</v>
<v Speaker 1>ship and truck is commonly used to </v>
120
00:09:05.151 --> 00:09:08.031
<v Speaker 1>explore.</v>
<v Speaker 1>Some of the basic convolutional neural </v>
121
00:09:08.031 --> 00:09:08.190
<v Speaker 1>networks will discuss,</v>
<v Speaker 1>so let's come up with a very trivial,</v>
122
00:09:08.191 --> 00:09:11.880
<v Speaker 1>classify it to explain the concept of </v>
<v Speaker 1>how we could go about it.</v>
123
00:09:12.600 --> 00:09:17.600
<v Speaker 1>In fact,</v>
<v Speaker 1>this is maybe if you start to think </v>
124
00:09:17.600 --> 00:09:20.121
<v Speaker 1>about how to classify an image.</v>
<v Speaker 1>If you don't know any of these </v>
125
00:09:20.121 --> 00:09:23.241
<v Speaker 1>techniques,</v>
<v Speaker 1>this is perhaps the approach you would </v>
126
00:09:23.241 --> 00:09:25.641
<v Speaker 1>take is you would subtract images.</v>
<v Speaker 1>So in order to know that an image of a </v>
127
00:09:25.771 --> 00:09:30.771
<v Speaker 1>cat is different than image of a dog,</v>
<v Speaker 1>you have to compare them when given </v>
128
00:09:30.771 --> 00:09:30.860
<v Speaker 1>those two images,</v>
<v Speaker 1>what?</v>
129
00:09:30.890 --> 00:09:33.090
<v Speaker 1>What's the what's the way you compare </v>
<v Speaker 1>them?</v>
130
00:09:33.900 --> 00:09:38.900
<v Speaker 1>One way you could do it is you just </v>
<v Speaker 1>subtract it and then some all the pixel </v>
131
00:09:38.900 --> 00:09:42.840
<v Speaker 1>wise differences in the image.</v>
<v Speaker 1>Just subtract the intensity of the image</v>
132
00:09:42.870 --> 00:09:46.530
<v Speaker 1>pixel by Pixel.</v>
<v Speaker 1>Sum It up if that intent,</v>
133
00:09:46.560 --> 00:09:51.560
<v Speaker 1>if that difference is really high,</v>
<v Speaker 1>that means the image is a very </v>
134
00:09:51.560 --> 00:09:51.560
<v Speaker 1>different.</v>
135
00:09:51.560 --> 00:09:56.180
<v Speaker 1>Using that metric,</v>
<v Speaker 1>we can look at cfr 10 and use it as a </v>
136
00:09:56.180 --> 00:10:00.120
<v Speaker 1>classifier saying,</v>
<v Speaker 1>based on this difference function,</v>
137
00:10:00.390 --> 00:10:05.390
<v Speaker 1>I'm going to find one of the 10 bins for</v>
<v Speaker 1>a new image that that is cool,</v>
138
00:10:07.240 --> 00:10:12.240
<v Speaker 1>that has the lowest difference.</v>
<v Speaker 1>Find an image in this dataset that is </v>
139
00:10:13.511 --> 00:10:16.540
<v Speaker 1>most like the image I have and put it in</v>
<v Speaker 1>the same bin.</v>
140
00:10:16.541 --> 00:10:21.520
<v Speaker 1>Is that images in?</v>
<v Speaker 1>So there's 10 classes.</v>
141
00:10:21.521 --> 00:10:26.521
<v Speaker 1>If we just flip a coin,</v>
<v Speaker 1>the accuracy of our classifier will be </v>
142
00:10:26.521 --> 00:10:28.420
<v Speaker 1>10 percent.</v>
<v Speaker 1>Using our image difference classify,</v>
143
00:10:28.421 --> 00:10:33.421
<v Speaker 1>we can actually do pretty good.</v>
<v Speaker 1>Much better than random was better than </v>
144
00:10:33.421 --> 00:10:34.780
<v Speaker 1>10 percent.</v>
<v Speaker 1>We can do 35,</v>
145
00:10:34.781 --> 00:10:39.781
<v Speaker 1>38 percent accuracy.</v>
<v Speaker 1>That's a classifier wherever first </v>
146
00:10:40.750 --> 00:10:45.750
<v Speaker 1>classifier,</v>
<v Speaker 1>k nearest neighbors.</v>
147
00:10:46.530 --> 00:10:51.530
<v Speaker 1>Let's take our classifier to a whole new</v>
<v Speaker 1>level instead of comparing it to just </v>
148
00:10:51.960 --> 00:10:56.960
<v Speaker 1>fight.</v>
<v Speaker 1>Trying to find one image that's the </v>
149
00:10:56.960 --> 00:10:59.511
<v Speaker 1>closest in our dataset.</v>
<v Speaker 1>We tried to find k closest and say what </v>
150
00:10:59.791 --> 00:11:02.820
<v Speaker 1>is what class do the majority of them </v>
<v Speaker 1>belong to?</v>
151
00:11:03.330 --> 00:11:06.210
<v Speaker 1>And we take that K and increase it for </v>
<v Speaker 1>one to two,</v>
152
00:11:06.211 --> 00:11:07.560
<v Speaker 1>to three,</v>
<v Speaker 1>to four to five,</v>
153
00:11:08.790 --> 00:11:13.790
<v Speaker 1>and see how that changes the problem </v>
<v Speaker 1>with seven years neighbors,</v>
154
00:11:14.541 --> 00:11:17.450
<v Speaker 1>which is the optimal under this approach</v>
<v Speaker 1>for cfr 10,</v>
155
00:11:20.610 --> 00:11:25.610
<v Speaker 1>we achieved 30 percent accuracy.</v>
<v Speaker 1>Human level is 95 percent accuracy and </v>
156
00:11:28.390 --> 00:11:31.760
<v Speaker 1>convolutional neural networks will get </v>
<v Speaker 1>very close to a 100 percent.</v>
157
00:11:34.260 --> 00:11:39.260
<v Speaker 1>That's were you on.</v>
<v Speaker 1>That works shine this very task of </v>
158
00:11:41.691 --> 00:11:46.691
<v Speaker 1>binning images.</v>
<v Speaker 1>It all starts with this basic </v>
159
00:11:46.691 --> 00:11:49.490
<v Speaker 1>computational unit signal in each of the</v>
<v Speaker 1>signals are wade summed,</v>
160
00:11:51.980 --> 00:11:53.150
<v Speaker 1>bias added</v>
161
00:11:55.140 --> 00:12:00.140
<v Speaker 1>and put an input into a nonlinear </v>
<v Speaker 1>activation function that produces an </v>
162
00:12:00.140 --> 00:12:04.181
<v Speaker 1>output.</v>
<v Speaker 1>The nonlinear activation function is </v>
163
00:12:04.181 --> 00:12:07.811
<v Speaker 1>key.</v>
<v Speaker 1>All of these put together and more and </v>
164
00:12:07.971 --> 00:12:12.560
<v Speaker 1>more hidden layers form a deep neural </v>
<v Speaker 1>network,</v>
165
00:12:12.650 --> 00:12:17.650
<v Speaker 1>and that deep neural network is trained </v>
<v Speaker 1>as we've discussed by taking a forward </v>
166
00:12:17.661 --> 00:12:20.600
<v Speaker 1>pass on examples,</v>
<v Speaker 1>have ground truth labels.</v>
167
00:12:20.690 --> 00:12:24.050
<v Speaker 1>Seeing how close those labels are too,</v>
<v Speaker 1>the real ground truth,</v>
168
00:12:24.350 --> 00:12:29.350
<v Speaker 1>and then punishing the weights that </v>
<v Speaker 1>resulted in the incorrect decisions and </v>
169
00:12:29.871 --> 00:12:32.480
<v Speaker 1>rewarding the weights that results in </v>
<v Speaker 1>incorrect decisions.</v>
170
00:12:33.800 --> 00:12:38.800
<v Speaker 1>For the case of 10 examples,</v>
<v Speaker 1>the output of the network is 10 </v>
171
00:12:40.041 --> 00:12:45.041
<v Speaker 1>different values.</v>
<v Speaker 1>The input being handwritten digits from </v>
172
00:12:46.651 --> 00:12:51.651
<v Speaker 1>zero to nine,</v>
<v Speaker 1>10 of those and one of our network to </v>
173
00:12:52.441 --> 00:12:57.441
<v Speaker 1>classify what is in this image of a </v>
<v Speaker 1>handwritten digit is at one zero,</v>
174
00:12:58.201 --> 00:12:58.620
<v Speaker 1>one,</v>
<v Speaker 1>two,</v>
175
00:12:58.621 --> 00:13:03.621
<v Speaker 1>three through nine.</v>
<v Speaker 1>The way it's often done is there's 10 </v>
176
00:13:03.811 --> 00:13:08.811
<v Speaker 1>outputs of the network and each of the </v>
<v Speaker 1>neurons and the output is responsible </v>
177
00:13:12.061 --> 00:13:17.061
<v Speaker 1>for getting really excited when it's </v>
<v Speaker 1>number is called and everybody else is </v>
178
00:13:18.811 --> 00:13:23.811
<v Speaker 1>supposed to be not excited.</v>
<v Speaker 1>Therefore the number of classes is the </v>
179
00:13:24.301 --> 00:13:29.301
<v Speaker 1>number of outputs.</v>
<v Speaker 1>That's how it's commonly done and you </v>
180
00:13:29.301 --> 00:13:32.460
<v Speaker 1>assign a class to the input image based </v>
<v Speaker 1>on the highest,</v>
181
00:13:32.760 --> 00:13:35.250
<v Speaker 1>the neuron which produces the highest </v>
<v Speaker 1>output,</v>
182
00:13:36.870 --> 00:13:40.530
<v Speaker 1>but that's for a fully connected network</v>
<v Speaker 1>that we've discussed on Monday.</v>
183
00:13:42.320 --> 00:13:47.320
<v Speaker 1>There is in deep learning a lot of </v>
<v Speaker 1>tricks that make things work that make </v>
184
00:13:47.721 --> 00:13:52.721
<v Speaker 1>training much more efficient on large </v>
<v Speaker 1>class problems where there's a lot of </v>
185
00:13:54.051 --> 00:13:59.051
<v Speaker 1>classes on large data sets.</v>
<v Speaker 1>When the representation that the neural </v>
186
00:13:59.051 --> 00:14:03.581
<v Speaker 1>network is tasked with learning is </v>
<v Speaker 1>extremely complex and that's where </v>
187
00:14:03.581 --> 00:14:05.090
<v Speaker 1>convolutional neural network step in </v>
<v Speaker 1>that trick.</v>
188
00:14:05.091 --> 00:14:10.091
<v Speaker 1>They use a spatial invariance.</v>
<v Speaker 1>They use the idea that a cat in the top </v>
189
00:14:12.261 --> 00:14:17.261
<v Speaker 1>left corner of an image is the same as a</v>
<v Speaker 1>cat in the bottom right corner of an </v>
190
00:14:17.261 --> 00:14:20.330
<v Speaker 1>image,</v>
<v Speaker 1>so we can learn the same features across</v>
191
00:14:20.331 --> 00:14:25.331
<v Speaker 1>the image.</v>
<v Speaker 1>That's where the convolution operation </v>
192
00:14:25.331 --> 00:14:29.850
<v Speaker 1>steps in.</v>
<v Speaker 1>Instead of the fully connected networks </v>
193
00:14:29.850 --> 00:14:32.840
<v Speaker 1>here,</v>
<v Speaker 1>there's a third dimension of depth,</v>
194
00:14:33.530 --> 00:14:38.530
<v Speaker 1>so the blocks in this neural network </v>
<v Speaker 1>that as input take three d volumes in </v>
195
00:14:39.131 --> 00:14:41.180
<v Speaker 1>this output produced three d volumes.</v>
196
00:14:46.890 --> 00:14:51.890
<v Speaker 1>They take a slice of the image,</v>
<v Speaker 1>a window and it across applying this </v>
197
00:14:53.491 --> 00:14:56.030
<v Speaker 1>same exact weights and we'll go through </v>
<v Speaker 1>an example,</v>
198
00:14:56.330 --> 00:15:01.330
<v Speaker 1>the same exact weights as in the fully </v>
<v Speaker 1>connected network on the edges that are </v>
199
00:15:01.330 --> 00:15:06.251
<v Speaker 1>used to map the input to the output.</v>
<v Speaker 1>Here are used to map the slice of an </v>
200
00:15:08.001 --> 00:15:10.880
<v Speaker 1>image,</v>
<v Speaker 1>this window of an image to the output,</v>
201
00:15:12.350 --> 00:15:17.350
<v Speaker 1>and you can make several,</v>
<v Speaker 1>many of such convolutional filters,</v>