transcripts/en/Lex_Fridman/txt/-6INDaLcuJY.txt

Speaker 1:          00:00:01       Thank you everyone for braving the cold and the snow to be here. This is six zero, nine for deep learning for self driving cars and it's a course where we cover the topic of deep learning, which is a set of techniques that have taken a leap in the last decade for our understanding of what artificial intelligence systems are capable of doing and self driving cars, which is systems that can take these techniques and integrate them in a meaningful, profound way into our daily lives in a way that transforms society. So that's why both of these topics are extremely important and extremely exciting. My name is Lex Friedman and I'm joined by an amazing team of engineers and Jack Terwilliger, Julia Kendall's Burger, Dan Brown, Michael Glaser, lead Ding, Spencer, dod, and Ben antigenic among many others. We build autonomous vehicles here at Mit, not just ones that perceive and move about the environment, but ones that interact, communicate and earn the trust and understanding of human beings inside the car, the drivers and the passengers and the human beings outside the car, the pedestrians and other drivers and cyclists.

Speaker 1:          00:01:39       The website for this course, self driving cars that mit did to you, if you have questions, email deep cars at Mit Dot Edu, slack, deep dash mit for registered mit students. You have to register on the website and by midnight, Friday, January 19th, building your own network and submitted to the competition that achieves the speed of 65 miles per hour on the new deep traffic. Two point. Oh, it's much harder and much more interesting than last years. For those of you who participated, there's three competitions in this class. Deep traffic segued, fuse, deep crash. There's guest speakers that come from Waymo, Google, Tesla, and those are starting new autonomous vehicle startups in voyage. You autonomy and Aurora and the news a lot today from cs and we have shirts for those of you who brave the snow and continue to do so towards the end of the class there'll be free shirts. Yes. I said free and in the same sentence you should be here. Okay. First, the deep traffic competition. There's a lot of updates and we'll cover those on Wednesday. It's a deeper enforcement learning competition. Last year we received over 18,000 submissions. This year we're going to go bigger.

Speaker 1:          00:03:23       Not only can you control one car, well then you will now work. You can control up to 10. This is multiagent, deeper enforcement learning. This is super cool. Second, psych fuse, dynamic driving scene segmentation competition where you're given the raw video,

Speaker 1:          00:03:44       the the kinematics of the vehicles and the movement of the vehicle, the state of the art segmentation for the training set you're given ground truth labels, pixel level labels, scene segmentation and optical flow and would those pieces of data your task to try to perform better than the state of the art in image based segmentation. Why is this critical and fascinating and open research problem? Because robots that act in this world and the physical space not only must interpret, use these deep learning methods to interpret the spacial visual characteristics of a scene, they must also interpret, understand and track the temporal dynamics of the scene. This competition is about temporal propagation of information, not just seen as segmentation. You must understand the space and time and finally, deep crash where we use deeper enforcement learning. The Slam cars thousands of times at uh, at, uh, here at Mit, at the gym, you're given data on a thousand runs or car or a car knowing nothing is using a monocular camera as a single input, driving over 30 miles an hour through a scene that has very little control through very little capability to localize itself. It must act very quickly in that scene. You're given a thousand runs to learn anything.

Speaker 1:          00:05:21       We'll discuss this in the coming weeks. This competition will result in for submissions that we evaluate everyone's in simulation, but the taskforce submissions we put head to head at the gym and until there is a winner declared what keep slamming cars at 30 miles an hour deep crash and also on the website is from last year. And on get hub, there's deep tesla, which is using the large scale naturalistic driving data set. We have to train a neural network to do end to end steering. It takes in monocular video from the forward roadway and produce the steering commands that steering commands for the car

Speaker 1:          00:06:07       lectures. Today we'll talk about deep learning tomorrow. We'll talk about autonomous vehicles drls on Wednesday, driving, seeing, understanding, so segmentation, that's Thursday. On Friday we have Sasha or knew the director of engineering at Waymo. Waymo is one of the companies that's truly taking huge strides and fully autonomous vehicles. They're taking the fully l four l five autonomous vehicle approach and it's fascinating to learn. He's also the head of perception for them to learn from him what kind of problems they're facing, what kind of approach they're taking on. We have a meal. If Rizzoli, who won to last year, speakers start to ask. Carmen said, Amelia is the smartest person who knows, so Amelia for is the cto of autonomy and autonomous vehicle, a company that was just acquired by Delphi for a large sum of money and they're doing a lot of incredible work in Singapore and here in Boston.

Speaker 1:          00:07:10       Next Wednesday we are going to talk about the topic of our research or my personal fascination is deep learning for drivers, states sensing, understanding the human perceiving, everything about the human being inside the car, outside the car. One Talk I'm really excited about is Oliver Cameron. On Thursday, he is now the CEO of of autonomous vehicle startup voyage. He was previously the director of the self driving car program free udacity. He will talk about how to start a self driving car company. For those it said that Mit folks and entrepreneurs, if you want to start one yourself, they'll tell you exactly how it's super cool and then sterling Anderson who was the director previously, a tesla autopilot team and now is the cofounder of Aurora, the car, the self driving car startup that I mentioned that has now partnered with Nvidia and many others, so why self driving cars? Those classes about applying data driven learning methods to the problem of autonomous vehicles. Why self driving cars are fascinating and interesting problem space quite possibly in my opinion, this is the first wide reaching and profound integration of personal robots in society wide reaching because there's 1 billion cars on the road. Even a fraction of that will change

Speaker 1:          00:08:47       the face of transportation and how we move about the world profound, and this is an important point that's always understood, is there's an intimate connection between a human and a vehicle. When there's a direct transfer of control, it's a direct transfer of control that takes that his or her life into the hands of an artificial intelligence system. I showed a few quick, quick, quick, quick clips here you can google first time with Tesla, autopilot on Youtube, and watch people perform that transfer of control. There's something magical about a human and a robot working together that will transform what artificial intelligence is in the 21st century and this particular autonomous system, ai system, self driving cars, is on the scale and the profound, the life critical nature of it is profound in a way that will truly test the capabilities of Ai. There is a personal connection that will argue throughout these lectures that we cannot escape. Considering the human being, that autonomous vehicle must not only perceive and control its movement through the environment. It must also perceive everything about the human driver and the passenger and interact, communicate, and build trust with that driver.

Speaker 1:          00:10:20       Because in my view, as I will argue throughout this course, an autonomous vehicle is more of a personal robot than it is a perfect perception control system. Because perfect perception and control, so this world full of humans is extremely difficult and could be two, three, four decades away, full autonomy. Autonomous Vehicles are going to be flawed, they're going to have flaws and we'll have to design systems that are effectively caught effectively transfer control to human beings when they can't handle the situation, and that transfer of control isn't as a fascinating opportunity for ai because the obstacle avoidance perception of obstacles, an obstacle avoidance is the easy problem. It's the safe problem. Going 30 miles an hour and navigating through streets of Boston is easy. It's when you have to get to work in your late or you're sick of the person in front of you that you want to go into the er in the opposing lane and speed up.

Speaker 1:          00:11:41       That's human nature and we can't escape it are artificial assist intelligence systems can't escape human nature. They must work with it. What's shown here is one of the algorithms will talk about next week for cognitive load or we take the raw 3d accomplished in your networks, taken the eye region, the blinking, and the pupil movement to determine the cognitive load of the driver. We'll see how we can detect everything about the driver, where they're looking. Emotion, cognitive load, body pose, estimation, drowsiness, the the. The movement towards full autonomy is so difficult. I would argue that it almost requires human level intelligence that the, as I said, two, three, four decade out journey for artificial intelligence researchers to achieve full autonomy will require achieving, solving some of the problems, fundamental problems of creating intelligence, and that's something we'll discuss in much more depth in a broader view in two weeks. For the artificial general intelligence course where we have Andrea neuropathy from Tesla, Ray Kurzweil, Mark Robert from Boston, dynamics who asked for the dimensions of the room because he's bringing robots. Nothing else was told to me, it'll be a surprise. So that is why I argued the human centered artificial intelligence approach in every algorithm design considers the human

Speaker 1:          00:13:26       for autonomous vehicle on the left, the perception seen understanding, and the control problem as we'll explore through the competitions and the assignments of this course can handle 90 and increasing percent of the cases, but it's the ten one point one percent of the cases as we get better and better that we have to, we're not able to handle to these methods. And that's where the human perceiving the human is really important. This is the video from last year of Arc de Triomphe. Thank you. I didn't know what last year. I know now. That's one of millions of cases where human to human interaction is the is the dominant driver, not the basic perception control problem. So why deep learning in this space, because deep learning is a set of methods that do well from a lot of data and to solve these problems or human life is a stake. We have to be able to have techniques that learn from data, learn from real world data. This is the fundamental reality of artificial intelligence systems that operate in the real world. They must learn from real world data, whether that's on the left, the perception, the control side,

Speaker 1:          00:14:54       on the right for the human, the perception and the communication, interaction and collaboration with the human and the human robot interaction. Okay, so what is deep learning?

Speaker 1:          00:15:13       It's a set of techniques. If you allow me the definition of intelligence being the ability to accomplish complex goals, then I would argue definition of understanding, maybe reasoning is the ability to turn complex information into simple, useful, actionable information and that is what deep learning does. Deep learning is representation, learning or feature learning, if you will. It's able to take raw information, raw complicated information that's hard to do. Anything with and construct here are hierarchical representation of that information to be able to do something interesting with it. It is the branch of artificial intelligence which is most capable and focused on this task. For me, representations from data, whether it's supervised and unsupervised, whether it's with the help of humans or not, it's able to construct structure, find structure in the data such that you can extract simple, useful, actionable information. On the left for me in Goodfellas book is the basic example of a misclassification, the input of the image on the bottom with the raw pixels, and as we go up the stack as a go up, the layers hiring higher order representations of formed from edges to contours the corners to object parts, and then finally the full object semantic classification of what's in the image. This is representation learning. A favorite example for me

Speaker 1:          00:16:57       is one from for centuries ago, our place in the universe and representing that place in the universe, whether it's relative to Earth, are relative to the sun. On the left is our current belief. On the right is the one that is held widely for centuries ago. Representation matters because what's on the right is much more complicated than what's on the left.

Speaker 1:          00:17:34       You can think of in a simple case here, when the task is to draw a line that separates green triangles and blue circles in the cartesian coordinate space on the left, the task is much more difficult. Impossible to do well on the right is trivial. In polar coordinates, this transformation is exactly what we need to learn. This is representation learning so you can take the same task of having to draw a line that separates the blue curve and the red curve on the left. If we draw a straight line, it's going to be a high. There is no way to do it was zero error with 100 percent accuracy. Shown on the right is our best attempt, but what we can do with deep learning with a single hidden layer network done here is formed the the topology, the mapping of the space in such a way in the middle that allows for a straight line to be drawn. This separates the blue curve and the red curve. The learning of the function in the middle is what we're able to achieve with deep learning. It's taking raw, complicated information and making it simple, actionable, useful, and the point is that this kind of ability to learn from raw sensory information means that we can do a lot more with a lot more data, so deep learning gets better with more data

Speaker 1:          00:19:09       and that's important for real world applications

Speaker 1:          00:19:14       were edge cases are everything. This is us driving to perception control systems. One is an tesla vehicle with the autopilot version one system that's using a monocular camera to perceive the external environment and produce control decisions and our own neural network running an adjustment [inaudible] that's taking in the same with a monocular camera and producing controlled decisions and the two systems argue and when they disagree, they raise up a flag to say that this is an edge case. These that needs human intervention. There is covering such edge cases using machine learning is the main problem, our artificial intelligence and in when applied to the real world, it is the main problem to solve. Okay, so what are neural networks inspired? Very loosely, and I'll discuss about the key difference here in our own brains and artificial brains because there's a lot of insights in that difference, but inspired loosely by biological neural networks. Here is a simulation of a thalamocortical brain network, which is only 3 million neurons, 476 million synapses. The full human brain is a lot more than that 100 billion neurons. One thousand trillion synapses. There's inspirational music with this one that I didn't realize was here should make you think artificial neural networks. Yeah, let's just let it play. The. The human neural network is 100 billion neurons, right? One thousand trillion synapses, one of the state of the state of the art neural networks as resident at 1:52, which has $60, million synapses.

Speaker 1:          00:21:17       That's a difference of about a seven order of magnitude difference. The human brains have 10 million times more synapses than artificial neural networks plus or minus one order of magnitude depending on the network. So what's the difference between a biological neuron and artificial neuron? The topology of the human brain have no layers. Neural networks are stacked and layers, they're fixed for the most part. There is chaos. Very little structure in our human brain in terms of Han neurons are connected. They're connected often to 10,000 plus other neurons. The number of synopsis from individual neurons that are, uh, that are input into the neuron is huge. There are asynchronous. The human brain brain works asynchronously. Artificial neural networks work synchronously. The learning algorithm for artificial neuron networks, the only one, the best one is backpropagation. And we don't know how human brains learn processing speed. This is one of the, the only benefits we have with artificial neural networks is artificial neurons are faster, but they're also extremely power and efficient.

Speaker 1:          00:22:50       And there is a division into stages of training and testing when you're on that works, uh, with, uh, biological neural networks is, you're sitting here today, they're always learning. The only profound similarity, the inspiring one, the captivating one, is that both are distributed computation at scale. There is an emergent aspect to neural networks where the basic element of computation, a neuron is simple, is extremely simple, but when connected together, beautiful, amazing, powerful, approximators can be formed. A neural network is built up with these computational units where the inputs, there's a set of edges with weights on them. The edges are the weights are by this input signal. A bias is added with a nonlinear function that determines whether the network gets activated or not. The neuron gets activated or not visualized here, and these neurons can be combined in a mall in number of ways. It can form a feed forward and will not work or they can feed back into itself to form to have state memory in recurrent neural networks. The ones on the left are the ones that are most successful for most applications in computer vision. The ones in the right are very popular and specific when temporal dynamics or dynamics time series of any kind are used. In fact, the ones in the right a much closer to the way our human brains are and the ones on the left, but that's why they're really hard to train.

Speaker 1:          00:24:45       One beautiful aspect of this emerging power from multiple neurons being connected together is the universal property that with a single hidden layer, these networks can learn any function, learn to approximate at any function which is an important property to be aware of because the limits here are not in the power of the networks. The limits in the is in the methods by which we construct them and train them.

Speaker 1:          00:25:21       What kinds of machine learning deep learning are there. We can separate into two categories, memorizers now, uh, the approaches that essentially memorize patterns in the data and approaches that we can loosely say are beginning to reason to generalize over the data with minimal human input. On top, on the left are the quote unquote teachers is how much human input and blue is needed to make the method successful for supervised learning, which is what most of the deep learning success has come from. Or most of the data's annotated by human beings. The human is at the core of the success. Most of the data that's part of the training needs to be annotated by human beings was some additional successes coming from augmentation methods that extend that extend the data based on which is networks have trained and the semi-supervised reinforcement learning and unsupervised methods that we'll talk about later in the course. That's where the near term successes we hope are and where the unsupervised learning approaches, that's where the true excitement about the possibilities of artificial lie being able to make sense of our world with minimal input from humans.

Speaker 1:          00:26:53       So we can think of two kinds of deep learning impacts spaces. One is a special purpose intelligence is taking a problem, formalizing it, collecting enough data on it, and being able to solve a particular case that's that provides value of particular interest. Here is a a network that estimates apartment costs in the Boston area, so you could take the number of bedrooms, the square feet and the neighborhood and provide is. I'll put the estimated costs on the. On the right is the actual data of apartment cost. We're actually standing at a in a area that has over $3,000 for a studio apartment.

Speaker 1:          00:27:44       Some of you may be feeling that pain and then there's general purpose intelligence or something that feels like approaching general purpose intelligence, which is reinforcement and unsupervised learning. Here with Andrea, come from Monica potties Pong, the pixels, a system that takes in 80 by 80 pixel image and with no other information is able to beat, is able to win at this game, no information except a sequence of images, raw sensory information, the same way, the same kind of information that human beings taken from the visual, audio touch, sensory data, the very low level data and be able to learn to win and this very simplistic and it's very artificially constructed world, but nevertheless a world where no feature learning is performed. Only raw sensory information is used to win with very sparse minimal human input. We'll talk about that on Wednesday. We're deep reinforcement learning so but for now we'll focus on supervised learning where there is input data, there is a network. We're trying to train a learning system and there's a correct output that's labeled by human beings, that's the general training process for neural network input, data labels and the training of that network, that model, so that in the testing stage, our new input data that has never seen before as task with producing guesses and is evaluated based on that for autonomous vehicles, that means being released either in simulation or in the real world to operate

Speaker 1:          00:29:32       and how they learn, how neural networks learn is given the forward pass of taking the input data, whether it's from the training stage, in the training stage, the taking the input data, producing a prediction, and then given that there's ground truth in the training stage, we can, we can have a measure of error based on a loss function that then punishes the, uh, the synapses, the connections, the parameters that were involved with making that a, that wrong prediction.

Speaker 1:          00:30:07       A back propagates the error through those weights. We'll discuss that a little bit more detail in a bit here. So what can we do with deep learning? You can do one to one mapping. Really, you can think of input as being anything. It can be a number of vector numbers, a sequence of numbers, a sequence of vector, of numbers, anything you can think of from images to video to audio to text and represented in this way, and the output can the same be a single number or it can be images, video, text, audio, one to one, mapping on the bottom, one to many, many to many to many and many to many with different starting points for the data, a synchronous,

Speaker 1:          00:30:53       some quick terms that will come up. Deep learning is the same as neural networks. It's really deep neural networks, large neural networks. It's a subset of machine learning that has been extremely successful in the past decade. Multilayer Perceptron, deep neural network, recurrent neural network, long short term memory network, Lstm, convolutional neural network, and deep belief networks. All of these will come up through the slides and there is specific operations layers within these networks of convolution pooling activation and backpropagation. This concepts that we'll discuss in this class, activation functions. There's a lot of variance than the left is the activation function and left column, and the x axis is the input. On the y axis is the output, the sigmoid function, the output. If the font is too small, the output is not centered at zero. For the Tan age function, it's centered at zero, but it's still suffers from vantage ingredients. Vanishing gradients is one of the value. The input is low or high.

Speaker 1:          00:32:11       The, uh, the output of the network, because you see in the right column there, the derivative of the function is very low, so the learning rate is very low for revenue, not it's also not zero centered, but it does not suffer from vanishing gradients. Backpropagation is the process of learning. It's the way we take go from error computers, the last function and bottom right of the slide, taking the actual output of the network with the Ford Pass, subtracting it from the ground truth squaring dividing were to and using that lost function than back propagate through to construct a great aunt to back propagate the error to the weights that we're responsible for making either correct or incorrect decision. So the subtests that there's a forward pass as a backward pass and a fraction of the weights, gradients of tractor from the weight, that's it. That process is modular, so it's local to each individual neuron, which is why it's extremely dis. We're able to distribute it across multiple across the GPU parallelize across the GPU, so learning for a neural network. These competition units are extremely simple, extremely simple to then correct when they make an error, when they're part of a larger network that makes an error and all that boils down to is essentially an optimization problem where the objective utility function is the loss function and the goal is to minimize it and we have to update the parameters, the weights and the synapses and the biases to decrease that loss function.

Speaker 1:          00:34:01       And that last function is highly nonlinear. Depending on the activation functions, different properties, different issues arise. There's vanishing gradients for sigmoid where the learning can be slow. There's dying. Rarely use where the derivatives exactly zero, four inputs less than zero. There are solutions to this like leaky rallies and a bunch of details that you may discover when you try to win the deep traffic competition, but for the most part these are the main activation functions and it's the choice of the you'll network designer, which one works best. They're saddle points, all the problems from numerical, nonlinear optimization that arise come up here. It's hard to break symmetry and stochastic gradient descent without any kind of tricks to it can take a very long time to arrive at the minimum. One of the biggest problems in all of machine learning and certainly in deep learning is overfitting. You can think of the blue dots and applied here as the data to which we want to fit a curve.

Speaker 1:          00:35:22       We want to design a learning system that approximates the aggression of that, uh, of this data. So in green is a sine curve, simple fits well, and then there's a ninth degree polynomial which fits even better in terms of the error, but it clearly over fits this data. If there's other data that has not seen yet that it has to fit, it's likely to produce a high error, so it's overfitting. The training set. This is a big problem for small data sets and so we have to fix that with regularization. Regularization is a set of methodologies that prevent overfitting learning the training too well in order and then to not be able to generalize to the testing stage and overfitting. The main symptom is the air decreases in training set but increases in the test set,

Speaker 1:          00:36:22       so there's a lot of techniques and traditional machine learning that deal with this and cross validation, so on, but because of the cost of training for neural networks, it's traditional to use of what's called the validation set, so you create a subset of the training they use, keep away for which you have the ground truth and use that as a representative of the testing set so you perform early stoppage or more realistically. Just save a checkpoint often to see how as the training evolves, the performance changes on the validation set and so you can stop when the performance and the validation set is getting a lot worse. It means you're overtraining on the training set.

Speaker 1:          00:37:11       In practice, of course, we run training which longer and see when a, what is the best performing, uh, what, what is the best performing snapshot checkpoint of the network dropout is another very powerful regularization technique. What were you randomly remove? Part of the network. Randomly remove some of the nodes in the network along with its incoming and outgoing edges. So what that really looks like is a probability of keeping a node and in many deep learning frameworks today, it comes with a dropout layer. So it's essentially a probability that's usually greater than point five than a, that a node will be kept for the input layer, the probability should be much higher or more effectively. What works well is just adding noise. What's the point here? You want to create enough diversity in the training data such that it is generalizable to the testing and as you'll see with deep traffic, competition is l two and l one penalty weight decay way penalty where there's a penalization on the weights. They get too large. The [inaudible] penalty keeps the weights small unless the aero derivative is huge and produces a smoother model and prefers to distribute when there is too similar inputs. That prefers to put half the weights on each distribute the weights as opposed to putting the weight on one of the edges,

Speaker 1:          00:38:45       makes the network more robust. Our one penalty has the one benefit that for really large weights, they're allowed to be to stay, so it's allows her a few ways to remain very large. These are the regularization techniques and I wanted to mention them because they're useful to some of the competitions here in the course and I recommend to go to a playground and tenser and tenser flow playground to play around with some of these parameters where you get to online, in the browser, play around with different inputs, different features, different number of layers and regularization techniques, uh, and to build your intuition about classification, regression problems given different input data sets. So what changed why over the past many decades, neural networks that have gone through two winters are now again dominating the artificial intelligence community CPU GPU, a six. So computational power has skyrocketed from Moore's law to gps. There is huge data set including image net and others.

Speaker 1:          00:39:58       There is research backpropagation in the eighties, uh, uh, the convolutional neural networks, Lstm, there's been a lot of interesting breakthroughs about how to design these architectures, how to build them such that they're trainable, efficiently using gps. There is the software infrastructure from being able to share the data will get to being able to train networks and share code and effectively view neural networks as a stack of layers as opposed to having to start from scratch with tensorflow, Pi Torch and other than that and other deep learning frameworks and there's huge financial backing from Google, facebook and so on. Deep learning is it in order to understand why it works so well and where it's limitations are, we need to understand where our own intuition comes from about what is hard and what is easy. The important thing about computer vision, which is a lot of what this course is about even as in deeper enforcement learning formulation, is that visual perception for us human beings was formed 540 million years ago. That's $540. Millions million years worth of data and abstract thought is only formed a a 100,000 years ago. That's several orders of magnitude less data so we can with neural networks predictions that seemed trivial.

Speaker 1:          00:41:40       The, uh, the trivial to us human beings, but completely challenging and wrong to neural networks. Here on the left showing a prediction of a dog with a little bit of a distortion and noise added to the image, producing the image on the right and you know, network is confidently 99 percent plus accuracy predicting that it's an ostrich

Speaker 1:          00:42:05       and there's all these problems as to deal with whether it's an computer vision data, whether it's in text, data, audio, all of this variation arises in vision. It's elimination variability. The set of pixels in the numbers look completely different depending on the lighting conditions. It's the biggest problem. And driving is lighting conditions. Letting variability pose variation. Objects need to be learned from every different perspective. I'll discuss that for when sensing the driver. Most of most of, most of the deep learning work that's done in the face on the human is done on the frontal face or semi frontal face. That has very little work done on the full three 60 a pose variability that a human being can take on. Interclass variability for the classification problem, for the detection problem, there is a lot of different kinds of objects for cats, dogs, cars, bicyclists, pedestrians,

Speaker 1:          00:43:07       so that brings us to object classification and I'd like to take you through where deep learning has taken big strides for the past several years leading up to this year to 2018. So let's start at object classification is when you take a single image and you have to say one class that's most likely to belong in that image. The most famous variant of that as the image net competition. Image net challenge image not data set is a data set of 14 million images with 21,000 categories and for say the category of fruit, there's a total of 188,000 images of fruit and there is 1200 images of granny smith apples. It gives you a sense of what we're talking about here. So this is been the source of a lot of interesting breakthroughs in deep learning and a lot of the excitement and deep learning is first. The big successful network, at least one that became famous and deep learning is Alex Net in 2012.

Speaker 1:          00:44:17       That took a leap of a significant leap in performance on the image net challenge, so it was one of the first neural networks that have successfully trained on the GPU and achieved an incredible performance boost over the previous year. On the image net challenge. The challenges, and I'll talk about some of these networks, is to given a single image, give five guesses, and you have five guests. This to guess for one of them to be correct. The human annotation is the question often comes up. So how do you know the ground truth? Human level of performance is five point one percent accuracy on this task, but the way the annotation for image net is performed is there's a google search where you pull the images are already labeled for you, and then the annotation that on mechanical Turk, other humans perform. It's just binary. Is this a cat or not a cat? So they're. They're not tasked with performing the very high resolution semantic labeling of the image. Okay. So through from 2012 with Alex Net to today and the big transition in 2018 of the image net challenge leaving Stanford and go into Cagle, it's sort of a monumental step because in 2015 with the resident network was the first time that the human level performance was exceeded and I think this is a very important

Speaker 1:          00:45:51       map of where deep learning is for particular what I would argue as a toy example, despite the fact that it's 14 million images. So we're developing state of the art techniques here and the next stage as we are now exceeding human level performance on this task is how to take these methods into the real world to perform scene perception, to perform driver's state perception

Speaker 1:          00:46:18       in 2016 and 2017. See you image and see net has a very unique new addition to the previous formulations that has achieved an accuracy of two point two percent error, two point two, five percent error on the image net declassification challenge. This is an incredible result. Okay, so you have this image classification architecture that takes in a single image and produces convolution, uh, and uh, it takes it through pool and convolution and at the end fully connected layers and performance, the classification task or regression task, and you can swap out that layer to perform any kind of, um, other task including with the recurrent neural networks of image captioning and so on, or localization of bonding boxes. Or you can do fully convolutional networks, which we'll talk about on Thursday, which is when you take a, a images and input and producing images that output.

Speaker 1:          00:47:18       But where the output image in this case is the segmentation is a wear a color indicates what, ah, what the object is of the category of the, of the object. So it's level of segmentation. Every single pixel in the image is assigned a class of category of where that pixel belongs to. This is the kind of task that's overlaid on top of other sensory information coming from the car. In order to perceive the external environment, you can continue to extract information from images in this way to produce image to image mapping. For example, to colorize images and take from gray scale images to color images or you can use that kind of heat map information to localize objects in the image, so as opposed to just classifying that this is an image of a of a cow are CNN fast and faster. Our CNN and a lot of other localization networks allow you to propose different candidates for where exactly the car was located in this image and thereby being able to perform object detection, not just object classification in 2017 has been a lot of cool applications of these architectures. One of which is back on removal. Again, mapping from image to image, ability to remove a background from selfies of humans or human like

Speaker 1:          00:48:50       pictures of faces. The references with some incredible animations are in the bottom of the slide and the slides are now available online. Big Stupid Hd.

Speaker 1:          00:49:08       There's been a lot of work in gans generative artifice. Ariel networks in particular in driving gans have been used to generate examples that generate examples from source data. Whether that's from raw data or in this case with pics to picks. Hd is taking course, semantic labeling of the images, pixel level and producing photo realistic, high definition images of the forward roadway. This is an exciting possibility for being able to generate a variety of cases for self driving cars, for autonomous vehicles to be able to learn to generate, to augment the data and be able to change the way different rows look, road conditions to change the way vehicles look, cyclists, pedestrians. Then we can move on to recur in your own networks. Everything I've talked about was one to one, mapping from image to image or image to number, but currently all networks or work with sequences. We can use sequences to generate handwriting to generate text captions from an image based on the localization is the various detections in that image. What can provide video description generation, so taking a video and combining convolution neural networks with recurrent neural networks, using convolutional neural networks to extract features frame to frame and using those extracted features to input into our the rns to then generate a a, a labeling, a description, what's going on in the video.

Speaker 1:          00:50:54       A lot of exciting approaches for autonomous systems, especially in drones were the time to make a decision a is short, same with the RC car, traveling 30 miles an hour at tensional mechanisms for steering the attention of the network had been very popular for the localization task and for just saving how much interpretation of the image, how many pixels need to be considered in the classification task so we can steer, we can model the way a human being looks around and image to interpret it and use the network to do the same and we can use that kind of steering to a draw images as well.

Speaker 1:          00:51:41       Finally, the big breakthroughs in 2017 came from this, the pong to pixels, the reinforcement learning, using sensory data, raw sensory data, and use reinforcement learning methods deeper. All methods of which we'll talk about on Wednesday. I'm really excited about the underlying methodology of deep traffic and deep crash is using neural networks as the approximators inside reinforcement learning approaches. So Alphago in 2016 have achieved a monumental task that when I first started in artificial intelligence was told to me it was impossible for any system to accomplish, which is to win at the game of go against the top human player in the world. However, that method was trained on human expert positions. The alphago system was trained on previous games played by human experts and an incredible accomplishment. Alphago zero in 2017 was able to beat Alphago and many of its variance by playing itself from zero information. So no knowledge of human experts, no games, no training data, very little human input. And what more it was able to generate moves that were surprising to human experts. I think it's Einstein that said that intelligence, that the key mark of intelligence is imagination. I think it's beautiful. See an artificial intelligence system come up with something that surprises human experts,

Speaker 1:          00:53:31       truly surprises

Speaker 1:          00:53:36       for the gambling junkies. Deep Stack and a few other variants have been used in 2017 to when a heads up poker. Again, another incredible results. I was always told and artificial intelligence will be impossible for dave for any machine learning method to achieve and it was able to beat a professional player and several competitors have come along since we're yet to be able to beat to win a tournament setting. So multiple players, for those of you familiar heads up, poker is one on one. It's a much, much smaller, easier space to, uh, solve. There's a lot more humidity, human dynamics going on from when there's multiple players, but that's the task for 2018

Speaker 1:          00:54:22       and the drawbacks. So one of my favorite videos, a show it often have coast runners for these deep reinforcement learning approaches. The learning of the reward function, the definition of the word function controls how the actual system behaves. And this will come. This will be extremely important for us with autonomous vehicles. Here the boat is tasked with gaining the highest number of points and it figures out that it does not need to race, which is the whole point of the game in order to gain points. But instead pickup green circles that regenerate themselves over and over. This is the the counter intuitive behavior of a system that would not be expected when you first designed the reward function, and this is a very formal simple system, nevertheless is extremely difficult to come up with a reward function that makes it operate in the way you expected to operate. Very applicable for autonomous vehicles and of course on the perception side, as I mentioned with the hostage and the dog, a little bit of noise. The 99 point six percent confidence, we can predict that the noise up top is a robin, a Cheetah, Armadillo, lesser panda. These are outputs from actual state of the art and neural networks

Speaker 1:          00:55:51       taking into noise and producing a confident prediction. It should build our intuition to understand that we don't, that the visual characteristics, the vision, the special characteristics of an image that not necessarily convey the level of hierarchy necessary to function in this world in a similar way with a dog and the ostrich and everything in an ostrich and network confidently with a little bit of noise can make the wrong prediction. Thinking the bus is an ostrich and as an ostrich, they're easily fooled, but not really because they performed the task that they were trained to do well, so we have to make sure we keep our intuition

Speaker 1:          00:56:44       optimized to the way machines learn, not the way humans have learned over the 540 million years of data that we've gained through developing the eye, the revolution, the current challenge is we're taking on first transfer learning. There's a lot of success in transfer learning between domains that are very close to each other, so image classification from one domain to the next. There's a lot of value in forming representations of the way scenes look in order to see natural scenes look in order to do scene segmentation, the driving case, for example, but we're not able to do any any bigger leaps in the way we perform transfer learning. The biggest challenge for deep learning is to generalize, generalize across domains. It lacks the ability to reason and the way that we've defined understanding previously, which is the ability to turn complex information into simple, useful information,

Speaker 1:          00:57:44       convert domain specific complicated sensory information that doesn't relate to the initial training set. That's the open challenge for deep learning, training, very little data, and then go and reason and operate in the real world right now. You know now it's a very inefficient. They're acquiring a big data. They require supervised data, which means they need human costs, the human input. They're not fully automated despite the fact that the feature learning incredibly, the big breakthrough feature learning is performed automatically. You're still have to do a lot of design or the actual architecture of the network and all the different hyper parameter tuning and he used to perform human input perhaps a little bit more educated human input and former phd students. Postdocs faculty is required to high during these hyper parameters, but nevertheless human input is still necessary. They cannot be left alone. For the most part, their award defining the award is with south coast. Ron is extremely difficult for systems that operate in the real world. Transparency quite possibly is not an important one, but neural networks currently our black box for the most part, they're not able to accept through a few successful visualization methods that visualize different aspects of the activations. They're not able to reveal to us humans why they work or where they fail

Speaker 1:          00:59:17       and this. This is a philosophical question for autonomous vehicles that we may not care as human beings if the system works well enough, but I would argue that it be a long time before systems work well enough or we don't care well care and will have to work together with these systems and that's where transparency, communication, collaboration is critical and edge cases. It's all about edge cases in robotics and autonomous vehicles. The 99 point nine percent of driving is really boring. It's the same especially highway driving traffic driving. It's the same. The obstacle avoidance, the car following the lane centering. All of these problems with trivial is the edge cases, the trillions of edge cases that need to be generalized over on a very small amount of training data. So again, I returned to why deep learning. I mentioned a bunch of challenges and this is an opportunity. It's an opportunity to come up with techniques that opera is successful in this world, so I hope the competitions were presented in this class and the autonomous vehicle domain. We'll give you some insight and opportunity to apply in. Some of these cases are open research problems with semantic segmentation of external perception with control of the vehicle and deep traffic and with deep crash of control of the vehicle and under actuator. Good high speed conditions and the driver's state perception.

Speaker 2:          01:01:07       Okay,

Speaker 1:          01:01:10       so would that. I wanted to introduce deep learning to you today before we get to the fun tomorrow of autonomous vehicles. So we'd like to thank and video, Google auto live, Toyota and the risk of setting off people's phones. Amazon, Alexa Auto, but truly I would like to say that I've been humbled over the past year by the thousands of messages were received by the attention by the 18,000 competition entries by the many people across the world. Not just here at Mit that are brilliant that I got a chance to interact with and I hope we go bigger and do some impressive stuff in 2018. Thank you very much and tomorrow is self driving.