[00:16] All right. So, today's lecture,
[00:18] introduction to neural networks and deep
[00:19] learning.
[00:20] Um so, we'll start with a very quick
[00:21] intro to these things,
[00:23] uh and then we'll switch and dive deep
[00:25] into neural networks. All right. So, the
[00:27] field of AI originated in 1956. Sadly,
[00:30] it didn't originate at MIT, it
[00:31] originated at Dartmouth.
[00:32] Because all these people got together at
[00:33] Dartmouth. I guess it's it's got a nice
[00:35] quad or whatever. They got together,
[00:37] they defined the field. But, fortunately
[00:40] for us, MIT was very well represented.
[00:42] So, we have Marvin Minsky who founded
[00:44] the MIT AI Lab, John McCarthy who
[00:47] invented Lisp, and then later defected
[00:50] to the West Coast, and then Claude
[00:51] Shannon who invented information theory,
[00:53] right? Who was a professor at MIT. So,
[00:55] MIT was well represented. These folks,
[00:57] you know, founded the field, and they
[00:58] were so bright, they thought that AI was
[01:01] going to be substantially solved, quote
[01:03] unquote,
[01:04] by that fall.
[01:06] Okay?
[01:07] Now, obviously, it turned out a bit
[01:08] differently than what they expected.
[01:10] Um so, it's been, whatever, 67, 68 years
[01:12] since its founding. So, it's gone
[01:14] through, essentially, in my opinion,
[01:16] three seminal breakthroughs,
[01:18] um starting with the traditional
[01:19] approach, then machine learning, deep
[01:21] learning, and generative AI. So, let's
[01:22] take a very quick look at each of these
[01:24] breakthroughs and what motivated them.
[01:26] So,
[01:27] let's start with the traditional
[01:28] approach to AI. And so, what is AI? AI,
[01:31] informally, is the ability to imbue
[01:33] computers with the
[01:34] the the the ability to do things that
[01:36] only humans can typically do. Cognitive
[01:38] tasks, thinking tasks, and things like
[01:39] that. And so, the most sort of common
[01:41] sensical way to do that is to say,
[01:43] "Well, if I want the computer to do
[01:45] something complicated like play chess,
[01:46] I'm just going to sit down with a few
[01:48] chess grandmasters,
[01:49] show them a whole bunch of board moves,
[01:51] and ask them how they figure out how to
[01:53] respond, how to play the next move." I'm
[01:55] going to sort of sit down, talk to all
[01:56] these people, and then I'm going to
[01:57] write down a whole bunch of rules. If
[01:59] this is the board position, move this.
[02:01] If this is the board position, move
[02:02] this, and so on and so forth. Or I might
[02:04] sit down with a cardiologist and tell
[02:05] them, "Okay, how do you actually
[02:06] interpret an ECG?" They will give me all
[02:09] the similarly a bunch of if-then rules.
[02:11] I will take all these rules, I'll put
[02:12] them into the computer, and boom, I have
[02:13] a system that can do what a human can
[02:15] do. Right? Now, this approach, even
[02:17] though it's common sensical and kind of
[02:19] makes sense, it had success in only a
[02:21] few areas.
[02:22] Um and so, the interesting question is,
[02:24] why was it not pervasively successful?
[02:28] Why was it not pervasively successful?
[02:29] It seems like a pretty good idea to me,
[02:31] right? And the people who came up with
[02:32] these things are smart people, they're
[02:33] not dumb people. They know what they're
[02:35] doing. So, why did it not work?
[02:39] Because
[02:40] because it's time-intensive,
[02:42] so in case that you have to run through
[02:44] all these scenarios that can ever exist,
[02:46] and still some new scenarios can come up
[02:48] that you didn't cater for initially.
[02:51] Right. So, there are two aspects to what
[02:52] you said, which is the first aspect is
[02:54] it's time-intensive. That, as it turns
[02:56] out, is not a big deal, because
[02:57] computers are getting faster and faster.
[02:58] >> [clears throat]
[02:59] >> Right? The second thing is actually the
[03:01] key thing, which is that it doesn't
[03:02] generalize to new situations very well.
[03:05] Right? The problem is
[03:07] there are an infinite number of things
[03:08] that you're going to see when you deploy
[03:10] these systems in the real world. By
[03:11] definition, what you're training it on
[03:13] is a small sample of rules. So, these
[03:15] rules are very brittle. But, there's
[03:17] actually even more interesting reason.
[03:19] And that reason is that we know more
[03:22] than we can tell.
[03:23] This is called Polanyi's paradox. So,
[03:25] the idea is that if I come to you and
[03:27] say, "Hey, uh here's a picture. Is it a
[03:29] dog or a cat?" you will tell me within,
[03:32] I believe, they've measured it, like 20
[03:33] milliseconds or something, you know if
[03:34] it's a dog if it's a dog or a cat. And
[03:36] then if I ask you to explain to me
[03:38] exactly how you figured that out, you'll
[03:40] come up with a bunch of sort of reasons,
[03:41] right? Alleged reasons. Oh, you know, if
[03:43] it has whiskers, I think it's a cat or
[03:45] whatever.
[03:46] But, the problem is that you actually,
[03:47] first of all, can't really articulate
[03:49] what's going on in your head, how you do
[03:50] these things. And number two, even if
[03:51] you articulate it, often times, your
[03:54] articulation has no correspondence with
[03:55] how your brain actually does it.
[03:58] So, you're incomplete and a liar.
[04:01] So, this is Polanyi's paradox. So, if
[04:03] you can't even
[04:04] tell me how you do something, how the
[04:06] heck am I supposed to take it and put it
[04:08] into a computer? Doesn't work. And
[04:10] second is the fact that we can't write
[04:11] down these rules for all possible
[04:13] situations. Edge cases, corner cases,
[04:15] etc. And the world is full of edge
[04:17] cases.
[04:18] So, for these reasons, this approach
[04:20] didn't work.
[04:21] And so, a different approach was
[04:22] developed, and this approach was, well,
[04:24] basically said, "Hey, instead of
[04:26] explicitly telling the computer what to
[04:27] do, why don't we simply give it lots of
[04:30] examples of inputs and outputs, chess
[04:32] positions, next move, right? ECG,
[04:35] diagnosis, right? Inputs and outputs.
[04:37] And then, why don't we just use some
[04:39] statistical techniques to learn a
[04:41] mapping, a function, that can go from
[04:43] the input to the output? Okay? That was
[04:44] the idea.
[04:45] And this idea is machine learning.
[04:48] Okay? So, machine learning is basically
[04:49] just a fancy way of saying, "Learn from
[04:51] input-output examples using statistical
[04:53] techniques."
[04:55] Good. All right. So, um
[04:59] Now, there are numerous ways to create
[05:00] machine learning models, and if you've
[05:01] ever done linear regression,
[05:02] congratulations, you've been doing
[05:03] machine learning.
[05:06] Okay? And only one of those methods
[05:08] happens to be something called neural
[05:09] networks.
[05:11] There are many other methods, and in
[05:12] fact, you probably have done these other
[05:14] methods if you have done the a course
[05:16] like the Analytics Edge or something
[05:17] similar.
[05:19] Okay. So, machine learning has got
[05:21] tremendous impact around the world,
[05:23] right? It's like, at this point, um it's
[05:25] widely accepted, it's a very, very
[05:27] successful technology.
[05:29] And in fact, whenever people are
[05:30] actually talking about AI,
[05:32] chances are they're actually talking
[05:33] about machine learning.
[05:35] It's just that AI sounds cooler.
[05:38] The only problem is, for machine
[05:40] learning to work really well, the input
[05:41] data has to be structured.
[05:43] Okay? And what I mean by that is data
[05:46] that can essentially be sort of
[05:47] numericalized and stuck into the columns
[05:50] and rows of a spreadsheet.
[05:51] Right? So, for example, here, let's say
[05:54] I want to put together a data data set
[05:55] of, you know, uh patients, their
[05:58] symptoms, and their characteristics, and
[05:59] then in the following year after they
[06:01] showed up at the doctor's office whether
[06:03] they had a cardiac event or not.
[06:05] I might create a data set like this with
[06:07] age, smoking status, yes, no, exercise,
[06:09] blah blah blah blah blah. Right? And so,
[06:11] either these numbers are numbers,
[06:13] they're numerical, or if they're not
[06:15] numerical, they're categorical.
[06:17] Right? Yes, no, uh smoking, yes, no,
[06:19] things like that. Which means that if
[06:21] you have categorical variables, you can
[06:22] just numericalize them pretty easily.
[06:25] You folks have done the some machine
[06:26] learning before, so you know, things
[06:27] like one-hot encoding and stuff like
[06:29] that can be done to make them all
[06:30] numerical. So, the point is, you can
[06:32] just render the data into the columns
[06:35] and rows of a spreadsheet pretty easily,
[06:36] right? That's what I mean by structured
[06:38] data. So, when you but the situation is
[06:40] very different if you have unstructured
[06:41] data. So, if you have an image of, you
[06:43] know, a cute puppy, this is my puppy, by
[06:46] the way, um
[06:47] from many years ago. Sadly, he's no
[06:49] more. Um
[06:50] but, his name was Google.
[06:54] So, yeah, anyway, uh
[06:56] my DMD alums know Google well. So, this
[06:58] is Google, right? If you want to take
[07:00] Google,
[07:01] uh this picture, and figure out how to
[07:03] sort of numericalize it, the first thing
[07:05] you want to need to understand is that
[07:06] if you actually look at how this picture
[07:07] is represented inside, uh digitally, in
[07:10] the computer, basically, every picture
[07:12] like this is represented using three
[07:13] tables of numbers.
[07:15] Okay? And these and we'll get to what
[07:17] these numbers mean later on, but the
[07:19] point I'm making is that each number
[07:21] basically represents the amount of
[07:23] light,
[07:25] right? On a scale of 0 to 255, the
[07:27] amount of light in that location, in
[07:29] that pixel. That's all the amount of
[07:30] light. So, basically, the this table is
[07:32] the amount of um sorry.
[07:35] This table is amount of red light,
[07:37] amount of green light, amount of blue
[07:39] light. Okay? Now, you will agree with me
[07:41] that if you, for example, look at
[07:42] something like this and say, "Okay, 251
[07:45] at this location, there is a lot of blue
[07:47] light because it's 251 out of a possible
[07:49] 255, right? Maybe a lot of blue light
[07:52] somewhere here. There's a lot of blue
[07:53] here."
[07:55] Whether that area is blue because of a
[07:59] piece of sky,
[08:00] some water, or a bunch of blue paint,
[08:03] could be anything, it's going to say
[08:04] 251.
[08:06] So, the underlying reality, the
[08:08] underlying object that's being
[08:09] described, has nothing to do with the
[08:11] 251.
[08:12] Right? So, that's the whole problem. The
[08:14] raw form of the data has no intrinsic
[08:16] meaning with the underlying thing.
[08:19] So, given that there's no connection
[08:20] between the number and what it's
[08:21] describing, how the heck can any
[08:23] algorithm do anything with it?
[08:25] It can't.
[08:27] Right? So, what you have to do is
[08:30] something called feature engineering or
[08:32] feature extraction, right? Where you
[08:34] have to manually take all these things
[08:36] and create essentially a spreadsheet
[08:38] from them. So, basically, let's say that
[08:40] you have a bunch of birds, right? And
[08:42] you're trying to build a a bird
[08:43] classifier to figure out what kind of
[08:44] bird species it is, you might actually
[08:46] have to take this picture, and then you
[08:48] have to measure the beak length, the
[08:50] wingspan, the primary color, and so on
[08:52] and so forth.
[08:54] So, you're basically structuring the
[08:56] unstructured data manually, right?
[08:59] And this process of structuring
[09:02] unstructured data is basically called
[09:06] we use the word representation. We take
[09:08] the raw data and we represent the data
[09:10] in a different form. And the the reason
[09:13] why I'm sort of
[09:14] focusing on the use of the word
[09:15] representation is because it becomes
[09:17] really, really important a bit later on
[09:19] when we get to deep learning. Okay? So,
[09:22] we have to represent the data in a
[09:23] different way for it to work. That's the
[09:25] basic idea.
[09:26] All right. So, what that means is that,
[09:28] uh historically, researchers would
[09:31] manually develop these representations.
[09:33] And once you develop them, once you have
[09:35] representations, you can just use
[09:37] traditional linear regression or
[09:38] logistic regression to get the job done.
[09:40] So, the whole name of the game is the
[09:41] representations. So, in fact, people
[09:43] doing PhDs, for example, in computer
[09:45] vision, would spend like 4 years
[09:47] developing amazing representations for
[09:49] solving one particular little problem.
[09:52] Right? We have a bunch of, say, CAT
[09:53] scans, and we need to take the CAT scan
[09:55] and figure out whether a particular kind
[09:57] of stroke that is evidence for it in the
[09:58] cat scan, right? They might actually sit
[10:00] and develop all kinds of representations
[10:02] and test it and so on. And then they'll
[10:04] finally declare victory and say, "Yay,
[10:05] I'm done with my PhD. Here is this
[10:07] amazing representation, and you can
[10:08] build a classifier with it to predict a
[10:11] particular kind of stroke with a high
[10:12] accuracy." Okay? So, that was what that
[10:15] that's where the world was.
[10:18] Uh now, as you can imagine, developing
[10:20] representations, because it's so manual,
[10:22] is this massive human bottleneck, and
[10:24] this sharply limited limited the reach
[10:27] and applicability of machine learning.
[10:29] As you would expect.
[10:31] To address this problem,
[10:33] a different approach came about, and
[10:35] that's deep learning. So, deep learning
[10:36] sits inside machine learning. Okay?
[10:38] And deep learning
[10:40] can handle unstructured input data
[10:43] without upfront manual processing.
[10:46] Meaning,
[10:48] it will automatically learn the right
[10:50] representations from the raw input.
[10:52] Automatically is the keyword.
[10:54] Automatically learn representations,
[10:55] which means that you could give it
[10:57] structured data, you can give it
[10:58] pictures, you can give it text, you can
[10:59] give it anything you want, it just learn
[11:00] it.
[11:01] Okay?
[11:02] Um it can automatically extract these
[11:05] representations, and since it's being
[11:07] automatically extracted, you can imagine
[11:09] sort of a pipeline where the raw data
[11:11] comes in, you have a bunch of stuff in
[11:12] the middle that's learning these
[11:14] representations automatically without
[11:15] your help, and then boom, you just
[11:17] attach a little linear regression or
[11:19] logistic regression at the end, problem
[11:20] solved.
[11:22] That in a nutshell is deep learning.
[11:25] Input, a whole bunch of representations
[11:26] being learned, and then piped into a
[11:28] linear or logistic regression model.
[11:30] Okay?
[11:31] You would So, the amazing thing is this
[11:34] simple idea
[11:36] this simple idea
[11:37] is just incredibly powerful. Right? That
[11:40] idea has led to ChatGPT, has led to
[11:42] AlphaGo, AlphaFold, and so on and so
[11:44] forth.
[11:45] And
[11:46] I I kid you not, I'm sort of
[11:49] I've I've I've been doing deep learning
[11:50] for about 10 years now, and every time I
[11:52] look at it, I literally get goosebumps
[11:54] every so often.
[11:56] That that something so simple could be
[11:57] so powerful, right? It's really like
[11:59] boggles the mind.
[12:01] I'm like I'm just so lucky to be alive
[12:03] and working during this period.
[12:05] Okay?
[12:06] And you know, coming from people who
[12:07] have been in the industry a long time,
[12:08] this sort of breathless exclamation is
[12:10] not very rare, particularly because I'm
[12:12] not in marketing.
[12:14] Okay? I actually mean it.
[12:17] With all your apologies to various
[12:19] marketing folks. So,
[12:21] just realized it's being taped, so uh
[12:23] okay. So, so this has demolished the
[12:25] human bottleneck for using machine
[12:27] learning with unstructured data, uh and
[12:29] so it comes from the confluence of three
[12:31] forces,
[12:32] uh new algorithmic ideas, a whole a lot
[12:34] of data, and then very importantly, the
[12:37] fact that we have access to parallel
[12:38] computing hardware in the in the form of
[12:40] these things called GPUs, graphics
[12:42] processing units. Um and these three
[12:44] forces came together, and they were
[12:45] applied to an old idea called neural
[12:47] networks, and that's basically deep
[12:48] learning. And I'll go through it very
[12:49] quickly, because obviously we going to
[12:50] spend half the semester looking into
[12:52] this thing in detail.
[12:54] Uh so, what's the immediate immediate
[12:56] application of the ability to
[12:58] automatically handle unstructured data?
[13:01] What is like the no-brainer application?
[13:10] It's okay if it's obvious, tell me.
[13:13] Uh sorry.
[13:15] Um image classification. Right. So,
[13:18] image classification, yes. So, you can
[13:19] take an image, a good example of
[13:21] unstructured data, you can do some
[13:22] classification on it. But more
[13:24] generally, more generally, what I'm
[13:27] getting at is that every sensor in the
[13:30] world
[13:31] can be given the ability to detect,
[13:33] recognize, and classify what it's
[13:35] sensing. Every sensor. Because remember,
[13:37] what is a What does a sensor do?
[13:39] A sensor is just a receptacle for
[13:41] unstructured data.
[13:43] A camera is a receptacle for
[13:44] unstructured video
[13:46] or unstructured, you know, still images.
[13:48] Microphone, unstructured audio, right?
[13:50] So, every sensor, you can you can
[13:52] imagine taking a sensor and sticking a
[13:54] little deep learning system behind it.
[13:56] And now suddenly, the
[13:58] what comes out of this sensor the deep
[13:59] learning system, you can count, you can
[14:01] classify, you can detect, you can do all
[14:03] kinds of stuff. In short, you can
[14:05] analyze.
[14:07] And you can predict, right? And this is
[14:10] the way I'm describing it right now,
[14:12] you'll be like, "Yeah, duh, obviously."
[14:15] But you know what, this obviously thing
[14:17] is actually not at all obvious
[14:19] in terms of whether it'll help you find
[14:21] interesting applications or not. Okay?
[14:24] So,
[14:25] here's something I literally saw
[14:28] last week. Okay? Actually, I have
[14:30] another slide before that, but we are
[14:32] coming to that. So, for instance, every
[14:34] time you use Face ID to unlock your
[14:36] phone, this is the basic principle at
[14:38] work, right? The the camera in the
[14:39] iPhone is the sensor, and they stuck a
[14:41] deep learning system behind it to do
[14:42] image classification, right? Drama,
[14:44] non-drama, right? That's what it's
[14:45] classifying.
[14:46] Um and so here, right, you have a breast
[14:49] cancer is it's a breast cancer detection
[14:51] system from a mammogram.
[14:52] Uh by the way, this picture
[14:55] it's a very interesting picture. So, uh
[14:57] there's a professor in WCS, uh Regina
[15:00] Barzilay, who's a very well-known expert
[15:02] in this field, and uh she actually has
[15:05] built a breast cancer detection system,
[15:07] which is which has been deployed at Mass
[15:08] General Hospital.
[15:10] And turns out she's actually a breast
[15:12] cancer survivor. And uh she was
[15:15] you know, she's she's she's good now,
[15:16] all good. But when um after she built
[15:19] her system, I heard that she actually
[15:21] ran that system against the mammograms
[15:25] from many years prior when she went for
[15:29] a mammogram and was told that everything
[15:30] is fine.
[15:32] She ran the system on that mammogram,
[15:34] and it came back and said, "Here is a
[15:35] problem."
[15:37] So, a very interesting example where a
[15:38] deep learning system picked up something
[15:40] that a radiologist could not, right? So,
[15:43] these things can be quite powerful.
[15:45] Um obviously, any self-driving system
[15:47] has numerous deep learning algorithms
[15:50] running under the hood, you know,
[15:51] pedestrian detection, you know,
[15:52] stoplight detection, zebra crossing
[15:54] detection, and so on and so forth. Um
[15:57] you know, it's being very heavily used
[15:58] in visual inspection manufacturing.
[16:00] Uh you have various cameras now instead
[16:02] of people looking at saying, "Okay,
[16:03] there is a dent or there's a scratch."
[16:04] They have a little system, which is a
[16:06] dent detector, scratch detector, and so
[16:07] on. That's that's going on right now.
[16:09] And now I come to the example I saw last
[16:11] week,
[16:12] which is um So, this is an example of
[16:14] you can create dramatically better
[16:16] products if you really internalize this
[16:18] idea of, "Okay, it's almost like you're
[16:20] looking the the world and saying, 'Oh,
[16:22] there's a sensor. Can I attach a DL
[16:24] thing behind it?'" That's the way you
[16:25] should be looking at the world, okay,
[16:26] for startup ideas. So, here's an
[16:28] example, okay, these apparently are the
[16:30] world's first smart binoculars.
[16:34] Okay?
[16:35] This is the binocular.
[16:37] Two weeks ago,
[16:39] where you look at the bird you look at
[16:41] the bird,
[16:42] and now it tells you what kind of bird
[16:43] it is, right there.
[16:47] It's a simple idea, but imagine, right?
[16:50] Imagine you are the first out of the
[16:51] gate with this feature, you'll have a
[16:53] little bit of an edge till everybody
[16:54] catches up like 3 months later.
[16:57] Let's be very clear, there are no
[16:58] long-term monopoly windows in the world.
[17:01] There are only short-term windows, so
[17:03] the hunt is always on for a little
[17:04] monopoly window.
[17:06] So, here's an example of that.
[17:08] Right? So, I encourage you to always
[17:11] think about the world as, you know,
[17:13] where are the sensors here?
[17:15] And can I attach something behind the
[17:16] sensor to do something useful with it?
[17:18] Okay?
[17:19] All right.
[17:24] Now, let's uh turn our attention to the
[17:26] output.
[17:27] We've been talking about in structured
[17:28] data, unstructured data, and how deep
[17:30] learning has sort of unlocked the
[17:32] ability to work with unstructured data,
[17:34] but you've sort of been neglecting the
[17:35] output side of the equation. So,
[17:37] traditionally, uh we could predict
[17:40] single numbers or a few numbers pretty
[17:42] easily, right? So, you've all done the
[17:44] canonical, you know, uh should this
[17:46] person be given a loan application in
[17:48] machine learning, right? So, you just
[17:50] predicts a probability that a borrower
[17:51] will repay a loan on a whole based on a
[17:53] whole bunch of data, or supply chain,
[17:56] you predict the demand for the product
[17:57] next week, or you could predict a bunch
[17:58] of numbers. So, given a
[18:00] um given a picture, you can say, "Okay,
[18:01] does it Which which one of the 10 kinds
[18:03] of furniture is it?" Right? You can
[18:04] predict 10 numbers, 10 probabilities
[18:06] that add up to one. You can predict a
[18:08] whole bunch of numbers that don't have
[18:09] to add up to one, such as the GPS
[18:10] coordinates of a of an Uber ride. So,
[18:12] these are all simple unstructured Sorry,
[18:15] simple structured output, just a few
[18:16] numbers, right? What we could not do
[18:18] very easily was to actually generate
[18:20] pictures like this.
[18:23] We could not generate unstructured data.
[18:25] We could only consume unstructured data,
[18:27] right?
[18:28] Um you can generate text, you can
[18:29] generate pictures, and so on, and audio,
[18:31] and so on, and so forth.
[18:32] So, with generative AI, that problem is
[18:35] gone.
[18:36] So, generative AI is the ability to
[18:37] actually create unstructured data, all
[18:39] right? And therefore, it sits within
[18:41] deep learning. It still runs on deep
[18:43] learning, but it's just one kind of deep
[18:45] learning.
[18:47] Okay? There's plenty of stuff going on
[18:49] in deep learning that's got nothing to
[18:50] do with generative AI.
[18:51] Nowadays, of course, you know, if you're
[18:53] a self-respecting entrepreneur who wants
[18:55] to ride this craze, you'll probably
[18:57] declare whatever you're doing as
[18:58] generative AI.
[19:00] Right? Um and some VCs may actually be
[19:02] ready to fund you, who knows?
[19:04] But the point is, there's plenty of
[19:05] stuff going on in deep learning that's
[19:06] got nothing to do with generative AI. Uh
[19:08] but this is the overall picture. Now,
[19:11] here, uh we can produce unstructured
[19:13] outputs, like pictures. You can take
[19:15] this thing, and then you can actually,
[19:17] you know, come up with a nice picture
[19:18] description of it. This actually is a
[19:19] very famous picture, by the way, in in
[19:21] the world of computer vision. So, we are
[19:23] actually going to be analyzing this
[19:24] picture a little later on
[19:26] in the semester.
[19:27] Uh you can obviously go from a very
[19:29] complicated caption to an image.
[19:31] Uh you can go from text to music.
[19:36] Can people hear it? Okay. Yeah. Yeah.
[19:38] All right. So, and of course, we can go
[19:40] from text to text, i.e., ChatGPT. Uh and
[19:43] then uh as of a few months ago, things
[19:45] have gotten even more interesting, where
[19:47] you can actually go you can send text
[19:49] and an image in, and you can get text
[19:51] out.
[19:51] Right? And in fact, as of a few weeks
[19:53] ago, you can send text, image, text,
[19:55] image, text, image in in an arbitrary
[19:56] sequence
[19:58] into into the system, and it'll actually
[20:00] come back to you with text and image.
[20:02] Right? So, things are becoming
[20:03] multimodal. I just want to share with
[20:05] you like a really fun example I saw
[20:07] uh recently. So, this person
[20:10] sends this picture. Can folks see this?
[20:14] It's this very complicated parking sign.
[20:16] Apparently in San Francisco.
[20:19] And they're like, it's Wednesday at 4:00
[20:20] p.m. Can I park here?
[20:22] Tell me in one line. Because you really
[20:23] didn't want GPT-4 to be giving you a big
[20:25] essay about this.
[20:26] Like, you literally want to park.
[20:29] So, GPT-4 comes back and says, "Yes, you
[20:32] can park here for up to 1 hour starting
[20:33] at 4:00 p.m."
[20:35] And folks, I double-checked this thing,
[20:36] it's correct.
[20:38] We all know these things hallucinate,
[20:39] right? Can you imagine getting a parking
[20:41] ticket and telling the judge, "I'm
[20:42] sorry, I didn't realize it was
[20:42] hallucinating."
[20:44] So,
[20:45] so you have to double-check it.
[20:46] So, yeah. So, things are getting
[20:47] multimodal very quickly.
[20:49] Uh and so, the picture here is that
[20:51] within gen AI, we used to have these
[20:53] separate circles, text to text, text to
[20:55] image, text to music, text to this, text
[20:57] to that, so on and so forth. Those are
[20:59] all beginning to merge now inside gen AI
[21:00] because multimodal models are going to
[21:02] become the norm this year, right? We
[21:04] already have really good closed models.
[21:06] We really have We actually already have
[21:07] very good open-source multimodal models.
[21:10] And so, my feeling is that by the end of
[21:12] the year, the idea of using a text-only
[21:15] model is going to be like, "Really, you
[21:17] do that still?"
[21:19] Right? It's going to become like a
[21:20] quaint, old-fashioned thing. I think
[21:21] multimodal modality is going to become
[21:23] the norm. So, that's where the world is,
[21:25] and this is the landscape. So, any
[21:26] questions on the landscape?
[21:29] Before we actually start doing some
[21:29] math.
[21:35] Okay.
[21:37] Yeah.
[22:05] You mean the the the evidence of that
[22:07] being a problem would have been smaller.
[22:09] Yeah.
[22:16] Yeah. So, I think the So, the question
[22:17] is that in general, how do you train
[22:19] your models so that it gives you the
[22:20] right answers given that over the
[22:22] passage of time, the amount of evidence
[22:24] in this data could be very highly
[22:25] variable. So, in this particular case of
[22:28] you know, the professor I talked about,
[22:30] uh yeah, everything at that point was
[22:32] going through a an expert radiologist.
[22:34] So, 5 years ago, this mammogram was seen
[22:36] by a radiologist, and that person
[22:38] concluded there is no problem. So, that
[22:40] was the training label, right? The wrong
[22:41] training label. Uh so, in typically what
[22:44] happens is that training labels could be
[22:46] wrong some small fraction of the time.
[22:48] So, you need to have systems that are
[22:49] robust. So, your data needs to be
[22:51] complete, it needs to be comprehensive,
[22:53] it needs to be have correct labels. If
[22:56] these ideas are not met, your systems
[22:58] are not going to be that good. But as it
[22:59] turns out, with neural networks, even
[23:01] with some amount of noise in the labels,
[23:04] they still do a pretty good job.
[23:06] Right? So, it's that's sort of the
[23:07] general idea.
[23:11] The veri- The verification comes from
[23:12] the human. So, every Remember when we
[23:15] look at radiology data,
[23:17] the the data we're working with is the
[23:19] input is let's say an image, like a
[23:21] radio mammogram or something, and then a
[23:23] human radiologist or a set of
[23:25] radiologists have said this has a
[23:27] problem or does not have a problem. So,
[23:29] that is called the ground truth.
[23:31] So, it is this ground truth image and
[23:33] label, this combination that's being
[23:35] used to train these models.
[23:39] Yeah.
[23:43] Embodiment? So, So, are we are we going
[23:45] to cover embodiment? So, the
[23:47] the embodiment here refers to the fact
[23:49] that
[23:50] if you have robot robots, right?
[23:53] They need to actually operate in the
[23:54] real world, and so robots are an example
[23:56] of what's called embodied intelligence.
[23:58] So, unfortunately, due to the
[23:59] constraints of time, we're not going to
[24:01] get into robotics at all. But I will say
[24:03] that a lot of the deep learning stuff
[24:04] you're going to talk about, those are
[24:05] all fundamental building blocks in
[24:07] modern robotic systems.
[24:09] All right. So, um so, in summary,
[24:13] X and Y
[24:14] can be anything, and it can be
[24:15] multimodal.
[24:17] Okay? I literally could not have put up
[24:19] this slide maybe 2 years ago.
[24:21] Right? So, it's very simple in how it
[24:23] looks, but it's very profound. You can
[24:25] You can learn a mapping from anything to
[24:28] anything at this point very easily as
[24:29] long as you have enough data.
[24:31] Okay? So, um now, note that all this
[24:34] excitement that we see around us
[24:36] is everything stems from stems from deep
[24:38] learning.
[24:39] Okay?
[24:40] Everything Everything depends on deep
[24:42] learning. And so, if you understand deep
[24:44] learning, a lot of interesting things
[24:45] become possible. So, let's get going.
[24:47] All right. So, we'll start with the very
[24:49] basics. Uh what's a neural network?
[24:51] Uh now, recall logistic regression
[24:54] from back in the day.
[24:56] So, what is logistic regression?
[24:57] You send in a bunch of numbers, a vector
[24:59] of numbers, and you get usually get a
[25:01] probability out, right? Between 0 and 1.
[25:03] What is the probability of something or
[25:05] the other? Okay? Um and so, this
[25:07] logistic regression model is also
[25:09] represented in this form,
[25:11] if you will recall. So, basically what
[25:13] we do is we take all these numbers, we
[25:15] run it through a linear function, right?
[25:17] We run it through a linear function, you
[25:19] get a number, and then we take that
[25:20] thing and run it through 1 / 1 + e
[25:22] raised to minus that,
[25:25] and that's guaranteed to give you a
[25:26] number between 0 and 1, which can be
[25:27] interpreted as a probability, and that's
[25:29] logistic regression. Okay? And the
[25:31] canonical, you know,
[25:33] uh loan approvals, things like that, all
[25:35] fall into this sort of convenient
[25:36] bucket.
[25:38] Okay? So, this should be super familiar.
[25:44] All right. Now, we're going to actually
[25:46] look at this, you know, simple, modest,
[25:48] humble little operation
[25:51] using the lens of a network of
[25:53] mathematical operations, and the reason
[25:55] why we do it will become clear a bit
[25:56] later.
[25:57] So, we'll take this very simple example
[25:59] where we have uh let's say two
[26:02] variables, GPA and experience, right?
[26:05] This is the GPA of some graduates, uh
[26:07] number of years of work experience, and
[26:09] then this is the dependent variable,
[26:11] which is either 0 or 1, and 0 if they
[26:14] don't get called for an interview, 1 if
[26:16] they get called for an interview. Okay?
[26:18] It's a two-input variable, one-output
[26:20] variable problem. Okay? And it's a
[26:22] classification problem because we're
[26:24] classifying people into will they get
[26:25] called for an interview, yes or no.
[26:27] Okay?
[26:29] And so, that's the setup for this
[26:31] problem.
[26:33] And let's say that we actually run it
[26:35] through any you know, we actually try to
[26:38] fit a logistic regression model to it.
[26:40] So, if you're familiar with R, for
[26:41] example, you would use something like
[26:43] GLM to fit this model.
[26:46] Um if you use something like statsmodels
[26:48] in Python, there's a similar function
[26:49] for it. Scikit-learn, there's another
[26:52] function for it. You get the idea,
[26:53] right? This
[26:55] You can use whatever favorite methods
[26:57] you have for logistic regression
[26:58] modeling to get this job done. And if
[27:00] you do that with this little data set,
[27:02] you're going to get these coefficients.
[27:04] Right? The 0.4 is the intercept, 0.2 is
[27:06] the coefficient for GPA, 0.5 for
[27:08] experience. And that is the resulting
[27:09] sigmoid function.
[27:11] Okay?
[27:12] All right. Cool. So, now let's actually
[27:14] rewrite this formula as a network in the
[27:17] following way. So, first, what we'll do
[27:19] is we'll take GPA and experience and
[27:20] stick it here on the left side, and
[27:22] we'll put little circles next to them,
[27:24] and we'll call them the input nodes.
[27:26] Okay? And so, imagine that somebody puts
[27:29] the writes a GPA into the circle, 3.5 or
[27:32] you know, years experience, 2.0, and
[27:34] then it flows through this arrow,
[27:36] and as it flows through, it gets
[27:38] multiplied by its coefficient, 0.2. The
[27:40] 0.2 is coming from here.
[27:42] Similarly, experience gets multiplied by
[27:44] 0.5, it comes in here, and this node, as
[27:47] the plus indicates, is adding everything
[27:49] that's coming into it.
[27:50] So, it's adding 0.2 * GPA, 0.5 *
[27:52] experience, plus the intercept, which is
[27:54] the green arrow coming from on its own.
[27:57] It comes through here, and what comes
[27:58] out of this is just a single number,
[28:01] and that number goes into this little
[28:02] circle,
[28:04] and then out pops a probability.
[28:07] Okay?
[28:08] So, I've sort of
[28:10] done this ridiculously long long
[28:13] long-winded way of writing a simple
[28:15] function.
[28:16] Okay? And the reason we why I'm doing it
[28:18] will become clear in a second.
[28:21] Okay? So, this is a little network of
[28:23] operations for the simple function.
[28:25] And so, for instance, how you would use
[28:27] it is you to make a prediction, you'll
[28:29] let's say someone has a 3.8 GPA and 1.2
[28:31] years experience, you just plug it in
[28:33] here,
[28:34] do the math, you get 0.76, same thing
[28:36] here, comes in here, add them all up,
[28:38] you get 1.76, you run 1.76 through the
[28:40] sigmoid, you get 0.85, and that is the
[28:43] probability that that particular
[28:44] individual may get called for an
[28:45] interview.
[28:46] Okay? At this point, we're just doing
[28:48] logistic regression, nothing more
[28:49] complicated.
[28:51] Okay? So, um now, if you have many
[28:54] variables, not two variables like X1
[28:56] through XK, you can the same sort of
[28:58] logic applies. Each one has some
[28:59] coefficient, and then there's an
[29:01] intercept, they all get added up here,
[29:03] run through a sigmoid, and out pops this
[29:04] number. Okay? Notice how the data flows
[29:07] from left to right.
[29:09] Okay?
[29:10] All right. Any questions on this?
[29:15] All right. Good.
[29:16] So, now terminology.
[29:18] Uh so, you will actually you'll discover
[29:20] that the world of neural networks and
[29:21] deep learning has its own terminology.
[29:24] They have their own ways of referring to
[29:25] things that we the rest of the world has
[29:26] been referring using something else for
[29:28] the longest time.
[29:29] Right? It's kind of annoying sometimes,
[29:31] but it's the way it is. So, um
[29:35] Remember in regression, we used to call
[29:37] those numbers next to each variable as
[29:38] coefficients,
[29:39] and the constant thing as an intercept?
[29:41] Well, guess what? In this world, these
[29:43] multi- those coefficients are actually
[29:44] called weights,
[29:46] and the intercepts are called biases.
[29:49] So, in in the neural network world,
[29:50] these are called weights and biases.
[29:53] And sometimes, if you're a little lazy,
[29:54] you may just call the whole thing as
[29:55] weights.
[29:56] Okay? So, when you see in the newspaper
[29:58] that, you know, "Oh my god, this amazing
[30:00] model's weights have been leaked
[30:03] on the internet or on BitTorrent or
[30:05] something." That's what's going on,
[30:06] right? All these coefficients have been
[30:08] leaked. Because once you know what the
[30:09] coefficients are and what the
[30:11] architecture is, you can just
[30:12] reconstruct the model.
[30:15] All right. So, that's what's going on
[30:16] here.
[30:17] Now, why did we do this network
[30:19] business? Why did we write it as a
[30:20] network?
[30:23] Yeah, what is the advantage? Any
[30:24] guesses?
[30:34] When you have multiple functions for
[30:37] So,
[30:38] it's just easier to see it that way.
[30:40] Right. If you have lots of things going
[30:41] on, it's easier to see it if you
[30:43] actually write it in graphical form.
[30:45] Yes, correct.
[30:46] But, so is it only like a usability
[30:49] advantage?
[30:51] I mean, the thing is you want different
[30:53] functions for different layers of that.
[30:55] Uh-huh.
[30:56] Okay.
[30:57] So, maybe we want to use different
[30:59] functions in different layers. But, I
[31:00] think there's actually even a larger
[31:02] sort of a more basic point, which is
[31:04] that
[31:05] then when you the moment you write it
[31:07] down, you suddenly realize
[31:09] that I could have lots of things in the
[31:10] middle.
[31:12] I don't have to go from the input to the
[31:13] output directly. I can do lots of things
[31:15] in the middle, right? That's sort of the
[31:17] key idea. So, what you do is
[31:20] So, remember the notion of learning
[31:22] representations of unstructured data,
[31:24] right? Where you take a picture and say
[31:25] beak length and things like that, right?
[31:27] And remember, I said deep learning
[31:29] actually automatically learns these
[31:30] things. Where is that automatic learning
[31:33] coming from?
[31:34] Well, this is where it's coming from.
[31:36] So, what we do is we take this thing,
[31:38] right? There's just a logistic
[31:39] regression model. Inputs
[31:41] get multiple added up as a linear
[31:43] function, run through a sigmoid.
[31:45] And then
[31:46] we are like, "Hmm, if we want to learn
[31:48] representations of the raw input, we
[31:51] better be doing something in the middle
[31:53] here."
[31:54] Because the output is the output.
[31:56] That is That's not going to change.
[31:58] You know, it's it's either a dog or a
[32:00] cat. You don't have any choice
[32:02] as to what it is. Okay? The only agency
[32:05] you have at this point is you can take
[32:07] the raw input and do things in the
[32:09] middle with it.
[32:11] You can do a lot of stuff in the middle
[32:12] and then run it through something to get
[32:14] the output. Okay? So, in any in in in
[32:18] any mathematical discipline,
[32:20] if someone comes to you and says,
[32:22] "Here's a bunch of data.
[32:23] I want you to do something with it."
[32:25] What should the What is like the big the
[32:27] most basic first thing you should do?
[32:31] Run it through a linear function.
[32:34] The most basic thing in math is a linear
[32:36] function. So, given anything, just run
[32:37] it through a linear function. See what
[32:38] happens.
[32:40] So, that's exactly what we can do. So,
[32:42] the simplest thing we can do here, we
[32:44] can insert a bunch of linear functions.
[32:46] So, we do is we take all this input and
[32:49] we just run it we we do a linear
[32:50] function on it. So, think of it this as
[32:52] X1 * 2 + X3 * 4 and all the way to XK *
[32:56] 9 plus some intercept and boom, it goes
[32:58] out the other end. So, this little
[33:00] circle here with a plus in it is just
[33:05] Thank you.
[33:05] Uh
[33:06] that is This is just a linear It's a
[33:08] shorthand for a linear function.
[33:10] So, whenever you see a circle with a
[33:11] plus, it's just a shorthand for a linear
[33:13] function. Okay? So, you can take this
[33:15] whole thing and run through a linear
[33:16] function and when you do it, you'll get
[33:17] some number right there. You'll get some
[33:19] number. So, you've taken these K numbers
[33:21] and you've sort of dis- compressed them
[33:23] in some way into one number.
[33:25] Okay?
[33:26] But, you don't have to stop at one
[33:28] number. You can do more.
[33:30] So, we can have a stack of linear
[33:31] functions in the middle.
[33:33] Right? There's a linear function here,
[33:35] another one here, another one here. At
[33:37] this point, the K numbers you have
[33:40] K could be, for example, 1,000.
[33:42] Right? It's just the size of your input
[33:43] data.
[33:44] You've taken these K things and you've
[33:45] compressed them into three numbers at
[33:47] this point.
[33:48] Okay?
[33:50] So, okay, maybe three is the right
[33:52] number, maybe 10 is the right number. We
[33:53] don't know.
[33:54] And we'll get to know how do we know
[33:55] what the right number is later on.
[33:58] So, we can stack as many linear
[33:59] functions we want.
[34:01] So, we have transformed this K thing
[34:02] into a three-dimensional vector, right?
[34:04] K numbers become three numbers.
[34:06] Um
[34:07] and now we can flow this three these
[34:10] three numbers through some other little
[34:12] function.
[34:13] Okay?
[34:16] And as you will see in a few minutes,
[34:18] that function is called an activation
[34:19] function
[34:20] and it's chosen to be a non-linear
[34:22] function
[34:23] because if you don't choose it to be a
[34:24] non-linear function, all the effort we
[34:26] are doing is going to be a total waste
[34:28] of time.
[34:30] Okay? For now, just
[34:32] take it on faith that you need to have
[34:34] non-linear functions here.
[34:36] But, note that the three numbers here
[34:39] are still three numbers. They are three
[34:41] different numbers, but they're still
[34:42] three numbers.
[34:43] And once we do this, we'll be like, "You
[34:45] know what? This was fun. Let's do it
[34:46] again."
[34:48] Okay? So, you can do it again.
[34:52] And you can keep on doing it. You can
[34:53] keep it 100 times if you want.
[34:55] And the key thing is that every time you
[34:57] do it, you're giving this network some
[35:00] ability, some capacity to learn
[35:03] something interesting from the data.
[35:05] To learn an interesting representation.
[35:07] Now, of course, you're thinking, "Well,
[35:09] how do we know it's interesting? How do
[35:10] you know it's a useful thing?" And we'll
[35:12] come to all that later on.
[35:14] Right? We're just giving it the
[35:14] capacity, the potential to learn
[35:16] interesting things from the data.
[35:17] Whether it actually lives up to its
[35:19] potential, we don't know yet.
[35:21] Okay? We'll give it the potential.
[35:23] Because the more transformations of the
[35:24] input data you make, the more
[35:26] opportunity you have to do interesting
[35:27] things with it.
[35:29] If I don't even give you the opportunity
[35:30] to transform it once, you don't have any
[35:31] opportunity, right?
[35:32] If I give you 10 chances to transform
[35:34] things, you have 10 shots at doing
[35:36] something useful.
[35:38] So, you can you can do this repeatedly
[35:40] and once we are done doing these
[35:42] transformations, we just pipe it through
[35:44] to our good old logistic regression
[35:46] sigmoid here and we are done.
[35:50] Okay?
[35:51] So, this is the basic idea.
[35:53] And so, just to contrast it, this was
[35:55] good old logistic regression where we
[35:57] take the input,
[35:59] we run it through a linear function and
[36:00] pop out a number,
[36:02] a probability number. But, after we do
[36:04] all this stuff, the input stays the
[36:06] same, the output stays the same, but in
[36:08] the middle you just run through a whole
[36:09] bunch of these functions, you know,
[36:11] these layers, boop boop boop boop, and
[36:12] then we get the output.
[36:14] Okay?
[36:15] That's all we have done.
[36:16] And this is a neural network.
[36:19] A neural network is nothing more than
[36:21] repeatedly transformed inputs which are
[36:25] finally fed to a linear or logistic
[36:27] regression model.
[36:35] Any questions?
[36:37] I have two questions. Could you use the
[36:38] thing so that everyone can hear? Yeah.
[36:41] I have two questions. Firstly, so when
[36:43] we say that there isn't chance of
[36:45] explainability, is it that we don't know
[36:48] which arrow it went through? That's one.
[36:51] Second,
[36:53] who's controlling the number of
[36:54] iterations or the number of functions?
[36:57] That's up to us or how does that work?
[36:59] Right. So, yeah, so the the first
[37:01] question, um explainability, we actually
[37:03] know exactly for any given input input
[37:06] uh data data point, we know exactly how
[37:09] it flows through the network. So, there
[37:10] is no problem there.
[37:12] The problem is in ascribing, "Okay, this
[37:15] we we think this person is going to be
[37:17] uh repay the loan because
[37:20] of this particular attribute." We don't
[37:21] know that because those attributes all
[37:24] get enmeshed together and goes through
[37:25] this complicated thing. So, we know
[37:27] exactly what happens. We just can't give
[37:29] credit to anyone thing very easily.
[37:31] I'm again, I'm just standing on the
[37:33] brink of this vast ocean of something
[37:35] called explainability and
[37:36] interpretability, uh which I'll get to a
[37:38] bit later on in the semester. But,
[37:39] that's sort of the quick
[37:42] kind of right-ish kind of wrong answer.
[37:44] Okay? Number two, um
[37:46] uh
[37:47] we decide the number of layers. We
[37:49] decide a whole bunch of things and as
[37:51] we'll see in a few minutes, uh there is
[37:52] something that's given to us and
[37:53] something we get to design and I'll make
[37:55] it very clear which is which.
[37:59] Yeah.
[38:02] Did I say your name right? Yeah.
[38:04] So, which functions have to be linear
[38:06] and also like why does it have to be
[38:08] linear? Yeah. So, these functions uh the
[38:11] f of x here, they have to be non-linear.
[38:15] As to why they have to be non-linear,
[38:16] we'll get to that in a few minutes.
[38:19] Okay. So, these are called neurons.
[38:22] Okay?
[38:23] These things where you basically there's
[38:25] a linear function followed by uh a
[38:27] little non-linear function,
[38:29] right? This is a Each one of these
[38:31] things is called a neuron.
[38:32] Um
[38:34] By the way, you know, this is loosely
[38:36] inspired by the way how, you know, uh
[38:39] neurons work in a human in mammalian
[38:41] brains.
[38:42] But, the connections between
[38:45] neuroscience and deep learning
[38:47] are very heavily argued.
[38:50] So, I'm going to like stay away from it.
[38:52] Okay? Uh suffice it to say it's I I just
[38:55] think of For for building practical deep
[38:57] learning systems in industry, you don't
[38:59] you don't worry about this. Okay?
[39:01] All right, let's move on.
[39:04] Terminology. Uh this vertical stack of
[39:06] linear functions or neurons,
[39:09] right? This vertical stack is called a
[39:10] layer.
[39:12] Right? This is a layer, that's a layer.
[39:14] Uh and these little non-linear
[39:15] functions, which we haven't gotten to
[39:17] yet, are called activation functions.
[39:20] Uh and we'll get to why they are called
[39:22] that in just a second.
[39:25] And
[39:26] the input
[39:29] is called an input layer and I have the
[39:31] word layer in double quotes because like
[39:34] it's not really doing anything, right?
[39:35] It's just the input.
[39:36] So, but we call it an input layer.
[39:39] And what the very final thing that
[39:41] produces outputs is called the output
[39:42] layer, right? Obviously. And everything
[39:45] in the middle is called a hidden layer.
[39:48] Okay?
[39:50] So, the final piece of terminology is
[39:52] that when you have a layer like this in
[39:54] which say three numbers are coming out
[39:56] and there's another another layer,
[39:58] right? If every neuron in this layer is
[40:00] connected to every neuron in this layer,
[40:03] it's called a fully connected or dense
[40:05] layer. So, for instance, here
[40:07] this arrow that's
[40:08] whatever the whatever number is coming
[40:10] out. Let's say the number three is
[40:11] coming out of this thing here. That
[40:12] number three goes flows on this arrow to
[40:15] this thing, flows on this arrow to this
[40:17] neuron, and flows on this third arrow to
[40:19] this neuron. That's what I mean. So,
[40:21] every neuron, its output is being sent
[40:23] to every neuron in the following layer.
[40:25] Okay? That's we call it fully connected
[40:27] or dense.
[40:29] And then
[40:30] if you look at logistic regression,
[40:32] right? This is logistic regression. You
[40:34] can see basically logistic regression is
[40:36] a neural network with no hidden layers.
[40:41] So, in some sense, logistic regression
[40:42] is like almost the simplest possible
[40:43] network you can think of.
[40:45] Like barely a neural network.
[40:48] Right? It's got no no hidden layers.
[40:50] That's what makes it logistic
[40:51] regression.
[40:52] And so, as you might have guessed by
[40:54] now, deep learning is just neural
[40:56] networks with lots and lots of
[40:58] of what?
[41:00] Yes, layers.
[41:02] So, here are a few.
[41:04] Uh and by the way, these are not even
[41:07] considered all that, you know,
[41:08] impressive these days.
[41:10] Okay? Uh but I put them up because this
[41:13] this thing here is called ResNet.
[41:16] And it's famous because the ResNet
[41:18] neural network was I think the first
[41:20] network
[41:21] to surpass human-level performance in
[41:24] image classification.
[41:26] Sort of it it's sort of like the Skynet
[41:28] of image classification. Okay? It
[41:31] surpassed human-level performance. And
[41:32] I'm putting it up here because we'll
[41:34] actually work with ResNet on next next
[41:36] Wednesday. And we'll actually take
[41:37] ResNet, we'll fine-tune it, and solve a
[41:39] real problem in class.
[41:41] All right. So, it's got lots and lots of
[41:43] layers. Uh now, let's turn to these
[41:46] activation functions. We've been
[41:47] ignoring these little guys, right? So
[41:48] far.
[41:49] So, the activation function at a node is
[41:52] a first of all, it's a function that
[41:54] receives a single number and outputs a
[41:56] single number, right? It's not very
[41:58] complicated, right? It receives
[42:00] basically this this is a linear function
[42:03] which receives all these inputs. It
[42:04] could be 10 inputs, 1,000 inputs,
[42:06] runs it through a linear function,
[42:07] outputs a number, and that single
[42:09] number, a scalar, goes in here, and it
[42:12] comes out as another single number.
[42:14] Just just just remember that.
[42:16] And so, these are some of the most
[42:18] common activation functions. In fact,
[42:19] the sigmoid we saw, which is actually we
[42:21] use for the output, is actually a kind
[42:23] of activation function where a single
[42:25] number comes in and it gets mapped into
[42:28] this curve because of this thing. So,
[42:30] the single number that comes in is A,
[42:31] and it and it gets transformed as 1 / 1
[42:33] + e ^ -A, and you get a shape like this,
[42:37] and it's called the sigmoid activation
[42:38] function. And And And as you can see
[42:40] here,
[42:41] for very small values, for very negative
[42:44] values,
[42:45] it's going to be pretty close to zero,
[42:47] meaning it won't get activated.
[42:50] And for very very large values, it's
[42:52] going to be
[42:53] pretty close to one.
[42:55] All the action happens in the middle.
[42:57] When your When your When your values are
[42:59] somewhere in this range, there's a
[43:00] dramatic increases in what comes out.
[43:03] Okay? So, that little thing in the
[43:05] middle is a sweet spot for these
[43:06] functions.
[43:07] Uh
[43:08] and this
[43:10] I you know, I'm also almost embarrassed
[43:11] to call it an activation function
[43:12] because it's literally not doing
[43:13] anything. It's sort of getting a nice
[43:15] label for free.
[43:16] Um right? You basically it says you just
[43:18] get a number, just pass it straight
[43:19] along.
[43:20] It's a linear activation function, but
[43:22] just for completeness, I want to put it
[43:23] here.
[43:25] And then we come to the hero of deep
[43:28] learning, which is the rectified linear
[43:30] unit,
[43:32] right? Rectified linear unit. It's
[43:34] called ReLU. Uh and ReLU is going to
[43:37] become part of your vocabulary very very
[43:38] quickly. Uh and so, ReLU is actually a
[43:41] very interesting function. So, you write
[43:43] it as maximum of whatever number and
[43:44] zero,
[43:46] which is another way of saying if the
[43:48] number is positive, just send it along
[43:50] unchanged. If the number is negative,
[43:53] send a zero instead. Squish it to zero.
[43:56] So, which means if the number is
[43:57] negative, nothing happens. If the number
[43:59] is positive, it wakes up.
[44:03] So, what happens is that you could have
[44:04] a very complicated linear function with
[44:07] millions of variables, and then it puts
[44:09] a single number, and that number
[44:10] unfortunately happens to be negative.
[44:12] The ReLU is not impressed. It's going to
[44:13] send a zero out.
[44:15] Okay? It's a very simple function.
[44:17] And many many folks who've been in deep
[44:20] learning for a long long time believe
[44:22] that
[44:23] the use of the ReLUs is one of the key
[44:25] factors
[44:26] that led to the amazing success of deep
[44:28] learning because it's got some very
[44:30] interesting properties,
[44:32] uh which we'll get to hopefully on
[44:33] Wednesday.
[44:35] Okay. So, the shorthand here is that um
[44:40] whenever you see this thing, it's just a
[44:42] linear activation, linear function
[44:43] followed by just sending it straight
[44:44] out. If I If you do this this If I put a
[44:47] ReLU in here, I'm going to denote it
[44:49] like that, which mimics the graph
[44:51] uh how it looks. And if I'm going If I
[44:53] put a sigmoid, I'm just going to use
[44:54] this thing here.
[44:55] Okay?
[44:56] Just a visual shorthand.
[44:59] >> [clears throat]
[45:00] >> There are many other functions
[45:02] activation functions, by the way.
[45:03] There's something called the tan h
[45:05] function, the leaky ReLU, the GELU, the
[45:07] Swish. I mean, it's like a menagerie of
[45:10] activation functions because very often
[45:12] researchers will be like, "Well, I don't
[45:14] like this activation function. Here's a
[45:15] little modified version of the function
[45:17] which is going to be better for certain
[45:18] things." So, you know, people's research
[45:20] creativity is sort of on this point has
[45:22] gone unhinged. Um so, there's lots of
[45:24] options. But if you just stick to the
[45:26] ReLU
[45:27] for your hidden layers, you can
[45:29] basically get anything done practically,
[45:31] right? You don't have to worry about
[45:32] anything else. So, we'll only focus on
[45:34] ReLUs for all the intermediate stuff. Uh
[45:37] yeah.
[45:38] Yeah, how do you gauge which activation
[45:40] function is more suited for your use
[45:41] case?
[45:42] Yeah. So, the rule of thumb here is that
[45:45] for your hidden layers, use ReLUs,
[45:48] right? Because empirically we have seen
[45:49] that they they do an amazing job.
[45:51] For your output layer, your very final
[45:54] thing, you actually don't have a choice
[45:56] because what you have to use depends on
[45:57] what kind of output you have to work
[45:59] with. If it's an output which is a
[46:01] probability number between zero and one,
[46:02] you have to use a sigmoid.
[46:04] Um if it is
[46:05] say 10 numbers, all of which have to be
[46:07] probabilities, and they have to add up
[46:08] to one,
[46:10] you got to use something called the
[46:10] softmax, which we'll get to on
[46:12] Wednesday. So, it really depends on the
[46:13] output, and the nature of the output
[46:15] dictates what you use in the output
[46:16] layer.
[46:18] Okay.
[46:19] So, coming back to this. So, if you want
[46:22] to design a deep neural network,
[46:24] uh the input is the input.
[46:27] The output is the output. And so, you
[46:29] get to choose everything else. You get
[46:30] to choose the number of hidden layers,
[46:32] the number of neurons in each layer, the
[46:35] activation functions you're going to use
[46:37] and uh for the hidden layers, and then
[46:39] you have to make sure that the what you
[46:41] choose for the output layer matches the
[46:42] kind of output you want to generate.
[46:44] Okay? So, this is this sort of This is
[46:46] all in your hands. You decide what
[46:48] happens. But
[46:51] you will there there's a lot of guidance
[46:52] for how to do these things, which we'll
[46:53] which we'll cover as we go along.
[46:56] Did you have a question?
[46:57] Kind of, but I guess I'll do it.
[47:00] Is Is there also exploration in kind of
[47:03] dynamic uh
[47:05] setting up layers so that your users
[47:07] determine the number of layers
[47:12] Yeah. So, there's a whole field called
[47:14] neural architecture search, NAS,
[47:16] where we can actually try a whole bunch
[47:18] of different architectures,
[47:20] uh and then use some optimization and in
[47:22] fact reinforcement learning, which we
[47:23] won't get to in this class,
[47:25] as a way to figure out really good
[47:27] architectures for any particular
[47:28] problem. Uh but the
[47:32] the question of okay,
[47:33] when I'm training a model with a
[47:34] particular kind of data,
[47:36] the first pass through the training
[47:37] data, I'm going to use two layers. The
[47:39] second pass, I'm going to do seven
[47:40] layers. That is not done.
[47:42] Uh and the reason it's not done is
[47:44] because of certain other constraints we
[47:45] have in how we can do the the
[47:47] optimization and the gradient descent
[47:48] and stuff like that. But what you can
[47:50] do, and we will we'll look at this thing
[47:52] called dropout,
[47:54] for certain layers, you can actually for
[47:56] each time you run it through the
[47:58] network, you can decide in this layer
[48:00] I'm not going to use all the nodes. I'm
[48:02] going to drop out a few of the nodes
[48:03] randomly. And it's a very effective
[48:05] technique to prevent overfitting, and
[48:07] we'll come to that a little later on.
[48:09] Uh yeah.
[48:11] So, one question regarding like
[48:13] neural networks is about the
[48:15] coefficients. Is this something we
[48:16] decide
[48:17] or we
[48:19] have to use as a defined coefficient for
[48:21] the weights? No, the whole trick here
[48:23] the whole name of the game is we use the
[48:25] data, the training data, and something
[48:29] called a loss function, which I'll get
[48:30] to on Wednesday,
[48:31] along with an optimization algorithm, so
[48:33] that the network figures out by itself
[48:36] what the weights need to be, what the
[48:37] coefficients need to be, so as to
[48:39] minimize prediction error.
[48:42] And that's the whole thing. The magic
[48:43] here is that we don't have to do
[48:45] anything. We only have to set it up, sit
[48:47] back, often for many hours, and watch it
[48:49] do its thing.
[48:51] Yeah.
[48:52] Just one quick question. Um you
[48:54] mentioned nodes just now when you were
[48:56] answering Roland's question. Can you
[48:58] just confirm exactly what a node is? I
[49:00] have an idea that it's basically any
[49:02] circle, but
[49:03] >> Yeah, yeah. you just added a lot more
[49:04] detail. Sure. No, when when I'm
[49:06] referring to a node, I'm literally
[49:07] referring to something like this, which
[49:09] think of it as a linear function
[49:12] followed by a non-linear activation.
[49:14] So, it it reads a bunch of inputs, runs
[49:16] it through a linear function, and pass
[49:18] it through like a ReLU or a sigmoid or
[49:19] something, and out pops a number.
[49:22] So, in general, a node will have
[49:24] many numbers potentially coming in, but
[49:26] only one number going out.
[49:28] Uh now, that one number may get copied
[49:30] to every node in the next layer,
[49:32] but what comes out of that particular
[49:33] node is just a single number.
[49:36] All right. So,
[49:38] uh
[49:38] So, let's use a DNN for our interview
[49:41] example. So, in this problem we had two
[49:44] inputs, right? GPA and experience. The
[49:46] output variable has to be between zero
[49:48] and one because you're trying to predict
[49:48] the probability that someone will get
[49:50] called for an interview. So, the output
[49:52] size is fixed the
[49:54] sorry, the input size is fixed the
[49:55] output is fixed. Uh
[49:57] and we so, since it's really the only
[49:59] the very first network we're actually
[50:00] playing with uh
[50:02] let's just start simple, right? We'll
[50:04] just have one hidden layer and we'll
[50:06] have three neurons, right? And and as I
[50:09] mentioned to Tommaso's question from
[50:11] before if you are choosing activation
[50:13] functions in the hidden layers, just go
[50:15] with the ReLU as a default. It usually
[50:17] works really well out of the box. So,
[50:19] we'll just use a ReLU and since the
[50:21] output has to be between zero and one,
[50:23] we don't have a choice. We have to use a
[50:25] sigmoid for the output layer.
[50:27] Okay? That's it. So, we have the those
[50:29] are the design choices and when we do
[50:31] that, this is how it's looked like,
[50:32] right? We have two inputs X1 and X2, GPA
[50:34] and experience and then it goes through
[50:36] these three
[50:38] ReLUs and then out comes these three
[50:40] numbers and they pass through a sigmoid
[50:42] and we get a probability Y at the end.
[50:44] All right, quick question. Concept
[50:46] check.
[50:47] How many weights
[50:49] how many parameters, both weights and
[50:51] biases does this network have?
[50:53] Let's take a moment to count.
[51:11] All right, any guesses?
[51:15] Yeah.
[51:16] 12.
[51:18] I think you're almost there.
[51:22] Um
[51:23] our folks going to be doing a binary
[51:25] search on this now? Okay.
[51:29] Uh no.
[51:31] Yes? 30. Yes, very good.
[51:34] So, that's 30
[51:35] and my guess is that the reason you came
[51:37] up with 12 and I made the same mistake,
[51:39] that's why I know it is you probably
[51:41] forgot this green thing here.
[51:45] Um so, so the what folks often forget is
[51:48] the bias.
[51:49] Right? We all count the things, right?
[51:50] Okay. And the easy way to do it is okay,
[51:52] two things here,
[51:54] three things here, two times six three
[51:56] is six,
[51:57] three times one is three another nine
[51:59] and then you have to add up all the
[52:00] intercepts.
[52:02] Right? So, you get 30.
[52:04] And so, when we get to very complicated
[52:05] networks the the first two or three
[52:08] times you work with very complex
[52:09] networks
[52:10] and we'll do it, you know, starting very
[52:11] soon, just get into the habit of hand
[52:14] calculating the number of parameters
[52:16] just to make sure you understand what's
[52:17] going on. Once you get it right a couple
[52:18] of times, you can you don't have to do
[52:20] it anymore. Okay? The first couple of
[52:21] times hand calculate to make sure you
[52:23] get it.
[52:23] Okay. So, yeah. So, let's say that we
[52:26] have trained this network using, you
[52:28] know, using techniques which we'll cover
[52:30] on Wednesday and it is it comes back to
[52:32] you after training and says, "Okay,
[52:34] these are the optimal the best values
[52:36] for the weights and the biases that I
[52:38] have found." So, now your network is
[52:40] ready for action.
[52:42] It's ready to be used
[52:43] and so, so what you can do is let's say
[52:45] that you want to predict with this
[52:47] network,
[52:48] you know,
[52:49] if you have X1 and X2, what comes out of
[52:52] what So, what comes out of this top
[52:54] neuron, right? Let's call it A1. It's
[52:56] basically this.
[52:58] Okay? That's what's coming out of this
[53:00] thing. For any X1 and X2, this is what's
[53:02] coming out. Similarly for A2 and A3
[53:05] Okay?
[53:06] And then what comes out at the very end
[53:08] is
[53:09] basically A1 times that plus A2 times
[53:11] that plus A3 times that plus 0.05 and
[53:14] the whole thing gets run through the
[53:15] sigmoid and this is what you get.
[53:18] Okay? So, this slide and the one before,
[53:20] just make sure you look at it afterwards
[53:22] and to make sure you totally understand
[53:23] the mechanics of it because
[53:26] this is really important. If you don't
[53:27] If you don't fully understand like
[53:28] internalize the mechanics, when we get
[53:30] to things like transformers, it's going
[53:31] to get hard. Okay? So, just make sure
[53:33] it's like automatic at this point. It
[53:35] should be reflexive.
[53:37] Um
[53:38] Okay. So, yeah. And so, when we when you
[53:40] want to predict anything, you just run
[53:41] some numbers through it, you get all
[53:42] these things
[53:44] and boom, you calculate it. It turns out
[53:45] to be 22.6. That's the answer.
[53:48] All right. So,
[53:50] I just want to say that let's say that
[53:51] you built this network
[53:53] and now we are like, "Hey,
[53:55] given any X1 and X2, I can come up with
[53:57] a Y."
[53:58] But I'm feeling a little mathy. Can we
[54:00] actually write down the function? Yeah,
[54:02] you can write down the function. This is
[54:03] what it looks like.
[54:07] Super interpretable, right?
[54:10] So, this goes to the comment that Itai
[54:12] you made earlier on where the act of
[54:16] depicting something using this sort of
[54:18] graphical layout makes it so much easier
[54:21] to reason with
[54:22] and to think about compared to trying to
[54:24] figure out what this function is doing.
[54:26] Right? The other point I want to make is
[54:28] that um
[54:30] just contrast what we just saw with the
[54:32] logistic regression thing we saw
[54:33] earlier, which was this little function
[54:35] and so, here
[54:38] even this simple network with just three
[54:40] hidden layers the sorry, three nodes in
[54:42] that single hidden layer
[54:44] right? It's so much more complicated
[54:46] than the logistic regression model. So
[54:48] much more complicated, right?
[54:50] And it is from this complexity
[54:52] springs the ability of these networks to
[54:55] do basically magical things.
[54:56] Right? That's where the complexity comes
[54:58] from. That's where the magic comes from.
[55:00] So, and here in this case, the number of
[55:02] variables hasn't even changed. It's
[55:03] still only two.
[55:05] But we can go from the two inputs to the
[55:07] one output in very complicated ways as
[55:10] long as we know how to train these
[55:11] networks the right way. That's sort of
[55:13] the
[55:13] the secret sauce which we'll spend a lot
[55:15] of time on.
[55:16] So, yeah. To summarize, this is what we
[55:19] have. It's a deep neural network.
[55:20] By the way, this kind of network where
[55:22] things just flow from left to right is
[55:23] called a feedforward
[55:25] neural network
[55:27] in contrast to some other kinds of
[55:28] networks called recurrent networks which
[55:30] you won't get to
[55:31] in this class because
[55:34] transformers have actually proven to be
[55:36] much more capable than recurrent
[55:38] networks and those have become the norm,
[55:40] so we'll just focus on those instead. Um
[55:42] and so, this arrangement of neurons into
[55:44] layers and activation functions and all
[55:46] that stuff, this called the architecture
[55:48] of the neural network. And as you will
[55:50] see later on, the transformer, the
[55:51] famous transformer network
[55:53] [clears throat] is just an example of a
[55:54] particular neural network architecture
[55:57] much like convolutional neural networks
[55:59] which will get to next week for computer
[56:01] vision or another example of a
[56:03] particular network of of architecture.
[56:05] So, we will focus on transformers. They
[56:07] are a particular kind of architecture.
[56:08] All right. So, in summary, this is what
[56:10] we have.
[56:11] You know, you get to choose the hidden
[56:13] layers, the neurons, activation
[56:14] functions, stuff like that.
[56:15] The inputs and outputs are what you have
[56:17] to work with and so, we will actually
[56:19] take this idea and then use it
[56:22] to
[56:23] to actually solve a problem from start
[56:25] to finish on Wednesday. So, I think I'm
[56:28] done. I give you three minutes back of
[56:29] your day. Thank you.
[56:32] >> [applause]