1 00:00:00,000 --> 00:00:06,000 Kylie Ying has worked at many interesting places such as MIT, CERN, and Free Code Camp. 2 00:00:06,000 --> 00:00:10,879 She's a physicist, engineer, and basically a genius. And now she's going to teach you 3 00:00:10,880 --> 00:00:14,720 about machine learning in a way that is accessible to absolute beginners. 4 00:00:15,279 --> 00:00:21,600 What's up you guys? So welcome to Machine Learning for Everyone. If you are someone who 5 00:00:21,600 --> 00:00:27,520 is interested in machine learning and you think you are considered as everyone, then this video 6 00:00:27,519 --> 00:00:33,039 is for you. In this video, we'll talk about supervised and unsupervised learning models, 7 00:00:33,039 --> 00:00:39,200 we'll go through maybe a little bit of the logic or math behind them, and then we'll also see how 8 00:00:39,200 --> 00:00:46,960 we can program it on Google CoLab. If there are certain things that I have done, and you know, 9 00:00:46,960 --> 00:00:50,960 you're somebody with more experience than me, please feel free to correct me in the comments 10 00:00:50,960 --> 00:00:58,000 and we can all as a community learn from this together. So with that, let's just dive right in. 11 00:00:58,000 --> 00:01:02,159 Without wasting any time, let's just dive straight into the code and I will be teaching you guys 12 00:01:02,159 --> 00:01:11,039 concepts as we go. So this here is the UCI machine learning repository. And basically, 13 00:01:11,040 --> 00:01:15,280 they just have a ton of data sets that we can access. And I found this really cool one called 14 00:01:15,280 --> 00:01:22,560 the magic gamma telescope data set. So in this data set, if you want to read all this information, 15 00:01:22,560 --> 00:01:28,320 to summarize what I what I think is going on, is there's this gamma telescope, and we have all 16 00:01:28,319 --> 00:01:34,239 these high energy particles hitting the telescope. Now there's a camera, there's a detector that 17 00:01:34,239 --> 00:01:40,399 actually records certain patterns of you know, how this light hits the camera. And we can use 18 00:01:40,400 --> 00:01:46,640 properties of those patterns in order to predict what type of particle caused that radiation. So 19 00:01:46,640 --> 00:01:54,879 whether it was a gamma particle, or some other head, like hadron. Down here, these are all of 20 00:01:54,879 --> 00:02:00,000 the attributes of those patterns that we collect in the camera. So you can see that there's, you 21 00:02:00,000 --> 00:02:06,480 know, some length, width, size, asymmetry, etc. Now we're going to use all these properties to 22 00:02:06,480 --> 00:02:12,400 help us discriminate the patterns and whether or not they came from a gamma particle or hadron. 23 00:02:13,199 --> 00:02:19,519 So in order to do this, we're going to come up here, go to the data folder. And you're going 24 00:02:19,520 --> 00:02:28,240 to click this magic zero for data, and we're going to download that. Now over here, I have a colab 25 00:02:28,240 --> 00:02:34,320 notebook open. So you go to colab dot research dot google.com, you start a new notebook. And 26 00:02:34,319 --> 00:02:43,120 I'm just going to call this the magic data set. So actually, I'm going to call this for code camp 27 00:02:43,120 --> 00:02:52,240 magic example. Okay. So with that, I'm going to first start with some imports. So I will import, 28 00:02:52,240 --> 00:03:04,560 you know, I always import NumPy, I always import pandas. And I always import matplotlib. 29 00:03:06,080 --> 00:03:11,360 And then we'll import other things as we go. So yeah, 30 00:03:14,080 --> 00:03:19,200 we run that in order to run the cell, you can either click this play button here, or you can 31 00:03:19,199 --> 00:03:24,319 on my computer, it's just shift enter and that that will run the cell. And here, I'm just going 32 00:03:24,319 --> 00:03:29,120 to order I'm just going to, you know, let you guys know, okay, this is where I found the data set. 33 00:03:30,000 --> 00:03:34,080 So I've copied and pasted this actually, but this is just where I found the data set. 34 00:03:35,199 --> 00:03:40,639 And in order to import that downloaded file that we we got from the computer, we're going to go 35 00:03:40,639 --> 00:03:49,119 over here to this folder thing. And I am literally just going to drag and drop that file into here. 36 00:03:50,800 --> 00:03:55,840 Okay. So in order to take a look at, you know, what does this file consist of, 37 00:03:55,840 --> 00:03:59,840 do we have the labels? Do we not? I mean, we could open it on our computer, but we can also just do 38 00:04:00,960 --> 00:04:06,640 pandas read CSV. And we can pass in the name of this file. 39 00:04:06,639 --> 00:04:14,559 And let's see what it returns. So it doesn't seem like we have the label. So let's go back to here. 40 00:04:16,160 --> 00:04:23,600 I'm just going to make the columns, the column labels, all of these attribute names over here. 41 00:04:23,600 --> 00:04:29,120 So I'm just going to take these values and make that the column names. 42 00:04:29,120 --> 00:04:36,079 All right, how do I do that? So basically, I will come back here, and I will create a list called 43 00:04:36,079 --> 00:04:50,560 calls. And I will type in all of those things. With f size, f conk. And we also have f conk one. 44 00:04:50,560 --> 00:05:06,079 We have f symmetry, f m three long, f m three trans, f alpha. Let's see, we have f dist and class. 45 00:05:09,839 --> 00:05:16,639 Okay, great. Now in order to label those as these columns down here in our data frame. 46 00:05:16,639 --> 00:05:22,879 So basically, this command here just reads some CSV file that you pass in CSV has come about comma 47 00:05:22,879 --> 00:05:31,519 separated values, and turns that into a pandas data frame object. So now if I pass in a names here, 48 00:05:31,519 --> 00:05:38,799 then it basically assigns these labels to the columns of this data set. So I'm going to set 49 00:05:38,800 --> 00:05:44,960 this data frame equal to DF. And then if we call the head is just like, give me the first five things, 50 00:05:44,959 --> 00:05:50,799 give me the first five things. Now you'll see that we have labels for all of these. Okay. 51 00:05:52,000 --> 00:05:57,519 All right, great. So one thing that you might notice is that over here, the class labels, 52 00:05:57,519 --> 00:06:05,279 we have G and H. So if I actually go down here, and I do data frame class unique, 53 00:06:07,199 --> 00:06:11,519 you'll see that I have either G's or H's, and these stand for gammas or hadrons. 54 00:06:11,519 --> 00:06:17,439 And our computer is not so good at understanding letters, right? Our computer is really good at 55 00:06:17,439 --> 00:06:23,279 understanding numbers. So what we're going to do is we're going to convert this to zero for G and 56 00:06:23,279 --> 00:06:35,679 one for H. So here, I'm going to set this equal to this, whether or not that equals G. And then 57 00:06:35,680 --> 00:06:42,560 I'm just going to say as type int. So what this should do is convert this entire column, 58 00:06:43,360 --> 00:06:48,720 if it equals G, then this is true. So I guess that would be one. And then if it's H, it would 59 00:06:48,720 --> 00:06:52,800 be false. So that would be zero, but I'm just converting G and H to one and zero, it doesn't 60 00:06:52,800 --> 00:07:02,240 really matter. Like, if G is one and H is zero or vice versa. Let me just take a step back right 61 00:07:02,240 --> 00:07:09,439 now and talk about this data set. So here I have some data frame, and I have all of these different 62 00:07:09,439 --> 00:07:18,240 values for each entry. Now this is a you know, each of these is one sample, it's one example, 63 00:07:18,240 --> 00:07:23,199 it's one item in our data set, it's one data point, all of these things are kind of the same 64 00:07:23,199 --> 00:07:29,120 thing when I mentioned, oh, this is one example, or this is one sample or whatever. Now, each of 65 00:07:29,120 --> 00:07:36,240 these samples, they have, you know, one quality for each or one value for each of these labels 66 00:07:36,240 --> 00:07:41,600 up here, and then it has the class. Now what we're going to do in this specific example is try to 67 00:07:41,600 --> 00:07:50,800 predict for future, you know, samples, whether the class is G for gamma or H for hadron. And 68 00:07:50,800 --> 00:08:00,319 that is something known as classification. Now, all of these up here, these are known as our features, 69 00:08:00,319 --> 00:08:05,759 and features are just things that we're going to pass into our model in order to help us predict 70 00:08:05,759 --> 00:08:12,879 the label, which in this case is the class column. So for you know, sample zero, I have 71 00:08:14,240 --> 00:08:19,519 10 different features. So I have 10 different values that I can pass into some model. 72 00:08:19,519 --> 00:08:26,719 And I can spit out, you know, the class the label, and I know the true label here is G. So this is 73 00:08:26,720 --> 00:08:35,440 this is actually supervised learning. All right. So before I move on, let me just give you a quick 74 00:08:35,440 --> 00:08:43,360 little crash course on what I just said. This is machine learning for everyone. Well, the first 75 00:08:43,360 --> 00:08:49,759 question is, what is machine learning? Well, machine learning is a sub domain of computer science 76 00:08:49,759 --> 00:08:56,000 that focuses on certain algorithms, which might help a computer learn from data, without a 77 00:08:56,000 --> 00:09:01,360 programmer being there telling the computer exactly what to do. That's what we call explicit 78 00:09:01,360 --> 00:09:08,480 programming. So you might have heard of AI and ML and data science, what is the difference between 79 00:09:08,480 --> 00:09:14,720 all of these. So AI is artificial intelligence. And that's an area of computer science, where the 80 00:09:14,720 --> 00:09:22,080 goal is to enable computers and machines to perform human like tasks and simulate human behavior. 81 00:09:23,600 --> 00:09:31,600 Now machine learning is a subset of AI that tries to solve one specific problem and make predictions 82 00:09:31,600 --> 00:09:39,840 using certain data. And data science is a field that attempts to find patterns and draw insights 83 00:09:39,840 --> 00:09:45,840 from data. And that might mean we're using machine learning. So all of these fields kind of overlap, 84 00:09:45,840 --> 00:09:52,560 and all of them might use machine learning. So there are a few types of machine learning. 85 00:09:52,559 --> 00:09:58,399 The first one is supervised learning. And in supervised learning, we're using labeled inputs. 86 00:09:58,399 --> 00:10:05,360 So this means whatever input we get, we have a corresponding output label, in order to train 87 00:10:05,360 --> 00:10:12,960 models and to learn outputs of different new inputs that we might feed our model. So for example, 88 00:10:12,960 --> 00:10:19,040 I might have these pictures, okay, to a computer, all these pictures are are pixels, they're pixels 89 00:10:19,039 --> 00:10:27,439 with a certain color. Now in supervised learning, all of these inputs have a label associated with 90 00:10:27,440 --> 00:10:32,880 them, this is the output that we might want the computer to be able to predict. So for example, 91 00:10:32,879 --> 00:10:39,200 over here, this picture is a cat, this picture is a dog, and this picture is a lizard. 92 00:10:41,600 --> 00:10:47,840 Now there's also unsupervised learning. And in unsupervised learning, we use unlabeled data 93 00:10:47,840 --> 00:10:57,920 to learn about patterns in the data. So here are here are my input data points. Again, they're just 94 00:10:57,919 --> 00:11:04,959 images, they're just pixels. Well, okay, let's say I have a bunch of these different pictures. 95 00:11:05,759 --> 00:11:09,919 And what I can do is I can feed all these to my computer. And I might not, you know, 96 00:11:09,919 --> 00:11:14,479 my computer is not going to be able to say, Oh, this is a cat, dog and lizard in terms of, 97 00:11:14,480 --> 00:11:19,680 you know, the output. But it might be able to cluster all these pictures, it might say, 98 00:11:19,679 --> 00:11:26,079 Hey, all of these have something in common. All of these have something in common. And then these 99 00:11:26,080 --> 00:11:31,680 down here have something in common, that's finding some sort of structure in our unlabeled data. 100 00:11:33,679 --> 00:11:40,159 And finally, we have reinforcement learning. And reinforcement learning. Well, they usually 101 00:11:40,159 --> 00:11:46,480 there's an agent that is learning in some sort of interactive environment, based on rewards and 102 00:11:46,480 --> 00:11:54,720 penalties. So let's think of a dog, we can train our dog, but there's not necessarily, you know, 103 00:11:54,720 --> 00:12:02,879 any wrong or right output at any given moment, right? Well, let's pretend that dog is a computer. 104 00:12:03,600 --> 00:12:08,240 Essentially, what we're doing is we're giving rewards to our computer, and tell your computer, 105 00:12:08,240 --> 00:12:15,200 Hey, this is probably something good that you want to keep doing. Well, computer agent terminology. 106 00:12:16,879 --> 00:12:21,759 But in this class today, we'll be focusing on supervised learning and unsupervised learning 107 00:12:21,759 --> 00:12:29,120 and learning different models for each of those. Alright, so let's talk about supervised learning 108 00:12:29,120 --> 00:12:35,120 first. So this is kind of what a machine learning model looks like you have a bunch of inputs 109 00:12:35,120 --> 00:12:40,960 that are going into some model. And then the model is spitting out an output, which is our prediction. 110 00:12:41,919 --> 00:12:48,399 So all these inputs, this is what we call the feature vector. Now there are different types 111 00:12:48,399 --> 00:12:53,919 of features that we can have, we might have qualitative features. And qualitative means 112 00:12:53,919 --> 00:13:01,360 categorical data, there's either a finite number of categories or groups. So one example of a 113 00:13:01,360 --> 00:13:07,440 qualitative feature might be gender. And in this case, there's only two here, it's for the sake of 114 00:13:07,440 --> 00:13:13,200 the example, I know this might be a little bit outdated. Here we have a girl and a boy, there are 115 00:13:13,200 --> 00:13:19,840 two genders, there are two different categories. That's a piece of qualitative data. Another 116 00:13:19,840 --> 00:13:25,600 example might be okay, we have, you know, a bunch of different nationalities, maybe a nationality or 117 00:13:25,600 --> 00:13:33,279 a nation or a location, that might also be an example of categorical data. Now, in both of 118 00:13:33,279 --> 00:13:43,199 these, there's no inherent order. It's not like, you know, we can rate us one and France to Japan 119 00:13:43,200 --> 00:13:51,840 three, etc. Right? There's not really any inherent order built into either of these categorical 120 00:13:51,840 --> 00:14:00,240 data sets. That's why we call this nominal data. Now, for nominal data, the way that we want 121 00:14:00,240 --> 00:14:06,639 to feed it into our computer is using something called one hot encoding. So let's say that, you 122 00:14:06,639 --> 00:14:13,120 know, I have a data set, some of the items in our data, some of the inputs might be from the US, 123 00:14:13,120 --> 00:14:19,200 some might be from India, then Canada, then France. Now, how do we get our computer to recognize that 124 00:14:19,200 --> 00:14:24,560 we have to do something called one hot encoding. And basically, one hot encoding is saying, okay, 125 00:14:24,559 --> 00:14:30,239 well, if it matches some category, make that a one. And if it doesn't just make that a zero. 126 00:14:31,120 --> 00:14:40,159 So for example, if your input were from the US, you would you might have 1000. India, you know, 127 00:14:40,159 --> 00:14:46,879 0100. Canada, okay, well, the item representing Canada is one and then France, the item representing 128 00:14:46,879 --> 00:14:52,240 France is one. And then you can see that the rest are zeros, that's one hot encoding. 129 00:14:54,480 --> 00:15:00,480 Now, there are also a different type of qualitative feature. So here on the left, 130 00:15:00,480 --> 00:15:07,440 there are different age groups, there's babies, toddlers, teenagers, young adults, 131 00:15:08,639 --> 00:15:15,840 adults, and so on, right. And on the right hand side, we might have different ratings. So maybe 132 00:15:15,840 --> 00:15:26,160 bad, not so good, mediocre, good, and then like, great. Now, these are known as ordinal pieces of 133 00:15:26,159 --> 00:15:33,600 data, because they have some sort of inherent order, right? Like, being a toddler is a lot closer to 134 00:15:33,600 --> 00:15:41,680 being a baby than being an elderly person, right? Or good is closer to great than it is to really 135 00:15:41,679 --> 00:15:48,559 bad. So these have some sort of inherent ordering system. And so for these types of data sets, 136 00:15:48,559 --> 00:15:54,399 we can actually just mark them from, you know, one to five, or we can just say, hey, for each of these, 137 00:15:54,399 --> 00:16:02,959 let's give it a number. And this makes sense. Because, like, for example, the thing that I 138 00:16:02,960 --> 00:16:09,759 just said, how good is closer to great, then good is close to not good at all. Well, four is closer 139 00:16:09,759 --> 00:16:14,559 to five, then four is close to one. So this actually kind of makes sense. And it'll make sense for the 140 00:16:14,559 --> 00:16:22,399 computer as well. Alright, there are also quantitative pieces of data and quantitative 141 00:16:22,960 --> 00:16:29,040 pieces of data are numerical valued pieces of data. So this could be discrete, which means, 142 00:16:29,039 --> 00:16:34,159 you know, they might be integers, or it could be continuous, which means all real numbers. 143 00:16:34,159 --> 00:16:40,799 So for example, the length of something is a quantitative piece of data, it's a quantitative 144 00:16:40,799 --> 00:16:46,559 feature, the temperature of something is a quantitative feature. And then maybe how many 145 00:16:46,559 --> 00:16:53,679 Easter eggs I collected in my basket, this Easter egg hunt, that is an example of discrete quantitative 146 00:16:53,679 --> 00:17:02,079 feature. Okay, so these are continuous. And this over here is the screen. So those are the things 147 00:17:02,080 --> 00:17:08,400 that go into our feature vector, those are our features that we're feeding this model, because 148 00:17:08,400 --> 00:17:14,800 our computers are really, really good at understanding math, right at understanding numbers, 149 00:17:14,799 --> 00:17:19,680 they're not so good at understanding things that humans might be able to understand. 150 00:17:21,759 --> 00:17:29,680 Well, what are the types of predictions that our model can output? So in supervised learning, 151 00:17:29,680 --> 00:17:35,440 there are some different tasks, there's one classification, and basically classification, 152 00:17:35,440 --> 00:17:42,000 just saying, okay, predict discrete classes. And that might mean, you know, this is a hot dog, 153 00:17:42,799 --> 00:17:48,639 this is a pizza, and this is ice cream. Okay, so there are three distinct classes and any other 154 00:17:48,640 --> 00:17:56,480 pictures of hot dogs, pizza or ice cream, I can put under these labels. Hot dog, pizza, ice cream. 155 00:17:56,480 --> 00:18:03,440 Hot dog, pizza, ice cream. This is something known as multi class classification. But there's also 156 00:18:03,440 --> 00:18:10,640 binary classification. And binary classification, you might have hot dog, or not hot dog. So there's 157 00:18:10,640 --> 00:18:14,240 only two categories that you're working with something that is something and something that's 158 00:18:14,240 --> 00:18:23,680 isn't binary classification. Okay, so yeah, other examples. So if something has positive or negative 159 00:18:23,680 --> 00:18:28,960 sentiment, that's binary classification. Maybe you're predicting your pictures of their cats or 160 00:18:28,960 --> 00:18:35,039 dogs. That's binary classification. Maybe, you know, you are writing an email filter, and you're 161 00:18:35,039 --> 00:18:40,559 trying to figure out if an email spam or not spam. So that's also binary classification. 162 00:18:41,759 --> 00:18:46,240 Now for multi class classification, you might have, you know, cat, dog, lizard, dolphin, shark, 163 00:18:46,960 --> 00:18:53,519 rabbit, etc. We might have different types of fruits like orange, apple, pear, etc. And then 164 00:18:53,519 --> 00:18:59,440 maybe different plant species. But multi class classification just means more than two. Okay, 165 00:18:59,440 --> 00:19:06,320 and binary means we're predicting between two things. There's also something called regression 166 00:19:06,319 --> 00:19:11,359 when we talk about supervised learning. And this just means we're trying to predict continuous 167 00:19:11,359 --> 00:19:15,759 values. So instead of just trying to predict different categories, we're trying to come up 168 00:19:15,759 --> 00:19:24,400 with a number that you know, is on some sort of scale. So some examples. So some examples might 169 00:19:24,400 --> 00:19:31,040 be the price of aetherium tomorrow, or it might be okay, what is going to be the temperature? 170 00:19:31,759 --> 00:19:37,440 Or it might be what is the price of this house? Right? So these things don't really fit into 171 00:19:37,440 --> 00:19:43,920 discrete classes. We're trying to predict a number that's as close to the true value as possible 172 00:19:43,920 --> 00:19:51,759 using different features of our data set. So that's exactly what our model looks like in 173 00:19:51,759 --> 00:19:59,279 supervised learning. Now let's talk about the model itself. How do we make this model learn? 174 00:19:59,920 --> 00:20:05,120 Or how can we tell whether or not it's even learning? So before we talk about the models, 175 00:20:05,680 --> 00:20:10,320 let's talk about how can we actually like evaluate these models? Or how can we tell 176 00:20:10,319 --> 00:20:19,039 whether something is a good model or bad model? So let's take a look at this data set. So this data 177 00:20:19,039 --> 00:20:26,639 set has this is from a diabetes, a Pima Indian diabetes data set. And here we have different 178 00:20:26,640 --> 00:20:32,640 number of pregnancies, different glucose levels, blood pressure, skin thickness, insulin, BMI, 179 00:20:32,640 --> 00:20:37,520 age, and then the outcome whether or not they have diabetes one for they do zero for they don't. 180 00:20:37,519 --> 00:20:46,639 So here, all of these are quantitative features, right, because they're all on some scale. 181 00:20:48,720 --> 00:20:56,160 So each row is a different sample in the data. So it's a different example, it's one person's data, 182 00:20:56,160 --> 00:21:04,240 and each row represents one person in this data set. Now this column, each column represents a 183 00:21:04,240 --> 00:21:11,599 different feature. So this one here is some measure of blood pressure levels. And this one 184 00:21:11,599 --> 00:21:17,119 over here, as we mentioned is the output label. So this one is whether or not they have diabetes. 185 00:21:19,039 --> 00:21:23,759 And as I mentioned, this is what we would call a feature vector, because these are all of our 186 00:21:23,759 --> 00:21:33,519 features in one sample. And this is what's known as the target, or the output for that feature 187 00:21:33,519 --> 00:21:41,279 vector. That's what we're trying to predict. And all of these together is our features matrix x. 188 00:21:42,640 --> 00:21:51,920 And over here, this is our labels or targets vector y. So I've condensed this to a chocolate 189 00:21:51,920 --> 00:21:58,000 bar to kind of talk about some of the other concepts in machine learning. So over here, 190 00:21:58,000 --> 00:22:08,160 we have our x, our features matrix, and over here, this is our label y. So each row of this 191 00:22:08,160 --> 00:22:15,200 will be fed into our model, right. And our model will make some sort of prediction. And what we do 192 00:22:15,200 --> 00:22:21,920 is we compare that prediction to the actual value of y that we have in our label data set, because 193 00:22:21,920 --> 00:22:26,960 that's the whole point of supervised learning is we can compare what our model is outputting to, 194 00:22:26,960 --> 00:22:31,920 oh, what is the truth, actually, and then we can go back and we can adjust some things. So the next 195 00:22:31,920 --> 00:22:41,039 iteration, we get closer to what the true value is. So that whole process here, the tinkering that, 196 00:22:41,039 --> 00:22:46,399 okay, what's the difference? Where did we go wrong? That's what's known as training the model. 197 00:22:47,680 --> 00:22:54,080 Alright, so take this whole, you know, chunk right here, do we want to really put our entire 198 00:22:54,079 --> 00:23:02,319 chocolate bar into the model to train our model? Not really, right? Because if we did that, then 199 00:23:02,319 --> 00:23:10,240 how do we know that our model can do well on new data that we haven't seen? Like, if I were to 200 00:23:10,240 --> 00:23:18,000 create a model to predict whether or not someone has diabetes, let's say that I just train all my 201 00:23:18,000 --> 00:23:23,119 data, and I see that all my training data does well, I go to some hospital, I'm like, here's my 202 00:23:23,119 --> 00:23:28,559 model. I think you can use this to predict if somebody has diabetes. Do we think that would 203 00:23:28,559 --> 00:23:41,039 be effective or not? Probably not, right? Because we haven't assessed how well our model can 204 00:23:41,039 --> 00:23:46,879 generalize. Okay, it might do well after you know, our model has seen this data over and over and 205 00:23:46,880 --> 00:23:54,960 over again. But what about new data? Can our model handle new data? Well, how do we how do we get our 206 00:23:54,960 --> 00:24:02,319 model to assess that? So we actually break up our whole data set that we have into three different 207 00:24:02,319 --> 00:24:07,759 types of data sets, we call it the training data set, the validation data set and the testing data 208 00:24:07,759 --> 00:24:15,759 set. And you know, you might have 60% here 20% and 20% or 80 10 and 10. It really depends on how 209 00:24:15,759 --> 00:24:22,000 many statistics you have, I think either of those would be acceptable. So what we do is then we feed 210 00:24:22,000 --> 00:24:28,960 the training data set into our model, we come up with, you know, this might be a vector of predictions 211 00:24:28,960 --> 00:24:36,079 corresponding with each sample that we put into our model, we figure out, okay, what's the difference 212 00:24:36,079 --> 00:24:42,879 between our prediction and the true values, this is something known as loss, losses, you know, 213 00:24:42,880 --> 00:24:50,080 what's the difference here, in some numerical quantity, of course. And then we make adjustments, 214 00:24:50,079 --> 00:24:57,599 and that's what we call training. Okay. So then, once you know, we've made a bunch of adjustments, 215 00:24:58,480 --> 00:25:06,000 we can put our validation set through this model. And the validation set is kind of used as a reality 216 00:25:06,000 --> 00:25:14,559 check during or after training to ensure that the model can handle unseen data still. So every 217 00:25:14,559 --> 00:25:19,599 single time after we train one iteration, we might stick the validation set in and see, hey, what's 218 00:25:19,599 --> 00:25:25,679 the loss there. And then after our training is over, we can assess the validation set and ask, 219 00:25:25,680 --> 00:25:32,400 hey, what's the loss there. But one key difference here is that we don't have that training step, 220 00:25:32,400 --> 00:25:38,080 this loss never gets fed back into the model, right, that feedback loop is not closed. 221 00:25:38,799 --> 00:25:45,919 Alright, so let's talk about loss really quickly. So here, I have four different types of models, 222 00:25:45,920 --> 00:25:52,960 I have some sort of data that's being fed into the model, and then some output. Okay, so this output 223 00:25:52,960 --> 00:26:02,720 here is pretty far from you know, this truth that we want. And so this loss is going to be high. In 224 00:26:02,720 --> 00:26:07,839 model B, again, this is pretty far from what we want. So this loss is also going to be high, 225 00:26:07,839 --> 00:26:15,759 let's give it 1.5. Now this one here, it's pretty close, I mean, maybe not almost, but pretty close 226 00:26:15,759 --> 00:26:23,839 to this one. So that might have a loss of 0.5. And then this one here is maybe further than this, 227 00:26:23,839 --> 00:26:30,319 but still better than these two. So that loss might be 0.9. Okay, so which of these model 228 00:26:30,319 --> 00:26:40,079 performs the best? Well, model C has a smallest loss, so it's probably model C. Okay, now let's 229 00:26:40,079 --> 00:26:45,679 take model C. After you know, we've come up with these, all these models, and we've seen, okay, model 230 00:26:45,680 --> 00:26:52,880 C is probably the best model. We take model C, and we run our test set through this model. And this 231 00:26:52,880 --> 00:27:00,720 test set is used as a final check to see how generalizable that chosen model is. So if I, 232 00:27:00,720 --> 00:27:05,680 you know, finish training my diabetes data set, then I could run it through some chunk of the 233 00:27:05,680 --> 00:27:11,519 data and I can say, oh, like, this is how we perform on data that it's never seen before at 234 00:27:11,519 --> 00:27:19,599 any point during the training process. Okay. And that loss, that's the final reported performance 235 00:27:19,599 --> 00:27:27,199 of my test set, or this would be the final reported performance of my model. Okay. 236 00:27:29,279 --> 00:27:34,879 So let's talk about this thing called loss, because I think I kind of just glossed over it, 237 00:27:34,880 --> 00:27:41,600 right? So loss is the difference between your prediction and the actual, like, label. 238 00:27:43,200 --> 00:27:50,640 So this would give a slightly higher loss than this. And this would even give a higher loss, 239 00:27:50,640 --> 00:27:56,960 because it's even more off. In computer science, we like formulas, right? We like formulaic ways 240 00:27:57,599 --> 00:28:03,279 of describing things. So here are some examples of loss functions and how we can actually come 241 00:28:03,279 --> 00:28:10,160 up with numbers. This here is known as L one loss. And basically, L one loss just takes the 242 00:28:10,160 --> 00:28:18,080 absolute value of whatever your you know, real value is, whatever the real output label is, 243 00:28:18,640 --> 00:28:26,160 subtracts the predicted value, and takes the absolute value of that. Okay. So the absolute 244 00:28:26,160 --> 00:28:34,000 value is a function that looks something like this. So the further off you are, the greater your losses, 245 00:28:35,519 --> 00:28:42,480 right in either direction. So if your real value is off from your predicted value by 10, 246 00:28:42,480 --> 00:28:47,519 then your loss for that point would be 10. And then this sum here just means, hey, 247 00:28:47,519 --> 00:28:53,039 we're taking all the points in our data set. And we're trying to figure out the sum of how far 248 00:28:53,039 --> 00:29:01,599 everything is. Now, we also have something called L two loss. So this loss function is quadratic, 249 00:29:01,599 --> 00:29:08,559 which means that if it's close, the penalty is very minimal. And if it's off by a lot, 250 00:29:08,559 --> 00:29:15,839 then the penalty is much, much higher. Okay. And this instead of the absolute value, we just square 251 00:29:15,839 --> 00:29:26,000 the the difference between the two. Now, there's also something called binary cross entropy loss. 252 00:29:26,960 --> 00:29:32,720 It looks something like this. And this is for binary classification, this this might be the 253 00:29:32,720 --> 00:29:38,960 loss that we use. So this loss, you know, I'm not going to really go through it too much. 254 00:29:38,960 --> 00:29:47,840 But you just need to know that loss decreases as the performance gets better. So there are some 255 00:29:47,839 --> 00:29:53,679 other measures of accurate or performance as well. So for example, accuracy, what is accuracy? 256 00:29:55,440 --> 00:30:02,559 So let's say that these are pictures that I'm feeding my model, okay. And these predictions 257 00:30:02,559 --> 00:30:11,359 might be apple, orange, orange, apple, okay, but the actual is apple, orange, apple, apple. So 258 00:30:12,240 --> 00:30:17,680 three of them were correct. And one of them was incorrect. So the accuracy of this model is 259 00:30:17,680 --> 00:30:25,600 three quarters or 75%. Alright, coming back to our colab notebook, I'm going to close this a little 260 00:30:25,599 --> 00:30:33,039 bit. Again, we've imported stuff up here. And we've already created our data frame right here. And 261 00:30:33,039 --> 00:30:39,599 this is this is all of our data. This is what we're going to use to train our models. So down here, 262 00:30:40,559 --> 00:30:49,039 again, if we now take a look at our data set, you'll see that our classes are now zeros and ones. 263 00:30:49,039 --> 00:30:53,119 So now this is all numerical, which is good, because our computer can now understand that. 264 00:30:53,119 --> 00:31:00,719 Okay. And you know, it would probably be a good idea to maybe kind of plot, hey, do these things 265 00:31:00,720 --> 00:31:10,240 have anything to do with the class. So here, I'm going to go through all the labels. So for label 266 00:31:10,240 --> 00:31:15,839 in the columns of this data frame. So this just gets me the list. Actually, we have the list, 267 00:31:15,839 --> 00:31:20,879 right? It's called so let's just use that might be less confusing of everything up to the last 268 00:31:20,880 --> 00:31:26,560 thing, which is the class. So I'm going to take all these 10 different features. And I'm going 269 00:31:26,559 --> 00:31:37,039 to plot them as a histogram. So and now I'm going to plot them as a histogram. So basically, if I 270 00:31:37,039 --> 00:31:45,599 take that data frame, and I say, okay, for everything where the class is equal to one, so these are all 271 00:31:45,599 --> 00:31:55,279 of our gammas, remember, now, for that portion of the data frame, if I look at this label, so now 272 00:31:55,279 --> 00:32:03,440 these, okay, what this part here is saying is, inside the data frame, get me everything where 273 00:32:03,440 --> 00:32:08,480 the class is equal to one. So that's all all of these would fit into that category, right? 274 00:32:09,119 --> 00:32:14,079 And now let's just look at the label column. So the first label would be f length, which would 275 00:32:14,079 --> 00:32:20,480 be this column. So this command here is getting me all the different values that belong to class one 276 00:32:20,480 --> 00:32:27,200 for this specific label. And that's exactly what I'm going to put into the histogram. And now I'm 277 00:32:27,200 --> 00:32:34,960 just going to tell you know, matplotlib make the color blue, make this label this as you know, gamma 278 00:32:37,039 --> 00:32:43,279 set alpha, why do I keep doing that, alpha equal to 0.7. So that's just like the transparency. 279 00:32:43,279 --> 00:32:48,399 And then I'm going to set density equal to true, so that when we compare it to 280 00:32:50,000 --> 00:32:56,960 the hadrons here, we'll have a baseline for comparing them. Okay, so the density being true 281 00:32:56,960 --> 00:33:05,360 just basically normalizes these distributions. So you know, if you have 200 in of one type, 282 00:33:05,359 --> 00:33:12,079 and then 50 of another type, well, if you drew the histograms, it would be hard to compare because 283 00:33:12,079 --> 00:33:17,599 one of them would be a lot bigger than the other, right. But by normalizing them, we kind of are 284 00:33:17,599 --> 00:33:24,240 distributing them over how many samples there are. Alright, and then I'm just going to put a title 285 00:33:24,240 --> 00:33:31,680 on here and make that the label, the y label. So because it's density, the y label is probability. 286 00:33:32,799 --> 00:33:36,319 And the x label is just going to be the label. 287 00:33:36,319 --> 00:33:44,639 What is going on. And I'm going to include a legend and PLT dot show just means okay, display 288 00:33:44,640 --> 00:33:54,800 the plot. So if I run that, just be up to the last item. So we want a list, right, not just the last 289 00:33:54,799 --> 00:34:02,240 item. And now we can see that we're plotting all of these. So here we have the length. Oh, and I 290 00:34:02,240 --> 00:34:11,199 made this gamma. So this should be hadron. Okay, so the gammas in blue, the hadrons are in red. So 291 00:34:11,199 --> 00:34:16,559 here we can already see that, you know, maybe if the length is smaller, it's probably more likely 292 00:34:16,559 --> 00:34:24,320 to be gamma, right. And we can kind of you know, these all look somewhat similar. But here, okay, 293 00:34:24,320 --> 00:34:34,640 clearly, if there's more asymmetry, or if you know, this asymmetry measure is larger, then it's 294 00:34:34,639 --> 00:34:44,480 probably hadron. Okay, oh, this one's a good one. So f alpha seems like hadrons are pretty evenly 295 00:34:44,480 --> 00:34:48,960 distributed. Whereas if this is smaller, it looks like there's more gammas in that area. 296 00:34:48,960 --> 00:34:54,480 Okay, so this is kind of what the data that we're working with, we can kind of see what's going on. 297 00:34:55,920 --> 00:35:02,079 Okay, so the next thing that we're going to do here is we are going to create our train, 298 00:35:03,119 --> 00:35:12,880 our validation, and our test data sets. I'm going to set train valid and test to be equal to 299 00:35:12,880 --> 00:35:20,800 this. So NumPy dot split, I'm just splitting up the data frame. And if I do this sample, 300 00:35:20,800 --> 00:35:29,360 where I'm sampling everything, this will basically shuffle my data. Now, if I I want to pass in where 301 00:35:29,360 --> 00:35:38,320 exactly I'm splitting my data set, so the first split is going to be maybe at 60%. So I'm going 302 00:35:38,320 --> 00:35:44,720 to say 0.6 times the length of this data frame. So and then cast that 10 integer, that's going 303 00:35:44,719 --> 00:35:50,559 to be the first place where you know, I cut it off, and that'll be my training data. Now, if I 304 00:35:50,559 --> 00:35:57,360 then go to 0.8, this basically means everything between 60% and 80% of the length of the data 305 00:35:57,360 --> 00:36:03,760 set will go towards validation. And then, like everything from 80 to 100, I'm going to pass 306 00:36:03,760 --> 00:36:12,080 my test data. So I can run that. And now, if we go up here, and we inspect this data, we'll see that 307 00:36:12,079 --> 00:36:20,480 these columns seem to have values in like the 100s, whereas this one is 0.03. Right? So the scale of 308 00:36:20,480 --> 00:36:28,240 all these numbers is way off. And sometimes that will affect our results. So I'm going to run this 309 00:36:28,239 --> 00:36:35,919 is way off. And sometimes that will affect our results. So one thing that we would want to do 310 00:36:35,920 --> 00:36:46,240 is scale these so that they are, you know, so that it's now relative to maybe the mean and the 311 00:36:46,239 --> 00:36:54,399 standard deviation of that specific column. I'm going to create a function called scale data set. 312 00:36:54,400 --> 00:37:04,880 And I'm going to pass in the data frame. And that's what I'll do for now. Okay, so the x values are 313 00:37:04,880 --> 00:37:14,320 going to be, you know, I take the data frame. And let's assume that the columns are going to be, 314 00:37:14,320 --> 00:37:20,000 you know, that the label will always be the last thing in the data frame. So what I can do is say 315 00:37:20,000 --> 00:37:28,559 data frame, dot columns all the way up to the last item, and get those values. Now for my y, 316 00:37:30,000 --> 00:37:34,239 well, it's the last column. So I can just do this, I can just index into that last column, 317 00:37:34,800 --> 00:37:46,640 and then get those values. Now, in, so I'm actually going to import something known as 318 00:37:46,639 --> 00:37:55,199 the standard scalar from sk learn. So if I come up here, I can go to sk learn dot pre processing. 319 00:37:56,079 --> 00:38:04,880 And I'm going to import standard scalar, I have to run that cell, I'm going to come back down here. 320 00:38:04,880 --> 00:38:10,880 And now I'm going to create a scalar and use that skip or so standard scalar. 321 00:38:10,880 --> 00:38:21,119 And with the scalar, what I can do is actually just fit and transform x. So here, I can say x 322 00:38:21,119 --> 00:38:31,599 is equal to scalar dot fit, fit, transform x. So what that's doing is saying, okay, take x and 323 00:38:31,599 --> 00:38:36,799 fit the standard scalar to x, and then transform all those values. And what would it be? And that's 324 00:38:36,800 --> 00:38:45,039 going to be our new x. Alright. And then I'm also going to just create, you know, the whole data as 325 00:38:45,039 --> 00:38:53,920 one huge 2d NumPy array. And in order to do that, I'm going to call H stack. So H stack is saying, 326 00:38:53,920 --> 00:38:58,400 okay, take an array, and another array and horizontally stack them together. That's what 327 00:38:58,400 --> 00:39:03,440 the H stands for. So by horizontally stacked them together, just like put them side by side, 328 00:39:03,440 --> 00:39:09,200 okay, not on top of each other. So what am I stacking? Well, I have to pass in something 329 00:39:10,000 --> 00:39:20,400 so that it can stack x and y. And now, okay, so NumPy is very particular about dimensions, 330 00:39:20,400 --> 00:39:27,119 right? So in this specific case, our x is a two dimensional object, but y is only a one dimensional 331 00:39:27,119 --> 00:39:35,440 thing, it's only a vector of values. So in order to now reshape it into a 2d item, we have to call 332 00:39:35,440 --> 00:39:45,200 NumPy dot reshape. And we can pass in the dimensions of its reshape. So if I pass in negative 333 00:39:45,199 --> 00:39:51,039 one comma one, that just means okay, make this a 2d array, where the negative one just means infer 334 00:39:51,039 --> 00:39:56,719 what what this dimension value would be, which ends up being the length of y, this would be the 335 00:39:56,719 --> 00:40:01,439 same as literally doing this. But the negative one is easier because we're making the computer 336 00:40:01,440 --> 00:40:13,119 do the hard work. So if I stack that, I'm going to then return the data x and y. Okay. So one more 337 00:40:13,119 --> 00:40:18,480 thing is that if we go into our training data set, okay, again, this is our training data set. 338 00:40:18,480 --> 00:40:28,240 And we get the length of the training data set. But where the training data sets class is one, 339 00:40:28,239 --> 00:40:39,439 so remember that this is the gammas. And then if we print that, and we do the same thing, but zero, 340 00:40:39,440 --> 00:40:49,039 we'll see that, you know, there's around 7000 of the gammas, but only around 4000 of the hadrons. 341 00:40:49,039 --> 00:40:57,360 So that might actually become an issue. And instead, what we want to do is we want to oversample 342 00:40:57,360 --> 00:41:06,200 our our training data set. So that means that we want to increase the number of these values, 343 00:41:06,199 --> 00:41:13,960 so that these kind of match better. And surprise, surprise, there is something that we can import 344 00:41:13,960 --> 00:41:23,159 that will help us do that. It's so I'm going to go to from in the learn dot oversampling. And I'm 345 00:41:23,159 --> 00:41:31,759 going to import this random oversampler, run that cell, and come back down here. So I will actually 346 00:41:31,760 --> 00:41:43,640 add in this parameter called oversample, and set that to false for default. And if I do want to 347 00:41:43,639 --> 00:41:51,239 oversample, then what I'm going to do, and by oversample, so if I do want to oversample, 348 00:41:51,239 --> 00:41:59,559 then I'm going to create this ROS and set it equal to this random oversampler. And then for x and y, 349 00:41:59,559 --> 00:42:06,960 I'm just going to say, okay, just fit and resample x and y. And what that's doing is saying, okay, 350 00:42:06,960 --> 00:42:15,000 take more of the less class. So take take the less class and keep sampling from there to increase 351 00:42:15,000 --> 00:42:24,039 the size of our data set of that smaller class so that they now match. So if I do this, and I scale 352 00:42:24,039 --> 00:42:33,279 data set, and I pass in the training data set where oversample is true. So this let's say this 353 00:42:33,280 --> 00:42:48,400 is train and then x train, y train. Oops, what's going on? These should be columns. So basically, 354 00:42:48,400 --> 00:42:55,039 what I'm doing now is I'm just saying, okay, what is the length of y train? Okay, now it's 355 00:42:55,039 --> 00:43:05,440 14,800, whatever. And now let's take a look at how many of these are type one. So actually, 356 00:43:05,440 --> 00:43:12,720 we can just sum that up. And then we'll also see that if we instead switch the label and ask how 357 00:43:12,719 --> 00:43:19,799 many of them are the other type, it's the same value. So now these have been evenly, you know, 358 00:43:19,800 --> 00:43:31,320 rebalanced. Okay, well, okay. So here, I'm just going to make this the validation data set. And 359 00:43:31,320 --> 00:43:39,880 then the next one, I'm going to make this the test data set. Alright, and we're actually going to 360 00:43:39,880 --> 00:43:46,280 switch oversample here to false. Now, the reason why I'm switching that to false is because my 361 00:43:46,280 --> 00:43:51,840 validation and my test sets are for the purpose of you know, if I have data that I haven't seen yet, 362 00:43:51,840 --> 00:43:59,680 how does my sample perform on those? And I don't want to oversample for that right now. Like, 363 00:43:59,679 --> 00:44:06,559 I don't care about balancing those I'm, I want to know if I have a random set of data that's 364 00:44:06,559 --> 00:44:16,840 unlabeled, can I trust my model, right? So that's why I'm not oversampling. I run that. And again, 365 00:44:16,840 --> 00:44:23,120 what is going on? Oh, it's because we already have this train. So I have to go come up here and split 366 00:44:23,119 --> 00:44:32,279 that data frame again. And now let's run these. Okay. So now we have our data properly formatted. 367 00:44:32,280 --> 00:44:37,040 And we're going to move on to different models now. And I'm going to tell you guys a little bit 368 00:44:37,039 --> 00:44:43,000 about each of these models. And then I'm going to show you how we can do that in our code. So the 369 00:44:43,000 --> 00:44:49,880 first model that we're going to learn about is KNN or K nearest neighbors. Okay, so here, I've 370 00:44:49,880 --> 00:44:57,720 already drawn a plot on the y axis, I have the number of kids that a family might have. And then 371 00:44:57,719 --> 00:45:07,399 on the x axis, I have their income in terms of 1000s per year. So, you know, if if someone's 372 00:45:07,400 --> 00:45:12,360 making 40,000 a year, that's where this would be. And if somebody making 320, that's where that 373 00:45:12,360 --> 00:45:18,000 would be somebody has zero kids, it'd be somewhere along this axis. Somebody has five, it'd be 374 00:45:18,000 --> 00:45:28,400 somewhere over here. Okay. And now I have these plus signs and these minus signs on here. So what 375 00:45:28,400 --> 00:45:42,480 I'm going to represent here is the plus sign means that they own a car. And the minus sign is going 376 00:45:42,480 --> 00:45:49,800 to represent no car. Okay. So your initial thought should be okay, I think this is binary 377 00:45:49,800 --> 00:46:00,240 classification because all of our points all of our samples have labels. So this is a sample with 378 00:46:00,239 --> 00:46:13,000 the plus label. And this here is another sample with the minus label. This is an abbreviation for 379 00:46:13,000 --> 00:46:20,760 width that I'll use. Alright, so we have this entire data set. And maybe around half the people 380 00:46:20,760 --> 00:46:29,200 own a car and maybe around half the people don't own a car. Okay, well, what if I had some new 381 00:46:29,199 --> 00:46:35,399 point, let me use choose a different color, I'll use this nice green. Well, what if I have a new 382 00:46:35,400 --> 00:46:42,720 point over here? So let's say that somebody makes 40,000 a year and has two kids. What do we think 383 00:46:42,719 --> 00:46:52,439 that would be? Well, just logically looking at this plot, you might think, okay, it seems like 384 00:46:52,440 --> 00:46:57,800 they wouldn't have a car, right? Because that kind of matches the pattern of everybody else around 385 00:46:57,800 --> 00:47:06,240 them. So that's a whole concept of this nearest neighbors is you look at, okay, what's around you. 386 00:47:06,239 --> 00:47:11,319 And then you're basically like, okay, I'm going to take the label of the majority that's around me. 387 00:47:11,320 --> 00:47:17,640 So the first thing that we have to do is we have to define a distance function. And a lot of times 388 00:47:17,639 --> 00:47:25,279 in, you know, 2d plots like this, our distance function is something known as Euclidean distance. 389 00:47:25,280 --> 00:47:45,480 And Euclidean distance is basically just this straight line distance like this. Okay. So this 390 00:47:45,480 --> 00:47:54,000 would be the Euclidean distance, it seems like there's this point, there's this point, there's 391 00:47:54,000 --> 00:48:00,679 that point, etc. So the length of this line, this green line that I just drew, that is what's known 392 00:48:00,679 --> 00:48:10,159 as Euclidean distance. If we want to get technical with that, this exact formula is the distance here, 393 00:48:10,159 --> 00:48:20,199 let me zoom in. The distance is equal to the square root of one point x minus the other points x 394 00:48:20,199 --> 00:48:29,159 squared plus extend that square root, the same thing for y. So y one of one minus y two of the 395 00:48:29,159 --> 00:48:36,159 other squared. Okay, so we're basically trying to find the length, the distances, the difference 396 00:48:36,159 --> 00:48:43,719 between x and y, and then square each of those sum it up and take the square root. Okay, so I'm 397 00:48:43,719 --> 00:48:53,239 going to erase this so it doesn't clutter my drawing. But anyways, now going back to this plot, 398 00:48:53,239 --> 00:49:03,519 so here in the nearest neighbor algorithm, we see that there is a K, right? And this K is basically 399 00:49:03,519 --> 00:49:09,719 telling us, okay, how many neighbors do we use in order to judge what the label is? So usually, 400 00:49:09,719 --> 00:49:16,519 we use a K of maybe, you know, three or five, depends on how big our data set is. But here, 401 00:49:16,519 --> 00:49:25,360 I would say, maybe a logical number would be three or five. So let's say that we take K to be equal 402 00:49:25,360 --> 00:49:34,640 to three. Okay, well, of this data point that I drew over here, let me use green to highlight this. 403 00:49:34,639 --> 00:49:40,199 Okay, so of this data point that I drew over here, it looks like the three closest points are definitely 404 00:49:40,199 --> 00:49:50,359 this one, this one. And then this one has a length of four. And this one seems like it'd be a little 405 00:49:50,360 --> 00:49:57,559 bit further than four. So actually, this would be these would be our three points. Well, all those 406 00:49:57,559 --> 00:50:05,920 points are blue. So chances are, my prediction for this point is going to be blue, it's going to be 407 00:50:05,920 --> 00:50:14,840 probably don't have a car. All right, now what if my point is somewhere? What if my point is 408 00:50:14,840 --> 00:50:26,120 somewhere over here, let's say that a couple has four kids, and they make 240,000 a year. All right, 409 00:50:26,119 --> 00:50:34,159 well, now my closest points are this one, probably a little bit over that one. And then this one, 410 00:50:34,159 --> 00:50:45,639 right? Okay, still all pluses. Well, this one is more than likely to be plus. Right? Now, 411 00:50:45,639 --> 00:50:55,279 let me get rid of some of these just so that it looks a little bit more clear. All right, 412 00:50:55,280 --> 00:51:06,960 let's go through one more. What about a point that might be right here? Okay, let's see. Well, 413 00:51:06,960 --> 00:51:16,000 definitely this is the closest, right? This one's also closest. And then it's really close between 414 00:51:16,000 --> 00:51:22,719 the two of these. But if we actually do the mathematics, it seems like if we zoom in, 415 00:51:22,719 --> 00:51:30,839 this one is right here. And this one is in between these two. So this one here is actually shorter 416 00:51:30,840 --> 00:51:37,920 than this one. And that means that that top one is the one that we're going to take. Now, 417 00:51:37,920 --> 00:51:45,079 what is the majority of the points that are close by? Well, we have one plus here, we have one plus 418 00:51:45,079 --> 00:51:52,159 here, and we have one minus here, which means that the pluses are the majority. And that means 419 00:51:52,159 --> 00:52:04,559 that this label is probably somebody with a car. Okay. So this is how K nearest neighbors would 420 00:52:04,559 --> 00:52:13,599 work. It's that simple. And this can be extrapolated to further dimensions to higher dimensions. You 421 00:52:13,599 --> 00:52:19,400 know, if you have here, we have two different features, we have the income, and then we have 422 00:52:19,400 --> 00:52:25,920 the number of kids. But let's say we have 10 different features, we can expand our distance 423 00:52:25,920 --> 00:52:31,519 function so that it includes all 10 of those dimensions, we take the square root of everything, 424 00:52:31,519 --> 00:52:39,480 and then we figure out which one is the closest to the point that we desire to classify. Okay. So 425 00:52:39,480 --> 00:52:45,240 that's K nearest neighbors. So now we've learned about K nearest neighbors. Let's see how we would 426 00:52:45,239 --> 00:52:51,079 be able to do that within our code. So here, I'm going to label the section K nearest neighbors. 427 00:52:51,079 --> 00:52:59,559 And we're actually going to use a package from SK learn. So the reason why we, you know, use these 428 00:52:59,559 --> 00:53:04,639 packages and so that we don't have to manually code all these things ourselves, because it would 429 00:53:04,639 --> 00:53:08,199 be really difficult. And chances are the way that we would code it, either would have bugs, 430 00:53:08,199 --> 00:53:13,079 or it'd be really slow, or I don't know a whole bunch of issues. So what we're going to do is 431 00:53:13,079 --> 00:53:20,319 hand it off to the pros. From here, I can say, okay, from SK learn, which is this package dot 432 00:53:20,320 --> 00:53:27,880 neighbors, I'm going to import K neighbors classifier, because we're classifying. Okay, 433 00:53:27,880 --> 00:53:38,160 so I run that. And our KNN model is going to be this K neighbors classifier. And we can pass in 434 00:53:38,159 --> 00:53:43,920 a parameter of how many neighbors, you know, we want to use. So first, let's see what happens if 435 00:53:43,920 --> 00:53:52,800 we just use one. So now if I do K, and then model dot fit, I can pass in my x training set and my 436 00:53:52,800 --> 00:54:03,560 weight y train data. Okay. So that effectively fits this model. And let's get all the predictions. So 437 00:54:03,559 --> 00:54:11,880 why can and I guess yeah, let's do y predictions. And my y predictions are going to be cannon model 438 00:54:11,880 --> 00:54:24,960 dot predict. So let's use the test set x test. Okay. Alright, so if I call y predict, you'll see 439 00:54:24,960 --> 00:54:29,720 that we have those. But if I get my truth values for that test set, you'll see that this is what 440 00:54:29,719 --> 00:54:33,879 we actually do. So just looking at this, we got five out of six of them. Okay, great. So let's 441 00:54:33,880 --> 00:54:39,480 actually take a look at something called the classification report that's offered by SK learn. 442 00:54:39,480 --> 00:54:49,719 So if I go to from SK learn dot metrics, import classification report, what I can actually do is 443 00:54:49,719 --> 00:54:57,959 say, hey, print out this classification report for me. And let's check, you know, I'm giving you the 444 00:54:57,960 --> 00:55:04,119 y test and the y prediction. We run this and we see we get this whole entire chart. So I'm going 445 00:55:04,119 --> 00:55:10,719 to tell you guys a few things on this chart. Alright, this accuracy is 82%, which is actually 446 00:55:10,719 --> 00:55:15,679 pretty good. That's just saying, hey, if we just look at, you know, what each of these new points, 447 00:55:15,679 --> 00:55:23,359 what it's closest to, then we actually get an 82% accuracy, which means how many do we get right 448 00:55:23,360 --> 00:55:29,960 versus how many total are there. Now, precision is saying, okay, you might see that we have it 449 00:55:29,960 --> 00:55:36,199 for class one, or class zero and class one. What precision is saying was, let's go to this Wikipedia 450 00:55:36,199 --> 00:55:42,879 diagram over here, because I actually kind of like this diagram. So here, this is our entire data set. 451 00:55:42,880 --> 00:55:48,160 And on the left over here, we have everything that we know is positive. So everything that is 452 00:55:48,159 --> 00:55:54,079 actually truly positive, that we've labeled positive in our original data set. And over here, 453 00:55:54,079 --> 00:56:01,079 this is everything that's truly negative. Now in the circle, we have things that are positive that 454 00:56:01,079 --> 00:56:08,159 were labeled positive by our model. On the left here, we have things that are truly positive, 455 00:56:08,159 --> 00:56:13,119 because you know, this side is the positive side and the side is the negative side. So these are 456 00:56:13,119 --> 00:56:18,839 truly positive. Whereas all these ones out here, well, they should have been positive, but they 457 00:56:18,840 --> 00:56:24,559 are labeled as negative. And in here, these are the ones that we've labeled positive, but they're 458 00:56:24,559 --> 00:56:33,000 actually negative. And out here, these are truly negative. So precision is saying, okay, out of all 459 00:56:33,000 --> 00:56:40,400 the ones we've labeled as positive, how many of them are true positives? And recall is saying, 460 00:56:40,400 --> 00:56:47,160 okay, out of all the ones that we know are truly positive, how many do we actually get right? Okay, 461 00:56:47,159 --> 00:56:55,480 so going back to this over here, our precision score, so again, precision, out of all the ones 462 00:56:55,480 --> 00:57:03,880 that we've labeled as the specific class, how many of them are actually that class, it's 7784%. Now, 463 00:57:03,880 --> 00:57:09,400 recall how out of all the ones that are actually this class, how many of those that we get, this 464 00:57:09,400 --> 00:57:18,200 is 68% and 89%. Alright, so not too shabby, we can clearly see that this recall and precision for 465 00:57:18,199 --> 00:57:24,079 like this, the class zero is worse than class one. Right? So that means for hadron, it's worked for 466 00:57:24,079 --> 00:57:30,079 hadrons and for our gammas. This f1 score over here is kind of a combination of the precision and 467 00:57:30,079 --> 00:57:35,519 recall score. So we're actually going to mostly look at this one because we have an unbalanced 468 00:57:35,519 --> 00:57:43,000 test data set. So here we have a measure of 72 and 87 or point seven two and point eight seven, 469 00:57:43,000 --> 00:57:55,639 which is not too shabby. All right. Well, what if we, you know, made this three. So we actually see 470 00:57:55,639 --> 00:58:04,599 that, okay, so what was it originally with one? We see that our f1 score, you know, is now it was 471 00:58:04,599 --> 00:58:10,360 point seven two and then point eight seven. And then our accuracy was 82%. So if I change that to 472 00:58:10,360 --> 00:58:20,440 three. Alright, so we've kind of increased zero at the cost of one and then our overall accuracy 473 00:58:20,440 --> 00:58:28,159 is 81. So let's actually just make this five. Alright, so you know, again, very similar numbers, 474 00:58:28,159 --> 00:58:35,359 we have 82% accuracy, which is pretty decent for a model that's relatively simple. Okay, 475 00:58:35,360 --> 00:58:42,880 the next type of model that we're going to talk about is something known as naive Bayes. Now, 476 00:58:42,880 --> 00:58:48,400 in order to understand the concepts behind naive Bayes, we have to be able to understand 477 00:58:48,400 --> 00:58:55,800 conditional probability and Bayes rule. So let's say I have some sort of data set that's shown in 478 00:58:55,800 --> 00:59:03,720 this table right here. People who have COVID are over here in this red row. And people who do not 479 00:59:03,719 --> 00:59:09,039 have COVID are down here in this green row. Now, what about the COVID test? Well, people who have 480 00:59:09,039 --> 00:59:18,360 tested positive are over here in this column. And people who have tested negative are over here in 481 00:59:18,360 --> 00:59:25,840 this column. Okay. Yeah, so basically, our categories are people who have COVID and test positive, 482 00:59:25,840 --> 00:59:32,800 people who don't have COVID, but test positive, so a false false positive, people who have COVID 483 00:59:32,800 --> 00:59:38,560 and test negative, which is a false negative, and people who don't have COVID and test negative, 484 00:59:38,559 --> 00:59:48,159 which good means you don't have COVID. Okay, so let's make this slightly more legible. And here, 485 00:59:48,159 --> 00:59:55,359 in the margins, I've written down the sums of whatever it's referring to. So this here is the 486 00:59:55,360 --> 01:00:05,559 sum of this entire row. And this here might be the sum of this column over here. Okay. So the first 487 01:00:05,559 --> 01:00:11,559 question that I have is, what is the probability of having COVID given that you have a positive 488 01:00:11,559 --> 01:00:21,920 test? And in probability, we write that out like this. So the probability of COVID given, so this 489 01:00:21,920 --> 01:00:29,360 line, that vertical line means given that, you know, some condition, so given a positive test, 490 01:00:29,360 --> 01:00:39,440 okay, so what is the probability of having COVID given a positive test? So what this is asking is 491 01:00:39,440 --> 01:00:48,320 saying, okay, let's go into this condition. So the condition of having a positive test, that is this 492 01:00:48,320 --> 01:00:53,360 slice of the data, right? That means if you're in this slice of data, you have a positive test. So 493 01:00:53,360 --> 01:00:59,000 given that we have a positive test, given in this condition, in this circumstance, we have a positive 494 01:00:59,000 --> 01:01:05,679 test. So what's the probability that we have COVID? Well, if we're just using this data, the number 495 01:01:05,679 --> 01:01:15,440 of people that have COVID is 531. So I'm gonna say that there's 531 people that have COVID. And then 496 01:01:15,440 --> 01:01:24,599 now we divide that by the total number of people that have a positive test, which is 551. Okay, 497 01:01:24,599 --> 01:01:34,639 so that's the probability and doing a quick division, we get that this is equal to around 498 01:01:34,639 --> 01:01:43,239 96.4%. So according to this data set, which is data that I made up off the top of my head, so it's 499 01:01:43,239 --> 01:01:50,759 not actually real COVID data. But according to this data, the probability of having COVID given 500 01:01:50,760 --> 01:02:02,480 that you tested positive is 96.4%. Alright, now with that, let's talk about Bayes rule, which is 501 01:02:02,480 --> 01:02:10,440 this section here. Let's ignore this bottom part for now. So Bayes rule is asking, okay, what is 502 01:02:10,440 --> 01:02:18,000 the probability of some event A happening, given that B happened. So this, we already know has 503 01:02:18,000 --> 01:02:26,000 happened. This is our condition, right? Well, what if we don't have data for that, right? Like, what 504 01:02:26,000 --> 01:02:31,440 if we don't know what the probability of A given B is? Well, Bayes rule is saying, okay, well, you 505 01:02:31,440 --> 01:02:36,920 can actually go and calculate it, as long as you have a probability of B given A, the probability 506 01:02:36,920 --> 01:02:43,920 of A and the probability of B. Okay. And this is just a mathematical formula for that. Alright, 507 01:02:43,920 --> 01:02:51,320 so here we have Bayes rule. And let's actually see Bayes rule in action. Let's use it on an example. 508 01:02:51,320 --> 01:02:58,920 So here, let's say that we have some disease statistics, okay. So not COVID different disease. 509 01:02:58,920 --> 01:03:05,960 And we know that the probability of obtaining a false positive is 0.05 probability of obtaining a 510 01:03:05,960 --> 01:03:12,800 false negative is 0.01. And the probability of the disease is 0.1. Okay, what is the probability of 511 01:03:12,800 --> 01:03:20,640 the disease given that we got a positive test? Hmm, how do we even go about solving this? So 512 01:03:20,639 --> 01:03:26,519 what what do I mean by false positive? What's a different way to rewrite that? A false positive 513 01:03:26,519 --> 01:03:32,960 is when you test positive, but you don't actually have the disease. So this here is a probability 514 01:03:32,960 --> 01:03:42,480 that you have a positive test given no disease, right? And similarly for the false negative, 515 01:03:42,480 --> 01:03:47,599 it's a probability that you test negative given that you actually have the disease. So if I put 516 01:03:47,599 --> 01:03:58,119 that into a chart, for example, and this might be my positive and negative tests, and this might 517 01:03:58,119 --> 01:04:07,239 be my diseases, disease and no disease. Well, the probability that I test positive, but actually 518 01:04:07,239 --> 01:04:14,039 have no disease, okay, that's 0.05 over here. And then the false negatives up here for 0.01. So I'm 519 01:04:14,039 --> 01:04:20,880 testing negative, but I don't actually have the disease. This so the probability that you test 520 01:04:20,880 --> 01:04:25,480 positive, and you don't have the disease, plus a probability that you test negative, given that you 521 01:04:25,480 --> 01:04:30,880 don't have the disease, that should sum up to one. Okay, because if you don't have the disease, 522 01:04:30,880 --> 01:04:34,360 then you should have some probability that you're testing positive and some probability that you're 523 01:04:34,360 --> 01:04:43,120 testing negative. But that probability, in total should be one. So that means that the probability 524 01:04:43,119 --> 01:04:47,039 negative and no disease, this should be the reciprocal, this should be the opposite. So it 525 01:04:47,039 --> 01:04:57,360 should be 0.95 because it's one minus whatever this probability is. And then similarly, oops, 526 01:04:59,679 --> 01:05:06,319 up here, this should be 0.99 because the probability that we, you know, 527 01:05:06,320 --> 01:05:10,080 test negative and have the disease plus the probability that we test positive and have the 528 01:05:10,079 --> 01:05:16,799 disease should equal one. So this is our probability chart. And now, this probability of disease 529 01:05:16,800 --> 01:05:21,920 being point 0.1 just means I have 10% probability of actually of having the disease, right? Like, 530 01:05:23,199 --> 01:05:30,000 in the general population, the probability that I have the disease is 0.1. Okay, so what is the 531 01:05:30,000 --> 01:05:37,039 probability that I have the disease given that I got a positive test? Well, remember that we 532 01:05:37,039 --> 01:05:43,119 can write this out in terms of Bayes rule, right? So if I use this rule up here, this is the 533 01:05:43,119 --> 01:05:51,199 probability of a positive test given that I have the disease times the probability of the disease 534 01:05:52,880 --> 01:05:58,240 divided by the probability of the evidence, which is my positive test. 535 01:06:00,000 --> 01:06:05,679 Alright, now let's plug in some numbers for that. The probability of having a positive test given 536 01:06:05,679 --> 01:06:13,839 that I have the disease is 0.99. And then the probability that I have the disease is this value 537 01:06:13,840 --> 01:06:26,000 over here 0.1. Okay. And then the probability that I have a positive test at all should be okay, 538 01:06:26,000 --> 01:06:29,840 what is the probability that I have a positive test given that I actually have the disease 539 01:06:29,840 --> 01:06:37,360 and then having having the disease. And then the other case, where the probability of me having a 540 01:06:37,360 --> 01:06:45,519 negative test given or sorry, positive test giving no disease times the probability of not actually 541 01:06:45,519 --> 01:06:52,000 having a disease. Okay, so I can expand that probability of having a positive test out into 542 01:06:52,000 --> 01:06:58,480 these two different cases, I have a disease, and then I don't. And then what's the probability of 543 01:06:58,480 --> 01:07:08,240 having positive tests in either one of those cases. So that expression would become 0.99 times 0.1 544 01:07:09,519 --> 01:07:16,159 plus 0.05. So that's the probability that I'm testing positive, but don't have the disease. 545 01:07:16,960 --> 01:07:20,400 And the times the probability that I don't actually have the disease. So that's one minus 546 01:07:20,400 --> 01:07:29,840 0.1 probability that the population doesn't have the disease is 90%. So 0.9. And let's do that 547 01:07:29,840 --> 01:07:48,720 multiplication. And I get an answer of 0.6875 or 68.75%. Okay. All right, so we can actually expand 548 01:07:48,719 --> 01:07:56,480 that we can expand Bayes rule and apply it to classification. And this is what we call naive 549 01:07:56,480 --> 01:08:04,639 base. So first, a little terminology. So the posterior is this over here, because it's asking, 550 01:08:04,639 --> 01:08:12,480 Hey, what is the probability of some class CK? So by CK, I just mean, you know, the different 551 01:08:12,480 --> 01:08:19,359 categories, so C for category or class or whatever. So category one might be cats, category two, 552 01:08:19,359 --> 01:08:26,639 dogs, category three, lizards, all the way, we have k categories, k is just some number. Okay. 553 01:08:27,520 --> 01:08:36,160 So what is the probability of having of this specific sample x, so this is our feature vector 554 01:08:36,159 --> 01:08:44,079 of this one sample. What is the probability of x fitting into category 123 for whatever, right, 555 01:08:44,079 --> 01:08:49,119 so that that's what this is asking, what is the probability that, you know, it's actually from 556 01:08:49,119 --> 01:08:59,920 this class, given all this evidence that we see the x's. So the likelihood is this quantity over 557 01:08:59,920 --> 01:09:07,600 here, it's saying, Okay, well, given that, you know, assume, assume we are, assume that this 558 01:09:07,600 --> 01:09:13,760 class is class CK, okay, assume that this is a category. Well, what is the likelihood of 559 01:09:13,760 --> 01:09:21,280 actually seeing x, all these different features from that category. And then this here is the 560 01:09:21,279 --> 01:09:26,880 prior. So like in the entire population of things, what are the probabilities? What is the 561 01:09:26,880 --> 01:09:32,640 probability of this class in general? Like if I have, you know, in my entire data set, what is the 562 01:09:32,640 --> 01:09:40,160 percentage? What is the chance that this image is a cat? How many cats do I have? Right. And then this 563 01:09:40,159 --> 01:09:47,439 down here is called the evidence because what we're trying to do is we're changing our prior, 564 01:09:47,439 --> 01:09:54,319 we're creating this new posterior probability built upon the prior by using some sort of evidence, 565 01:09:54,319 --> 01:10:02,239 right? And that evidence is a probability of x. So that's some vocab. And this here 566 01:10:05,439 --> 01:10:15,599 is a rule for naive Bayes. Whoa, okay, let's digest that a little bit. Okay. So what is 567 01:10:15,600 --> 01:10:21,680 let me use a different color. What is this side of the equation asking? It's asking, 568 01:10:21,680 --> 01:10:28,320 what is the probability that we are in some class K, CK, given that, you know, this is my first 569 01:10:28,319 --> 01:10:33,920 input, this is my second input, this is, you know, my third, fourth, this is my nth input. So let's 570 01:10:33,920 --> 01:10:41,600 say that our classification is, do we play soccer today or not? Okay, and let's say our x's are, 571 01:10:41,600 --> 01:10:49,440 okay, is it how much wind is there? How much rain is there? And what day of the week is it? So let's 572 01:10:49,439 --> 01:10:54,399 So let's say that it's raining, it's not windy, but it's Wednesday, do we play soccer? Do we not? 573 01:10:56,079 --> 01:10:59,680 So let's use Bayes rule on this. So this here 574 01:11:06,079 --> 01:11:13,840 is equal to the probability of x one, x two, all these joint probabilities, given class K 575 01:11:13,840 --> 01:11:20,800 times the probability of that class, all over the probability of this evidence. 576 01:11:24,399 --> 01:11:31,839 Okay. So what is this fancy symbol over here, this means proportional to 577 01:11:33,600 --> 01:11:38,560 so how our equal sign means it's equal to this like little squiggly sign means that this is 578 01:11:38,560 --> 01:11:48,800 proportional to okay, and this denominator over here, you might notice that it has no impact on 579 01:11:48,800 --> 01:11:53,840 the class like this, that number doesn't depend on the class, right? So this is going to be constant 580 01:11:53,840 --> 01:11:59,199 for all of our different classes. So what I'm going to do is make things simpler. So I'm just 581 01:11:59,199 --> 01:12:07,920 going to say that this probability x one, x two, all the way to x n, this is going to be proportional 582 01:12:07,920 --> 01:12:10,800 to the numerator, I don't care about the denominator, because it's the same for every 583 01:12:10,800 --> 01:12:20,800 single class. So this is proportional to x one, x two, x n given class K times the probability of 584 01:12:20,800 --> 01:12:31,920 that class. Okay. All right. So in naive Bayes, the point of it being naive, is that we're actually 585 01:12:32,960 --> 01:12:36,319 this joint probability, we're just assuming that all of these different things 586 01:12:36,319 --> 01:12:42,719 are all independent. So in my soccer example, you know, the probability that we're playing soccer, 587 01:12:44,800 --> 01:12:50,720 or the probability that, you know, it's windy, and it's rainy, and, and it's Wednesday, all these 588 01:12:50,720 --> 01:12:56,800 things are independent, we're assuming that they're independent. So that means that I can 589 01:12:56,800 --> 01:13:06,560 actually write this part of the equation here as this. So each term in here, I can just multiply 590 01:13:07,119 --> 01:13:13,840 all of them together. So the probability of the first feature, given that it's class K, 591 01:13:14,800 --> 01:13:20,159 times the probability of the second feature and given this problem, like class K all the way up 592 01:13:20,159 --> 01:13:30,960 all the way up until, you know, the nth feature of given that it's class K. So this expands to 593 01:13:30,960 --> 01:13:39,199 all of this. All right, which means that this here is now proportional to the thing that we just 594 01:13:39,199 --> 01:13:47,599 expanded times this. So I'm going to write that out. So the probability of that class. 595 01:13:47,600 --> 01:13:54,560 And I'm actually going to use this symbol. So what this means is it's a huge multiplication, 596 01:13:54,560 --> 01:14:04,000 it means multiply everything to the right of this. So this probability x, given some class K, 597 01:14:04,720 --> 01:14:11,360 but do it for all the i's. So I, what is I, okay, we're going to go from the first 598 01:14:11,359 --> 01:14:18,639 the first x i all the way to the nth. So that means for every single i, we're just multiplying 599 01:14:19,359 --> 01:14:27,439 these probabilities together. And that's where this up here comes from. So to wrap this up, 600 01:14:27,439 --> 01:14:31,599 oops, this should be a line to wrap this up in plain English. Basically, what this is saying 601 01:14:31,600 --> 01:14:37,520 is a probability that you know, we're in some category, given that we have all these different 602 01:14:37,520 --> 01:14:44,960 features is proportional to the probability of that class in general, times the probability of 603 01:14:44,960 --> 01:14:51,119 each of those features, given that we're in this one class that we're testing. So the probability 604 01:14:51,680 --> 01:14:59,600 of it, you know, of us playing soccer today, given that it's rainy, not windy, and and it's 605 01:14:59,600 --> 01:15:04,880 Wednesday, is proportional to Okay, well, what is what is the probability that we play soccer 606 01:15:04,880 --> 01:15:10,400 anyways, and then times the probability that it's rainy, given that we're playing soccer, 607 01:15:10,960 --> 01:15:15,439 times the probability that it's not windy, given that we're playing soccer. So how many times are 608 01:15:15,439 --> 01:15:21,199 we playing soccer when it's windy, how you know, and then how many times are what's the probability 609 01:15:21,199 --> 01:15:30,319 that's Wednesday, given that we're playing soccer. Okay. So how do we use this in order to make a 610 01:15:30,319 --> 01:15:39,039 classification. So that's where this comes in our y hat, our predicted y is going to be equal to 611 01:15:39,039 --> 01:15:45,439 something called the arg max. And then this expression over here, because we want to take 612 01:15:45,439 --> 01:15:55,199 the arg max. Well, we want. So okay, if I write out this, again, this means the probability of 613 01:15:55,199 --> 01:16:05,840 being in some class CK given all of our evidence. Well, we're going to take the K that maximizes 614 01:16:06,640 --> 01:16:13,920 this expression on the right. That's what arc max means. So if K is in zero, oops, 615 01:16:14,720 --> 01:16:21,199 one through K, so this is how many categories are, we're going to go through each K. And we're going 616 01:16:21,199 --> 01:16:32,319 to solve this expression over here and find the K that makes that the largest. Okay. And remember 617 01:16:32,319 --> 01:16:39,439 that instead of writing this, we have now a formula, thanks to Bayes rule for helping us 618 01:16:40,560 --> 01:16:47,440 approximate that right in something that maybe we can we maybe we have like the evidence for that, 619 01:16:47,439 --> 01:16:54,479 we have the answers for that based on our training set. So this principle of going through each of 620 01:16:54,479 --> 01:17:00,559 these and finding whatever class whatever category maximizes this expression on the right, 621 01:17:00,560 --> 01:17:12,160 this is something known as MAP for short, or maximum a posteriori. 622 01:17:12,159 --> 01:17:20,159 Pick the hypothesis. So pick the K that is the most probable so that we minimize the probability 623 01:17:20,159 --> 01:17:31,119 of misclassification. Right. So that is MAP. That is naive Bayes. Back to the notebook. So 624 01:17:31,760 --> 01:17:38,800 just like how I imported k nearest neighbor, k neighbors classifier up here for naive Bayes, 625 01:17:38,800 --> 01:17:45,680 I can go to SK learn naive Bayes. And I can import Gaussian naive Bayes. 626 01:17:46,800 --> 01:17:52,720 Right. And here I'm going to say my naive Bayes model is equal. This is very similar to what we 627 01:17:52,720 --> 01:18:06,480 had above. And I'm just going to say with this model, we are going to fit x train and y train. 628 01:18:06,479 --> 01:18:17,359 All right, just like above. So this, I might actually, so I'm going to set that. And 629 01:18:19,199 --> 01:18:26,159 exactly, just like above, I'm going to make my prediction. So here, I'm going to instead use my 630 01:18:26,159 --> 01:18:35,279 naive Bayes model. And of course, I'm going to run the classification report again. So I'm actually 631 01:18:35,279 --> 01:18:40,719 just going to put these in the same cell. But here we have the y the new y prediction and then y test 632 01:18:40,720 --> 01:18:49,520 is still our original test data set. So if I run this, you'll see that. Okay, what's going on here, 633 01:18:49,520 --> 01:18:58,640 we get worse scores, right? Our precision, for all of them, they look slightly worse. And our, 634 01:18:58,640 --> 01:19:04,160 you know, for our precision, our recall, our f1 score, they look slightly worse for all the different 635 01:19:04,159 --> 01:19:11,439 categories. And our total accuracy, I mean, it's still 72%, which is not too shabby. But it's still 636 01:19:11,439 --> 01:19:22,000 72%. Okay. Which, you know, is not not that great. Okay, so let's move on to logistic regression. 637 01:19:22,000 --> 01:19:29,760 Here, I've drawn a plot, I have y. So this is my label on one axis. And then this is maybe one of 638 01:19:29,760 --> 01:19:36,720 my features. So let's just say I only have one feature in this case, text zero, right? Well, 639 01:19:36,720 --> 01:19:44,079 we see that, you know, I have a few of one class type down here. And we know it's one class type 640 01:19:44,079 --> 01:19:51,279 because it's zero. And then we have our other class type one up here. And then we have our 641 01:19:51,279 --> 01:19:58,960 y. Okay. So many of you guys are familiar with regression. So let's start there. If I were to 642 01:19:58,960 --> 01:20:10,159 draw a regression line through this, it might look something like like this. Right? Well, this 643 01:20:10,159 --> 01:20:16,239 doesn't seem to be a very good model. Like, why would we use this specific line to predict why? 644 01:20:16,239 --> 01:20:27,840 Right? It's, it's iffy. Okay. For example, we might say, okay, well, it seems like, you know, 645 01:20:27,840 --> 01:20:33,520 everything from here downwards would be one class type in here, upwards would be another class type. 646 01:20:34,640 --> 01:20:41,520 But when you look at this, you're just you, you visually can tell, okay, like, that line doesn't 647 01:20:41,520 --> 01:20:46,240 make sense. Things are not those dots are not along that line. And the reason is because we 648 01:20:46,239 --> 01:20:55,279 are doing classification, not regression. Okay. Well, first of all, let's start here, we know that 649 01:20:55,279 --> 01:21:04,639 this model, if we just use this line, it equals m x. So whatever this let's just say it's x plus b, 650 01:21:04,640 --> 01:21:10,000 which is the y intercept, right? And m is the slope. But when we use a linear regression, 651 01:21:10,000 --> 01:21:15,760 is it actually y hat? No, it's not right. So when we're working with linear regression, 652 01:21:15,760 --> 01:21:20,720 what we're actually estimating in our model is a probability, what's a probability between zero 653 01:21:20,720 --> 01:21:30,240 and one, that is class zero or class one. So here, let's rewrite this as p equals m x plus b. 654 01:21:32,720 --> 01:21:39,440 Okay, well, m x plus b, that can range, you know, from negative infinity to infinity, 655 01:21:39,439 --> 01:21:43,279 right? For any for any value of x, it goes from negative infinity to infinity. 656 01:21:44,159 --> 01:21:49,039 But probability, we know probably one of the rules of probability is that probability has to stay 657 01:21:49,039 --> 01:21:57,039 between zero and one. So how do we fix this? Well, maybe instead of just setting the probability 658 01:21:57,039 --> 01:22:03,519 equal to that, we can set the odds equal to this. So by that, I mean, okay, let's do probability 659 01:22:03,520 --> 01:22:10,080 divided by one minus the probability. Okay, so now becomes this ratio. Now this ratio is allowed to 660 01:22:10,079 --> 01:22:17,359 take on infinite values. But there's still one issue here. Let me move this over a bit. 661 01:22:18,079 --> 01:22:24,559 The one issue here is that m x plus b, that can still be negative, right? Like if you know, 662 01:22:24,560 --> 01:22:28,800 I have a negative slope, if I have a negative b, if I have some negative x's in there, I don't know, 663 01:22:28,800 --> 01:22:36,400 but that can be that's allowed to be negative. So how do we fix that? We do that by actually taking 664 01:22:36,399 --> 01:22:47,839 the log of the odds. Okay. So now I have the log of you know, some probability divided by one minus 665 01:22:47,840 --> 01:22:54,319 the probability. And now that is on a range of negative infinity to infinity, which is good 666 01:22:54,319 --> 01:23:00,639 because the range of log should be negative infinity to infinity. Now how do I solve for P 667 01:23:00,640 --> 01:23:08,400 the probability? Well, the first thing I can do is take, you know, I can remove the log by taking 668 01:23:08,399 --> 01:23:16,479 the not the e to the whatever is on both sides. So that gives me the probability 669 01:23:16,479 --> 01:23:27,839 over the one minus the probability is now equal to e to the m x plus b. Okay. So let's multiply 670 01:23:27,840 --> 01:23:39,039 that out. So the probability is equal to one minus probability e to the m x plus b. So P is equal to 671 01:23:39,039 --> 01:23:49,279 e to the m x plus b minus P times e to the m x plus b. And now we have we can move like terms to 672 01:23:49,279 --> 01:23:58,880 one side. So if I do P, so basically, I'm moving this over, so I'm adding P. So now P one plus e 673 01:23:58,880 --> 01:24:11,440 to the m x plus b is equal to e to the m x plus b and let me change this parentheses make it a 674 01:24:11,439 --> 01:24:22,719 little bigger. So now my probability can be e to the m x plus b divided by one plus e to the m x plus b. 675 01:24:22,720 --> 01:24:32,880 Okay, well, let me just rewrite this really quickly, I want a numerator of one on top. 676 01:24:33,840 --> 01:24:39,920 Okay, so what I'm going to do is I'm going to multiply this by negative m x plus b, 677 01:24:40,800 --> 01:24:45,119 and then also the bottom by negative m x plus b, and I'm allowed to do that because 678 01:24:45,119 --> 01:24:52,640 this over this is one. So now my probability is equal to one over 679 01:24:54,640 --> 01:25:01,840 one plus e to the negative m x plus b. And now why did I rewrite it like that? 680 01:25:01,840 --> 01:25:07,600 It's because this is actually a form of a special function, which is called the sigmoid 681 01:25:07,600 --> 01:25:19,360 function. And for the sigmoid function, it looks something like this. So s of x sigmoid, you know, 682 01:25:20,159 --> 01:25:30,639 that some x is equal to one over one plus e to the negative x. So essentially, what I just did up here 683 01:25:30,640 --> 01:25:38,000 is rewrite this in some sigmoid function, where the x value is actually m x plus b. 684 01:25:38,960 --> 01:25:42,880 So maybe I'll change this to y just to make that a bit more clear, it doesn't matter what 685 01:25:42,880 --> 01:25:50,319 the variable name is. But this is our sigmoid function. And visually, what our sigmoid function 686 01:25:50,319 --> 01:26:01,039 looks like is it goes from zero. So this here is zero to one. And it looks something like this 687 01:26:01,039 --> 01:26:06,399 curved s, which I didn't draw too well. Let me try that again. It's hard to draw 688 01:26:10,159 --> 01:26:19,119 something if I can draw this right. Like that. Okay, so it goes in between zero and one. 689 01:26:19,119 --> 01:26:25,760 And you might notice that this form fits our shape up here. 690 01:26:29,840 --> 01:26:36,159 Oops, let's draw it sharper. But if it's our shape up there a lot better, right? 691 01:26:37,439 --> 01:26:44,479 Alright, so that is what we call logistic regression, we're basically trying to fit our data 692 01:26:44,479 --> 01:26:56,239 to the sigmoid function. Okay. And when we only have, you know, one data point, so if we only have 693 01:26:56,239 --> 01:27:06,239 one feature x, and that's what we call simple logistic regression. But then if we have, you know, 694 01:27:06,239 --> 01:27:12,639 so that's only x zero, but then if we have x zero, x one, all the way to x n, we call this 695 01:27:12,640 --> 01:27:19,360 multiple logistic regression, because there are multiple features that we're considering 696 01:27:19,359 --> 01:27:26,079 when we're building our model, logistic regression. So I'm going to put that here. 697 01:27:26,079 --> 01:27:36,079 And again, from SK learn this linear model, we can import logistic regression. All right. 698 01:27:36,079 --> 01:27:43,279 And just like how we did above, we can repeat all of this. So here, instead of NB, I'm going to call 699 01:27:43,279 --> 01:27:53,439 this log model, or LG logistic regression. I'm going to change this to logistic regression. 700 01:27:54,319 --> 01:27:59,119 So I'm just going to use the default logistic regression. But actually, if you look here, 701 01:27:59,119 --> 01:28:02,319 you see that you can use different penalties. So right now we're using 702 01:28:02,319 --> 01:28:08,880 an L2 penalty. But L2 is our quadratic formula. Okay, so that means that for, 703 01:28:09,680 --> 01:28:16,079 you know, outliers, it would really penalize that. For all these other things, you know, 704 01:28:16,079 --> 01:28:22,319 you can toggle these different parameters, and you might get slightly different results. 705 01:28:22,319 --> 01:28:26,960 If I were building a production level logistic regression model, then I would want to go and I 706 01:28:26,960 --> 01:28:31,439 would want to figure out how to do that. So I'm going to go ahead and I'm going to go ahead and 707 01:28:31,439 --> 01:28:36,479 I would want to figure out, you know, what are the best parameters to pass into here, 708 01:28:36,479 --> 01:28:41,519 based on my validation data. But for now, we'll just we'll just use this out of the box. 709 01:28:42,720 --> 01:28:49,600 So again, I'm going to fit the X train and the Y train. And I'm just going to predict again, 710 01:28:49,600 --> 01:28:57,440 so I can just call this again. And instead of LG, NB, I'm going to use LG. So here, this is decent 711 01:28:57,439 --> 01:29:07,279 precision 65% recall 71, f 168, or 82 total accuracy of 77. Okay, so it performs slightly 712 01:29:07,279 --> 01:29:15,279 better than I base, but it's still not as good as K and N. Alright, so the last model for 713 01:29:15,279 --> 01:29:20,079 classification that I wanted to talk about is something called support vector machines, 714 01:29:20,079 --> 01:29:31,840 or SVMs for short. So what exactly is an SVM model, I have two different features x zero and 715 01:29:31,840 --> 01:29:39,520 x one on the axes. And then I've told you if it's you know, class zero or class one based on the 716 01:29:39,520 --> 01:29:51,280 blue and red labels, my goal is to find some sort of line between these two labels that best divides 717 01:29:51,279 --> 01:30:00,559 the data. Alright, so this line is our SVM model. So I call it a line here because in 2d, it's a 718 01:30:00,560 --> 01:30:06,160 line, but in 3d, it would be a plane and then you can also have more and more dimensions. So the 719 01:30:06,159 --> 01:30:11,599 proper term is actually I want to find the hyperplane that best differentiates these two 720 01:30:11,600 --> 01:30:30,000 classes. Let's see a few examples. Okay, so first, between these three lines, let's say A, B, and C, 721 01:30:30,000 --> 01:30:37,760 and C, which one is the best divider of the data, which one has you know, all the data on one side 722 01:30:37,760 --> 01:30:42,880 or the other, or at least if it doesn't, which one divides it the most, right, like which one 723 01:30:42,880 --> 01:30:53,920 is has the most defined boundary between the two different groups. So this this question should be 724 01:30:53,920 --> 01:31:02,079 pretty straightforward. It should be a right because a has a clear distinct line between where you 725 01:31:02,079 --> 01:31:09,039 know, everything on this side of a is one label, it's negative and everything on this side of a 726 01:31:09,039 --> 01:31:16,399 is the other label, it's positive. So what if I have a but then what if I had drawn my B 727 01:31:16,399 --> 01:31:26,479 like this, and my C, maybe like this, sorry, they're kind of the labels are kind of close together. 728 01:31:27,439 --> 01:31:38,559 But now which one is the best? So I would argue that it's still a, right? And why is it still a? 729 01:31:38,560 --> 01:31:47,840 Right? And why is it still a? Because in these other two, look at how close this is to that, 730 01:31:47,840 --> 01:31:57,119 to these points. Right? So if I had some new point that I wanted to estimate, okay, 731 01:31:57,119 --> 01:32:02,960 say I didn't have A or B. So let's say we're just working with C. Let's say I have some new point 732 01:32:02,960 --> 01:32:10,960 that's right here. Or maybe a new point that's right there. Well, it seems like just logically 733 01:32:10,960 --> 01:32:19,600 looking at this. I mean, without the boundary, that would probably go under the positives, 734 01:32:19,600 --> 01:32:27,520 right? I mean, it's pretty close to that other positive. So one thing that we care about in SVM 735 01:32:27,520 --> 01:32:36,320 is something known as the margin. Okay, so not only do we want to separate the two classes really 736 01:32:36,319 --> 01:32:43,119 well, we also care about the boundary in between where the points in those classes in our data set 737 01:32:43,119 --> 01:32:53,279 are, and the line that we're drawing. So in a line like this, the closest values to this line 738 01:32:53,279 --> 01:33:10,000 might be like here. And I'm trying to draw these perpendicular. Right? And so this effectively, 739 01:33:10,000 --> 01:33:22,399 if I switch over to these dotted lines, if I can draw this right. So these effectively 740 01:33:22,399 --> 01:33:37,839 are what's known as the margins. Okay, so these both here, these are our margins in our SVMs. 741 01:33:38,479 --> 01:33:43,039 And our goal is to maximize those margins. So not only do we want the line that best separates the 742 01:33:43,039 --> 01:33:51,279 two different classes, we want the line that has the largest margin. And the data points that lie 743 01:33:51,279 --> 01:33:57,519 on the margin lines, the data. So basically, these are the data points that's helping us define our 744 01:33:57,520 --> 01:34:08,480 divider. These are what we call support vectors. Hence the name support vector machines. Okay, 745 01:34:08,479 --> 01:34:16,479 so the issue with SVM sometimes is that they're not so robust to outliers. Right? So for example, 746 01:34:16,479 --> 01:34:25,839 if I had one outlier, like this up here, that would totally change where I want my support 747 01:34:25,840 --> 01:34:31,920 vector to be, even though that might be my only outlier. Okay. So that's just something to keep 748 01:34:31,920 --> 01:34:38,239 in mind. As you know, when you're working with SVM is, it might not be the best model if there 749 01:34:38,239 --> 01:34:45,679 are outliers in your data set. Okay, so another example of SVMs might be, let's say that we have 750 01:34:45,680 --> 01:34:50,480 data like this, I'm just going to use a one dimensional data set for this example. Let's 751 01:34:50,479 --> 01:34:56,799 say we have a data set that looks like this. Well, our, you know, separators should be 752 01:34:56,800 --> 01:35:01,440 perpendicular to this line. But it should be somewhere along this line. So it could be 753 01:35:02,399 --> 01:35:09,119 anywhere like this. You might argue, okay, well, there's one here. And then you could also just 754 01:35:09,119 --> 01:35:13,840 draw another one over here, right? And then maybe you can have two SVMs. But that's not really how 755 01:35:13,840 --> 01:35:21,680 SVMs work. But one thing that we can do is we can create some sort of projection. So I realize here 756 01:35:21,680 --> 01:35:29,440 that one thing I forgot to do was to label where zero was. So let's just say zero is here. 757 01:35:32,000 --> 01:35:36,800 Now, what I'm going to do is I'm going to say, okay, I'm going to have x, and then I'm going to 758 01:35:36,800 --> 01:35:44,560 have x, sorry, x zero and x one. So x zero is just going to be my original x. But I'm going to make 759 01:35:44,560 --> 01:35:56,880 x one equal to let's say, x squared. So whatever is this squared, right? So now, my natives would be, 760 01:35:56,880 --> 01:36:02,960 you know, maybe somewhere here, here, just pretend that it's somewhere up here. 761 01:36:02,960 --> 01:36:06,640 Right. And now my pluses might be something like 762 01:36:10,079 --> 01:36:16,079 that. And I'm going to run out of space over here. So I'm just going to draw these together, 763 01:36:16,079 --> 01:36:27,600 use your imagination. But once I draw it like this, well, it's a lot easier to apply a boundary, 764 01:36:27,600 --> 01:36:35,520 right? Now our SVM could be maybe something like this, this. And now you see that we've divided 765 01:36:35,520 --> 01:36:41,600 our data set. Now it's separable where one class is this way. And the other class is that way. 766 01:36:42,800 --> 01:36:49,360 Okay, so that's known as SVMs. I do highly suggest that, you know, any of these models that we just 767 01:36:49,359 --> 01:36:54,399 mentioned, if you're interested in them, do go more in depth mathematically into them. Like how 768 01:36:54,399 --> 01:37:00,239 do we how do we find this hyperplane? Right? I'm not going to go over that in this specific course, 769 01:37:00,239 --> 01:37:05,840 because you're just learning what an SVM is. But it's a good idea to know, oh, okay, this is the 770 01:37:05,840 --> 01:37:13,039 technique behind finding, you know, what exactly are the are the how do you define the hyperplane 771 01:37:13,039 --> 01:37:19,519 that we're going to use. So anyways, this transformation that we did down here, this is known 772 01:37:19,520 --> 01:37:26,560 as the kernel trick. So when we go from x to some coordinate x, and then x squared, 773 01:37:27,119 --> 01:37:31,599 what we're doing is we are applying a kernel. So that's why it's called the kernel trick. 774 01:37:33,279 --> 01:37:40,159 So SVMs are actually really powerful. And you'll see that here. So from sk learn.svm, we are going 775 01:37:40,159 --> 01:37:48,800 to import SVC. And SVC is our support vector classifier. So with this, so with our SVM model, 776 01:37:49,600 --> 01:37:59,840 we are going to, you know, create SVC model. And we are going to, again, fit this to X train, I 777 01:37:59,840 --> 01:38:06,560 could have just copied and pasted this, I should be able to do that. So we're going to create SVC 778 01:38:06,560 --> 01:38:10,480 again, fit this to X train, I could have just copied and pasted this, I should have probably 779 01:38:10,479 --> 01:38:23,119 done that. Okay, taking a bit longer. All right. Let's predict using RSVM model. And here, 780 01:38:23,760 --> 01:38:28,880 let's see if I can hover over this. Right. So again, you see a lot of these different 781 01:38:28,880 --> 01:38:37,119 parameters here that you can go back and change if you were creating a production level model. Okay, 782 01:38:37,119 --> 01:38:46,319 but in this specific case, we'll just use it out of the box again. So if I make predictions, 783 01:38:46,319 --> 01:38:53,119 you'll note that Wow, the accuracy actually jumps to 87% with the SVM. And even with class zero, 784 01:38:53,119 --> 01:38:59,199 there's nothing less than, you know, point eight, which is great. And for class one, 785 01:38:59,199 --> 01:39:03,359 I mean, everything's at 0.9, which is higher than anything that we had seen to this point. 786 01:39:06,640 --> 01:39:11,360 So so far, we've gone over four different classification models, we've done SVM, 787 01:39:11,359 --> 01:39:17,039 logistic regression, naive Bayes and cannon. And these are just simple ways on how to implement 788 01:39:17,039 --> 01:39:23,760 them. Each of these they have different, you know, they have different hyper parameters that you can 789 01:39:23,760 --> 01:39:31,920 go and you can toggle. And you can try to see if that helps later on or not. But for the most part, 790 01:39:31,920 --> 01:39:40,800 they perform, they give us around 70 to 80% accuracy. Okay, with SVM being the best. Now, 791 01:39:40,800 --> 01:39:45,440 let's see if we can actually beat that using a neural net. Now the final type of model that 792 01:39:45,439 --> 01:39:51,839 I wanted to talk about is known as a neural net or neural network. And neural nets look something 793 01:39:51,840 --> 01:39:58,480 like this. So you have an input layer, this is where all your features would go. And they have 794 01:39:58,479 --> 01:40:03,199 all these arrows pointing to some sort of hidden layer. And then all these arrows point to some 795 01:40:03,199 --> 01:40:10,559 sort of output layer. So what is what is all this mean? Each of these layers in here, this is 796 01:40:10,560 --> 01:40:18,160 something known as a neuron. Okay, so that's a neuron. In a neural net. These are all of our 797 01:40:18,159 --> 01:40:23,199 features that we're inputting into the neural net. So that might be x zero x one all the way through 798 01:40:23,840 --> 01:40:28,880 x n. Right. And these are the features that we talked about there, they might be you know, 799 01:40:28,880 --> 01:40:38,720 the pregnancy, the BMI, the age, etc. Now all of these get weighted by some value. So they 800 01:40:38,720 --> 01:40:44,240 are multiplied by some w number that applies to that one specific category that one specific 801 01:40:44,239 --> 01:40:51,840 feature. So these two get multiplied. And the sum of all of these goes into that neuron. Okay, 802 01:40:51,840 --> 01:40:58,400 so basically, I'm taking w zero times x zero. And then I'm adding x one times w one and then 803 01:40:58,399 --> 01:41:05,359 I'm adding you know, x two times w two, etc, all the way to x n times w n. And that's getting 804 01:41:05,359 --> 01:41:11,199 input into the neuron. Now I'm also adding this bias term, which just means okay, I might want 805 01:41:11,199 --> 01:41:17,199 to shift this by a little bit. So I might add five or I might add 0.1 or I might subtract 100, 806 01:41:17,199 --> 01:41:24,960 I don't know. But we're going to add this bias term. And the output of all these things. So 807 01:41:24,960 --> 01:41:31,279 the sum of this, this, this and this, go into something known as an activation function, 808 01:41:31,279 --> 01:41:38,960 okay. And then after applying this activation function, we get an output. And this is what a 809 01:41:38,960 --> 01:41:44,399 neuron would look like. Now a whole network of them would look something like this. 810 01:41:46,000 --> 01:41:53,760 So I kind of gloss over this activation function. What exactly is that? This is how a neural net 811 01:41:53,760 --> 01:41:58,720 looks like if we have all our inputs here. And let's say all of these arrows represent some sort 812 01:41:58,720 --> 01:42:08,159 of addition, right? Then what's going on is we're just adding a bunch of times, right? We're adding 813 01:42:08,159 --> 01:42:13,840 the some sort of weight times these input layer a bunch of times. And then if we were to go back 814 01:42:13,840 --> 01:42:22,000 and factor that all out, then this entire neural net is just a linear combination of these input 815 01:42:22,000 --> 01:42:27,840 layers, which I don't know about you, but that just seems kind of useless, right? Because we could 816 01:42:27,840 --> 01:42:33,279 literally just write that out in a formula, why would we need to set up this entire neural network, 817 01:42:33,279 --> 01:42:40,000 we wouldn't. So the activation function is introduced, right? So without an activation 818 01:42:40,000 --> 01:42:46,880 function, this just becomes a linear model. An activation function might look something like 819 01:42:46,880 --> 01:42:52,880 this. And as you can tell, these are not linear. And the reason why we introduce these is so that 820 01:42:52,880 --> 01:42:58,480 our entire model doesn't collapse on itself and become a linear model. So over here, this is 821 01:42:58,479 --> 01:43:04,079 something known as a sigmoid function, it runs between zero and one, tanh runs between negative 822 01:43:04,079 --> 01:43:10,720 one all the way to one. And this is ReLU, which anything less than zero is zero, and then anything 823 01:43:10,720 --> 01:43:18,640 greater than zero is linear. So with these activation functions, every single output of a neuron 824 01:43:18,640 --> 01:43:24,160 is no longer just the linear combination of these, it's some sort of altered linear state, which means 825 01:43:24,159 --> 01:43:32,880 that the input into the next neuron is, you know, it doesn't it doesn't collapse on itself, it doesn't 826 01:43:32,880 --> 01:43:39,920 become linear, because we've introduced all these nonlinearities. So this is a training set, the 827 01:43:39,920 --> 01:43:45,440 model, the loss, right? And then we do this thing called training, where we have to feed the loss 828 01:43:45,439 --> 01:43:53,199 back into the model, and make certain adjustments to the model to improve this predicted output. 829 01:43:55,199 --> 01:43:59,359 Let's talk a little bit about the training, what exactly goes on during that step. 830 01:44:00,720 --> 01:44:07,600 Let's go back and take a look at our L2 loss function. This is what our L2 loss function 831 01:44:07,600 --> 01:44:15,840 looks like it's a quadratic formula, right? Well, up here, the error is really, really, really, really 832 01:44:15,840 --> 01:44:23,199 large. And our goal is to get somewhere down here, where the loss is decreased, right? Because that 833 01:44:23,199 --> 01:44:30,720 means that our predicted value is closer to our true value. So that means that we want to go 834 01:44:30,720 --> 01:44:39,680 this way. Okay. And thanks to a lot of properties of math, something that we can do is called 835 01:44:39,680 --> 01:44:53,680 gradient descent, in order to follow this slope down this way. This quadratic is, it has different 836 01:44:53,680 --> 01:45:02,560 different slopes with respect to some value. Okay, so the loss with respect to some weight 837 01:45:03,119 --> 01:45:12,479 w zero, versus w one versus w n, they might all be different. Right? So some way that I kind of 838 01:45:12,479 --> 01:45:18,319 think about it is, to what extent is this value contributing to our loss. And we can actually 839 01:45:18,319 --> 01:45:24,399 figure that out through some calculus, which we're not going to touch up on in this specific course. 840 01:45:24,399 --> 01:45:29,599 But if you want to learn more about neural nets, you should probably also learn some calculus 841 01:45:29,600 --> 01:45:35,360 and figure out what exactly back propagation is doing, in order to actually calculate, you know, 842 01:45:35,359 --> 01:45:41,759 how much do we have to backstep by. So the thing is here, you might notice that this follows 843 01:45:41,760 --> 01:45:48,480 this curve at all of these different points. And the closer we get to the bottom, the smaller 844 01:45:48,479 --> 01:45:57,839 this step becomes. Now stick with me here. So my new value, this is what we call a weight update, 845 01:45:57,840 --> 01:46:04,800 I'm going to take w zero, and I'm going to set some new value for w zero. And what I'm going to 846 01:46:04,800 --> 01:46:12,800 set for that is the old value of w zero, plus some factor, which I'll just call alpha for now, 847 01:46:13,680 --> 01:46:22,400 times whatever this arrow is. So that's basically saying, okay, take our old w zero, our old weight, 848 01:46:23,039 --> 01:46:30,000 and just decrease it this way. So I guess increase it in this direction, right, like take a step in 849 01:46:30,000 --> 01:46:34,640 this direction. But this alpha here is telling us, okay, don't don't take a huge step, right, 850 01:46:34,640 --> 01:46:38,800 just in case we're wrong, take a small step, take a small step in that direction, see if we get any 851 01:46:38,800 --> 01:46:45,760 closer. And for those of you who, you know, do want to look more into the mathematics of things, 852 01:46:45,760 --> 01:46:51,840 the reason why I use a plus here is because this here is the negative gradient, right, if this were 853 01:46:51,840 --> 01:46:54,720 just the if you were to use the actual gradient, this should be a minus. 854 01:46:54,720 --> 01:47:00,560 Now this alpha is something that we call the learning rate. Okay, and that adjusts how quickly 855 01:47:00,560 --> 01:47:07,280 we're taking steps. And that might, you know, tell our that that will ultimately control 856 01:47:07,840 --> 01:47:13,039 how long it takes for our neural net to converge. Or sometimes if you set it too high, it might even 857 01:47:13,039 --> 01:47:21,840 diverge. But with all of these weights, so here I have w zero, w one, and then w n. We make the same 858 01:47:21,840 --> 01:47:29,840 update to all of them after we calculate the loss, the gradient of the loss with respect to that 859 01:47:29,840 --> 01:47:37,680 weight. So that's how back propagation works. And that is everything that's going on here. After we 860 01:47:37,680 --> 01:47:42,880 calculate the loss, we're calculating gradients, making adjustments in the model. So we're setting 861 01:47:42,880 --> 01:47:50,480 all the all the weights to something adjusted slightly. And then we're going to calculate the 862 01:47:50,479 --> 01:47:55,119 gradient. And then we're saying, Okay, let's take the training set and run it through the model 863 01:47:55,119 --> 01:48:01,840 again, and go through this loop all over again. So for machine learning, we already have seen some 864 01:48:01,840 --> 01:48:09,039 libraries that we use, right, we've already seen SK learn. But when we start going into neural 865 01:48:09,039 --> 01:48:19,920 networks, this is kind of what we're trying to program. And it's not very fun to try to 866 01:48:19,920 --> 01:48:25,760 do this from scratch, because not only will we probably have a lot of bugs, but also probably 867 01:48:25,760 --> 01:48:30,159 not going to be fast enough, right? Wouldn't it be great if there are just some, you know, 868 01:48:30,800 --> 01:48:35,760 full time professionals that are dedicated to solving this problem, and they could literally 869 01:48:35,760 --> 01:48:43,360 just give us their code that's already running really fast? Well, the answer is, yes, that exists. 870 01:48:43,359 --> 01:48:49,359 And that's why we use TensorFlow. So TensorFlow makes it really easy to define these models. But 871 01:48:49,359 --> 01:48:55,599 we also have enough control over what exactly we're feeding into this model. So for example, 872 01:48:55,600 --> 01:49:02,640 this line here is basically saying, Okay, let's create a sequential neural net. So sequential is 873 01:49:02,640 --> 01:49:08,000 just, you know, what we've seen here, it just goes one layer to the next. And a dense layer means that 874 01:49:08,000 --> 01:49:13,359 a dense layer means that all of them are interconnected. So here, this is interconnected with all of these 875 01:49:13,359 --> 01:49:19,839 nodes, and this one's all these, and then this one gets connected to all of the next ones, and so on. 876 01:49:19,840 --> 01:49:26,800 So we're going to create 16 dense nodes with relu activation functions. And then we're going 877 01:49:26,800 --> 01:49:34,000 to create another layer of 16 dense nodes with relu activation. And then our output layer is going 878 01:49:34,000 --> 01:49:43,199 to be just one node. Okay. And that's how easy it is to define something in TensorFlow. So TensorFlow 879 01:49:43,199 --> 01:49:51,199 is an open source library that helps you develop and train your ML models. Let's implement this 880 01:49:51,199 --> 01:49:57,119 for a neural net. So we're using a neural net for classification. Now, so our neural net model, 881 01:49:58,239 --> 01:50:03,840 we are going to use TensorFlow, and I don't think I imported that up here. So we are going to import 882 01:50:03,840 --> 01:50:18,400 that down here. So I'm going to import TensorFlow as TF. And enter. Cool. So my neural net model 883 01:50:19,279 --> 01:50:28,159 is going to be, I'm going to use this. So essentially, this is saying layer all these 884 01:50:28,159 --> 01:50:35,039 things that I'm about to pass in. So yeah, layer them linear stack of layers, layer them as a model. 885 01:50:35,760 --> 01:50:40,560 And what that means, nope, not that. So what that means is I can pass in 886 01:50:42,720 --> 01:50:46,560 some sort of layer, and I'm just going to use a dense layer. 887 01:50:46,560 --> 01:50:56,560 Oops, dot dense. And let's say we have 32 units. Okay, I will also 888 01:51:01,279 --> 01:51:09,599 set the activation as really. And at first we have to specify the input shape. So here we have 10, 889 01:51:09,600 --> 01:51:19,680 and comma. Alright. Alright, so that's our first layer. Now our next layer, I'm just going to have 890 01:51:19,680 --> 01:51:28,880 another dense layer of 32 units all using relu. And that's it. So for the final layer, this is 891 01:51:28,880 --> 01:51:35,760 just going to be my output layer, it's going to just be one node. And the activation is going to 892 01:51:35,760 --> 01:51:43,119 be sigmoid. So if you recall from our logistic regression, what happened there was when we had 893 01:51:43,119 --> 01:51:49,599 a sigmoid, it looks something like this, right? So by creating a sigmoid activation on our last layer, 894 01:51:49,600 --> 01:51:56,720 we're essentially projecting our predictions to be zero or one, just like in logistic regression. 895 01:51:57,439 --> 01:52:03,279 And that's going to help us, you know, we can just round to zero or one and classify that way. 896 01:52:03,279 --> 01:52:12,000 Okay. So this is my neural net model. And I'm going to compile this. So in TensorFlow, 897 01:52:12,000 --> 01:52:17,520 we have to compile it. It's really cool, because I can just literally pass in what type of optimizer 898 01:52:17,520 --> 01:52:23,840 I want, and it'll do it. So here, if I go to optimizers, I'm actually going to use atom. 899 01:52:24,720 --> 01:52:31,039 And you'll see that, you know, the learning rate is 0.001. So I'm just going to use that default. 900 01:52:31,039 --> 01:52:44,800 So 0.001. And my loss is going to be binary cross entropy. And the metrics that I'm also going to 901 01:52:44,800 --> 01:52:50,079 include on here, so it already will consider loss, but I'm, I'm also going to tack on accuracy. 902 01:52:50,079 --> 01:52:55,600 So we can actually see that in a plot later on. Alright, so I'm going to run this. 903 01:52:55,600 --> 01:53:01,760 And one thing that I'm going to also do is I'm going to define these plot definitions. So I'm 904 01:53:01,760 --> 01:53:06,800 actually copying and pasting this, I got these from TensorFlow. So if you go on to some TensorFlow 905 01:53:06,800 --> 01:53:13,119 tutorial, they actually have these, this like, defined. And that's exactly what I'm doing here. 906 01:53:13,119 --> 01:53:18,239 So I'm actually going to move this cell up, run that. So we're basically plotting the loss 907 01:53:18,239 --> 01:53:23,519 over all the different epochs. epochs means like training cycles. And we're going to run that. So 908 01:53:23,520 --> 01:53:27,680 means like training cycles. And we're going to plot the accuracy over all the epochs. 909 01:53:28,960 --> 01:53:36,079 Alright, so we have our model. And now all that's left is, let's train it. Okay. 910 01:53:37,199 --> 01:53:42,720 So I'm going to say history. So TensorFlow is great, because it keeps track of the history 911 01:53:42,720 --> 01:53:47,680 of the training, which is why we can go and plot it later on. Now I'm going to set that equal to 912 01:53:47,680 --> 01:53:59,280 this neural net model. And fit that with x train, y train, I'm going to make the number of epochs 913 01:53:59,279 --> 01:54:06,159 equal to let's say just let's just use 100 for now. And the batch size, I'm going to set equal to, 914 01:54:06,159 --> 01:54:18,159 let's say 32. Alright. And the validation split. So what the validation split does, if it's down 915 01:54:18,159 --> 01:54:23,920 here somewhere. Okay, so yeah, this validation split is just the fraction of the training data 916 01:54:23,920 --> 01:54:31,119 to be used as validation data. So essentially, every single epoch, what's going on is TensorFlow 917 01:54:31,119 --> 01:54:37,199 saying, leave certain if this is point two, then leave 20% out. And we're going to test how the 918 01:54:37,199 --> 01:54:42,559 model performs on that 20% that we've left out. Okay, so it's basically like our validation data 919 01:54:42,560 --> 01:54:48,800 set. But TensorFlow does it on our training data set during the training. So we have now a measure 920 01:54:48,800 --> 01:54:54,640 outside of just our validation data set to see, you know, what's going on. So validation split, 921 01:54:54,640 --> 01:55:05,760 I'm going to make that 0.2. And we can run this. So if I run that, all right, and I'm actually going 922 01:55:05,760 --> 01:55:13,760 to set verbose equal to zero, which means, okay, don't print anything, because printing something 923 01:55:13,760 --> 01:55:19,680 for 100 epochs might get kind of annoying. So I'm just going to let it run, let it train, 924 01:55:19,680 --> 01:55:31,039 and then we'll see what happens. Cool, so it finished training. And now what I can do is 925 01:55:31,039 --> 01:55:36,960 because you know, I've already defined these two functions, I can go ahead and I can plot the loss, 926 01:55:36,960 --> 01:55:45,199 oops, loss of that history. And I can also plot the accuracy throughout the training. 927 01:55:45,199 --> 01:55:52,239 So this is a little bit ish what we're looking for. We definitely are looking for a steadily 928 01:55:52,239 --> 01:55:59,119 decreasing loss and an increasing accuracy. So here we do see that, you know, our validation 929 01:55:59,119 --> 01:56:07,199 accuracy improves from around point seven, seven or something all the way up to somewhere around 930 01:56:07,199 --> 01:56:16,880 point, maybe eight one. And our loss is decreasing. So this is good. It is expected that the validation 931 01:56:16,880 --> 01:56:23,359 loss and accuracy is performing worse than the training loss or accuracy. And that's because 932 01:56:23,359 --> 01:56:28,479 our model is training on that data. So it's adapting to that data. Whereas the validation stuff is, 933 01:56:28,479 --> 01:56:35,759 you know, stuff that it hasn't seen yet. So, so that's why. So in machine learning, as we saw above, 934 01:56:35,760 --> 01:56:40,159 we could change a bunch of the parameters, right? Like I could change this to 64. So now it'd be 935 01:56:40,159 --> 01:56:46,960 a row of 64 nodes, and then 32, and then one. So I can change some of these parameters. 936 01:56:47,680 --> 01:56:53,039 And a lot of machine learning is trying to find, hey, what do we set these hyper parameters to? 937 01:56:54,399 --> 01:57:02,079 So what I'm actually going to do is I'm going to rewrite this so that we can do something what's 938 01:57:02,079 --> 01:57:08,079 known as a grid search. So we can search through an entire space of hey, what happens if, you know, 939 01:57:08,079 --> 01:57:19,199 we have 64 nodes and 64 nodes, or 16 nodes and 16 nodes, and so on. And then on top of all that, 940 01:57:19,199 --> 01:57:26,639 we can, you know, we can change this learning rate, we can change how many epochs we can change, 941 01:57:26,640 --> 01:57:33,039 you know, the batch size, all these things might affect our training. And just for kicks, 942 01:57:33,039 --> 01:57:42,000 I'm also going to add what's known as a dropout layer in here. And what dropout is doing is 943 01:57:42,000 --> 01:57:51,119 saying, hey, randomly choose with at this rate, certain nodes, and don't train them in, you know, 944 01:57:51,119 --> 01:57:59,760 in a certain iteration. So this helps prevent overfitting. Okay, so I'm actually going to 945 01:57:59,760 --> 01:58:06,720 define this as a function called train model, we're going to pass in x train, y train, 946 01:58:07,920 --> 01:58:15,760 the number of nodes, the dropout, you know, the probability that we just talked about 947 01:58:15,760 --> 01:58:27,199 learning rate. So I'm actually going to say lr batch size. And we can also pass in number epochs, 948 01:58:27,199 --> 01:58:34,319 right? I mentioned that as a parameter. So indent this, so it goes under here. And with these two, 949 01:58:34,319 --> 01:58:40,799 I'm going to set this equal to number of nodes. And now with the two dropout layers, I'm going 950 01:58:40,800 --> 01:58:48,720 to set dropout prob. So now you know, the probability of turning off a node during the training 951 01:58:48,720 --> 01:58:55,360 is equal to dropout prob. And I'm going to keep the output layer the same. Now I'm compiling it, 952 01:58:55,359 --> 01:59:00,479 but this here is now going to be my learning rate. And I still want binary cross entropy and 953 01:59:00,479 --> 01:59:12,639 accuracy. We are actually going to train our model inside of this function. But here we can do the 954 01:59:12,640 --> 01:59:19,200 epochs equal epochs, and this is equal to whatever, you know, we're passing in x train, 955 01:59:19,199 --> 01:59:25,279 y train belong right here. Okay, so those are getting passed in as well. And finally, at the 956 01:59:25,279 --> 01:59:38,159 end, I'm going to return this model and the history of that model. Okay. So now what I'll do 957 01:59:40,399 --> 01:59:46,399 is let's just go through all of these. So let's say let's keep epochs at 100. And now what I can 958 01:59:46,399 --> 01:59:53,279 do is I can say, hey, for a number of nodes in, let's say, let's do 1632 and 64, to see what 959 01:59:53,279 --> 02:00:02,960 happens for the different dropout probabilities. And I mean, zero would be nothing. Let's use 0.2. 960 02:00:02,960 --> 02:00:17,199 Also, to see what happens. You know, for the learning rate in 0.005, 0.001. And you know, 961 02:00:17,199 --> 02:00:27,359 maybe we want to throw on 0.1 in there as well. And then for the batch size, let's do 1632, 962 02:00:27,359 --> 02:00:33,119 64 as well. Actually, and let's also throw in 128. Actually, let's get rid of 16. Sorry, 963 02:00:33,680 --> 02:00:44,079 so 128 in there. That should be 01. I'm going to record the model and history using this 964 02:00:44,079 --> 02:00:54,640 train model here. So we're going to do x train y train, the number of nodes is going to be, 965 02:00:54,640 --> 02:01:04,240 you know, the number of nodes that we've defined here, dropout, prob, LR, batch size, and epochs. 966 02:01:04,239 --> 02:01:10,479 Okay. And then now we have both the model and the history. And what I'm going to do is again, 967 02:01:10,479 --> 02:01:18,079 I want to plot the loss for the history. I'm also going to plot the accuracy. 968 02:01:19,840 --> 02:01:22,640 Probably should have done them side by side, that probably would have been easier. 969 02:01:26,319 --> 02:01:34,399 Okay, so what I'm going to do is split up, split this up. And that will be 970 02:01:34,399 --> 02:01:41,039 the subplots. So now this is just saying, okay, I want one row and two columns in that row for my 971 02:01:41,039 --> 02:01:56,000 plots. Okay, so I'm going to plot on my axis one, the loss. I don't actually know this is going to 972 02:01:56,000 --> 02:02:04,640 work. Okay, we don't care about the grid. Yeah, let's let's keep the grid. And then now my other. 973 02:02:09,199 --> 02:02:14,800 So now on here, I'm going to plot all the accuracies on the second plot. 974 02:02:20,159 --> 02:02:21,840 I might have to debug this a bit. 975 02:02:21,840 --> 02:02:27,680 We should be able to get rid of that. If we run this, we already have history saved as a variable 976 02:02:27,680 --> 02:02:36,800 in here. So if I just run it on this, okay, it has no attribute x label. Oh, I think it's because 977 02:02:36,800 --> 02:02:47,680 it's like set x label or something. Okay, yeah, so it's, it's set instead of just x label, y label. 978 02:02:47,680 --> 02:02:54,480 So let's see if that works. All right, cool. Um, and let's actually make this a bit larger. 979 02:02:55,439 --> 02:02:59,919 Okay, so we can actually change the figure size that I'm gonna set. Let's see what happens if I 980 02:02:59,920 --> 02:03:08,159 set that to. Oh, that's not the way I wanted it. Okay, so that looks reasonable. 981 02:03:08,159 --> 02:03:13,920 And that's just going to be my plot history function. So now I can plot them side by side. 982 02:03:15,279 --> 02:03:23,279 Here, I'm going to plot the history. And what I'm actually going to do is I so here, first, 983 02:03:23,279 --> 02:03:26,079 I'm going to print out all these parameters. So I'm going to print out 984 02:03:27,359 --> 02:03:34,960 the F string to print out all of this stuff. So here, I'm going to print out all these parameters. 985 02:03:34,960 --> 02:03:42,720 Uh, all of this stuff. So here, I'm printing out how many nodes, um, the dropout probability, 986 02:03:45,600 --> 02:03:46,880 uh, the learning rate. 987 02:03:55,199 --> 02:03:57,519 And we already know how many you found, so I'm not even going to bother with that. 988 02:03:57,520 --> 02:04:10,560 So once we plot this, uh, let's actually also figure out what the, um, what the validation 989 02:04:10,560 --> 02:04:15,680 losses on our validation set that we have that we created all the way back up here. 990 02:04:16,720 --> 02:04:23,760 Alright, so remember, we created three data sets. Let's call our model and evaluate what the 991 02:04:23,760 --> 02:04:32,640 validation data with the validation data sets loss would be. And I actually want to record, 992 02:04:33,520 --> 02:04:38,160 let's say I want to record whatever model has the least validation loss. So 993 02:04:40,640 --> 02:04:45,360 first, I'm going to initialize that to infinity so that you know, any model will beat that score. 994 02:04:45,359 --> 02:04:53,599 So if I do float infinity, that will set that to infinity. And maybe I'll keep 995 02:04:53,600 --> 02:04:58,640 track of the parameters. Actually, it doesn't really matter. I'm just going to keep track of 996 02:04:58,640 --> 02:05:06,480 the model. And I'm gonna set that to none. So now down here, if the validation loss is ever 997 02:05:06,479 --> 02:05:13,759 less than the least validation loss, then I am going to simply come down here and say, 998 02:05:13,760 --> 02:05:20,400 Hey, this validation for this least validation loss is now equal to the validation loss. 999 02:05:21,600 --> 02:05:30,480 And the least loss model is whatever this model is that just earned that validation loss. Okay. 1000 02:05:31,840 --> 02:05:40,319 So we are actually just going to let this run for a while. And then we're going to get our least 1001 02:05:40,319 --> 02:05:51,840 last model after that. So let's just run. All right, and now we wait. 1002 02:05:51,840 --> 02:06:12,079 All right, so we've finally finished training. And you'll notice that okay, down here, the loss 1003 02:06:12,079 --> 02:06:19,039 actually gets to like 0.29. The accuracy is around 88%, which is pretty good. So you might be wondering, 1004 02:06:19,039 --> 02:06:26,239 okay, why is this accuracy in this? Like, these are both the validation. So this accuracy here 1005 02:06:26,239 --> 02:06:30,319 is on the validation data set that we've defined at the beginning, right? And this one here, 1006 02:06:30,319 --> 02:06:35,840 this is actually taking 20% of our tests, our training set every time during the training, 1007 02:06:35,840 --> 02:06:41,199 and saying, Okay, how much of it do I get right now? You know, after this one step where I didn't 1008 02:06:41,199 --> 02:06:46,880 train with any of that. So they're slightly different. And actually, I realized later on 1009 02:06:46,880 --> 02:06:52,640 that I probably you know, probably what I should have done is over here, when we were defining 1010 02:06:54,640 --> 02:06:59,920 the model fit, instead of the validation split, you can define the validation data. 1011 02:07:00,479 --> 02:07:04,639 And you can pass in the validation data, I don't know if this is the proper syntax. But 1012 02:07:05,439 --> 02:07:09,439 that's probably what I should have done. But instead, you know, we'll just stick with what 1013 02:07:09,439 --> 02:07:16,719 we have here. So you'll see at the end, you know, with the 64 nodes, it seems like this is our best 1014 02:07:16,720 --> 02:07:24,880 performance 64 nodes with a dropout of 0.2, a learning rate of 0.001, and a batch size of 64. 1015 02:07:25,439 --> 02:07:31,439 And it does seem like yes, the validation, you know, the fake validation, but the validation 1016 02:07:34,000 --> 02:07:40,239 loss is decreasing, and then the accuracy is increasing, which is a good sign. Okay, 1017 02:07:40,239 --> 02:07:45,039 so finally, what I'm going to do is I'm actually just going to predict. So I'm going to take 1018 02:07:45,039 --> 02:07:50,960 this model, which we've called our least loss model, I'm going to take this model, 1019 02:07:50,960 --> 02:07:58,159 and I'm going to predict x test on that. And you'll see that it gives me some values that 1020 02:07:58,159 --> 02:08:02,159 are really close to zero and some that are really close to one. And that's because we have a sigmoid 1021 02:08:02,159 --> 02:08:11,920 output. So if I do this, and what I can do is I can cast them. So I'm going to say anything that's 1022 02:08:11,920 --> 02:08:20,239 greater than 0.5, set that to one. So if I actually, I think what happens if I do this? 1023 02:08:22,399 --> 02:08:29,759 Oh, okay, so I have to cast that as type. And so now you'll see that it's ones and zeros. And I'm 1024 02:08:29,760 --> 02:08:40,560 actually going to transform this into a column as well. So here I'm going to Oh, oops, I didn't 1025 02:08:40,560 --> 02:08:49,280 I didn't mean to do that. Okay, no, I wanted to just reshape it to that. So now it's one dimensional. 1026 02:08:49,279 --> 02:08:57,599 Okay. And using that we can actually just rerun the classification report based on these this 1027 02:08:57,600 --> 02:09:04,880 neural net output. And you'll see that okay, the the F ones are the accuracy gives us 87%. So it 1028 02:09:04,880 --> 02:09:12,560 seems like what happened here is the precision on class zero. So the hadrons has increased a bit, 1029 02:09:12,560 --> 02:09:19,840 but the recall decreased. But the F one score is still at a good point eight one. And for the other 1030 02:09:19,840 --> 02:09:24,480 class, it looked like the precision decreased a bit the recall increased for an overall F one score. 1031 02:09:25,039 --> 02:09:31,439 That's also been increased. I think I interpreted that properly. I mean, we went through all this 1032 02:09:31,439 --> 02:09:37,839 work and we got a model that performs actually very, very similarly to the SVM model that we 1033 02:09:37,840 --> 02:09:43,039 had earlier. And the whole point of this exercise was to demonstrate, okay, these are how you can 1034 02:09:43,039 --> 02:09:48,720 define your models. But it's also to say, hey, maybe, you know, neural nets are very, very 1035 02:09:48,720 --> 02:09:55,840 powerful, as you can tell. But sometimes, you know, an SVM or some other model might actually be more 1036 02:09:55,840 --> 02:10:03,360 appropriate. But in this case, I guess it didn't really matter which one we use at the end. An 87% 1037 02:10:04,399 --> 02:10:10,639 accuracy score is still pretty good. So yeah, let's now move on to regression. 1038 02:10:11,840 --> 02:10:17,039 We just saw a bunch of different classification models. Now let's shift gears into regression, 1039 02:10:17,039 --> 02:10:23,279 the other type of supervised learning. If we look at this plot over here, we see a bunch of scattered 1040 02:10:23,279 --> 02:10:31,439 data points. And here we have our x value for those data points. And then we have the corresponding y 1041 02:10:31,439 --> 02:10:40,079 value, which is now our label. And when we look at this plot, well, our goal in regression is to find 1042 02:10:40,079 --> 02:10:48,159 the line of best fit that best models this data. Essentially, we're trying to let's say we're given 1043 02:10:48,159 --> 02:10:54,159 some new value of x that we don't have in our sample, we're trying to say, okay, what would my 1044 02:10:54,159 --> 02:11:01,599 prediction for y be for that given x value. So that, you know, might be somewhere around there. 1045 02:11:03,279 --> 02:11:08,399 I don't know. But remember, in regression that, you know, given certain features, 1046 02:11:08,399 --> 02:11:12,079 we're trying to predict some continuous numerical value for y. 1047 02:11:12,079 --> 02:11:21,199 In linear regression, we want to take our data and fit a linear model to this data. So in this case, 1048 02:11:21,199 --> 02:11:30,079 our linear model might look something along the lines of here. Right. So this here would be 1049 02:11:30,079 --> 02:11:41,119 considered as maybe our line of best fit. And this line is modeled by the equation, I'm going to write 1050 02:11:41,119 --> 02:11:51,680 it down here, y equals b zero, plus b one x. Now b zero just means it's this y intercept. So if we 1051 02:11:51,680 --> 02:11:58,880 extend this y down here, this value here is b zero, and then b one defines the source of the 1052 02:11:58,880 --> 02:12:08,880 line, defines the slope of this line. Okay. All right. So that's the that's the formula 1053 02:12:09,680 --> 02:12:17,119 for linear regression. And how exactly do we come up with that formula? What are we trying to do 1054 02:12:17,119 --> 02:12:23,279 with this linear regression? You know, we could just eyeball where the line be, but humans are 1055 02:12:23,279 --> 02:12:29,279 not very good at eyeballing certain things like that. I mean, we can get close, but a computer is 1056 02:12:29,279 --> 02:12:37,519 better at giving us a precise value for b zero and b one. Well, let's introduce the concept of 1057 02:12:37,520 --> 02:12:47,200 something known as a residual. Okay, so residual, you might also hear this being called the error. 1058 02:12:47,199 --> 02:12:55,039 And what that means is, let's take some data point in our data set. And we're going to evaluate how 1059 02:12:55,039 --> 02:13:03,439 far off is our prediction from a data point that we already have. So this here is our y, let's say, 1060 02:13:04,000 --> 02:13:15,119 this is 12345678. So this is y eight, let's call it, you'll see that I use this y i in order to 1061 02:13:15,119 --> 02:13:23,039 I in order to represent, hey, just one of these points. Okay. So this here is why and this here 1062 02:13:23,039 --> 02:13:30,720 would be the prediction. Oops, this here would be the prediction for y eight, which I've labeled 1063 02:13:30,720 --> 02:13:35,199 with this hat. Okay, if it has a hat on it, that means hey, this is what this is my guess this is 1064 02:13:35,199 --> 02:13:48,239 my prediction for you know, this specific value of x. Okay. Now the residual would be this distance 1065 02:13:48,239 --> 02:13:58,719 here between y eight and y hat eight. So y eight minus y hat eight. All right, because that would 1066 02:13:58,720 --> 02:14:04,400 give us this here. And I'm just going to take the absolute value of this. Because what if it's below 1067 02:14:04,399 --> 02:14:08,879 the line, right, then you would get a negative value, but distance can't be negative. So we're 1068 02:14:08,880 --> 02:14:14,560 just going to put a little hat, or we're going to put a little absolute value around this quantity. 1069 02:14:15,279 --> 02:14:23,519 And that gives us the residual or the error. So let me rewrite that. And you know, to generalize 1070 02:14:23,520 --> 02:14:32,960 to all the points, I'm going to say the residual can be calculated as y i minus y hat of i. Okay. 1071 02:14:32,960 --> 02:14:39,279 So this just means the distance between some given point, and its prediction, its corresponding 1072 02:14:39,279 --> 02:14:47,679 prediction on the line. So now, with this residual, this line of best fit is generally trying to 1073 02:14:47,680 --> 02:14:55,840 decrease these residuals as much as possible. So now that we have some value for the error, 1074 02:14:55,840 --> 02:15:00,640 our line of best fit is trying to decrease the error as much as possible for all of the different 1075 02:15:00,640 --> 02:15:07,840 data points. And that might mean, you know, minimizing the sum of all the residuals. So this 1076 02:15:07,840 --> 02:15:14,720 here, this is the sum symbol. And if I just stick the residual calculation in there, 1077 02:15:16,640 --> 02:15:21,200 it looks something like that, right. And I'm just going to say, okay, for all of the eyes in our 1078 02:15:21,199 --> 02:15:27,679 data set, so for all the different points, we're going to sum up all the residuals. And I'm going 1079 02:15:27,680 --> 02:15:33,200 to try to decrease that with my line of best fit. So I'm going to find the B0 and B1, which gives 1080 02:15:33,199 --> 02:15:41,679 me the lowest value of this. Okay. Now in other, you know, sometimes in different circumstances, 1081 02:15:41,680 --> 02:15:49,039 we might attach a squared to that. So we're trying to decrease the sum of the squared residuals. 1082 02:15:49,039 --> 02:16:03,519 And what that does is it just, you know, it adds a higher penalty for how far off we are from, 1083 02:16:03,520 --> 02:16:07,920 you know, points that are further off. So that is linear regression, we're trying to find 1084 02:16:08,640 --> 02:16:15,520 this equation, some line of best fit that will help us decrease this measure of error 1085 02:16:15,520 --> 02:16:19,920 with respect to all the data points that we have in our data set, and try to come up with 1086 02:16:19,920 --> 02:16:27,760 the best prediction for all of them. This is known as simple linear regression. 1087 02:16:30,880 --> 02:16:39,520 And basically, that means, you know, our equation looks something like this. Now, there's also 1088 02:16:39,520 --> 02:16:52,479 multiple linear regression, which just means that hey, if we have more than one value for x, so like 1089 02:16:52,479 --> 02:16:58,559 think of our feature vectors, we have multiple values in our x vector, then our predictor might 1090 02:16:58,559 --> 02:17:11,199 look something more like this. Actually, I'm just going to say etc, plus b n, x n. So now I'm coming 1091 02:17:11,200 --> 02:17:18,960 up with some coefficient for all of the different x values that I have in my vector. Now you guys 1092 02:17:18,959 --> 02:17:23,039 might have noticed that I have some assumptions over here. And you might be asking, okay, Kylie, 1093 02:17:23,040 --> 02:17:26,560 what in the world do these assumptions mean? So let's go over them. 1094 02:17:26,559 --> 02:17:31,119 So let's go over them. The first one is linearity. 1095 02:17:33,840 --> 02:17:38,399 And what that means is, let's say I have a data set. Okay. 1096 02:17:43,760 --> 02:17:50,960 Linearity just means, okay, my does my data follow a linear pattern? Does y increase as x 1097 02:17:50,959 --> 02:17:59,279 increases? Or does y decrease at as x increases? Does so if y increases or decreases at a constant 1098 02:17:59,280 --> 02:18:04,720 rate as x increases, then you're probably looking at something linear. So what's the example of a 1099 02:18:04,719 --> 02:18:12,959 nonlinear data set? Let's say I had data that might look something like that. Okay. So now just 1100 02:18:12,959 --> 02:18:18,719 visually judging this, you might say, okay, seems like the line of best fit might actually be some 1101 02:18:18,719 --> 02:18:28,559 curve like this. Right. And in this case, we don't satisfy that linearity assumption anymore. 1102 02:18:29,680 --> 02:18:36,960 So with linearity, we basically just want our data set to follow some sort of linear trajectory. 1103 02:18:39,280 --> 02:18:42,640 And independence, our second assumption 1104 02:18:42,639 --> 02:18:50,079 just means this point over here, it should have no influence on this point over here, 1105 02:18:50,079 --> 02:18:55,039 or this point over here, or this point over here. So in other words, all the points, 1106 02:18:56,000 --> 02:19:03,440 all the samples in our data set should be independent. Okay, they should not rely on 1107 02:19:03,440 --> 02:19:05,840 one another, they should not affect one another. 1108 02:19:05,840 --> 02:19:17,120 Okay, now, normality and homoscedasticity, those are concepts which use this residual. Okay. So if 1109 02:19:17,120 --> 02:19:31,120 I have a plot that looks something like this, and I have a plot that looks like this. Okay, 1110 02:19:31,120 --> 02:19:45,680 something like this. And my line of best fit is somewhere here, maybe it's something like that. 1111 02:19:47,200 --> 02:19:52,000 In order to look at these normality and homoscedasticity assumptions, let's look at 1112 02:19:52,000 --> 02:20:03,440 the residual plot. Okay. And what that means is I'm going to keep my same x axis. But instead 1113 02:20:03,440 --> 02:20:09,360 of plotting now where they are relative to this y, I'm going to plot these errors. So now I'm 1114 02:20:09,360 --> 02:20:19,200 going to plot y minus y hat like this. Okay. And now you know, this one is slightly positive, 1115 02:20:19,200 --> 02:20:24,720 so it might be here, this one down here is negative, it might be here. So our residual plot, 1116 02:20:25,840 --> 02:20:30,079 it's literally just a plot of how you know, the values are distributed around our line of best 1117 02:20:30,079 --> 02:20:42,879 fit. So it looks like it might, you know, look something like this. Okay. So this might be our 1118 02:20:42,879 --> 02:20:55,279 residual plot. And what normality means, so our assumptions are normality and homoscedasticity, 1119 02:20:59,280 --> 02:21:05,120 I might have butchered that spelling, I don't really know. But what normality is saying is 1120 02:21:05,120 --> 02:21:12,960 saying, okay, these residuals should be normally distributed. Okay, around this line of best fit, 1121 02:21:12,959 --> 02:21:21,599 it should follow a normal distribution. And now what homoscedasticity says, okay, our variants 1122 02:21:21,600 --> 02:21:28,399 of these points should remain constant throughout. So this spread here should be approximately the 1123 02:21:28,399 --> 02:21:35,199 same as this spread over here. Now, what's an example of where you know, homoscedasticity is 1124 02:21:35,200 --> 02:21:43,920 not held? Well, let's say that our original plot actually looks something like this. 1125 02:21:46,479 --> 02:21:51,600 Okay, so now if we looked at the residuals for that, it might look something 1126 02:21:51,600 --> 02:22:03,600 like that. And now if we look at this spread of the points, it decreases, right? So now the spread 1127 02:22:03,600 --> 02:22:12,559 is not constant, which means that homoscedasticity, this assumption would not be fulfilled, and it 1128 02:22:12,559 --> 02:22:18,559 might not be appropriate to use linear regression. So that's just linear regression. Basically, 1129 02:22:18,559 --> 02:22:25,680 we have a bunch of data points, we want to predict some y value for those. And we're trying to come 1130 02:22:25,680 --> 02:22:32,639 up with this line of best fit that best describes, hey, given some value x, what would be my best 1131 02:22:32,639 --> 02:22:43,039 guess of what y is. So let's move on to how do we evaluate a linear regression model. So the first 1132 02:22:43,040 --> 02:22:49,600 measure that I'm going to talk about is known as mean absolute error, or MAE 1133 02:22:52,079 --> 02:22:59,039 for short, okay. And mean absolute error is basically saying, all right, let's take 1134 02:22:59,040 --> 02:23:06,080 all the errors. So all these residuals that we talked about, let's sum up the distance 1135 02:23:06,079 --> 02:23:11,440 for all of them, and then take the average. And then that can describe, you know, how far off are 1136 02:23:11,440 --> 02:23:18,319 we. So the mathematical formula for that would be, okay, let's take all the residuals. 1137 02:23:21,680 --> 02:23:27,440 Alright, so this is the distance. Actually, let me redraw a plot down here. So 1138 02:23:27,440 --> 02:23:41,440 suppose I have a data set, look like this. And here are all my data points, right. And now let's 1139 02:23:41,440 --> 02:23:52,319 say my line looks something like that. So my mean absolute error would be summing up all of these 1140 02:23:52,319 --> 02:24:01,600 values. This was a mistake. So summing up all of these, and then dividing by how many data points 1141 02:24:01,600 --> 02:24:07,760 I have. So what would be all the residuals, it would be y i, right, so every single point, 1142 02:24:08,639 --> 02:24:16,159 minus y hat i, so the prediction for that on here. And then we're going to sum over all of 1143 02:24:16,159 --> 02:24:24,319 all of the different i's in our data set. Right, so i, and then we divide by the number of points 1144 02:24:24,319 --> 02:24:29,119 we have. So actually, I'm going to rewrite this to make it a little clearer. So i is equal to 1145 02:24:29,120 --> 02:24:33,680 whatever the first data point is all the way through the nth data point. And then we divide 1146 02:24:33,680 --> 02:24:42,399 it by n, which is how many points there are. Okay, so this is our measure of mae. And this is basically 1147 02:24:42,399 --> 02:24:50,479 telling us, okay, in on average, this is the distance between our predicted value and the 1148 02:24:50,479 --> 02:25:01,359 actual value in our training set. Okay. And mae is good because it allows us to, you know, when we 1149 02:25:01,360 --> 02:25:08,720 get this value here, we can literally directly compare it to whatever units the y value is in. 1150 02:25:08,719 --> 02:25:17,920 So let's say y is we're talking, you know, the prediction of the price of a house, right, in 1151 02:25:17,920 --> 02:25:24,719 dollars. Once we have once we calculate the mae, we can literally say, oh, the average, you know, 1152 02:25:24,719 --> 02:25:34,319 price, the average, how much we're off by is literally this many dollars. Okay. So that's the 1153 02:25:34,319 --> 02:25:40,159 mean absolute error. An evaluation technique that's also closely related to that is called the mean 1154 02:25:40,159 --> 02:25:53,280 squared error. And this is MSE for short. Okay. Now, if I take this plot again, and I duplicated 1155 02:25:53,280 --> 02:25:59,360 and move it down here, well, the gist of mean squared error is kind of the same, but instead 1156 02:25:59,360 --> 02:26:06,159 of the absolute value, we're going to square. So now the MSE is something along the lines of, 1157 02:26:06,159 --> 02:26:11,920 okay, let's sum up something, right, so we're going to sum up all of our errors. 1158 02:26:13,280 --> 02:26:19,120 So now I'm going to do y i minus y hat i. But instead of absolute valuing them, 1159 02:26:19,120 --> 02:26:25,360 I'm going to square them all. And then I'm going to divide by n in order to find the mean. So 1160 02:26:25,360 --> 02:26:33,200 basically, now I'm taking all of these different values, and I'm squaring them first before I add 1161 02:26:33,200 --> 02:26:42,079 them to one another. And then I divide by n. And the reason why we like using mean squared error 1162 02:26:42,079 --> 02:26:49,680 is that it helps us punish large errors in the prediction. And later on, MSE might be important 1163 02:26:49,680 --> 02:26:55,760 because of differentiability, right? So a quadratic equation is differentiable, you know, 1164 02:26:55,760 --> 02:27:00,719 if you're familiar with calculus, a quadratic equation is differentiable, whereas the absolute 1165 02:27:00,719 --> 02:27:05,279 value function is not totally differentiable everywhere. But if you don't understand that, 1166 02:27:05,280 --> 02:27:10,560 don't worry about it, you won't really need it right now. And now one downside of mean squared 1167 02:27:10,559 --> 02:27:16,239 error is that once I calculate the mean squared error over here, and I go back over to y, and I 1168 02:27:16,239 --> 02:27:25,360 want to compare the values. Well, it gets a little bit trickier to do that because now my mean squared 1169 02:27:25,360 --> 02:27:33,280 error is in terms of y squared, right? It's this is now squared. So instead of just dollars, how, 1170 02:27:33,280 --> 02:27:40,079 you know, how many dollars off am I I'm talking how many dollars squared off am I. And that, 1171 02:27:40,079 --> 02:27:45,440 you know, to humans, it doesn't really make that much sense. Which is why we have created 1172 02:27:45,440 --> 02:27:53,600 something known as the root mean squared error. And I'm just going to copy this diagram over here 1173 02:27:53,600 --> 02:28:02,559 because it's very, very similar to mean squared error. Except now we take a big squared root. 1174 02:28:03,280 --> 02:28:10,640 Okay, so this is our messy, and we take the square root of that mean squared error. And so now the 1175 02:28:10,639 --> 02:28:17,760 term in which you know, we're defining our error is now in terms of that dollar sign symbol again. 1176 02:28:17,760 --> 02:28:23,280 So that's a pro of root mean squared error is that now we can say, okay, our error according 1177 02:28:23,280 --> 02:28:30,320 to this metric is this many dollar signs off from our predictor. Okay, so it's in the same unit, 1178 02:28:30,319 --> 02:28:37,680 which is one of the pros of root mean squared error. And now finally, there is the coefficient 1179 02:28:37,680 --> 02:28:43,200 of determination, or r squared. And this is a formula for r squared. So r squared is equal 1180 02:28:43,200 --> 02:28:55,200 to one minus RSS over TSS. Okay, so what does that mean? Basically, RSS stands for the sum 1181 02:28:56,639 --> 02:29:03,920 of the squared residuals. So maybe it should be SSR instead, but 1182 02:29:03,920 --> 02:29:14,079 RSS sum of the squared residuals, and this is equal to if I take the sum of all the values, 1183 02:29:14,799 --> 02:29:24,799 and I take y i minus y hat, i, and square that, that is my RSS, right, it's a sum of the squared 1184 02:29:24,799 --> 02:29:30,639 residuals. Now TSS, let me actually use a different color for that. 1185 02:29:30,639 --> 02:29:38,479 So TSS is the total sum of squares. 1186 02:29:41,040 --> 02:29:46,640 And what that means is that instead of being with respect to this prediction, 1187 02:29:48,879 --> 02:29:52,079 we are instead going to 1188 02:29:52,079 --> 02:29:59,440 take each y value and just subtract the mean of all the y values, and square that. 1189 02:30:00,799 --> 02:30:03,119 Okay, so if I drew this out, 1190 02:30:13,520 --> 02:30:16,000 and if this were my 1191 02:30:16,000 --> 02:30:23,040 actually, let's use a different color. Let's use green. If this were my predictor, 1192 02:30:24,799 --> 02:30:33,039 so RSS is giving me this measure here, right? It's giving me some estimate of how far off we are from 1193 02:30:33,040 --> 02:30:41,840 our regressor that we predicted. Actually, I'm gonna take this one, and I'm gonna take this one, 1194 02:30:41,840 --> 02:30:52,639 and actually, I'm going to use red for that. Well, TSS, on the other hand, is saying, okay, 1195 02:30:52,639 --> 02:30:59,039 how far off are these values from the mean. So if we literally didn't do any calculations for the 1196 02:30:59,040 --> 02:31:04,800 line of best fit, if we just took all the y values and average all of them, and said, hey, 1197 02:31:04,799 --> 02:31:10,159 this is the average value for every single x value, I'm just going to predict that average value 1198 02:31:10,159 --> 02:31:16,000 instead, then it's asking, okay, how far off are all these points from that line? 1199 02:31:19,120 --> 02:31:26,079 Okay, and remember that this square means that we're punishing larger errors, right? So even if 1200 02:31:26,079 --> 02:31:32,959 they look somewhat close in terms of distance, the further a few data points are, then the further 1201 02:31:32,959 --> 02:31:39,439 the larger our total sum of squares is going to be. Sorry, that was my dog. So the total sum of 1202 02:31:39,440 --> 02:31:44,960 squares is taking all of these values and saying, okay, what is the sum of squares, if I didn't do 1203 02:31:44,959 --> 02:31:51,119 any regressor, and I literally just calculated the average of all the y values in my data set, 1204 02:31:51,120 --> 02:31:55,440 and for every single x value, I'm just going to predict that average, which means that okay, 1205 02:31:55,440 --> 02:32:00,720 like, that means that maybe y and x aren't associated with each other at all. Like the 1206 02:32:00,719 --> 02:32:05,599 best thing that I can do for any new x value, just predict, hey, this is the average of my data set. 1207 02:32:05,600 --> 02:32:11,200 And this total sum of squares is saying, okay, well, with respect to that average, 1208 02:32:12,239 --> 02:32:19,920 what is our error? Right? So up here, the sum of the squared residuals, this is telling us what is 1209 02:32:19,920 --> 02:32:26,799 our what what is our error with respect to this line of best fit? Well, our total sum of squares 1210 02:32:26,799 --> 02:32:34,559 saying what is the error with respect to, you know, just the average y value. And if our line 1211 02:32:34,559 --> 02:32:44,639 of best fit is a better fit, then this total sum of squares, that means that you know, this numerator, 1212 02:32:46,079 --> 02:32:51,520 that means that this numerator is going to be smaller than this denominator, right? 1213 02:32:52,319 --> 02:32:59,600 And if our errors in our line of best fit are much smaller, then that means that this ratio 1214 02:32:59,600 --> 02:33:06,960 of the RSS over TSS is going to be very small, which means that R squared is going to go towards 1215 02:33:06,959 --> 02:33:14,319 one. And now when R squared is towards one, that means that that's usually a sign that we have a 1216 02:33:14,319 --> 02:33:24,719 good predictor. It's one of the signs, not the only one. So over here, I also have, you know, 1217 02:33:24,719 --> 02:33:29,840 that there's this adjusted R squared. And what that does, it just adjusts for the number of terms. 1218 02:33:29,840 --> 02:33:36,000 So x1, x2, x3, etc. It adjusts for how many extra terms we add, because usually when we, 1219 02:33:37,280 --> 02:33:42,480 you know, add an extra term, the R squared value will increase because that'll help us predict 1220 02:33:42,479 --> 02:33:48,879 y some more. But the value for the adjusted R squared increase if the new term actually 1221 02:33:48,879 --> 02:33:54,000 improves this model fit more than expected, you know, by chance. So that's what adjusted 1222 02:33:54,000 --> 02:33:58,159 R squared is. I'm not, you know, it's out of the scope of this one specific course. 1223 02:33:58,159 --> 02:34:04,559 And now that's linear regression. Basically, I've covered the concept of residuals or errors. 1224 02:34:05,280 --> 02:34:11,040 And, you know, how do we use that in order to find the line of best fit? And you know, 1225 02:34:11,040 --> 02:34:15,200 our computer can do all the calculations for us, which is nice. But behind the scenes, 1226 02:34:15,200 --> 02:34:20,400 it's trying to minimize that error, right? And then we've gone through all the different 1227 02:34:20,399 --> 02:34:25,440 ways of actually evaluating a linear regression model and the pros and cons of each one. 1228 02:34:26,559 --> 02:34:31,760 So now let's look at an example. So we're still on supervised learning. But now we're just going to 1229 02:34:31,760 --> 02:34:37,120 talk about regression. So what happens when you don't just want to predict, you know, type 123? 1230 02:34:37,120 --> 02:34:43,840 What happens if you actually want to predict a certain value? So again, I'm on the UCI machine 1231 02:34:43,840 --> 02:34:54,399 learning repository. And here I found this data set about bike sharing in Seoul, South Korea. 1232 02:34:55,040 --> 02:35:01,520 So this data set is predicting rental bike count. And here it's the kind of bikes rented at each 1233 02:35:01,520 --> 02:35:08,159 hour. So what we're going to do, again, you're going to go into the data folder, and you're going 1234 02:35:08,159 --> 02:35:19,520 to download this CSV file. And we're going to move over to collab again. And here I'm going to name 1235 02:35:19,520 --> 02:35:29,680 this FCC bikes and regression. I don't remember what I called the last one. But yeah, FCC bikes 1236 02:35:29,680 --> 02:35:39,600 regression. Now I'm going to import a bunch of the same things that I did earlier. And, you know, 1237 02:35:39,600 --> 02:35:46,559 I'm going to also continue to import the oversampler and the standard scaler. And then I'm actually 1238 02:35:46,559 --> 02:35:52,799 also just going to let you guys know that I have a few more things I wanted import. So this is a 1239 02:35:52,799 --> 02:35:59,199 library that lets us copy things. Seaborn is a wrapper over a matplotlib. So it also allows us 1240 02:35:59,200 --> 02:36:03,280 to plot certain things. And then just letting you know that we're also going to be using 1241 02:36:03,280 --> 02:36:07,920 TensorFlow. Okay, so one more thing that we're also going to be using, we're going to use the 1242 02:36:07,920 --> 02:36:13,760 sklearn linear model library. Actually, let me make my screen a little bit bigger. So yeah, 1243 02:36:15,600 --> 02:36:25,120 awesome. Run this and that'll import all the things that we need. So again, I'm just going to, 1244 02:36:25,120 --> 02:36:34,960 you know, give some credit to where we got this data set. So let me copy and paste this UCI thing. 1245 02:36:38,000 --> 02:36:42,159 And I will also give credit to this here. 1246 02:36:46,559 --> 02:36:54,319 Okay, cool. All right, cool. So this is our data set. And again, it tells us all the different 1247 02:36:54,319 --> 02:37:01,520 attributes that we have right here. So I'm actually going to go ahead and paste this in here. 1248 02:37:05,280 --> 02:37:09,280 Feel free to copy and paste this if you want me to read it out loud, so you can type it. 1249 02:37:09,280 --> 02:37:18,960 It's byte count, hour, temp, humidity, wind, visibility, dew point, temp, radiation, rain, 1250 02:37:18,959 --> 02:37:27,279 snow, and functional, whatever that means. Okay, so I'm going to come over here and import my data 1251 02:37:27,280 --> 02:37:34,800 by dragging and dropping. All right. Now, one thing that you guys might actually need to do is 1252 02:37:34,799 --> 02:37:41,359 you might actually have to open up the CSV because there were, at first, a few like forbidding 1253 02:37:41,360 --> 02:37:46,319 characters in mine, at least. So you might have to get rid of like, I think there was a degree here, 1254 02:37:46,319 --> 02:37:50,639 but my computer wasn't recognizing it. So I got rid of that. So you might have to go through 1255 02:37:50,639 --> 02:37:58,639 and get rid of some of those labels that are incorrect. I'm going to do this. Okay. But 1256 02:37:59,600 --> 02:38:07,040 after we've done that, we've imported in here, I'm going to create a data a data frame from that. So, 1257 02:38:07,040 --> 02:38:12,560 all right, so now what I can do is I can read that CSV file and I can get the data into here. 1258 02:38:12,559 --> 02:38:21,359 So so like data dot CSV. Okay, so now if I call data dot head, you'll see that I have all the 1259 02:38:21,360 --> 02:38:32,079 various labels, right? And then I have the data in there. So I'm going to from here, I'm actually 1260 02:38:32,079 --> 02:38:37,600 going to get rid of some of these columns that, you know, I don't really care about. So here, 1261 02:38:37,600 --> 02:38:44,159 I'm going to, when I when I type this in, I'm going to drop maybe the date, whether or not it's a 1262 02:38:44,159 --> 02:38:53,039 holiday, and the various seasons. So I'm just not going to care about these things. Access equals 1263 02:38:53,040 --> 02:38:59,120 one means drop it from the columns. So now you'll see that okay, we still have, I mean, 1264 02:38:59,120 --> 02:39:05,280 I guess you don't really notice it. But if I set the data frames columns equal to data set calls, 1265 02:39:05,280 --> 02:39:11,280 and I look at, you know, the first five things, then you'll see that this is now our data set. 1266 02:39:11,280 --> 02:39:17,520 It's a lot easier to read. So another thing is, I'm actually going to 1267 02:39:18,319 --> 02:39:24,239 df functional. And we're going to create this. So remember that our computers are not very good 1268 02:39:24,239 --> 02:39:30,000 at language, we want it to be in zeros and ones. So here, I will convert that. 1269 02:39:30,000 --> 02:39:39,920 Well, if this is equal to yes, then that that gets mapped as one. So then set type integer. All right. 1270 02:39:41,040 --> 02:39:48,560 Great. Cool. So the thing is, right now, these by counts are for whatever hour. So 1271 02:39:48,559 --> 02:39:52,559 to make this example simpler, I'm just going to index on an hour, and I'm gonna say, okay, 1272 02:39:52,559 --> 02:39:59,359 we're only going to use that specific hour. So I'm just going to index on an hour, and I'm 1273 02:39:59,360 --> 02:40:07,680 going to use an hour. So here, let's say. So this data frame is only going to be data frame where 1274 02:40:07,680 --> 02:40:17,600 the hour, let's say it equals 12. Okay, so it's noon. All right. So now you'll see that all the 1275 02:40:17,600 --> 02:40:31,120 equal to 12. And I'm actually going to now drop that column. Our access equals one. Alright, 1276 02:40:31,120 --> 02:40:38,480 so we run this cell. Okay, so now we got rid of the hour in here. And we just have the by count, 1277 02:40:38,479 --> 02:40:45,760 the temperature, humidity, wind, visibility, and yada, yada, yada. Alright, so what I want to do 1278 02:40:45,760 --> 02:40:54,639 is I'm going to actually plot all of these. So for i in all the columns, so the range, length of 1279 02:40:55,440 --> 02:40:59,280 whatever its data frame is, and all the columns, because I don't have by count as 1280 02:41:00,159 --> 02:41:05,760 actually, it's my first thing. So what I'm going to do is say for a label in data frame, 1281 02:41:06,559 --> 02:41:10,159 columns, everything after the first thing, so that would give me the temperature and 1282 02:41:10,159 --> 02:41:19,440 onwards. So these are all my features, right? I'm going to just scatter. So I want to see how that 1283 02:41:19,440 --> 02:41:29,680 label how that specific data, how that affects the by count. So I'm going to plot the bike count on 1284 02:41:29,680 --> 02:41:35,760 the y axis. And I'm going to plot, you know, whatever the specific label is on the x axis. 1285 02:41:35,760 --> 02:41:46,000 And I'm going to title this, whatever the label is. And, you know, make my y label, the bike count 1286 02:41:46,639 --> 02:41:58,079 at noon. And the x label as just the label. Okay, now, I guess we don't even need the legend. 1287 02:41:58,079 --> 02:42:10,000 We don't even need the legend. So just show that plot. All right. So it seems like functional is 1288 02:42:10,000 --> 02:42:21,920 not really doesn't really give us any utility. So then snow rain seems like this radiation, 1289 02:42:21,920 --> 02:42:31,040 you know, is fairly linear dew point temperature, visibility, wind doesn't really seem like it does 1290 02:42:31,040 --> 02:42:37,200 much humidity, kind of maybe like an inverse relationship. But the temperature definitely 1291 02:42:37,200 --> 02:42:41,680 looks like there's a relationship between that and the number of bikes, right. So what I'm actually 1292 02:42:41,680 --> 02:42:46,000 going to do is I'm going to drop some of the ones that don't don't seem like they really matter. So 1293 02:42:46,000 --> 02:42:56,959 maybe wind, you know, visibility. Yeah, so I'm going to get rid of when visibility and functional. 1294 02:42:59,280 --> 02:43:13,760 So now data frame, and I'm going to drop wind, visibility, and functional. All right. And the 1295 02:43:13,760 --> 02:43:21,200 axis again is the column. So that's one. So if I look at my data set, now, I have just the 1296 02:43:21,200 --> 02:43:27,200 temperature, the humidity, the dew point temperature, radiation, rain, and snow. So again, 1297 02:43:27,200 --> 02:43:33,760 what I want to do is I want to split this into my training, my validation and my test data set, 1298 02:43:34,319 --> 02:43:42,719 just as we talked before. Here, we can use the exact same thing that we just did. And we can say 1299 02:43:42,719 --> 02:43:51,359 numpy dot split, and sample, you know that the whole sample, and then create our splits 1300 02:43:54,000 --> 02:44:02,559 of the data frame. And we're going to do that. But now set this to eight. Okay. 1301 02:44:04,639 --> 02:44:10,159 So I don't really care about, you know, the the full grid, the full array. So I'm just going to 1302 02:44:10,159 --> 02:44:19,680 use an underscore for that variable. But I will get my training x and y's. And actually, I don't 1303 02:44:19,680 --> 02:44:29,600 have a function for getting the x and y's. So here, I'm going to write a function defined, 1304 02:44:30,159 --> 02:44:36,879 get x y. And I'm going to pass in the data frame. And I'm actually going to pass in what the name 1305 02:44:36,879 --> 02:44:47,039 of the y label is, and what the x what specific x labels I want to look at. So here, if that's none, 1306 02:44:47,040 --> 02:44:51,520 then I'm just like, like, I'm only going to I'm going to get everything from the data set. That's 1307 02:44:51,520 --> 02:45:00,560 not the wildlife. So here, I'm actually going to make first a deep copy of my data frame. 1308 02:45:00,559 --> 02:45:08,879 And that basically means I'm just copying everything over. If, if like x labels is none, 1309 02:45:08,879 --> 02:45:14,559 so if not x labels, then all I'm going to do is say, all right, x is going to be whatever this 1310 02:45:14,559 --> 02:45:22,959 data frame is. And I'm just going to take all the columns. So C for C, and data frame, dot columns, 1311 02:45:22,959 --> 02:45:32,239 if C does not equal the y label, right, and I'm going to get the values from that. But if there 1312 02:45:32,239 --> 02:45:40,159 is the x labels, well, okay, so in order to index only one thing, so like, let's say I pass in only 1313 02:45:40,159 --> 02:45:50,000 one thing in here, then my data frame is, so let me make a case for that. So if the length of x 1314 02:45:50,000 --> 02:46:00,319 labels is equal to one, then what I'm going to do is just say that this is going to be x labels, 1315 02:46:00,319 --> 02:46:07,600 and add that just that label values, and I actually need to reshape to make this 2d. 1316 02:46:08,159 --> 02:46:15,039 So I'm going to pass in negative one comma one there. Now, otherwise, if I have like a list of 1317 02:46:15,040 --> 02:46:20,000 specific x labels that I want to use, then I'm actually just going to say x is equal to data 1318 02:46:20,000 --> 02:46:28,719 frame of those x labels, dot values. And that should suffice. Alright, so now that's just me 1319 02:46:28,719 --> 02:46:36,159 extracting x. And in order to get my y, I'm going to do y equals data frame, and then passing the y 1320 02:46:36,159 --> 02:46:45,440 label. And at the very end, I'm going to say data equals NP dot h stack. So I'm stacking them horizontally 1321 02:46:45,440 --> 02:46:54,960 one next to each other. And I'll take x and y, and return that. Oh, but this needs to be values. 1322 02:46:54,959 --> 02:46:59,119 And I'm actually going to reshape this to make it 2d as well so that we can do this h stack. 1323 02:46:59,120 --> 02:47:10,160 And I will return data x, y. So now I should be able to say, okay, get x, y, and take that data 1324 02:47:10,159 --> 02:47:18,639 frame. And the y label, so my y label is byte count. And actually, so for the x label, I'm actually 1325 02:47:18,639 --> 02:47:24,399 going to let's just do like one dimension right now. And earlier, I got rid of the plots, but we 1326 02:47:24,399 --> 02:47:30,719 had seen that maybe, you know, the temperature dimension does really well. And we might be able 1327 02:47:30,719 --> 02:47:38,639 to use that to predict why. So I'm going to label this also that, you know, it's just using the 1328 02:47:38,639 --> 02:47:48,559 temperature. And I am also going to do this again for, oh, this should be train. And this should be 1329 02:47:48,559 --> 02:48:00,239 validation. And this should be a test. Because oh, that's Val. Right. But here, it should be Val. 1330 02:48:01,920 --> 02:48:08,079 And this should be test. Alright, so we run this and now we have our training validation and test 1331 02:48:08,639 --> 02:48:16,239 data sets for just the temperature. So if I look at x train temp, it's literally just the temperature. 1332 02:48:16,239 --> 02:48:23,039 Okay, and I'm doing this first to show you simple linear regression. Alright, so right now I can 1333 02:48:23,040 --> 02:48:30,800 create a regressor. So I can say the temp regressor here. And then I'm going to, you know, make a 1334 02:48:30,799 --> 02:48:40,000 linear regression model. And just like before, I can simply fix fit my x train temp, y train temp 1335 02:48:40,000 --> 02:48:48,239 in order to train train this linear regression model. Alright, and then I can also, I can print 1336 02:48:49,040 --> 02:49:02,160 this regressor is coefficients and the intercept. So if I do that, okay, this is the coefficient 1337 02:49:02,159 --> 02:49:11,039 for whatever the temperature is, and then the the x intercept, okay, or the y intercept, sorry. All 1338 02:49:11,040 --> 02:49:25,920 right. And I can, you know, score, so I can get the the r squared score. So I can score x test 1339 02:49:25,920 --> 02:49:35,520 and y test. All right, so it's an r squared of around point three eight, which is better than 1340 02:49:35,520 --> 02:49:40,880 zero, which would mean, hey, there's absolutely no association. But it's also not, you know, like, 1341 02:49:42,319 --> 02:49:47,520 good, it depends on the context. But, you know, the higher that number, it means the higher that 1342 02:49:47,520 --> 02:49:53,680 the two variables would be correlated, right? Which here, it's all right. It just means there's 1343 02:49:53,680 --> 02:50:00,319 maybe some association between the two. But the reason why I want to do this one D was to show 1344 02:50:00,319 --> 02:50:06,799 you, you know, if we plotted this, this is what it would look like. So if I create a scatterplot, 1345 02:50:07,440 --> 02:50:22,480 and let's take the training. So this is our data. And then let's make it blue. And then if I 1346 02:50:22,479 --> 02:50:29,279 also plotted, so something that I can do is say, you know, the x range, I'm going to plot it, 1347 02:50:29,840 --> 02:50:36,399 is when space, and this goes from negative 20 to 40, this piece of data. So I'm going to just say, 1348 02:50:36,399 --> 02:50:47,199 let's take 100 things from there. So I'm going to plot x, and I'm going to take this temper, 1349 02:50:47,200 --> 02:50:55,840 this, like, regressor, and predict x with that. Okay, and this label, I'm going to label that 1350 02:50:57,200 --> 02:51:08,800 the fit. And this color, let's make this red. And let's actually set the line with, so I can, 1351 02:51:08,799 --> 02:51:20,719 I can change how thick that value is. Okay. Now at the very end, let's create a legend. And let's, 1352 02:51:21,920 --> 02:51:30,239 all right, let's also create, you know, title, all these things that matter, in some sense. So 1353 02:51:30,239 --> 02:51:39,360 here, let's just say, this would be the bikes, versus the temperature, right? And the y label 1354 02:51:39,360 --> 02:51:48,400 would be number of bikes. And the x label would be the temperature. So I actually think that this 1355 02:51:48,399 --> 02:51:57,920 might cause an error. Yeah. So it's expecting a 2d array. So we actually have to reshape this. 1356 02:51:57,920 --> 02:52:15,120 Okay, there we go. So I just had to make this an array and then reshape it. So it was 2d. Now, 1357 02:52:15,120 --> 02:52:20,960 we see that, all right, this increases. But again, remember those assumptions that we had about 1358 02:52:20,959 --> 02:52:26,799 linear regression, like this, I don't really know if this fits those assumptions, right? I just 1359 02:52:26,799 --> 02:52:32,159 wanted to show you guys though, that like, all right, this is what a line of s fit through this 1360 02:52:32,159 --> 02:52:46,399 data would look like. Okay. Now, we can do multiple linear regression, right. So I'm going to go ahead 1361 02:52:46,399 --> 02:52:58,079 and do that as well. Now, if I take my data set, and instead of the labels, it's actually what's 1362 02:52:58,079 --> 02:53:09,600 my current data set right now. Alright, so let's just use all of these except for the byte count, 1363 02:53:09,600 --> 02:53:18,399 right. So I'm going to just say for the x labels, let's just take the data frames columns and just 1364 02:53:18,399 --> 02:53:30,559 remove the byte count. So does that work? So if this part should be of x labels is none. And then 1365 02:53:30,559 --> 02:53:39,039 this should work now. Oops, sorry. Okay, so I have Oh, but this here, because it's not just the 1366 02:53:39,040 --> 02:53:48,160 temperature anymore, we should actually do this, let's say all, right. So I'm just going to quickly 1367 02:53:48,159 --> 02:53:53,920 rerun this piece here so that we have our temperature only data set. And now we have our 1368 02:53:53,920 --> 02:54:02,000 all data set. Okay. And this regressor, I can do the same thing. So I can do the all regressor. 1369 02:54:02,000 --> 02:54:12,879 And I'm going to make this the linear regression. And I'm going to fit this to x train all and y 1370 02:54:12,879 --> 02:54:20,959 train all. Okay. Alright, so let's go ahead and also score this regressor. And let's see how the 1371 02:54:20,959 --> 02:54:30,159 R squared performs now. So if I test this on the test data set, what happens? Alright, so our R 1372 02:54:30,159 --> 02:54:37,200 square seems to improve it went from point four to point five, two, which is a good sign. Okay. 1373 02:54:38,319 --> 02:54:44,559 And I can't necessarily plot, you know, every single dimension. But this just this is just 1374 02:54:44,559 --> 02:54:49,680 to say, okay, this is this is improved, right? Alright, so one cool thing that you can do with 1375 02:54:49,680 --> 02:55:00,079 tensorflow is you can actually do regression, but with the neural net. So here, I'm going 1376 02:55:00,079 --> 02:55:08,879 to we already have our our training data for just the temperature and just, you know, for all the 1377 02:55:08,879 --> 02:55:13,839 different columns. So I'm not going to bother with splitting up the data again, I'm just going to go 1378 02:55:13,840 --> 02:55:20,639 ahead and start building the model. So in this linear regression model, typically, you know, 1379 02:55:20,639 --> 02:55:28,079 it does help if we normalize it. So that's very easy to do with tensorflow, I can just create some 1380 02:55:28,079 --> 02:55:36,719 normalizer layer. So I'm going to do tensorflow Keras layers, and get the normalization layer. 1381 02:55:37,440 --> 02:55:43,920 And the input shape for that will just be one because let's just do it again on just the 1382 02:55:43,920 --> 02:55:53,520 temperature and the access I will make none. Now for this temp normalizer, and I should have had 1383 02:55:53,520 --> 02:56:04,960 an equal sign there. I'm going to adapt this to X train temp, and reshape this to just a single vector. 1384 02:56:06,479 --> 02:56:14,799 So that should work great. Now with this model, so temp neural net model, what I can do is I can do, 1385 02:56:14,799 --> 02:56:23,759 you know, dot keras, sequential. And I'm going to pass in this normalizer layer. And then I'm 1386 02:56:23,760 --> 02:56:29,920 going to say, hey, just give me one single dense layer with one single unit. And what that's doing 1387 02:56:29,920 --> 02:56:37,120 is saying, all right, well, one single node just means that it's linear. And if you don't add any 1388 02:56:37,120 --> 02:56:43,360 sort of activation function to it, the output is also linear. So here, I'm going to have tensorflow 1389 02:56:43,360 --> 02:56:52,960 Keras layers dot dense. And I'm just going to have one unit. And that's going to be my model. Okay. 1390 02:56:54,479 --> 02:57:06,799 So with this model, let's compile. And for our optimizer, let's use, 1391 02:57:06,799 --> 02:57:16,399 let's use the atom again, dot atom, and we have to pass in the learning rate. So learning rate, 1392 02:57:16,399 --> 02:57:26,879 and our learning rate, let's do 0.01. And now, the loss, we actually let's get this one 0.1. And the 1393 02:57:26,879 --> 02:57:34,079 loss, I'm going to do mean squared error. Okay, so we run that we've compiled it, okay, great. 1394 02:57:34,079 --> 02:57:41,440 And just like before, we can call history. And I'm going to fit this model. So here, 1395 02:57:41,440 --> 02:57:48,640 if I call fit, I can just fit it, and I'm going to take the x train with the temperature, 1396 02:57:49,280 --> 02:57:57,840 but reshape it. Y train for the temperature. And I'm going to set verbose equal to zero so 1397 02:57:57,840 --> 02:58:04,479 that it doesn't, you know, display stuff. I'm actually going to set epochs equal to, let's do 1398 02:58:04,479 --> 02:58:13,760 1000. And the validation data should be let's pass in the validation data set here 1399 02:58:16,319 --> 02:58:22,799 as a tuple. And I know I spelled that wrong. So let's just run this. 1400 02:58:22,799 --> 02:58:27,759 And up here, I've copied and pasted the plot loss from our previous but changed the y label 1401 02:58:27,760 --> 02:58:34,159 to MSC. Because now we're talking we're dealing with mean squared error. And I'm going to plot 1402 02:58:34,159 --> 02:58:39,119 the loss of this history after it's done. So let's just wait for this to finish training and then to 1403 02:58:39,120 --> 02:58:50,320 plot. Okay, so this actually looks pretty good. We see that the value is still the same. So 1404 02:58:50,319 --> 02:58:56,479 this actually looks pretty good. We see that the values are converging. So now what I can do is 1405 02:58:56,479 --> 02:59:05,520 I'm going to go back up and take this plot. And we are going to just run that plot again. So 1406 02:59:07,200 --> 02:59:14,400 here, instead of this temperature regressor, I'm going to use the neural net regressor. 1407 02:59:16,319 --> 02:59:17,360 This neural net model. 1408 02:59:17,360 --> 02:59:25,200 And if I run that, I can see that, you know, this also gives me a linear regressor, 1409 02:59:26,399 --> 02:59:30,079 you'll notice that this this fit is not entirely the same as the one 1410 02:59:31,120 --> 02:59:38,800 up here. And that's due to the training process of, you know, of this neural net. So just two 1411 02:59:38,799 --> 02:59:45,279 different ways to try and try to find the best linear regressor. Okay, but here we're using back 1412 02:59:45,280 --> 02:59:50,960 propagation to train a neural net node, whereas in the other one, they probably are not doing that. 1413 02:59:50,959 --> 02:59:58,719 Okay, they're probably just trying to actually compute the line of s fit. So, okay, given this, 1414 02:59:59,600 --> 03:00:08,479 well, we can repeat the exact same exercise with our with our multiple linear regressions. Okay, 1415 03:00:09,360 --> 03:00:14,560 but I'm actually going to skip that part. I will leave that as an exercise to the viewer. Okay, 1416 03:00:14,559 --> 03:00:19,039 so now what would happen if we use a neural net, a real neural net instead of just, you know, 1417 03:00:19,040 --> 03:00:24,960 one single node in order to predict this. So let's start on that code, we already have our 1418 03:00:24,959 --> 03:00:31,439 normalizer. So I'm actually going to take the same setup here. But instead of, you know, this 1419 03:00:31,440 --> 03:00:37,520 one dense layer, I'm going to set this equal to 32 units. And for my activation, I'm going to use 1420 03:00:37,520 --> 03:00:46,159 Relu. And now let's duplicate that. And for the final output, I just want one answer. So I just 1421 03:00:46,159 --> 03:00:52,079 want one cell. And this activation is also going to be Relu, because I can't ever have less than 1422 03:00:52,079 --> 03:00:57,039 zero bytes. So I'm just going to set that as Relu. I'm just going to name this the neural net model. 1423 03:00:57,040 --> 03:01:04,640 Okay. And at the bottom, I'm going to have this neural net model. I'm going to have this neural 1424 03:01:04,639 --> 03:01:16,319 net model, I'm going to compile. And I will actually use the same compiler here. But instead of 1425 03:01:18,639 --> 03:01:27,279 instead of a learning rate of 0.01, I'll use 0.001. Okay. And I'm going to train this here. 1426 03:01:27,280 --> 03:01:39,920 So the history is this neural net model. And I'm going to fit that against x train temp, y train 1427 03:01:39,920 --> 03:01:54,479 temp, and valid validation data, I'm going to set this again equal to x val temp, and y val temp. 1428 03:01:54,479 --> 03:02:03,600 Now, for the verbose, I'm going to say equal to zero epochs, let's do 100. And here for the batch 1429 03:02:03,600 --> 03:02:08,559 size, actually, let's just not do a batch size right now. Let's just try it. Let's see what happens 1430 03:02:08,559 --> 03:02:18,319 here. And again, we can plot the loss of this history after it's done training. So let's just 1431 03:02:18,319 --> 03:02:26,879 run this. And that's not what we're supposed to get. So what is going on? Here is sequential, 1432 03:02:26,879 --> 03:02:39,679 we have our temperature normalizer, which I'm wondering now if we have to redo that. 1433 03:02:39,680 --> 03:02:51,040 Do that. Okay, so we do see this decline, it's an interesting curve, but we do we do see it eventually. 1434 03:02:53,280 --> 03:02:57,280 So this is our loss, which all right, if decreasing, that's a good sign. 1435 03:02:57,920 --> 03:03:04,079 And actually, what's interesting is let's just let's plot this model again. So here instead of that. 1436 03:03:04,079 --> 03:03:09,840 And you'll see that we actually have this like, curve that looks something like this. So actually, 1437 03:03:09,840 --> 03:03:19,600 what if I got rid of this activation? Let's train this again. And see what happens. 1438 03:03:21,120 --> 03:03:27,600 Alright, so even even when I got rid of that really at the end, it kind of knows, hey, you know, if 1439 03:03:27,600 --> 03:03:36,559 it's not the best model, if we had maybe one more layer in here, these are just things that you have 1440 03:03:36,559 --> 03:03:41,680 to play around with. When you're, you know, working with machine learning, it's like, you don't really 1441 03:03:41,680 --> 03:03:53,440 know what the best model is going to be. For example, this also is not brilliant. But I guess 1442 03:03:53,440 --> 03:04:00,399 it's okay. So my point is, though, that with a neural net, I mean, this is not brilliant, but also 1443 03:04:00,399 --> 03:04:04,959 there's like no data down here, right? So it's kind of hard for our model to predict. In fact, 1444 03:04:04,959 --> 03:04:09,439 we probably should have started the prediction somewhere around here. My point, though, is that 1445 03:04:09,440 --> 03:04:14,560 with this neural net model, you can see that this is no longer a linear predictor, but yet we still 1446 03:04:14,559 --> 03:04:21,600 get an estimate of the value, right? And we can repeat this exact same exercise, right? So let's 1447 03:04:21,600 --> 03:04:30,640 do that. Right. And we can repeat this exact same exercise with the multiple inputs. So here, 1448 03:04:33,520 --> 03:04:40,720 if I now pass in all of the data, so this is my all normalizer, 1449 03:04:40,719 --> 03:04:54,479 and I should just be able to pass in that. So let's move this to the next cell. Here, 1450 03:04:54,479 --> 03:05:00,959 I'm going to pass in my all normalizer. And let's compile it. Yeah, those parameters look good. 1451 03:05:02,959 --> 03:05:10,479 Great. So here with the history, when we're trying to fit this model, instead of temp, 1452 03:05:10,479 --> 03:05:17,680 we're going to use our larger data set with all the features. And let's just train that. 1453 03:05:22,000 --> 03:05:23,680 And of course, we want to plot the loss. 1454 03:05:31,520 --> 03:05:37,760 Okay, so that's what our loss looks like. So an interesting curve, but it's decreasing. 1455 03:05:37,760 --> 03:05:44,479 So before we saw that our R squared score was around point five, two. Well, we don't really have 1456 03:05:44,479 --> 03:05:49,680 that with a neural net anymore. But one thing that we can measure is hey, what is the mean squared 1457 03:05:49,680 --> 03:05:59,600 error, right? So if I come down here, and I compare the two mean squared errors, so 1458 03:05:59,600 --> 03:06:13,360 so I can predict x test all right. So these are my predictions using that linear regressor, 1459 03:06:14,079 --> 03:06:20,159 will linear multiple multiple linear regressor. So these are my live predictions, linear regression. 1460 03:06:20,159 --> 03:06:32,079 Okay. I'm actually going to do that at the bottom. So let me just copy and paste that cell and bring 1461 03:06:32,079 --> 03:06:41,760 it down here. So now I'm going to calculate the mean squared error for both the linear regressor 1462 03:06:41,760 --> 03:06:51,360 and the neural net. Okay, so this is my linear and this is my neural net. So if I do my neural net 1463 03:06:51,360 --> 03:07:03,760 model, and I predict x test all, I get my two, you know, different y predictions. And I can calculate 1464 03:07:03,760 --> 03:07:11,280 the mean squared error, right? So if I want to get the mean squared error, and I have y prediction 1465 03:07:11,280 --> 03:07:19,200 and y real, I can do numpy dot square, and then I would need the y prediction minus, you know, the 1466 03:07:19,200 --> 03:07:31,840 real. So this this is basically squaring everything. And this should be a vector. So if I just take 1467 03:07:31,840 --> 03:07:42,000 this entire thing and take the mean of that, that should give me the MSC. So let's just try that out. 1468 03:07:44,959 --> 03:07:52,639 And the y real is y test all, right? So that's my mean squared error for the linear regressor. 1469 03:07:52,639 --> 03:08:04,559 And this is my mean squared error for the neural net. So that's interesting. I will debug this live, 1470 03:08:04,559 --> 03:08:14,399 I guess. So my guess is that it's probably coming from this normalization layer. Because this input 1471 03:08:14,399 --> 03:08:33,279 shape is probably just six. And okay, so that works now. And the reason why is because, like, 1472 03:08:33,280 --> 03:08:39,040 my inputs are only for every vector, it's only a one dimensional vector of length six. So I should 1473 03:08:39,040 --> 03:08:46,000 have I should have just had six, comma, which is a tuple of size six from the start, or it's a it's 1474 03:08:46,000 --> 03:08:54,079 a tuple containing one element, which is a six. Okay, so it's actually interesting that my neural 1475 03:08:54,079 --> 03:09:00,479 net results seem like they they have a larger mean squared error than my linear regressor. 1476 03:09:00,479 --> 03:09:09,840 One thing that we can look at is, we can actually plot the real versus, you know, the the actual 1477 03:09:09,840 --> 03:09:21,200 results versus what the predictions are. So if I say, some access, and I use plt dot axes, and make 1478 03:09:21,200 --> 03:09:31,280 axes and make these equal, then I can scatter the the y, you know, the test. So what the actual 1479 03:09:31,280 --> 03:09:40,000 values are on the x axis, and then what the prediction are on the x axis. Okay. And I can 1480 03:09:40,000 --> 03:09:50,159 label this as the linear regression predictions. Okay, so then let me just label my axes. So the 1481 03:09:50,159 --> 03:09:59,360 x axis, I'm going to say is the true values. The y axis is going to be my linear regression predictions. 1482 03:10:04,319 --> 03:10:09,279 Or actually, let's plot. Let's just make this predictions. 1483 03:10:09,280 --> 03:10:19,200 And then at the end, I'm going to plot. Oh, let's set some limits. 1484 03:10:22,879 --> 03:10:26,159 Because I think that's like approximately the max number of bikes. 1485 03:10:28,639 --> 03:10:35,199 So I'm going to set my x limit to this and my y limit to this. 1486 03:10:35,200 --> 03:10:45,920 So here, I'm going to pass that in here too. And all right, this is what we actually get for our 1487 03:10:46,479 --> 03:10:54,719 linear regressor. You see that actually, they align quite well, I mean, to some extent. So 2000 is 1488 03:10:54,719 --> 03:11:03,359 probably too much 2500. I mean, looks like maybe like 1800 would be enough here for our limits. 1489 03:11:03,360 --> 03:11:09,360 And I'm actually going to label something else, the neural net predictions. 1490 03:11:12,719 --> 03:11:22,000 Let's add a legend. So you can see that our neural net for the larger values, it seems like 1491 03:11:22,000 --> 03:11:28,479 it's a little bit more spread out. And it seems like we tend to underestimate a little bit down 1492 03:11:28,479 --> 03:11:36,479 here in this area. Okay. And for some reason, these are way off as well. 1493 03:11:37,840 --> 03:11:44,479 But yeah, so we've basically used a linear regressor and a neural net. Honestly, there are 1494 03:11:44,479 --> 03:11:48,559 sometimes where a neural net is more appropriate and a linear regressor is more appropriate. 1495 03:11:49,120 --> 03:11:54,720 I think that it just comes with time and trying to figure out, you know, and just literally seeing 1496 03:11:54,719 --> 03:11:59,279 like, hey, what works better, like here, a linear, a multiple linear regressor might actually work 1497 03:11:59,280 --> 03:12:05,760 better than a neural net. But for example, with the one dimensional case, a linear regressor would 1498 03:12:05,760 --> 03:12:12,880 never be able to see this curve. Okay. I mean, I'm not saying this is a great model either, but I'm 1499 03:12:12,879 --> 03:12:19,439 just saying like, hey, you know, sometimes it might be more appropriate to use something that's not 1500 03:12:19,440 --> 03:12:29,120 linear. So yeah, I will leave regression at that. Okay, so we just talked about supervised learning. 1501 03:12:29,840 --> 03:12:34,880 And in supervised learning, we have data, we have some a bunch of features and for a bunch of 1502 03:12:34,879 --> 03:12:39,759 different samples. But each of those samples has some sort of label on it, whether that's a number, 1503 03:12:39,760 --> 03:12:46,159 a category, a class, etc. Right, we were able to use that label in order to try to predict 1504 03:12:46,159 --> 03:12:51,840 right, we were able to use that label in order to try to predict new labels of other points that 1505 03:12:51,840 --> 03:12:59,520 we haven't seen yet. Well, now let's move on to unsupervised learning. So with unsupervised 1506 03:12:59,520 --> 03:13:05,600 learning, we have a bunch of unlabeled data. And what can we do with that? You know, can we learn 1507 03:13:05,600 --> 03:13:13,120 anything from this data? So the first algorithm that we're going to discuss is known as k means 1508 03:13:13,120 --> 03:13:22,720 clustering. What k means clustering is trying to do is it's trying to compute k clusters from the data. 1509 03:13:25,760 --> 03:13:31,360 So in this example below, I have a bunch of scattered points. And you'll see that this 1510 03:13:31,360 --> 03:13:38,079 is x zero and x one on the two axes, which means I'm actually plotting two different features, 1511 03:13:38,079 --> 03:13:44,799 right of each point, but we don't know what the y label is for those points. And now, just looking 1512 03:13:44,799 --> 03:13:51,439 at these scattered points, we can kind of see how there are different clusters in the data set, 1513 03:13:51,440 --> 03:14:00,319 right. So depending on what we pick for k, we might have different clusters. Let's say k equals two, 1514 03:14:00,319 --> 03:14:05,440 right, then we might pick, okay, this seems like it could be one cluster, but this here is also 1515 03:14:05,440 --> 03:14:12,399 another cluster. So those might be our two different clusters. If we have k equals three, 1516 03:14:13,120 --> 03:14:18,160 for example, then okay, this seems like it could be a cluster. This seems like it could be a 1517 03:14:18,159 --> 03:14:23,119 cluster. And maybe this could be a cluster, right. So we could have three different clusters in the 1518 03:14:23,120 --> 03:14:33,520 data set. Now, this k here is predefined, if I can spell that correctly, by the person who's running 1519 03:14:33,520 --> 03:14:42,479 the model. So that would be you. All right. And let's discuss how you know, the computer actually 1520 03:14:42,479 --> 03:14:49,199 goes through and computes the k clusters. So I'm going to write those steps down here. 1521 03:14:52,639 --> 03:15:01,279 Now, the first step that happens is we actually choose well, the computer chooses three random 1522 03:15:01,280 --> 03:15:11,280 points on this plot to be the centroids. And by centuries, I just mean the center of the clusters. 1523 03:15:11,840 --> 03:15:16,799 Okay. So three random points, let's say we're doing k equals three, so we're choosing three 1524 03:15:16,799 --> 03:15:21,519 random points to be the centroids of the three clusters. If it were two, we'd be choosing two 1525 03:15:21,520 --> 03:15:27,760 random points. Okay. So maybe the three random points I'm choosing might be here. 1526 03:15:27,760 --> 03:15:41,680 Here, here, and here. All right. So we have three different points. And the second thing that we do 1527 03:15:44,399 --> 03:15:46,000 is we actually calculate 1528 03:15:46,000 --> 03:15:58,639 the distance for each point to those centroids. So between all the points and the centroid. 1529 03:16:01,360 --> 03:16:06,400 So basically, I'm saying, all right, this is this distance, this distance, this distance, 1530 03:16:07,600 --> 03:16:13,120 all of these distances, I'm computing between oops, not those two, between the points, not the 1531 03:16:13,120 --> 03:16:18,720 centroids themselves. So I'm computing the distances for all of these plots to each of the centroids. 1532 03:16:20,079 --> 03:16:30,639 Okay. And that comes with also assigning those points to the closest centroid. 1533 03:16:34,799 --> 03:16:42,399 What do I mean by that? So let's take this point here, for example, so I'm computing 1534 03:16:42,399 --> 03:16:46,959 this distance, this distance, and this distance. And I'm saying, okay, it seems like the red one 1535 03:16:46,959 --> 03:16:54,399 is the closest. So I'm actually going to put this into the red centroid. So if I do that for 1536 03:16:54,399 --> 03:17:03,279 all of these points, it seems slightly closer to red, and this one seems slightly closer to red, 1537 03:17:03,280 --> 03:17:13,040 right? Now for the blue, I actually wouldn't put any blue ones in here, but we would probably 1538 03:17:13,040 --> 03:17:21,200 actually, that first one is closer to red. And now it seems like the rest of them are probably 1539 03:17:21,200 --> 03:17:31,440 closer to green. So let's just put all of these into green here, like that. And cool. So now we 1540 03:17:31,440 --> 03:17:38,480 have, you know, our two, three, technically centroid. So there's this group here, there's 1541 03:17:38,479 --> 03:17:44,879 this group here. And then blue is kind of just this group here, it hasn't really touched any 1542 03:17:44,879 --> 03:17:54,559 of the points yet. So the next step, three that we do is we actually go and we recalculate the 1543 03:17:54,559 --> 03:18:02,799 centroid. So we compute new centroids based on the points that we have in all the centroids. 1544 03:18:04,000 --> 03:18:10,159 And by that, I just mean, okay, well, let's take the average of all these points. And where is that 1545 03:18:10,159 --> 03:18:15,680 new centroid? That's probably going to be somewhere around here, right? The blue one, we don't have 1546 03:18:15,680 --> 03:18:22,800 any points in there. So we won't touch and then the screen one, we can put that probably somewhere 1547 03:18:22,799 --> 03:18:36,239 over here, oops, somewhere over here. Right. So now if I erase all of the previously computed centroids, 1548 03:18:38,239 --> 03:18:44,239 I can go and I can actually redo step two over here, this calculation. 1549 03:18:45,280 --> 03:18:48,560 Alright, so I'm going to go back and I'm going to iterate through everything again, 1550 03:18:48,559 --> 03:18:55,199 and I'm going to recompute my three centroids. So let's see, we're going to take this red point, 1551 03:18:55,200 --> 03:19:01,840 these are definitely all red, right? This one still looks a bit red. Now, 1552 03:19:03,760 --> 03:19:06,800 this part, we actually start getting closer to the blues. 1553 03:19:08,159 --> 03:19:16,799 So this one still seems closer to a blue than a green, this one as well. And I think the rest 1554 03:19:16,799 --> 03:19:26,399 would belong to green. Okay, so now our three centroids are three, sorry, our three clusters 1555 03:19:26,399 --> 03:19:39,840 would be this, this, and then this, right? Those are our three centroids. And so now we go back 1556 03:19:39,840 --> 03:19:44,079 and we compute the new sorry, those would be the three clusters. So now we go back and we compute 1557 03:19:44,079 --> 03:19:50,639 the three centroids. So I'm going to get rid of this, this and this. And now where would this 1558 03:19:50,639 --> 03:19:57,680 red be centered, probably closer, you know, to this point here, this blue might be closer to 1559 03:19:57,680 --> 03:20:05,520 up here. And then this green would probably be somewhere. It's pretty similar to what we had 1560 03:20:05,520 --> 03:20:10,880 before. But it seems like it'd be pulled down a bit. So probably somewhere around there for green. 1561 03:20:10,879 --> 03:20:20,239 All right. And now, again, we go back and we compute the distance between all the points 1562 03:20:20,239 --> 03:20:27,600 and the centroids. And then we assign them to the closest centroid. Okay. So the reds are all here, 1563 03:20:27,600 --> 03:20:36,000 it's very clear. Actually, let me just circle that. And this it actually seems like this point is 1564 03:20:36,000 --> 03:20:43,440 it actually seemed like this point is closer to this blue now. So the blues seem like they would 1565 03:20:43,440 --> 03:20:49,440 be maybe this point looks like it'd be blue. So all these look like they would be blue now. 1566 03:20:50,159 --> 03:20:58,000 And the greens would probably be this cluster right here. So we go back, we compute the centroids, 1567 03:20:58,000 --> 03:21:08,959 bam. This one probably like almost here, bam. And then the green looks like it would be probably 1568 03:21:10,959 --> 03:21:21,919 here ish. Okay. And now we go back and we compute the we compute the clusters again. 1569 03:21:21,920 --> 03:21:32,879 So red, still this blue, I would argue is now this cluster here. And green is this cluster here. 1570 03:21:33,360 --> 03:21:48,079 Okay, so we go and we recompute the centroids, bam, bam. And, you know, bam. And now if I were 1571 03:21:48,079 --> 03:21:54,399 to go and assign all the points to clusters again, I would get the exact same thing. Right. And so 1572 03:21:54,399 --> 03:21:59,840 that's when we know that we can stop iterating between steps two and three is when we've 1573 03:21:59,840 --> 03:22:06,559 converged on some solution when we've reached some stable point. And so now because none of 1574 03:22:06,559 --> 03:22:10,399 these points are really changing out of their clusters anymore, we can go back to the user 1575 03:22:10,399 --> 03:22:19,199 and say, Hey, these are our three clusters. Okay. And this process, something known as 1576 03:22:20,719 --> 03:22:33,279 expectation maximization. This part where we're assigning the points to the closest centroid, 1577 03:22:33,280 --> 03:22:41,840 this is something this is our expectation step. And this part where we're computing the new 1578 03:22:41,840 --> 03:22:54,000 centroids, this is our maximization step. Okay, so that's expectation maximization. 1579 03:22:55,040 --> 03:23:02,720 And we use this in order to compute the centroids, assign all the points to clusters, 1580 03:23:02,719 --> 03:23:07,519 according to those centroids. And then we're recomputing all that over again, until we reach 1581 03:23:07,520 --> 03:23:13,760 some stable point where nothing is changing anymore. Alright, so that's our first example 1582 03:23:13,760 --> 03:23:19,200 of unsupervised learning. And basically, what this is doing is trying to find some structure, 1583 03:23:19,200 --> 03:23:25,520 some pattern in the data. So if I came up with another point, you know, might be somewhere here, 1584 03:23:25,520 --> 03:23:32,560 I can say, Oh, it looks like that's closer to if this is a, b, c, it looks like that's closest to 1585 03:23:32,559 --> 03:23:38,239 cluster B. And so I would probably put it in cluster B. Okay, so we can find some structure 1586 03:23:38,239 --> 03:23:46,239 in the data based on just how, how the points are scattered relative to one another. Now, 1587 03:23:46,239 --> 03:23:50,479 the second unsupervised learning technique that I'm going to discuss with you guys, something noted, 1588 03:23:50,479 --> 03:23:57,439 principal component analysis. And the point of principal component analysis is very often it's 1589 03:23:57,440 --> 03:24:07,520 used as a dimensionality reduction technique. So let me write that down. It's used for dimensionality 1590 03:24:07,520 --> 03:24:15,520 reduction. And what do I mean by dimensionality reduction is if I have a bunch of features like 1591 03:24:15,520 --> 03:24:23,600 x1 x2 x3 x4, etc. Can I just reduce that down to one dimension that gives me the most information 1592 03:24:23,600 --> 03:24:29,520 about how all these points are spread relative to one another. And that's what PCA is for. So PCA 1593 03:24:29,520 --> 03:24:42,800 principal component analysis. Let's say I have some points in the x zero and x one feature space. 1594 03:24:42,799 --> 03:24:51,279 Okay, so these points might be spread, you know, something like this. 1595 03:24:59,680 --> 03:25:08,960 Okay. So for example, if this were something to do with housing prices, right, 1596 03:25:08,959 --> 03:25:19,599 this here might be x zero might be hey, years since built, right, since the house was built, 1597 03:25:19,600 --> 03:25:29,920 and x one might be square footage of the house. Alright, so like years since built, I mean, like 1598 03:25:29,920 --> 03:25:36,960 right now it's been, you know, 22 years since a house in 2000 was built. Now principal component 1599 03:25:36,959 --> 03:25:40,799 analysis is just saying, alright, let's say we want to build a model, or let's say we want to, 1600 03:25:40,799 --> 03:25:48,639 you know, display something about our data, but we don't we don't have two axes to show it on. 1601 03:25:49,520 --> 03:25:56,319 How do we display, you know, how do we how do we demonstrate that this point is a further away from 1602 03:25:56,319 --> 03:26:04,239 this point than this point. And we can do that using principal component analysis. So 1603 03:26:04,239 --> 03:26:07,920 take what you know about linear regression and just forget about it for a second. Otherwise, 1604 03:26:07,920 --> 03:26:16,879 you might get confused. PCA is a way of trying to find direction in the space with the largest 1605 03:26:16,879 --> 03:26:23,920 variance. So this principal component, what that means is basically the component. 1606 03:26:23,920 --> 03:26:38,879 So some direction in this space with the largest variance, okay, it tells us the most about our 1607 03:26:38,879 --> 03:26:42,639 data set without the two different dimensions. Like, let's say we have these two different 1608 03:26:42,639 --> 03:26:47,359 mentions, and somebody's telling us, hey, you only get one dimension in order to show your data set. 1609 03:26:48,079 --> 03:26:53,840 What dimension do you want to show us? Okay, so let's say we want to show our data set, 1610 03:26:53,840 --> 03:26:59,040 what dimension like what do we do, we want to project our data onto a single dimension. 1611 03:27:00,159 --> 03:27:05,520 Alright, so that in this case might be a dimension that looks something like 1612 03:27:06,399 --> 03:27:10,639 this. And you might say, okay, we're not going to talk about linear regression, okay. 1613 03:27:11,680 --> 03:27:16,800 We don't have a y value. So linear regression, this would be why this is not why, okay, we don't 1614 03:27:16,799 --> 03:27:23,199 have a label for that. Instead, what we're doing is we're taking the right angle projection. So 1615 03:27:23,200 --> 03:27:30,880 all of these take that's not very visible. But take this right angle projection onto this line. 1616 03:27:33,040 --> 03:27:38,960 And what PCA is doing is saying, okay, map all of these points onto this one dimensional space. 1617 03:27:39,520 --> 03:27:44,000 So the transformed data set would be here. 1618 03:27:44,000 --> 03:27:49,760 This one's on the data sets are on the line. So we just put that there. But now this would be our 1619 03:27:49,760 --> 03:27:57,120 new one dimensional data set. Okay, it's not our prediction or anything. This is our new data set. 1620 03:27:57,120 --> 03:28:02,480 If somebody came to us said you only get one dimension, you only get one number to represent 1621 03:28:02,479 --> 03:28:06,879 each of these 2d points. What number would you give us? What number would you give us? 1622 03:28:06,879 --> 03:28:13,039 So this would be our new one dimensional data set. Okay, it's not our prediction or anything. 1623 03:28:13,040 --> 03:28:23,360 What number would you give me? This would be the number that we gave. Okay, this in this direction, 1624 03:28:24,159 --> 03:28:29,840 this is where our points are the most spread out. Right? If I took this plot, 1625 03:28:31,040 --> 03:28:36,320 and let me actually duplicate this so I don't have to rewrite anything. 1626 03:28:36,319 --> 03:28:43,840 Or so I don't have to erase and then redraw anything. Let me get rid of some of this stuff. 1627 03:28:47,440 --> 03:28:50,079 And I just got rid of a point there too. So let me draw that back. 1628 03:28:54,159 --> 03:29:01,039 Alright, so if this were my original data point, what if I had taken, you know, this to be 1629 03:29:01,040 --> 03:29:12,960 the PCA dimension? Okay, well, I then would have points that let me actually do that in different 1630 03:29:12,959 --> 03:29:24,319 color. So if I were to draw a right angle to this for every point, my points would look something 1631 03:29:24,319 --> 03:29:37,440 like this. And so just intuitively looking at these two different plots, this top one and this one, 1632 03:29:37,440 --> 03:29:43,120 we can see that the points are squished a little bit closer together. Right? Which means that the 1633 03:29:43,120 --> 03:29:48,800 variance that's not the space with the largest variance. The thing about the largest variance 1634 03:29:48,799 --> 03:29:55,759 is that this will give us the most discrimination between all of these points. The larger the 1635 03:29:55,760 --> 03:30:01,520 variance, the further spread out these points will likely be. Now, and so that's the that's the 1636 03:30:01,520 --> 03:30:07,600 dimension that we should project it on a different way to actually look at that, like what is the 1637 03:30:07,600 --> 03:30:14,399 dimension with the largest variance. It's actually it also happens to be the dimension that decreases 1638 03:30:14,399 --> 03:30:25,279 to be the dimension that decreases that minimizes the residuals. So if we take all the points, and 1639 03:30:25,280 --> 03:30:33,520 we take the residual from that the XY residual, so in linear regression, in linear regression, 1640 03:30:33,520 --> 03:30:37,760 we were looking only at this residual, the differences between the predictions right between 1641 03:30:37,760 --> 03:30:44,800 y and y hat, it's not that here in principal component analysis, we're taking the difference 1642 03:30:44,799 --> 03:30:52,319 from our current point in two dimensional space, and then it's projected point. Okay, so we're 1643 03:30:52,319 --> 03:31:00,879 taking that dimension. And we're saying, alright, how much, you know, how much distance is there 1644 03:31:00,879 --> 03:31:08,719 between that projection residual, and we're trying to minimize that for all of these points. So that 1645 03:31:08,719 --> 03:31:21,119 actually equates to this largest variance dimension, this dimension here, the PCA dimension, 1646 03:31:21,120 --> 03:31:32,560 you can either look at it as minimizing, minimize, let me get rid of this, 1647 03:31:34,559 --> 03:31:38,319 the projection residuals. So that's the stuff in orange. 1648 03:31:42,079 --> 03:31:48,319 Or to maximizing the variance between the points. 1649 03:31:48,319 --> 03:31:55,279 Okay. And we're not really going to talk about, you know, the method that we need in order to 1650 03:31:55,280 --> 03:32:00,960 calculate out the principal components, or like what that projection would be, because you will 1651 03:32:00,959 --> 03:32:06,799 need to understand linear algebra for that, especially eigenvectors and eigenvalues, which 1652 03:32:06,799 --> 03:32:12,079 I'm not going to cover in this class. But that's how you would find the principal components. Okay, 1653 03:32:12,079 --> 03:32:16,879 now, with this two dimensional data set here, sorry, this one dimensional data set, we started 1654 03:32:16,879 --> 03:32:22,159 from a 2d data set, and we now boil it down to one dimension. Well, we can go and take that 1655 03:32:22,159 --> 03:32:27,680 dimension, and we can do other things with it. Right, we can, like if there were a y label, 1656 03:32:27,680 --> 03:32:35,040 then we can now show x versus y, rather than x zero and x one in different plots with that y. 1657 03:32:35,040 --> 03:32:38,480 Now we can just say, oh, this is a principal component. And we're going to plot that with 1658 03:32:38,479 --> 03:32:44,559 the y. Or for example, if there were 100 different dimensions, and you only wanted to take five of 1659 03:32:44,559 --> 03:32:51,199 them, well, you could go and you could find the top five PCA dimensions. And that might be a lot 1660 03:32:51,200 --> 03:32:58,400 more useful to you than 100 different feature vector values. Right. So that's principal component 1661 03:32:58,399 --> 03:33:05,279 analysis. Again, we're taking, you know, certain data that's unlabeled, and we're trying to make 1662 03:33:05,280 --> 03:33:13,760 some sort of estimation, like some guess about its structure from that original data set, if we 1663 03:33:13,760 --> 03:33:20,159 wanted to take, you know, a 3d thing, so like a sphere, but we only have a 2d surface to draw it 1664 03:33:20,159 --> 03:33:26,079 on. Well, what's the best approximation that we can make? Oh, it's a circle. Right PCA is kind of 1665 03:33:26,079 --> 03:33:30,079 the same thing. It's saying if we have something with all these different dimensions, but we can't 1666 03:33:30,079 --> 03:33:35,920 show all of them, how do we boil it down to just one dimension? How do we extract the most 1667 03:33:35,920 --> 03:33:43,200 information from that multiple dimensions? And that is exactly either you minimize the projection 1668 03:33:43,200 --> 03:33:50,400 residuals, or you maximize the variance. And that is PCA. So we'll go through an example of that. 1669 03:33:50,399 --> 03:33:57,039 Now, finally, let's move on to implementing the unsupervised learning part of this class. 1670 03:33:57,040 --> 03:34:03,600 Here, again, I'm on the UCI machine learning repository. And I have a seeds data set where, 1671 03:34:04,399 --> 03:34:09,440 you know, I have a bunch of kernels that belong to three different types of wheat. So there's 1672 03:34:09,440 --> 03:34:17,120 comma, Rosa and Canadian. And the different features that we have access to are, you know, 1673 03:34:17,120 --> 03:34:23,840 geometric parameters of those wheat kernels. So the area perimeter, compactness, length, width, 1674 03:34:23,840 --> 03:34:30,639 width, asymmetry, and the length of the kernel groove. Okay, so all of these are real values, 1675 03:34:30,639 --> 03:34:35,119 which is easy to work with. And what we're going to do is we're going to try to predict, 1676 03:34:36,079 --> 03:34:40,479 or I guess we're going to try to cluster the different varieties of the wheat. 1677 03:34:41,440 --> 03:34:46,960 So let's get started. I have a colab notebook open again. Oh, you're gonna have to, you know, 1678 03:34:46,959 --> 03:34:52,159 go to the data folder, download this. And so I'm going to go to the data folder, download this, 1679 03:34:52,159 --> 03:35:04,239 and let's get started. So the first thing to do is to import our seeds data set into our colab 1680 03:35:04,239 --> 03:35:11,920 notebook. So I've done that here. Okay, and then we're going to import all the classics again, 1681 03:35:11,920 --> 03:35:28,960 so pandas. And then I'm also going to import seedborn because I'm going to want that for this 1682 03:35:28,959 --> 03:35:40,239 specific class. Okay. Great. So now our columns that we have in our seed data set are the area, 1683 03:35:40,239 --> 03:35:54,879 the perimeter, the compactness, the length, with asymmetry, groove, length, I mean, I'm just going 1684 03:35:54,879 --> 03:36:00,959 to call it groove. And then the class, right, the wheat kernels class. So now we have to import this, 1685 03:36:00,959 --> 03:36:11,199 I'm going to do that using pandas read CSV. And it's called seeds data.csv. So I'm going to turn 1686 03:36:11,200 --> 03:36:19,040 that into a data frame. And the names are equal to the columns over here. So what happens if I just 1687 03:36:19,040 --> 03:36:29,120 do that? Oops, what did I call this seeds data set text? Alright, so if we actually look at our 1688 03:36:29,120 --> 03:36:36,800 data frame right now, you'll notice something funky. Okay. And here, you know, we have all the 1689 03:36:36,799 --> 03:36:42,239 stuff under area. And these are all our numbers with some dash t. So the reason is because we 1690 03:36:42,239 --> 03:36:50,799 haven't actually told pandas what the separator is, which we can do like this. And this t that's 1691 03:36:50,799 --> 03:36:56,959 just a tab. So in order to ensure that like all whitespace gets recognized as a separator, 1692 03:36:56,959 --> 03:37:04,559 we can actually this is for like a space. So any spaces are going to get recognized as data 1693 03:37:04,559 --> 03:37:13,279 separators. So if I run that, now our this, you know, this is a lot better. Okay. Okay. 1694 03:37:14,559 --> 03:37:20,719 So now let's actually go and like visualize this data. So what I'm actually going to do is plot 1695 03:37:20,719 --> 03:37:26,479 each of these against one another. So in this case, pretend that we don't have access to the 1696 03:37:26,479 --> 03:37:31,279 class, right? Pretend that so this class here, I'm just going to show you in this example, 1697 03:37:31,280 --> 03:37:36,159 that like, hey, we can predict our classes using unsupervised learning. But for this example, 1698 03:37:36,159 --> 03:37:41,440 in unsupervised learning, we don't actually have access to the class. So I'm going to just try to 1699 03:37:41,440 --> 03:37:49,920 plot these against one another and see what happens. So for some I in range, you know, 1700 03:37:49,920 --> 03:37:57,040 the columns minus one because the classes in the columns. And I'm just going to say for j in range, 1701 03:37:57,040 --> 03:38:06,640 so take everything from I onwards, you know, so I like the next thing after I until the end of this. 1702 03:38:06,639 --> 03:38:15,519 So this will give us basically a grid of all the different like combinations. And our x label is 1703 03:38:15,520 --> 03:38:24,399 going to be columns I our y label is going to be the columns j. So those are our labels up here. 1704 03:38:25,280 --> 03:38:34,000 And I'm going to use seaborne this time. And I'm going to say scatter my data. So our x is going 1705 03:38:34,000 --> 03:38:46,399 to be our x label. Or y is going to be our y label. And our data is going to be the data frame that 1706 03:38:46,399 --> 03:38:52,879 we're passing in. So what's interesting here is that we can say hue. And what this will do is say, 1707 03:38:53,520 --> 03:38:57,920 like if I give this class, it's going to separate the three different classes into three different 1708 03:38:57,920 --> 03:39:03,200 hues. So now what we're doing is we're basically comparing the area and the perimeter or the area 1709 03:39:03,200 --> 03:39:10,880 and the compactness. But we're going to visualize, you know, what classes they're in. So let's go 1710 03:39:10,879 --> 03:39:22,399 ahead and I might have to show. So great. So basically, we can see perimeter and area we give 1711 03:39:22,399 --> 03:39:31,760 we get these three groups. The area compactness, we get these three groups, and so on. So these all 1712 03:39:31,760 --> 03:39:40,639 kind of look honestly like somewhat similar. Right, so Wow, look at this one. So this one, 1713 03:39:40,639 --> 03:39:44,319 we have the compactness and the asymmetry. And it looks like there's not really I mean, 1714 03:39:44,319 --> 03:39:48,799 it just looks like they're blobs, right? Sure, maybe class three is over here more, but 1715 03:39:50,000 --> 03:39:55,680 one and two kind of look like they're on top of each other. Okay. I mean, there are some that 1716 03:39:55,680 --> 03:40:00,720 might look slightly better in terms of clustering. But let's go through some of the some of the 1717 03:40:00,719 --> 03:40:05,920 clustering examples that we talked about, and try to implement those. The first thing that we're 1718 03:40:05,920 --> 03:40:16,239 going to do is just straight up clustering. So what we learned about was k means clustering. 1719 03:40:16,239 --> 03:40:29,039 So from SK learn, I'm going to import k means. Okay. And just for the sake of being able to run, 1720 03:40:29,040 --> 03:40:38,640 you know, any x and any y, I'm just going to say, hey, let's use some x. What's a good one, maybe. 1721 03:40:40,959 --> 03:40:47,439 I mean, perimeter asymmetry could be a good one. So x could be perimeter, y could be asymmetry. 1722 03:40:47,440 --> 03:40:58,159 Okay. And for this, the x values, I'm going to just extract those specific values. 1723 03:40:59,840 --> 03:41:08,639 Alright, well, let's make a k means algorithm, or let's, you know, define this. So k means, 1724 03:41:09,200 --> 03:41:15,760 and in this specific case, we know that the number of clusters is three. So let's just use that. And 1725 03:41:15,760 --> 03:41:27,120 I'm going to fit this against this x that I've just defined right here. Right. So, you know, if I 1726 03:41:27,120 --> 03:41:33,200 create this clusters, so one thing, one cool thing is I can actually go to this clusters, and I can 1727 03:41:33,200 --> 03:41:43,200 say k mean dot labels. And it'll give give me if I can type correctly, it'll give me what its 1728 03:41:43,200 --> 03:41:52,159 predictions for all the clusters are. And our actual, oops, not that. If we go to the data frame, 1729 03:41:52,159 --> 03:41:59,440 and we get the class, and the values from those, we can actually compare these two and say, hey, 1730 03:41:59,440 --> 03:42:05,200 like, you know, everything in general, most of the zeros that it's predicted, are the ones, right. 1731 03:42:05,200 --> 03:42:11,360 And in general, the twos are the twos here. And then this third class one, okay, that corresponds 1732 03:42:11,360 --> 03:42:16,560 to three. Now remember, these are separate classes. So the labels, what we actually call them don't 1733 03:42:16,559 --> 03:42:23,760 really matter. We can say a map zero to one map two to two and map one to three. Okay, and our, 1734 03:42:23,760 --> 03:42:30,880 you know, our mapping would do fairly well. But we can actually visualize this. And in order to do 1735 03:42:30,879 --> 03:42:40,239 that, I'm going to create this cluster cluster data frame. So I'm going to create a data frame. 1736 03:42:40,239 --> 03:42:50,559 And I'm going to pass in a horizontally stacked array with x, so my values for x and y. And then 1737 03:42:51,920 --> 03:42:58,159 the clusters that I have here, but I'm going to reshape them. So it's 2d. 1738 03:42:58,159 --> 03:43:14,319 Okay. And the columns, the labels for that are going to be x, y, and plus. Okay. So I'm going 1739 03:43:14,319 --> 03:43:23,520 to go ahead and do that same seaborne scatter plot. Again, where x is x, y is y. And now, 1740 03:43:23,520 --> 03:43:32,159 the hue is again the class. And the data is now this cluster data frame. Alright, so this here, 1741 03:43:35,760 --> 03:43:42,639 this here is my k means like, I guess classes. 1742 03:43:42,639 --> 03:43:54,319 So k means kind of looks like this. If I come down here and I plot, you know, my original data frame, 1743 03:43:54,319 --> 03:44:01,760 this is my original classes with respect to this specific x and y. And you'll see that, honestly, 1744 03:44:01,760 --> 03:44:07,360 like it doesn't do too poorly. Yeah, there's I mean, the colors are different, but that's fine. 1745 03:44:07,360 --> 03:44:16,000 For the most part, it gets information of the clusters, right. And now we can do that with 1746 03:44:16,000 --> 03:44:25,680 higher dimensions. So with the higher dimensions, if we make x equal to, you know, all the columns, 1747 03:44:25,680 --> 03:44:31,680 except for the last one, which is our class, we can do the exact same thing. 1748 03:44:31,680 --> 03:44:38,720 We can do the exact same thing. So here, and we can 1749 03:44:43,600 --> 03:44:55,360 predict this. But now, our columns are equal to our data frame columns all the way to the last one. 1750 03:44:55,360 --> 03:45:02,079 And then with this class, actually, so we can literally just say data frame columns. 1751 03:45:02,079 --> 03:45:09,760 And we can fit all of this. And now, if I want to plot the k means classes. 1752 03:45:11,520 --> 03:45:20,079 Alright, so this was my that's my clustered and my original. So actually, let me see if I can 1753 03:45:20,079 --> 03:45:27,360 get these on the same page. So yeah, I mean, pretty similar to what we just saw. But what's 1754 03:45:27,360 --> 03:45:36,159 actually really cool is even something like, you know, if we change. So what's one of them 1755 03:45:36,159 --> 03:45:47,280 where they were like on top of each other? Okay, so compactness and asymmetry, this one's messy. 1756 03:45:47,280 --> 03:45:57,680 Right. So if I come down here, and I say compactness and asymmetry, and I'm trying to do this in 2d, 1757 03:45:58,959 --> 03:46:05,119 this is what my scatterplot. So this is what you know, my k means is telling me for these two 1758 03:46:05,120 --> 03:46:12,000 dimensions for compactness and asymmetry, if we just look at those two, these are our three classes, 1759 03:46:12,000 --> 03:46:17,520 right? And we know that the original looks something like this. And are these two remotely 1760 03:46:18,239 --> 03:46:25,119 alike? No. Okay, so now if I come back down here, and I rerun this higher dimensions one, 1761 03:46:25,120 --> 03:46:31,280 but actually, this clusters, I need to get the labels of the k means again. 1762 03:46:34,559 --> 03:46:38,399 Okay, so if I rerun this with higher dimensions, 1763 03:46:38,399 --> 03:46:45,600 well, if we zoom out, and we take a look at these two, sure, the colors are mixed up. But in general, 1764 03:46:45,600 --> 03:46:52,000 there are the three groups are there, right? This does a much better job at assessing, okay, 1765 03:46:52,000 --> 03:47:01,200 what group is what. So, for example, we could relabel the one in the original class to two. 1766 03:47:01,200 --> 03:47:08,400 And then we could make sorry, okay, this is kind of confusing. But for example, if this light pink 1767 03:47:08,399 --> 03:47:15,600 were projected onto this darker pink here, and then this dark one was actually the light pink, 1768 03:47:15,600 --> 03:47:21,280 and this light one was this dark one, then you kind of see like these correspond to one another, 1769 03:47:21,280 --> 03:47:26,159 right? Like even these two up here are the same class as all the other ones over here, which are 1770 03:47:26,159 --> 03:47:31,039 the same in the same color. So you don't want to compare the two colors between the plots, 1771 03:47:31,040 --> 03:47:37,680 you want to compare which points are in what colors in each of the plots. So that's one cool 1772 03:47:37,680 --> 03:47:44,079 application. So this is how k means functions, it's basically taking all the data sets and saying, 1773 03:47:44,079 --> 03:47:50,239 All right, where are my clusters given these pieces of data? And then the next thing that we 1774 03:47:50,239 --> 03:47:58,319 talked about is PCA. So PCA, we're reducing the dimension, but we're mapping all these like, 1775 03:47:58,319 --> 03:48:02,799 you know, seven dimensions. I don't know if there are seven, I made that number up, but we're 1776 03:48:02,799 --> 03:48:09,199 mapping multiple dimensions into a lower dimension number. Right. And so let's see how that works. 1777 03:48:10,079 --> 03:48:16,159 So from SK learn decomposition, I can import PCA and that will be my PCA model. 1778 03:48:16,159 --> 03:48:22,479 So if I do PCA component, so this is how many dimensions you want to map it into. 1779 03:48:22,479 --> 03:48:28,319 And you know, for this exercise, let's do two. Okay, so now I'm taking the top two dimensions. 1780 03:48:29,360 --> 03:48:39,600 And my transformed x is going to be PCA dot fit transform, and the same x that I had up here. 1781 03:48:39,600 --> 03:48:46,559 And the same x that I had up here. Okay, so all the other all the values basically, area, 1782 03:48:46,559 --> 03:48:54,799 perimeter, compactness, length, width, asymmetry, groove. Okay. So let's run that. And we've 1783 03:48:54,799 --> 03:49:02,399 transformed it. So let's look at what the shape of x used to be. So they're okay. So seven was right, 1784 03:49:02,399 --> 03:49:10,879 I had 210 samples, each seven, seven features long, basically. And now my transformed x 1785 03:49:14,639 --> 03:49:20,079 is 210 samples, but only of length two, which means that I only have two dimensions now that 1786 03:49:20,079 --> 03:49:26,159 I'm plotting. And we can actually even take a look at, you know, the first five things. 1787 03:49:27,200 --> 03:49:30,320 Okay, so now we see each each one is a two dimensional point, 1788 03:49:30,319 --> 03:49:37,600 each sample is now a two dimensional point in our new in our new dimensions. 1789 03:49:38,879 --> 03:49:42,959 So what's cool is I can actually scatter these 1790 03:49:46,639 --> 03:49:53,519 zero and transformed x. So I actually have to 1791 03:49:53,520 --> 03:49:59,280 take the columns here. And if I show that, 1792 03:50:01,920 --> 03:50:06,879 basically, we've just taken this like seven dimensional thing, and we've made it into a 1793 03:50:06,879 --> 03:50:12,079 single or I guess to a two dimensional representation. So that's a point of PCA. 1794 03:50:13,200 --> 03:50:20,800 And actually, let's go ahead and do the same clustering exercise as we did up here. If I take 1795 03:50:20,799 --> 03:50:29,840 the k means this PCA data frame, I can let's construct data frame out of that. And the data 1796 03:50:29,840 --> 03:50:40,399 frame is going to be H stack. I'm going to take this transformed x and the clusters that reshape. 1797 03:50:40,399 --> 03:50:46,559 So actually, instead of clusters, I'm going to use k means dot labels. And I need to reshape this. 1798 03:50:46,559 --> 03:50:58,799 So it's 2d. So we can do the H stack. And for the columns, I'm going to set this to PCA one PCA two, 1799 03:50:59,680 --> 03:51:07,200 and the class. All right. So now if I take this, I can also do the same for the truth. 1800 03:51:08,159 --> 03:51:13,200 But instead of the k means labels, I want from the data frame the original classes. 1801 03:51:13,200 --> 03:51:20,720 And I'm just going to take the values from that. And so now I have a data frame for the k means 1802 03:51:20,719 --> 03:51:27,199 with PCA and then a data frame for the truth with also the PCA. And I can now plot these similarly 1803 03:51:27,200 --> 03:51:32,320 to how I plotted these up here. So let me actually take these two. 1804 03:51:32,319 --> 03:51:41,279 Instead of the cluster data frame, I want the this is the k means PCA data frame. This is still going 1805 03:51:41,280 --> 03:51:51,200 to be class, but now x and y are going to be the two PCA dimensions. Okay. So these are my two PCA 1806 03:51:51,200 --> 03:51:58,159 dimensions. And you can see that the data frame is going to be the same as the cluster data frame. 1807 03:51:58,159 --> 03:52:05,760 So these are my two PCA dimensions. And you can see that, you know, they're, they're pretty spread 1808 03:52:05,760 --> 03:52:14,319 out. And then here, I'm going to go to my truth classes. Again, it's PCA one PCA two, but instead 1809 03:52:14,319 --> 03:52:22,000 of k means this should be truth PCA data frame. So you can see that like in the truth data frame 1810 03:52:22,000 --> 03:52:29,520 along these two dimensions, we actually are doing fairly well in terms of separation, right? It does 1811 03:52:29,520 --> 03:52:36,720 seem like this is slightly more separable than the other like dimensions that we had been looking at 1812 03:52:36,719 --> 03:52:45,359 up here. So that's a good sign. And up here, you can see that hey, some of these correspond to one 1813 03:52:45,360 --> 03:52:51,440 another. I mean, for the most part, our algorithm or unsupervised clustering algorithm is able to 1814 03:52:51,440 --> 03:52:59,680 to give us is able to spit out, you know, what the proper labels are. I mean, if you map these 1815 03:52:59,680 --> 03:53:05,200 specific labels to the different types of kernels. But for example, this one might all be the comma 1816 03:53:05,200 --> 03:53:09,360 kernel kernels and same here. And then these might all be the Canadian kernels. And these might all 1817 03:53:09,360 --> 03:53:14,960 be the Canadian kernels. So it does struggle a little bit with, you know, where they overlap. 1818 03:53:14,959 --> 03:53:21,119 But for the most part, our algorithm is able to find the three different categories, and do a 1819 03:53:21,120 --> 03:53:26,480 fairly good job at predicting them without without any information from us, we haven't given our 1820 03:53:26,479 --> 03:53:32,879 algorithm any labels. So that's a gist of unsupervised learning. I hope you guys enjoyed 1821 03:53:32,879 --> 03:53:38,799 this course. I hope you know, a lot of these examples made sense. If there are certain things 1822 03:53:38,799 --> 03:53:44,239 that I have done, and you know, you're somebody with more experience than me, please let me know 1823 03:53:44,239 --> 03:53:50,559 in the comments and we can all as a community learn from this together. So thank you all for watching.