[00:16] Um, so let's start with a quick review.
[00:18] Last week we looked at BERT, how BERT
[00:21] was created, and we learned about this
[00:23] technique called masking, which is a
[00:25] kind of self-supervised learning. And
[00:27] the idea of masking was very simple. We
[00:29] asked ourselves the question we have
[00:31] seen ways in which people can take
[00:33] images and pre-train models like restnet
[00:35] on a vast you know vast uh body of
[00:38] images but then for each image somebody
[00:40] had to go and label them right so for
[00:42] text we asked the question well what
[00:44] does it mean to label a piece of text
[00:46] when we don't actually have a clearly
[00:48] defined end goal in mind except the
[00:49] general goal of pre-training things
[00:51] right and then we said oh well what we
[00:53] can do is we can actually replace some
[00:55] some of the words in every sentence with
[00:57] a what you call like a mask token and
[00:59] then we just train the network to
[01:00] recover the blanks to fill in the blanks
[01:03] right and this technique which is one of
[01:06] many ways of doing what's called
[01:07] self-supervised learning is called
[01:08] masking and we and we described how if
[01:12] you essentially take all of Wikipedia
[01:14] and for every sentence you mask it like
[01:16] this and then train a network to recover
[01:19] to fill in the blanks the resulting
[01:21] network becomes really good at doing all
[01:23] kinds of interesting things and that in
[01:25] fact the first such network or one of
[01:27] the first such networks was called BERT
[01:29] u and in fact in your homework you've
[01:31] been you've been looking at BERT and so
[01:32] on and so forth right that's masking now
[01:34] we're going to switch gears and talk
[01:35] about a different kind of self-s
[01:37] supervised learning which is different
[01:38] from masking which turns out to be
[01:41] weirdly more interesting and powerful
[01:45] okay so we are going to look at another
[01:47] technique and this technique is called
[01:49] next word prediction so now it is
[01:52] actually in some some sense a special
[01:54] case of masking where you're basically
[01:55] saying take a sentence and instead of
[01:57] randomly picking a word and and making
[01:59] it a blank. You're saying, "I'm just
[02:01] going to take the last word and make it
[02:03] a blank." Okay? And then you send the
[02:06] sentence in and then you have the the
[02:08] machine just fill in the blank on the
[02:10] last word. Predict the next word. Okay?
[02:12] And you don't have to use full sentences
[02:13] for it. You can use parts of sentences
[02:15] for it. Sentence fragments as well. So
[02:17] if you take the same sentences before
[02:20] the mission of the MI loan school, you
[02:21] can literally divide it into well you
[02:23] can give the and ask it to predict
[02:25] mission. If you can give it the mission
[02:27] and ask it to predict off. You give it
[02:29] the mission of ask to predict the you
[02:31] get the idea. So every sentence fragment
[02:33] you can take and literally just give it
[02:35] the first few and then predict the next
[02:37] one. First few next one first few next
[02:38] one. Okay. So this is next word
[02:41] prediction. And
[02:44] so the let's what we're going to do now
[02:46] is we're going to actually take the
[02:47] transformer encoder architecture that we
[02:50] used to build bird in the last class and
[02:52] we're going to try to use it to solve
[02:54] next word prediction to build a model
[02:56] that can do next word prediction. Okay.
[02:58] So this is what [clears throat] we have.
[03:01] So what we're going to do is uh if you
[03:03] take the phrase the cat sat on the mat.
[03:09] So the phrase was let's say the cat
[03:13] sat
[03:15] on
[03:16] the mat.
[03:18] So what you might want to do is to say
[03:20] okay this is the input
[03:25] output
[03:27] the cat.
[03:30] Then maybe you have the cat
[03:33] then the output is sat.
[03:36] The cat sat on and so on. Right, you get
[03:39] the idea. And then finally, we have the
[03:42] cat sat
[03:45] the mat. Right, this is basically what
[03:48] we have all these inputs and outputs.
[03:50] But we're going to very compactly
[03:51] express it as if it's just coming in
[03:54] through as as one sort of data point in
[03:56] one batch. And that's what we're doing
[03:58] here. So what we're going to do is we're
[04:00] going to stack it up like this where we
[04:02] have the cat sat on the on the left
[04:04] meaning everything but the last word and
[04:07] then we're going to take that same
[04:08] sentence and just shift it to the left
[04:10] one right so the cat sat on the mat we
[04:13] cut off the mat right and that becomes
[04:15] the input then we cut off the first word
[04:17] and that becomes the output so when you
[04:19] look at it that way you can see here
[04:22] right the you will want the to be used
[04:25] to predict cat you will want the to be
[04:29] used to predict SAT and so on and so
[04:31] forth.
[04:32] Okay, so this is just a little sort of
[04:35] manipulation so that we don't have to
[04:37] have you know like dozens of sentences
[04:40] or sentence examples just for one
[04:42] starting sentence.
[04:44] So if you have something like this, what
[04:46] you can do is you can run it through
[04:49] positional input embeddings like we have
[04:50] done before with BERT. Uh then we can
[04:53] run it through a whole bunch of
[04:54] transformers, right? It's like a
[04:56] transformer stack. Then we get these
[04:59] contextual embeddings. Then we run them
[05:01] through maybe one or more ReLUs if you
[05:03] want because it's always a good idea to
[05:05] stick some ReLUS at the very end. U and
[05:08] then we basically attach a softmax to
[05:11] every one of the things that are coming
[05:13] out. Okay. And then that soft max is
[05:17] actually going to be a soft max whose
[05:20] range is the entire vocabulary.
[05:23] Okay. For now, let's assume that the
[05:25] vocabulary is just a vocabulary of
[05:27] words, not tokens. We'll get into tokens
[05:29] a bit later on in the class. For now,
[05:30] just assume it's words. And roughly
[05:32] speaking, let's say there are 50,000
[05:33] words in our vocabulary. So each of
[05:36] these soft maxes, and this is exactly
[05:38] what we did for BERT, by the way. Each
[05:39] of these soft maxes is like a 50,000 way
[05:42] soft max.
[05:43] Okay. But what we're going to do is here
[05:47] when we look at it this way
[05:50] since we are fundamentally bothered
[05:52] about next word prediction as you will
[05:54] see later on we are actually going to
[05:55] ignore all these predictions because who
[05:57] cares? We are only going to look at the
[05:59] last one to figure out okay what is the
[06:02] last prediction? What is it? Because the
[06:04] last prediction is going to be based on
[06:06] everything that came before it here. So
[06:09] this is really the next word that's
[06:11] actually being predicted. All the things
[06:13] before we don't care so much.
[06:16] Okay. And all this will become slightly
[06:17] clearer because you're going to make a
[06:18] couple of passes through it. Yeah.
[06:20] >> How do we
[06:24] >> uh so um the notion of a sentence has
[06:27] disappeared at this point. What we're
[06:29] going to do is when we look at how we
[06:30] tokenize the input for these kinds of
[06:33] models, we're actually going to take
[06:35] punctuation into account. So we're going
[06:36] to take periods into account,
[06:37] exclamation marks into account and so on
[06:39] and so forth. And that that'll answer
[06:41] your question and we'll come back to
[06:42] that. U okay so this what we have. So um
[06:47] all right. So just to be clear the
[06:49] embedding that's coming out of the final
[06:50] dense layer is passed through its own
[06:52] softmax with the number of softmax
[06:54] categories equal to the cap size. Okay.
[06:58] All right. Um okay. So
[07:01] first of all, s let's say we train
[07:04] models a model like this with a lots of
[07:05] inputs and outputs. Okay, this just
[07:08] looks like bird, right? It's not that
[07:10] different except that there's no notion
[07:11] of a mask.
[07:13] Do you notice any problems with the way
[07:15] this thing has been set up? Uh
[07:19] >> like for some words like the you're
[07:21] going to have a lot of potential output
[07:23] pairs that come out of that.
[07:25] >> True. Which means that if you have a
[07:27] word like the the next word
[07:29] >> hard to predict.
[07:29] >> It's true. So some words may be hard to
[07:32] predict depending on the last word of
[07:35] the sentence that was the input. Right.
[07:36] That's what you're getting at. Yeah. U
[07:39] concerns.
[07:41] So I want you Yeah. Uh
[07:43] >> since you're using contextual
[07:46] like the output of the first word is
[07:48] going to have access to the second word
[07:51] and so it's kind of like cheating.
[07:53] >> Bingo.
[07:55] So remember for bingo is a technical
[07:58] term in deep learning which means great.
[08:01] So um so if you go to this right as she
[08:05] points out if you look at the self
[08:08] attention layer note remember the self
[08:11] attention layer is the key building
[08:12] block of the transformer block right and
[08:15] so in the self attention layer every
[08:17] word we calculate its contextual
[08:19] embedding by waiting weighted averaging
[08:23] of its relationship to all other words
[08:26] in the sentence. So the last word can
[08:28] see the first word, the first word can
[08:30] see the last word and so on and so
[08:31] forth, right? But when you're doing next
[08:33] word prediction, this feels problematic
[08:34] because you're peeking into the future,
[08:38] right? So
[08:40] so let's say that you want to predict
[08:42] the next word. If you look at this
[08:43] architecture, what it can simply do, it
[08:46] can simply copy it from the input
[08:48] because it can see the whole sentence.
[08:50] So if I tell you, hey, the cat sat on
[08:52] the mat. If I just gave you the cat sat
[08:55] on the can you predict the the next word
[08:56] for me? You'll be like yeah duh it's cat
[08:58] it's Matt.
[09:01] The whole thing becomes challenging only
[09:02] if I say the cat sat on the dash. Now
[09:04] predict the dash.
[09:07] So to put it another way let's say that
[09:09] you want to predict right you have fed
[09:11] in the first two words and you want to
[09:13] predict this. This is the right answer
[09:15] for the prediction. The network should
[09:17] only use the first two.
[09:20] However, but because self attention can
[09:23] see SAT, it can see this next word,
[09:26] it'll trivially learn to predict the
[09:28] next word to be SAT,
[09:31] right? There is no challenge for it.
[09:34] So, this is the key problem, right? This
[09:37] is the key problem. We're just using the
[09:38] transformer as is.
[09:41] >> What's our loss function here?
[09:43] >> The loss function in all these things is
[09:44] actually the same as before, which is
[09:46] that for every output that's coming out.
[09:48] So imagine you have just a traditional
[09:50] classification problem uh in which you
[09:52] have one output uh let's say dividing
[09:54] you're classifying things to uh 10
[09:56] categories like we did with the fashion
[09:57] mnest right 10 digits so you have 10
[10:00] outputs right and that goes through a
[10:02] softmax and then you have 10
[10:03] probabilities and there we use cross
[10:05] entropy right so here for every one of
[10:09] these things we use cross entropy so we
[10:12] take this output and there's a cross
[10:14] entropy for just for that plus cross
[10:16] entropy for that and so on and so forth
[10:18] So we we minimize still cross entropy
[10:20] but the sum of all these cross
[10:21] entropies.
[10:22] >> And does it get complicated at all by
[10:24] the fact we have a large vocabulary size
[10:26] now?
[10:27] >> I mean it it gets complicated just
[10:29] because there are more things to worry
[10:30] about compute and so on and so forth.
[10:32] But conceptually no difference whether
[10:33] you have 10 or 50,000 it's the same
[10:35] thing. It's just that instead of
[10:37] classifying an input into one of 10
[10:39] categories you're take the inputs
[10:41] themselves are as long as the number of
[10:42] words in your sentence. So each word
[10:45] that comes into your sentence is being
[10:46] classified in one of 50,000 ways, right?
[10:49] So essentially you have as many
[10:51] classification problems as you have
[10:53] number of words in a sentence. But at
[10:55] the end of the day, the loss function is
[10:56] just a sum of all those things or to be
[10:58] more precise, the average of all those
[10:59] things.
[11:02] Actually, I think I may have a slide
[11:03] about this which I may have hidden
[11:05] because I wasn't sure if I would have
[11:07] time. Uh let's unhide it.
[11:17] and B I did not agree ahead of time that
[11:19] we're going to set this up like this.
[11:20] Okay. So, all right. So, yeah. So, we
[11:23] still use the cross cross entropy cross
[11:25] cross entropy loss function. So, each
[11:27] word that comes in. So, the cross
[11:30] entropy is actually minus log
[11:33] probability of the right answer. And you
[11:35] may recall this from earlier in the
[11:36] class. So, we just do the same thing for
[11:38] for cat sat on the everything. And then
[11:41] we just take the average 1 / 7. Boom.
[11:43] That's it.
[11:47] So let's so to go back to this problem.
[11:50] So this is the issue. The issue is that
[11:52] we can't allow words to be predicted
[11:55] knowing the future. They should only
[11:57] know about the past words. Okay. So what
[12:00] do we do? Right? We have to make a
[12:02] change to the transformer to make it
[12:03] work for next word prediction. So what
[12:06] we're going to do is when we are
[12:07] calculating the contextual embedding for
[12:09] a word, remember the contextual
[12:11] embedding for a word is going to be a
[12:13] weighted average of all the other words
[12:14] embeddings. We will simply give zero
[12:17] weight to future words.
[12:20] If you give zero weight to future words,
[12:22] it's almost as if they don't exist.
[12:26] Okay? And this will become clear in a
[12:27] second. So imagine that this is the the
[12:31] thing we are going to calculate. These
[12:32] are all for every word in the sentence
[12:34] we are calculating the uh the pair-wise
[12:38] attention weight and you will remember I
[12:41] went through this you know with like an
[12:43] iPad thing last week we calculate all
[12:45] the weights. So for example to find the
[12:48] um so all these weights in every row
[12:51] will add up to one and so you take the
[12:54] contextual embeddings of the cat sat on
[12:56] the multiply them by the respective
[12:58] weights that add up to one which is the
[12:59] first row of this table and that gives
[13:01] you the contextual embedding for the
[13:02] word the and so on and so forth. And
[13:05] since we can't look at the future words
[13:07] all we do is we go take this table and
[13:10] we just zero everything out in red.
[13:14] Okay, we just zero everything here out
[13:17] and then we renormalize so that the
[13:19] remaining cells the nonzero dot cells
[13:22] will still add up to one in each row. So
[13:25] what that means is that if you're
[13:27] actually only looking at the only this
[13:29] thing is going to play a role for cat
[13:31] only this thing is going to play a role.
[13:32] So let's let's let's give an example. So
[13:36] um to calculate
[13:39] to predict uh on you'll only look at the
[13:43] words for the cat sat.
[13:46] Okay. The rest of it will not be
[13:48] considered at all. Now the effect of
[13:51] doing all this is that by the way this
[13:54] is called causal self attention. This
[13:56] tweak is called causal self attention.
[13:58] Uh is also called masked self attention.
[14:01] Right? Just different labels for the
[14:02] same thing. And so what that means is
[14:05] that when you're looking at the input
[14:07] for the only the is going to be used to
[14:10] predict cat.
[14:12] When you look the cat only these two are
[14:15] going to be used to predict sat and so
[14:18] on and so on and so forth.
[14:24] Okay. So this thing here this so all we
[14:28] do is we go into a transformer and we
[14:30] just change each attention head to be a
[14:32] causal attention head
[14:38] and the way it's actually done under the
[14:40] hood is actually very elegant for
[14:42] computational efficiency purposes but I
[14:44] won't get into it because it gets a bit
[14:46] you know involved but the key idea is
[14:49] replace basic plain vanilla attention
[14:52] with causal attention aka pay mass
[14:54] attention
[14:57] and you do that boom suddenly it it
[14:59] starts you know working for an expert
[15:01] prediction it can't cheat anymore
[15:04] and when we do that we get the
[15:06] transformer causal encoder
[15:11] and by the way the word causal here
[15:13] there's no connection to causality so
[15:15] it's just a it's just a term
[15:19] so if you look at the original
[15:20] transformer paper um
[15:24] it was created for translation for
[15:26] machine translation you know English to
[15:28] German right those kinds of use cases so
[15:30] it had something called an encoder which
[15:32] we are very familiar with from last week
[15:34] and then it had something called a
[15:35] decoder right and it is called the
[15:38] encoder decoder architecture and we are
[15:40] not going to cover the encoder decoder
[15:42] architecture because we are not covering
[15:43] machine translation in this class but
[15:45] I'm mentioning this because the this
[15:48] part of the the architecture is called a
[15:51] decoder
[15:52] because it uses see here there is a
[15:55] masked attention business going on here
[15:57] because it is using this masked
[15:59] attention it's called a decoder so
[16:02] the transformer causal encoder is also
[16:05] referred to sometimes as a transformer
[16:06] decoder but the word decoder has two
[16:09] meanings
[16:11] right it's a synonym for the causal
[16:12] encoder like we have seen today it's
[16:14] also used to refer to sequencetosequence
[16:17] translation problems for the second part
[16:19] of its architecture so you just have
[16:21] keep it it'll become clear from context
[16:23] what we're talking about in this course
[16:25] of course there is no confusion because
[16:26] we're not going to be looking at
[16:27] translation right we may say decoder
[16:29] causal encoder it's the same thing so I
[16:32] thought there were some transformers
[16:34] that use birectional
[16:36] package like is it different from
[16:39] >> no the um the birectional all all
[16:42] birectional means is that I can see
[16:44] everything so the encoder we looked at
[16:47] last week the the basic self attention
[16:49] thing is birectional
[16:54] Basically all it means is I can look at
[16:55] both in both directions to see what
[16:57] other words are there in causal. You're
[16:58] not using the one in the future.
[16:59] Correct.
[17:02] All right. So,
[17:04] so in to summarize where we are. This is
[17:07] what we looked at last week for BERT and
[17:09] this is a transformer encoder and we
[17:11] take the same thing and instead of
[17:14] multi-head retention we would do causal
[17:15] multi retention. We get the decoder aka
[17:18] causal encoder.
[17:21] Okay. And we use the left for masked
[17:25] prediction. We use the right for next
[17:27] word prediction.
[17:29] All right. So now if you have instead of
[17:32] having an encoder, if you have a causal
[17:34] encoder, a TCE here, now we can train
[17:37] models for expert prediction using the
[17:38] same exact approach as before,
[17:42] right? We set up the inputs and the
[17:43] outputs like I described earlier. We run
[17:45] it through a bunch of stacks, a stack of
[17:47] causal encoders, dens, relu, softmax and
[17:50] so on and so forth, right? Otherwise the
[17:52] details don't change but the all
[17:54] important changes go into the attention
[17:56] layer and make it masked or causal.
[18:02] Any questions so far?
[18:06] >> Uh yeah,
[18:08] this would only apply when we're
[18:09] training the model, not when we're
[18:11] validating and testing, right?
[18:13] Uh so if I if you give me a sentence
[18:15] after training right the final
[18:18] prediction is only is the only thing you
[18:20] care about and by definition the final
[18:22] prediction will use everything that came
[18:24] before it. So we are okay.
[18:27] Was that your question? No, I think the
[18:30] fact that we're
[18:33] uh we're zeroing out the weights in the
[18:35] future words I thought would apply more
[18:36] when we're training the model and we're
[18:38] trying to minimize the loss as opposed
[18:40] to when we're as a chance to the next
[18:44] set
[18:45] >> right but the point is when we actually
[18:47] use them what is the objective like what
[18:49] do we want to do when we actually use
[18:50] them for inference once we finish
[18:51] training our objective is given a
[18:54] particular string get me the next word
[18:56] right and to find the next word you can
[18:59] in fact use everything that came before
[19:00] it
[19:01] >> and therefore without any change to this
[19:03] model it'll just work for your intended
[19:04] purpose you don't have to go in there
[19:06] and change it to you don't have to
[19:08] unmask it for inference because you
[19:10] don't need to
[19:13] >> yes
[19:14] >> uh I have one question is regarding like
[19:17] when we do the puzzle transformers we
[19:20] are putting certain weights to zero for
[19:22] the words which are to be predicted and
[19:24] then we
[19:24] >> no word the the words that are in the
[19:26] future
[19:27] >> future Yeah.
[19:28] >> And then we normalize it.
[19:29] >> Correct.
[19:29] >> And we have trained a transformer
[19:31] earlier on the all the words packed all
[19:33] the words together. So won't there be
[19:35] difference in weights between both the
[19:37] things
[19:37] >> between the two ways of training? The
[19:39] weights are going to be very different
[19:40] and they are two different models. Bert
[19:43] is used for certain things and this kind
[19:45] of model which is the basis of GPT is
[19:47] going to be used for other things.
[19:47] >> We are training it as well like that. I
[19:49] mean with while putting the by moving
[19:52] some of the rates to
[19:53] >> correct correct. So what I'm talking
[19:56] about here is the what we're trying to
[19:59] do here is to say let's say that we want
[20:01] to do next word prediction as the as the
[20:03] task as a self-supervised learning task
[20:06] and and we want to train such a model on
[20:08] a vast amount of text data right well we
[20:10] can't just use what we did last week
[20:12] because it's not going to work because
[20:13] of the fact it can see the future
[20:14] therefore we make a tweak and then we
[20:16] build this model now the question
[20:17] becomes okay what can you do with this
[20:18] such a model right we have basically
[20:20] trained two different kinds of models
[20:21] that the one that can see everything
[20:23] Bert and that one that can't see the
[20:25] future which is actually GPT. So what
[20:27] can you do with it? And we're going to
[20:28] come to that.
[20:32] Okay. U all right. So now once you train
[20:35] such a model u right given any input
[20:38] sentence um let's say that the sentence
[20:41] is it was a dark and it was a dark and
[20:45] right it goes through all these things.
[20:47] And remember what I said earlier the
[20:49] fact that it's predicting something
[20:50] after just seeing it. We don't really
[20:53] care.
[20:55] All what we're really curious about is
[20:57] what is the next thing it's going to
[20:59] say? And the next thing it's going to
[21:01] say is going to be is basically going to
[21:02] be what's coming out of this softmax.
[21:06] Does it make sense? We don't care about
[21:08] anything that went before it
[21:11] because we already have like a half form
[21:14] sentence and we want to just find the
[21:15] next thing here. So we only care about
[21:17] this. We I mean these things will come
[21:19] out of the of the architecture of the
[21:21] model, but we don't we throw them out.
[21:22] We don't even pay any attention to them.
[21:24] Okay, we only look at what's coming out
[21:26] in this one here. And what comes out of
[21:30] the soft max, remember, is a 50,000 way
[21:32] table of probabilities. That's what a
[21:35] soft max is, right? It's a whole bunch
[21:37] of probabilities that add up to one. And
[21:39] so it's going to and let's say, for
[21:40] example, that you know you have starting
[21:42] with oddwark all the way to zebra,
[21:45] right? Right? And these are the
[21:46] probabilities.
[21:48] So it was a dark and you know just for
[21:52] kicks I put star me as the most highest
[21:55] probability number but these numbers
[21:56] will add up to one. We have this table.
[21:59] Okay. And then what we do is we choose a
[22:02] token from this table. We get we get to
[22:04] choose right. There's a whole bunch of
[22:06] numbers in this table that we we get to
[22:08] choose a token. the the simplest thing
[22:11] one can think of is just choose the the
[22:12] word that is the most likely, right? And
[22:14] we choose the word that's most likely
[22:16] here. And we we're going to have a whole
[22:18] section on how to choose these things
[22:20] coming up. Okay, for now let's go with
[22:22] the simple option. We're going to just
[22:23] choose the one that's most likely 6. And
[22:26] then we we attach it to the input. So
[22:30] now the input has become it was a dark
[22:32] and stormy. We run it through and we
[22:34] again we only care about the last one
[22:36] softmax.
[22:37] Okay,
[22:40] we do that. We get another table and
[22:42] this table turns out the table keeps
[22:44] changing because the softmax is
[22:45] different for each time you run it
[22:46] through because the input has changed.
[22:49] So you get a new table and it turns out
[22:50] the most likely one is knight. Okay. And
[22:53] then we attach so night comes out the
[22:56] other end. We and we attach knight here
[22:59] and we keep on going right. We can keep
[23:03] on going maybe till we basically we tell
[23:05] the model okay generate up to 100 tokens
[23:08] and stop. It might stop after 100 or you
[23:11] or it might decide the model may decide
[23:12] in fact that when it sees a punctuation
[23:15] like a period or exclamation mark or
[23:17] something it's going to stop. Okay. And
[23:19] we have control over this when it stops
[23:21] and how it stops. But this is this is
[23:23] sort of the the basic process and you
[23:26] folks are all very used to it because
[23:27] you've all been playing with chat GPT
[23:28] and the like right? So the but the basic
[23:30] building block is next word prediction
[23:33] feed it back to the input next word
[23:34] prediction keep on doing it right you
[23:36] keep on doing it and suddenly you know
[23:38] it's writing entire novels for you
[23:41] um yeah
[23:42] >> that mean that the longer the initial
[23:44] input is better you get a better
[23:47] prediction
[23:48] >> um it depends on your objective so
[23:52] fundamentally you have some task you
[23:54] want the thing to do for you right and
[23:56] that task may and you need to give it
[23:58] all the information it can puzzle we
[24:00] find useful. Yeah. So the long the the
[24:02] more helpful the input the better. Maybe
[24:04] that's how I would say it.
[24:07] Uh yeah.
[24:09] >> Would this also apply to something like
[24:11] Google search? Uh or does they also do
[24:14] next letter prediction too? But would
[24:17] this just be a deeper
[24:18] >> Yeah. So the Google autocomplete for
[24:20] example, I don't know if they actually
[24:22] use uh this kind of model under the hood
[24:24] or not. I just don't know. Um these
[24:26] things tend to be kept tightly under
[24:27] wraps. uh you know if they were to do if
[24:29] they were using it you know my guess is
[24:31] that
[24:33] they so I don't know if you folks have
[24:34] seen recently over the last few months
[24:36] they have there is there is a generative
[24:38] AI panel that opens up when you do a
[24:40] Google search that panel I suspect uses
[24:42] this uh but I don't know if the default
[24:45] Google autocomplete actually uses it or
[24:47] not because it's very compute heavy
[24:49] right so I don't know what they do
[24:52] um so yeah this is what you do other
[24:55] questions on this on the mechanics of
[25:00] Yeah,
[25:01] >> for our vocabulary list, I'm assuming
[25:03] it's static.
[25:05] >> Yeah, correct. Uh, and as you will see
[25:07] here, it's not really a word vocabulary.
[25:08] It's a token vocabulary, but yes, it is
[25:10] static for a given model.
[25:12] >> And so for I guess I'm assuming for
[25:15] Google or any other sort of like search
[25:17] engine that wouldn't necessarily be
[25:19] static. And so when it comes to I guess
[25:23] I guess I'll leave it like because the
[25:26] model would be different
[25:30] sort of thinking about uh what happens
[25:32] to like new words and things that are
[25:34] formed and how does it handle it if the
[25:35] vocabulary is static. There's a very
[25:37] elegant solution that's coming up.
[25:41] Okay. Um
[25:45] all right. So now in other words we have
[25:48] learned how to do sequence generation.
[25:51] We already saw that we can do
[25:52] classification with BERT. We can do
[25:54] labeling with BERT B like models which
[25:56] are trained on mass prediction. And for
[25:59] generating sequences now we know how to
[26:00] do it. We just need to use a transformer
[26:02] cosal encoder.
[26:05] Okay.
[26:08] Now
[26:10] these kind of models, sequence
[26:12] generation models trained on text
[26:13] sequences using next word prediction are
[26:15] called auto reggressive language models
[26:17] or causal language models. Okay. And of
[26:20] course the GPD family is perhaps the
[26:22] most well-known uh example of an auto
[26:25] reggressive co language model. auto
[26:28] reggressive because people who have done
[26:30] econometrics and some regression know
[26:32] the notion of auto reggression means
[26:34] that you predict something and then you
[26:36] you use sort of you know the past
[26:38] predictions as inputs into the next time
[26:40] you predict right so this is the notion
[26:42] of auto reggression you feed you predict
[26:44] you feed the prediction back get the
[26:46] next prediction and keep on cycling
[26:48] through yes
[26:51] >> so when you you're kind of putting an
[26:53] input into GPT for example and it has
[26:56] that um you know it shows you like the
[26:59] next words as as it's coming. Is that an
[27:01] indication of it doing this
[27:03] recalculation that you described here?
[27:05] >> Correct. That's exactly what's going on.
[27:07] Uh in fact, if you use the API, there is
[27:09] the thing called the streaming API where
[27:12] it'll actually stream each token that's
[27:14] coming out through the through every
[27:15] pass and you can actually see everything
[27:17] very clearly. But when you actually work
[27:19] with the web interface and you see the
[27:22] thing almost as if it's typing like a
[27:24] human, what I've heard from people, I
[27:25] don't know if this is true, what I've
[27:26] heard from people is that they can
[27:28] actually do it much faster. They slow it
[27:30] down intentionally to give you the
[27:32] feeling that it's actually coming from a
[27:33] human.
[27:36] So it's like a UX trick to slow it down
[27:39] to make it feel as if someone is
[27:41] actually typing something on the other
[27:42] end. So when you're interacting with a
[27:44] chatbot, for example, sometimes you see
[27:46] it actually typing like slowly you can
[27:48] see the bubble and you can see the
[27:49] typing. It's actually intentionally
[27:50] slowed down. Uh because you know it's a
[27:53] bot otherwise, right? So there's a
[27:55] little bit of UX
[27:58] creepiness maybe going on. Uh I don't
[28:01] know to what extent this is 100% true
[28:03] and how pervasive it is, but folks who
[28:05] work in the field have told me that this
[28:06] actually is not uncommon. So
[28:10] okay, so that's what's going on here.
[28:12] These are language models and of course
[28:14] GPD3 is an auto reggressive language
[28:17] model and the reason why we have an L in
[28:20] front of the LM because it was trained
[28:22] on lots of data with lots of parameters
[28:24] right some someone does this at some
[28:25] point it's not a small language model
[28:26] anymore it's a large language model so
[28:28] yeah so it's LLM nothing more momentous
[28:31] than that so so as it turns out uh GPT3
[28:35] uses 96 transformer blocks 96 blocks and
[28:40] each block has 96 six causal attention
[28:43] heads.
[28:44] Okay. And you can see you can read the
[28:46] GPD3 paper. It gives you all the details
[28:48] of the architecture. That is interesting
[28:50] because GPD4 they didn't publish the
[28:51] architecture from GPD3 after GPD3
[28:55] everything became closed. So we actually
[28:58] don't know what the architecture is even
[28:59] though there's a lot of speculation on
[29:00] Twitter. So uh but GP3 we know exactly
[29:03] what happened right 96 blocks each has
[29:06] 96 causal attention heads. Um and then
[29:09] the data was actually they scraped 30
[29:11] billion sentences um from a whole bunch
[29:14] of sources, web text, Wikipedia, a bunch
[29:16] of book databases. Um and um and then
[29:19] they basically just took those 30
[29:21] billion sentences and just trained it
[29:23] exactly next word. That's it.
[29:27] Now when they trained GBD3, I think it
[29:28] cost them a lot of money um because
[29:31] things were not as we hadn't figured out
[29:34] how to do as efficiently as we know now.
[29:36] uh but it was still pretty amazing and
[29:38] I'll talk about you know what is so
[29:39] special about GBD3 in just a minute or
[29:41] two. So, so this is what we have here
[29:44] and as you folks have seen the notion of
[29:46] generating text right is very powerful
[29:49] right uh because we can obviously
[29:51] generate text but we can also generate
[29:53] code because code is just text uh we can
[29:55] generate documentation for code we can
[29:57] summarize text we can answer questions
[29:58] we can do chat I mean the list goes on
[30:00] all the excitement we see around genai
[30:03] from the time chat GBD came out is
[30:05] precisely because the simple idea of
[30:07] text in text out is just so flexible
[30:12] It's so versatile. It can handle all
[30:13] sorts of use cases. That's why there's
[30:15] so much excitement.
[30:17] Um, by the way, um, if you're really
[30:19] curious, I would actually recommend
[30:21] seeing this video where this this guy
[30:24] Andre Karpathi builds GPT from scratch.
[30:28] Okay, it's a fantastic video. If you if
[30:31] you have even like a little bit of
[30:33] curiosity about how these things are
[30:35] actually built, I would strongly
[30:36] recommend checking it out. Um and
[30:38] there's also a little blog post where
[30:39] this person you know basically if you
[30:41] know numpy you can actually create GPD3
[30:43] GPD using numpy without any using any
[30:46] frameworks and things like that. So um
[30:50] I I found it super interesting and
[30:52] helpful to understand what exactly is
[30:53] going on. So if you would like to do
[30:55] this. Okay. So now we're going to talk
[30:57] about um decoding sampling strategies
[31:00] which is I said that when we produce uh
[31:03] when when when we come up with the
[31:05] softmax for that last token right we
[31:07] have 50,000 choices. What do we pick
[31:10] right as it turns out to actually get
[31:13] really good performance out of uh genai
[31:15] systems like charge you need to be quite
[31:17] thoughtful about the how to decode right
[31:19] how to actually sample from that table.
[31:21] So we'll talk about that for a bit. So,
[31:25] so the first of all definition the
[31:27] process of choosing a token from the
[31:29] probability distribution from the coming
[31:30] out of the softmax right I'm sticking
[31:32] this table right here this is the
[31:34] softmax right this process of choosing
[31:36] it is called decoding that's a technical
[31:38] term for it right we have to we get this
[31:40] table we have to decode meaning we have
[31:42] to pick something from this table okay
[31:44] that's called decoding now
[31:48] there are two sort of extreme cases of
[31:51] very highly simple ways to do
[31:53] The first thing of course is just pick
[31:55] the one just pick the word with the
[31:56] highest probability.
[31:58] This is called greedy decoding.
[32:02] Okay.
[32:03] So in this case for example if stommy is
[32:06] 6 the highest probability in this whole
[32:08] table we just pick stommy. Okay. So that
[32:10] is the obvious extreme simple case. The
[32:14] other thing we can do which is also
[32:15] super simple is that because we have a
[32:18] probability table here, we can just
[32:20] reach into the table and sample a word
[32:22] out of it, right? In proportion to its
[32:24] probability, which means that if you if
[32:27] if you have this table and you're
[32:28] sampling from it, if you sample from it
[32:30] 100 times, 60 times you probably get
[32:33] Stormy because the probability is 6. But
[32:36] some small fraction of the time you may
[32:38] get strange things like oddwark and
[32:39] zebra and so on and so forth,
[32:42] right? you're just literally doing
[32:44] random sampling.
[32:46] That's a fine way to do it too, right?
[32:48] There's nothing wrong with that. So
[32:50] these these are both options. So the key
[32:53] thing you need to remember is that the
[32:56] which one you pick and there are some
[32:58] variations on it which we'll get to in a
[32:59] moment. What you pick, which way to
[33:01] decode you pick really depends on what
[33:03] your task is, what you're trying to use
[33:05] the the system for, right? The LLM for.
[33:08] So the the the broad thing to remember
[33:10] is that if you're working on questions
[33:13] for which the factual accuracy of the
[33:16] response is really important
[33:19] and or you want the output to be
[33:22] deterministic meaning every time you ask
[33:24] it a particular question you really want
[33:26] the same answer back right you can
[33:28] imagine a customer call support agent
[33:31] where there two different customers ask
[33:33] the same question and they get different
[33:34] answers right you don't want that so you
[33:37] want determinist IC outputs. So in those
[33:40] situations, you should use greedy
[33:41] decoding is a good starting point
[33:43] because you will get you know you won't
[33:45] get any random stuff because for any
[33:48] given input sentence the softmax that
[33:51] comes out of that table is not going to
[33:53] change. It's the same table and if
[33:55] you're always picking the highest number
[33:57] in the table that's not going to change
[33:58] either. So guaranteed determinism
[34:03] and I found that for reasoning questions
[34:05] and things where you know you're asking
[34:07] questions, math questions, reasoning
[34:08] questions, logic questions, you should
[34:10] really sort of keep it as sort of greedy
[34:12] as possible in my experience. Okay. Now
[34:15] there are other situations where random
[34:18] sampling is actually a better option. If
[34:20] you're doing creative things, right?
[34:22] write a poem, write a highQ, write a
[34:24] screenplay, things like that. You do
[34:26] want a lot of creativity in which case
[34:27] you actually randomness is your friend,
[34:30] right? You get a lot of different
[34:31] varieties of responses, diversity of
[34:32] responses, all that is really good. The
[34:35] price you pay for it is that you lose
[34:36] determinist determinism. The outputs are
[34:39] going to be stoastic. They're going to
[34:40] be random. They're going to vary from
[34:41] the same question. The answer is going
[34:42] to vary again and again. But in many
[34:44] cases, maybe it's okay. You don't care.
[34:47] Okay, so that's sort of how roughly how
[34:49] you think about. Other one I want to say
[34:50] is that the diversity of response also
[34:53] important because you if you imagine a
[34:55] chatbot um if you ask questions if the
[34:58] chatbot always responds in the same
[35:00] stilted robotic fashion right it kind it
[35:03] starts to get annoying you want some
[35:05] variation in the output right because a
[35:07] human will never give you the same thing
[35:08] back though I must say that when I
[35:11] interact with call center agents I think
[35:13] they're just cutting and pasting from a
[35:14] text library so it does look kind of
[35:16] robotic u so maybe we are already kind
[35:18] of used to this but anyway Okay, so
[35:20] those are some of the things to keep in
[35:21] mind. Yeah,
[35:24] >> if you're using random sampling, do you
[35:26] end up with a better estimation of the
[35:28] uncertainty and probability are more
[35:33] calibrated in the sense that the table
[35:35] that you end up at the end is the real
[35:36] probability that you observe from the
[35:39] words in your corpus.
[35:42] >> The table doesn't change regardless of
[35:43] how you sample it. The table is a
[35:45] starting point for sampling.
[35:47] The all of all decoding is about what
[35:50] token from the table you're going to
[35:51] pull out.
[35:53] >> Oh, so it doesn't impact the loss
[35:54] function.
[35:55] >> No.
[35:56] >> Yeah. It's all those things are fixed.
[35:58] You literally get the table and then you
[36:00] literally can forget how you got the
[36:02] table and now decoding starts.
[36:06] >> Is there a reason why would generate a
[36:09] different answer given the same prompt
[36:11] if we run it again and again? Because
[36:12] they are using random sampling.
[36:14] >> Correct. That's exactly why. And we'll
[36:16] see I'll see do a demo of it very very
[36:19] shortly because you can actually
[36:20] manipulate it. Uh
[36:22] >> if you do the prediction word by word,
[36:25] is there a way to make it resilient to
[36:27] mistakes? Like if you say the night was
[36:29] dark and hard work, that can mess up the
[36:32] next word, right?
[36:33] >> It can totally mess it up.
[36:34] >> So how does it can it get itself back on
[36:37] track?
[36:37] >> It cannot. And so great question. And
[36:40] we'll look at an example of things going
[36:42] off the rails in just a second. Yep.
[36:46] Is this how Bing works where you can
[36:48] slide between being more creative, more
[36:51] accurate?
[36:52] >> Yeah, exactly. So, Bing has creative,
[36:53] balanced, precise something, right? Uh
[36:56] they're basically under the hood,
[36:57] they're manipulating some of the par
[36:59] we're going to look at some of those
[37:00] parameters in just a moment. They're
[37:01] just manipulating it for you. But if you
[37:03] use the API, you can manipulate it
[37:05] directly.
[37:09] Okay. Um All right. So, so here's sort
[37:14] of the basic thing to remember about
[37:15] random sampling.
[37:17] So, our hope is that the, you know, for
[37:19] any given sentence, we think that there
[37:22] is probably some set of good answers for
[37:24] the next word and a whole bunch of bad
[37:26] answers, right? Intuitively. So, we want
[37:30] the probability of the good stuff,
[37:33] right? We we want like a you can imagine
[37:36] a distribution is going like that. There
[37:38] is the head of the distribution, the
[37:39] first few words in the distribution. if
[37:41] you sort them from high to low
[37:42] probability and then there's all the
[37:44] long tale of you know kind of you know
[37:46] inappropriate not inappropriate
[37:48] irrelevant words right so our hope is
[37:51] that the model is so good that for any
[37:53] given input phrase it it basically
[37:55] concentrates the output probability in
[37:57] the softmax to just a few good words and
[37:59] sort of kind of zeros out everything
[38:01] else that is the ideal scenario because
[38:04] in that scenario if you do random
[38:06] sampling you by definition you'll pick
[38:08] something from the high quality head of
[38:10] the distribution and life is good. Okay.
[38:13] Now, we want random sampling to sample
[38:16] from the head and not from the tail,
[38:18] right? That's the key point. And what do
[38:19] I mean by head and tail? Let's be very
[38:21] clear.
[38:26] So, um imagine you have
[38:30] take the table that we looked at the
[38:31] softax table which went from whatever
[38:33] oddwalk to zebra right and let's say we
[38:35] sort the table based on high to low
[38:37] probabilities. So maybe what's going to
[38:39] happen is that star me
[38:42] is going to have a probability of I
[38:43] don't know 6 and I think if I remember
[38:46] right a knight had a probability of.3
[38:51] and then a there was a whole bunch of
[38:53] other words
[38:56] all the way to the 50,000th word right
[39:00] from highest low probability so this is
[39:02] what I so this is you can think of this
[39:04] as like a probability distribution
[39:06] okay and So basically what we are saying
[39:09] here is that these this is the head of
[39:12] the distribution
[39:13] while this long tail is the tail of the
[39:16] distribution and we want our system to
[39:18] grab something from the head and not
[39:21] from the tail because the head is the
[39:23] stuff that's actually the relevant
[39:24] useful good stuff. Okay, that's really
[39:26] what we're trying to do here. Does it
[39:28] make sense? Okay. So,
[39:32] so to come back to this um
[39:37] and here is like the most important
[39:39] point to remember about this slide.
[39:41] While the probability of choosing any
[39:43] individual word in this long tail is
[39:46] pretty small. For any one word, it's
[39:47] pretty small. The probability of
[39:49] choosing some word from the tail is
[39:51] high.
[39:54] Some word from the tail is high. So to
[39:56] go back to this thing here. Yeah. Uh so
[39:58] in this particular example
[40:00] 6 +.3 there is a 0.9 probability it's
[40:03] going to be either stormy or night but
[40:05] there is a 10% probability it's going to
[40:06] be one of these words
[40:09] and who knows what that word might it's
[40:11] going to be it might be some random
[40:12] nonsense word right so what that means
[40:15] is and this goes to um
[40:18] this goes to point from before if the
[40:21] LLM happens to sample a token from the
[40:24] tail which is not good it won't be able
[40:25] to recover from its mistake it'll just
[40:27] go off the rails
[40:29] Which is why every word that gets
[40:31] generated is really important to get it
[40:33] right because book it can't recover very
[40:35] often.
[40:37] >> Is there a technical way to define the
[40:40] difference between the head and the
[40:41] tail? No,
[40:44] it's sort of like this common thing
[40:45] people use and the reason why it's not
[40:47] is because uh it's so problem dependent
[40:50] as to what like the you know like
[40:52] basically you're saying that for any
[40:54] particular problem I think depending on
[40:55] the question the right number of words
[40:58] is probably 20 for the same for a
[41:00] different question maybe it's 40 for a
[41:02] totally different model for the same
[41:04] question maybe 10 so because of that
[41:05] variability we just can't figure it out
[41:09] okay so um all All right. So, and I'll
[41:12] show you this how to do this in just a
[41:14] moment. So, just for kicks, um I went in
[41:18] to GPD 3.5 U and then I said students at
[41:22] the MIT Sloan School of Management are
[41:25] and I said predict the next word. Okay,
[41:29] so it turns out invited is the most
[41:31] likely next word followed by given,
[41:33] expected, required and able. These are
[41:35] the top five words.
[41:38] Okay. And the probability is 3% 2% you
[41:40] see the you know pretty small
[41:42] probabilities but then the words that
[41:43] are below it right the remaining
[41:45] whatever 50,000 odd words are even
[41:47] lower. Okay. So here the most likely
[41:50] word is invited. So what I did is I went
[41:52] in there and said okay let me try again
[41:54] now with students of this loan school of
[41:56] management or invited. And now
[41:59] autocomplete that find me the next
[42:00] thing. So it comes back with see now
[42:03] this is my new prompt. student the M
[42:04] school invited to submit their original
[42:07] white papers to the annual MIT
[42:08] something. It seems reasonable. Doesn't
[42:11] seem bad, right? It seems reasonable.
[42:13] Okay. Now, let's mess it up a bit. So
[42:16] now I go in there and I noticed that the
[42:19] word masters and the word spending were
[42:22] much lower probability than these top
[42:24] five words. Right? I just mucked around
[42:26] till I found these things. So this is
[42:28] only 0.05%. This is.1%.
[42:31] So these are clearly in the tail, right?
[42:34] They're not the most likely. So I said,
[42:36] what's going to happen if I actually
[42:37] force it to use masters and then I force
[42:41] it to use spending? Okay, this is what I
[42:43] what you get. Students MID school of
[42:46] management are masters of chaos.
[42:49] They routinely blow past deadlines
[42:52] fracture and then I couldn't take it
[42:53] anymore. I stopped it.
[42:58] a single word
[43:00] and then I said students school of
[43:02] management or spending which is the
[43:03] other unlikely word the semester
[43:05] learning life skills so far it looks
[43:07] promising through knitting socks
[43:13] I'm not making this stuff up but this is
[43:14] GP3.5
[43:17] so yes it will go off the rails you have
[43:19] to be super careful um and so
[43:22] so the way we sort of tame random
[43:25] sampling to make it work for us uh
[43:29] Do you think that these sentences refers
[43:32] like the past like the master of chaos
[43:35] blow past deadline like is something
[43:38] that it was in the training sense?
[43:40] >> Yeah. I mean that is the thing is it's
[43:42] basically doing rough it's doing some
[43:45] very rough and approximate pattern
[43:47] matching from all the training data it
[43:48] was trained on. So it doesn't mean for
[43:51] example that on on the mit.edu edu
[43:53] website right on the collection of sites
[43:56] that actually there were text saying
[43:59] that yeah MIT Sloan students were doing
[44:00] all this crazy stuff it's probably more
[44:02] like a whole bunch of you know u college
[44:06] university websites probably had some
[44:08] content like that maybe there was a
[44:09] bunch of Reddit people posting stuff
[44:10] like that so you're just doing some
[44:12] rough pattern matching it's basically
[44:14] looking the thing is you have to
[44:15] remember always with large language
[44:16] models what it's trying to give you it's
[44:19] giving you a response that is not
[44:22] implausible
[44:23] There is no guarantee of correctness.
[44:25] There's no accuracy. Nothing like that.
[44:27] It's giving you a probabilistically
[44:29] plausible response. That's it. Okay.
[44:32] Now, usies being Sloan, uh we look at
[44:35] stuff like this and we get offended. So,
[44:36] we are we are imputing our values onto
[44:39] its generation, but it doesn't know and
[44:40] it doesn't care.
[44:43] So in fact if I when I typed in
[44:46] something like list all the awards that
[44:48] professor Ramak Krishna has won it gave
[44:50] me an amazing list of awards apparently
[44:52] I won this and I won that I won none of
[44:55] it is true to which a student said not
[44:58] yet.
[45:00] So I had the tea I made a note of that
[45:01] fine person's name. So [laughter]
[45:05] >> so yeah so that's what's going on.
[45:09] Yeah
[45:11] >> I get the sense like Maybe there's
[45:12] >> Could you use the microphone, please?
[45:15] >> I get the sense that maybe there's some
[45:17] sort of sliding window that's somehow
[45:20] waning later words more strongly than
[45:23] earlier words given how far out because
[45:26] I feel like the context of students at
[45:28] MIT, right, should have steered in a
[45:30] certain direction even with the presence
[45:32] of the word masters. So, is there
[45:34] something like that happening?
[45:35] >> No, it is just the thing is think about
[45:37] the training process, right? In the
[45:38] training process, uh, we gave it
[45:41] sentence fragments and we asked it to
[45:42] predict the next word. Now, clearly the
[45:45] more you know about the input that's
[45:48] coming and the longer the input, the
[45:49] more clues you have to figure out what
[45:51] the right next prediction is going to
[45:53] be. Right? If I say the capital uh the
[45:56] capital of you'll be like, I don't know,
[45:58] it's got to be a country, I guess, or a
[46:00] state, but I don't know anything more
[46:01] than that. But if you if I say the
[46:03] capital of France is dramatic narrowing
[46:06] of the cone of uncertainty. So that's
[46:08] basically what's going on. And in fact
[46:11] some there's a very beautiful expression
[46:12] I've heard which is that what what the
[46:14] LMS do they call it subtractive
[46:17] sculpting. So what I mean by that is
[46:20] it's sort of like when you start it's
[46:22] like this big block of marble and then
[46:24] every word chips away at the marble and
[46:26] then when you're done it's kind of
[46:27] pretty clear it's David inside the
[46:29] marble. Right? That's sort of what's
[46:31] going on.
[46:34] All right. So to come back to this, uh
[46:36] what can we do? We can there are three
[46:38] ways in which you can tune random
[46:40] sampling to make it work for you. The
[46:42] first way and and the the idea of all
[46:44] these things is that you have some
[46:46] probability distribution. We are now
[46:48] going to sort of manually
[46:51] focus on the head and then we're going
[46:53] to kill everything else and only focus
[46:55] on the head and sample from that head.
[46:56] Okay, which immediately begs the
[46:58] question, how will you decide what the
[46:59] head is? Right? And that was sort of
[47:01] Alina's question from before. How will
[47:02] you decide what the head is? So, one way
[47:04] we do that is to say, you know what, I
[47:07] know we have 50,000 words in the
[47:08] vocabulary. I don't care. Each time, I'm
[47:11] only going to pick the top K words,
[47:13] right? K could be 10, 20, 30, 40, 50.
[47:15] This very problem dependent. I'm going
[47:17] to pick the top 20 words and I'm going
[47:18] to ignore everything else and only
[47:20] sample from the top 10 or the top 20.
[47:22] That's called top K sampling. And so the
[47:24] way it works is that let's say this is
[47:25] your whole distribution and I just
[47:27] stopped at wet instead of going all the
[47:28] way to 50,000, right? And then you see
[47:30] and you decide let's say that you want k
[47:33] to be two. So you just grab the top two
[47:36] words k equals 2 and then you reormalize
[47:39] the probability so they add up to one.
[47:41] So 6 and2 reormalize it becomes 75 and
[47:45] 0.25.
[47:46] And now just imagine that this is the
[47:48] new softmax table that you're sampling
[47:50] from and you grab a number from I'm
[47:52] sorry a word from here and you're done.
[47:55] Okay, that's this called top K sampling
[47:58] very commonly used
[48:00] but there's it has a small shortcoming
[48:03] which is that it basically assumes that
[48:06] this K that you have come up with let's
[48:07] say 20 every input sentence the right
[48:11] number of words in the head is 20 which
[48:13] seems obviously it's not a you know well
[48:15] supported assumption it's just an
[48:16] assumption so then the question becomes
[48:18] can we do better right because what you
[48:21] really want is you want the words that
[48:24] you pick to have the bulk of the
[48:25] probabilities,
[48:27] right? As much probability as possible.
[48:29] You don't really care how many words are
[48:30] inside it as long as together they have
[48:32] a lot of probability. Which brings us to
[48:34] something called top p sampling also
[48:37] called nucleus sampling where instead of
[48:39] deciding on the number of words we're
[48:40] going to pick every time, we decide you
[48:42] know what we're just going to
[48:45] choose all the words such that the
[48:47] probability of such words that we have
[48:49] chosen is at least P.
[48:51] Sometimes it may be just two words.
[48:53] Sometimes it may be 20 words. We don't
[48:54] care. And then we sample from it.
[48:58] Okay. So here, same thing here. Let's
[49:02] say you go with P equ= 0.9. So you 6
[49:05] +2.8 plus.1.9. Boom. We have hit 0.9. We
[49:09] stop and then we grab these three words
[49:11] and then we renormalize them to get this
[49:14] thing and then boom, we sample from it.
[49:16] So this actually is even more effective
[49:18] in my opinion because it sort of it
[49:19] fluctuates. It doesn't hardcode the
[49:21] number of words you think is important.
[49:23] Uh was there a question? Yeah.
[49:25] >> What if like let's say 0.9 ended up like
[49:29] if foggy was 0.12 will it only take 0.1
[49:32] from foggy?
[49:33] >> Yeah. What it does is it so you give it
[49:35] a give it a 0.9. What it's going to do
[49:37] is it's going to keep adding words till
[49:39] it just crosses that number.
[49:43] >> Yeah. I was thinking, can't you just set
[49:46] a threshold for the word slap? Don't
[49:50] pick a word below probability. This top
[49:53] B, what if was like 0.89
[49:57] and then the other one is just 0.1. So
[49:59] you pick two words.
[50:00] >> Yeah, you can do that. Um and in fact in
[50:03] what you can do is you can always say I
[50:04] want to pick a word which is the most
[50:06] likely word, right? You can do that. But
[50:08] if you say I want a word um I want only
[50:12] consider words whose probabilities are
[50:13] at least something then basically what
[50:15] you're saying is that I'm just going to
[50:16] keep on doing and then we draw a line
[50:18] here right but the problem is you don't
[50:21] know how many words have crept over your
[50:23] threshold
[50:25] right you might for example find that to
[50:27] to go to your example maybe you said 0.9
[50:29] as a threshold may maybe there are a
[50:31] whole bunch of there was a word at 089
[50:33] that you just missed because you didn't
[50:34] make the threshold you'll be like oh no
[50:36] I should have made it 089 so there's No
[50:38] right answer unfortunately. But these
[50:40] are exactly the this is exactly the kind
[50:41] of thinking that brought us these kinds
[50:43] of ways to tune these things
[50:46] all sort of you know the foundation here
[50:48] is that the realization that we cannot
[50:51] pro sort of a priority decide what the
[50:53] right number of words is. So we have to
[50:54] find huristics to try to do do these
[50:56] things. So in practice people try all
[50:58] these methods. In fact you can do both.
[51:00] You can do you can set up so that you
[51:02] can do top p and top k at the same time.
[51:04] Basically you're saying grab words uh
[51:07] till you cross the probability uh or you
[51:10] cross k whichever is earlier.
[51:15] Okay. So those are two methods people
[51:17] use heavily.
[51:19] The third method is called distribution.
[51:21] I'm sorry temperature. And the idea of
[51:23] temperature is that in top K and top P,
[51:26] it sort of we have to decide on a number
[51:28] up front K or P and then we just draw
[51:31] the line and look at the words that pass
[51:33] the threshold. Temperature is like a
[51:35] softer way to do the same thing. It it's
[51:37] a softer way to emphasize the head more
[51:39] than the tail. So um I think iPad. All
[51:44] right.
[51:52] So the idea of temperature is remember
[51:55] uh when we have this um oops soft max.
[52:01] So you know oddwark
[52:04] all the way to zebra
[52:06] you have all these probabilities right
[52:09] now remember where did we get these
[52:10] probabilities these properties came from
[52:12] a softmax. So what is a softmax? We
[52:15] basically had you know all these nodes
[52:18] say 50,000 nodes in some output layer
[52:22] and these were just numbers let's just
[52:23] call them a1 through a 50,000
[52:27] and then we ran it through a softmax
[52:29] function and what did it do it basically
[52:31] did e ra to a1 e ra to a2 all the way to
[52:36] e ra to a let's call it n and then we it
[52:39] divided it by the sum of all these
[52:40] things to get the probabilities. So this
[52:42] number became e ra to a1 divided by the
[52:47] sum of all the e ra to a
[52:52] okay so e ra to a divided by e ra to a1
[52:54] plus e to a2 and so on and so forth. So
[52:55] this how softmax works. I'm just
[52:57] refreshing your memory from a few weeks
[52:59] ago. Okay. Now what temperature does is
[53:03] that let me just write it a little
[53:06] easier.
[53:08] So e ra to a1 plus e ra to a2 is all the
[53:13] way
[53:15] and
[53:18] what it does is it introduces a new
[53:20] parameter here called temperature which
[53:22] is that we divide everything here by t.
[53:41] And the effect of adding this little
[53:43] knob called temperature here, right, is
[53:45] very interesting. So let's assume for a
[53:48] second that t is a very very small
[53:50] number.
[53:52] Assume that t is pretty close to zero,
[53:53] very small number. So if t is close to
[53:57] zero,
[54:00] what's going to happen is that since
[54:03] it's in the denominator here, all these
[54:05] numbers,
[54:06] all these numbers are going to become
[54:08] really big because t is really small.
[54:10] Right? If if a1 happens to be a positive
[54:13] number, it's going to become really big.
[54:14] If a1 is a negative number, it's going
[54:15] to be a really really small negative
[54:16] number. Okay? Now in particular, what's
[54:19] going to happen is the biggest of all
[54:20] the a numbers, it was already big. Now
[54:23] it's going to get massive
[54:26] which means that its probability is
[54:28] going to dominate everything else
[54:30] because you're taking a really big
[54:31] number and doing e ra to that number.
[54:35] So what's going to happen is that wait
[54:37] what what did this
[54:40] okay so if t is close to zero
[54:47] the biggest a
[54:56] Uh, hold on.
[54:59] The word corresponding to the biggest A
[55:06] will have a probability of one or close
[55:09] to one.
[55:12] And since all the probabilities have to
[55:14] add up to zero, which means that
[55:15] everything else is going to be zero. So
[55:17] the biggest A will have a probability of
[55:18] one. Everything else is going to have
[55:20] zero. So reducing temperature close to
[55:22] zero means that the probability
[55:24] distribution is going to peak at the
[55:25] biggest word and everything is going to
[55:27] become zero. So in practice what that
[55:29] means is that if you look at something
[55:30] like this if you apply um
[55:34] temperature here
[55:37] what's going to happen is that stormiest
[55:40] thing is going to get something like.999
[55:43] and everything else right it's going to
[55:46] get wiped out
[55:49] right it's going to get really small
[55:51] it's going to get even smaller and so on
[55:52] and so forth and so when t is exactly
[55:55] zero basically what that means is that
[55:57] this is going to be exactly nine uh one
[55:59] and everything was going to just get
[56:00] zero. So when one of them is one and
[56:02] everything else is zero when you do
[56:03] sampling from it you're just picking the
[56:05] the big number right which means it sort
[56:07] it becomes greedy decoding.
[56:10] So that is the value of having
[56:12] temperature as a knob. Conversely, if
[56:14] you take temperature T and make it
[56:16] bigger and bigger, right, as opposed to
[56:19] smaller and smaller, this distribution
[56:22] is going to become flat. Meaning all the
[56:24] words are going to have the same
[56:25] probability.
[56:27] So a any one of these words becomes
[56:29] equally likely. So t close to zero, the
[56:32] biggest biggest word gets picked. T
[56:34] close to say exceeds one goes to 1.52
[56:38] any word becomes likely. It becomes
[56:40] truly random. So that is the effect of
[56:42] temperature.
[56:44] And this knob, you can actually tune it.
[56:47] Um,
[56:50] all right. So, uh, this is called, uh,
[56:53] I'm at
[56:56] platform.openai.com.
[56:57] It's called the OpenAI playground. And
[56:59] in this playground, you can actually put
[57:01] in all the sentences you want. You can
[57:02] choose the model and then you can it'll
[57:04] actually tell you what the softmax
[57:05] output is. Okay, it's very handy. So
[57:09] this is where I said oh so here are a
[57:12] few things I want to draw your attention
[57:13] to. The first one is you see temperature
[57:15] here the default is one. If you make it
[57:18] zero it becomes greedy decoding but you
[57:20] can make it more than one if you want.
[57:22] It'll give you all kinds of crazy stuff
[57:24] as you will see in a second. Okay. Um
[57:27] and then they don't have top K. They
[57:30] don't have support for top K openai but
[57:32] they do have support for top P. You can
[57:35] put P here in this thing. And I'll
[57:37] ignore these things. You can read the
[57:38] documentation uh to understand those
[57:40] things. But you can actually ask it to
[57:42] show the probabilities. So I'm going to
[57:44] ask it to show all the probabilities.
[57:46] I'm also going to tell it um don't go
[57:48] nuts. Just give me like a few outputs.
[57:50] Let's just call it 30. Okay. And now I'm
[57:53] going to enter some sentences for us to
[57:55] see what's going on. So let's enter the
[57:57] same sentence as before. students
[57:59] at the MIT
[58:03] Sloan
[58:05] School of Management
[58:08] or I think that's what we had right so
[58:10] submit
[58:14] so okay this is what it's filling out
[58:16] now you go click on this word you get
[58:18] all the probabilities
[58:20] pretty cool right so you can see invited
[58:23] given expected these are all some of the
[58:25] things we had u and so what you can do
[58:27] is you can go in and say here clearly uh
[58:32] aching. What is that?
[58:36] That's very weird. So I'm going to again
[58:40] I'm just going to check to make sure
[58:41] that I use the same sentence as before.
[58:43] It's very brittle. Students MD school
[58:46] management are okay. Uh are
[58:50] oh I know what it is.
[58:54] Okay.
[58:57] Okay. So, let's try that again.
[59:03] Okay. So, invited 3.18. That's what we
[59:05] had, right? Invited 3.19. 3.8. Okay.
[59:08] Close enough. So, this is what we have.
[59:10] And now, if you wanted to force it to
[59:12] choose invited here, you just go in
[59:15] there and make the temperature zero.
[59:18] Temperature zero means it's always going
[59:20] to pick the best one. Greedy recording.
[59:21] So, you can hit it again.
[59:25] And it better give you invited. See it
[59:27] has given you invited.
[59:29] So that's how you manipulate it using
[59:31] temperature. Um you can also ask it you
[59:34] can also manipulate top P. You can do
[59:35] all these things right but so it's a
[59:38] it's a people actually use it very
[59:40] heavily for debugging right and for when
[59:41] they're playing with a bunch of data
[59:42] with a model for that particular use
[59:44] case. You just play with it to get a
[59:45] sense for what kinds of probability
[59:46] distributions you see and then you can
[59:48] fine-tune it using that using that
[59:50] knowledge. Um so yeah check this out.
[59:54] Oh, uh, I I said that if the temperature
[59:58] goes above one to a higher number, every
[01:00:01] word in the 50,000 becomes sort of
[01:00:03] equally likely, which means it's going
[01:00:04] to produce garbage, right? So, let's
[01:00:06] actually see garbage production in
[01:00:07] action.
[01:00:09] So, all right, let's just nuke this.
[01:00:11] Okay, and I'm going to take the
[01:00:13] temperature and max it. I'm going to
[01:00:15] call it two. Okay, which means that
[01:00:19] literally anything is possible.
[01:00:22] Submit.
[01:00:25] Ladies and gentlemen, I present to you a
[01:00:28] modern large language model.
[01:00:35] Isn't it like shocking
[01:00:38] >> because when we work with these language
[01:00:39] models we have, we always when we see it
[01:00:41] doing some smart things, we always
[01:00:43] ascribe some level of, you know,
[01:00:45] interesting abilities and intelligence
[01:00:46] and so on and then you realize all I had
[01:00:48] to go in go in there and change one
[01:00:50] parameter and it's garbage.
[01:00:52] So you can see the amount of garbage
[01:00:54] right it's showing just by twiddling one
[01:00:56] parameter. So you have to be in
[01:00:58] production use cases when you're
[01:01:00] building applications on top of these
[01:01:01] large language models you got to be very
[01:01:02] very careful with these parameters. So
[01:01:05] pay attention. All right. So um what did
[01:01:09] I have next?
[01:01:13] Okay. So that brings us to the uh sort
[01:01:17] of the end of the decoding section.
[01:01:22] Oh, see now I'm going to switch gears
[01:01:24] and talk about tokenization, right?
[01:01:27] which is that um when so far in all the
[01:01:30] the the things we have done including
[01:01:32] the homeworks and so on we looked at
[01:01:34] this tokenization the standard process
[01:01:36] right for taking a bunch of text and
[01:01:38] vectorizing it which was the stie
[01:01:41] process standardize tokenize um index
[01:01:44] right and then encode and the
[01:01:46] standardization I had mentioned earlier
[01:01:48] uh strips out punctuation lower cases
[01:01:50] everything uh sometimes removes stop
[01:01:53] words like a and the things like that it
[01:01:55] also does these things called stemming
[01:01:57] But turns out if you actually work with
[01:01:59] uh something like GPT, you know that
[01:02:02] it hasn't stripped out punctuation. The
[01:02:04] punctuation is really good, right? It
[01:02:06] uses case, uppercase, and lower case.
[01:02:08] And in fact, even better, you can
[01:02:10] actually make up a word as part of your
[01:02:11] question and it'll use the word
[01:02:13] consistently in the output. So just for
[01:02:15] fun,
[01:02:18] um I made up a word.
[01:02:22] I just did this yesterday, a day before.
[01:02:23] I said, here's a new word and it
[01:02:24] definition. The word is relo
[01:02:28] backwards.
[01:02:30] I said the definition a student who
[01:02:31] understands deep learning backwards
[01:02:33] please use his word in a sentence. And
[01:02:35] here is a sentence it's coming up with.
[01:02:37] Um
[01:02:39] I was like a little shocked during the
[01:02:41] advanced neural network seminar. It
[01:02:43] became evident that Jane was a true relo
[01:02:45] effortlessly explaining even the most
[01:02:47] complex deep learning concepts in
[01:02:48] reverse order.
[01:02:50] Okay. So it clearly knows how to use
[01:02:53] anything you may make up with. Right? So
[01:02:54] it has the ability to compose things
[01:02:56] from scratch as opposed to just looking
[01:02:59] up stuff. So where is the thing coming
[01:03:01] from? Right? That's the question. And
[01:03:02] the answer is this very beautiful thing
[01:03:04] called bite pair encoding which we'll
[01:03:06] look at next.
[01:03:10] So all right. So what here um when we
[01:03:14] look at this process the adv
[01:03:15] disadvantages are some of the things we
[01:03:17] have discussed which is that we want to
[01:03:18] be able to preserve punctuation. We want
[01:03:19] to be able to preserve case. We want to
[01:03:21] be able to handle new words and so on
[01:03:22] and so forth. So uh the new like the the
[01:03:26] sort of the modern models like BERT and
[01:03:28] so on they use different tokenization
[01:03:29] schemes. They don't actually do the STIE
[01:03:31] thing and the GPD family uses bite pair
[01:03:34] encoding BPE. Uh BERT uses something
[01:03:37] called wordpiece. All of these ways of
[01:03:40] encoding, the fundamental idea is to
[01:03:42] say, well, you know what? Why don't
[01:03:44] whatever language you're working with,
[01:03:46] why don't we start first of all with all
[01:03:47] the individual characters? Because if
[01:03:50] you could actually work with individual
[01:03:51] characters, you can clearly compose any
[01:03:53] word that comes up, right? Reo is just R
[01:03:56] E L D O H, right? Six tokens. If you're
[01:03:58] working with characters at the character
[01:04:00] level, but working only with characters
[01:04:02] is not great, right? because that means
[01:04:05] that the model you're giving it no
[01:04:07] information about the world. It has to
[01:04:09] learn every word from scratch, what the
[01:04:11] word means and so on and so forth. So we
[01:04:14] it would be nice if we can actually give
[01:04:15] it words as well. But we don't we don't
[01:04:17] want to give it infrequent words because
[01:04:20] infrequent words by definition are not
[01:04:22] worth adding to your vocabulary. We're
[01:04:25] just going to you know take up another
[01:04:26] embedding vector and things like that.
[01:04:28] For infrequent words, we'll just make
[01:04:30] we'll just compose them. we'll we'll
[01:04:31] actually construct them on the fly
[01:04:32] because we can always use characters.
[01:04:35] Okay, so we don't want to put every word
[01:04:37] in there. We only want to put frequent
[01:04:38] words. But to give this thing the
[01:04:41] ability to compose new words and not
[01:04:43] always have to go to characters, we will
[01:04:45] give it parts of words. These are called
[01:04:47] subwords. So the key idea is that let's
[01:04:52] come up with a way to build a vocabulary
[01:04:54] which has characters full words that are
[01:04:56] frequent enough to be worth adding and
[01:04:59] subwords or word fragments that occur
[01:05:01] frequently enough to be worth adding. So
[01:05:03] for example the word standardize
[01:05:07] right normalize standardize and so on
[01:05:09] and so forth. I is going to show up a
[01:05:11] lot in many places. So you don't want to
[01:05:12] have standardize and normalize and so
[01:05:14] on. You just want to have eyes. you can
[01:05:15] just attach it to all kinds of words,
[01:05:17] right? And make it all work, right? So
[01:05:19] that's the basic idea of all these
[01:05:20] tokenization schemes. And BP is one such
[01:05:23] way to figure out how to actually
[01:05:25] construct this vocabulary from a
[01:05:27] training corpus, right? And by the way,
[01:05:29] when I say characters, this will include
[01:05:31] not just you know uppercase lowerase
[01:05:33] alphabets and numbers, it may it will
[01:05:34] also include punctuation.
[01:05:37] So that all these things just become
[01:05:38] atomic units.
[01:05:40] All right. So uh so what we're going to
[01:05:42] the way BP works is that uh we're going
[01:05:45] to uh start with each character as a
[01:05:47] token and I'll talk about the rest of
[01:05:49] the thing on the page in just a moment.
[01:05:51] Don't worry about it. We'll start with
[01:05:52] each character as a token. So let's say
[01:05:53] that your training corpus is just a
[01:05:56] single sentence. The cat sat on the mat.
[01:05:58] Okay. And even though GPT does not
[01:06:02] actually do any lower casing, it'll just
[01:06:03] actually use like TH uppercase is
[01:06:05] different than TH lowerase. Uh just for
[01:06:08] simplicity, I'm just going to
[01:06:09] standardize it here. So it just becomes
[01:06:11] a cat sat on the mat. And then I'm going
[01:06:12] to write it in this form where I
[01:06:14] basically put a comma after every word
[01:06:16] and then I put a little underscore to
[01:06:18] show the space between the words. Okay,
[01:06:20] I'm going to write it in this format.
[01:06:21] And it'll become clear why I'm writing
[01:06:22] it in just a second. Okay. Now my
[01:06:25] starting vocabulary is just all the
[01:06:27] individual letters in the training
[01:06:28] corpus. So the starting is just whatever
[01:06:31] all these letters. Okay, that's it. And
[01:06:34] this is a starting point. And now what
[01:06:35] we do and this is the key step.
[01:06:38] We merge tokens that most frequently
[01:06:41] occur right next to each other. So if
[01:06:44] two characters or two tokens are
[01:06:47] occurring right next to each other a
[01:06:48] lot, let's just merge them because they
[01:06:51] seem to be occurring a lot together,
[01:06:52] right? May as well merge them. And so
[01:06:54] here, for example, I've I've listed the
[01:06:57] frequency of the adjacent token. So for
[01:06:59] example, if you look at th
[01:07:01] shows up right after each other here, it
[01:07:04] also shows up here. So therefore, it
[01:07:06] shows up twice.
[01:07:08] Now H E again is showing up here. It's
[01:07:11] also showing up here. So that also shows
[01:07:13] up twice. CA on the other hand is only
[01:07:16] showing up here. It's not showing up
[01:07:17] anywhere else. So it shows up once. A
[01:07:20] shows up three times in Matt, SAT, and
[01:07:24] in CAT and so on and so forth. You get
[01:07:25] the idea. So you're just looking at
[01:07:27] pair-wise adjacent tokens. And you pick
[01:07:30] the most frequent one that's showing up,
[01:07:32] which in this case happens to be a t.
[01:07:34] And then you take a and t and you merge
[01:07:36] them. So it becomes 80.
[01:07:40] Okay. So when you do that when you when
[01:07:42] you you merge them and then you add that
[01:07:44] new token that you've just literally
[01:07:45] created to your vocabulary list and then
[01:07:48] you update the corpus to reflect the
[01:07:50] merge you've just did. So now the corpus
[01:07:52] becomes the cat sat on the mat. But in
[01:07:55] this case there is no a and t
[01:07:56] separately. There is just the at combo
[01:07:58] com combo token here.
[01:08:02] Are we good with this step so far?
[01:08:06] take the most frequent things and merge
[01:08:07] them.
[01:08:12] It's a way to compress the data. In
[01:08:14] fact, the algorithm came from someone
[01:08:16] trying to figure out a way to compress
[01:08:17] data.
[01:08:18] You know,
[01:08:22] think of it this way, right? Suppose I
[01:08:23] tell you uh I'm I want you to compress a
[01:08:25] message I'm going to send to you and
[01:08:28] then you look at all the past messages
[01:08:30] you've had to deal with and it turns out
[01:08:32] you're finding that u certain characters
[01:08:35] are occurring next to each other all the
[01:08:37] time right maybe just for argument let's
[01:08:40] say ABC shows up ridiculously often in
[01:08:42] the messaging and then you'll be like
[01:08:44] you know what's if it's always showing
[01:08:45] up all the time together why treat it as
[01:08:47] three things let me just call it one
[01:08:48] thing ABC that's it you send a single
[01:08:51] token called ABC every time you send
[01:08:53] need ABC not a B C that's the basic
[01:08:56] idea. So here if you come here that's
[01:08:58] what we have and then what we do is now
[01:09:01] we do again this calculation of
[01:09:03] adjacency tokens on this updated corpus
[01:09:05] and you can see here th shows up once TH
[01:09:08] shows up here twice so you get two every
[01:09:11] H shows up twice everything else shows
[01:09:13] up once and yeah when many things are
[01:09:16] showing up with equal frequency just
[01:09:18] pick one randomly from this. So we pick
[01:09:19] up th right and we merge that which
[01:09:22] means that we add th to our vocabulary
[01:09:25] and once we do that we update the corpus
[01:09:27] and now we have th is now one thing
[01:09:30] fused together along with the previous
[01:09:32] thing 80 that had been fused together
[01:09:34] that is a corpus after the second merge
[01:09:36] and then we do the same thing we find
[01:09:38] the frequency adjacent tokens turns out
[01:09:40] th and e are showing up twice everything
[01:09:42] else is showing up once so we take th
[01:09:45] merge it to get the boom the and now we
[01:09:48] have the cat sat on the mat. So this
[01:09:51] process continues
[01:09:53] till we reach a predefined limit for our
[01:09:56] vocabulary. Now as it turns out when
[01:09:59] they built GPT2 and GPT let me just see
[01:10:02] I think I did some digging around on
[01:10:04] this thing. Yeah. So GPT2 and 3 they set
[01:10:07] the vocabulary size to be roughly
[01:10:09] 50,000. So it basically kept on doing
[01:10:12] this till it hit a limit of 50,000 then
[01:10:14] it stopped. GPD4 on the other hand
[01:10:17] actually went goes all the way to
[01:10:18] 100,000 vocabulary size.
[01:10:23] Okay, so this is BP in action. U and so
[01:10:28] what's going to happen is once you
[01:10:29] finish all this thing and you have
[01:10:30] vocabulary and you have all these things
[01:10:31] that you have merged when a new piece of
[01:10:32] text comes in right the merges remember
[01:10:36] here we merged a to get a this th became
[01:10:39] this and so on. When a new piece of text
[01:10:41] arrives the tokenization apply the
[01:10:43] merges in the exact same order. So if
[01:10:45] the new text that comes in is the rat,
[01:10:47] it's first going to apply the 80 to 80
[01:10:50] to become fuse this here and then going
[01:10:52] to fuse th to get this and then it's
[01:10:54] going to fuse th and e to get that. And
[01:10:56] the final list of tokens that goes in to
[01:10:58] your model is going to be the token for
[01:11:00] the the token for space and the token
[01:11:02] for r and the token for at.
[01:11:06] So let's see this in action.
[01:11:12] uh GP I mean OpenAI has a has its own
[01:11:14] thing but I found this uh site to be
[01:11:17] really good. So let's uh tokenize
[01:11:20] hands-on
[01:11:23] deep learning.
[01:11:26] So you can see here
[01:11:28] look at this.
[01:11:30] So H uppercase H is its own token. It's
[01:11:34] token number 39
[01:11:36] and
[01:11:38] it's it own token. dash is its own token
[01:11:41] on is its own token and then space deep
[01:11:43] is its token and space learning is its
[01:11:45] token okay note one thing suppose you
[01:11:48] had said
[01:11:50] let's just say you just had deep deep
[01:11:51] deep learning
[01:11:53] deep has a different token than space
[01:11:56] deep
[01:11:58] okay what they have realized is that
[01:12:01] most words are actually going to show up
[01:12:03] after the space after a space right much
[01:12:06] more likely so having a space attached
[01:12:08] to the beginning of the word saves you a
[01:12:10] lot of sort of you know saves you a lot
[01:12:12] of compute and so on and so forth
[01:12:13] because they will in fact arrive almost
[01:12:15] all the time with the space before it
[01:12:17] right that's why they have attached the
[01:12:18] space to the word itself um and note
[01:12:21] that deep learning deep and uh deep
[01:12:25] actually let's call it this way
[01:12:30] so deep and deep are different
[01:12:34] right there is deep there is so clearly
[01:12:36] it's taking case into account then I put
[01:12:38] an exclamation here. Boom. That and so
[01:12:43] ultimately what goes in when you have
[01:12:44] have a phrase like um
[01:12:48] sat on the mat.
[01:12:51] So the cat sat on the mat. And you can
[01:12:53] see here uppercase the um and then
[01:12:58] let's just do another thing here.
[01:13:01] So uppercase the with a space is 383.
[01:13:06] lowerase the is 262. Uh and then that's
[01:13:10] distinct from just the without any
[01:13:11] space. That's a different thing. So
[01:13:13] these are all the tokens. Now um let's
[01:13:16] try something.
[01:13:18] Let's try
[01:13:21] Jane.
[01:13:24] So Jane is one token which is great and
[01:13:27] is another token. Let's see. Rama. Ah
[01:13:30] darn. My name wasn't worthy enough to be
[01:13:34] its own token. Okay. But strangely
[01:13:38] enough
[01:13:41] this I was very surprised by this. So if
[01:13:44] I put Rama in lower case is its own
[01:13:46] token.
[01:13:48] I have no idea what they were scraping
[01:13:51] which websites. Uh and if I put Jane
[01:13:55] here
[01:13:56] now J has become its token with space
[01:13:58] and A has become different.
[01:14:01] So the tokenization is like very it's a
[01:14:03] very interesting thing and it works in
[01:14:05] very interesting ways. But that's the
[01:14:07] basic idea of what's going on under the
[01:14:08] hood. I would encourage you to like
[01:14:10] check out your names to see if it's
[01:14:12] actually been tokenized. So all right,
[01:14:13] I'm done. Thanks folks. I'll see you on
[01:14:15] Wednesday.