[00:05] So, let's get started. [00:07] So I'll be talking about building LLMs today. [00:10] So I think a lot of you have heard of LLMs before, but just [00:14] as a quick recap. [00:16] LLMs standing for large language models [00:18] are basically all the chat bots that you've [00:21] been hearing about recently. [00:22] So, ChatGPT, from OpenAI, Claude, from Anthropic, Gemini [00:28] and Llama, and other types of models like this. [00:31] And today we'll be talking about how do they actually work. [00:34] So it's going to be an overview because it's only one lecture [00:36] and it's hard to compress everything. [00:38] But hopefully, I'll touch a little bit [00:39] about all the components that are needed [00:41] to train some of these LLMs. [00:43] Also, if you have questions, please interrupt me [00:46] and ask if you have a question. [00:48] Most likely other people in the room or on Zoom have other. [00:52] Have the same questions. [00:53] So, please ask. [00:56] Great. [00:56] So what matters when training LLMs. [01:00] So there are a few key components that matter. [01:02] One is the architecture. [01:04] So as you probably all LLMs are neural networks, [01:07] and when you think about neural networks, [01:09] you have to think about what architecture you're using. [01:11] And another component, which is really important [01:13] is the training loss and the training algorithm. [01:16] So, how you actually train these models, then it's data. [01:20] So, what do you train these models on. [01:24] The evaluation, which is how do you [01:26] know whether you're actually making progress [01:28] towards the goal of LLMs and then, the system component. [01:33] So that is like how do you actually [01:35] make these models run on modern hardware, which [01:38] is really important because these models are really large. [01:41] So now more than ever, systems are actually [01:43] really an important topic for LLMs. [01:47] So those five components, you probably all know that LLMs. [01:52] And if you don't know LLMs are all [01:53] based on transformers or at least some version [01:56] of transformers. [01:57] I'm actually not going to talk about the architecture today. [02:00] One, because I gave a lecture on transformers a few weeks ago [02:06] and two, because you can find so much information online [02:09] on transformers. [02:11] There's much less information about the other four topics. [02:14] So, I really want to talk about those. [02:17] And another thing to say is that most of academia [02:20] actually focuses on architecture and training [02:22] algorithm and losses as academics [02:25] and I've done that for a big part of my career, [02:28] is simply we like thinking that this is like we make [02:32] new architectures, new models, and it [02:35] seems like it's very important. [02:37] But in reality, honestly, what matters in practice is mostly [02:39] the three other topics. [02:41] So, data, evaluation and systems, which is what most [02:45] of industry actually focuses on. [02:48] So, that's also one of the reasons [02:49] why I don't want to talk too much about the architecture, [02:52] because really the rest is super important. [02:55] Great. [02:55] So, overview of the lecture, I'll [02:57] be talking about pretraining. [02:58] So, pretraining, you probably heard that word. [03:00] This is the general word. [03:02] This is kind of the classical language modeling paradigm where [03:06] you basically train your language model to essentially [03:08] model all of internet. [03:10] And then, there's a post training, [03:11] which is a more recent paradigm which [03:13] is taking these large language models [03:15] and making them essentially AI assistants. [03:18] So, this is more of a recent trend since ChatGPT. [03:22] So, if you ever heard of GPT3 or GPT2, [03:25] that's really pretraining land. [03:27] If you heard of ChatGPT, which you probably have, [03:29] this is really post training land, [03:31] so I'll be talking about both, but I'll start with pretraining [03:34] and specifically I'll talk about what [03:37] is the task of pretraining LLMs and what is the loss that people [03:41] actually use. [03:43] So, language modeling, this is a quick recap. [03:47] Language models at a high level are simply [03:49] models of probability distribution over sequences [03:52] of tokens or of words. [03:53] So it's basically some model of p of x1 [03:57] to XL, where x1 is basically what [03:59] one and XL is the last one in the sequence or in the sentence. [04:04] So, very concretely, if you have a sentence like the mouse [04:07] ate the cheese, what the language model gives [04:09] you is simply a probability of this sentence being uttered [04:13] by a human or being found online. [04:17] So, if you have another sentence like "The the mouse ate cheese." [04:21] Here, there's grammatical mistakes. [04:23] So, the model should know that this should [04:25] have some syntactic knowledge. [04:27] So, it should know that this has less likelihood [04:30] of appearing online. [04:32] If you have another sentence like the cheese ate the mouse, [04:36] then the model should hopefully know about the fact [04:39] that usually cheese don't eat mouse. [04:42] So, there's some semantic knowledge [04:43] and this is less likely that the first sentence. [04:45] So, this is basically at a high level what language models are. [04:50] One word that you probably have been hearing a lot in the news [04:52] are generative models. [04:54] So, this is just something that can generate. [04:56] Models that can generate sentences [04:57] or can generate some data. [04:59] The reason why we say language models are generative models [05:01] is that once you have a model of a distribution, [05:04] you can simply sample from this model. [05:06] And now we can generate data. [05:07] So we can generate sentences using a language model. [05:12] So the type of models that people are all currently using [05:15] are what we call autoregressive language models. [05:18] And the key idea of autoregressive language models [05:21] is that you take this distribution over words [05:25] and you basically decompose it into the distribution [05:29] of the first word, multiply by the distribution of [05:32] or the likelihood of the distribution of the second word [05:35] given the first word, and multiply it [05:37] by P of the third word given the first two words. [05:40] So, there's no approximation here. [05:42] This is just the chain rule of probability, which you [05:44] hopefully you all know about. [05:46] Really no approximation. [05:47] This is just one way of modeling a distribution. [05:50] So, slightly more concisely, you can write it [05:52] as a product of P's of the next word, given everything which [05:57] happened in the past. [05:58] So, of the context. [05:59] So, this is what we call autoregressive language models. [06:02] Again, this is really not the only way [06:05] of modeling distribution. [06:06] This is just one way. [06:07] It has some benefits and some downsides. [06:10] One downside of autoregressive language models [06:12] is that when you actually sample from this autoregressive [06:15] language model, you basically have [06:16] a for loop, which generates the next word, then conditions [06:20] on that next word. [06:21] And then we generate in other words. [06:23] So, basically if you have a longer sentence [06:24] that you want to generate, it takes more time to generate it. [06:28] So, there are some downsides of this current paradigm, [06:31] but that's what we currently have. [06:33] So, I'm going to talk about this one. [06:36] Great. [06:36] So, autoregressive language models. [06:38] At a high level, what a task of autoregressive language model [06:41] is simply predicting the next word, as I just said. [06:44] So, if we have a sentence like she likely prefers, [06:47] one potential, next word might be dogs. [06:50] And the way we do it is that we first tokenize. [06:54] So, you take these words or subwords you tokenize them [06:58] and then you give an ID for each token. [07:00] So here you have one, two, three. [07:03] Then, you pass it through this black box. [07:04] As I already said, we're not going [07:06] to talk about the architecture. [07:07] You just pass it through, pass it through a model, [07:10] and you then get a distribution, a probability distribution [07:13] over the next word or over the next token. [07:16] And then you sample from this distribution, [07:20] you get a new token and then you detokenize. [07:22] So, you get a new ID, you detokenize [07:24] and that's how you basically sample from a language model. [07:28] One thing which is important to note [07:29] is that the last two steps are actually [07:32] only needed during inference. [07:34] When you do training, you just need [07:36] to predict the most likely token and you can just [07:38] compare to the real token which happened in practice, [07:41] and then, you basically change the weights [07:43] of your model to increase the probability of generating [07:46] that token. [07:49] Great. [07:50] So, autoregressive neural language models. [07:52] So to be slightly more specific, still, [07:54] without talking about the architecture, [07:56] the first thing we do is that we have all of these. [07:58] Sorry, yes. [07:59] On the previous slide. [08:01] Predicting the probability of the next token, [08:03] does this mean that your final output vector has [08:06] to be the same dimensionality as the number of tokens [08:08] that you have? [08:09] Yes. [08:10] How do you deal with if you have more token. [08:13] Adding more token to your [INAUDIBLE]? [08:16] Yeah so we're going to talk about tokenization [08:18] actually later so you will get some sense of this. [08:21] You basically can deal with adding new tokens. [08:24] I'm kind of exaggerating. [08:25] There are methods for doing it, but essentially people [08:28] don't do it. [08:29] So it's really important to think about [08:32] how you tokenize your text, and that's why [08:33] we'll talk about that later. [08:35] But it's a very good point to note [08:36] is that you basically-- the vocabulary size, so [08:38] the number of tokens that you have is essentially [08:40] the output of your language model. [08:43] So it's actually pretty large. [08:46] So autoregressive neural language models. [08:48] First thing you do is that you take every word or every token. [08:51] You embed them so you get some vector representation [08:56] for each of these tokens. [08:58] You pass them through some neural network, as we said, [09:00] it's a transformer. [09:01] Then you get a representation for all the word [09:04] and all the words in the context. [09:06] So it's basically a representation [09:07] of the entire sentence. [09:09] You pass it through a linear layer, [09:11] as you just said, to basically map it to the number [09:15] so that the output-- the number of outputs [09:17] is the number of tokens. [09:19] You then pass it through some softmax [09:21] and you basically get a probability distribution [09:24] over the next words given every word in the context. [09:30] And the last that you use is basically-- [09:32] it's essentially a task of classifying the next token. [09:35] So it's a very simple, kind of, machine learning task. [09:37] So you use the cross-entropy loss. [09:39] Where you basically look at the actual target that happened, [09:44] which is the target distribution, which [09:45] is a one hot encoding, which in this case says, [09:49] I saw the real word that happened is cat. [09:51] So that's a one hot distribution over cat. [09:55] And here this is the actual-- [09:57] do you see my mouse? [09:58] Oh, yeah. [09:58] This is the distribution that you generated. [10:00] And basically you do cross entropy, [10:01] which really just increases the probability of generating cat [10:04] and decreases all the probability of generating [10:06] all the other tokens. [10:08] One thing to notice is that, as you all know again, [10:11] this is just equivalent to maximizing the text log [10:15] likelihood because you can just rewrite [10:17] the max over the probability of this autoregressive language [10:23] modeling task as just being this minimum of I just [10:26] added the log here and minus, which [10:29] is just the minimum of the loss, which is the cross entropy loss. [10:31] So basically minimizing the loss is [10:33] the same thing as maximizing the likelihood of your text. [10:36] Any question? [10:37] Questions? [10:43] OK, tokenizer. [10:46] So this is one thing that people usually [10:49] don't talk that much about. [10:50] Tokenizers are extremely important. [10:53] So it's really important that you understand at least what [10:56] they do at a high level. [10:57] So why do we need tokenizers in the first place? [11:01] First, it's more general than words. [11:02] So one simple thing that you might think [11:04] is we're just going to take every word that we will have. [11:07] You just say every word is a token in its own. [11:11] But then what happens is if there's a typo in your word? [11:14] Then you might not have any token associated [11:17] with this word with a typo. [11:20] And then you don't know how to actually pass [11:21] this word with a typo into a large language model. [11:24] So what do you do next? [11:25] And also, even if you think about words, words is a very-- [11:29] words are fine with Latin-based languages. [11:32] But if you think about a language like Thai, [11:34] you won't have a simple way of tokenizing [11:36] by spaces because there are no spaces between words. [11:39] So really, tokens are much more general than words. [11:43] It's the first thing. [11:44] Second thing that you might think [11:45] is that you might tokenize every sentence, character [11:48] by character. [11:49] You might say A is one token, B is another token. [11:52] That would actually work and probably very well. [11:55] The issue is that then your sequence becomes super long. [11:58] And as you probably remember from the lecture [12:00] on transformers, the complexity grows quadratically [12:05] with the length of sequences. [12:06] So you really don't want to have a super-long sequence. [12:10] So tokenizers basically try to deal with those two problems [12:14] and give common subsequences a certain token. [12:19] And usually how you should be thinking about it is around [12:22] an average of every token is around 3-4 letters. [12:27] And there are many algorithms for tokenization. [12:30] I'll just talk about one of them to give you a high level, which [12:32] is what we call Byte Pair Encoding, which is actually [12:34] a pretty common. [12:35] One of the two most common tokenizers. [12:37] And the way that you train a tokenizer [12:39] is that first you start with a very large corpus of text. [12:42] And here, I'm really not talking about training a large language [12:45] model yet, this is purely for the tokenization step. [12:48] So this is my large corpus of text with these five words. [12:52] And then you associate every character [12:55] in this corpus of text a different token. [12:58] So here, I just split it up every character [13:00] with a different token, and I just [13:03] color coded all of those tokens. [13:05] And then what you do is that you go through your text, [13:08] and every time you see pairs of tokens that are very common, [13:12] the most common pair of token, you just merge them. [13:15] So here you see three times the tokens t and o [13:19] next to each other. [13:20] So you're just going to say this is a new token. [13:22] And then you continue, you repeat that. [13:24] So now you have tok, tok which happens three times. [13:28] Toke with an E that happens 2 times and token, [13:33] which happens twice, and then ex which also happens twice. [13:37] So this is the-- if you were to train a tokenizer on this corpus [13:41] of text, which is very small, that's [13:43] how you would finish with a token-- [13:45] with like trained tokenizer. [13:47] In reality, you do it on much larger corpus of text. [13:51] And this is the real tokenizer of-- [13:54] actually, I think this is GPT3 or ChatGPT. [13:57] And here you see how it would actually separate these words. [14:00] So basically you see the same thing [14:01] as what we gave in the previous example. [14:03] Token becomes its own token. [14:06] So tokenizer is actually split it up [14:08] into two tokens token and -izer. [14:12] So yeah, that's all about tokenizers. [14:15] Any questions on that? [14:16] Yeah. [14:16] How do you deal with spaces, and how do you [14:18] deal with [INAUDIBLE]. [14:19] Yeah so actually there's a step before tokenizers, [14:23] which is what we call pre-tokenizers, which [14:25] is exactly what you just said. [14:27] So this is mostly-- [14:29] in theory, there's no reason to deal with spaces and punctuation [14:33] separately. [14:34] You could just say every space gets its own token, [14:37] every punctuation gets its own token, [14:40] and you can just do all the merging. [14:42] The problem is that-- so there's an efficiency question. [14:45] Actually, training these tokenizers takes a long time. [14:48] So you better-- because you have to consider every pair of token. [14:51] So what you end up doing is saying if there's a space, [14:54] this is very-- like pre-tokenizers [14:55] are very English specific. [14:57] You say if there's a space, we're [14:58] not going to start looking at the token that came before [15:01] and the token that came afterwards. [15:03] So you're not merging in between spaces. [15:06] But this is just like a computational optimization. [15:10] You could theoretically just deal with it [15:12] the same way as you deal with any other character. [15:15] And-- [15:15] Yeah. [15:16] When you merge tokens to delete the tokens that you merged away [15:19] or do you keep the smaller tokens that emerge? [15:22] You actually keep the smaller tokens. [15:25] I mean, in reality, it doesn't matter much because usually [15:29] on a large corpus of text, you will have actually everything. [15:32] But you usually keep the small ones. [15:34] And the reason why you want to do that [15:36] is because if-- in case there's, as we said before, you have [15:38] some grammatical mistakes or some typos, [15:41] you still want to be able to represent [15:43] these words by character. [15:46] So, yeah. [15:47] Yes. [15:48] Are the tokens unique? [15:51] So I mean, say in this case T-O-K-E-N is there only one [15:54] occurrence or could-- [15:56] do you need to leave multiple occurrence so they could have-- [16:00] take on different meanings or something? [16:02] Oh I see what you say. [16:03] No, it's every token has its own unique ID. [16:08] So a usual-- this is a great question. [16:11] For example, if you think about a bank, which [16:13] could be bank for like money or bank like water, [16:16] it will have the same token. [16:18] But the model will learn, the transformer [16:19] will learn that based on the words that are around it, [16:22] it should associate that-- [16:24] I'm saying-- I'm being very handwavy here, [16:26] but associate that with a representation that [16:30] is either more like the bank money side or the bank water [16:33] side. [16:34] But that's a transformer that does that. [16:36] It's not a tokenizer. [16:38] Yes. [16:39] Yes. [16:39] So you mentioned during tokenization, [16:41] keep the smaller tokens you started with, right. [16:43] Like if you start with a T you keep the T [16:45] and then you build your tokenize out to [16:47] [INAUDIBLE] allow input token. [16:49] So let's say maybe you didn't train on token, but in your data [16:53] you are trying to encode token. [16:54] So how does the tokenizer know to encode it with token or to [16:58] [INAUDIBLE]? [16:59] Yeah. [16:59] The great question. [17:00] You basically when you-- so when you tokenize, [17:02] so that's after training of the tokenizer [17:04] when you actually apply the tokenizer [17:06] you basically always choose the largest token [17:10] that you can apply. [17:11] So if you can do token, you will never do T, [17:13] you will always do token. [17:15] But there's actually-- so people don't usually [17:18] talk that much about tokenizers, but there's [17:20] a lot of computational benefits or computational tricks [17:24] that you can do for making these things faster. [17:27] So I really don't think we-- and honestly, I [17:29] think a lot of people think that we should just get away [17:31] from tokenizers and just kind of tokenize character [17:34] by character or bytes by bytes. [17:36] But as I said, right now there's this issue of length, [17:39] but maybe one day, like in five or 10 years, [17:42] we will have different architectures [17:43] that don't scale quadratically with the length of the sequence. [17:46] And maybe we'll move away from tokenizers. [17:50] So can you share with us the drawback? [17:53] Why do people want to move away from the tokenizer? [17:57] Yeah. [17:58] So I think one good example is math. [18:03] If you think about math, actually numbers right now [18:06] are not tokenized. [18:07] So for example, 327 might have its own token, which [18:10] means that models, when they see numbers, [18:13] they don't see them the same way as we do. [18:15] And this is very annoying because I mean, [18:17] the reason why we can generalize with math [18:19] is because we can deal with every letter separately [18:22] and we can then do composition. [18:24] Where you know that basically if you add stuff, [18:26] it's the same thing as adding every one separately [18:28] plus like whatever the unit that you add. [18:30] So they can't do that. [18:32] So then you have to do special tokenization. [18:35] And, like, one of the big changes that GPT4 did [18:39] is changing the way that they tokenize code. [18:42] So for example, if you have code, you know you have often, [18:46] in Python, these four spaces at the beginning. [18:48] Those were dealt with strangely before. [18:52] And as a result, like, the model couldn't really [18:54] understand how to deal with code. [18:57] So tokenize actually matter a lot. [19:00] OK, so I'll move on right now, but we can come back later [19:04] on tokenizers. [19:05] Great. [19:06] So we talked about a task the loss the tokenizer, [19:08] let's talk a little bit about evaluation. [19:11] So the way that LLMs are usually evaluated [19:13] is what we call-- is using what we call perplexity. [19:16] At a high level it's basically just your validation loss. [19:20] The slight difference with perplexity [19:21] is that we use something that is slightly more interpretable, [19:24] which is that we use the average per token loss, [19:27] and then you exponentiate it. [19:29] And the reason why you exponentiate it [19:30] is because you want-- [19:32] I mean, the loss has a log inside and you-- [19:35] like one humans are actually pretty [19:36] bad at thinking in log space. [19:38] But two logs depend on the base of the log [19:41] while when you exponentiate you basically have everything [19:44] in the vocabulary size unit. [19:48] And the average per token is just so [19:50] that your perplexity is independent of the length [19:52] of your sequence. [19:54] So perplexity is just two to the power average [19:57] of the loss of the sequence. [20:00] So perplexity is between one and the length of the vocabulary [20:04] of your tokenizer. [20:05] One it's simply well, if you predict perfectly [20:08] the thing which every word, then every word [20:11] will have basically products of ones. [20:14] So the best perplexity you can have is one. [20:16] If you really have no idea, you basically [20:18] predict with one divided by size of vocabulary [20:22] and then you do simple math and you basically [20:24] get perplexity of size of vocabulary. [20:26] So the intuition of perplexity is [20:28] that it's basically the number of tokens [20:30] that your model is, kind of, hesitating between. [20:32] So if your model is perfect, it doesn't hesitate. [20:35] It know exactly the word. [20:36] If it really has no idea, then it [20:38] hesitates between all of the vocabulary. [20:43] So perplexity really improved. [20:46] That's perplexity on a standard data set between 2017 and 2023. [20:50] It went from a kind of 70 tokens to less than 10 tokens [20:54] over these five, six years. [20:56] So that means that the models were previously [20:58] stated between 70 words every time it was generating a word, [21:02] and now it's hesitating between less than 10 words. [21:05] So that's much better. [21:06] Perplexity is actually not used anymore [21:08] in academic benchmarking, mostly because it depends [21:11] on the tokenizer that you use. [21:12] It depends on the actual data that people are evaluating on. [21:16] But it's still very important for development of LLMs. [21:19] So when you actually train your own LLM people [21:21] will still really look at the perplexity. [21:26] One common other way and now more common in academia [21:30] of evaluating these LLMs is just by taking all the classical NLP [21:34] benchmarks, and I'll give you a few examples later and just, [21:37] kind of, aggregating everything. [21:39] So collect as many automatically evaluatable benchmarks [21:43] and just evaluate across all of them. [21:46] So one such-- or actually two such [21:50] benchmarks are what we call HELM, which is from Stanford. [21:54] And another one is the Hugging Face open leaderboard, [21:56] which are probably the two most common ones right now. [22:00] So just to give you an idea, in HELM, [22:02] all of these type of tasks, which [22:04] are mostly things that can be easily evaluated [22:08] like question answering. [22:09] So think about many different question answering tasks. [22:13] And the benefit with question answering [22:15] is that you usually know what is the real answer. [22:18] So you can-- the way that you evaluate these models [22:20] and I'll give you a concrete example in one second, [22:22] is that you can just look at how likely the language model is [22:26] to generate the real answer compared to some other answers. [22:30] And that's essentially, at a high level, [22:31] how you evaluate these models. [22:33] So to give you a specific example, [22:35] MMLU is probably the most common academic benchmark for LLMs. [22:42] And this is just a collection of many question [22:45] and answers in all of those domains. [22:47] For example, college medicine, college physics, [22:50] astronomy and these type of topics. [22:52] And the questions are things like, so this is in astronomy. [22:55] What is true for type-1a supernova? [22:58] Then you give four different potential answers [23:01] and you just ask the model which one is more likely. [23:04] So there are many different ways of doing it. [23:06] Either you can look at the likelihood of generating [23:09] all these answers, or you can ask the model [23:11] which one is the most likely. [23:12] So there are different ways that you can prompt the model, [23:15] but at a high level, you know which one is correct. [23:17] And there are three other mistakes. [23:20] Yes. [23:22] Creating unconstrained text as an output. [23:24] Yeah. [23:25] How do you evaluate a model if it [23:28] gives something that's semantically completely [23:31] identical, but is not the exact tokens that you expect? [23:35] Yeah. [23:36] So that's a great question. [23:37] I'll talk more about that later. [23:38] Here, in this case, we don't do unconstrained. [23:41] So the way you would evaluate MMLU is basically either [23:44] you ask the first question, and then you [23:47] look at the likelihood of the model generating A, [23:50] the likelihood of the model generating B, C, and D [23:53] and you look at which one is the most likely. [23:55] Or you can ask the model out of A, B, C, D, [23:58] which one is the most likely. [23:59] And you look at whether the most likely next token is A, B, [24:03] C, or D. So you constrain the model [24:05] to say it can only answer these four things. [24:09] You say you constraint-- [24:10] Yeah. [24:11] You constrain the prompt or do you [24:13] mean of its whole probability distribution [24:15] that it outputs you only comparing [24:17] the outputs of like-- you're only comparing the A token the [24:19] [INAUDIBLE]. [24:20] Yeah. [24:20] So in the second case I gave you, you would do exactly the-- [24:24] actually would do both. [24:25] You would prompt the model saying A, B, C, or D [24:27] plus you would constrain to only look at these four tokens. [24:32] In the first case, you don't even need to generate anything. [24:34] So in the first case, you literally just [24:36] look, given it's a language model, [24:38] it can give a distribution over sentences. [24:40] You just look at what is the likelihood of generating [24:43] all of these words? [24:45] What is the likelihood of generating the second choice? [24:48] And you just look at whether the most likely sentence is actually [24:52] the real answer. [24:54] So you don't actually sample from it, [24:56] you really just use P of X1 to XL. [24:59] Does that make sense? [25:01] That being said, evaluation of open-ended questions [25:05] is something we're going to talk about later, [25:06] and it's actually really important [25:08] and really challenging. [25:09] Yes. [25:10] Earlier you mentioned [INAUDIBLE] metrics [25:13] like perplexity are not I usually [25:16] use because it depends on how you do [25:18] your tokenization, some design choices. [25:21] I was wondering if you could speak more to that. [25:24] Yeah. [25:25] So think about perplexity. [25:26] I told you perplexity is between 1 and vocabulary size. [25:30] So now imagine that ChatGPT uses a tokenizer that has 10,000 [25:34] tokens but Gemini from Google uses a tokenizer that had [25:38] 100,000 potential tokens. [25:41] Then actually the Gemini one will have the upper bound [25:45] of the perplexity that you can get is actually worse for Gemini [25:48] than for ChatGPT. [25:50] Does that make sense? [25:52] So that's just an idea. [25:53] It's actually a little bit more complicated than that, [25:55] but that's just one festival with a bit [25:58] of where you can see that the tokenizer actually matters. [26:02] Great. [26:05] OK, so evaluation challenges. [26:07] There are many. [26:08] I'll just talk about two really briefly. [26:10] One, as I told you, there are two ways of doing evaluation [26:13] for these MMLUs. [26:14] Actually, there are many more than two [26:16] but I gave you two examples. [26:17] And it happens that for a long time, [26:20] even though that was a very classical benchmark [26:22] that everyone uses actually different companies [26:27] and different organizations were actually [26:32] using different ways of evaluating MMLU. [26:34] And as a result, you get completely different results. [26:37] For example, Llama-65b, which was the first model of meta [26:42] in the llama series, had on HELM 63.7 accuracy [26:47] but on this other benchmark had like 48.8. [26:53] So really the way that you evaluate, and this is not even [26:55] talking about prompting this is really just the way [26:58] that you evaluate the models. [27:01] Prompting is another issue. [27:02] So really, there are a lot of inconsistencies. [27:04] It's not as easy as it looks. [27:07] First thing. [27:08] Yeah, sorry. [27:08] How can we make sure that all these models [27:10] are trained on the benchmark? [27:13] Second thing. [27:14] This is a great question. [27:15] Train test contamination. [27:17] This is something which I would say [27:19] is really important in academia in-- [27:24] given that the talk is mostly about training large language [27:26] models, for companies, it's maybe not that important [27:29] because they know what they trained on. [27:33] For us, we have no idea. [27:35] So, for us, it's a real problem. [27:37] So there are many different ways of trying [27:39] to test whether the test set-- [27:42] or sorry, whether the test set was actually [27:44] in the training set. [27:45] One, kind of, cute trick that people in the lab, [27:51] in [? Tatsuo's ?] lab have found, is that what you can do [27:54] is that given that most of the data set online [27:57] are not randomized, you can just look at-- [28:00] and that language models, what they do is just [28:02] predict the next word. [28:03] You can just look at the entire test set. [28:06] What if you generate all the examples [28:09] in order versus all the examples in a different order. [28:13] And if it's more likely to generate a thing in order, given [28:17] that there's no real order there, [28:19] then it means that probably it was in the training set. [28:21] Does that make sense? [28:23] So there are many-- that's like one of them. [28:24] There are many other ways of doing it. [28:26] Train test contamination, again, not [28:28] that important for development, really important for [28:30] academic benchmarking. [28:33] Great. [28:33] So there are many other challenges, [28:34] but I'll move on for now. [28:37] Great. [28:38] Data. [28:40] So data is another really big topic. [28:43] At a high level people just say you basically [28:45] train large language models on all of internet. [28:48] What does that even mean? [28:50] So people sometimes say, well, of clean internet, [28:53] which is even less defined. [28:55] So internet is very dirty and really not representative [28:59] of what we want in practice. [29:00] If I download a random website right now, [29:03] you would be shocked at what is in there. [29:06] It's definitely not your Wikipedia. [29:08] So I'll go really briefly on what people do. [29:14] I can answer some questions, but I mean, [29:16] data is on its own it's a huge topic. [29:19] Basically, first what you do is download all of internet. [29:22] What that means is that you use web crawlers that [29:25] will go on every web page, on internet or every web page that [29:29] is on Google. [29:31] And that is around 250 billion pages right now. [29:36] And that's around 1 petabyte of data. [29:39] So this is actually a Common Crawl is one web crawler. [29:42] So people don't usually write their own web crawlers [29:45] what they do is that they use standard web crawlers, [29:47] and Common Crawl is one of them that basically every month adds [29:51] all the new websites that were added on internet that are found [29:56] by Google, and they put it in a big basically a big data set. [30:00] So that's-- on Common Crawl, you have around 250 billion pages [30:04] right now. [30:04] So 1E6 gigabytes of data. [30:07] Once you have this-- [30:09] so this is a random web page. [30:11] Like literally random from this Common Crawl. [30:14] And what you see is that one, it really [30:16] doesn't look at type of things that you would usually see, [30:18] but actually-- so this is an HTML page. [30:21] It's hard to see, but if you look through [30:24] will see some content. [30:26] For example, here, Test King World [30:30] is your ultimate source for the system x high performance [30:33] server. [30:34] And then you have three dots. [30:35] So you don't even-- the sentence is not even finished. [30:37] That's how random internet looks like. [30:40] So, of course, it's not that useful [30:42] if you just train a large language model [30:44] to generate things like this. [30:45] So what are some of the steps that are needed? [30:48] First one, you extract the text from the HTML. [30:51] So that's what I just tried to do by looking [30:53] at basically the correct tags. [30:55] There are a lot of challenges through this. [30:57] For example, extracting math is actually [30:59] very complicated, but pretty important for training [31:02] large language models. [31:03] Or for example, boilerplates. [31:05] A lot of your forums will have the same type of headers, [31:08] the same type of footers. [31:10] You don't want to repeat all of this in your data, [31:13] and then you will filter undesirable content. [31:16] So not safe for work, harmful content, PII. [31:20] So usually every company has basically [31:22] a blacklist of websites that they don't [31:26] want to train their models on. [31:27] That blacklist is very long and you basically [31:30] say if it comes from there, we don't train on this. [31:32] There are other ways of doing these things. [31:34] Is that you can train a small model for classifying what [31:36] is PII, removing these things. [31:39] It's hard. [31:40] Every point here that I'm going to show you [31:42] is a hard amount of work, but I'm just [31:46] going to go quickly through it. [31:48] So filter undesirable content. [31:50] Second or fourth is de-duplication. [31:54] As I said, you might have things like headers and footers [31:57] in forums that are always the same. [31:59] You want to remove that. [32:01] Another thing that you might have [32:02] is a lot of URLs that are different, but actually show [32:05] the same website. [32:08] And you might also have a lot of paragraphs that come from common [32:13] books that are basically de-duplicated 1,000 times [32:16] or 10,000 times on internet. [32:18] So you have to de-duplicated. [32:20] Also very challenging because you have to do that at scale. [32:24] Once you do the de-duplication, you [32:26] will do some heuristic filtering. [32:28] You will try to remove low-quality documents. [32:31] The way you do that are things like rules-based filtering. [32:35] For example, if you see that there are some outlier tokens. [32:37] If the distribution of tokens in the website [32:39] is very different than the usual distribution of tokens, [32:42] then it's probably some outlier. [32:43] If you see that the length of the words in this website [32:46] is super long, there's something strange going on that website. [32:49] If you see that the website has only three words, [32:52] maybe, is it worth training on it. [32:54] Maybe not. [32:54] If it has 10 million words, maybe there's something also [32:58] wrong going on that page. [33:00] So a lot of rules like this. [33:01] Yes. [33:02] Why do we filter out undesirable content [33:04] from our data set instead of putting it in as, [33:08] like, a supervised loss? [33:10] Can we not just say, here's this like, hate speech website, [33:14] let's actively try to-- [33:17] let's actively penalize the model for getting it. [33:19] We'll do exactly that, but not at this step. [33:22] That's why the post-training will come from. [33:25] Pretraining the idea is just to say [33:30] I want to model, kind of, how humans speak, essentially. [33:34] And I want to remove all these headers, footers [33:36] and menus and things like this. [33:38] But it's a very good idea that you just had. [33:41] And that's exactly what we'll do later. [33:45] Next step, model-based filtering. [33:47] So once you filter a lot of data, what you will do-- [33:50] that's actually a very cute trick. [33:51] You will take all of Wikipedia and you [33:54] will look at all the links that are [33:56] linked through Wikipedia pages. [33:58] Because probably if something is referenced by Wikipedia, [34:01] it's probably some high-quality website. [34:02] And you will train a classifier to predict whether something [34:07] comes from-- whether a document comes from one [34:10] of these references from Wikipedia [34:13] or whether it's from the random web. [34:15] And you will try to basically say, [34:17] I want more of the things that come from Wikipedia references. [34:21] Does that make sense? [34:23] So yeah. [34:24] So you will train a machine learning model. [34:26] Usually also very simple models because you [34:28] need to do that really at scale. [34:30] I mean, just think about the 250 billion pages. [34:34] Next one, you will try to classify your data [34:37] into different domains. [34:41] You will say, OK, this is entertainment, this is books, [34:43] this is code, this is like these type of domains. [34:46] And then you will try to either up or down weight [34:51] some of the domains. [34:52] For example, you might say-- [34:54] you might see that actually if you train more on code, then [34:57] actually your model becomes better on reasoning. [34:59] So that's something that people usually say in [35:01] a very hand-wavy way. [35:02] If you train your model more on code, [35:04] actually it helps reasoning. [35:05] So you want to update the coding distribution [35:08] because that helps for general language modeling skills. [35:11] Books is usually also another one that people usually update. [35:16] Entertainment, they usually down weight. [35:18] So things like this. [35:19] Of course, you want to do it-- so people used to do it, maybe [35:24] kind of heuristically. [35:25] Now there's entire pipelines that we'll [35:27] talk about of how to do these things slightly [35:30] more automatically. [35:33] And then at the end of training, you usually train-- [35:37] after training on all of this data that we saw [35:40] you usually train on very high quality data [35:42] at the end of training your large language model where you [35:46] decrease your learning rate. [35:47] And that basically means that you're, [35:49] kind of, overfitting your model on a very high quality data. [35:52] So usually what you do there is Wikipedia. [35:55] You basically overfit on Wikipedia [35:57] and you overfit on, like, human data that was collected. [36:04] The other thing is like continual pretraining [36:06] for getting longer context. [36:07] I'm going to skip over all of these things. [36:09] But that's just to give you a sense of how hard it [36:12] is when people just say I'm going to train on internet, [36:15] that's a lot of work. [36:17] And, really, we haven't figured it out yet. [36:19] So collecting well data is a huge part [36:23] of practical, large language model. [36:24] Some might say that it's actually the key. [36:26] Yes. [36:27] [INAUDIBLE] about data. [36:29] So basic question. [36:30] So usually when you start with like a petabyte of data, [36:33] after you go through all the steps, [36:35] what's the typical amount of data you have remaining. [36:37] And then how large a team does it typically [36:40] take to go through all the data steps you talked about? [36:43] Sorry how la-- is your question how large [36:45] is the data after you filter? [36:46] Yeah. [36:47] After you filter and then you go through all the steps. [36:49] How large a team do you need to go through, like, [36:52] all the filtration steps you mentioned. [36:54] How slow is it or-- [36:56] How many people would you need to be [37:00] able to do this [INAUDIBLE]? [37:02] OK that's a great question. [37:03] I'm going to somewhat answer about the data. [37:06] How large is the data set at the end of this slide. [37:10] For number of people that work on it, that's a good question. [37:15] I'm actually not quite sure, but I would say, yeah, [37:19] I actually don't quite know but I [37:22] would say it's probably even bigger than the number of people [37:25] that work on the tuning of the pretraining of the model. [37:29] So the data is bigger than the modeling aspect. [37:34] Yeah, I don't think I have a good sense. [37:37] I would say probably in LLAMA's team, which have 70-ish people, [37:41] I would say maybe 15 work on data. [37:45] Yeah. [37:46] All these things, you don't need that many people, [37:48] you need a lot of compute also. [37:49] Because for data you need a lot of CPUs. [37:52] So, yeah. [37:53] And I'll answer the second question [37:54] at the end of this slide. [37:56] So as I just, kind of, alluded to really, [37:59] we haven't solved data at all for pretraining. [38:02] So there's a lot of research that has to be done. [38:04] First, how do you process these things super efficiently? [38:07] Second, how do you balance kind of all [38:09] of these different domains? [38:10] Can you do synthetic data generation? [38:12] That's actually a big one right now. [38:14] And because we don't have-- [38:16] we'll talk about that later, but we don't have [38:18] enough data on the internet. [38:20] Can you use multimodal data instead of just text data? [38:23] And how does that improve even your text performance? [38:28] There's a lot of secrecy because, really, this [38:30] is the key of most of the pretraining large language [38:33] models. [38:34] So for competitive dynamics, usually these companies [38:39] don't talk about how they do the data collection. [38:41] And also there's a copyright liability issue. [38:44] They definitely don't want to tell you [38:45] that they've trained on books even though they did [38:47] because if not can sue them. [38:50] Common academic benchmarks. [38:52] So that will, kind of, answer what you asked. [38:54] It started-- so those are the smaller ones. [38:57] The names are not that important, [38:58] but it started from around $150 billion tokens, which are [39:02] around 800 gigabytes of data. [39:04] And now it's around 15 trillion-- [39:06] 15 trillion tokens, which is also [39:09] the size of the models that are-- right now the best models [39:12] are probably trained on that amount of data. [39:14] So 15 trillion tokens, which is probably, [39:18] I guess, two orders of magnitude bigger than that. [39:20] So 80E3 gigabyte. [39:23] So that would be around 100 to 1,000 times filtering [39:29] of the Common Crawl, if I'm not mistaken. [39:32] So, yeah. [39:34] One very famous one is the Pile. [39:37] So this is an academic benchmark, the Pile. [39:39] And we can just look at what distribution of data they have. [39:42] It's things like archive, PubMed Central, [39:46] which is all the biology stuff. [39:50] Here it's Wikipedia, you see Stack Exchange, some GitHub [39:55] and some books and things like this. [39:58] Again, this is on the smaller side. [39:59] So this is-- if we look at here, this is on 280B so, in reality, [40:03] it's like 100 times bigger so you cannot have that much [40:05] of GitHub and of Wikipedia. [40:09] In terms of closed source models. [40:11] Just to give you an idea, Llama 2 [40:14] it was trained on 2 trillion tokens, [40:16] Llama 3 15 trillion tokens, which is currently [40:19] the best model that we know on how much it was trained on, [40:22] which is the same thing as is the best academic or the biggest [40:26] academic benchmark, which is 15 trillion tokens. [40:29] GPT4 we don't really but it's probably [40:31] in the same order of magnitude or it's probably around that. [40:33] Actually, it's probably around 13 from leaks. [40:36] If the leaks are true. [40:39] Great. [40:41] So scaling laws. [40:43] Any other questions on data before we go to scaling laws? [40:48] Sorry I know I'm giving you a lot of information, [40:51] but there's a lot into training, large language models. [40:54] Great scaling laws. [40:56] So the idea is that what people saw around 2020, or at least [41:01] from a long time, but they've been able to theoretically show [41:05] it or empirically show it since 2020, [41:07] is that the more data you train your models on [41:09] and the larger the models, the better the performance. [41:12] This is actually pretty different than what [41:14] you've seen in this class. [41:15] In this class we teach you about overfitting. [41:17] Overfitting doesn't happen with large language models. [41:20] Larger models, better performance. [41:23] It's something that really took a long time [41:25] for the community who took this type of class to realize. [41:29] But for the exam, overfitting exists. [41:33] So, OK, the idea of scaling loss is that if-- given that more [41:38] data and larger models will always [41:40] give you better performance, can we [41:42] predict how much better your performance will [41:46] be if you increase the amount of data and the size of your model? [41:50] And surprisingly, it works. [41:52] So here you see three plots from a very famous paper called [41:55] Scaling Laws from OpenAI. [41:57] Here you see on the x-axis compute. [42:00] So how much did you train-- [42:01] like, how much compute did you spend for training? [42:04] And here you see test loss. [42:05] So this is essentially, I mean, perplexity, [42:08] but it's your validation loss. [42:09] So it's a log of the perplexity. [42:11] And if you put these two on log scale, [42:15] then you see that the performance or the-- [42:19] sorry, the scaling law is linear. [42:22] That means that if you increase your compute [42:25] by a certain amount, you can say by how much your test loss will [42:29] actually decrease. [42:30] Same thing with data and same thing for parameters. [42:33] If you increase the data set size, [42:35] your loss will decrease by an amount [42:38] that is somewhat predictable. [42:40] If you increase the number of parameters, [42:42] the loss will decrease by an amount, [42:44] which is somewhat predictable. [42:45] This is really amazing. [42:47] Very surprising. [42:49] I mean, it looks innocuous when you look at these type of plots, [42:52] but that's crazy because it means that you can predict [42:55] how well we're going to perform in two or three years, [42:58] depending on how much compute we will add, [42:59] assuming that these things will hold. [43:01] There's nothing theoretical about it. [43:04] Yes. [43:05] Two things. [43:06] One, what is the loss that they're using here. [43:08] Is this perplexity? [43:09] So it's-- I said perplexity was like 2 to the power of the loss. [43:13] So this is the power of the perplexity. [43:17] And then the second thing is, when [43:19] you increase the number of parameters [43:21] or you increase the data set size [INAUDIBLE] data [43:24] [INAUDIBLE] times, doesn't that just inherently [43:26] increase your compute? [43:27] Like does all of this [INAUDIBLE] come to just how [43:30] [INAUDIBLE] you [INAUDIBLE]? [43:31] Yes. [43:31] --or something specific [INAUDIBLE]? [43:32] No, this is a great question. [43:33] So the compute here is actually a factor of two things, the data [43:37] and the parameter. [43:38] What I'm showing here is that you can-- [43:40] well, actually, we're going to talk about that in details. [43:42] But basically, if you increase the number of parameters, [43:44] you should increase the number of data that you have. [43:48] So you actually don't go multiple times [43:50] to the same data set. [43:51] No one does epochs in at least not yet [43:56] because we haven't still kind of enough data. [43:59] So yeah, this is all the same trend, [44:01] which is increase compute decrease loss. [44:04] Yes. [44:06] Have we seen the numbers for the last two years or this [44:09] is still holding? [44:10] It is still holding. [44:13] I don't have good numbers to show you, [44:16] but it is still holding, surprisingly. [44:20] Yes. [44:21] Is there no evidence that control quality density [44:23] will ever plateau? [44:25] In theory, we would expect it plateau, [INAUDIBLE]? [44:28] No empirical evidence of plateauing anytime soon. [44:33] Why? [44:34] We don't know. [44:35] Will it happen? [44:37] Probably. [44:37] I mean, it doesn't need to because it's actually [44:39] in log scale. [44:40] So it's not like as if it had to go. [44:43] It had to plateau. [44:44] Like mathematically, it could continue decreasing like this. [44:47] I mean, most people think that it will probably [44:49] plateau at some point. [44:50] We don't know when. [44:54] So that's-- I'll talk more about scaling laws now. [44:57] So why are scaling laws really cool? [44:59] Imagine that I gave you-- [45:02] you're very fortunate I gave you 10,000 GPUs for this month. [45:05] What model will you train? [45:07] How do you even go about answering that question? [45:09] And I mean, this is a hypothetical, [45:12] but that's exactly what these companies are faced with. [45:16] The old pipeline, which was basically [45:19] tune hyperparameters on the big models. [45:21] So let's say I have 30 days, I will train [45:24] 30 models for one day each. [45:26] I will pick the best one and that will be the final model [45:30] that I will use in production. [45:32] That means that the model that I actually used [45:34] was only trained for one day. [45:36] The new pipeline is that you first find a scaling recipe. [45:40] So you find something that tells you, for example, [45:43] like one common thing is that if you increase [45:45] the size of your model, you should decrease your learning [45:46] rate. [45:47] So you find a scaling recipe such [45:49] that you know if I increase the size of my model, [45:52] here's what I should do with some hyperparameters. [45:55] Then you tune your hyperparameters [45:57] on smaller models of different sizes. [46:00] Let's say I will say for three days, of my 30 days, [46:03] I will train many different models. [46:05] And I will do hyperparameter tuning [46:07] on these small models, each of different sizes. [46:09] Then I will fit a scaling law and try [46:11] to extrapolate from these smaller models, which [46:15] one will be the best if I train it for much longer-- [46:20] or sorry if I train it for a larger model. [46:22] And then I will train the final huge model [46:24] for 27 days instead of just one day. [46:28] So the new pipeline is not train things [46:31] or do hyperparameter tuning on the real scale of the model [46:34] that you're going to use in practice, [46:35] but do things on smaller ones at different scales. [46:39] Try to predict how well they will perform [46:41] once you make them bigger. [46:43] I will give-- I will give you a very concrete example right now. [46:46] Let's say transformers versus LSTMs. [46:49] Let's say you have these 10,000 GPUs, [46:51] you are not sure which one you should be using. [46:53] Should I be using a transformer-based model [46:55] or LSTM-based model. [46:56] What I will do is I will train transformers [46:58] at different scales. [47:00] So here you see different parameters on the x-axis, [47:02] y-axis is my test source. [47:04] I will then train different LSTMs at different scales. [47:08] Once I have these points, I will see oh it, kind of, [47:11] fits a scaling law. [47:12] I will fit my scaling law and then [47:14] I will be able to predict if I had 10 times more compute, [47:18] here's how well I would perform for the LSTM. [47:21] It's actually slightly less linear for the LSTM, [47:23] but you can probably try to predict where you would end up. [47:26] And clearly from this plot, you would see [47:28] that transformers are better. [47:30] One thing to notice when you read these type of scaling laws [47:33] is that there are two things that are important. [47:35] One is really your scaling rate, which [47:40] is the slope of the-- the slope of the scaling law. [47:45] The other thing is your intercept, [47:49] you could start worse, but actually [47:52] become better over time. [47:53] It just happens that LSTMs are worse for both. [47:55] But I could show you another one where things-- [47:58] you can predict that actually after a certain scale [48:01] you're better off using that type of model than others. [48:04] So that's why scaling laws are actually really useful. [48:08] Any questions on that? [48:12] Yeah. [48:12] So these are all, kind of, very-- [48:15] how sensitive are these to small differences in the architecture. [48:18] Like one like transformer architecture [48:21] versus another transformer architecture. [48:23] Do you think we have to fit your own curve [48:26] and, basically, say like oh scaling laws tell me this should [48:28] be some logarithmic function. [48:31] Like, let me extrapolate that for [48:33] my own specific architecture. [48:35] Yeah, so usually, for example, if you're an academic [48:38] and you want to-- now at least that's pretty recent [48:40] and you want to propose a new activation. [48:43] That's exactly what you will do. [48:45] You will fit a scaling law, show another scaling law [48:47] with the standard like, I don't GELU [48:49] and you will say that it's better. [48:50] In reality, once you start thinking about it in scaling [48:53] laws terms, you really realize that actually [48:55] all the architecture differences that we [48:57] can make, like the small, minor ones, all they do [48:59] is maybe change a little bit the intercept. [49:03] But really that doesn't matter because just [49:05] train it for 10 hours longer or like wait for the next computer [49:09] GPUs and these things are really secondary. [49:12] Which is exactly why I was telling you originally, [49:14] people spend too much time on the architecture and losses. [49:17] In reality, these things don't matter as much. [49:19] Data though. [49:19] If you use good data, you will have much better scaling laws [49:23] than if you use bad data. [49:24] So that really matters. [49:27] Another really cool thing you can do with scaling laws [49:29] is that you can ask yourself, how to optimally allocate [49:33] training resources. [49:35] Should I train larger models. [49:37] Because we saw that it's better when you train larger models, [49:39] but we saw that it's also better when you use more data. [49:42] So which one should I do? [49:43] Should I just train on more data, a smaller model, [49:46] or should I train a larger model on less data? [49:49] So Chinchilla is a very famous paper that first showed this. [49:53] The way they did it, I want to give you [49:55] a little bit of a sense of what these plots are. [49:58] Here you see training loss again on the x-axis, [50:00] you see parameter differences, sorry, parameter size-- [50:04] number of parameters. [50:04] So the size of the model. [50:06] And here all these curves are what [50:07] we call ISO flops, which is that all the models on this curve [50:13] have been trained with the same amount of compute. [50:17] The way that you do that is that you train-- [50:19] you change. [50:20] Sorry, you vary the number of tokens that were trained on [50:22] and the size of the models, but you vary in such a way [50:25] that the total compute is constant, OK. [50:27] So all these curves that you see with different colors [50:29] have different amount of compute that were trained on. [50:32] Then you take the best one for each of those curves. [50:35] Once you have the best one for each of those curves, [50:38] you can ask-- you can plot how much flops it was [50:44] and which curve were you on and how much parameters [50:47] did you actually use for training that specific point. [50:50] You put that on the log log scale again and now [50:55] you fit a scaling law again. [50:56] So now I have something which tells me [50:59] if I want to train a model of 10 to the power 23 flops, here is [51:03] exactly the number of parameters that I should be using. [51:06] 100 B. [51:07] And you can do the same thing with flops and tokens. [51:11] So now you can predict-- [51:13] if I tell you exactly I have one month of compute, [51:16] what size of model should I be training? [51:18] Fit the scaling law, and I tell you. [51:21] Of course that all looks beautiful. [51:23] In reality like there's a lot of small things of like, [51:26] should you be counting, like, embedding parameters, [51:29] there's a lot of complexities. [51:30] But if you do things well, these things actually do hold. [51:35] So the optimal number of parameters that Chinchilla paper [51:38] have found is to use 20 tokens for every parameter [51:42] that you train. [51:44] So if you add one more parameter, [51:45] you should train your thing on-- your model on 20 more tokens. [51:49] So one caveat here is that this is optimal training resources. [51:53] So that is telling me if you have 10 to the power, 23 flops [51:57] or if you have 100, I don't know how much that is, $100 million [52:00] or 10-- no, that's much less, actually. [52:02] Let's say I have $5 million to train [52:05] my best model that gets the lowest [52:07] loss what would I train on? [52:09] In reality, these companies need to think about inference also. [52:12] If you have a smaller model, they will spend less over time. [52:17] So actually, if you consider the inference cost, [52:20] you have other papers that try to show that, it's [52:23] around 150 parameters, sorry-- [52:26] tokens per parameters, because you prefer having a smaller [52:29] model because over time you're going [52:32] to actually spend less money on inference of these models. [52:37] So 150 to 1, that's around what the best models are trained [52:42] on right now, at least the ones that are [52:45] used in practice in production. [52:49] Great. [52:51] Any questions on Chinchilla? [52:55] Great. [52:56] Oh sorry. [52:58] In practice, how expensive is inference for these models [53:01] relative to training? [53:03] Actually, very expensive. [53:05] I will not talk about inference because that would [53:07] be another entire lecture. [53:09] But just think about ChatGPT where [53:11] they have I don't know how much it is now, [53:14] like 600 million people that use it. [53:18] Like, that's a lot. [53:22] Yeah. [53:23] So it's actually very expensive. [53:24] There's a lot of optimization you can do for inference though. [53:27] And that's an entire other lecture. [53:29] I'm going to skip that this time, but it's very interesting. [53:33] OK tunings. [53:34] As I said, there are many things that you [53:36] can answer with scaling laws. [53:38] I just try to give you two examples, [53:40] but really there are many things. [53:42] What data do you use. [53:43] What mixture-- what data mixing weighting you use. [53:46] The mixtures, that's what we talked about before. [53:49] What architecture you use, whether you should make [53:51] your models wider or deeper? [53:54] Should you be paying for more GPUs [53:56] or actually collecting more data? [53:58] All these things are things you can try [54:00] to answer with scaling laws. [54:03] One thing I want to say is the bitter lesson. [54:05] If you ever heard of Richard Sutton, [54:08] very famous blog post in 2019, what he realized, [54:12] which I think not enough people realize, [54:16] I didn't-- definitely did not realize at that time, [54:19] is that once you see these type of scaling laws you know that [54:23] the more compute you have, the better models you will get. [54:26] So with scale, you will get better model. [54:28] And you also know by Moore's law or these type [54:30] of variants of Moore's law that you will always [54:33] have better compute. [54:34] Then the only thing that matters is just [54:36] to have architectures that can leverage computation. [54:40] So what matters is basically systems data and less [54:44] so the architecture, like the small architecture [54:46] differences like, your activation and things like this. [54:49] So I think that's one of the reasons why most of research [54:52] focuses on some things that for industry matters less. [54:56] And I was one of those researchers [54:58] for a large part of my career. [55:02] So don't spend time over complicating. [55:04] Do the simple things, do it well. [55:07] See all them. [55:08] That's really what OpenAI taught us with ChatGPT and with all [55:12] the GPTs before. [55:15] OK, I want to give you some back of the envelope computation. [55:18] So I might be off by a few factors here, [55:20] but I just want to give you a sense of how costly it is [55:23] to train some of these models. [55:25] I'll give us an example. [55:26] llama3 400b which is currently the best open source model that [55:30] you can get. [55:31] It was trained on 15.6 tokens. [55:35] It has 405 billion parameters. [55:37] So just now that you know what is [55:39] like this optimal tokens per parameter, that's around 40. [55:43] So that's a little bit more than Chinchilla, [55:45] but less than this like inference optimal model. [55:50] So they went for training optimallity [55:53] Flops for this model. [55:55] So one simple way to compute flops [55:57] is 6 times the number of parameters, [56:00] times the number of data that you train on. [56:03] So if you do the simple calculation here, [56:04] it's 3.8 e25 flops. [56:07] The reason why this is important is [56:09] that if you follow it a little bit, the news, [56:11] there's an executive order from Biden that basically [56:13] says that once you have one e26 parameters, sorry, flops, then [56:19] you have special scrutiny on your models. [56:21] So they went to 2X less than that. [56:23] So they really went right below this [56:25] to not have special scrutiny. [56:27] So 3.8. [56:28] I might be off by a little bit, but it's definitely [56:30] under the 1 e26 [56:36] So parameter p is parameters n is data, number of tokens. [56:41] This is just an approximation. [56:46] Yeah. [56:48] OK. [56:49] Compute and we know that they trained on 16,000 h100s and we [56:55] know the throughput they set it to. [56:58] So if you do the computation, it takes around 70 days [57:02] or 26 million GPU hours. [57:05] At least that's what my back of the envelope computation. [57:08] They actually said that they use 30 million [57:10] instead of 26 million GPU hours. [57:13] So maybe they had some challenges. [57:17] I don't really know. [57:18] But if you follow the simple computation, [57:20] it's around 70 days. [57:22] Cost. [57:24] I mean this it's hard to approximate, [57:27] but I'm just going to say it's, kind of, the rent. [57:29] Like, what if I wanted to rent H100, that many H 100 [57:33] for that many days, how much will I pay? [57:36] H100 a lower bound on the renting costs of H100 [57:41] is around two hours-- [57:42] $2 per hour. [57:43] So if you multiply this by 26,000,000 hours, [57:48] you get $52 million. [57:50] So they probably pay less than that, [57:52] but not actually much less because all these services [57:58] that actually rent GPUs, they don't make that much money. [58:00] So it's probably slightly less, but not that much less. [58:04] Now salary I said 50 employees, 500k per year. [58:10] Yeah it's probably the right ballpark. [58:12] $25 million. [58:13] So if you put altogether around $75 million [58:17] for training this llama model. [58:21] I'm probably off by like 10 million, [58:22] but that's kind of right ballpark. [58:27] Carbon emitted. [58:29] A lot of people might ask like also the cost is not [58:32] the only thing that is important. [58:33] So I did the computation. [58:35] It's around 4000 tons of CO2 equivalent. [58:42] That is actually only 2000 return tickets [58:45] from JFK to London. [58:47] So right now carbon emitted is actually not-- [58:51] I mean, it's huge, but it's not meaningful yet. [58:56] I think in maybe GPT6, GPT7, once you multiply this [59:01] by 100, that might become a real issue. [59:04] Right now it's still not, I think, [59:07] an issue in the grand scheme of things. [59:09] Next model the way you should be thinking about these models is [59:12] that every new generation, the number of flops essentially [59:16] multiplies 10x, or at least that's what they try if they [59:19] have enough energy. [59:20] And if they can buy enough GPUs. [59:23] Great. [59:23] Any question on these back of the envelope math. [59:29] No. [59:30] OK. [59:31] So now we talked about pretraining, [59:34] I wanted to also chat about systems [59:36] because now we know compute is really important so there's [59:39] a question of how do you optimize the-- [59:41] how do you optimize the compute? [59:43] I will leave that for the end because I'm not [59:45] sure how much time we will have. [59:46] I think it's important, but hopefully I'll [59:48] be able to talk about it later. [59:50] It's slightly different than what we've [59:52] been talking about right now. [59:54] So I'll move on to post-training for now. [59:56] So the task of post-training, the reason why [59:59] we need to do post training is, as I told you [01:00:01] before, it's to make AI assistants. [01:00:06] So language modeling is not really the thing [01:00:09] that you want when you have an AI assistant. [01:00:12] For example, if you ask to GPT3, which [01:00:14] is a purely language model-- [01:00:16] a pure language model, not a non-aligned one. [01:00:20] If you ask a question explain the moon landing [01:00:22] to a six-year-old, the completion that you would get [01:00:26] is something explain the theory of gravity to a six-year-old. [01:00:29] Because what it learned is that on internet, [01:00:31] if you have one question, you usually [01:00:33] have maybe another bullet point of other similar questions [01:00:36] you don't usually have question and then answer later. [01:00:39] This is not what you want from an AI assistant. [01:00:42] So how do we do this alignment, which [01:00:46] is this post training and making these models assistants? [01:00:49] So the goal of this alignment is to basically get [01:00:52] LLMs follow the instructions that [01:00:55] are given by users and maybe some designers, [01:01:00] kind of, desires. [01:01:02] So think about motivation. [01:01:04] You don't want the model-- like OpenAI [01:01:06] doesn't want the model to say stuff that is very toxic. [01:01:09] So here you see on the left-hand side [01:01:12] that when you ask a question, it actually provides a real answer. [01:01:15] So it's not like before the LLM. [01:01:17] And on the right-hand side, you see that it would-- [01:01:20] if you ask to write a tweet describing how a certain part [01:01:25] of the population are evil, it will say that it cannot do that. [01:01:29] So that's kind of this alignment. [01:01:32] The background here is that basically the data [01:01:38] that you want for training some of these models is-- [01:01:41] like, we know what we want. [01:01:42] Which is just asking humans, this is a question, [01:01:44] this is the answer that you want. [01:01:46] But the thing is that it's very expensive to collect that data, [01:01:48] and it's hard to find it online. [01:01:51] In contrast, pretraining data is not what you want, [01:01:54] but there's a lot of it. [01:01:56] So what we will do, or the main idea is simply [01:01:59] take a pretrained large language model [01:02:01] pretrained on all of internet and then just fine tune. [01:02:03] So you just change a little bit the weights on the type of data [01:02:06] that you actually want. [01:02:07] And hopefully given it, you already [01:02:08] pretrained it on all of internet, [01:02:10] it basically learns or knows how to speak in English [01:02:13] and knows standard language syntax [01:02:18] then you can really fine tune it with very little data. [01:02:23] OK, SFT. [01:02:24] So Supervised Fine Tuning is really exactly what I just said. [01:02:27] Which is the idea of fine-tuning the large language [01:02:29] model on basically the desired answers that [01:02:33] are collected from humans. [01:02:35] So why is it called supervised fine tuning? [01:02:37] Because you basically want to do language modeling on the real [01:02:41] answers. [01:02:41] So language modeling is this like next word prediction, [01:02:44] and that's the fine tuning part. [01:02:45] And then you want to do it on desired answers given by humans [01:02:48] so that's why we call it supervised. [01:02:51] So how do we collect this data? [01:02:52] Well, I just said it. [01:02:54] You just ask humans to tell you this [01:02:57] is a question this is the answer that you would [01:02:59] want from some of these models. [01:03:00] So this is an example. [01:03:03] I can't read very well on my computer, [01:03:04] but my kid needs to do a science-- [01:03:08] no let's read this one. [01:03:09] Can you write a short introduction [01:03:11] about the relevance of the term monopsony? [01:03:13] And then it says monopsony refers to a market [01:03:15] structure, blah blah, blah. [01:03:16] And that's a human network there. [01:03:19] So, actually, this is Open Assistant, [01:03:20] which was a way to collect data online by humans. [01:03:27] So this type of supervised fine tuning or alignment [01:03:31] is really the key of ChatGPT. [01:03:33] This is what made the big jump from GPT 3, which was mostly [01:03:37] something that was known by AI researchers [01:03:40] to ChatGPT, which became known by basically everyone. [01:03:46] So the problem with human data is [01:03:51] that it's very slow to collect and very expensive. [01:03:56] So one possible simple idea is to use [01:04:00] LLMs to scale data collection. [01:04:03] So that's exactly what we did with Alpaca one year ago. [01:04:06] What we did is that we asked humans, [01:04:09] so we use a data set of human question answers. [01:04:11] So there were 175 question answers here, [01:04:15] and we asked the best model at the time, [01:04:16] so text-davinci 003 to basically generate many more of these [01:04:21] question and answers. [01:04:22] So all we did is, this is what humans would write now, [01:04:25] write similar answers and similar questions. [01:04:27] And we collected 52,000 LLM-generated question answers. [01:04:32] And then what we did is simply we took llama 7B, [01:04:34] which was the best pre-trained model at the time. [01:04:36] And we just fine tuned this with supervised fine tuning, [01:04:39] as I told you. [01:04:39] And that's how we got the Alpaca 7B model. [01:04:44] And this is the type of data that we collected. [01:04:47] So things like what does algorithm mean? [01:04:49] And algorithm is a step by step set of instructions [01:04:53] you use to solve a problem or achieve a goal, blah, blah, [01:04:55] blah, blah. [01:04:56] So the data is not actually-- it's actually pretty good, [01:04:58] given that it was LLM generated by LLMs from essentially two [01:05:02] generations ago. [01:05:04] So that really started at least for us [01:05:07] as an academic replication of ChatGPT. [01:05:10] Now it really-- there's a big field [01:05:12] of synthetic data generation of how [01:05:15] to use LLMs to basically make development of LLMs faster. [01:05:21] And basically by decreasing the amount of human hours that [01:05:24] you need. [01:05:26] Quantity of data. [01:05:28] So we talked about what type of data and how we collect it. [01:05:31] One thing which is surprising with SFT [01:05:33] is that you don't need that much data. [01:05:36] So what this paper showed this is called LIMA, [01:05:38] is that if you scale the amount of data that you use from [01:05:43] supervised fine tuning from 2000 to 32,000, [01:05:46] it really doesn't help much. [01:05:47] So here scaling laws definitely don't help. [01:05:49] And so the intuition here is that all you learn [01:05:55] is you learn how to format your desired answers. [01:05:58] Another way of saying it is that your pre-trained models, they [01:06:02] essentially model the distribution of every user [01:06:04] on internet, one that might write bullet points, [01:06:07] another one that might answer question-- answer [01:06:09] question with an answer. [01:06:10] So all you tell your model is like, wait, [01:06:13] you should actually be optimizing [01:06:14] more for this type of user than another one. [01:06:17] So you're not actually teaching it-- [01:06:18] you're not teaching anything through this SFT, so [01:06:23] supervised fine tuning, all you do [01:06:25] is you tell the model to optimize for one type of user [01:06:28] that it saw already in a pretrained data set. [01:06:30] So the knowledge is already in the pretrained LLM [01:06:33] and you basically just specialize to one type of user. [01:06:37] Great. [01:06:38] Any question on SFT? [01:06:40] Yes. [01:06:41] So I know it's a big issue with synthetic data [01:06:45] where if you keep generating data from the same distribution, [01:06:49] eventually you're not learning a new distribution, [01:06:51] you're essentially playing with it. [01:06:52] Just bootstrapping that. [01:06:53] Yeah. [01:06:55] Surely you can't scale that forever, right. [01:06:57] You can't keep going on and generating [01:06:59] from the same distribution. [01:07:00] You hope to learned something new. [01:07:01] Yeah. [01:07:02] So are there-- it's an active area of research [01:07:05] but any thoughts that you have around [01:07:06] how people are maybe thinking around this and better ways [01:07:10] to bootstrap? [01:07:11] Or to give up on this idea and realize that the chart shows [01:07:15] you don't need that many so just get humans to generate [01:07:17] 2000 really good prompts. [01:07:19] Yeah. [01:07:20] So that's a very good question. [01:07:21] So for the data stuff, so I'm saying [01:07:23] it's not that important for SFT, but there [01:07:25] will be another thing we'll talk about right after where actually [01:07:28] data does matter. [01:07:29] My intuition based on not that much empirical results [01:07:33] is that you can still get, even though you use your LLMs, [01:07:38] if you use purely LLM generated text [01:07:40] and you do that for like three or four generations of LLMs, [01:07:43] I agree with you that probably you won't improve much. [01:07:45] But for me what is important is how do you use human in the loop [01:07:48] with LLMs? [01:07:49] Not purely LLMs, not purely humans, [01:07:53] but maybe what you can do is just [01:07:54] have the model regenerate some new text [01:07:56] and just humans write a few edits. [01:07:59] Edits are much faster than writing the entire text. [01:08:01] And I think that if you have that type of collaboration, [01:08:04] then from an information theoretical point of view, [01:08:07] you still get additional information, [01:08:09] but you're still much faster than if you use humans. [01:08:11] And I think that as a field we'll [01:08:13] probably move towards these type of things, which is really [01:08:17] just finding the examples that are important and asking humans. [01:08:20] It's kind of active learning, just [01:08:22] asking humans exactly when you need to get their inputs. [01:08:28] Yes. [01:08:28] Do we train with the same loss function [01:08:30] and the same general training algorithm [01:08:32] for the supervised fine tuning bit [01:08:34] as we do for the pretraining? [01:08:36] Because the examples you showed, I [01:08:39] think the important thing of the good examples [01:08:43] is like super factually accurate. [01:08:45] Like there's these more complex things [01:08:46] and it's still just like [INAUDIBLE]. [01:08:48] Same loss. [01:08:49] So that's why here-- [01:08:50] yeah, I didn't-- maybe didn't emphasize enough. [01:08:52] This is just language modeling. [01:08:53] Fine tune the LLM with language model and the desired answers. [01:08:56] So this is literally the same loss. [01:08:59] It will be different in two seconds, [01:09:01] but the first step of SFT is literally [01:09:04] the same loss where you just say, OK, I [01:09:06] want to actually specialize on that type of data. [01:09:08] So there's even a question of what is pretraining, [01:09:10] what is post-training? [01:09:11] Because, in reality, it's just like a different data [01:09:13] that you use. [01:09:13] The reason why we usually call it post-training is that the way [01:09:16] we collect that data is very different. [01:09:18] Great, great questions. [01:09:20] Yes. [01:09:22] Maybe it's the same question, but why would [01:09:24] these 2000 examples have such a overweighted influence [01:09:28] on fine tuning? [01:09:30] So that's why we-- [01:09:31] also that's another reason why we call it post-training [01:09:33] is that we use different type of hyperparameters. [01:09:35] So, I told you basically at the end [01:09:37] of pretraining you essentially end up [01:09:38] with a learning rate of 0. [01:09:40] Here, you're going to increase your learning rate. [01:09:42] So like 1e minus 5, 1e minus-- yeah. [01:09:44] And so the way that you give to them is actually different. [01:09:52] OK. [01:09:54] Second step or second part of this post training [01:09:57] is what we call reinforcement learning [01:10:00] from human feedback or RLHF. [01:10:02] Some of you might have heard of that. [01:10:05] The idea is that SFT has a problem, namely that you [01:10:09] do behavioral cloning, which means that you just try to clone [01:10:12] what the humans would say. [01:10:14] And that has many issues. [01:10:16] One of them is that you're bound by human abilities. [01:10:19] So if-- humans actually humans won't generate the things [01:10:26] that they think is actually the best thing to generate. [01:10:28] So if you ask me to write a book, [01:10:30] I mean, I can definitely enjoy your book. [01:10:32] I can probably say one book is better than another, [01:10:34] but I'm definitely not going to be as good as writing the book [01:10:37] that I want to read. [01:10:37] So you're going to be bound by the human ability [01:10:39] to generate things, even though the humans might be better [01:10:42] at distinguishing between things. [01:10:43] That's one issue. [01:10:44] Issue number two, I find that actually pretty interesting [01:10:47] is that it-- [01:10:49] if you ever heard of the word hallucination. so this [01:10:51] is LLMs generating fake-- like false information. [01:10:55] Hallucination might-- at least people [01:10:57] have hypothesized that can come from the supervised fine tuning [01:11:02] even if you do supervised fine tuning on data that is correct. [01:11:06] And the reason why that is is that if-- [01:11:09] given I told you that basically SFT is with very little data. [01:11:13] And it's with data that the model [01:11:15] doesn't learn anything new. [01:11:17] So what if the human gives an answer that the model didn't [01:11:21] know was true. [01:11:23] From the model perspective, the human basically [01:11:26] is telling the model generate this thing that seems plausible [01:11:30] but actually have no idea if it's true or not. [01:11:34] So just to give you a very concrete example, [01:11:36] if we go back to this monopsony example, [01:11:39] can you write blah blah blah about monopsony? [01:11:41] Imagine that the human wrote a reference on this type of book. [01:11:46] And that book might exist. [01:11:47] That might be a correct reference, [01:11:49] but what if the LLM never saw this reference [01:11:51] during pretraining. [01:11:52] Then it doesn't know that it's a correct reference. [01:11:54] So really what you tell the model [01:11:56] is to generate or make up some plausible sounding reference [01:12:00] rather than actually tell the real reference [01:12:03] that it saw during pretraining. [01:12:05] So hallucination might be caused by this SFT. [01:12:12] So that's problem number two. [01:12:14] Does that all make sense? [01:12:15] Great. [01:12:16] Problem number 3, price. [01:12:18] Generating the ideal answers is very pricey. [01:12:21] And that comes back to your question [01:12:23] of humans writing the entire answer is actually [01:12:26] pretty expensive. [01:12:28] So that's why RLHF comes in. [01:12:30] The idea is that instead of cloning the behaviors of humans, [01:12:34] we're going to maximize human preference. [01:12:37] And the way we're going to do that, so the pipeline, [01:12:39] is that for a certain-- for every instruction, [01:12:42] you're going to ask a model to generate two answers [01:12:45] and usually use a pretty good model. [01:12:48] So you usually don't use an LLM here, you use a SFT fine tune, [01:12:52] you use a fine tuned LLM already to give pretty good answers. [01:12:56] And then you ask labelers which of these two answers was better? [01:13:01] So select the preferred one. [01:13:02] And then with different types of algorithms, [01:13:05] we're going to talk about the algorithms, you just fine [01:13:07] tune the model to generate more of the green thing [01:13:10] than the red thing. [01:13:10] So more of the good stuff. [01:13:12] So now the question is how and we're [01:13:14] going to talk about that right now. [01:13:17] So there are two ways that we're going to talk about [01:13:20] and two that are mainly use in the community. [01:13:23] The first one is simply the idea of using reinforcement learning. [01:13:26] So hopefully you all know what reinforcement learning is now. [01:13:30] So when you think about using reinforcement learning, [01:13:33] one important question is like, what is the reward [01:13:35] that we're optimizing. [01:13:36] So in this case, there are really two options [01:13:38] that I could think about. [01:13:39] The first one, you could just say, [01:13:41] I'm going to compare the output generated by some baseline, [01:13:44] the output generated by my model. [01:13:46] And I'm just going to ask the human to say which one is better [01:13:49] and I'm going to use this as a reward. [01:13:51] So if I'm better than the baseline, [01:13:53] this is a plus 1, if not, it's a minus 1. [01:13:55] So now it's binary reward. [01:13:57] The problem with binary reward is that it's very sparse [01:13:59] and you don't get much information out of it. [01:14:01] Like maybe your answer was slightly better, [01:14:04] maybe it was like way better and you don't really [01:14:07] know from this how much better it was. [01:14:10] So option 2 is that you can train [01:14:13] what we call a reward model, which is simply a classifier. [01:14:16] So you use machine learning to classify [01:14:19] how much better two outputs are from the preference-- [01:14:24] from the perspective of the human. [01:14:26] So this is a little bit meta, but what you basically [01:14:29] do is that you train-- [01:14:31] you take a reward model, which is just a large la-- also [01:14:37] a large classifier, and you basically ask this reward model, [01:14:41] you give it the input and the actual output [01:14:43] that you have, one of the two outputs. [01:14:45] And you just exponentiate that so that's the softmax loss [01:14:49] that you all know about. [01:14:50] And now you divide by the exponentiated reward [01:14:56] on the first example-- [01:14:58] I'm sorry, on the first output and this [01:15:00] is on the second output. [01:15:01] And you basically train-- [01:15:02] so the reason why you do that is that you train your model, [01:15:05] you train this reward model to be [01:15:07] able to classify how much better one output is to another one. [01:15:13] So another slightly less convoluted way of saying it [01:15:16] is that your reward model will output [01:15:19] some reward that will be used as the logits of your softmax. [01:15:22] So now if you have high logits in your softmax, [01:15:25] it means that you highly likely this output is better. [01:15:32] So that's what we call Bradley-Terry model. [01:15:34] Yes. [01:15:35] Will this reward model [INAUDIBLE] [01:15:36] lower the entire output, or is it going to [INAUDIBLE]? [01:15:40] So this takes the entire-- [01:15:45] yeah, this takes the entire output at once. [01:15:46] So it takes all the input and all the output [01:15:48] and it gives one number. [01:15:50] Yes. [01:15:51] So [INAUDIBLE] reward model, where would the human be then? [01:15:55] Sorry. [01:15:55] With the reward model, where would the human be? [01:15:58] Like-- [01:15:58] I see. [01:16:00] OK sorry. [01:16:01] Maybe I wasn't clear. [01:16:02] You train this reward model to fit this green and red [01:16:08] preference from humans. [01:16:09] So basically you train a classifier [01:16:11] to say whether the humans prefer red or green. [01:16:15] But instead of using the binary reward, which [01:16:18] is what the human would tell you you basically use [01:16:20] the logits of the softmax. [01:16:23] And the thing with the logits is that logits are continuous. [01:16:26] So now you know that if your reward model said [01:16:29] it has high logits, then, in some ways, [01:16:31] the human highly preferred this answer to some other answer. [01:16:36] Great. [01:16:38] So as I just said, continuous information is better. [01:16:41] So that's what people use in practice or at least [01:16:44] used to use in practice. [01:16:45] I'll tell you about the other algorithm later. [01:16:48] So what do you do at the end is that you basically [01:16:50] try to just use reinforcement learning that you know about. [01:16:53] Now we know we have a reward. [01:16:55] What you sample through is the generation [01:16:58] from your large language model. [01:16:59] And then you just use some regularization term. [01:17:02] So the reason why we do this regularization term [01:17:04] is for avoiding what we call overoptimization. [01:17:06] So this reward model might not be [01:17:08] really represent-- might not perfectly [01:17:10] model human preferences. [01:17:12] So you don't want to maximize this thing [01:17:14] to essentially infinity. [01:17:17] And you do it using a PPO, which is a common reinforcement [01:17:22] learning algorithm. [01:17:24] One thing to note here, because it will be important for later, [01:17:27] is that when we use maximum likelihood-- [01:17:32] sorry, now the large language models [01:17:34] are actually a policy for your reinforcement learning. [01:17:38] It's not maximizing maximum likelihood anymore. [01:17:41] Which means that you're not modeling any distribution [01:17:43] anymore. [01:17:43] And the reason why this is important [01:17:45] is that models that went through this type of PPO [01:17:48] actually don't give you likelihoods [01:17:51] of text that are meaningful. [01:17:52] Because what you optimize them to do [01:17:54] is basically just optimize for generating [01:17:56] the most likely thing, not optimize for modeling, [01:18:00] all the answers that humans might say. [01:18:02] Another way of saying that is that there's [01:18:04] nothing that incentivizes here the model to not give [01:18:09] a single possible generation. [01:18:11] Nothing here says it's good if you have some distribution [01:18:15] with some entropy. [01:18:18] If you haven't followed, it's not that important but just good [01:18:20] to know. [01:18:22] Great. [01:18:23] So PPO is exactly what ChatGPT did originally. [01:18:27] So here is on their blog post on what [01:18:30] they have is step one do supervised fine tuning, which [01:18:33] now you all know about. [01:18:34] Step two, train a reward model on human preferences. [01:18:38] Step three, do PPO multiple steps, [01:18:40] which is where you see this blue arrow. [01:18:43] So you continue-- you train the model once with the PPO, [01:18:45] you collect new data, you continue. [01:18:47] And that's why-- and that's exactly what ChatGPT did. [01:18:50] And that was the big breakthrough [01:18:52] between GPT 3 and ChatGPT. [01:18:55] One thing to note is that PPO has many challenges. [01:18:58] Reinforcement learning is something that [01:19:00] is super nice theoretically. [01:19:02] In practice, anyone who ever worked [01:19:03] with reinforcement learning knows it's such a mess. [01:19:06] There's a lot of things like rollouts, outer loops, [01:19:09] clipping so many complications. [01:19:11] So it's messy. [01:19:13] This is the idealized PPO used for LLM settings, [01:19:15] so that's already much more complicated [01:19:17] than this expectation we saw before. [01:19:19] And in practice it's actually much more complicated. [01:19:21] So we have one implementation of it that we had to do, [01:19:23] and I'm not going to go through it. [01:19:25] But basically have so much stuff that you [01:19:27] have to think about when you implement [01:19:29] that type of PPO algorithm. [01:19:31] So you have clipping everywhere, you have a lot of complexities [01:19:34] and things are not well documented. [01:19:37] All this to say that we're going to there was a new method that [01:19:41] was proposed also from Stanford one year ago [01:19:44] called DPO, which is essentially a simplification of PPO. [01:19:49] And the way-- what they did or the idea that they have [01:19:53] is that instead of using reinforcement learning, [01:19:56] you can just maximize the probability of generating [01:19:58] the stuff that you like and minimizing [01:20:00] the probability of the stuff that you don't like. [01:20:02] So if you think about the human preference, the red and green, [01:20:05] maximize green, minimize red. [01:20:08] So the loss is actually this one where what you see [01:20:12] this is simply some log of the model. [01:20:16] So this is the likelihood of a model generating the things [01:20:19] that the human preferred, given the inputs. [01:20:23] And what you try to do is basically [01:20:25] maximize the likelihood of generating the things that you [01:20:30] like, minimize the likelihood of the things that you don't like. [01:20:33] All the rest of the terms here it's not too important. [01:20:36] It's actually really not that complicated to understand. [01:20:39] But at a high level, it's really just maximizing the things [01:20:42] you like, minimizing the rest. [01:20:45] And one thing to note, which I was going to say just here, [01:20:49] is that actually all the rest is chosen such [01:20:51] that the global minima of PPO and the global minima [01:20:56] of like this DPO, under some assumptions, [01:20:59] are essentially equivalent. [01:21:01] So this is the right thing to do mathematically. [01:21:04] I'm not going to go through the derivations, [01:21:06] but that's the right thing to do. [01:21:08] It's pretty different with PPO in the sense that now-- [01:21:10] with PPO, what you had to do is collect the human preferences, [01:21:13] then train a reward model with maximum likelihood, [01:21:16] then use reinforcement learning. [01:21:17] Now all you do is basically maximum likelihood. [01:21:19] Much simpler. [01:21:20] Yes. [01:21:21] I mean, yeah. [01:21:21] So it seems like this is A, much simpler and B, like, [01:21:24] what you would just intuitively do with [INAUDIBLE]? [01:21:27] Why did they start with this reward model. [01:21:29] Like what led them doing that? [01:21:31] I think it's a great question. [01:21:33] I don't really know. [01:21:34] What I can tell you is that. [01:21:35] At ChatGPT the people who did basically [01:21:41] this PP-- sorry, who did ChatGPT initially [01:21:44] are the ones who actually wrote PPO. [01:21:47] And I think they were just-- like, [01:21:48] there are a lot of reinforcement learning people. [01:21:50] And I think that for them it was very intuitive. [01:21:54] So there's also some additional potential benefits. [01:21:58] For example, I don't want to-- [01:22:00] yeah, for example, if you use the reward model, [01:22:03] the cool thing here with reinforcement learning [01:22:04] is that you can use unlabeled data with the reward model. [01:22:08] So here you can only use the labeled data for doing DPO-- [01:22:12] For PPO-- for PPO, you first train your reward model [01:22:15] and then you can use unlabeled data [01:22:18] where the reward model will basically [01:22:19] label this unlabeled data. [01:22:21] So this additional, kind of, potential-- [01:22:25] there could be potential improvements. [01:22:26] In practice it happens that there are none. [01:22:29] And I think just that a lot of people in this team [01:22:32] were reinforcement learning experts, including [01:22:35] the main author of PPO, John Schulman. [01:22:39] So much simpler than PPO, and it's basically performs as well. [01:22:43] So now this is the standard thing that people use. [01:22:46] At least in the open source community, [01:22:47] I believe it's actually the standard also in industry. [01:22:51] So that's called DPO. [01:22:53] Gains so those are all the papers on the left. [01:22:57] Here this is on the summarization task. [01:22:59] You see, all I want to show you is [01:23:01] that basically the pretrained models were OK [01:23:04] and they improve of scale. [01:23:05] If you do supervised fine tuning, [01:23:07] you improve them a little bit more, [01:23:08] if you do PPO or something with RLHF human feedback, [01:23:12] you get performance that are, oftentimes [01:23:15] depending on a benchmark, even better than humans. [01:23:18] So this is the human reference summaries. [01:23:21] Same thing. [01:23:22] This is on a paper that we have Alpaca farm where [01:23:25] we see the evaluation here is not too important [01:23:27] but basically see pretrained model. [01:23:29] You jump to SFT and then you jump to PPO, DPO and PPO, [01:23:33] DPO have the exact same performance. [01:23:36] So basically RLHF helps. [01:23:38] That's, kind of, the conclusion and DPO is simple. [01:23:42] Data. [01:23:43] The way that you collect that type of data. [01:23:46] First idea is just use humans as we already talked about. [01:23:51] Guidelines are very complicated for what [01:23:53] humans should be labeling, and it's really not that easy. [01:23:55] And actually, if you ever do some of the labeling, [01:23:58] you will see that it's extremely complicated. [01:24:01] Like if I Zoom in to this. [01:24:03] Here, I have a question tell me about self-driving cars. [01:24:07] And you read both self-driving cars [01:24:09] are vehicles that are capable of detecting [01:24:10] the surroundings, blah, blah blah, blah. [01:24:12] Self driving cars are cars that are equipped [01:24:13] with sensors, blah blah, blah to navigate [01:24:15] without the need for a driver. [01:24:16] I mean, both seem OK. [01:24:18] Which one is better? [01:24:19] It's actually hard to say at a glance. [01:24:21] And as a result, the problem with humans [01:24:24] is that you will start optimizing [01:24:27] a lot of high-level features. [01:24:28] For example, the second one is longer. [01:24:30] I can guarantee you that most humans will choose [01:24:32] the second one, even though I mean, [01:24:34] maybe the first one is better. [01:24:35] I don't know. [01:24:36] I haven't read it carefully. [01:24:38] So challenges of humans. [01:24:39] First, slow and expensive. [01:24:42] Second, as I just mentioned, it's hard to focus on things [01:24:46] that matter, like correctness. [01:24:47] And people usually look at things [01:24:49] that don't matter as much like the form, like length. [01:24:53] And as a result, so what I show here [01:24:55] is that when you do RLHF, the more you do RLHF, [01:24:58] the longer the output of the models become. [01:25:01] So if you've ever been annoyed at ChatGPT [01:25:03] answering you super long sentences, [01:25:05] this is because of RLHF. [01:25:08] Annotator distribution shift. [01:25:11] Like the distribution of annotators [01:25:12] that you use matters a lot, and you have to think, [01:25:15] like, what is even the humans that we want [01:25:17] to represent in these models? [01:25:20] Another question is crowdsourcing ethics. [01:25:22] Like usually these-- basically a lot [01:25:25] of the labeling that is done, the people who do them [01:25:29] are not paid well and they have to go [01:25:31] through a lot of toxic data because you basically [01:25:33] want the model to avoid saying the toxic data. [01:25:36] So crowdsourcing ethics too. [01:25:40] So many challenges with human data. [01:25:43] So what we did, also last year, is again, [01:25:46] the same thing as Alpaca, just the idea of like oh well, there [01:25:48] are challenges with humans, maybe [01:25:50] we can just replace them with LLMs. [01:25:51] So what we did is simply replace-- [01:25:55] I see that. [01:25:56] I'm just realizing that the slides are not centered. [01:25:58] Anyways you replace a human preference with preferences. [01:26:02] So here, on this figure, you see on the x-axis, the price [01:26:06] that we paid for collecting human data. [01:26:09] It's around $300 for 1,000 examples. [01:26:12] And this is on mechanical Turkers which are usually [01:26:15] like cheaper than maybe some of the other companies [01:26:19] that you could go through. [01:26:20] And on the y-axis, it's basically [01:26:22] the agreement with other humans, with the mode of other humans. [01:26:27] And what you see is that actually, as I told you before, [01:26:29] labeling is really complicated. [01:26:30] Humans agree with themselves only around 66% [01:26:34] of the time on a binary task. [01:26:36] And it's not that the humans are not good [01:26:38] here because we were five main authors on this paper. [01:26:41] We tried to label this data ourselves, [01:26:43] and we only had, like, 67 or 68% accuracy, even though we [01:26:47] talked-- like we talked for like three hours of how [01:26:50] we should be doing labeling. [01:26:51] But really, it's complicated. [01:26:52] It's not an easy task. [01:26:54] And here I just showed many different models. [01:26:56] And, basically, you see that models are much cheaper, [01:26:59] and they can actually get higher agreement [01:27:01] with the mode of humans than humans themselves. [01:27:04] And the reason why is because humans have a lot of variance, [01:27:06] models have no variance. [01:27:08] So there might be a little bit more biased [01:27:09] but have less variance. [01:27:11] So it works surprisingly well. [01:27:13] And now it's, kind of, the standard [01:27:14] in open source community. [01:27:16] I think even in industry a lot of people [01:27:18] use both humans and LLMs for improving [01:27:21] the collection of RLHF data. [01:27:24] And this is like-- this is the paper from last year, [01:27:27] but honestly, now it's more like the LLMs would be around this [01:27:30] agreement, and this costs around, [01:27:32] I would say 50 50x than humans and better agreement with human [01:27:36] than humans themselves. [01:27:39] OK. [01:27:39] So that gets us to evaluation of post training. [01:27:45] That goes back to your initial question [01:27:46] at the beginning of the lecture. [01:27:48] How do you evaluate something like ChatGPT? [01:27:50] The answers that GPT could give are basically unbounded. [01:27:54] And it's not that there's one right answer, [01:27:56] there are many answers that are just as good. [01:27:59] So there are many challenges. [01:28:00] One, you can't use validation loss [01:28:03] because one method might use PPO, [01:28:06] the other one might use DPO. [01:28:07] Validation loss is not comparable. [01:28:08] Second, you can't use-- [01:28:10] sorry, perplexity. [01:28:11] That's the thing I told you before. [01:28:13] These models are not calibrated. [01:28:16] They don't give distributions. [01:28:17] They just optimize for one thing. [01:28:19] So you can't use perplexity for actually evaluating these type [01:28:22] of models once they aligned-- [01:28:24] sorry, once they're aligned. [01:28:26] Third, there's a large diversity of questions [01:28:29] that humans might ask to these models. [01:28:31] Generation open QA some question answering some summarization [01:28:35] and all of these things. [01:28:36] So there's so many things you have to cover. [01:28:38] Then the tasks are really open ended, [01:28:41] so it's very hard to automate. [01:28:42] So that's what you were alluding to before. [01:28:45] So the idea is that instead of trying [01:28:48] to come up with really easily automated benchmarks, [01:28:51] it's just we're going to ask questions that users actually [01:28:55] ask to these models in practice. [01:28:56] And we're just going to ask annotators [01:28:58] to say between these two models, which one is better. [01:29:01] What's the better output. [01:29:03] So basically the exact same thing [01:29:04] as basically the data from RLHF but you [01:29:08] use it now for evaluation. [01:29:10] Yes I'm not sure I understand what [01:29:11] you mean by can't use perplexity not calibrated. [01:29:14] Like RLHF still doing like next token prediction. [01:29:19] So-- [01:29:19] Why can't perplexity be used then? [01:29:21] So think about the optimal solution [01:29:24] after doing PPL is basically one model that [01:29:27] gives you essentially a delta. [01:29:30] Like basically it says that there's only one sentence [01:29:33] that is-- [01:29:34] that could be generated for that question. [01:29:36] So now if you use it on something [01:29:38] that is slightly semantically differently different, [01:29:40] it would actually give a likelihood of 0 for that answer. [01:29:44] So in reality, it's not that extreme because as you say, [01:29:46] it's still a distribution, but it just [01:29:48] shows you that there's a fundamental issue [01:29:50] with perplexity. [01:29:51] Once these models are not LLMs anymore, [01:29:55] they were not trained, at least with PPO [01:29:56] they're not trained to do maximum likelihood anymore, [01:29:59] they were trained to be policies. [01:30:04] So probably the most common or the most-- [01:30:08] yeah, the most common benchmark or the most trusted one [01:30:10] is what we call ChatBotArena, which is basically [01:30:14] go on internet, have random users on the internet, [01:30:17] blindly talk with two chatbots, just ask many questions, [01:30:21] see the two answers and rate, which one is better. [01:30:23] And you do that over hundreds of thousands of users and then [01:30:26] you get the actual preferences and you get rankings of models. [01:30:30] So you can go right now on ChatBotArena [01:30:33] and actually interact with these models. [01:30:35] One potential issue just to highlight [01:30:38] is that while people who want to do these type of things [01:30:40] are usually more like tech-driven or like tech savvy. [01:30:44] So a lot of the questions that you will ask [01:30:46] are more like tech stuff discussing [01:30:47] software errors, inquiries about AI tools [01:30:50] and all of these things. [01:30:52] So another issue is cost and speed. [01:30:54] If you really want to use something [01:30:55] like this for development process, [01:30:58] it will be too costly because you will need to basically pay [01:31:01] a lot of humans to do that. [01:31:03] So one simple idea is, again, as we said many times, [01:31:07] just use LLM instead of humans. [01:31:10] You probably know the drill at this point. [01:31:13] Steps for every instruction generate outputs [01:31:15] by some baseline and the model that you want to evaluate. [01:31:19] So here you imagine that I'm comparing an answer [01:31:22] from ChatGPT and from Misrule. [01:31:24] I'm just asking a model, another model, which one is better. [01:31:29] And I just basically average that out. [01:31:32] Yeah. [01:31:32] I asked ChatGPT 4, which one is better. [01:31:34] I averaged that out over my entire distribution, [01:31:37] over my entire benchmark or data set, [01:31:39] and that gives me a win rate. [01:31:41] So a win probability for one model compared to another one. [01:31:44] And now you can rank models. [01:31:46] And this is the AlpacaEval leaderboard. [01:31:50] So the benefits of this is that actually we [01:31:53] show-- we get 98% correlation with ChatBotArena. [01:31:56] So very high correlation with humans. [01:31:59] So this is yeah, comparison with correlation [01:32:01] with other benchmarks. [01:32:02] And it takes less than three minutes and less than $10 [01:32:05] to run. [01:32:05] So it's pretty cheap. [01:32:06] And there are downsides though. [01:32:08] One of them is poor correlation. [01:32:11] So as we already saw before, LLMs prefer, [01:32:14] this is one spurious correlation, not many. [01:32:16] I'll just talk about one. [01:32:17] LLMs prefer longer outputs. [01:32:19] Actually humans also prefer longer outputs. [01:32:21] But the problem or the issue once you use LLMs [01:32:23] is that once there is bias, you will continue optimizing that. [01:32:26] Humans at some point, I can guarantee you [01:32:28] if I ask a simple question, and you give me [01:32:29] five pages of answers, I'll be like, [01:32:31] no, I don't like that answer. [01:32:32] But LLMs if they have this bias and they were trained for that, [01:32:35] they will continue preferring longer outputs. [01:32:37] So here we see the preference just showing [01:32:42] that humans and models prefer longer outputs. [01:32:46] And here is another view of the initial AlpacaEval data set [01:32:50] benchmark, where when we asked-- [01:32:53] when we rank GPT4, when we look at the win rate of GPT4 [01:32:56] versus actually GPT4 itself, if we use the standard GPT4, [01:33:01] it gets 50%, kind of, by definition because we're [01:33:03] comparing GPT4 versus GPT4. [01:33:06] But if we ask a GPT4 to be slightly more verbose, [01:33:09] so we just say in the prompt, be verbose in your answers, [01:33:12] then it gets a win rate of 64.4%. [01:33:15] So really there's a huge variance. [01:33:16] And if we ask it to be concise, it [01:33:17] gets 20% so there's a huge variance [01:33:20] depending on whether you ask it to be concise or verbose. [01:33:24] That's very annoying. [01:33:25] So one possible solution, which is what we did, [01:33:29] is just use some regression analysis. [01:33:31] I'm not going to go into details, [01:33:32] but basically use causal inference [01:33:34] tools to control for length. [01:33:36] And right now actually length matters much less. [01:33:38] So if you ask it to be verbose, you still get some gains, [01:33:41] but much less. [01:33:44] Great. [01:33:44] So that's all about post training. [01:33:46] And now for the next eight minutes, [01:33:48] I might talk about systems or just answer questions. [01:33:51] Yes. [01:33:52] Can you go back to your post training, internal post [01:33:56] training. [01:33:57] How did we tune those parameters using [01:33:59] the small body of fine-tuning data [01:34:03] and have such big effect on the model? [01:34:05] You mentioned earlier that there's a different set [01:34:07] of hyperparameters. [01:34:08] Are we changing just some of the weights, the later weights [01:34:11] or other weights. [01:34:12] What's actually happening? [01:34:13] Yeah. [01:34:14] Yeah, I, kind of, skimmed through all of this. [01:34:16] You change all the weights. [01:34:17] Actually, industry will change all the weights. [01:34:20] In open source land, you might have [01:34:22] heard of Laura, which is going to change basically only [01:34:26] some of the weights or it actually, to be more specific, [01:34:29] it's going to add some differences [01:34:31] to the output of every layer. [01:34:33] But in industry, you're going to just fine tune all the weights. [01:34:37] And also to say something else about the data, actually, [01:34:40] this last step, RLHF you usually going [01:34:42] to collect a lot more data than with SFT. [01:34:45] So if FSFT is like 5,000, 10,000, maybe 50,000 with, [01:34:50] RLHF I think you're going to be more around like the one million [01:34:54] order of magnitude. [01:34:55] It's still much less than pretraining though. [01:34:57] Yeah. [01:34:57] Because pretraining is 15 trillion tokens. [01:35:00] I mean, this is like-- that's not even a drop [01:35:02] and yet you influence the weight a lot. [01:35:05] So because you do it-- [01:35:05] I mean, you have to think that how you do it is you use-- [01:35:10] I mean, as I said, the learning rate that you're going to use [01:35:12] is going to be different, but also you only do that. [01:35:16] So just imagine if I trained-- [01:35:18] even if I trained on one sentence, [01:35:19] but over and over again at some point [01:35:22] my model will only generate that sentence [01:35:24] even if it was just one sentence instead of [01:35:27] the 15 trillion tokens. [01:35:29] So if you use a large enough learning [01:35:30] rate and for enough time, you will basically [01:35:33] overfit that sentence. [01:35:35] So the key thing to remember is that the data is not-- [01:35:39] it's not as if you mix some post-training data [01:35:42] and some pretraining data. [01:35:43] You do pretraining, and then you just start fine-tuning only [01:35:47] on the post-training. [01:35:48] So another way, maybe another perspective [01:35:50] is that the pretraining is just the initialization [01:35:53] of your model. [01:35:54] And once you view it that way, that this is just [01:35:56] initialization of weights, then there's nothing special. [01:35:59] Like you don't need to remember that you train on a lot of data [01:36:02] before. [01:36:02] The only thing that matters is that you had an initialization [01:36:04] and now I actually train the model. [01:36:06] So maybe you think about it that way. [01:36:07] Like this is a Markov property in some ways. [01:36:10] It's just like you had your weights. [01:36:11] This is my initialization. [01:36:12] Now I'm training that one. [01:36:14] Does that answer your question? [01:36:16] Kind of but you said something just now about it's [01:36:20] almost the equivalent of just rerunning the fine tuning [01:36:23] data many times. [01:36:25] Is it actually-- is that what actually happens in order [01:36:28] to give so much more preference? [01:36:33] You might-- I actually don't know right now how they do it [01:36:37] in industry. [01:36:37] When we did our packet, we had to do three epochs. [01:36:40] So you did run it three times through it. [01:36:44] But I mean, even the number of times [01:36:46] that you run it through, it's actually not important. [01:36:48] The only thing-- the only thing is the effective learning rate [01:36:52] that what matters. [01:36:54] So yeah. [01:36:56] Great. [01:36:58] So I think I have five minutes. [01:37:06] OK I might try to give a high-level overview at least [01:37:12] from one of the systems trick. [01:37:14] Systems, as we said, for everyone bottleneck is-- [01:37:19] sorry compute is the huge bottleneck. [01:37:21] One question you might ask is, why not buy more GPUs? [01:37:24] GPUs are expensive, but also are scarce. [01:37:26] Even if you have $10 million right now, [01:37:28] you cannot buy the best GPUs. [01:37:31] [INAUDIBLE] [01:37:33] There's also some physical limitations. [01:37:35] When you have multiple GPUs, you have [01:37:37] to communicate between them. [01:37:39] That takes time. [01:37:40] So just buying more GPUs is not that easy. [01:37:43] So it's really important to think about [01:37:45] how do you allocate resources and how do you optimize [01:37:47] your pipeline, so system? [01:37:49] 101 on GPUs, I'm sorry, I'm going slightly faster. [01:37:53] I hope that some of you at least can follow. [01:37:55] GPUs are basically optimized for throughput. [01:37:58] CPUs are optimized for latency. [01:38:01] So GPUs, the way you have to think about it [01:38:03] is that there's one-- [01:38:04] there's one command that is run on many, many cores [01:38:07] at the same time on different type of data. [01:38:11] So this is how you see a GPU. [01:38:13] You see there are many different codes. [01:38:14] We call them streaming multiprocessors, [01:38:17] which is very different than the usual CPU architecture. [01:38:20] So just think high throughput parallelization for GPUs. [01:38:24] GPUs are optimized for fast matrix multiplication. [01:38:27] So every time you will do-- you will do something on GPU. [01:38:30] If you can do it with a matrix multiplication, [01:38:33] it's going to be 10 times faster than with anything else. [01:38:36] That is a little bit annoying because it [01:38:38] means that we are, kind of, bottlenecked [01:38:40] to doing anything with matrix multiplications. [01:38:44] Another thing to note with GPUs is [01:38:46] that compute has been improving faster [01:38:48] than memory and communication. [01:38:50] So right now GPUs usually are hard to keep-- [01:38:55] Like the data that you sent to GPUs [01:38:58] is actually hard to keep up with the processes. [01:39:00] So most of your GPUs are actually [01:39:02] going to be idle if you just run normal code, [01:39:04] if you don't optimize your code. [01:39:06] So communication-- and this will continue over time. [01:39:10] Another thing to know about GPUs is that there's [01:39:12] a memory hierarchy. [01:39:13] This is the same thing actually with CPUs, [01:39:15] but basically the closer you are to your cores, [01:39:17] the less memory there is, but the faster things run. [01:39:20] If you are further, more memory slower. [01:39:24] Oh yeah I'm going to skip that. [01:39:26] OK actually, I'm going to say it. [01:39:27] I told you about this-- [01:39:29] the fact of communication. [01:39:31] The metric that people usually look at [01:39:32] is model FLOP utilization. [01:39:34] So what is the theoretical maximum that GPU could run at, [01:39:37] number of flops that you could use per second-- [01:39:39] divide-- sorry, the number of observed throughput [01:39:42] divided by this theoretical maximum. [01:39:45] And in general, if you reach 50% you're very happy. [01:39:49] Like Facebook I looked at llama was at 45 [01:39:51] or something like this. [01:39:52] So that means that data doesn't come fast enough [01:39:55] even for these big companies. [01:39:58] So one simple trick, and that might [01:40:00] be the only one I'm going to tell you about, [01:40:02] is low precision. [01:40:04] One simple idea is that well, if I'm [01:40:06] going to put my floats in low precision, [01:40:09] then there's going to be fewer bits [01:40:10] that I have to send to my GPUs. [01:40:12] If there's fewer bits, it's faster communication, [01:40:14] lower memory consumption. [01:40:16] Things are going to go faster. [01:40:17] And for deep learning it just happens [01:40:19] that decimal is not that important. [01:40:22] So when you do matrix multiplication, when [01:40:25] you do like for example, SGD, there's already so much noise [01:40:28] that if you update something by 0.01 or 0.015, who cares. [01:40:33] So basically instead of using 32 bits per float, which [01:40:37] is what people used to use, or 64 for example, which [01:40:41] is what you would use in other domains, [01:40:43] you use 16 bits for matrix multiplication. [01:40:46] So for every float you use 16 bits. [01:40:49] And for training you have this type [01:40:51] of what we call automatic mixed precision. [01:40:54] Which is that some of the things are in 32 bits, [01:40:57] others are in 60 bit-- [01:40:58] on 16 bits. [01:41:00] Generally, the way you should be thinking about [01:41:02] it is that your weights are stored-- of your model, [01:41:05] are stored in 32 bits. [01:41:06] But just before the computation you put everything in 16 bits. [01:41:10] Like this you do computation super fast. [01:41:12] And at the end you update your weights in 32 bits. [01:41:16] And the reason why you do all the updates in 32 bits is just [01:41:19] think that if your learning rate, for example, [01:41:21] is very small, you still want to be able to make [01:41:23] a difference in your weights. [01:41:25] So all the computation is done in 16 bits, [01:41:28] but the weights are actually stored in 32 bits. [01:41:30] So that's like the standard way that people are doing it. [01:41:35] OK, I'll actually talk just about this, [01:41:36] and then I'll skip all the rest, operator fusion, because I think [01:41:39] this is actually pretty cool. [01:41:40] As I just said, communication is very slow [01:41:42] and actually every time you use a PyTorch line, [01:41:45] it basically moves variable to global memory of your GPU. [01:41:49] So when you have something like this x dot cosine equal x1, [01:41:54] and then you do x1 dot cosine. [01:41:56] What is happening behind the scenes [01:41:58] is that you take the x, which is data. [01:42:00] You ship it to your actual processors of your GPUs. [01:42:03] You apply the cosine. [01:42:05] You ship it back to the main memory of your GPU [01:42:07] and then you see the next line. [01:42:09] You ship it back to the computer-- to the GPU processor, [01:42:12] you apply another cosine and you ship it back again. [01:42:15] So another way to see that is that you [01:42:17] go from your DRAM, which is your global memory and your GPU [01:42:20] and you ship it to compute. [01:42:22] You ship it back for every line. [01:42:24] This is a naive way of doing it. [01:42:25] This seems very wasteful. [01:42:28] So the idea, simple idea of operator fusion [01:42:31] is just communicate, do all the computation, ship it back once. [01:42:35] And this is exactly what fused kernels are. [01:42:39] So if you ever want to make your compute-- your computations [01:42:44] in PyTorch much faster, just apply torch dot [01:42:46] compile on your model. [01:42:48] This is going to make your model around 2 times faster. [01:42:51] And what it does is simply that it rewrites your code-- [01:42:56] your PyTorch code basically in C++ in CUDA to do [01:43:03] the communication only once then do all the operations, [01:43:05] then ship it back. [01:43:07] OK I'm not going to have time to talk about tiling. [01:43:10] Tiling is important. [01:43:11] Parallelization. [01:43:12] Parallelization is important. [01:43:15] And mixture of experts. [01:43:17] Mixture of experts is important. [01:43:18] Outlook. [01:43:19] There are many things we haven't talked about. [01:43:23] We haven't talked about architectures we definitely [01:43:25] haven't talked about inference. [01:43:27] There are many other things that are important with LLMs. [01:43:29] What is the UI that you use? [01:43:31] I mean, arguably ChatGPT, the big novelty was just [01:43:34] have a simple UI to use it. [01:43:35] Multi-modality. [01:43:36] What are all the misuses you could have. [01:43:38] The fact that there might not be enough data on the internet [01:43:41] to train all these models. [01:43:42] Legality of data collection, so many other things. [01:43:45] If you are interested in all these topics, [01:43:47] I would suggest three classes. [01:43:49] CS224N is probably the one that touches the least on LLMs, [01:43:54] but it gives some background and historical context [01:43:57] of all the LLMs and gives some adjacent material. [01:44:01] CS324 I think it's called-- [01:44:04] I think it's just called Large Language Models, more [01:44:07] in depth reading and lectures on everything I talked about. [01:44:10] CS336 which is large language model from scratch, [01:44:13] you actually build your own LLM. [01:44:16] It's an amazing class also given by my two supervisors. [01:44:20] Very heavy workload, so be careful. [01:44:23] Great.