[00:16] Okay. So, um, so let's continue the
[00:19] journey we started last time. Um so what
[00:22] we're going to do uh you know if you
[00:23] remember in the last class we showed how
[00:26] we can actually build an auto
[00:27] reggressive large language model uh aka
[00:30] a causal large language model um using
[00:33] this not this idea of a causal encoder a
[00:36] transformer causal encoder and then we
[00:38] showed how you can actually take a bunch
[00:39] of sentences and use next word
[00:41] prediction and just run it through and
[00:43] boom you get GPD3 okay so that's what we
[00:46] saw last time I want to point out a sort
[00:49] of an important clarification slash
[00:50] correction which is that when we work
[00:52] with large language models uh unlike
[00:55] when we work with BERT uh for instance
[00:57] when we work with these kinds of causal
[00:59] models actually uh when the contextual
[01:01] embeddings come out you don't actually
[01:03] have to use ReLU activations here you
[01:05] can literally just run it through just a
[01:07] single dense layer with linear
[01:09] activations and then pass it into a
[01:11] softmax and boom you're done okay so
[01:13] that's how GPD3 and all these models are
[01:15] trained u and the other thing I want to
[01:18] point out which may not have clear is
[01:21] that what what is coming out of these
[01:23] this dense layer right this vector is as
[01:27] long as your vocabulary
[01:29] because only then when it goes into the
[01:31] soft max you're going to get
[01:33] probabilities which are as long as your
[01:35] vocabulary which means that you get to
[01:36] pick one word or token out of that
[01:39] entire 50,000 long vocabulary
[01:42] okay so so just I just want to point
[01:45] that out because I think it's easy for
[01:47] us to sort of get a little confused
[01:49] because of this little difference
[01:50] between the way uh masked language
[01:53] models like BERT work and causal
[01:55] language models like GPD3.
[01:58] Okay, so now let's continue with we have
[02:02] we know how to build GPD3. So like what
[02:05] about GPD and GPD2 like what's up to
[02:07] them? Why is GPD3 so famous and not
[02:10] GPD2? Right? So turns out well first of
[02:13] all you folks know that GPD stands for
[02:15] generative pre-trained transformer. Now
[02:17] like GPD3
[02:19] two GPD2 and GPD1 were trained in
[02:22] basically the same fashion. Predict the
[02:23] next word uh same fashion the same sort
[02:26] of transformer stack except that GPT3
[02:29] was trained on much more data because
[02:31] the underlying transformer stack had
[02:33] many more layers. Okay, so it is a much
[02:36] bigger stack meaning lots more
[02:39] parameters and therefore you need lots
[02:41] more data to train it well. Okay, so
[02:44] that was really the only difference. The
[02:47] difference was literally one of scale,
[02:49] scale of network and scale of data. And
[02:53] unlike GPT and GPD2, GPD3 even though it
[02:57] was trained basically the same way with
[02:59] the same kind of network, it was one of
[03:01] the situations where more became
[03:04] different. Okay, there was almost like
[03:06] some sort of phase change that happened
[03:07] between two and three. Unlike GPD and
[03:10] GPD2, GPD3 could do amazingly coherent
[03:14] continuations of any starting prompt,
[03:16] right? Um so for example, if you have
[03:19] this little prompt which says the
[03:21] importance of being on Twitter by Jerome
[03:22] K Jerome who was a famous humorist and
[03:24] then you give it this prompt, right?
[03:26] Ending with the word it, it produces
[03:28] this continuation which is really like
[03:30] strikingly good. And if any of you have
[03:33] read Jerome K Jerome and if you read
[03:35] this thing, you'll be like, "Wow, that
[03:36] actually sounds like Jerome K Jerome."
[03:38] Right? So amazing continuations the the
[03:41] but the interesting thing here is not so
[03:43] much the continuation it's the fact that
[03:45] the same prompt you give it a two or GPT
[03:47] it won't do any it won't be very good in
[03:49] fact after the first one two or three
[03:51] sentences it'll sort of become sort of
[03:52] incoherent and meander and start
[03:54] rambling this thing can keep faking it
[03:57] for a long longer time right that's the
[03:59] amazing thing that was unexpected re
[04:02] researchers did not expect this okay and
[04:05] but it wasn't good at following your
[04:07] instructions
[04:09] So for instance, if you ask it, help me
[04:10] write a short note, introduce myself to
[04:12] my neighbor. This is the kind of thing
[04:14] it'll come up with. And you can actually
[04:15] run it yourself. You can actually go to
[04:17] GPD3 on the playground. I think GPD3 is
[04:20] still available in the playground. U if
[04:21] it is, you can actually start try
[04:23] running these prompts. You will start
[04:25] getting garbage very quickly, right? And
[04:28] the reason so for example here, help me
[04:29] write a short note. It says, what's a
[04:31] good introduction to a resume? Rumé for
[04:33] some reason has glombmed down to resume.
[04:35] I have no idea why. Right? But the
[04:38] reason it's doing stuff like this is
[04:39] because a lot of the training data it
[04:42] was trained on are basically lots of
[04:44] lists of things.
[04:46] So when you say for example um you know
[04:49] the the the capital of Paris continue
[04:52] it'll come back with the capital sorry
[04:53] the capital of France continue it say
[04:55] the capital of France is Paris the
[04:56] capital of you know uh Hungary is
[04:58] Budapest and so on. It just start coming
[04:59] up with a list. So it's sort of very
[05:02] list driven right? it thinks that you
[05:04] you need to complete some sort of list,
[05:06] right? That's what's going on here. And
[05:07] so it's not very good. So it doesn't
[05:09] realize that you're actually asking it
[05:10] to do something specific.
[05:12] So this is the problem when you have an
[05:14] autocomplete thing which doesn't realize
[05:17] what you're asking it. It just thinks
[05:18] that you're it's just an autocomplete.
[05:20] So um now in addition to these unhelpful
[05:24] answers, it can also produce offensive
[05:25] answers, factually incorrect answers and
[05:27] so on and so forth. The list of bad
[05:28] things it can do is long. So why does it
[05:32] do that? Why does it produce unhelpful
[05:33] answers? Well, you know, as you recall,
[05:35] it was only trained to predict the next
[05:37] word. It wasn't explicitly trained to
[05:39] follow instructions, right? So, it
[05:41] seems, you know, reasonable that if it's
[05:44] simply trying to guess the next word
[05:46] repeatedly, it can't really do anything
[05:48] more. Like, how can it figure out that
[05:50] there's an instruction that it needs to
[05:52] follow, right? Unless the training data
[05:54] on the net was all instructional, which
[05:57] it clearly is not.
[05:59] So light bulb idea, right? Let's
[06:02] explicitly train it with instruction
[06:04] data,
[06:06] right? Let's just train it with
[06:07] instruction data. And so OpenAI
[06:10] developed an approach called instruction
[06:12] tuning to do exactly this. Um, and this
[06:15] paper is the paper that sort of was the
[06:18] breakthrough. Okay, this is what
[06:20] actually put Chad on the map. So, and
[06:24] it's very readable. So, I would
[06:25] encourage you to check it out if you're
[06:26] curious.
[06:28] And so so we had GPT, GPD2, GPD3, you
[06:33] know, just bigger and bigger models
[06:34] trained the same way. And then we run
[06:36] into the problem that it can't handle
[06:37] instructions. So we do instruction
[06:39] tuning to get to 3.5, also called
[06:41] instruct GPT. And then a small tweak
[06:43] after that gets you chat GPT. Okay. And
[06:46] by the way, this step here, there are
[06:48] really two things going on in this as
[06:50] you will soon see. I'm just calling it
[06:52] instruction tuning just to so that I
[06:53] don't have to say some long thing every
[06:55] single time. it this is not a consistent
[06:58] piece of terminology. So just just
[06:59] beware aware of that's all. So all right
[07:03] first step they got a bunch of people to
[07:06] write highquality answers to questions
[07:09] and they created about 12,500 such
[07:11] question answer pairs. So for example
[07:14] let's say this was the question explain
[07:15] the moon landing to a six-year-old in a
[07:17] few sentences. Believe it or not, GPD3's
[07:19] answer to that question was another
[07:21] question
[07:23] because it thinks there's a list of
[07:24] questions it needs autocomplete, right?
[07:27] So, it comes up with explain the theory
[07:28] of gravity to a six-y old. It's like one
[07:30] of those people when you ask them a
[07:31] question, they ask you a question back,
[07:32] right? So, what what they did is they
[07:35] said, "Okay, let's create a nice answer
[07:36] to this question." And here's a human
[07:38] created answer. People went to the moon
[07:39] in a big rocket, walked around, blah
[07:41] blah blah, right? Much better answer to
[07:43] that question. And so once you create
[07:46] these 12,500 question answer pairs as
[07:48] training data, we just trained GPD3 some
[07:52] more using Xword prediction as before.
[07:56] No difference. So, so here is the input
[07:59] explain the moon landing blah blah blah
[08:00] blah. This is the question and then we
[08:02] have the answer right there. And then we
[08:05] we take that answer, move it to the
[08:07] right and just shift it up
[08:10] so that when it finishes sentences, it
[08:13] needs to predict people. And then you
[08:16] give it people, it needs to predict went
[08:17] and so on and so forth. Just like we saw
[08:20] before, the cat sat on the mat became
[08:22] the cat sat on the cat sat on the mat on
[08:25] the right shifted, right? That's what
[08:27] makes prediction possible and necessary.
[08:30] So that's what they did. This co this is
[08:31] step one. Okay, same as same as before.
[08:35] And once you do that, it turns out this
[08:37] step is called supervised fine-tuning.
[08:39] It really helped. GPD3 once you
[08:42] supervised fine-tuned it was much much
[08:44] better at following instructions. But
[08:45] there's a small problem with this
[08:46] approach. It takes a lot of money and
[08:49] effort to have humans write highquality
[08:51] answers to thousands of questions,
[08:53] right? It takes a lot of money. So the
[08:56] question is, what can we do, right? What
[08:59] is easier than writing a good answer to
[09:01] a question?
[09:03] Well, what? Okay. Uh, all right. Uh, how
[09:07] about somebody from this side?
[09:11] >> Yeah, Joseph.
[09:13] >> Perhaps writing a question for an
[09:15] answer.
[09:16] >> Oh, that's actually a good one. Yeah.
[09:17] Yeah, I like that. Um, so given an
[09:19] answer, find find a question. And while
[09:22] that is not what I'm going to talk about
[09:23] here, that technique is actually used
[09:25] very heavily in LLMs. Uh, and so but
[09:27] that that's great. Very creative. Uh
[09:29] Mark,
[09:31] >> thumbs up. Thumbs down.
[09:32] >> Sorry.
[09:33] >> Thumbs up or thumbs down?
[09:34] >> Thumbs up or thumbs down. Exactly.
[09:36] Because all of us, everyone loves to be
[09:38] a critic. It's much better easier to be
[09:40] a critic than to be a creator. Right. So
[09:43] what do we do? We basically say, let's
[09:46] rank answers written by somebody else.
[09:48] Which begs the question, who's going to
[09:50] write those answers? And that's where
[09:53] there's a brilliant answer to that
[09:54] question which is
[09:57] Wikipedia,
[09:59] Reddit.
[10:04] We will just ask GPT3 to write the
[10:06] answers.
[10:08] It might be crap, but we don't care
[10:10] because we can rank them.
[10:12] So we ask GPT3 to get generate several
[10:15] answers to the question. And how can we
[10:17] generate several answers? Because we can
[10:19] do sampling.
[10:21] We can do sampling.
[10:23] The fact that we had these stoastic
[10:25] outputs because of sampling is now a
[10:27] feature, not a bug. Okay, we create lots
[10:30] of different answers to the question. We
[10:32] feed it a question, get like three
[10:33] answers out. Just run it three times,
[10:36] get three answers out with a nice
[10:37] temperature of like one or 1.1 or
[10:39] something so that it's nice and random,
[10:41] right? Um, and then we literally have
[10:43] humans just rank them, do the thumbs up,
[10:45] thumbs down, just rank them from most
[10:47] useful to least useful. Okay, so this
[10:51] step is a step two of instruction
[10:53] tuning. So OpenAI collected 33,000
[10:55] instructions, fed them to GB3, generated
[10:57] answers and had humans rank them. And
[11:00] once you do that, once you do this, you
[11:03] can assemble a beautiful training data
[11:05] set, right? And so basically what we
[11:07] have is that we have an instruction and
[11:09] let's say we have just two answers A and
[11:10] B. And in in practice they you can have
[11:12] many many answers which we rank but just
[11:14] for simplicity I'll go with Mark's
[11:16] thumbs up thumbs down sort of answer
[11:18] which is let's assume only you have two
[11:19] answers to every question right and so
[11:22] and the human has said I prefer this to
[11:24] that that's it right so we have a data
[11:26] set now where the data point is
[11:28] instruction preferred answer is A the
[11:31] other answer is B yeah
[11:36] >> um the thumbs up thumbs down uh
[11:38] technique that we're talking is that why
[11:40] We're attaching to now we also use
[11:42] thumbs up thumbs down. It's using only
[11:44] answers to train.
[11:45] >> Exactly. Right.
[11:46] >> Yeah. So yeah, all the models have the
[11:48] thumbs up thumbs down stuff going on
[11:49] somewhere. They are all collecting data
[11:51] for this step.
[11:53] >> Thank you.
[11:53] >> Yeah. It's sort of the old adage, right?
[11:55] Uh if you're not sure who the product
[11:57] is, you are the product. So it's one of
[11:59] those things. Yeah.
[12:07] So if we understand correctly when we
[12:09] see thumbs up thumbs down it does mean
[12:12] that chat is going to trade on our data
[12:16] right
[12:16] >> unless you opt out. Yeah. So if you
[12:19] actually go to the chaty controls there
[12:20] is something called data controls or
[12:22] something you can toggle it to off but I
[12:24] think when I last checked if you toggle
[12:26] it to off you lose your chat history. So
[12:29] they have hobbled that feature to
[12:31] prevent people from setting it to off as
[12:33] much as possible. Yeah, clever.
[12:37] But you can opt out and if you use the
[12:39] API as opposed to the web interface,
[12:41] you're automatically opted out. So you
[12:43] have to deliberately opt in. And if you
[12:45] use the versions that are available
[12:46] through Microsoft Azure and so on and so
[12:48] forth, there are all kinds of very safe
[12:50] controls and stuff like that. In fact, I
[12:51] think the Microsoft co-pilot license
[12:54] that MIT has uh I think the default is
[12:56] opted out.
[12:58] Okay. So to go here, once you have this
[13:01] data point, you can build something
[13:02] called a reward model. Okay. And this is
[13:05] a very clever piece of work. So what you
[13:08] do is you have an instruction, right?
[13:10] You have a preferred answer and you have
[13:12] the other answer. You feed it to a
[13:15] network. Okay? You feed it to a network.
[13:18] This is just a a nice language model,
[13:20] right? It's just a language model. And
[13:23] the language model produces a number
[13:25] which measures how good this thing is,
[13:28] right? How good an answer is this to
[13:30] that particular instruction. So you get
[13:32] two you get a rating here, you get a
[13:34] rating here and then what you do is you
[13:38] run it through a little loss function
[13:41] which
[13:43] essentially encourages the model to give
[13:45] higher numbers to the better answer.
[13:50] It's the same model. You just run the
[13:51] the question and the first answer,
[13:53] question and the second answer. You get
[13:54] these two numbers. And then initially
[13:56] those numbers are just random. But then
[13:59] you tell the model, hey, this is the
[14:00] preferred thing. Make sure the preferred
[14:02] answers
[14:03] uh rating the R value is higher than the
[14:06] other number because more is better.
[14:08] Higher is better. Okay? And you can
[14:12] actually since you and this thing is
[14:13] just a sigmoid here, right? It's
[14:15] basically take the difference of these
[14:16] two things. do a sigma and take the
[14:18] logarithm and you can actually convince
[14:20] yourself afterwards and I encourage you
[14:22] to do that to to check for yourself that
[14:25] if we actually
[14:28] give a higher number to the better
[14:30] answer the loss will be lower and since
[14:34] we are minimizing loss we're essentially
[14:36] training the network to always to try to
[14:38] give higher ratings to better answers
[14:41] that's it so that's the approach uh did
[14:43] you have a yeah Ben
[14:46] So you could imagine training um
[14:49] training the model and only the good
[14:50] answers is the idea of having both that
[14:52] the model is actually learning what
[14:54] makes good
[14:54] >> correct. Exactly. Much like if you want
[14:56] to build a dog cat classifier, you have
[14:58] to show pictures of both.
[15:01] >> Yeah.
[15:02] >> So u I understand the feedback mechanism
[15:05] of thumbs up thumbs down but there are a
[15:06] lot of times when the popular response
[15:10] is not the accurate one. So uh is there
[15:12] a way that they actually have a layer to
[15:15] correct?
[15:16] >> Yeah, good question Swati. So uh as it
[15:18] turns out um the all these companies
[15:22] like OpenAI, they have like a huge
[15:24] document 100 200 pages longs you know
[15:27] very very bulky document which instructs
[15:30] and teaches the labelers the rankers to
[15:32] how to rank these things. So they have
[15:34] to follow these very strict guidelines
[15:36] to precisely handle like strange corner
[15:38] cases and things like that. And that
[15:40] document is on the web. You can dig it
[15:43] up, right? And it's actually very
[15:44] instructive to read through it, right? I
[15:46] think they put it out on the web because
[15:48] they wanted to convince people that
[15:49] they're going to inordinate trouble to
[15:50] make sure the rankings are actually
[15:52] good. U do you have a question? Comment.
[15:55] Okay. All right. So um so back to this
[16:00] and how how do you train this thing? SGD
[16:03] because you have a network it's coming
[16:04] up with an answer you have some way to
[16:06] know if that answer is good or bad right
[16:08] better answers of lower loss back
[16:10] propagation through the network keep
[16:12] updating the weights and boom you're
[16:13] done
[16:15] okay and once you do that this reward
[16:18] model can provide a numerical rating for
[16:21] any any instruction answer pair you just
[16:24] give it an instruction you give it an
[16:25] answer right could be a crappy answer
[16:27] good answer it just tells you how good
[16:28] it is which means right So in this case
[16:31] for example maybe it's going to give you
[16:32] like a nice number 1.5 uh uh which is
[16:35] you know 1.5 for this this answer but
[16:38] then a better answer comes along or 3.2
[16:41] right what we have done by doing this
[16:44] whole thing this modeling is that we
[16:46] have essentially we have learned how
[16:49] humans rank responses
[16:51] because we can only have humans rank
[16:53] responses for some finite number of
[16:55] questions. What we really want to do is
[16:58] to do this to automate that ranking
[17:00] process so that we can just do it for
[17:02] like tens of thousands of questions
[17:03] really fast. Right? So we have
[17:05] essentially built a model of how humans
[17:07] rank things, right? Which is beautiful.
[17:10] A lot of the stuff here is all very
[17:12] self-reerential which I find very
[17:13] elegant. Anyway, so this can be used to
[17:15] improve GP3 even further. So we take the
[17:18] instruction as before, we feed it. It
[17:20] gives you some answer and then we feed
[17:23] this instruction and the answer to our
[17:25] newly minted reward model. It gives us a
[17:28] numerical rating and then this is the
[17:30] key step. We take this numerical rating
[17:32] and then we use this rating to nudge the
[17:35] internal weights of GPD3 in the right
[17:37] direction. Right? This nudging
[17:41] uses a technique called reinforcement
[17:43] learning.
[17:44] Right? Which just in the interest of
[17:46] time we can't get into in this lecture.
[17:49] But that that's a technique you use to
[17:51] nudge these things in the right
[17:52] direction.
[17:54] So that's what we do. That's
[17:56] reinforcement learning. We nudge it in
[17:58] the right direction.
[18:01] And OpenAI did this with 31,000
[18:04] questions.
[18:07] Okay. Nudge, nudge, nudge, nudge, nudge.
[18:09] And when you do that, you get GPD
[18:11] 3.5/ingpd.
[18:13] Okay. Uh that's it. And now by the way
[18:18] this step here is called reinforcement
[18:20] learning with human feedback because we
[18:22] use reinforced learning and since humans
[18:24] rank the answers which tread to the
[18:26] building of the reward model we get
[18:28] human feedback. Okay, that's
[18:29] reinforcement learning with human
[18:30] feedback. Yeah.
[18:33] >> Yeah. I have [clears throat] a question
[18:34] regarding the the type of questions that
[18:37] they're using. I can imagine like maybe
[18:39] there are very simple questions to
[18:42] answer because I'm thinking now you can
[18:44] ask GPD like for example respond this as
[18:47] a pirate or something like that that is
[18:49] kind of it's going to be harder to train
[18:51] if you have bunch of questions that are
[18:54] having like small interactions and then
[18:56] there is the question like
[18:57] >> that's a good question. So the quality
[18:59] of the questions in the data set clearly
[19:01] is a big factor because if you have
[19:03] simple simplistic questions it won't be
[19:05] able to handle complex questions later
[19:07] on. So what it's a good question. So
[19:09] what how so the qu so that actually begs
[19:12] the question of where did they get these
[19:14] questions from
[19:16] so they actually got it from their API.
[19:20] So people are asking GPD3 on the API
[19:23] right before it became 3.5 people are
[19:25] asking all the API was already available
[19:26] you know fully available commercially
[19:28] available a lot of people are building
[19:29] products on it already by then and so
[19:31] they collected all those questions and
[19:33] filtered them for quality and that was
[19:35] the question set that they used and then
[19:37] they judiciously added to it with human
[19:39] created questions but they couldn't do a
[19:41] lot of that because it's expensive to do
[19:43] that but collecting stuff that somebody
[19:44] else is asking your API already very
[19:46] easy
[19:49] Yeah, Tomaso,
[19:50] >> uh, this might be more of a
[19:52] philosophical question, but, uh, the
[19:54] human bias that's present in the small
[19:56] subset of human labelers that they've
[19:58] chosen gets eventually compounded in
[20:00] this model that we often consider as the
[20:03] source of objective truth.
[20:04] >> Yes.
[20:06] >> Yeah, that's very true. Um I think the
[20:08] the reward model is probably very
[20:09] faithfully learn all the biases of the
[20:12] human labelers which is why they have
[20:14] these very complex u sort of frameworks
[20:17] and guidelines to try to prevent the
[20:19] bias from happening to mitigate it. So
[20:21] for example they might give the same
[20:22] question and set of possible answers to
[20:25] many many different labelers and only if
[20:28] people pick the same ranking they might
[20:30] use it so that at least inter labeler
[20:33] bias can be minimized right but if
[20:36] everybody's sort of biased in the same
[20:37] direction it won't protect you against
[20:39] that. Um so yeah in general there's a
[20:41] whole work that's being done to try to
[20:43] debias these things and build them
[20:44] without you know too much bias in them.
[20:46] It's like a whole world unto itself
[20:48] which we just don't have time to get
[20:49] into. Uh Olivia,
[20:53] >> um depending on the medium that's being
[20:56] returned by these models, would there be
[20:57] more than one reward model? Because
[20:59] isn't that what Gemini
[21:00] >> would there be more than one
[21:01] >> reward model? Because isn't this what
[21:03] Gemini is running into issues with right
[21:05] now with their image generation is the
[21:08] bias that they try to
[21:09] >> Yeah. So the Gemini business that's
[21:11] going on, it's unclear what's causing
[21:13] it. Um it may be in this step, maybe
[21:16] they were a little overzealous in
[21:18] preventing certain things from
[21:19] happening.
[21:20] Some of these uh systems also have um
[21:23] they will actually intercept the
[21:25] question that you ask and then route it
[21:27] differently based on what they sense is
[21:29] sitting around in the question. So there
[21:31] could be pre-processing post-processing
[21:32] a lot of stuff that goes on. So unclear
[21:34] to me where in the pipeline and it could
[21:36] be more than one place these things may
[21:38] be entering. So yes, so here may very
[21:40] well be where it actually enters a
[21:42] situation where people are people are
[21:44] told if you see any sort of this kind of
[21:46] answer downrank it right don't uprank it
[21:50] and then it learns that ranking very
[21:51] faithfully and then proceeds to apply it
[21:53] where it does should not be applied so
[21:54] that does happen uh Joselyn you had a
[21:56] question
[21:58] >> um I think I still I still don't totally
[22:02] understand why so when I ask chat GBT a
[22:04] question even in a lengthy response it
[22:06] doesn't wander away from the topic that
[22:08] I'm asking about right and so
[22:10] understanding that it it's predicting
[22:11] each word it's sort of taking a random
[22:13] walk from one word to the next in some
[22:15] sense
[22:15] >> but each word it utters
[22:17] >> now becomes part of the input to the
[22:19] next word it utters
[22:20] >> right
[22:21] >> so it's not truly random walk in that
[22:23] sense so the next step is not
[22:24] independent of the previous step
[22:26] >> it depends on what it depends on the
[22:27] journey so far so it's going to try to
[22:29] be very consistent with the journey so
[22:31] far
[22:32] >> okay
[22:33] >> does the
[22:35] does this part with um sort of
[22:38] fine-tuning it on these question answer
[22:40] sets. Does this play some role in it
[22:42] being able to constrain itself and not
[22:44] meander away?
[22:46] >> I don't think so. I think this is more
[22:48] to make sure that you know it does the
[22:50] weights generally tend to produce the
[22:52] right answer. Now what one of the things
[22:54] that is possible is that when when I'm
[22:57] let's say I'm a ranker and I'm looking
[22:58] at a few different answers I'm you know
[23:01] I have to figure out if the answer is
[23:03] helpful if it is accurate if it is uh
[23:06] you know non-toxic right things like
[23:08] that and part of the rubric for
[23:11] evaluating these answers could be their
[23:13] coherence right so it could also be that
[23:16] they are saying short coherent answers
[23:18] are better than long coherent answers
[23:21] but once you adjust for length Maybe
[23:23] coherence is more important, right? It
[23:24] could be any number of these things. So
[23:25] it could play a role in that.
[23:26] >> So just sort of one small followup. So
[23:28] in other words, when it's when it's
[23:30] learning from these question and answer
[23:31] pairs, it's able to look at
[23:32] [clears throat] the whole response and
[23:33] learn something about the whole response
[23:35] rather than just one word at a time,
[23:36] right?
[23:37] >> Correct. Yeah. The the entire question
[23:39] is being ranked.
[23:40] >> Yeah.
[23:40] >> Correct. Correct.
[23:42] >> Yeah. On a related note, um when it's
[23:46] generating a new word on a topic, does
[23:48] the attention pertain to the entire
[23:50] prior text or can you have like
[23:52] traveling attention? So like last five
[23:55] word.
[23:56] >> So yeah, the short answer is yeah, you
[24:00] can you can it's called sliding window
[24:02] attention. It can be done. They
[24:04] typically tend to do it not uh so much
[24:06] because they want to focus more on the
[24:08] the recent words, but more because it
[24:10] actually makes it very compute
[24:12] efficient. U that's why they do it. So
[24:14] it's called sliding window attention.
[24:16] You can Google it.
[24:17] >> So normally it's full attention.
[24:19] >> Normally it's full default is full
[24:21] attention.
[24:23] Okay. So that's what they did. Uh and
[24:25] when they did that and by the way as I
[24:27] think you pointed out that's exactly
[24:29] what's going on. You're training the
[24:30] reward model with these thumbs up and
[24:31] thumbs down. U hold on the questions.
[24:35] And so if you give it the same question
[24:37] to GPD 3.5 in GPD amazing answer.
[24:42] Okay, like night and day difference,
[24:45] amazingly good answer. Um, and so and
[24:48] then to go from 3.5 to CH GBT, they
[24:51] basically followed the exact same
[24:52] playbook except that because they wanted
[24:55] to have a chatbot, meaning something
[24:58] that could carry on a question answer,
[24:59] question answer pair as opposed to just
[25:00] a single question and answer, they
[25:02] wanted question answer question answer,
[25:03] right? Conversation. They trained it on
[25:05] conversations. That's it. Instead of
[25:08] training it on instruction answer data,
[25:11] they trained it on instruction answer
[25:13] instruction answer instruction answer a
[25:16] sequence of such things which are strung
[25:17] into a conversation.
[25:19] That's it. That is the only difference
[25:21] to go from 3.5 to CH GPT and then now
[25:25] chat GPD given you do that it's giving
[25:26] you a much nicer response and then you
[25:28] can ask a follow-on question. Can you
[25:30] make it more formal? Boom. It gives you
[25:32] a nice response because now it knows
[25:33] about conversations. It's been trained
[25:35] on conversational data. So that's it. So
[25:37] that's the whole that's how they built
[25:38] RGBT right and all the things we are
[25:41] seeing later on are all sort of
[25:42] continuations of this sort of approach.
[25:45] So pause for a couple of quick
[25:46] questions. Swati you had a question then
[25:47] we'll go to you and then to you. Yeah.
[25:50] >> So does that make a difference if a new
[25:53] question pair question answer pair or a
[25:56] new training data comes early in the
[25:59] building of the model or later in the
[26:01] building of the model 7 billion
[26:02] parameters. That be good. You mean the
[26:05] order of the questions does it matter?
[26:07] >> So I might have like let's say 5,000 uh
[26:09] images to start with. Now there after my
[26:12] model is trained and developed now I
[26:14] have a new use case that has come in.
[26:17] Will that make a difference if I set it
[26:18] in now?
[26:19] >> So if you have a new use case for which
[26:22] you want to essentially adapt the model
[26:24] there's a whole set of techniques you
[26:26] use which is going to be the next
[26:27] section.
[26:27] >> But it's not
[26:29] >> yeah because what you have out of the
[26:30] box is just a generally good chatbot. It
[26:33] knows about a lot of stuff because it's
[26:34] been trained on, you know, those 30
[26:36] billion sentences, it can answer a lot
[26:37] of questions reasonably well using
[26:39] common sense and world knowledge. But
[26:41] any specific use case like medical and
[26:43] so on and so forth, it may not know. So
[26:44] you'll need to adapt it to your
[26:46] particular unique situation and that's
[26:47] coming. U all right. Yes. Habit.
[26:51] >> Uh what determines if a whole
[26:54] conversation is ranked positively versus
[26:57] a specific answer proliferating your in
[26:59] your question?
[27:01] >> Is it if the first answer doesn't get a
[27:03] positive response but then after follow
[27:05] the second one does. Is that is that
[27:06] correct?
[27:07] >> Exactly. So if you're a human and you
[27:08] read the transcript of an exchange
[27:10] between two people and I'm giving you
[27:12] two exchanges which all start with the
[27:14] same question, you'll be able to assess
[27:15] which one is a better transcript. That's
[27:17] basically what's going on. Uh there was
[27:20] something here, right? Something. Yeah.
[27:22] >> So I was wondering when you ask a
[27:25] question very often it sounds kind of
[27:27] like you tell that something was written
[27:29] by not by an actual person. Do you think
[27:32] that comes from the reinforcement
[27:35] learning part or where do you think it
[27:38] comes from in this?
[27:40] >> It's a good question. I don't know
[27:41] because I know that part of the
[27:42] evaluation uh the ranking rubric are
[27:44] used is to is to favor responses which
[27:48] sound more humanlike than you know more
[27:50] than robotlike. So if anything I'm
[27:52] hoping that reinforcement learning would
[27:54] actually make it sound more humanike
[27:55] because the rankers would have
[27:56] prioritized that. So if you if it still
[27:58] comes up with robotic stuff, you know,
[28:01] it's something else that's going on.
[28:02] Maybe I mean maybe the lot of text on
[28:05] the internet is not literature. It's
[28:07] just people writing some crap, right? So
[28:09] could be that. Yeah.
[28:13] >> How much of this instruction tuning or
[28:15] conversational tuning is happening in
[28:17] real time within a conversation? So
[28:19] >> none of it.
[28:19] >> None of it. So as you kind of give
[28:22] feedback to the model, it's just
[28:24] basically regenerating it like I don't
[28:25] like that answer. come up with something
[28:27] else.
[28:27] >> No, it's not doing it in real time. Uh,
[28:29] basically whatever signals you're giving
[28:31] it with these thumbs up, thumbs down
[28:32] business, that gets added to the
[28:34] training logs and they periodically will
[28:36] retrain it.
[28:39] Uh, okay. So, by the way, this is
[28:41] instruction tuning in a nutshell and I
[28:42] want to point that out and you don't
[28:44] have to read the whole thing, but just
[28:45] to quickly point out this was where we
[28:47] had to have human involvement, right? In
[28:50] the first step, writing a lot of
[28:51] responses to these questions and then
[28:52] ranking the answers. So these two are
[28:56] still human sort of labor intensive. Now
[28:58] it turns out you can actually use helper
[29:00] LLMs to automate this too,
[29:03] right? This is not what open I did in
[29:04] the beginning with HGBT but now you can
[29:06] do it this way right because there are
[29:07] lots of really good LLMs available for
[29:09] you to automate many of these things. uh
[29:11] we don't have time but if you're curious
[29:12] I had a little blog post on this check
[29:14] it out okay so now we come to the
[29:17] question of well if you want to take a
[29:20] base LLM like GBD3 and make it useful
[29:23] and respond instructions we have seen
[29:24] that we had to adapt it with high
[29:26] quality instruction onset data right
[29:28] using supervised fine-tuning and
[29:30] reinforcement learning with human
[29:31] feedback right that's what made GPD3
[29:33] actually useful and became chat GPD by
[29:37] the same token this holds true more
[29:39] generally if you want to take large
[29:41] language model make it useful for a
[29:42] medical use case, a legal use case, some
[29:44] other narrow business use case. You have
[29:47] to adapt it with business domain
[29:49] specific data. Okay. And so let's look
[29:52] at techniques for doing so. All right.
[29:54] So adaptation is sort of the rough name
[29:56] for the process of taking a base large
[29:57] language model and making it tailoring
[30:00] it for your particular use case. And so
[30:02] there's sort of this ladder of things
[30:03] you can do, right? And we're going to
[30:05] look at every one of them. So you can do
[30:07] this thing called zeroshot prompting
[30:08] which is just you literally ask the LLM
[30:11] nicely clearly what you want and maybe
[30:14] just give it to you. Okay. And this is
[30:16] sort of the use case we're all used to
[30:17] in the web interface right you can also
[30:20] do something called few short prompting
[30:22] where you ask it something and you also
[30:24] give a few examples of the kind of
[30:25] things you want right and that helps it
[30:27] a great deal and then there is this
[30:30] thing called retrieval augmented
[30:31] generation and fine-tuning and we'll
[30:33] look at all of them and I'll explain all
[30:34] these things as we go along. Okay, so
[30:36] let's start with zero short prompting
[30:38] where by the way the word short here is
[30:40] a synonym for example. So zero example
[30:44] prompting. You literally ask in the
[30:45] prompt what you want without giving even
[30:47] a single example. Okay. And so let's say
[30:50] we want to build we want to look at
[30:51] product reviews and build a detector to
[30:54] figure out if the product review
[30:55] contains not sentiment. That's kind of
[30:56] boring. Uh whether it contains some
[30:59] description of a potential product
[31:01] defect or not. Okay. And so here is
[31:04] something I actually pulled off Wayfair
[31:06] with apologies to Wayfair. Uh it says
[31:08] here the curve of the back of the chair
[31:10] does not leave enough room to sit
[31:11] comfortably. Okay, sounds like a kind of
[31:14] a defectish kind of thing, right? So
[31:16] instead of bu back in the day, you would
[31:18] have collected all these reviews and
[31:20] built a special purpose NLP based
[31:21] classifier to figure out defect yes or
[31:23] no. Here you can literally just feed
[31:25] this thing into GPD3 uh and ask it tell
[31:28] me if a product defect is being
[31:30] described in this product review and
[31:31] then the curve at the back boom and then
[31:33] it comes back and says yep that's a
[31:34] product defect. Okay so this zero shot
[31:37] you just ask a question you get the
[31:38] answer back. Okay and it actually works
[31:41] remarkably well and the better models
[31:43] the bigger models tend to be much better
[31:45] than the smaller simpler models for
[31:47] doing zero shot. Okay. All right. Now
[31:50] when you adapt an LLM to a specific task
[31:52] obviously you need to carefully design
[31:54] the prompt as you folks know this is
[31:55] called prompt engineering and we're not
[31:57] going to spend much time on prompt
[31:58] engineering except I just want to give a
[32:00] simple example. So if you actually ask
[32:02] Jubid this question what is the fifth
[32:04] word of the sentence very often it'll
[32:07] give the wrong answer.
[32:09] It's very strange why it can't get this
[32:11] answer question right. It's a very
[32:12] simple question. So if it's the fifth
[32:14] word of the sentence is s right uh
[32:17] sometimes it gets it right but very
[32:18] often it'll get it wrong okay but now
[32:20] you can do a little prompt engineering
[32:22] and it'll always get it right. So for
[32:23] example you can say I'll give you a
[32:25] sentence first list all the words that
[32:26] are in the sentence then tell me the
[32:27] fifth word. Okay here is a sentence b it
[32:30] gets it right. So it's an example of you
[32:33] can help it along by being very very
[32:34] prescriptive as to what you want it to
[32:36] do and break down all the steps. Don't
[32:38] make it guess things. It does a great
[32:40] job. Okay. So anyway uh and there are
[32:42] lots of other tricks people have figured
[32:43] out over the the last couple of years.
[32:45] Uh for for a long time this is pretty
[32:47] hot where you say let's think step by
[32:49] step. You tell it give it a question and
[32:51] say let's think step by step. It
[32:53] actually gives the better shot at giving
[32:54] you a good answer back an accurate
[32:55] answer back. Uh now this kind of thing
[32:57] is actually already baked in into the
[32:59] LLMs. So when you ask a question to ch
[33:02] your question your prompt gets appended
[33:05] to what's called the system prompt and
[33:07] the whole thing goes into the LM. You
[33:09] never see the system prompt and the
[33:10] system prompt is telling Chad GPD think
[33:12] step by step take your time don't blurt
[33:14] out an answer stuff like that okay and
[33:17] the system you can just Google it the
[33:18] system problems have been jailbroken you
[33:20] can find it on the web
[33:22] so all right uh and and this is funny I
[33:25] this came out maybe like a month or two
[33:26] ago it says apparently take a deep
[33:28] breath and work on the problem step by
[33:29] step works better than saying work on it
[33:31] step by step and then more recently I
[33:34] literally read this two nights ago
[33:36] apparently if you tell it if you have a
[33:38] math or a reasoning question. You tell
[33:40] it you are an officer on the starship
[33:42] enterprise. Now solve this problem for
[33:44] me. It's higher more likely to get it
[33:46] right.
[33:47] >> Go figure. Thomas,
[33:48] >> I read two more that were super fun.
[33:50] >> Yeah.
[33:51] >> One I will keep you if you solve me
[33:53] >> correct
[33:54] >> and the other one was
[33:56] an answer was I cannot do that
[34:00] for answer was I tried on Gemini and he
[34:05] it was the way to solve it. So
[34:07] >> nice. both like back and forth charge
[34:10] you did you want to say was to solve
[34:11] this can you solve this
[34:13] >> yeah very good excellent one of the
[34:15] things just on that right let's have
[34:16] some fun you can say I'm going to give
[34:18] tip you a thousand bucks if you solve
[34:19] this it says right so this person
[34:22] apparently kept using this tip and at
[34:24] one point it says you keep promising me
[34:26] tips you never give me the tip so I'm
[34:28] not going to solve this problem for you
[34:31] yeah okay so and there are many prompt
[34:34] engineering resources this one that came
[34:36] out a couple of weeks ago which I
[34:37] thought was pretty Good. So I just put a
[34:38] link to it here. Um so now let's look at
[34:41] few short prompting where you give it a
[34:42] few examples. So here let's say we want
[34:45] to build a grammar corrector. Okay. So
[34:47] what you can do is you can actually give
[34:49] it examples of poor English good
[34:52] English. You can see right poor English
[34:54] I eated the purple berries. Good English
[34:56] I ate the purple berries. And similarly
[34:58] three examples right and then you end
[35:00] the prompt with just the poor English
[35:01] input. And then the response from GPD3
[35:04] is a good English output and it says fix
[35:06] the error.
[35:09] So this is an example of giving a few
[35:10] examples of what you want and just
[35:11] learns on the fly what you what you have
[35:13] in mind what your intention is. Okay. So
[35:16] that's that. Now the ability of LLMs to
[35:19] learn from just a few examples or even
[35:21] no examples and just with a clear
[35:23] instruction. This thing is called in
[35:25] context learning and that was something
[35:28] that GPD2 and GPD could not do. that was
[35:31] new in GBD3 and what they call an
[35:33] emergent capability right it is
[35:35] completely unanticipated by the people
[35:37] who built it and all right so that's
[35:40] that now let's look at retrieal
[35:41] augmented generation by the way this
[35:43] thing is also called indexing sometimes
[35:45] so the the so the the idea of it's
[35:47] called rag retrie rag the idea of rag is
[35:50] actually very simple so let's say that
[35:52] you know we want to ask a question to a
[35:53] chatbot but we want the chatbot to
[35:56] leverage proprietary data that we might
[35:59] have maybe it's a customer call support
[36:01] sort of in a call center kind of
[36:02] operation and you have like this massive
[36:04] FAQ database right content database and
[36:06] you want to give that FAQ to the chatbot
[36:09] along with your question so that it can
[36:10] leverage the FAQ to answer the question
[36:12] for you as opposed to like whatever
[36:14] things it has learned previously in its
[36:16] general training right so can't we just
[36:19] include the entire FAQ the whole data
[36:21] set into a prompt and set it in maybe we
[36:24] just take our question take everything
[36:26] we have potentially relevant to the
[36:27] question everything we have in the data
[36:28] set database just attach it to the
[36:31] question. The whole thing becomes a
[36:32] prompt. Feed it in and say, "Hey, find
[36:34] out for me." Can't you just do that?
[36:38] Theoretically, I think it stops us.
[36:43] The reason you can't do it is because
[36:44] this pesky thing called the context
[36:46] window.
[36:47] So, uh, for any LLM, the prompt plus the
[36:51] output, right, the length cannot exceed
[36:53] a predefined limit. This called the
[36:55] context window. Remember the max
[36:57] sequence length we had in our earlier
[37:00] models where that was the size of the
[37:02] sentence that could be fed in right
[37:04] basically there is a size of the
[37:05] sentence for any of these things right
[37:07] it's called the context window it's
[37:08] there are only so many tokens it can
[37:09] accommodate and since what comes in is
[37:12] what comes out it is for both the input
[37:14] and the output together okay that's
[37:16] called the context window okay and um
[37:20] and and and furthermore when you have a
[37:23] conversation with one of these chat bots
[37:25] the entire entire conversation is fed in
[37:27] every single time.
[37:29] That's how it actually remembers the
[37:31] what's going on earlier in the
[37:32] conversation. It doesn't have any memory
[37:34] per se. Each time you ask a question,
[37:36] the entire thread is fed in. Okay? So,
[37:39] initially you say what's the square root
[37:41] of 17, it gives you an answer.
[37:42] Initially, you only send in the red
[37:44] stuff. Then the next question you ask is
[37:46] the first question, the answer, the
[37:48] second question. All of them are fed in.
[37:50] Then all these are fed in. So with the
[37:52] conversation, you're consuming more and
[37:54] more of the context window as you go
[37:55] along.
[37:57] Okay. So can you imagine taking a whole
[38:00] FAQ asking a question and saying, "Well,
[38:01] I didn't mean that. I wanted something
[38:03] else." And before you know it, boom,
[38:04] you've blown out the context window.
[38:05] It's going to come back and give you an
[38:06] error.
[38:08] >> You finished that you can't does it
[38:10] together or does it take specific
[38:14] windows of it?
[38:15] >> Yeah. So there is a whole research
[38:17] cottage industry around when your thing
[38:19] is longer than the context window. what
[38:21] do you pick? Uh so the simplest case is
[38:23] you have a moving window, right? If if
[38:25] you have thousand tokens, you just look
[38:27] at the last thousand tokens. But there
[38:28] are some cleverer schemes where you can
[38:30] actually take the first stuff that is
[38:33] outside the window that doesn't fit into
[38:34] the window and use an other LLM to
[38:37] summarize it for you and then you attach
[38:39] it to your current prompt. I know it
[38:41] gets crazy. So
[38:43] uh okay. So for all these reasons, we
[38:46] need to pick and choose what we can
[38:47] send, right? To answer a particular
[38:49] question. So what we do is since we
[38:51] can't include the whole thing, we first
[38:53] retrieve the relevant content from the
[38:54] database or the FAQ and then send it to
[38:57] the LLM along with a question we have.
[38:59] Okay? So retrieval augmented sequence
[39:02] generation. That's what's going on.
[39:05] Make sense? And so pictorially
[39:08] um basically what we do is let's say
[39:10] that this is our external set of
[39:12] documents. We take this are think of it
[39:15] FAQ and then we take the FAQ and imagine
[39:18] for each question and answer. We take
[39:20] each question and answer in the FAQ and
[39:22] then we we just we treat it as its own
[39:24] little unit of text and then we actually
[39:27] calculate a contextual embedding for
[39:29] each of those question answer pairs.
[39:32] Remember we know how to do contextual
[39:33] embeddings, right? That's like it's a
[39:35] piece of cake at this point, right? You
[39:36] folks know how to do contextual
[39:37] embedding. Run it through something like
[39:39] BERT, you're done, right? You get you
[39:41] get a context. So you get embeddings for
[39:43] all the things that are in your FAQ. And
[39:47] now when a new question comes in, right,
[39:50] what you do is you take that question
[39:52] and you calculate a contextual embedding
[39:53] for that too.
[39:56] And then what you do is you then look to
[39:58] see which of the FAQ elements you have,
[40:02] which of those chunks are the most
[40:04] similar to your question.
[40:07] Okay? And then you grab the ones that
[40:09] are the most similar and then pack it
[40:11] into the prompt and send it in. Maybe
[40:14] you have 10,000 questions, but you can
[40:16] only accommodate five of them in your
[40:18] prompt because the context window is
[40:19] very small. So you pick the five what
[40:22] you think is the most relevant content
[40:24] to your particular question and then you
[40:25] feed it in.
[40:28] That's the idea that is retrieval
[40:29] augmented generation. Yeah, Rolando. So
[40:32] if does this tie in for example if I
[40:34] were to prompt and say help me work on
[40:36] my startup pitch but given the voice of
[40:38] Steve Jobs is it then kind of going out
[40:41] there and reducing the subset of of data
[40:45] to things that have been written by
[40:48] Steve Jobs and then it's kind of
[40:49] generating it response based
[40:51] >> uh not as a default not as a default
[40:53] typically because a lot of Steve Jobs
[40:54] stuff on the web it's just using that
[40:56] because it's all part of its
[40:57] pre-training data but this tends to be
[41:00] more useful for very targeted
[41:01] applications where you don't expect to
[41:03] know the answer because it is not on the
[41:05] public internet.
[41:07] It's your proprietary data and you
[41:09] wanted to use that proprietary data and
[41:10] this how you do it.
[41:12] Uh yeah
[41:15] this certain
[41:19] >> sure like that there will be some loss.
[41:22] >> There will be some loss because you have
[41:23] to figure out how to chunk it right. Uh
[41:26] maybe you have a 300page PDF and then
[41:28] maybe you look for each section and make
[41:30] it a chunk. Maybe you look for each
[41:32] paragraph, make it a chunk. Again,
[41:33] there's a whole empirical sort of
[41:36] cottage industry of techniques for doing
[41:37] these things better or worse depending
[41:39] on the use case and so on and so forth.
[41:40] But the conceptual idea is chunk and
[41:42] embed.
[41:43] >> Chunking is another use.
[41:46] >> Yeah. In fact, we going to do it
[41:47] ourselves in the collab right now.
[41:49] >> Yeah.
[41:50] >> Can we give more weightage lecture? Uh
[41:54] [laughter]
[41:55] so in the default implementation no but
[41:58] but in some sense you by picking the
[42:00] five most relevant chunks from 10,000
[42:02] chunks you're giving it giving the other
[42:04] you know 10,000 minus five chunks a
[42:06] weight of zero and these a weight of
[42:08] one. So in some sense you're waiting it.
[42:10] >> Yeah.
[42:12] >> I was just curious how much structure
[42:13] you have to have with an external
[42:14] document say hospital or something. Do
[42:16] you have to do a bunch of like lab?
[42:19] >> No, you just need to make sure it's kind
[42:21] of relatively clean. Uh but you will see
[42:23] in the collab that it can be kind of
[42:26] crappy and it still works. Yeah, because
[42:28] there is so much crap on the internet
[42:30] has been trained on already. So, okay.
[42:33] So, all right. So, let's look at the
[42:34] collab.
[42:36] By the way, retrieval operate generation
[42:38] is in my opinion the most pre prevalent
[42:41] business application of LLMs that I've
[42:43] seen up to this up to up to date. And
[42:45] there's a huge ecosystem of tools and
[42:47] vendors and so on and so forth.
[42:51] I'm going to skip through the verbiage
[42:52] here. Um, so you have to um install the
[42:56] OpenAI library
[42:58] and this thing called tick token which
[43:00] we'll get to in a in a bit. I've already
[43:01] installed it before class because it
[43:03] takes some time. So I'll just make sure
[43:05] all these things are already
[43:08] few good. So we don't have to wait for
[43:10] this. So I've imported pandas as before
[43:12] and so uh and you can read through these
[43:15] things because I'm just basically you
[43:17] know I have an open openi token that I
[43:19] have to use u a key rather key API key
[43:23] and I'm not showing you the key
[43:24] obviously I have to remember to delete
[43:25] it before I upload the collab uh you
[43:27] have to get your own key to make it all
[43:29] work uh but the instructions are here.
[43:31] So we're going to use GPT3.5 turbo to
[43:34] demonstrate rag right so I give it the
[43:36] name of the model and then open a also
[43:38] has a whole bunch of different models
[43:40] which can be used for u you can feed it
[43:43] a sentence or a chunk of text it'll give
[43:45] you a contextual embedding out it's like
[43:47] a nice little API you don't have to use
[43:49] your own bird and so on and so forth you
[43:50] can just use the open AI embeddings
[43:53] obviously you have to pay openai every
[43:54] time you make a request but it's really
[43:55] really cheap at this point u yepa
[44:01] question but
[44:03] by dealing with proprietary data because
[44:05] a lot of companies are like we need to
[44:07] invest in our own L&M because we don't
[44:09] want our data to be going down in this
[44:11] kind of it context how good is the the
[44:14] cyber security or the compliance and
[44:16] legal
[44:17] >> I think each vendor has their own sort
[44:19] of set of rules and contractual
[44:21] commitments they're willing to sign up
[44:22] for so you just
[44:23] >> if you use the data here does this go
[44:25] into the public domain or no
[44:27] >> but the vendor gets to see it
[44:29] >> okay
[44:29] >> right meaning the vendor systems get to
[44:31] see it, but do the vendors employees get
[44:33] to see it if they need to? Unclear.
[44:36] Those are all the like the legally sort
[44:38] of nitty-gritty you have to worry about.
[44:39] The other thing you can do is you can
[44:41] actually just download an open source
[44:42] LLM and do it all within your own
[44:44] premises.
[44:46] That's totally possible to do, right? In
[44:48] fact, um I probably won't have time
[44:50] today. I have a whole section on how do
[44:51] you actually do a fine-tuning with an
[44:52] open-source LLM, which I'll do a video,
[44:55] right, if you don't have time. U okay.
[44:58] So, so we and so this model this
[45:01] embedding ADA 2 is the name of the
[45:02] OpenAI model that actually gives you
[45:03] contextual embedding. So, we're going to
[45:05] use that. So, so first thing we want to
[45:07] so the the use case here is that uh we
[45:10] have taken a whole bunch we want to ask
[45:11] the LLM we want to create a chatbot
[45:13] which can answer questions about the
[45:15] 2022 Olympics like random questions you
[45:18] might have about the Olympics. So, uh so
[45:20] let's first ask it this question. Uh
[45:24] we'll ask it about the 2020 summer
[45:26] Olympics. Okay, that's the query and
[45:29] then this is the the API um request we
[45:33] have to make and you can read through
[45:35] it. I have linked to the documentation
[45:36] here as how it works and then it says
[45:38] that uh Bosshim of Qatar and Tambberia
[45:41] of Italy both won the gold and you can
[45:42] actually fact check this is actually
[45:44] accurate. It's correct. Uh so now let's
[45:46] change the query and ask it about the
[45:48] 2022 Winter Olympics. Okay. And why 22
[45:51] versus 20 will become clear in just a
[45:53] moment. So, which athletes won the gold
[45:55] in curling
[45:57] in the 22 Olympics? And it says the gold
[46:00] medal in curling was won by the Swedish
[46:02] men's team and the South Korean women's
[46:04] team. Okay, turns out if you fact check
[46:07] this, it turns out, wait for it, Sweden
[46:12] won the men's gold. Yes, South Korean
[46:13] DIM participated, but Great Britain
[46:15] actually won the women's gold. So, it
[46:17] got it wrong. So, it sounds like GBD3.5
[46:19] Turbo could use some help. And now one
[46:22] of the things we can do is so the thing
[46:24] is the reason why GPT3 3.1 turbo didn't
[46:27] know about this is because its training
[46:29] cutoff date was September 2021.
[46:32] So as far as it's concerned the 22
[46:34] Olympics haven't happened yet
[46:37] it confidently gave you the wrong answer
[46:39] as it is often prone to do. So and this
[46:42] is by the way is called hallucination
[46:43] where it gives you a very eloquent
[46:45] confident wrong answer. And so um
[46:50] or as some folks have said about um
[46:53] another business school that should
[46:54] remain nameless often in error but never
[46:56] in doubt. So um
[46:59] all right back to this uh so one simple
[47:02] thing we can try right off the bat is to
[47:03] tell 3 3.5 Turbo you can ask it to say I
[47:06] don't know if it doesn't know rather
[47:08] than just make stuff up right and how do
[47:10] you do it? It's very simple. You say in
[47:12] your prompt, answer the question as
[47:14] truthfully as possible. And if you're
[47:17] unsure of the answer, say, "Sorry, I
[47:18] don't know." Okay, now here's the
[47:20] question. Okay, this is a query. So,
[47:22] let's run it through.
[47:25] Sorry, I don't know. Not bad, huh? So,
[47:29] so it worked. It's sort of trying to be
[47:31] humble and honest and, you know,
[47:32] self-aware and things like that. Um,
[47:35] it's more like a a Sloan at this point.
[47:37] All right. So now the reason I as I
[47:40] mentioned earlier there's a you can
[47:41] check the cutoff date and you can see
[47:42] it's 2021 actually you know what let me
[47:44] just uh open a new tab
[47:49] so all these cut off dates are training
[47:50] data right so 3.5 turbo this is what we
[47:53] are using cutff date 2021 okay that's
[47:56] why all right so now what we can do is
[47:59] to to we can obviously provide relevant
[48:01] data on the prompt itself sort of we can
[48:02] leading up to rag here and by the way
[48:04] the extra information we provide in the
[48:06] prompt to help it answer a question is
[48:07] called context, right? That's sort of
[48:08] the lingo for it. So, we can do it,
[48:10] we'll first do it manually. Um, so we
[48:13] first we'll use the Wikipedia article
[48:15] for 2022 Winter Olympics and we tell it
[48:17] explicitly to make use of this context
[48:19] because telling things explicitly always
[48:21] seems to help. So, this is the thing we
[48:23] cut and pasted here, right? Wikipedia
[48:25] article on curling and it's like a
[48:28] pretty long article. It's got all kinds
[48:30] of stuff and it's not even all that like
[48:32] cleanly formatted, right? It's kind of
[48:34] it's very strange. Look at that.
[48:38] So don't don't answer your question,
[48:39] Spencer. It can be, you know, in pretty
[48:41] bad shape. It still seems to work. Okay.
[48:44] So now use below article on the Olympics
[48:46] to answer the subsequent question. If
[48:47] you don't know, say you don't know.
[48:49] Okay. So that's what we have. That's the
[48:51] query. And by the way, before I send it
[48:53] into the LLM, this is the actual query
[48:55] that's going to be sending. I'm printing
[48:56] out the query. Look at how long the
[48:58] query is. Use the article below. And
[49:00] here is the article. B scroll, scroll,
[49:02] scroll. There's a whole thing, right?
[49:04] And it keeps on going on. And then
[49:05] finally, I say which teams won the gold.
[49:07] So, okay, so let's run it.
[49:12] Okay, look at that.
[49:15] Women's curling Great Britain. It got it
[49:16] right. Pretty good, right? I mean, it
[49:19] had to parse all that crap to get and
[49:22] find the nuggets, right? So, nicely done
[49:25] now. But maybe it wasn't super hard
[49:27] because we literally gave it the answer.
[49:28] So, let's make it a bit harder. So, I
[49:30] noticed that this person, Oscar Ericson,
[49:32] won two golds in the event, two medals
[49:34] in the event. So let's ask if any
[49:37] athlete won multiple medals. That
[49:39] requires a little bit of abstraction,
[49:40] right? So all right, same query. Did any
[49:44] athlete win multiple medals in curling?
[49:46] The question has changed. Everything
[49:47] else hasn't changed. Hit it. Let's see
[49:50] what happens.
[49:51] Yes, Oscar Ericson won multiple medals
[49:53] in curling. He won a gold in the men's
[49:56] event and a bronze in the mix doubles.
[49:58] It's pretty cool, right? Take that
[50:00] Google. So
[50:02] all right now we come to retrieval
[50:04] augment generation where instead of
[50:05] doing it manually obviously because it
[50:06] doesn't scale we will do it
[50:07] automatically and so the thing you have
[50:09] to remember as I mentioned just a few
[50:11] minutes ago is that there is a context
[50:12] window for every LLM and for GPD 3.0 of
[50:15] turbo the context window is 1 1600 300
[50:18] sorry 16,385 tokens that is the length
[50:21] of the input and the output right so we
[50:24] can't exceed that uh by the way GPT4's
[50:26] context window is I think up to 128,000
[50:29] tokens and GPT sorry Google Gemini 1.5
[50:33] pro they really need to work on their
[50:35] names Google Gemini 1.5 pro the context
[50:38] window is 1 million tokens
[50:40] okay and in research they have tested 10
[50:43] million tokens so Crazy times. All that
[50:46] means is that you can upload entire
[50:48] videos and ask it questions about the
[50:49] video. So all right to come back to
[50:51] this. So what we'll do is we'll only
[50:53] grab the data from the Wikipedia
[50:55] articles the all the articles about the
[50:57] Olympics that are relevant to our
[50:59] question by using pre-trained
[51:00] embeddings. So again this is the thing
[51:02] we talked about earlier, right? This is
[51:04] the picture we saw in class. And the the
[51:06] only thing I want to point out is that
[51:08] if you have a particular embedding for a
[51:09] question and a particular embedding for
[51:11] a chunk of text that you have in your
[51:13] database, you have to figure out how
[51:15] similar how related they are. And for
[51:17] that we can use what
[51:21] dot product or something slightly uh
[51:24] almost as dot product which is more
[51:27] easier for us to work with the cosine
[51:29] similarity. We have we have done cosine
[51:31] similarity previously. I've explained it
[51:32] in class. We're just going to use cosine
[51:34] similarity. How similar are these
[51:35] vectors? So that's what we're going to
[51:37] do. Um all right. So the same picture as
[51:40] we saw in class. So the first we what
[51:42] we'll do is we need to break up the data
[51:43] set into sections and then take each
[51:45] section and then run it through the
[51:47] embedding thing. But fortunately for us
[51:49] uh I have code here which actually does
[51:50] it for you manually. You can play around
[51:52] with it later. But OpenAI has already
[51:54] given us the chunked data set. So we
[51:56] just use that because it's just easy for
[51:58] us. And I downloaded already because it
[52:00] took it takes five minutes to download.
[52:01] I've downloaded this thing and I've
[52:02] stuck it in a particular data frame
[52:04] here. So let's print out five randomly
[52:07] chosen chunks. Um so you can see here
[52:09] right this is the first chunk somebody
[52:12] else somebody else this just and look at
[52:14] all this crazy stuff here right the
[52:17] formatting is off but these are all you
[52:19] know basically paragraphs and sections
[52:21] just grabbed straight from Wikipedia
[52:22] with no cleaning.
[52:24] Okay, now we define a simple function to
[52:28] basically send in any arbitrary piece of
[52:30] text into the embedding model and get
[52:33] the contextual embedding vector out,
[52:35] right? And there is this little function
[52:36] that does that. Okay, u we using an
[52:39] embedding model. We send in a text, it
[52:40] gives you something. So let's try it on
[52:42] that is amazing. You should get a vector
[52:45] back.
[52:51] Oh, come on. Don't fail me now.
[52:56] All right. How long is it? 1536. Um, so
[53:00] how about I say hodle is incredible.
[53:02] Like hodle is amazing. Hopefully the two
[53:04] vectors would be kind of similar in
[53:05] terms of cosine, right? So um and so to
[53:09] calculate the cosine distance, I use
[53:11] this particular function from sci. It
[53:13] just calculates the cosine similarity
[53:15] and I hit it. So 0.9934
[53:18] maximum is one, right? So 0 934 means
[53:21] that they're very very similar. which is
[53:23] comforting because amazing and
[53:24] incredible are obviously synonyms. U
[53:27] okay so now given a data frame with a
[53:29] column of text chunks in it we can use
[53:32] this function on every one of these
[53:33] things to calculate the embedding right
[53:34] and you have a function here that
[53:36] basically does it for you I'm not going
[53:37] to run it uh because it takes a long
[53:39] time so but you can run it later on uh
[53:41] just be prepared go get a cup of coffee
[53:42] and stuff while it does it uh but once
[53:44] you but happily for us open has actually
[53:47] already done this step for us so we
[53:48] don't have to uh so it's already
[53:50] available in this data frame so if you
[53:51] actually Look at this. And you can see
[53:53] here there is a text and then there is
[53:56] an embedding that's right sitting right
[53:58] there right next to it. Okay. And these
[54:00] embeddings are whatever 15 how long is
[54:02] it? 1536 long. 1536 long vectors. Okay.
[54:07] Um All right. So that's what we have.
[54:14] Okay. So now that we have this thing
[54:16] whenever we get a question we calculate
[54:18] the question's embedding and then
[54:20] compare calculate its cosine similarity
[54:22] with all the embedding sitting in this
[54:23] data frame. Okay. So to do that we're
[54:26] going to define a couple of helper
[54:28] functions here. You can read through the
[54:29] Python later to understand is this is
[54:31] basic Python manipulations that are
[54:33] going on. Um and so let's just test this
[54:36] function. So basically we have a little
[54:38] function called strings ranked by
[54:41] relatedness where you give it any input
[54:44] question or text and then it's going to
[54:46] give you the top five most related
[54:49] chunks of text that is had in its data
[54:52] frame. Okay. So uh let me just run this
[54:55] thing. Okay.
[55:00] So curling the things it pulls back it
[55:02] better involves curling and metals and
[55:03] so on. So this one has a cosign
[55:06] similarity of 888 curling at the 22
[55:09] Olympics. That's good. Result summary.
[55:11] Medal summary. Result summary. It's all
[55:13] pretty good, right? Even the fifth one
[55:14] has a cosign similarity of867, which is
[55:17] pretty high. So it's doing the right
[55:18] things. It's it's picked up curling gold
[55:20] medal was input text. It's picked up the
[55:22] right things from it. Um, now let's see
[55:25] what we can do um
[55:28] with the original question. So here is a
[55:30] header I'm going to use in the prompt.
[55:31] I'm going to say use the below articles
[55:33] to answer the subsequent question.
[55:35] Answer the questions as truthfully as
[55:36] possible. And if you're unsure of the
[55:37] answer, say sorry, I don't know. As
[55:38] before. Okay, that's our prompt. Uh, and
[55:41] now here's the thing. We don't want to
[55:42] exceed the context window, right? So, we
[55:44] want to need to count the tokens we're
[55:46] sending in and the likely number of
[55:48] tokens we're going to get back so that
[55:49] we don't exceed the budget. So, we use
[55:51] this package called tick token package
[55:53] for this. Uh, and then it just, you
[55:55] know, helps you count the tokens. And
[55:57] you can read through this. It's just
[55:58] again some basic Python for counting
[56:00] tokens. And now what we do is um this
[56:03] this where we actually comp assemble the
[56:05] prompt. We start with the header right
[56:08] we have the header which says you know
[56:09] be truthful and all that. Then we say uh
[56:12] here is a question that you need that
[56:14] I'm going to ask you and then you go in
[56:16] there and keep grabbing Wikipedia
[56:18] articles till the number of tokens in
[56:21] your prompt is is exceeding your token
[56:23] budget and then you stop. Right? When
[56:26] you're about to exceed the budget you
[56:27] stop because you can't exceed the
[56:28] budget. Um, and that's that's the whole
[56:31] thing. So here, uh, all right, let's
[56:34] just do tick token. Run this function.
[56:38] Now, it turns out, as you saw, we can go
[56:40] up to like 1600 something, uh, tokens in
[56:42] the context window. I'm just using three
[56:45] 3,700 as my budget. Uh, partly because
[56:48] just to show you how to use this thing.
[56:49] Uh, and also because it's charging my
[56:52] credit card for every token that I'm
[56:54] using, right? So, I'm just being
[56:56] careful. um it charges by the token.
[56:59] It's a beautiful business model. Anyway,
[57:01] so back here, so let's ask the question,
[57:03] which athletes won the gold medal in
[57:05] curling at the Olympics? Here is the
[57:06] data frame that you should use. Here is
[57:08] the GPD model and don't exceed 3,700
[57:11] tokens. Okay, that's the the query or
[57:13] the prompt. It's going to compose the
[57:15] prompt now. And this is the whole
[57:17] prompt. Okay. Uh let's just go to the
[57:19] very top. It's really long.
[57:24] Okay. So, all right. use the below
[57:25] articles on the subsequent question as
[57:27] possible and boom boom boom boom boom it
[57:29] has all these things it's got a added a
[57:31] whole bunch of paragraphs from the
[57:33] Wikipedia pages okay and then it finally
[57:35] ends with a question which athletes won
[57:37] the gold okay all right now let's just
[57:39] ask it the thing and this is just a
[57:41] little function to to send stuff into
[57:44] the API and now we are finally ready to
[57:47] ask GPD the question fingers crossed
[57:53] all right curling
[57:55] Stefan can tell in the mixed doubles and
[57:58] the team consisting of blah blah blah in
[58:01] the the men's tournament and oh
[58:03] interesting it has actually ignored the
[58:06] Great Britain people completely I think
[58:08] right uh last night it didn't welcome to
[58:12] stoasticity
[58:14] so you can try it when you try it might
[58:16] actually give you the the thing um and
[58:19] so let's ask it now a question about the
[58:21] 2016 winter Olympics uh which by the way
[58:24] didn't happen there were no winter
[58:25] Olympics in 2016. So if you ask it,
[58:31] sorry I don't know. All right. Now let's
[58:34] change the header so that we don't say
[58:36] be truthful. So we will remove the need
[58:38] for it to be truthful and see what
[58:40] happens.
[58:43] All right, which at least won the gold.
[58:50] Oh, now it's telling you about the 2022
[58:53] Olympics. So it answered an irrelevant
[58:55] question accurately.
[58:57] Okay, if you remove the need for it to
[58:59] uh to be truthful. So the I guess the
[59:01] moral of the story is that um first of
[59:04] all you can use rack to grab stuff from
[59:07] mass databases and it's very heavily
[59:09] used in industry. Number one, number
[59:10] two. Um you have to be careful about
[59:12] these token budgets and so on and so
[59:13] forth. Uh and small wording changes in
[59:16] the prompt can actually dramatically
[59:18] alter behavior which makes it very
[59:20] difficult in enterprise settings to do
[59:21] QA on this stuff. Okay. Uh so a lot of
[59:25] care has to go into it. Uh you know and
[59:27] you have seen examples of for example
[59:29] Air Canada had a chatbot which actually
[59:30] gave the wrong advice to a customer. The
[59:32] customer sued Air Canada and then the
[59:34] court ruled in favor of the the
[59:35] passenger and then they pulled the
[59:37] chatbot off the website. Right? So you
[59:39] got to be very careful. I think without
[59:40] a human in the loop checking these
[59:42] answers it's kind of dangerous in my
[59:43] opinion at this current state. Hopefully
[59:45] it'll get better but you have to be
[59:47] there's a lot of potential but you have
[59:48] to be to be careful. All right. So this
[59:51] is what we have. Um, and you can
[59:52] actually take this thing here and use
[59:54] it. Um, you can actually, you know, take
[59:57] like a thousandpage PDF that you might
[59:58] have or something and then chunk it and
[01:00:00] use this approach. And I've done it for
[01:00:02] a whole bunch of different things. It
[01:00:03] actually works really well, right? Most
[01:00:04] of the time it'll make errors here and
[01:00:05] there. Most of the time it actually
[01:00:07] works really well. Okay. So, um, yeah.
[01:00:11] >> Sorry, just a question. when when like
[01:00:14] GP4 now lets you you upload PDFs, is it
[01:00:18] junkling that or is it actually
[01:00:20] ingesting all the
[01:00:21] >> No, when you upload something because
[01:00:22] GPD4 Turbo has 128,000 tokens which
[01:00:25] means it can accommodate a whole long b
[01:00:27] of documents. So when you upload stuff
[01:00:29] is not doing any chunking. The chunking
[01:00:31] you're talking about you have to do. The
[01:00:32] LLM doesn't even know you're doing it.
[01:00:34] As far as the LLM is concerned, it's
[01:00:36] only seeing the prompt it sees and the
[01:00:38] prompt says, "Hey, here's a bunch of
[01:00:39] information. Here's a question. Answer
[01:00:40] it for me using this question. Be
[01:00:41] truthful." That's it.
[01:00:44] Now when you ask these things a question
[01:00:46] um which is later than its training
[01:00:49] data, you will actually see GP4 saying
[01:00:51] doing a Bing search and things like
[01:00:53] that. there. What's actually going on is
[01:00:55] there's an there's a pre-processing step
[01:00:58] and a program which is doing a Bing
[01:00:59] search, gathering a bunch of Bing
[01:01:01] results, taking the top few results,
[01:01:04] chunking, embedding, packing into a
[01:01:06] prompt, sending it into GB4, and you
[01:01:08] don't know what's all this is going on
[01:01:10] under the hood. But that's actually so
[01:01:11] when it's actually thinking and saying
[01:01:12] Bing search, this is what's going on
[01:01:13] under the hood.
[01:01:19] Was was there a question somewhere here?
[01:01:21] No. Oh, sorry. Yeah.
[01:01:24] I have a question about formatting.
[01:01:26] Yeah. So, it seems to be able to
[01:01:29] understand and ignore irrelevant
[01:01:31] formatting even though there's
[01:01:33] colloquial tables, not really defined
[01:01:35] tables. And also when it outputs
[01:01:38] formats, it's able to do it really
[01:01:40] humanly. Is that something that's
[01:01:44] figuring out through the neural network
[01:01:46] or just something that's kind of being
[01:01:47] programmed in the head or somewhere with
[01:01:49] standard?
[01:01:49] >> There is no explicit programming going
[01:01:51] on. It's typically because a lot of the
[01:01:53] question answer pairs that it was used
[01:01:54] for supervised fine tetuning and
[01:01:56] instruction t and reinforcement
[01:01:57] learning, right? The better answers with
[01:02:00] the same sort of badly formatted input,
[01:02:02] the better answers are just rewarded are
[01:02:04] ranked higher. That's what's going on.
[01:02:06] But on a related note, what one thing
[01:02:08] that's very useful is that uh you can
[01:02:10] actually ask it to send you give you the
[01:02:12] answer back using certain formats like
[01:02:14] markdown and JSON and things like that.
[01:02:16] And by forcing it to adhere to a certain
[01:02:19] well- definfined formats, you actually
[01:02:21] increase the chance of it actually
[01:02:22] getting the right answer in the first
[01:02:23] place.
[01:02:24] Uh again, there's like a whole tangent
[01:02:26] here we can go into, but those are some
[01:02:28] of the things that uh are part of prompt
[01:02:30] engineering. All right, so that's what
[01:02:33] we have here. Back to the PowerPoint.
[01:02:40] So that's retrieval augment generation
[01:02:42] and we finally come to fine-tuning. So
[01:02:46] fine-tuning is when up to this point all
[01:02:49] the things we have seen don't alter the
[01:02:51] internals of the LLM. You have not
[01:02:54] messed around with the weights or change
[01:02:55] number them at all. You're just using it
[01:02:56] as a black box. Right? With fine-tuning
[01:03:00] you actually will train it further
[01:03:01] meaning the weights are going to change.
[01:03:04] Okay. So now remember we take something
[01:03:07] like a causal error like GPT right uh
[01:03:11] and then and this I haven't fixed this
[01:03:13] yet. this there is no rel here as I
[01:03:15] mentioned earlier okay just remember
[01:03:17] that
[01:03:19] and then if you have domain specific
[01:03:21] input output examples like input and
[01:03:23] output you can just train it like this
[01:03:25] okay input and then the shifted output
[01:03:28] uh and that will update these weights
[01:03:31] right all these weights so this is
[01:03:33] basically fine- tuning exactly like we
[01:03:34] saw with BERT and so on and and even
[01:03:37] with restnet it's the same sort of thing
[01:03:39] okay that is fine-tuning now before we
[01:03:42] discuss the mechanics how to do I want
[01:03:43] to look at a show you a quick example of
[01:03:45] the usefulness of finetuning. So, so
[01:03:48] imagine for a sec that we want to
[01:03:50] generate u synthetic product reviews
[01:03:53] from product descriptions.
[01:03:55] So we are building some product which
[01:03:57] can simulate customer behavior in
[01:03:59] e-commerce and for that we need to be
[01:04:01] able to generate the kinds of reviews
[01:04:03] that customers might come up with right
[01:04:05] and writing a lot of reviews is very
[01:04:07] timeconuming. So what you but what you
[01:04:09] can do is you can get a whole bunch of
[01:04:10] product descriptions right from the
[01:04:12] internet. So let's say you ask an LLM,
[01:04:14] hey write a positive product review
[01:04:16] using this information here, product
[01:04:18] description here and it comes up with
[01:04:19] this timeless, authentic, iconic, right?
[01:04:24] Seriously, do product reviewers actually
[01:04:26] write stuff like this? No. This looks
[01:04:28] like marketing copy, right? This reads
[01:04:31] like marketing copy because there's a
[01:04:33] whole bunch of marketing copy on the
[01:04:34] internet. So it's not good. It doesn't
[01:04:36] feel like a review. It's not authentic,
[01:04:38] right? Um, here's another example for
[01:04:41] Urban Outfitters, and it says, uh, the
[01:04:44] the boxy and cropped silhouette is
[01:04:46] flattering on all body types. Come on.
[01:04:50] Okay, so it's not going to work. So,
[01:04:52] what we do is we fine-tune the LLM. We
[01:04:55] can take an LLM and we can fine-tune it
[01:04:57] with instruction, product description,
[01:05:00] and product review examples.
[01:05:02] Okay, that's what we can do. So for
[01:05:05] instance we can take something like
[01:05:06] this. Uh let me zoom into this thing.
[01:05:14] So it says here write a positive review
[01:05:17] for the following product and then you
[01:05:19] can have the work. This is the
[01:05:20] description is the input and the output
[01:05:22] is the best car my husband's favorite.
[01:05:24] They fit well. Right? They feel like
[01:05:26] product reviews. So you just have to get
[01:05:28] a few hundred of these product review
[01:05:30] examples. Okay just a few hundred. Um
[01:05:33] and you may not even need that much. And
[01:05:35] once you do that,
[01:05:37] once you do that, you basically do uh
[01:05:40] used to fine-tuning like I showed
[01:05:42] earlier, you know, in instruction,
[01:05:45] input, output, and then you take that
[01:05:46] output and shift it a bit and make it
[01:05:48] the actual label, the actual output.
[01:05:50] Fine tune, fine tune, fine tune, fine
[01:05:51] tune a bunch of times, gradient descent,
[01:05:53] weights gets updated. Now you have a new
[01:05:55] LM, an updated LLM. And when you do that
[01:05:58] now for the same things, here's what you
[01:06:00] get. Write a review. These are the best
[01:06:02] jeans I've ever owned. I am whatever
[01:06:04] some details. I've been wearing them for
[01:06:06] a few weeks. They still look brand new,
[01:06:07] right? It looks much better. Doesn't
[01:06:09] look like marketing.
[01:06:11] This is completely fake. By the way, the
[01:06:13] came up with it after the fine tuning.
[01:06:15] And then we say, "Write a horrible
[01:06:16] review because we want to be balanced.
[01:06:18] These are the worst genes I've ever
[01:06:20] worn. They're too tight here and there.
[01:06:22] I'm going to return them and try a 30,
[01:06:23] but I'm not optimistic.
[01:06:25] I'm going to stick with Levis's." Few.
[01:06:27] Okay.
[01:06:29] So, that is So, these read like real
[01:06:31] reviews. So just by taking a few hundred
[01:06:33] examples and fine-tuning it, it
[01:06:34] completely changes the the behavior that
[01:06:36] you want for your particular use case.
[01:06:38] That's the key thing. So for me, the
[01:06:40] biggest sort of benefit here is that
[01:06:43] while it took billions of sentences for
[01:06:45] pre-training the original LLM and then
[01:06:47] it took tens of thousands of examples to
[01:06:49] do supervised finetuning and or HF and
[01:06:52] so on and so forth, for you for it to
[01:06:55] make it work for your narrow business
[01:06:56] use case, you only had to spend a couple
[01:06:59] hundred examples. That's it. It's
[01:07:02] amazing. Imagine that if you had to, you
[01:07:04] know, collect like 30,000 examples to
[01:07:06] make it. Nobody's going to do these
[01:07:07] things. It's too much work. But a couple
[01:07:10] of hundred anybody can do. That's why
[01:07:12] it's so powerful to finetune these
[01:07:14] things. Yeah.
[01:07:16] You talked about being able to um you
[01:07:19] know, in industries where you you don't
[01:07:22] want to put some of this stuff on the
[01:07:23] internet, downloading uh the pre-train
[01:07:26] model and being able to do this on your
[01:07:28] own. would you still need talking about
[01:07:30] computer power some of the computers we
[01:07:32] have now GPUs I don't know how they are
[01:07:35] um are you able to do some of these very
[01:07:37] small use cases on those types of
[01:07:39] devices
[01:07:40] >> perfect question uh Ike I mean you're
[01:07:42] going to get to that because the short
[01:07:44] answer it's hard yeah just a few hundred
[01:07:46] examples but actually trying to
[01:07:47] fine-tune these big models on consumer
[01:07:50] grade hardware is actually not easy so
[01:07:52] you have to make certain tricks and
[01:07:53] simplifications which is the next topic
[01:07:56] uh yeah
[01:07:57] >> is tuning always supervised like you
[01:08:00] need those pairs or could you do it if
[01:08:02] the company has like less structured
[01:08:05] data?
[01:08:05] >> No, you can. The thing is it depends on
[01:08:07] whether you want to make it generally
[01:08:09] smart about the company's sort of
[01:08:11] business details in which case you can
[01:08:13] just take a whole bunch of text and just
[01:08:14] do an expert prediction on it. It's
[01:08:16] going to get smarter about generally
[01:08:17] things. But it doesn't mean it's going
[01:08:19] to specifically follow your instructions
[01:08:20] on your particular business problem. So
[01:08:23] if you wanted to follow instructions,
[01:08:24] you need supervision.
[01:08:27] Okay. So all right these three are great
[01:08:29] reviews. So for small LLMs like GPD2
[01:08:32] fine-tuning isn't difficult to go to
[01:08:35] your question. You can actually do this
[01:08:36] with small models. So like for example
[01:08:38] Google had this has released this thing
[01:08:40] called Gemma which came out recently.
[01:08:41] It's a small model like two billion
[01:08:42] parameters or something if I remember
[01:08:44] the smallest one and those things will
[01:08:46] typically fit into uh thank you. Uh
[01:08:50] those things will typically fit into
[01:08:52] like one GPU and you can fine-tune it.
[01:08:54] You still need GPUs just to be clear. uh
[01:08:56] they will actually fit into one thing.
[01:08:57] But if you want to use a larger model,
[01:08:59] it won't fit. So to make this work, you
[01:09:02] have to do other things and that's what
[01:09:03] we're going to talk about now. So but
[01:09:05] this there's a family of models called
[01:09:07] Llama Llama 2. These are open source uh
[01:09:10] LLMs and they are widely used for
[01:09:12] fine-tuning, right? Because you can just
[01:09:14] download the model and just do whatever
[01:09:16] you want with it, right? It's open. uh I
[01:09:18] mean it's not strictly open because
[01:09:20] there are some you know footnote
[01:09:22] considerations you got to worry about
[01:09:23] but for most purposes it's open enough
[01:09:26] uh in my opinion and so what we let's
[01:09:29] see how hard it is to build the biggest
[01:09:30] model in this family which is the llama
[01:09:32] 2 model with 70 billion parameters okay
[01:09:35] 70 billion parameters so first of all
[01:09:37] the model is gigantic so 70 billion
[01:09:40] parameters each parameter is let's say
[01:09:42] we store it in two bytes per parameter
[01:09:44] right u and then each of these parame
[01:09:48] ameters actually we will need a
[01:09:50] multiplier on each parameter to store
[01:09:52] various details about how the
[01:09:53] optimization is done okay we know we
[01:09:56] won't get into the details here the the
[01:09:57] one thing I do want to point out is that
[01:09:59] um this 3 to four uh should really be 1
[01:10:02] to six right u so I I had I didn't have
[01:10:06] a chance to change it this morning but
[01:10:08] but the point is that it's going to be a
[01:10:09] huge model right so even with this
[01:10:12] number it's going to be like 48 to 560
[01:10:14] gigabytes just to hold the model in
[01:10:15] memory and manipulate it and So if you
[01:10:18] use a GPU like an A00 GPU or an H00 GPU
[01:10:21] which are all Nvidia GPUs,
[01:10:23] each of these things typically has 80 GB
[01:10:25] of RAM memory. So we need between six
[01:10:28] and seven to accommodate this thing. Six
[01:10:30] to seven GPUs just to accommodate this
[01:10:32] thing. So that's the first problem. The
[01:10:34] model is big just to hold it and work
[01:10:35] with it. You need lots of GPUs. The
[01:10:37] second problem, Llama 2 was trained on
[01:10:40] two trillion tokens of text.
[01:10:43] Two trillion tokens of text. So these
[01:10:46] GPUs can process about 400 tokens per
[01:10:49] GPU per second. By process, I mean the
[01:10:51] forward pass through the network. Okay?
[01:10:54] And so if you actually use seven GPUs
[01:10:57] with all this thing, it's going to take
[01:10:58] you 8,000 days, right? Let's say we want
[01:11:01] to do it in about a month, you need 24
[01:11:03] 20,000 248 GPUs at this cost of two $25
[01:11:08] per GPU per hour. This will cost you 4
[01:11:10] million.
[01:11:12] Okay? And we'd expect the actual cost to
[01:11:14] be a lot higher than this because it's
[01:11:15] very optimistic. It assumes you just do
[01:11:16] one pass through it, you're all done,
[01:11:17] right? In in general, you'll you know
[01:11:19] you'll make some mistakes. You have to
[01:11:20] do it a bunch of times and so on and so
[01:11:21] forth. So this is overly optimistic
[01:11:23] estimate and that is 4 million. So you
[01:11:25] need lots of GPUs and you need to spend
[01:11:27] a lot of money for it. Now what can we
[01:11:29] do with fewer resources?
[01:11:32] First of all, you you need to reduce the
[01:11:34] size of the data set. The second thing
[01:11:35] is you want to reduce the memory
[01:11:36] required. So we can ideally do it on
[01:11:38] many fewer GPUs, hopefully even one GPU
[01:11:41] literally on Collab. Okay. And so now we
[01:11:45] have good news on the data front because
[01:11:47] as I mentioned earlier, while it takes a
[01:11:49] lot of data to build these models, to
[01:11:51] fine-tune them for your specific data
[01:11:53] for use case, you may just need a few
[01:11:55] hundred examples. Okay, it's no problem
[01:11:57] at all. So the data for fine-tuning is
[01:11:59] actually not a problem. Only for
[01:12:01] building it in the first place, it's a
[01:12:02] problem. So in fact, there's this famous
[01:12:05] alpaca fine tune data set. It is 50,000
[01:12:07] instruction on pairs and so for that
[01:12:11] way less than the two trillion tokens
[01:12:13] and that can actually be done in about
[01:12:14] 20 hours. You can fine-tune a 50,000
[01:12:17] example fine-tuning data set you can
[01:12:19] fine tune with just 20 hours. Okay,
[01:12:21] Tomaso,
[01:12:23] >> could Microsoft's one bit model
[01:12:26] drastically reduce the amount of comput?
[01:12:28] Yeah, there's a whole bunch of
[01:12:30] approximations and simplifications to
[01:12:32] make all these things fit uh into
[01:12:35] smaller GPUs and so on and so forth and
[01:12:37] that's one of them. So, so the short
[01:12:39] answer is yeah, there are many
[01:12:40] possibilities uh and we have to very
[01:12:42] carefully look at them because every one
[01:12:44] of these simplifications you'll it'll
[01:12:45] cost you something in terms of accuracy
[01:12:47] and the ability of the model to do what
[01:12:49] it needs to do. So there's always a
[01:12:50] trade-off you have to worry about. So
[01:12:52] that for hooks who are interested
[01:12:54] there's this whole field called
[01:12:55] quantization LLM quantization. Google it
[01:12:57] and that gives you that's an entry point
[01:12:59] into that whole area. Okay. So now how
[01:13:02] do we reduce the memory required so that
[01:13:04] we can process the data using fewer GPUs
[01:13:06] ideally just one GPU on collab. So if
[01:13:08] you look at what actually consumes
[01:13:10] memory, you have all these model
[01:13:12] parameters. Let's say you know 70
[01:13:14] billion parameters times two bytes each
[01:13:16] 140 GB gradient computations is another
[01:13:18] 140 to hold the gradient and then the
[01:13:20] optimizer state is 2x. And as I
[01:13:22] mentioned earlier it could be between
[01:13:24] you know 1 to 6x as opposed to 3 to 4x
[01:13:27] but we'll just go with these numbers for
[01:13:28] the moment. And so the total is 560
[01:13:30] gigabytes right if you just naively want
[01:13:33] to use it. So turns out you can't do
[01:13:36] anything about that. it is just 4140 but
[01:13:38] by using a trick called gradient
[01:13:40] checkpointing this whole thing can
[01:13:42] actually be squashed close to zero
[01:13:44] basically you say hey I don't mind it
[01:13:46] running longer but I don't want to use
[01:13:48] as much memory and that trick is called
[01:13:50] gradient checkpointing we won't go into
[01:13:52] technical details that can go to zero
[01:13:54] but then this thing here the optimizer
[01:13:56] state turns out even this can be
[01:13:58] squashed very close to zero and that's
[01:14:00] actually was a breakthrough from you
[01:14:02] know maybe a year ago and so to do do
[01:14:06] that. What we're going to do is to say,
[01:14:07] look, you know what? Uh there are a
[01:14:09] whole bunch of weights here, but we're
[01:14:11] only going to take take those matrices
[01:14:13] inside each attention layer, and we're
[01:14:15] going to only look at those matrices.
[01:14:17] We're going to freeze everything else.
[01:14:19] So, we're going to take only a small set
[01:14:22] of parameters, unfreeze them, and update
[01:14:24] them and see if it's any good, if it
[01:14:26] actually gets the job done. Instead of
[01:14:27] unfreezing everything and updating them,
[01:14:29] right? And so if you look at the weight
[01:14:31] matrix, let's say the key AK weight
[01:14:33] matrix uh in llama 2, this is a 8,000
[01:14:36] roughly 8,000 by 8,000 matrix, which
[01:14:38] means that there are 64 million
[01:14:40] parameters inside each of these
[01:14:41] matrices. 64 million. Okay. So you can
[01:14:45] if you imagine this matrix AK here and
[01:14:48] let's say you thought experiment, you do
[01:14:50] the finetuning and the numbers have
[01:14:52] changed, right? as a result of
[01:14:54] finetuning then you can imagine that the
[01:14:56] resulting matrix is just the original
[01:14:58] matrix you had plus just the changes
[01:15:01] right the original plus the changes and
[01:15:04] we call the changes delta a k and of
[01:15:07] course in general this this change is
[01:15:08] also going to be a 64 million matrix
[01:15:10] right 8,000 by 8,000 so the question is
[01:15:13] can we make this change matrix smaller
[01:15:15] and to make it smaller it seems
[01:15:18] reasonable because a fine tune will only
[01:15:20] make small changes to just a few weights
[01:15:22] it's not going to change
[01:15:23] By definition, a couple hundred
[01:15:25] examples, you do some finetuning,
[01:15:26] hopefully a few weights are going to
[01:15:27] change and maybe they won't change a
[01:15:29] whole lot, right? So the the key insight
[01:15:32] here is that maybe we can force this
[01:15:33] change matrix to be kind of simple and
[01:15:36] get the job done, right? And it turns
[01:15:38] out you can. And what you do is you can
[01:15:40] think of this matrix as really coming
[01:15:42] from two thin skinny matrices which if
[01:15:46] you multiply them gets you the original
[01:15:48] matrix, right? And I'm not going to get
[01:15:51] into the mathematical details here. This
[01:15:52] is called a low rank approximation. Uh
[01:15:55] but the point here is that you can take
[01:15:57] two very small matrices and if you
[01:16:00] multiply them the right way, you
[01:16:01] actually can recover the original
[01:16:02] matrix, right? You can approximate the
[01:16:04] original matrix. And this matrix, as it
[01:16:06] turns out, these two matrices are much
[01:16:08] smaller because each one is just 8,000 *
[01:16:11] 2, 16,000, right? And so this thing has
[01:16:15] just 16,192 parameters, which is 0.02%
[01:16:19] of the original 64 million.
[01:16:23] So this thing is called low rank
[01:16:25] adaptation or LORA and it's incredibly
[01:16:27] widely used in the industry. U and so
[01:16:30] what we do is we freeze all the
[01:16:31] parameters. We initialize all these mat
[01:16:34] these change matrices to zero and then
[01:16:36] we update just the those two skinny
[01:16:38] matrices right here here we update only
[01:16:40] those matrices using gradient descent.
[01:16:43] And when you do that everything will fit
[01:16:45] into memory. So which means that the
[01:16:47] whole thing will fit in and you can just
[01:16:48] use like two GPUs and get the job done.
[01:16:50] And if you actually use llama's the
[01:16:52] smaller models like 7 billion 13 billion
[01:16:55] it can be fine-tuned comfortably on a
[01:16:56] single GPU on a single collab GPU. So
[01:17:00] all right uh 954 time does not permit so
[01:17:03] I will uh so I have a collab on how to
[01:17:05] do the finetuning uh using this
[01:17:07] technique. I will do like a video walk
[01:17:09] through um tomorrow or day after and I'm
[01:17:12] done. Thanks folks. Have a good rest of
[01:17:14] your week. [applause]
[01:17:16] Thank you.