[00:05] Hi, everyone. [00:06] Welcome to CS 25 Transformers United V2. [00:09] This was a course that was held at Stanford [00:11] in the winter of 2023. [00:13] This course is not about robots that [00:14] can transform into cars as this picture might suggest. [00:17] Rather, it's about deep learning models [00:18] that have taken the world by storm [00:21] and have revolutionized the field of AI and others. [00:23] Starting from natural language processing, [00:25] transformers have been applied all over, [00:27] computer vision, reinforcement learning, biology, robotics, [00:30] et cetera. [00:31] We have an exciting set of videos lined up for you [00:34] with some truly fascinating speakers, talks, presenting [00:37] how they're applying transformers [00:39] to the research in different fields and areas. [00:44] We hope you'll enjoy and learn from these videos. [00:47] So without any further ado, let's get started. [00:52] This is a purely introductory lecture. [00:54] And we'll go into the building blocks of transformers. [00:58] So first, let's start with introducing the instructors. [01:03] So for me, I'm currently on a temporary deferral from the PhD [01:06] program, and I'm leading AI at a robotics startup, Collaborative [01:09] Robotics, that are working on some general purpose robots, [01:13] somewhat like [INAUDIBLE]. [01:14] And I'm very passionate about robotics and building FSG [01:18] learning algorithms. [01:19] My research interests are in reinforcement learning, [01:21] computer vision, and remodeling, and I [01:23] have a bunch of publications in robotics, [01:25] autonomous driving, and other areas. [01:28] My undergrad was at Cornell. [01:29] If someone is from Cornell, so nice to [INAUDIBLE].. [01:33] So I'm Stephen, currently a first-year CS PhD here. [01:37] Previously did my master's at CMU and undergrad at Waterloo. [01:40] I'm mainly into NLP research, anything involving language [01:43] and text, but more recently, I've [01:45] been getting more into computer vision as well as [INAUDIBLE] [01:48] And just some stuff I do for fun, a lot of music [01:51] stuff, mainly piano. [01:52] Some self-promo of what I post a lot on my Insta, YouTube, [01:55] and TikTok, so if you guys want to check it out. [01:58] My friends and I are also starting a Stanford piano club, [02:01] so if anybody's interested, feel free to email [02:04] or DM me for details. [02:07] Other than that, martial arts, bodybuilding, and huge fan [02:11] of k-dramas, anime, and occasional gamer. [02:14] [LAUGHS] [02:18] OK, cool. [02:19] Yeah, so my name is Rylan. [02:20] Instead of talking about myself, I just [02:21] want to very briefly say that I'm super [02:23] excited to take this class. [02:24] I took it the last time-- sorry-- to teach this. [02:26] Excuse me. [02:26] I took it the last time I was offered. [02:28] I had a bunch of fun. [02:30] I thought we brought in a really great group of speakers [02:32] last time. [02:33] I'm super excited for this offering. [02:35] And yeah, I'm thankful that you're all here, [02:37] and I'm looking forward to a really fun quarter together. [02:39] Thank you. [02:39] Yeah, so fun fact, Rylan was the most outspoken student [02:42] last year. [02:43] And so if someone wants to become an instructor next year, [02:45] you know what to do. [02:46] [LAUGHTER] [02:50] OK, cool. [02:53] Let's see. [02:54] OK, I think we have a few minutes. [02:56] So what we hope you will learn in this class is, first of all, [02:59] how do transformers work, how they [03:02] are being applied, just beyond NLP, [03:04] and nowadays, like they are pretty [INAUDIBLE] [03:06] them everywhere in AI machine learning. [03:10] And what are some new and interesting directions [03:12] of research in these topics. [03:17] Cool, so this class is just an introductory. [03:19] So we're just talking about the basics of transformers, [03:22] introducing them, talking about the self-attention mechanism [03:24] on which they're founded. [03:26] And we'll do a deep dive more on models like BERT [03:30] to GPT, stuff like that. [03:32] So with that, happy to get started. [03:35] OK, so let me start with presenting the attention [03:38] timeline. [03:40] Attention all started with this one paper. [03:43] [INAUDIBLE] by Vaswani et al in 2017. [03:46] That was the beginning of transformers. [03:49] Before that, we had the prehistoric error, [03:51] where we had models like RNM, LSDMs, [03:55] and simple attention mechanisms that didn't work [03:57] or [INAUDIBLE]. [03:59] Starting 2017, we saw this explosion of transformers [04:02] into NLP, where people started using it for everything. [04:07] I even heard this quote from Google. [04:08] It's like our performance increased every time [04:10] we [INAUDIBLE] [04:11] [CHUCKLES] [04:15] For the [INAUDIBLE] after 2018 to 2020, [04:17] we saw this explosion of transformers [04:18] into other fields like vision, a bunch of other stuff, [04:23] and like biology as a whole. [04:25] And in last year, 2021 was the start [04:28] of the generative era, where we got a lot of genetic modeling, [04:31] started models like Codex, GPT, DALL-E, [04:35] stable diffusions, or a lot of things [04:37] happening in genetic modeling. [04:40] And we started scaling up in AI. [04:44] And now, the present. [04:45] So this is 2022 and the startup in '23. [04:49] And now we have models like ChatGPT, Whisperer, [04:53] a bunch of others. [04:54] And we're scaling onwards without splitting up, [04:57] so that's great. [04:58] So that's the future. [05:01] So going more into this, so once there were RNNs. [05:06] So we had Seq2Seq models, LSTMs, GRU. [05:10] What worked there was that they were good at encoding history, [05:13] but what did not work was they didn't encode long sequences [05:17] and they were very bad at encoding context. [05:21] So consider this example. [05:24] Consider trying to predict the last word in the text, [05:27] "I grew up in France, dot, dot, dot. [05:29] I speak fluent Dutch." [05:31] Here, you need to understand the context for it [05:33] to predict French, and attention mechanism [05:36] is very good at that, whereas if they're just using LSDMs, [05:39] it doesn't here work that well. [05:42] Another thing transformers are good at is, [05:46] more based on content, is also context prediction [05:50] is like finding attention maps. [05:52] If I have something like a word like it, [05:56] what noun does it correlate to. [05:57] And we can give a property attention [06:01] on one of the possible activations. [06:05] And this works better than existing mechanisms. [06:10] OK, so where we were in 2021, we were on the verge of takeoff. [06:16] We were starting to realize the potential of transformers [06:18] in different fields. [06:20] We solved a lot of long sequence problems [06:23] like protein folding, AlphaFold, offline RL. [06:28] We started to see few-shots, zero-shot generalization. [06:31] We saw multimodal tasks and applications [06:34] like generating images from language. [06:36] So that was DALL-E. And it feels like [INAUDIBLE].. [06:43] And this was also a talk on transformers [06:45] that you can watch on YouTube. [06:48] Yeah, cool. [06:51] And this is where we were going from 2021 to 2022, [06:55] which is we have gone from the version of [INAUDIBLE] [06:58] And now, we are seeing unique applications [07:00] in audio generation, art, music, storytelling. [07:03] We are starting to see these new capabilities [07:05] like commonsense, logical reasoning, [07:08] mathematical reasoning. [07:09] We are also able to now get human enlightenment [07:12] and interaction. [07:13] They're able to use reinforcement learning [07:15] and human feedback. [07:16] That's how ChatGPT is trained to perform really good. [07:19] We have a lot of mechanisms for controlling [07:21] toxicity bias and ethics now. [07:24] And there are a lot of also, a lot [07:26] of developments in other areas like diffusion models. [07:30] Cool. [07:33] So the future is a spaceship, and we are all [07:35] excited about it. [07:39] And there's a lot of more applications [07:40] that we can enable, and it'll be great [07:44] if you can see transformers also up there. [07:47] One big example is video understanding and generation. [07:49] That is something that everyone is interested in, [07:51] and I'm hoping we'll see a lot of models [07:53] in this area this year, also, finance, business. [07:59] I'll be very excited to see GPT author a novel, [08:02] but we need to solve very long sequence modeling. [08:04] And most transformer models are still [08:07] limited to 4,000 tokens or something like that. [08:09] So we need to make them generalize much more [08:13] better on long sequences. [08:17] We also want to have generalized agents [08:19] that can do a lot of multitask, a multi-input predictions [08:27] like Gato. [08:28] And so I think we will see more of that, too. [08:31] And finally, we also want domain specific models. [08:37] So you might want a GPT model, let's [08:39] put it like maybe your health. [08:41] So that could be like a DoctorGPT model. [08:43] You might have a LawyerGPT model that's [08:45] trained on only law data. [08:46] So currently, we have GPT models that are trained on everything. [08:49] But we might start to see more niche models that [08:51] are good at one task. [08:53] And we could have a mixture of experts, [08:55] so it's like, you can think this is a-- [08:57] how you'd normally consult an expert, [08:58] you'll have expert AI models. [09:00] And you can go to a different AI model for your different needs. [09:05] There are still a lot of missing ingredients [09:07] to make this all successful. [09:10] The first of all is external memory. [09:12] We are already starting to see this with the models [09:15] like ChatGPT, where the inflections are short-lived. [09:18] There's no long-term memory, and they [09:20] don't have ability to remember or store [09:23] conversations for long-term. [09:25] And this is something you want to fix. [09:29] Second is reducing the computation complexity. [09:32] So attention mechanism is quadratic over the sequence [09:36] length, which is slow. [09:37] And we want to reduce it and make it faster. [09:42] Another thing we want to do is we [09:44] want to enhance the controllability of these models [09:46] like a lot of these models can be stochastic. [09:48] And we want to be able to control what sort of outputs [09:51] we get from them. [09:52] And you might have experienced the ChatGPT, [09:54] if you just refresh, you get different output each time. [09:56] But you might want to have a mechanism that controls [09:59] what sort of things you get. [10:01] And finally, we want to align our state of art language [10:04] models with how the human brain works. [10:06] And we are seeing the surge, but we still [10:09] need more research on seeing how they can make more informed. [10:12] Thank you. [10:14] Great, hi. [10:16] Yes, I'm excited to be here. [10:18] I live very nearby, so I got the invites to come to class. [10:21] And I was like, OK, I'll just walk over. [10:23] But then I spent like 10 hours on the slides, [10:25] so it wasn't as simple. [10:28] So yeah, I'm going to talk about transformers. [10:30] I'm going to skip the first two over there. [10:32] I'm not going to talk about those. [10:34] We'll talk about that one just to simplify the lecture [10:36] since we don't have time. [10:39] OK, so I wanted to provide a little bit of context [10:41] on why does this transformers class even exist. [10:44] So a little bit of historical context. [10:45] I feel like Bilbo over there. [10:47] I joined like telling you guys about this. [10:50] I don't know if you guys saw Lord of the Rings. [10:52] And basically, I joined AI in roughly 2012, the full course, [10:56] so maybe a decade ago. [10:58] And back then, you wouldn't even say [10:59] that you joined AI by the way. [11:00] That was like a dirty word. [11:02] Now, it's OK to talk about, but back then, it [11:04] was not even deep learning. [11:05] It was machine learning. [11:06] That was the term we would use if you were serious. [11:08] But now, now, AI is OK to use, I think. [11:11] So basically, do you even realize [11:13] how lucky you are potentially entering [11:15] this area in roughly 2023? [11:17] So back then, in 2011 or so when I was working specifically [11:20] on computer vision, your pipeline's looked like this. [11:25] So you wanted to classify some images, [11:28] you would go to a paper, and I think this is representative. [11:30] You would have three pages in the paper describing [11:32] all kinds of a zoo, of kitchen sink, [11:34] of different kinds of features, descriptors. [11:36] And you would go to a poster session [11:38] and in computer vision conference, [11:40] and everyone would have their favorite feature descriptor [11:41] that they're proposing. [11:42] And it's totally ridiculous, and you [11:44] would take notes on which one you should incorporate [11:45] into your pipeline because you would extract all of them, [11:48] and then you would put an SVM on top. [11:49] So that's what you would do. [11:51] So there's two pages. [11:52] Make sure you get your [? Spar ?] SIFT histograms, [11:54] your SSIMs, your color histograms, textiles, [11:56] tiny images. [11:57] And don't forget the geometry specific histograms. [11:59] All of them have basically complicated code by themselves. [12:02] So you're collecting code from everywhere and running it, [12:04] and it was a total nightmare. [12:06] So on top of that, it also didn't work. [12:10] [LAUGHTER] [12:11] So this would be, I think, it represents the prediction [12:14] from that time. [12:15] You would just get predictions like this once in a while, [12:17] and you'd be like, you just shrug your shoulders [12:19] like that just happens once in a while. [12:20] Today, you would be looking for a bug. [12:23] And worse than that, every single chunk of AI [12:30] had their own completely separate vocabulary [12:32] that they work with. [12:33] So if you go to NLP papers, those papers [12:36] would be completely different. [12:38] So you're reading the NLP paper, and you're like, [12:40] what is this part of speech tagging, [12:42] morphological analysis, and tactic parsing, [12:44] co-reference resolution? [12:46] What is MPBTKJ? [12:48] And you're confused. [12:49] So the vocabulary and everything was completely different. [12:51] And you couldn't read papers, I would [12:52] say, across different areas. [12:55] So now, that changed a little bit [12:56] starting 2012 when Al Krizhevsky and colleagues basically [13:02] demonstrated that if you scale a large neural network [13:05] on large data set, you can get very strong performance. [13:08] And so up till then, there was a lot of focus on algorithms. [13:10] But this showed that actually neural nets scale very well. [13:13] So you need to now worry about compute and data, [13:15] and you can scale it up. [13:16] It works pretty well. [13:17] And then that recipe actually did copy paste [13:19] across many areas of AI. [13:21] So we start to see neural networks pop up everywhere [13:23] since 2012. [13:25] So we saw them in computer vision, and NLP, and speech, [13:28] and translation in RL and so on. [13:30] So everyone started to use the same kind of modeling [13:32] toolkit, modeling framework. [13:33] And now when you go to NLP, and you start reading papers there, [13:36] in machine translation, for example, [13:38] this is a sequence to sequence paper [13:40] which we'll come back to in a bit. [13:41] You start to read those papers, and you're like, OK, [13:44] I can recognize these words. [13:45] Like there's a neural network. [13:46] There's some parameters. [13:47] There's an optimizer, and it starts to read things [13:50] that you know of. [13:50] So that decreased tremendously the barrier to entry [13:54] across the different areas. [13:56] And then, I think, the big deal is [13:57] that when the transformer came out in 2017, [14:00] it's not even that just the tool kits and the neural networks [14:02] were similar-- there's that literally the architectures [14:05] converged to like one architecture that you [14:07] copy paste across everything seemingly. [14:10] So this was kind of an unassuming machine translation [14:12] paper at the time, proposing to transformer architecture. [14:15] But what we found since then is that you can just basically [14:17] copy paste this architecture and use it everywhere. [14:21] And what's changing is the details of the data, [14:23] and the chunking of the data, and how you feed it in. [14:26] And that's a caricature, but it's [14:28] kind of like a correct first order statement. [14:29] And so now, papers are even more similar looking [14:32] because everyone's just using transformer. [14:34] And so this convergence was remarkable to watch [14:38] and unfolded over the last decade. [14:40] And it's pretty crazy to me. [14:42] What I find interesting is I think [14:44] this is some kind of a hint that we're maybe converging [14:46] to something that maybe the brain is doing [14:48] because the brain is very homogeneous and uniform [14:50] across the entire sheet of your cortex. [14:52] And OK, maybe some of the details are changing, [14:54] but those feel like hyperparameters [14:56] like a transformer. [14:57] But your auditory cortex and your visual cortex [14:59] and everything else looks very similar. [15:01] And so maybe we're converging to some kind [15:02] of a uniform powerful learning algorithm here. [15:06] Something like that, I think, is interesting and exciting. [15:09] OK, so I want to talk about where the transformer came [15:11] from briefly, historically. [15:12] So I want to start in 2003. [15:15] I like this paper quite a bit. [15:17] It was the first popular application of neural networks [15:21] to the problem of language modeling, [15:22] so predicting in this case, the next word [15:24] in the sequence, which allows you to build [15:26] generative models over text. [15:27] And in this case, they were using multi-layer perceptron, [15:29] so very simple neural net. [15:30] The neural nets took three words and predicted the probability [15:33] distribution for the fourth word in a sequence. [15:36] So this was well and good at this point. [15:39] Now, over time, people started to apply this [15:41] to machine translation. [15:43] So that brings us to sequence to sequence paper [15:45] from 2014 that was pretty influential, [15:48] and the big problem here was OK, we [15:49] don't just want to take three words and predict the fourth. [15:52] We want to predict how to go from an English sentence [15:55] to a French sentence. [15:56] And the key problem was OK, you can [15:58] have arbitrary number of words in English and arbitrary number [16:00] of words in French, so how do you [16:03] get an architecture that can process [16:04] this variably sized input? [16:06] And so here they used a LSDM, and there's basically [16:10] two chunks of this, which are covered by the slack, by this. [16:16] But basically have an encoder LSDM on the left, [16:19] and it just consumes one word at a time [16:22] and builds up a context of what it has read. [16:24] And then that acts as a conditioning vector [16:26] to the decoder RNN or LSDM. [16:29] That basically goes chonk, chonk, [16:30] chonk for the next word in a sequence, [16:32] translating the English to French or something like that. [16:35] Now, the big problem with this, that people identified, [16:37] I think, very quickly and tried to resolve [16:40] is that there's what's called this encoder bottleneck. [16:43] So this entire English sentence that we are trying to condition [16:46] on is packed into a single vector [16:48] that goes from the encoder to the decoder. [16:50] And so this is just too much information [16:52] to potentially maintain in a single vector, [16:54] and that didn't seem correct. [16:55] And so people who are looking around for ways [16:57] to alleviate the attention of the encoder bottleneck as it [17:00] was called at the time. [17:02] And so that brings us to this paper, [17:03] Neural Machine Translation by Jointly Learning [17:05] to Align and Translate. [17:07] And here, just quoting from the abstract, "in this paper, [17:11] we conjectured that the use of a fixed length vector [17:13] is a bottleneck in improving the performance [17:15] of the basic encoder-decoder architecture [17:17] and propose to extend this by allowing [17:19] the model to automatically soft search [17:21] for parts of the source sentence that are relevant to predicting [17:24] a target word without having to form [17:28] these parts or hard segments exclusively." [17:30] So this was a way to look back to the words that [17:34] are coming from the encoder. [17:35] And it was achieved using this soft search. [17:38] So as you are decoding in the words [17:42] here, while you are decoding them, [17:44] you are allowed to look back at the words [17:45] at the encoder via this soft attention mechanism proposed [17:49] in this paper. [17:50] And so this paper, I think, is the first time that I saw, [17:52] basically, attention. [17:55] So your context vector that comes from the encoder [17:58] is a weighted sum of the hidden states [18:01] of the words in the encoding. [18:05] And then the weights of this sum come [18:07] from a softmax that is based on these compatibilities [18:10] between the current state as you're decoding [18:13] and the hidden states generated by the encoder. [18:15] And so this is the first time that really you [18:17] start to look at it, and this is the current modern equations [18:22] of the attention. [18:23] And I think this was the first paper that I saw it in. [18:25] It's the first time that there's a word [18:27] attention used, as far as I know, to call this mechanism. [18:32] So I actually tried to dig into the details of the history [18:34] of the attention. [18:35] So the first author here, Dzmitry, I [18:38] had an email correspondence with him, [18:40] and I basically sent him an email. [18:41] I'm like, Dzmitry, this is really interesting. [18:43] Just rumors have taken over. [18:44] Where did you come up with the soft attention [18:45] mechanism that ends up being the heart of the transformer? [18:48] And to my surprise, he wrote me back this massive email, which [18:52] was really fascinating. [18:52] So this is an excerpt from that email. [18:57] So basically, he talks about how he was looking for a way [18:59] to avoid this bottleneck between the encoder and decoder. [19:02] He had some ideas about cursors that [19:04] traverse the sequences that didn't quite work out. [19:06] And then here, "so one day, I had this thought [19:08] that it would be nice to enable the decoder [19:10] RNN to learn to search where to put the cursor in the source [19:13] sequence. [19:14] This was sort of inspired by translation exercises [19:16] that learning English in my middle school involved. [19:21] Your gaze shifts back and forth between source and target, [19:23] sequence as you translate." [19:24] So literally, I thought that this was kind of interesting, [19:27] that he's not a native English speaker, [19:28] and here, that gave him an edge in this machine translation [19:31] that led to attention and then led to transformer. [19:34] So that's really fascinating. [19:37] "I expressed a soft search a softmax [19:38] and then weighted averaging of the [INAUDIBLE] states. [19:40] And basically, to my great excitement, [19:43] this worked from the very first try." [19:45] So really, I think, interesting piece of history. [19:48] And as it later turned out that the name of RNN search [19:51] was kind of lame, so the better name attention came [19:54] from Yoshua on one of the final passes [19:57] as they went over the paper. [19:58] So maybe Attention is All You Need [20:00] would have been called RNN Search is All You Need, [20:03] but we have Yoshua Bengio to thank [20:05] for a little bit of better name, I would say. [20:07] So apparently, that's the history [20:08] of this, which I thought was interesting. [20:11] OK, so that brings us to 2017, which is Attention [20:13] is All You Need. [20:14] So this attention component, which [20:16] in Dzmitry's paper was just one small segment, [20:19] and there's all this bidirectional RNN, RNN [20:21] and decoder, and this Attention All You Need paper is saying, [20:25] OK, you can actually delete everything. [20:26] What's making this work very well [20:28] is just attention by itself. [20:29] And so delete everything, keep attention. [20:32] And then what's remarkable about this paper actually is usually, [20:35] you see papers that are very incremental. [20:36] They add one thing, and they show that it's better. [20:39] But I feel like Attention is All You [20:41] Need was like a mix of multiple things at the same time. [20:44] They were combined in a very unique way, [20:46] and then also achieve a very good local minimum [20:49] in the architecture space. [20:50] And so to me, this is really a landmark paper [20:52] that is quite remarkable and, I think, [20:55] had quite a lot of work behind the scenes. [20:58] So delete all the RNN, just keep attention. [21:01] Because attention operates over sets-- [21:03] and I'm going to go to this in a second-- [21:05] you now need to positionally encode your inputs [21:07] because attention doesn't have the notion of space by itself. [21:14] I have to be very careful. [21:17] They adopted this residual network structure [21:19] from resonance. [21:21] They interspersed attention with multi-layer perceptrons. [21:24] They used layer norms, which came from a different paper. [21:27] They introduced the concept of multiple heads of attention [21:29] that were applied in parallel. [21:30] And they gave us, I think, like a fairly good set [21:33] of hyperparameters that to this day are used. [21:35] So the expansion factor in the multi-layer perceptron goes up [21:39] by 4X-- [21:40] and we'll go into a bit more detail-- [21:41] and this 4X has stuck around. [21:43] And I believe there's a number of papers [21:44] that try to play with all kinds of little details [21:47] of the transformer, and nothing sticks because this is actually [21:50] quite good. [21:51] The only thing to my knowledge that didn't stick [21:54] was this reshuffling of the layer norms [21:56] to go into the prenorm version where here you [21:59] see the layer norms are after the multiheaded attention feed [22:01] forward. [22:02] They just put them before instead. [22:04] So just reshuffling of layer norms, but otherwise, [22:06] the TPTs and everything else that you're seeing today [22:08] is basically the 2017 architecture from 5 years ago. [22:11] And even though everyone is working on it, [22:13] it's been proven remarkably resilient, [22:15] which I think is real interesting. [22:17] There are innovations that, I think, [22:18] have been adopted also in positional encoding. [22:21] It's more common to use different rotary and relative [22:24] positional encoding and so on. [22:25] So I think there have been changes, but for the most part, [22:28] it's proven very resilient. [22:31] So really quite an interesting paper. [22:32] Now, I wanted to go into the attention mechanism. [22:36] And I think, the way I interpret it is not similar to the ways [22:43] that I've seen it presented before. [22:44] So let me try a different way of how I see it. [22:47] Basically, to me, attention is kind of like the communication [22:49] phase of the transformer, and the transformer [22:52] interweaves two phases of the communication phase, which [22:55] is the multi-headed attention, and the computation [22:57] stage, which is this multilayered perceptron [23:00] or [INAUDIBLE]. [23:01] So in the communication phase, it's [23:03] really just a data dependent message [23:05] passing on directed graphs. [23:07] And you can think of it as OK, forget everything [23:09] with machine translation, everything. [23:10] Let's just-- we have directed graphs. [23:13] At each node, you are storing a vector. [23:16] And then let me talk now about the communication [23:18] phase of how these vectors talk to each other [23:20] and this directed graph. [23:21] And then the compute phase later is just [23:23] a multi-perceptron, which then basically acts on every node [23:27] individually. [23:28] But how do these nodes talk to each other [23:30] in this directed graph? [23:32] So I wrote like some simple Python-- [23:36] I wrote this in Python basically to create [23:39] one round of communication of using attention [23:44] as the message passing scheme. [23:46] So here, a node has this private data vector, [23:51] as you can think of it as private information [23:53] to this node. [23:54] And then it can also emit a key, a query, and a value. [23:57] And simply, that's done by linear transformation [24:00] from this node. [24:01] So the key is what are the things that I am-- [24:07] sorry. [24:07] The query is what are the things that I'm looking for? [24:10] The key is what other the things that I have? [24:12] And the value is what are the things that I will communicate? [24:15] And so then when you have your graph that's [24:16] made up of nodes in some random edges, when you actually [24:19] have these nodes communicating, what's happening is [24:21] you loop over all the nodes individually [24:23] in some random order, and you're at some node, [24:27] and you get the query vector q, which [24:29] is, I'm a node in some graph, and this [24:32] is what I'm looking for. [24:33] And so that's just achieved via this linear transformation [24:36] here. [24:36] And then we look at all the inputs that point to this node, [24:39] and then they broadcast what are the things that I have, [24:42] which is their keys. [24:44] So they broadcast the keys. [24:45] I have the query, then those interact by dot product [24:49] to get scores. [24:51] So basically, simply by doing dot product, [24:53] you get some unnormalized weighting [24:55] of the interestingness of all of the information in the nodes [24:59] that point to me and to the things I'm looking for. [25:02] And then when you normalize that with softmax, [25:03] so it just sums to 1, you basically just [25:06] end up using those scores, which now sum to 1 in our probability [25:09] distribution, and you do a weighted sum of the values [25:13] to get your update. [25:15] So I have a query. [25:17] They have keys, dot products to get interestingness or like [25:21] affinity, softmax to normalize it, and then [25:24] weighted sum of those values flow to me and update me. [25:27] And this is happening for each node individually. [25:29] And then we update at the end. [25:30] And so this kind of a message passing scheme [25:32] is at the heart of the transformer. [25:35] And it happens in the more vectorized batched way [25:40] that is more confusing and is also interspersed with layer [25:44] norms and things like that to make the training behave [25:46] better. [25:47] But that's roughly what's happening in the attention [25:49] mechanism, I think, on a high level. [25:53] So yeah, so in the communication phase of the transformer, then [25:59] this message passing scheme happens [26:00] in every head in parallel and then in every layer in series [26:06] and with different weights each time. [26:08] And that's it as far as the multi-headed attention goes. [26:13] And so if you look at these encooder-decoder models, [26:15] you can think of it then in terms of the connectivity [26:18] of these nodes in the graph. [26:19] You can think of it as like, OK, all these tokens that [26:21] are in the encoder that we want to condition on, [26:23] they are fully connected to each other. [26:25] So when they communicate, they communicate fully [26:28] when you calculate their features. [26:30] But in the decoder, because we are [26:32] trying to have a language model, we [26:33] don't want to have communication for future tokens [26:35] because they give away the answer at this step. [26:38] So the tokens in the decoder are fully connected [26:40] from all the encoder states, and then they [26:43] are also fully connected from everything that is decoding. [26:46] And so you end up with this triangular structure [26:49] in the data graph. [26:50] But that's the message passing scheme [26:52] that this basically implements. [26:54] And then you have to be also a little bit careful because [26:57] in the cross attention here with the decoder, [26:59] you consume the features from the top of the encoder. [27:01] So think of it as in the encoder, [27:03] all the nodes are looking at each other, [27:05] all the tokens are looking at each other many, many times. [27:08] And they really figure out what's in there, [27:09] and then the decoder when it's looking only at the top nodes. [27:14] So that's roughly the message passing scheme. [27:16] I was going to go into more of an implementation [27:18] of a transformer. [27:19] I don't know if there's any questions about this. [27:23] [INAUDIBLE] self-attention and multi-headed attention, [27:26] but what is the advantage of [INAUDIBLE]?? [27:30] Yeah, so self-attention and multi-headed attention, so [27:35] the multi-headed attention is just this attention scheme, [27:38] but it's just applied multiple times in parallel. [27:40] Multiple heads just means independent applications [27:42] of the same attention. [27:44] So this message passing scheme basically just [27:47] happens in parallel multiple times [27:49] with different weights for the query, key, and value. [27:52] So you can almost look at it like in parallel, I'm [27:55] looking for, I'm seeking different kinds of information [27:57] from different nodes. [27:59] And I'm collecting it all in the same node. [28:01] It's all done in parallel. [28:03] So heads is really just copy-paste in parallel. [28:06] And layers are copy-paste but in series. [28:12] Maybe that makes sense. [28:15] And self-attention, when it's self-attention, [28:18] what it's referring to is that the node here [28:21] produces each node here. [28:23] So as I described it here, this is really self-attention [28:25] because every one of these nodes produces [28:27] a key query and a value from this individual node. [28:30] When you have cross-attention, you have one cross-attention [28:33] here, coming from the encoder. [28:36] That just means that the queries are still [28:38] produced from this node, but the keys and the values [28:42] are produced as a function of nodes that [28:44] are coming from the encoder. [28:48] So I have my queries because I'm trying to decode some-- [28:52] the fifth word in the sequence. [28:53] And I'm looking for certain things [28:55] because I'm the fifth word. [28:56] And then the keys and the values in terms [28:58] of the source of information that could answer my queries [29:01] can come from the previous nodes in the current decoding [29:04] sequence or from the top of the encoder. [29:06] So all the nodes that have already seen all [29:09] of the encoding tokens many, many times cannot broadcast [29:12] what they contain in terms of information. [29:14] So I guess, to summarize, the self-attention is-- [29:18] sorry, cross-attention and self-attention [29:20] only differ in where the piece and the values come from. [29:24] Either the keys and values are produced from this node, [29:28] or they are produced from some external source like an encoder [29:31] and the nodes over there. [29:33] But algorithmically, is the same mathematical operations. [29:39] Question. [29:39] Yeah, OK. [29:40] So two questions for you. [29:41] First question is, in the message passing [INAUDIBLE] [29:56] So think of-- so each one of these nodes is a token. [30:04] I guess they don't have a very good picture of it [30:06] in the transformer. [30:06] But this node here could represent the third word [30:14] in the output in the decoder, and in the beginning, [30:19] it is just the embedding of the word. [30:27] And then, OK, I have to think through this analogy [30:30] a little bit more. [30:31] I came up with it this morning. [30:32] [LAUGHTER] [30:34] [INAUDIBLE] [30:39] What example of instantiation [INAUDIBLE] nodes [30:45] as in in blocks were embedding? [30:50] These nodes are basically the vectors. [30:53] I'll go to an implementation. [30:54] I'll go to the implementation, and then maybe I'll [30:56] make the connections to the graph. [30:58] So let me try to first go to-- let me now go to, [31:01] with this intuition in mind, at least, [31:03] to a nanoGPT, which is a concrete implementation [31:05] of a transformer that is very minimal. [31:06] So I worked on this over the last few days, [31:08] and here it is reproducing GPT-2 on open web text. [31:11] So it's a pretty serious implementation that reproduces [31:14] GPT-2, I would say, and provide it enough compute-- [31:17] This was one node of 8 GPUs for 38 hours or something [31:21] like that, if I remember correctly. [31:22] And it's very readable. [31:23] It's 300 lines, so everyone can take a look at it. [31:27] And yeah, let me basically briefly step through it. [31:30] So let's try to have a decoder-only transformer. [31:34] So what that means is that it's a language model. [31:36] It tries to model the next word in the sequence [31:39] or the next character in the sequence. [31:41] So the data that we train on this [31:43] is always some kind of text. [31:44] So here's some fake Shakespeare. [31:45] Sorry, this is real Shakespeare. [31:47] We're going to produce fake Shakespeare. [31:48] So this is called a Tiny Shakespeare [31:50] dataset, which is one of my favorite toy datasets. [31:52] You take all of Shakespeare, concatenate it, [31:54] and it's 1 megabyte file, and then [31:55] you can train language models on it [31:56] and get infinite Shakespeare, if you like, [31:58] which I think is kind of cool. [31:59] So we have a text. [32:00] The first thing we need to do is we [32:02] need to convert it to a sequence of integers [32:05] because transformers natively process-- [32:09] you can't plug text into transformer. [32:10] You need to somehow encode it. [32:11] So the way that encoding is done is [32:13] we convert, for example, in the simplest case, [32:15] every character gets an integer, and then instead of "hi [32:18] there," we would have this sequence of integers. [32:21] So then you can encode every single character as an integer [32:25] and get a massive sequence of integers. [32:27] You just concatenate it all into one [32:29] large, long one-dimensional sequence. [32:31] And then you can train on it. [32:32] Now, here, we only have a single document. [32:34] In some cases, if you have multiple independent documents, [32:36] what people like to do is create special tokens, [32:38] and they intersperse those documents [32:40] with those special end of text tokens [32:42] that they splice in between to create boundaries. [32:46] But those boundaries actually don't have any modeling impact. [32:50] It's just that the transformer is supposed [32:52] to learn via backpropagation that the end of document [32:55] sequence means that you should wipe the memory. [33:00] OK, so then we produce batches. [33:02] So these batches of data just mean [33:04] that we go back to the one-dimensional sequence, [33:06] and we take out chunks of this sequence. [33:08] So say, if the block size is 8, Then the block size indicates [33:13] the maximum length of context that your transformer will [33:17] process. [33:18] So if our block size is 8, that means [33:20] that we are going to have up to eight characters of context [33:23] to predict the ninth character in a sequence. [33:26] And the batch size indicates how many sequences in parallel [33:29] we're going to process. [33:30] And we want this to be as large as possible, [33:31] so we're fully taking advantage of the GPU [33:33] and the parallels [INAUDIBLE] So in this example, [33:36] we're doing a 4 by 8 batches. [33:38] So every row here is independent example [33:41] and then every row here is a small chunk of the sequence [33:47] that we're going to train on. [33:48] And then we have both the inputs and the targets [33:50] at every single point here. [33:52] So to fully spell out what's contained in a single 4 [33:55] by 8 batch to the transformer-- [33:57] I sort of compact it here-- [33:59] so when the input is 47, by itself, the target is 58. [34:04] And when the input is the sequence 47, 58, [34:07] the target is one. [34:08] And when it's 47, 58, 1, the target is 51 and so on. [34:13] So actually, the single batch of examples that score by 8 [34:15] actually has a ton of individual examples [34:17] that we are expecting a transformer [34:18] to learn on in parallel. [34:21] And so you'll see that the batches are learned [34:23] on completely independently, but the time dimension here along [34:28] horizontally is also trained on in parallel. [34:30] So your real batch size is more like B times T. [34:34] And it's just that the context grows linearly [34:37] for the predictions that you make along the T direction [34:41] in the model. [34:42] So this is all the examples that the model will learn from, [34:45] this single batch. [34:48] So now, this is the GPT class. [34:52] And because this is a decoder-only model, [34:55] so we're not going to have an encoder because there's no [34:58] English we're translating from-- [34:59] we're not trying to condition in some other external [35:02] information. [35:02] We're just trying to produce a sequence of words that [35:05] follow each other or likely to. [35:08] So this is all PyTorch, and I'm going slightly faster [35:10] because I'm assuming people have taken 231 or something [35:12] along those lines. [35:15] But here in the forward pass, we take these indices, [35:19] and then we both encode the identity of the indices, [35:24] just via an embedding lookup table. [35:26] So every single integer, we index into a lookup table of [35:31] vectors in this, and end up embedding, and pull out [35:34] the word vector for that token. [35:38] And then because the transformer by itself [35:41] doesn't actually-- the process is set natively. [35:43] So we need to also positionally encode these vectors [35:45] so that we basically have both the information [35:47] about the token identity and its place in the sequence from 1 [35:51] to block size. [35:53] Now, the information about what and where [35:56] is combined additively, so the token embeddings [35:58] and the positional embeddings are just added exactly as here. [36:02] So then there's optional dropout, [36:06] this x here basically just contains [36:08] the set of words and their positions, [36:14] and that feeds into the blocks of transformer. [36:16] And we're going to look into what's block here. [36:18] But for here, for now, this is just a series [36:20] of blocks in a transformer. [36:22] And then in the end, there's a layer norm, [36:23] and then you're decoding the logits [36:26] for the next word or next integer in a sequence, [36:30] using the linear projection of the output of this transformer [36:33] So LM head here, a short core language model head. [36:36] It's just a linear function. [36:38] So basically, positionally encode all the words, [36:42] feed them into a sequence of blocks, [36:45] and then apply a linear layer to get the probability [36:47] distribution for the next character. [36:50] And then if we have the targets, which [36:51] we produced in the data order-- [36:54] and you'll notice that the targets are just [36:55] the inputs offset by one in time-- [36:59] then those targets feed into a cross entropy loss. [37:01] So this is just a negative log likelihood [37:03] typical classification loss. [37:04] So now let's drill into what's here in the blocks. [37:08] So these blocks that are applied sequentially, [37:11] there's, again, as I mentioned, this communicate [37:13] phase and the compute phase. [37:15] So in the communicate phase, all the nodes [37:17] get to talk to each other, and so these nodes are basically, [37:21] if our block size is 8, then we are [37:23] going to have eight nodes in this graph. [37:26] There's eight nodes in this graph. [37:28] The first node is pointed to only by itself. [37:30] The second node is pointed to by the first node and itself. [37:33] The third node is pointed to by the first two nodes [37:35] and itself, et cetera. [37:36] So there's eight nodes here. [37:38] So you apply-- there's a residual pathway and x. [37:42] You take it out. [37:43] You apply a layer norm, and then the self-attention [37:45] so that these communicate, these eight nodes communicate. [37:47] But you have to keep in mind that the batch is 4. [37:50] So because batch is 4, this is also applied-- [37:54] so we have eight nodes communicating, [37:55] but there's a batch of four of them individually communicating [37:58] in one of those eight nodes. [37:59] There's no crisscross across the batch dimension, of course. [38:02] There's no batch anywhere luckily. [38:04] And then once they've changed information, [38:06] they are processed using the multi-layer perceptron. [38:09] And that's the compute phase. [38:12] And then also here we are missing the cross-attention [38:18] because this is a decoder-only model. [38:19] So all we have is this step here, [38:21] the multi-headed attention, and that's [38:22] this line, the communicate phase. [38:24] And then we have the feed forward, which is the MLP, [38:27] and that's the compute phase. [38:29] I'll take question's a bit later. [38:31] Then the MLP here is fairly straightforward. [38:34] The MLP is just individual processing on each node, [38:38] just transforming the feature representation at that node. [38:41] So applying a two-layer neural net [38:45] with a GELU nonlinearity, which is just [38:47] think of it as a ReLU or something like that. [38:49] It's just a nonlinearity. [38:51] And then MLP is straightforward. [38:53] I don't think there's anything too crazy there. [38:55] And then this is the causal self-attention part, [38:57] the communication phase. [38:59] So this is like the meat of things [39:01] and the most complicated part. [39:03] It's only complicated because of the batching [39:06] and the implementation detail of how you mask the connectivity [39:10] in the graph so that you can't obtain [39:13] any information from the future when [39:15] you're predicting your token. [39:16] Otherwise, it gives away the information. [39:18] So if I'm the fifth token and if I'm the fifth position, [39:23] then I'm getting the fourth token coming into the input, [39:26] and I'm attending to the third, second, and first, [39:29] and I'm trying to figure out what is the next token. [39:32] Well then, in this batch, in the next element [39:34] over in the time dimension, the answer is at the input. [39:37] So I can't get any information from there. [39:40] So that's why this is all tricky, [39:41] but basically, in the forward pass, [39:45] we are calculating the queries, keys, and values based on x. [39:50] So these are the keys, queries, and values. [39:52] Here, when I'm computing the attention, [39:54] I have the queries matrix multiplying the piece. [39:58] So this is the dot product in parallel for all the queries [40:00] and all the keys in all the heads. [40:03] So I failed to mention that there's also [40:06] the aspect of the heads, which is also done all in parallel [40:08] here. [40:09] So we have the batch dimension, the time dimension, [40:10] and the head dimension, and you end up [40:12] with five-dimensional tensors, and it's all really confusing. [40:14] So I invite you to step through it later and convince yourself [40:17] that this is actually doing the right thing. [40:19] But basically, you have the batch dimension, the head [40:21] dimension and the time dimension, [40:23] and then you have features at them. [40:25] And so this is evaluating for all the batch elements, for all [40:28] the head elements, and all the time elements, [40:31] the simple Python that I gave you earlier, which is query [40:34] dot product p. [40:35] Then here, we do a masked_fill, and what this is doing [40:38] is it's basically clamping the attention between the nodes [40:44] that are not supposed to communicate to be negative [40:46] infinity. [40:47] And we're doing negative infinity [40:48] because we're about to softmax, and so negative infinity will [40:51] make basically the attention that those elements be zero. [40:54] And so here we are going to basically end up [40:56] with the weights, the affinities between these nodes, optional [41:03] dropout. [41:03] And then here, attention matrix multiply v is basically [41:08] the gathering of the information according to the affinities [41:10] we calculated. [41:11] And this is just a weighted sum of the values [41:14] at all those nodes. [41:15] So this matrix multiplies is doing that weighted sum. [41:19] And then transpose contiguous view [41:20] because it's all complicated and batched [41:22] in five-dimensional tensors, but it's really not [41:24] doing anything, optional drop out, [41:26] and then a linear projection back to the residual pathway. [41:30] So this is implementing the communication phase here. [41:34] Then you can train this transformer. [41:37] And then you can generate infinite Shakespeare. [41:41] And you will simply do this by-- [41:43] because our block size is 8, we start with a sum token, [41:47] say like, I used in this case, you [41:50] can use something like a new line as the start token. [41:53] And then you communicate only to yourself [41:55] because there's a single node, and you [41:57] get the probability distribution for the first word [41:59] in the sequence. [42:00] And then you decode it for the first character [42:03] in the sequence. [42:04] You decode the character. [42:05] And then you bring back the character, [42:06] and you re-encode it as an integer. [42:08] And now, you have the second thing. [42:10] And so you get-- [42:12] OK, we're at the first position, and this [42:14] is whatever integer it is, add the positional encodings, [42:17] goes into the sequence, goes in the transformer, [42:19] and again, this token now communicates [42:21] with the first token and it's identity. [42:26] And so you just keep plugging it back. [42:28] And once you run out of the block size, which is eight, [42:31] you start to crawl, because you can never [42:33] have watt size more than eight in the way you've [42:34] trained this transformer. [42:35] So we have more and more context until eight. [42:37] And then if you want to generate beyond eight, [42:39] you have to start cropping because the transformer only [42:41] works for eight elements in time dimension. [42:43] And so all of these transformers in the [INAUDIBLE] setting [42:47] have a finite block size or context length, [42:50] and in typical models, this will be 1,024 tokens or 2,048 [42:54] tokens, something like that. [42:56] But these tokens are usually like BPE tokens, [42:58] or SentencePiece tokens, or WorkPiece tokens. [43:00] There's many different encodings. [43:02] So it's not like that long. [43:03] And so that's why, I think, [INAUDIBLE].. [43:05] We really want to expand the context size, [43:06] and it gets gnarly because the attention [43:08] is sporadic in the [INAUDIBLE] case. [43:11] Now, if you want to implement an encoder instead of a decoder [43:16] attention. [43:18] Then all you have to do is this [INAUDIBLE] [43:21] and you just delete that line. [43:23] So if you don't mask the attention, [43:25] then all the nodes communicate to each other, [43:27] and everything is allowed, and information [43:29] flows between all the nodes. [43:31] So if you want to have the encoder here, just delete. [43:35] All the encoder blocks will use attention [43:38] where this line is deleted. [43:39] That's it. [43:40] So you're allowing whatever-- this encoder might store say, [43:44] 10 tokens, 10 nodes, and they are all [43:46] allowed to communicate to each other going up the transformer. [43:51] And then if you want to implement cross-attention, [43:53] so you have a full encoder-decoder transformer, [43:55] not just a decoder-only transformer or a GPT. [43:59] Then we need to also add cross-attention in the middle. [44:03] So here, there is a self-attention piece where all [44:05] the-- [44:06] there's a self-attention piece, a cross-attention piece, [44:08] and this MLP. [44:09] And in the cross-attention, we need [44:12] to take the features from the top of the encoder. [44:14] We need to add one more line here, [44:16] and this would be the cross-attention instead of a-- [44:20] I should have implemented it instead of just pointing, [44:22] I think. [44:23] But there will be a cross-attention line here. [44:25] So we'll have three lines because we [44:26] need to add another block. [44:28] And the queries will come from x but the keys [44:31] and the values will come from the top of the encoder. [44:35] And there will be basic code information [44:36] flowing from the encoder, strictly [44:38] to all the nodes inside x. [44:41] And then that's it. [44:42] So it's a very simple modifications [44:44] on the decoder attention. [44:47] So you'll hear people talk that you have [44:49] a decoder-only model like GPT. [44:51] You can have an encoder-only model like BERT, [44:53] or you can have an encoder-decoder model [44:55] like say T5, doing things like machine translation. [44:59] And in BERT, you can't train it using this language modeling [45:04] setup that's utter aggressive, and you're just [45:06] trying to predict next [INAUDIBLE] in the sequence. [45:07] You're training it doing slightly different objectives. [45:09] You're putting in the full sentence, [45:12] and, the full sentence is allowed to communicate fully. [45:14] And then you're trying to classify sentiment or something [45:16] like that. [45:18] So you're not trying to model the next token in the sequence. [45:21] So these are trained slightly different [45:26] using masking and other denoising techniques. [45:31] OK. [45:32] So that's like the transformer. [45:34] I'm going to continue. [45:36] So yeah, maybe more questions. [45:38] [INAUDIBLE] [46:01] This is like we are enforcing these constraints on it [46:06] by just masking [INAUDIBLE] [46:12] So I'm not sure if I fully follow. [46:14] So there's different ways to look at this analogy, [46:16] but one analogy is you can interpret [46:18] this graph as really fixed. [46:20] It's just that every time we do the communicate, [46:22] we are using different weights. [46:23] You can look at it that way. [46:24] So if we have block size of eight in my example, [46:26] we would have eight nodes. [46:27] Here we have 2, 4, 6. [46:29] OK, so we'd have eight nodes. [46:30] They would be connected in-- [46:33] you lay them out, and you only connect from left to right. [46:35] [INAUDIBLE] [46:42] Why would they connect-- usually, [46:44] the connections don't change as a function of the data [46:46] or something like that-- [46:47] [INAUDIBLE] [47:00] I don't think I've seen a single example where [47:02] the connectivity changes dynamically [47:03] in the function data. [47:04] Usually, the connectivity is fixed. [47:05] If you have an encoder, and you're training a BERT, [47:07] you have how many tokens you want, [47:09] and they are fully connected. [47:11] And if you have a decoder-only model, [47:13] you have this triangular thing, and if you [47:15] have encoder-decoder, then you have [47:16] awkwardly two pools of nodes. [47:21] Yeah. [47:24] Go ahead. [47:25] [INAUDIBLE] I wonder, you know much more about this [47:45] than I know. [47:46] But do you have a sense of like if you ran [INAUDIBLE] [48:00] In my head, I'm thinking [INAUDIBLE] but then you also [48:08] have different things for one or more of [INAUDIBLE]---- [48:13] Yeah, it's really hard to say, so that's [48:15] why I think this paper is so interesting because like, yeah, [48:17] usually, you'd see like the path, [48:18] and maybe they had path internally. [48:19] They just didn't publish it. [48:20] All you can see is things that didn't look like a transformer. [48:23] I mean, you have ResNets, which have lots of this. [48:26] But a ResNet would be like this, but there's [48:29] no self-attention component. [48:31] But the MLP is there kind of in a ResNet. [48:35] So a ResNet looks very much like this [48:37] except there's no-- you can use layer norms in ResNets, [48:40] I believe, as well. [48:41] Typically, sometimes, they can be batch norms. [48:43] So it is kind of like a ResNet. [48:45] It is like they took a ResNet, and they [48:47] put in a self-attention block in addition [48:50] to the preexisting MLP block, which [48:52] is kind of like convolutions. [48:53] And MLP was strictly speaking deconvolution, [48:55] one by one convolution, but I think [48:59] the idea is similar in that MLP is just like a typical weights, [49:04] nonlinearity weights operation. [49:11] But I will say, yeah, this is kind of interesting [49:13] because a lot of work is not there, [49:15] and then they give you this transformer. [49:17] And then it turns out 5 years later, [49:18] it's not changed, even though everyone's trying to change it. [49:20] So it's interesting to me that it's like a package, [49:23] in like a package, which I think is really [49:25] interesting historically. [49:26] And I also talked to paper authors, [49:30] and they were unaware of the impact [49:32] that the transformer would have at the time. [49:33] So when you read this paper, actually, it's unfortunate [49:37] because this is the paper that changed everything, [49:39] but when people read it, it's like question marks [49:41] because it reads like a pretty random machine translation [49:45] paper. [49:46] It's like, oh, we're doing machine translation. [49:47] Oh, here's a cool architecture. [49:48] OK, great, good results. [49:51] It doesn't know what's going to happen. [49:53] [LAUGHS] And so when people read it today, [49:56] I think they're confused potentially. [50:00] I will have some tweets at the end, [50:02] but I think I would have renamed it [50:03] with the benefit of hindsight of like, well, I'll get to it. [50:08] [INAUDIBLE] [50:20] Yeah, I think that's a good question as well. [50:22] Currently, I mean, I certainly don't [50:24] love the autoregressive modeling approach. [50:27] I think it's kind of weird to sample a token [50:29] and then commit to it. [50:31] So maybe there are some ways, some hybrids [50:36] with the Fusion as an example, which [50:38] I think would be really cool, or we'll [50:41] find some other ways to edit the sequences later but still [50:44] in our regressive framework. [50:47] But I think the Fusion is like an up and coming modeling [50:49] approach that I personally find much more appealing. [50:51] When I sample text, I don't go chunk, chunk, chunk, [50:54] and commit. [50:55] I do a draft one, and then I do a better draft two. [50:58] And that feels like a diffusion process. [51:00] So that would be my hope. [51:05] OK, also a question. [51:07] So yeah, you'd think the [INAUDIBLE] [51:20] And then once we have the edge rates, [51:21] we just have to multiply it by the values, [51:23] and then you just [INAUDIBLE] it. [51:25] Yes, yeah, it's right. [51:27] And you think there's MLG within graph neural networks [51:30] and they'll potentially-- [51:32] I find the graph neural networks like a confusing term [51:34] because, I mean, yeah, previously, [51:38] there, was this notion of-- [51:40] I feel like maybe today everything is a graph neural [51:42] network because a transformer is a graph neural network [51:44] processor. [51:45] The native representation that the transformer operates over [51:48] is sets that are connected by edges in a direct way. [51:51] And so that's the native representation, and then, yeah. [51:55] OK, I should go on because I still have 30 slides. [51:57] [INAUDIBLE] [52:08] Oh yeah, yeah, the root DE, I think, it basically [52:11] like if you're initializing with random weights [52:14] setup from a [INAUDIBLE] as your dimension size grows, [52:17] so does your values, the variance grows. [52:19] And then your softmax will just become the one half vector. [52:23] So it's just a way to control the variance [52:25] and bring it to always be in a good range for softmax [52:28] and nice diffused distribution. [52:31] OK, so it's almost like an initialization thing. [52:37] OK, so transformers have been applied [52:41] to all the other fields, and the way this was done [52:44] is in my opinion, ridiculous ways [52:46] honestly because I was a computer vision person, [52:49] and you have ComNets, and they make sense. [52:51] So what we're doing now with VITs as an example is [52:53] you take an image and you chop it up into little squares. [52:56] And then those squares, literally, [52:57] feed into a transformer, and that's [52:59] it, which is kind of ridiculous. [53:01] And so, I mean, yeah, and so the transformer [53:06] doesn't even, in the simplest case, really know where [53:08] these patches might come from. [53:10] They are usually positionally encoded, [53:12] but it has to rediscover a lot of the structure, [53:16] I think, of them in some ways. [53:19] And it's kind of weird to approach it that way. [53:23] But it's just the simplest baseline [53:25] of just chomping up big images into small squares [53:27] and feeding them in as the individual nodes actually [53:29] works fairly well. [53:30] And then this is in a transformer encoder, [53:32] so all the patches are talking to each other [53:34] throughout the entire transformer. [53:36] And the number of nodes here would be like nine. [53:42] Also, in speech recognition, you just take your melSpectrogram, [53:44] and you chop it up into slices and you feed them [53:46] into a transformer. [53:47] So there was paper like this, but also Whisper. [53:49] Whisper is a copy-paste transformer. [53:51] If you saw Whisper from OpenAI, you just chop up melSpectrogram [53:55] and feed it into a transformer and then pretend [53:57] you're dealing with text. [53:58] And it works very well. [54:00] Decision transformer in RL, you take your states, actions, [54:03] and reward that you experience in environment, [54:05] and you just pretend it's a language. [54:07] Then you start to model the sequences of that, [54:09] and then you can use that for planning later. [54:11] That works really well. [54:13] Even things AlphaFold, so we were briefly [54:15] talking about molecules and how you can plug them in. [54:17] So at the heart of AlphaFold, computationally, [54:19] is also a transformer. [54:21] One thing I wanted to also say about transformers [54:23] is I find that they're very flexible, [54:26] and I really enjoy that. [54:28] I'll give you an example from Tesla. [54:31] You have a ComNet that takes an image [54:32] and makes predictions about the image. [54:34] And then the big question is, how do you [54:35] feed in extra information? [54:37] And it's not always trivial like say, I [54:38] had additional information that I [54:40] want to inform that I want the outputs to be informed by. [54:43] Maybe I have other sensors like Radar. [54:45] Maybe I have some map information, or a vehicle type, [54:47] or some audio. [54:48] And the question is, how do you feed information into a ComNet? [54:50] Like where do you feed it in? [54:52] Do you concatenate it? [54:54] Do you add it? [54:55] At what stage? [54:56] And so with a transformer, it's much easier [54:58] because you just take whatever you want, you chop it [55:00] up into pieces, and you feed it in with a set [55:02] of what you had before. [55:03] And you let the self-attention figure out [55:04] how everything should communicate. [55:06] And that actually apparently works. [55:07] So just chop up everything and throw it into the mix [55:10] is like the way. [55:11] And it frees neural nets from this burgeon [55:15] of Euclidean space, where previously you [55:19] had to arrange your computation to conform to the Euclidean [55:21] space or three dimensions of how you're laying out the compute. [55:25] Like the compute actually kind of [55:26] happens in almost like 3D space if you think about it. [55:29] But in attention, everything is just sets. [55:32] So it's a very flexible framework, [55:33] and you can just throw in stuff into your conditioning set. [55:35] And everything just self-attended over. [55:37] So it's quite beautiful from that perspective. [55:39] OK, so now what exactly makes transformers so effective? [55:43] I think a good example of this comes [55:44] from the GPT-3 paper, which I encourage people to read. [55:48] Language Models of Few-Shot Learners. [55:50] I would have probably renamed this a little bit. [55:52] I would have said something like transformers [55:54] are capable of in-context learning or meta-learning. [55:57] That's like what makes them really special. [56:00] So basically the setting that they're working with [56:02] is, OK, I have some context, and I'm [56:03] trying-- like say, a passage. [56:04] This is just one example of many. [56:06] I have a passage, and I'm asking questions about it. [56:08] And then as part of the context in the prompt, [56:12] I'm giving the questions and the answers. [56:14] So I'm giving one example of question-answer, [56:16] another example of question-answer, [56:17] another example of question-answer, and so on. [56:19] And this becomes-- [56:21] Oh yeah, people are going to have to leave soon, huh? [56:24] OK, is this really important? [56:25] Let me think. [56:29] OK, so what's really interesting is basically [56:31] like with more examples given in a context, [56:35] the accuracy improves. [56:37] And so what that can set is that the transformer [56:39] is able to somehow learn in the activations [56:42] without doing any gradient descent [56:43] in a typical fine-tuning fashion. [56:45] So if you fine-tune, you have to give an example and the answer, [56:48] and you fine-tune it, using gradient descent. [56:51] But it looks like the transformer internally [56:53] in its weights is doing something [56:54] that looks like potentially gradient, some kind [56:56] of a metalearning in the weights of the transformer [56:57] as it is reading the prompt. [56:59] And so in this paper, they go into, OK, [57:01] distinguishing this outer loop with stochastic gradient [57:03] descent in this inner loop of the intercontext learning. [57:06] So the inner loop is as the transformer is reading [57:08] the sequence almost and the outer loop is the training [57:12] by gradient descent. [57:14] So basically, there's some training [57:15] happening in the activations of the transformer [57:17] as it is consuming a sequence that [57:18] may be very much looks like gradient descent. [57:21] And so there are some recent papers that hint at this [57:23] and study it. [57:23] And so as an example, in this paper [57:25] here, they propose something called the draw operator. [57:28] And they argue that the raw operator is implemented [57:32] by transformer, and then they show [57:33] that you can implement things like ridge regression [57:35] on top of the raw operator. [57:36] And so this is giving-- [57:39] There are papers hinting that maybe there [57:40] is some thing that looks like gradient-based learning [57:42] inside the activations of the transformer. [57:45] And I think this is not impossible to think through [57:47] because what is gradient-based learning? [57:49] Overpass, backward pass, and then update. [57:52] Oh, that looks like a ResNet, right, [57:54] because you're adding to the weights. [57:57] So the start of initial random set of weights, [57:59] forward pass, backward pass, and update your weights, [58:01] and then forward pass, backward pass, update the weights. [58:04] Looks like a ResNet. [58:04] Transformer is a ResNet, so much more hand-wavey, [58:10] but basically, some papers are trying [58:11] to hint at why that would be potentially possible. [58:14] And then I have a bunch of tweets I just copy-pasted here [58:16] in the end. [58:18] This was like meant for general consumption, [58:20] so they're a bit more high-level and hypey a little bit. [58:22] But I'm talking about why this architecture is so interesting [58:26] and why potentially it became so popular. [58:27] And I think it simultaneously optimizes [58:29] three properties that, I think, are very desirable. [58:31] Number one, the transformer is very [58:33] expressive in the forward pass. [58:35] It sort of like it's able to implement [58:37] very interesting functions, potentially functions [58:39] that can even do meta-learning. [58:41] Number two, it is very optimizable thanks [58:43] to things like residual connections, layer nodes, [58:45] and so on. [58:45] And number three, it's extremely efficient. [58:47] This is not always appreciated, but the transformer, [58:49] if you look at the computational graph, [58:51] is a shallow, wide network, which [58:53] is perfect to take advantage of the parallelism of GPUs. [58:56] So I think the transformer was designed very deliberately [58:58] to run efficiently on GPUs. [59:00] There's previous work like neural GPU [59:02] that I really enjoy as well, which is really just [59:05] like how do we design neural nets that are efficient on GPUs [59:08] and thinking backwards from the constraints of the hardware, [59:10] which I think is a very interesting way [59:11] to think about it. [59:17] Oh yeah, so here, I'm saying, I probably would have called-- [59:21] I probably would've called the transformer a general purpose [59:24] efficient optimizable computer instead of attention [59:27] is all you need. [59:28] That's what I would have maybe in hindsight called that paper. [59:31] It's proposing a model that is very general purpose, so [59:37] forward passes, expressive. [59:38] It's very efficient in terms of GPU usage [59:40] and is easily optimizable by gradient descent and trains [59:44] very nicely. [59:46] And then I have some other hype tweets here. [59:51] Anyway, so you can read them later. [59:53] But I think this one is maybe interesting. [59:55] So if previous neural nets are special purpose computers [59:58] designed for a specific task, GPT [01:00:00] is a general purpose computer, reconfigurable at runtime [01:00:03] to run natural language programs. [01:00:06] So the programs are given as prompts, [01:00:08] and then GPT runs the program by completing the document. [01:00:12] So I really like these analogies personally to computer. [01:00:16] It's just like a powerful computer, [01:00:18] and it's optimizable by gradient descent. [01:00:22] And I don't know-- [01:00:30] OK, yeah. [01:00:31] That's it. [01:00:31] [LAUGHTER] [01:00:33] You can read the tweets later, but that's for now. [01:00:35] I'll just thank you. [01:00:36] I'll just leave this up. [01:00:45] Sorry, I just found this tweet. [01:00:46] So turns out that if you scale up the training set [01:00:49] and use a powerful enough neural net like a transformer, [01:00:51] the network becomes a kind of general purpose [01:00:53] computer over text. [01:00:54] So I think that's nice way to look at it. [01:00:56] And instead of performing a single text sequence, [01:00:58] you can design the sequence in the prompt. [01:01:00] And because the transformer is both powerful [01:01:02] but also is trained on large enough, very hard data set, [01:01:05] it becomes this general purpose text computer. [01:01:07] And so I think that's kind of interesting way to look at it. [01:01:11] Yeah. [01:01:13] [INAUDIBLE] [01:02:01] And I guess my question is [INAUDIBLE] how [01:02:04] much do you think [INAUDIBLE]? [01:02:10] really because it's mostly more efficient or [INAUDIBLE] [01:02:25] So I think there's a bit of that. [01:02:27] Yeah, so I would say RNNs in principle, [01:02:29] yes, they can implement arbitrary programs. [01:02:31] I think, it's like a useless statement to some extent [01:02:33] because they're probably-- [01:02:35] I'm not sure that they're probably expressive [01:02:37] because in a sense of power and that they can implement [01:02:40] these arbitrary functions. [01:02:43] But they're not optimizable. [01:02:44] And they're certainly not efficient because they [01:02:46] are serial computing devices. [01:02:50] So if you look at it as a compute graph, [01:02:51] RNNs are very long, thin compute graph. [01:02:58] What if you stretched out the neurons and you looked-- [01:03:00] like take all the individual neurons interconnectivity, [01:03:02] and stretch them out, and try to visualize them. [01:03:04] RNNs would be like a very long graph and that's bad. [01:03:07] And it's bad also for optimizability [01:03:08] because I don't exactly know why, [01:03:10] but just the rough intuition is when you're backpropagating, [01:03:13] you don't want to make too many steps. [01:03:15] And so transformers are a shallow wide graph, and so [01:03:19] from supervision to inputs is a very small number of hops. [01:03:23] And it's a long residual pathways, [01:03:25] which make gradients flow very easily. [01:03:26] And there's all these layer norms [01:03:28] to control the scales of all of those activations. [01:03:32] And so there's not too many hops, [01:03:34] and you're going from supervision to input [01:03:36] very quickly and just flows through the graph. [01:03:40] And it can all be done in parallel, [01:03:42] so you don't need to do this-- [01:03:43] encoder and decoder RNNs, you have to go from first word, [01:03:46] then second word, then third word. [01:03:47] But here in transformer, every single word [01:03:49] was processed completely in parallel, which is kind of a-- [01:03:54] So I think all of these are really important because all [01:03:57] of these are really important. [01:03:57] And I think number 3 is less talked about but extremely [01:04:00] important because in deep learning scale matters. [01:04:03] And so the size of the network that you can train it [01:04:06] gives you is extremely important. [01:04:08] And so if it's efficient on the current hardware, [01:04:10] then you can make it bigger. [01:04:14] You mentioned that if you do it with multiple modalities [01:04:17] of data, [INAUDIBLE]. [01:04:21] How does that actually work? [01:04:22] Do you leave the different data as different token, [01:04:26] or is it [INAUDIBLE]? [01:04:29] No, so yeah, so you take your image, [01:04:31] and you apparently chop them up into patches. [01:04:33] So there's the first thousand tokens or whatever. [01:04:35] And now, I have a special-- [01:04:37] so radar could be also, but I don't actually [01:04:40] want to make a representation of radar. [01:04:43] But you just need to chop it up and enter it. [01:04:46] And then you have to encode it somehow. [01:04:47] Like the transformer needs to know [01:04:48] that they're coming from radar. [01:04:49] So you create a special-- [01:04:52] you have some kind of a special token of that to-- [01:04:55] these radar tokens are what's slightly [01:04:57] different in the representation, and it's [01:04:58] learnable by gradient descent. [01:05:00] And like vehicle information would also [01:05:03] come in with a special embedded token that can be learned. [01:05:07] So-- [01:05:09] So how do you line those before really-- [01:05:11] Actually, but you don't. [01:05:12] It's all just a set. [01:05:13] And there's-- [01:05:14] Even the [INAUDIBLE] [01:05:18] Yeah, it's all just a set, but you can positionally [01:05:20] encode these sets if you want. [01:05:23] So positional encoding means you can [01:05:26] hardwire, for example, the coordinates [01:05:28] like using [INAUDIBLE]. [01:05:29] You can hardwire that, but it's better [01:05:31] if you don't hardwire the position. [01:05:33] It's just a vector that is always [01:05:34] hanging out the dislocation. [01:05:35] Whatever content is there, it just adds on it. [01:05:37] And this vector is trainable by background. [01:05:39] That's how you do it. [01:05:43] Good point. [01:05:43] I don't really like the [INAUDIBLE].. [01:05:48] They seem to work, but it seems like they're sometimes [01:05:51] [INAUDIBLE] [01:06:08] I'm not sure if I understand your question. [01:06:10] [LAUGHTER] [01:06:11] So I mean the positional encoders [01:06:12] like they're actually like not-- [01:06:14] OK, so they have very little inductive bias or something [01:06:16] like that. [01:06:17] They're just vectors hanging out in location always, [01:06:19] and you're trying to help the network in some way. [01:06:23] And I think the intuition is good, [01:06:28] but if you have enough data, usually, [01:06:30] trying to mess with it is a bad thing. [01:06:33] Trying to enter knowledge when you [01:06:35] have enough knowledge in the data [01:06:36] set itself is not usually productive. [01:06:38] So it all really depends on what scale you want. [01:06:40] If you have infinity data, then you actually [01:06:41] want to encode less and less. [01:06:43] That turns out to work better. [01:06:44] And if you have very little data, then actually, you do [01:06:46] want to encode some biases. [01:06:47] And maybe if you have a much smaller data set, then [01:06:49] maybe convolutions are a good idea [01:06:50] because you actually have this bias coming from your filters. [01:06:55] But I think-- so the transformer is extremely general, [01:06:58] but there are ways to mess with the encodings [01:07:01] to put in more structure. [01:07:02] Like you could, for example, encode [INAUDIBLE] and fix it, [01:07:05] or you could actually go to the attention mechanism [01:07:07] and say, OK, if my image is chopped up into patches, [01:07:10] this patch can only communicate to this neighborhood. [01:07:13] And you just do that in the attention matrix, [01:07:15] you just mask out whatever you don't want to communicate. [01:07:18] And so people really play with this [01:07:19] because the full attention is inefficient. [01:07:22] So they will intersperse, for example, layers [01:07:25] that only communicate in little patches [01:07:26] and then layers that communicate globally. [01:07:28] And they will do all kinds of tricks like that. [01:07:30] So you can slowly bring in more inductive bias. [01:07:33] You would do it, but the inductive biases [01:07:35] are like they're factored out from the core transformer. [01:07:38] And they are factored out, and the interconnectivity [01:07:41] of the nodes. [01:07:42] And they are factored out in the positionally-- [01:07:44] and you can mess with this for computation. [01:07:49] [INAUDIBLE] [01:08:02] So there's probably about 200 papers on this now if not more. [01:08:06] They're kind of hard to keep track of. [01:08:07] Honestly, like my Safari browser, which is-- oh, [01:08:10] it's all up on my computer, like 200 open tabs. [01:08:13] But yes, I'm not even sure if I want [01:08:20] to pick my favorite honestly. [01:08:23] Yeah, [INAUDIBLE] [01:08:42] Maybe you can use a transformer like that [INAUDIBLE] [01:08:45] The other one that I actually like even more [01:08:46] is potentially, keep the context length fixed [01:08:49] but allow the network to somehow use a scratch pad. [01:08:53] And so the way this works is you will teach the transformer [01:08:55] somehow via examples in [INAUDIBLE] hey, [01:08:57] you actually have a scratch pad. [01:09:00] Basically, you can't remember too much. [01:09:01] Your context line is finite. [01:09:02] But you can use a scratch pad. [01:09:04] And you do that by emitting a start scratch pad, [01:09:06] and then writing whatever you want to remember, and then [01:09:08] end scratch pad. [01:09:10] And then you continue with whatever you want. [01:09:12] And then later when it's decoding, [01:09:14] you actually have special objects [01:09:15] that when you detect start scratch pad, [01:09:18] you will like save whatever it puts [01:09:19] in there in like external thing and allow it to attend over it. [01:09:22] So basically, you can teach the transformer just dynamically [01:09:25] because it's so meta-learned. [01:09:27] You can teach it dynamically to use other gizmos and gadgets [01:09:30] and allow it to expand its memory that way [01:09:31] if that makes sense. [01:09:32] It's just like human learning to use a notepad, right. [01:09:35] You don't have to keep it in your brain. [01:09:37] So keeping things in your brain is like the context line [01:09:39] from the transformer. [01:09:39] But maybe we can just give it a notebook. [01:09:42] And then it can query the notebook, and read from it, [01:09:45] and write to it. [01:09:46] [INAUDIBLE] transformer to plug in another transformer. [01:09:48] [LAUGHTER] [01:09:53] [INAUDIBLE] [01:10:09] I don't know if I detected that. [01:10:10] I feel like-- did you feel like there was more than just [01:10:12] a long prompt that's unfolding? [01:10:14] Yeah, [INAUDIBLE] [01:10:19] I didn't try extensively, but I did see a [INAUDIBLE] event. [01:10:22] And I felt like the block size was just moved. [01:10:28] Maybe I'm wrong. [01:10:28] I don't actually know about the internals of ChatGPT. [01:10:31] We have two online questions. [01:10:33] So one question is, "what do you think about architecture [01:10:35] [INAUDIBLE]?" [01:10:38] S4? [01:10:39] S4. [01:10:40] I'm sorry. [01:10:41] I don't know S4. [01:10:42] Which one is this one? [01:10:45] The second question, this one's a personal question. [01:10:47] "What are you going to work on next?" [01:10:49] [INAUDIBLE] [01:10:51] I mean, so right now, I'm working on things like nanoGPT. [01:10:53] Where is nanoGPT? [01:10:58] I mean, I'm going basically slightly from computer vision [01:11:01] and like computer vision-based products, do [01:11:03] a little bit in language domain. [01:11:05] Where's ChatGPT? [01:11:06] OK, nanoGPT. [01:11:07] So originally, I had minGPT, which I rewrote to nanoGPT. [01:11:10] And I'm working on this. [01:11:11] I'm trying to reproduce GPTs, and I mean, [01:11:14] I think something like ChatGPT, I think, [01:11:16] incrementally improved in a product fashion [01:11:17] would be extremely interesting. [01:11:19] And I think a lot of people feel it, [01:11:23] and that's why it went so wide. [01:11:24] So I think there's something like a Google plus [01:11:28] plus plus to build that I think is more interesting. [01:11:31] Shall we give our speaker a round of applause?