Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy

0:47

So without any further ado, let's get started.

0:52

This is a purely introductory lecture.

0:54

And we'll go into the building blocks of transformers.

0:58

So first, let's start with introducing the instructors.

1:03

So for me, I'm currently on a temporary deferral from the PhD

1:06

program, and I'm leading AI at a robotics startup, Collaborative

1:09

Robotics, that are working on some general purpose robots,

1:13

somewhat like [INAUDIBLE].

1:14

And I'm very passionate about robotics and building FSG

1:18

learning algorithms.

1:19

My research interests are in reinforcement learning,

1:21

computer vision, and remodeling, and I

1:23

have a bunch of publications in robotics,

1:25

autonomous driving, and other areas.

1:28

My undergrad was at Cornell.

1:29

If someone is from Cornell, so nice to [INAUDIBLE]..

1:33

So I'm Stephen, currently a first-year CS PhD here.

1:37

Previously did my master's at CMU and undergrad at Waterloo.

1:40

I'm mainly into NLP research, anything involving language

1:43

and text, but more recently, I've

1:45

been getting more into computer vision as well as [INAUDIBLE]

1:48

And just some stuff I do for fun, a lot of music

1:51

stuff, mainly piano.

1:52

Some self-promo of what I post a lot on my Insta, YouTube,

1:55

and TikTok, so if you guys want to check it out.

1:58

My friends and I are also starting a Stanford piano club,

2:01

so if anybody's interested, feel free to email

2:04

or DM me for details.

2:07

Other than that, martial arts, bodybuilding, and huge fan

2:11

of k-dramas, anime, and occasional gamer.

2:14

[LAUGHS]

2:18

OK, cool.

2:19

Yeah, so my name is Rylan.

2:20

Instead of talking about myself, I just

2:21

want to very briefly say that I'm super

2:23

excited to take this class.

2:24

I took it the last time-- sorry-- to teach this.

2:26

Excuse me.

2:26

I took it the last time I was offered.

2:28

I had a bunch of fun.

2:30

I thought we brought in a really great group of speakers

2:32

last time.

2:33

I'm super excited for this offering.

2:35

And yeah, I'm thankful that you're all here,

2:37

and I'm looking forward to a really fun quarter together.

2:39

Thank you.

2:39

Yeah, so fun fact, Rylan was the most outspoken student

2:42

last year.

2:43

And so if someone wants to become an instructor next year,

2:45

you know what to do.

2:46

[LAUGHTER]

2:50

OK, cool.

2:53

Let's see.

2:54

OK, I think we have a few minutes.

2:56

So what we hope you will learn in this class is, first of all,

2:59

how do transformers work, how they

3:02

are being applied, just beyond NLP,

3:04

and nowadays, like they are pretty [INAUDIBLE]

3:06

them everywhere in AI machine learning.

3:10

And what are some new and interesting directions

3:12

of research in these topics.

3:17

Cool, so this class is just an introductory.

3:19

So we're just talking about the basics of transformers,

3:22

introducing them, talking about the self-attention mechanism

3:24

on which they're founded.

3:26

And we'll do a deep dive more on models like BERT

3:30

to GPT, stuff like that.

3:32

So with that, happy to get started.

3:35

OK, so let me start with presenting the attention

3:38

timeline.

3:40

Attention all started with this one paper.

3:43

[INAUDIBLE] by Vaswani et al in 2017.

3:46

That was the beginning of transformers.

3:49

Before that, we had the prehistoric error,

3:51

where we had models like RNM, LSDMs,

3:55

and simple attention mechanisms that didn't work

3:57

or [INAUDIBLE].

3:59

Starting 2017, we saw this explosion of transformers

4:02

into NLP, where people started using it for everything.

4:07

I even heard this quote from Google.

4:08

It's like our performance increased every time

4:10

we [INAUDIBLE]

4:11

[CHUCKLES]

4:15

For the [INAUDIBLE] after 2018 to 2020,

4:17

we saw this explosion of transformers

4:18

into other fields like vision, a bunch of other stuff,

4:23

and like biology as a whole.

4:25

And in last year, 2021 was the start

4:28

of the generative era, where we got a lot of genetic modeling,

4:31

started models like Codex, GPT, DALL-E,

4:35

stable diffusions, or a lot of things

4:37

happening in genetic modeling.

4:40

And we started scaling up in AI.

4:44

And now, the present.

4:45

So this is 2022 and the startup in '23.

4:49

And now we have models like ChatGPT, Whisperer,

4:53

a bunch of others.

4:54

And we're scaling onwards without splitting up,

4:57

so that's great.

4:58

So that's the future.

5:01

So going more into this, so once there were RNNs.

5:06

So we had Seq2Seq models, LSTMs, GRU.

5:10

What worked there was that they were good at encoding history,

5:13

but what did not work was they didn't encode long sequences

5:17

and they were very bad at encoding context.

5:21

So consider this example.

5:24

Consider trying to predict the last word in the text,

5:27

"I grew up in France, dot, dot, dot.

5:29

I speak fluent Dutch."

5:31

Here, you need to understand the context for it

5:33

to predict French, and attention mechanism

5:36

is very good at that, whereas if they're just using LSDMs,

5:39

it doesn't here work that well.

5:42

Another thing transformers are good at is,

5:46

more based on content, is also context prediction

5:50

is like finding attention maps.

5:52

If I have something like a word like it,

5:56

what noun does it correlate to.

5:57

And we can give a property attention

6:01

on one of the possible activations.

6:05

And this works better than existing mechanisms.

6:10

OK, so where we were in 2021, we were on the verge of takeoff.

6:16

We were starting to realize the potential of transformers

6:18

in different fields.

6:20

We solved a lot of long sequence problems

6:23

like protein folding, AlphaFold, offline RL.

6:28

We started to see few-shots, zero-shot generalization.

6:31

We saw multimodal tasks and applications

6:34

like generating images from language.

6:36

So that was DALL-E. And it feels like [INAUDIBLE]..

6:43

And this was also a talk on transformers

6:45

that you can watch on YouTube.

6:48

Yeah, cool.

6:51

And this is where we were going from 2021 to 2022,

6:55

which is we have gone from the version of [INAUDIBLE]

6:58

And now, we are seeing unique applications

7:00

in audio generation, art, music, storytelling.

7:03

We are starting to see these new capabilities

7:05

like commonsense, logical reasoning,

7:08

mathematical reasoning.

7:09

We are also able to now get human enlightenment

7:12

and interaction.

7:13

They're able to use reinforcement learning

7:15

and human feedback.

7:16

That's how ChatGPT is trained to perform really good.

7:19

We have a lot of mechanisms for controlling

7:21

toxicity bias and ethics now.

7:24

And there are a lot of also, a lot

7:26

of developments in other areas like diffusion models.

7:30

Cool.

7:33

So the future is a spaceship, and we are all

7:35

excited about it.

7:39

And there's a lot of more applications

7:40

that we can enable, and it'll be great

7:44

if you can see transformers also up there.

7:47

One big example is video understanding and generation.

7:49

That is something that everyone is interested in,

7:51

and I'm hoping we'll see a lot of models

7:53

in this area this year, also, finance, business.

7:59

I'll be very excited to see GPT author a novel,

8:02

but we need to solve very long sequence modeling.

8:04

And most transformer models are still

8:07

limited to 4,000 tokens or something like that.

8:09

So we need to make them generalize much more

8:13

better on long sequences.

8:17

We also want to have generalized agents

8:19

that can do a lot of multitask, a multi-input predictions

8:27

like Gato.

8:28

And so I think we will see more of that, too.

8:31

And finally, we also want domain specific models.

8:37

So you might want a GPT model, let's

8:39

put it like maybe your health.

8:41

So that could be like a DoctorGPT model.

8:43

You might have a LawyerGPT model that's

8:45

trained on only law data.

8:46

So currently, we have GPT models that are trained on everything.

8:49

But we might start to see more niche models that

8:51

are good at one task.

8:53

And we could have a mixture of experts,

8:55

so it's like, you can think this is a--

8:57

how you'd normally consult an expert,

8:58

you'll have expert AI models.

9:00

And you can go to a different AI model for your different needs.

9:05

There are still a lot of missing ingredients

9:07

to make this all successful.

9:10

The first of all is external memory.

9:12

We are already starting to see this with the models

9:15

like ChatGPT, where the inflections are short-lived.

9:18

There's no long-term memory, and they

9:20

don't have ability to remember or store

9:23

conversations for long-term.

9:25

And this is something you want to fix.

9:29

Second is reducing the computation complexity.

9:32

So attention mechanism is quadratic over the sequence

9:36

length, which is slow.

9:37

And we want to reduce it and make it faster.

9:42

Another thing we want to do is we

9:44

want to enhance the controllability of these models

9:46

like a lot of these models can be stochastic.

9:48

And we want to be able to control what sort of outputs

9:51

we get from them.

9:52

And you might have experienced the ChatGPT,

9:54

if you just refresh, you get different output each time.

9:56

But you might want to have a mechanism that controls

9:59

what sort of things you get.

10:01

And finally, we want to align our state of art language

10:04

models with how the human brain works.

10:06

And we are seeing the surge, but we still

10:09

need more research on seeing how they can make more informed.

10:12

Thank you.

10:14

Great, hi.

10:16

Yes, I'm excited to be here.

10:18

I live very nearby, so I got the invites to come to class.

10:21

And I was like, OK, I'll just walk over.

10:23

But then I spent like 10 hours on the slides,

10:25

so it wasn't as simple.

10:28

So yeah, I'm going to talk about transformers.

10:30

I'm going to skip the first two over there.

10:32

I'm not going to talk about those.

10:34

We'll talk about that one just to simplify the lecture

10:36

since we don't have time.

10:39

OK, so I wanted to provide a little bit of context

10:41

on why does this transformers class even exist.

10:44

So a little bit of historical context.

10:45

I feel like Bilbo over there.

10:47

I joined like telling you guys about this.

10:50

I don't know if you guys saw Lord of the Rings.

10:52

And basically, I joined AI in roughly 2012, the full course,

10:56

so maybe a decade ago.

10:58

And back then, you wouldn't even say

10:59

that you joined AI by the way.

11:00

That was like a dirty word.

11:02

Now, it's OK to talk about, but back then, it

11:04

was not even deep learning.

11:05

It was machine learning.

11:06

That was the term we would use if you were serious.

11:08

But now, now, AI is OK to use, I think.

11:11

So basically, do you even realize

11:13

how lucky you are potentially entering

11:15

this area in roughly 2023?

11:17

So back then, in 2011 or so when I was working specifically

11:20

on computer vision, your pipeline's looked like this.

11:25

So you wanted to classify some images,

11:28

you would go to a paper, and I think this is representative.

11:30

You would have three pages in the paper describing

11:32

all kinds of a zoo, of kitchen sink,

11:34

of different kinds of features, descriptors.

11:36

And you would go to a poster session

11:38

and in computer vision conference,

11:40

and everyone would have their favorite feature descriptor

11:41

that they're proposing.

11:42

And it's totally ridiculous, and you

11:44

would take notes on which one you should incorporate

11:45

into your pipeline because you would extract all of them,

11:48

and then you would put an SVM on top.

11:49

So that's what you would do.

11:51

So there's two pages.

11:52

Make sure you get your [? Spar ?] SIFT histograms,

11:54

your SSIMs, your color histograms, textiles,

11:56

tiny images.

11:57

And don't forget the geometry specific histograms.

11:59

All of them have basically complicated code by themselves.

12:02

So you're collecting code from everywhere and running it,

12:04

and it was a total nightmare.

12:06

So on top of that, it also didn't work.

12:10

[LAUGHTER]

12:11

So this would be, I think, it represents the prediction

12:14

from that time.

12:15

You would just get predictions like this once in a while,

12:17

and you'd be like, you just shrug your shoulders

12:19

like that just happens once in a while.

12:20

Today, you would be looking for a bug.

12:23

And worse than that, every single chunk of AI

12:30

had their own completely separate vocabulary

12:32

that they work with.

12:33

So if you go to NLP papers, those papers

12:36

would be completely different.

12:38

So you're reading the NLP paper, and you're like,

12:40

what is this part of speech tagging,

12:42

morphological analysis, and tactic parsing,

12:44

co-reference resolution?

12:46

What is MPBTKJ?

12:48

And you're confused.

12:49

So the vocabulary and everything was completely different.

12:51

And you couldn't read papers, I would

12:52

say, across different areas.

12:55

So now, that changed a little bit

12:56

starting 2012 when Al Krizhevsky and colleagues basically

13:02

demonstrated that if you scale a large neural network

13:05

on large data set, you can get very strong performance.

13:08

And so up till then, there was a lot of focus on algorithms.

13:10

But this showed that actually neural nets scale very well.

13:13

So you need to now worry about compute and data,

13:15

and you can scale it up.

13:16

It works pretty well.

13:17

And then that recipe actually did copy paste

13:19

across many areas of AI.

13:21

So we start to see neural networks pop up everywhere

13:23

since 2012.

13:25

So we saw them in computer vision, and NLP, and speech,

13:28

and translation in RL and so on.

13:30

So everyone started to use the same kind of modeling

13:32

toolkit, modeling framework.

13:33

And now when you go to NLP, and you start reading papers there,

13:36

in machine translation, for example,

13:38

this is a sequence to sequence paper

13:40

which we'll come back to in a bit.

13:41

You start to read those papers, and you're like, OK,

13:44

I can recognize these words.

13:45

Like there's a neural network.

13:46

There's some parameters.

13:47

There's an optimizer, and it starts to read things

13:50

that you know of.

13:50

So that decreased tremendously the barrier to entry

13:54

across the different areas.

13:56

And then, I think, the big deal is

13:57

that when the transformer came out in 2017,

14:00

it's not even that just the tool kits and the neural networks

14:02

were similar-- there's that literally the architectures

14:05

converged to like one architecture that you

14:07

copy paste across everything seemingly.

14:10

So this was kind of an unassuming machine translation

14:12

paper at the time, proposing to transformer architecture.

14:15

But what we found since then is that you can just basically

14:17

copy paste this architecture and use it everywhere.

14:21

And what's changing is the details of the data,

14:23

and the chunking of the data, and how you feed it in.

14:26

And that's a caricature, but it's

14:28

kind of like a correct first order statement.

14:29

And so now, papers are even more similar looking

14:32

because everyone's just using transformer.

14:34

And so this convergence was remarkable to watch

14:38

and unfolded over the last decade.

14:40

And it's pretty crazy to me.

14:42

What I find interesting is I think

14:44

this is some kind of a hint that we're maybe converging

14:46

to something that maybe the brain is doing

14:48

because the brain is very homogeneous and uniform

14:50

across the entire sheet of your cortex.

14:52

And OK, maybe some of the details are changing,

14:54

but those feel like hyperparameters

14:56

like a transformer.

14:57

But your auditory cortex and your visual cortex

14:59

and everything else looks very similar.

15:01

And so maybe we're converging to some kind

15:02

of a uniform powerful learning algorithm here.

15:06

Something like that, I think, is interesting and exciting.

15:09

OK, so I want to talk about where the transformer came

15:11

from briefly, historically.

15:12

So I want to start in 2003.

15:15

I like this paper quite a bit.

15:17

It was the first popular application of neural networks

15:21

to the problem of language modeling,

15:22

so predicting in this case, the next word

15:24

in the sequence, which allows you to build

15:26

generative models over text.

15:27

And in this case, they were using multi-layer perceptron,

15:29

so very simple neural net.

15:30

The neural nets took three words and predicted the probability

15:33

distribution for the fourth word in a sequence.

15:36

So this was well and good at this point.

15:39

Now, over time, people started to apply this

15:41

to machine translation.

15:43

So that brings us to sequence to sequence paper

15:45

from 2014 that was pretty influential,

15:48

and the big problem here was OK, we

15:49

don't just want to take three words and predict the fourth.

15:52

We want to predict how to go from an English sentence

15:55

to a French sentence.

15:56

And the key problem was OK, you can

15:58

have arbitrary number of words in English and arbitrary number

16:00

of words in French, so how do you

16:03

get an architecture that can process

16:04

this variably sized input?

16:06

And so here they used a LSDM, and there's basically

16:10

two chunks of this, which are covered by the slack, by this.

16:16

But basically have an encoder LSDM on the left,

16:19

and it just consumes one word at a time

16:22

and builds up a context of what it has read.

16:24

And then that acts as a conditioning vector

16:26

to the decoder RNN or LSDM.

16:29

That basically goes chonk, chonk,

16:30

chonk for the next word in a sequence,

16:32

translating the English to French or something like that.

16:35

Now, the big problem with this, that people identified,

16:37

I think, very quickly and tried to resolve

16:40

is that there's what's called this encoder bottleneck.

16:43

So this entire English sentence that we are trying to condition

16:46

on is packed into a single vector

16:48

that goes from the encoder to the decoder.

16:50

And so this is just too much information

16:52

to potentially maintain in a single vector,

16:54

and that didn't seem correct.

16:55

And so people who are looking around for ways

16:57

to alleviate the attention of the encoder bottleneck as it

17:00

was called at the time.

17:02

And so that brings us to this paper,

17:03

Neural Machine Translation by Jointly Learning

17:05

to Align and Translate.

17:07

And here, just quoting from the abstract, "in this paper,

17:11

we conjectured that the use of a fixed length vector

17:13

is a bottleneck in improving the performance

17:15

of the basic encoder-decoder architecture

17:17

and propose to extend this by allowing

17:19

the model to automatically soft search

17:21

for parts of the source sentence that are relevant to predicting

17:24

a target word without having to form

17:28

these parts or hard segments exclusively."

17:30

So this was a way to look back to the words that

17:34

are coming from the encoder.

17:35

And it was achieved using this soft search.

17:38

So as you are decoding in the words

17:42

here, while you are decoding them,

17:44

you are allowed to look back at the words

17:45

at the encoder via this soft attention mechanism proposed

17:49

in this paper.

17:50

And so this paper, I think, is the first time that I saw,

17:52

basically, attention.

17:55

So your context vector that comes from the encoder

17:58

is a weighted sum of the hidden states

18:01

of the words in the encoding.

18:05

And then the weights of this sum come

18:07

from a softmax that is based on these compatibilities

18:10

between the current state as you're decoding

18:13

and the hidden states generated by the encoder.

18:15

And so this is the first time that really you

18:17

start to look at it, and this is the current modern equations

18:22

of the attention.

18:23

And I think this was the first paper that I saw it in.

18:25

It's the first time that there's a word

18:27

attention used, as far as I know, to call this mechanism.

18:32

So I actually tried to dig into the details of the history

18:34

of the attention.

18:35

So the first author here, Dzmitry, I

18:38

had an email correspondence with him,

18:40

and I basically sent him an email.

18:41

I'm like, Dzmitry, this is really interesting.

18:43

Just rumors have taken over.

18:44

Where did you come up with the soft attention

18:45

mechanism that ends up being the heart of the transformer?

18:48

And to my surprise, he wrote me back this massive email, which

18:52

was really fascinating.

18:52

So this is an excerpt from that email.

18:57

So basically, he talks about how he was looking for a way

18:59

to avoid this bottleneck between the encoder and decoder.

19:02

He had some ideas about cursors that

19:04

traverse the sequences that didn't quite work out.

19:06

And then here, "so one day, I had this thought

19:08

that it would be nice to enable the decoder

19:10

RNN to learn to search where to put the cursor in the source

19:13

sequence.

19:14

This was sort of inspired by translation exercises

19:16

that learning English in my middle school involved.

19:21

Your gaze shifts back and forth between source and target,

19:23

sequence as you translate."

19:24

So literally, I thought that this was kind of interesting,

19:27

that he's not a native English speaker,

19:28

and here, that gave him an edge in this machine translation

19:31

that led to attention and then led to transformer.

19:34

So that's really fascinating.

19:37

"I expressed a soft search a softmax

19:38

and then weighted averaging of the [INAUDIBLE] states.

19:40

And basically, to my great excitement,

19:43

this worked from the very first try."

19:45

So really, I think, interesting piece of history.

19:48

And as it later turned out that the name of RNN search

19:51

was kind of lame, so the better name attention came

19:54

from Yoshua on one of the final passes

19:57

as they went over the paper.

19:58

So maybe Attention is All You Need

20:00

would have been called RNN Search is All You Need,

20:03

but we have Yoshua Bengio to thank

20:05

for a little bit of better name, I would say.

20:07

So apparently, that's the history

20:08

of this, which I thought was interesting.

20:11

OK, so that brings us to 2017, which is Attention

20:13

is All You Need.

20:14

So this attention component, which

20:16

in Dzmitry's paper was just one small segment,

20:19

and there's all this bidirectional RNN, RNN

20:21

and decoder, and this Attention All You Need paper is saying,

20:25

OK, you can actually delete everything.

20:26

What's making this work very well

20:28

is just attention by itself.

20:29

And so delete everything, keep attention.

20:32

And then what's remarkable about this paper actually is usually,

20:35

you see papers that are very incremental.

20:36

They add one thing, and they show that it's better.

20:39

But I feel like Attention is All You

20:41

Need was like a mix of multiple things at the same time.

20:44

They were combined in a very unique way,

20:46

and then also achieve a very good local minimum

20:49

in the architecture space.

20:50

And so to me, this is really a landmark paper

20:52

that is quite remarkable and, I think,

20:55

had quite a lot of work behind the scenes.

20:58

So delete all the RNN, just keep attention.

21:01

Because attention operates over sets--

21:03

and I'm going to go to this in a second--

21:05

you now need to positionally encode your inputs

21:07

because attention doesn't have the notion of space by itself.

21:14

I have to be very careful.

21:17

They adopted this residual network structure

21:19

from resonance.

21:21

They interspersed attention with multi-layer perceptrons.

21:24

They used layer norms, which came from a different paper.

21:27

They introduced the concept of multiple heads of attention

21:29

that were applied in parallel.

21:30

And they gave us, I think, like a fairly good set

21:33

of hyperparameters that to this day are used.

21:35

So the expansion factor in the multi-layer perceptron goes up

21:39

by 4X--

21:40

and we'll go into a bit more detail--

21:41

and this 4X has stuck around.

21:43

And I believe there's a number of papers

21:44

that try to play with all kinds of little details

21:47

of the transformer, and nothing sticks because this is actually

21:50

quite good.

21:51

The only thing to my knowledge that didn't stick

21:54

was this reshuffling of the layer norms

21:56

to go into the prenorm version where here you

21:59

see the layer norms are after the multiheaded attention feed

22:01

forward.

22:02

They just put them before instead.

22:04

So just reshuffling of layer norms, but otherwise,

22:06

the TPTs and everything else that you're seeing today

22:08

is basically the 2017 architecture from 5 years ago.

22:11

And even though everyone is working on it,

22:13

it's been proven remarkably resilient,

22:15

which I think is real interesting.

22:17

There are innovations that, I think,

22:18

have been adopted also in positional encoding.

22:21

It's more common to use different rotary and relative

22:24

positional encoding and so on.

22:25

So I think there have been changes, but for the most part,

22:28

it's proven very resilient.

22:31

So really quite an interesting paper.

22:32

Now, I wanted to go into the attention mechanism.

22:36

And I think, the way I interpret it is not similar to the ways

22:43

that I've seen it presented before.

22:44

So let me try a different way of how I see it.

22:47

Basically, to me, attention is kind of like the communication

22:49

phase of the transformer, and the transformer

22:52

interweaves two phases of the communication phase, which

22:55

is the multi-headed attention, and the computation

22:57

stage, which is this multilayered perceptron

23:00

or [INAUDIBLE].

23:01

So in the communication phase, it's

23:03

really just a data dependent message

23:05

passing on directed graphs.

23:07

And you can think of it as OK, forget everything

23:09

with machine translation, everything.

23:10

Let's just-- we have directed graphs.

23:13

At each node, you are storing a vector.

23:16

And then let me talk now about the communication

23:18

phase of how these vectors talk to each other

23:20

and this directed graph.

23:21

And then the compute phase later is just

23:23

a multi-perceptron, which then basically acts on every node

23:27

individually.

23:28

But how do these nodes talk to each other

23:30

in this directed graph?

23:32

So I wrote like some simple Python--

23:36

I wrote this in Python basically to create

23:39

one round of communication of using attention

23:44

as the message passing scheme.

23:46

So here, a node has this private data vector,

23:51

as you can think of it as private information

23:53

to this node.

23:54

And then it can also emit a key, a query, and a value.

23:57

And simply, that's done by linear transformation

24:00

from this node.

24:01

So the key is what are the things that I am--

24:07

sorry.

24:07

The query is what are the things that I'm looking for?

24:10

The key is what other the things that I have?

24:12

And the value is what are the things that I will communicate?

24:15

And so then when you have your graph that's

24:16

made up of nodes in some random edges, when you actually

24:19

have these nodes communicating, what's happening is

24:21

you loop over all the nodes individually

24:23

in some random order, and you're at some node,

24:27

and you get the query vector q, which

24:29

is, I'm a node in some graph, and this

24:32

is what I'm looking for.

24:33

And so that's just achieved via this linear transformation

24:36

here.

24:36

And then we look at all the inputs that point to this node,

24:39

and then they broadcast what are the things that I have,

24:42

which is their keys.

24:44

So they broadcast the keys.

24:45

I have the query, then those interact by dot product

24:49

to get scores.

24:51

So basically, simply by doing dot product,

24:53

you get some unnormalized weighting

24:55

of the interestingness of all of the information in the nodes

24:59

that point to me and to the things I'm looking for.

25:02

And then when you normalize that with softmax,

25:03

so it just sums to 1, you basically just

25:06

end up using those scores, which now sum to 1 in our probability

25:09

distribution, and you do a weighted sum of the values

25:13

to get your update.

25:15

So I have a query.

25:17

They have keys, dot products to get interestingness or like

25:21

affinity, softmax to normalize it, and then

25:24

weighted sum of those values flow to me and update me.

25:27

And this is happening for each node individually.

25:29

And then we update at the end.

25:30

And so this kind of a message passing scheme

25:32

is at the heart of the transformer.

25:35

And it happens in the more vectorized batched way

25:40

that is more confusing and is also interspersed with layer

25:44

norms and things like that to make the training behave

25:46

better.

25:47

But that's roughly what's happening in the attention

25:49

mechanism, I think, on a high level.

25:53

So yeah, so in the communication phase of the transformer, then

25:59

this message passing scheme happens

26:00

in every head in parallel and then in every layer in series

26:06

and with different weights each time.

26:08

And that's it as far as the multi-headed attention goes.

26:13

And so if you look at these encooder-decoder models,

26:15

you can think of it then in terms of the connectivity

26:18

of these nodes in the graph.

26:19

You can think of it as like, OK, all these tokens that

26:21

are in the encoder that we want to condition on,

26:23

they are fully connected to each other.

26:25

So when they communicate, they communicate fully

26:28

when you calculate their features.

26:30

But in the decoder, because we are

26:32

trying to have a language model, we

26:33

don't want to have communication for future tokens

26:35

because they give away the answer at this step.

26:38

So the tokens in the decoder are fully connected

26:40

from all the encoder states, and then they

26:43

are also fully connected from everything that is decoding.

26:46

And so you end up with this triangular structure

26:49

in the data graph.

26:50

But that's the message passing scheme

26:52

that this basically implements.

26:54

And then you have to be also a little bit careful because

26:57

in the cross attention here with the decoder,

26:59

you consume the features from the top of the encoder.

27:01

So think of it as in the encoder,

27:03

all the nodes are looking at each other,

27:05

all the tokens are looking at each other many, many times.

27:08

And they really figure out what's in there,

27:09

and then the decoder when it's looking only at the top nodes.

27:14

So that's roughly the message passing scheme.

27:16

I was going to go into more of an implementation

27:18

of a transformer.

27:19

I don't know if there's any questions about this.

27:23

[INAUDIBLE] self-attention and multi-headed attention,

27:26

but what is the advantage of [INAUDIBLE]??

27:30

Yeah, so self-attention and multi-headed attention, so

27:35

the multi-headed attention is just this attention scheme,

27:38

but it's just applied multiple times in parallel.

27:40

Multiple heads just means independent applications

27:42

of the same attention.

27:44

So this message passing scheme basically just

27:47

happens in parallel multiple times

27:49

with different weights for the query, key, and value.

27:52

So you can almost look at it like in parallel, I'm

27:55

looking for, I'm seeking different kinds of information

27:57

from different nodes.

27:59

And I'm collecting it all in the same node.

28:01

It's all done in parallel.

28:03

So heads is really just copy-paste in parallel.

28:06

And layers are copy-paste but in series.

28:12

Maybe that makes sense.

28:15

And self-attention, when it's self-attention,

28:18

what it's referring to is that the node here

28:21

produces each node here.

28:23

So as I described it here, this is really self-attention

28:25

because every one of these nodes produces

28:27

a key query and a value from this individual node.

28:30

When you have cross-attention, you have one cross-attention

28:33

here, coming from the encoder.

28:36

That just means that the queries are still

28:38

produced from this node, but the keys and the values

28:42

are produced as a function of nodes that

28:44

are coming from the encoder.

28:48

So I have my queries because I'm trying to decode some--

28:52

the fifth word in the sequence.

28:53

And I'm looking for certain things

28:55

because I'm the fifth word.

28:56

And then the keys and the values in terms

28:58

of the source of information that could answer my queries

29:01

can come from the previous nodes in the current decoding

29:04

sequence or from the top of the encoder.

29:06

So all the nodes that have already seen all

29:09

of the encoding tokens many, many times cannot broadcast

29:12

what they contain in terms of information.

29:14

So I guess, to summarize, the self-attention is--

29:18

sorry, cross-attention and self-attention

29:20

only differ in where the piece and the values come from.

29:24

Either the keys and values are produced from this node,

29:28

or they are produced from some external source like an encoder

29:31

and the nodes over there.

29:33

But algorithmically, is the same mathematical operations.

29:39

Question.

29:39

Yeah, OK.

29:40

So two questions for you.

29:41

First question is, in the message passing [INAUDIBLE]

29:56

So think of-- so each one of these nodes is a token.

30:04

I guess they don't have a very good picture of it

30:06

in the transformer.

30:06

But this node here could represent the third word

30:14

in the output in the decoder, and in the beginning,

30:19

it is just the embedding of the word.

30:27

And then, OK, I have to think through this analogy

30:30

a little bit more.

30:31

I came up with it this morning.

30:32

[LAUGHTER]

30:34

[INAUDIBLE]

30:39

What example of instantiation [INAUDIBLE] nodes

30:45

as in in blocks were embedding?

30:50

These nodes are basically the vectors.

30:53

I'll go to an implementation.

30:54

I'll go to the implementation, and then maybe I'll

30:56

make the connections to the graph.

30:58

So let me try to first go to-- let me now go to,

31:01

with this intuition in mind, at least,

31:03

to a nanoGPT, which is a concrete implementation

31:05

of a transformer that is very minimal.

31:06

So I worked on this over the last few days,

31:08

and here it is reproducing GPT-2 on open web text.

31:11

So it's a pretty serious implementation that reproduces

31:14

GPT-2, I would say, and provide it enough compute--

31:17

This was one node of 8 GPUs for 38 hours or something

31:21

like that, if I remember correctly.

31:22

And it's very readable.

31:23

It's 300 lines, so everyone can take a look at it.

31:27

And yeah, let me basically briefly step through it.

31:30

So let's try to have a decoder-only transformer.

31:34

So what that means is that it's a language model.

31:36

It tries to model the next word in the sequence

31:39

or the next character in the sequence.

31:41

So the data that we train on this

31:43

is always some kind of text.

31:44

So here's some fake Shakespeare.

31:45

Sorry, this is real Shakespeare.

31:47

We're going to produce fake Shakespeare.

31:48

So this is called a Tiny Shakespeare

31:50

dataset, which is one of my favorite toy datasets.

31:52

You take all of Shakespeare, concatenate it,

31:54

and it's 1 megabyte file, and then

31:55

you can train language models on it

31:56

and get infinite Shakespeare, if you like,

31:58

which I think is kind of cool.

31:59

So we have a text.

32:00

The first thing we need to do is we

32:02

need to convert it to a sequence of integers

32:05

because transformers natively process--

32:09

you can't plug text into transformer.

32:10

You need to somehow encode it.

32:11

So the way that encoding is done is

32:13

we convert, for example, in the simplest case,

32:15

every character gets an integer, and then instead of "hi

32:18

there," we would have this sequence of integers.

32:21

So then you can encode every single character as an integer

32:25

and get a massive sequence of integers.

32:27

You just concatenate it all into one

32:29

large, long one-dimensional sequence.

32:31

And then you can train on it.

32:32

Now, here, we only have a single document.

32:34

In some cases, if you have multiple independent documents,

32:36

what people like to do is create special tokens,

32:38

and they intersperse those documents

32:40

with those special end of text tokens

32:42

that they splice in between to create boundaries.

32:46

But those boundaries actually don't have any modeling impact.

32:50

It's just that the transformer is supposed

32:52

to learn via backpropagation that the end of document

32:55

sequence means that you should wipe the memory.

33:00

OK, so then we produce batches.

33:02

So these batches of data just mean

33:04

that we go back to the one-dimensional sequence,

33:06

and we take out chunks of this sequence.

33:08

So say, if the block size is 8, Then the block size indicates

33:13

the maximum length of context that your transformer will

33:17

process.

33:18

So if our block size is 8, that means

33:20

that we are going to have up to eight characters of context

33:23

to predict the ninth character in a sequence.

33:26

And the batch size indicates how many sequences in parallel

33:29

we're going to process.

33:30

And we want this to be as large as possible,

33:31

so we're fully taking advantage of the GPU

33:33

and the parallels [INAUDIBLE] So in this example,

33:36

we're doing a 4 by 8 batches.

33:38

So every row here is independent example

33:41

and then every row here is a small chunk of the sequence

33:47

that we're going to train on.

33:48

And then we have both the inputs and the targets

33:50

at every single point here.

33:52

So to fully spell out what's contained in a single 4

33:55

by 8 batch to the transformer--

33:57

I sort of compact it here--

33:59

so when the input is 47, by itself, the target is 58.

34:04

And when the input is the sequence 47, 58,

34:07

the target is one.

34:08

And when it's 47, 58, 1, the target is 51 and so on.

34:13

So actually, the single batch of examples that score by 8

34:15

actually has a ton of individual examples

34:17

that we are expecting a transformer

34:18

to learn on in parallel.

34:21

And so you'll see that the batches are learned

34:23

on completely independently, but the time dimension here along

34:28

horizontally is also trained on in parallel.

34:30

So your real batch size is more like B times T.

34:34

And it's just that the context grows linearly

34:37

for the predictions that you make along the T direction

34:41

in the model.

34:42

So this is all the examples that the model will learn from,

34:45

this single batch.

34:48

So now, this is the GPT class.

34:52

And because this is a decoder-only model,

34:55

so we're not going to have an encoder because there's no

34:58

English we're translating from--

34:59

we're not trying to condition in some other external

35:02

information.

35:02

We're just trying to produce a sequence of words that

35:05

follow each other or likely to.

35:08

So this is all PyTorch, and I'm going slightly faster

35:10

because I'm assuming people have taken 231 or something

35:12

along those lines.

35:15

But here in the forward pass, we take these indices,

35:19

and then we both encode the identity of the indices,

35:24

just via an embedding lookup table.

35:26

So every single integer, we index into a lookup table of

35:31

vectors in this, and end up embedding, and pull out

35:34

the word vector for that token.

35:38

And then because the transformer by itself

35:41

doesn't actually-- the process is set natively.

35:43

So we need to also positionally encode these vectors

35:45

so that we basically have both the information

35:47

about the token identity and its place in the sequence from 1

35:51

to block size.

35:53

Now, the information about what and where

35:56

is combined additively, so the token embeddings

35:58

and the positional embeddings are just added exactly as here.

36:02

So then there's optional dropout,

36:06

this x here basically just contains

36:08

the set of words and their positions,

36:14

and that feeds into the blocks of transformer.

36:16

And we're going to look into what's block here.

36:18

But for here, for now, this is just a series

36:20

of blocks in a transformer.

36:22

And then in the end, there's a layer norm,

36:23

and then you're decoding the logits

36:26

for the next word or next integer in a sequence,

36:30

using the linear projection of the output of this transformer

36:33

So LM head here, a short core language model head.

36:36

It's just a linear function.

36:38

So basically, positionally encode all the words,

36:42

feed them into a sequence of blocks,

36:45

and then apply a linear layer to get the probability

36:47

distribution for the next character.

36:50

And then if we have the targets, which

36:51

we produced in the data order--

36:54

and you'll notice that the targets are just

36:55

the inputs offset by one in time--

36:59

then those targets feed into a cross entropy loss.

37:01

So this is just a negative log likelihood

37:03

typical classification loss.

37:04

So now let's drill into what's here in the blocks.

37:08

So these blocks that are applied sequentially,

37:11

there's, again, as I mentioned, this communicate

37:13

phase and the compute phase.

37:15

So in the communicate phase, all the nodes

37:17

get to talk to each other, and so these nodes are basically,

37:21

if our block size is 8, then we are

37:23

going to have eight nodes in this graph.

37:26

There's eight nodes in this graph.

37:28

The first node is pointed to only by itself.

37:30

The second node is pointed to by the first node and itself.

37:33

The third node is pointed to by the first two nodes

37:35

and itself, et cetera.

37:36

So there's eight nodes here.

37:38

So you apply-- there's a residual pathway and x.

37:42

You take it out.

37:43

You apply a layer norm, and then the self-attention

37:45

so that these communicate, these eight nodes communicate.

37:47

But you have to keep in mind that the batch is 4.

37:50

So because batch is 4, this is also applied--

37:54

so we have eight nodes communicating,

37:55

but there's a batch of four of them individually communicating

37:58

in one of those eight nodes.

37:59

There's no crisscross across the batch dimension, of course.

38:02

There's no batch anywhere luckily.

38:04

And then once they've changed information,

38:06

they are processed using the multi-layer perceptron.

38:09

And that's the compute phase.

38:12

And then also here we are missing the cross-attention

38:18

because this is a decoder-only model.

38:19

So all we have is this step here,

38:21

the multi-headed attention, and that's

38:22

this line, the communicate phase.

38:24

And then we have the feed forward, which is the MLP,

38:27

and that's the compute phase.

38:29

I'll take question's a bit later.

38:31

Then the MLP here is fairly straightforward.

38:34

The MLP is just individual processing on each node,

38:38

just transforming the feature representation at that node.

38:41

So applying a two-layer neural net

38:45

with a GELU nonlinearity, which is just

38:47

think of it as a ReLU or something like that.

38:49

It's just a nonlinearity.

38:51

And then MLP is straightforward.

38:53

I don't think there's anything too crazy there.

38:55

And then this is the causal self-attention part,

38:57

the communication phase.

38:59

So this is like the meat of things

39:01

and the most complicated part.

39:03

It's only complicated because of the batching

39:06

and the implementation detail of how you mask the connectivity

39:10

in the graph so that you can't obtain

39:13

any information from the future when

39:15

you're predicting your token.

39:16

Otherwise, it gives away the information.

39:18

So if I'm the fifth token and if I'm the fifth position,

39:23

then I'm getting the fourth token coming into the input,

39:26

and I'm attending to the third, second, and first,

39:29

and I'm trying to figure out what is the next token.

39:32

Well then, in this batch, in the next element

39:34

over in the time dimension, the answer is at the input.

39:37

So I can't get any information from there.

39:40

So that's why this is all tricky,

39:41

but basically, in the forward pass,

39:45

we are calculating the queries, keys, and values based on x.

39:50

So these are the keys, queries, and values.

39:52

Here, when I'm computing the attention,

39:54

I have the queries matrix multiplying the piece.

39:58

So this is the dot product in parallel for all the queries

40:00

and all the keys in all the heads.

40:03

So I failed to mention that there's also

40:06

the aspect of the heads, which is also done all in parallel

40:08

here.

40:09

So we have the batch dimension, the time dimension,

40:10

and the head dimension, and you end up

40:12

with five-dimensional tensors, and it's all really confusing.

40:14

So I invite you to step through it later and convince yourself

40:17

that this is actually doing the right thing.

40:19

But basically, you have the batch dimension, the head

40:21

dimension and the time dimension,

40:23

and then you have features at them.

40:25

And so this is evaluating for all the batch elements, for all

40:28

the head elements, and all the time elements,

40:31

the simple Python that I gave you earlier, which is query

40:34

dot product p.

40:35

Then here, we do a masked_fill, and what this is doing

40:38

is it's basically clamping the attention between the nodes

40:44

that are not supposed to communicate to be negative

40:46

infinity.

40:47

And we're doing negative infinity

40:48

because we're about to softmax, and so negative infinity will

40:51

make basically the attention that those elements be zero.

40:54

And so here we are going to basically end up

40:56

with the weights, the affinities between these nodes, optional

41:03

dropout.

41:03

And then here, attention matrix multiply v is basically

41:08

the gathering of the information according to the affinities

41:10

we calculated.

41:11

And this is just a weighted sum of the values

41:14

at all those nodes.

41:15

So this matrix multiplies is doing that weighted sum.

41:19

And then transpose contiguous view

41:20

because it's all complicated and batched

41:22

in five-dimensional tensors, but it's really not

41:24

doing anything, optional drop out,

41:26

and then a linear projection back to the residual pathway.

41:30

So this is implementing the communication phase here.

41:34

Then you can train this transformer.

41:37

And then you can generate infinite Shakespeare.

41:41

And you will simply do this by--

41:43

because our block size is 8, we start with a sum token,

41:47

say like, I used in this case, you

41:50

can use something like a new line as the start token.

41:53

And then you communicate only to yourself

41:55

because there's a single node, and you

41:57

get the probability distribution for the first word

41:59

in the sequence.

42:00

And then you decode it for the first character

42:03

in the sequence.

42:04

You decode the character.

42:05

And then you bring back the character,

42:06

and you re-encode it as an integer.

42:08

And now, you have the second thing.

42:10

And so you get--

42:12

OK, we're at the first position, and this

42:14

is whatever integer it is, add the positional encodings,

42:17

goes into the sequence, goes in the transformer,

42:19

and again, this token now communicates

42:21

with the first token and it's identity.

42:26

And so you just keep plugging it back.

42:28

And once you run out of the block size, which is eight,

42:31

you start to crawl, because you can never

42:33

have watt size more than eight in the way you've

42:34

trained this transformer.

42:35

So we have more and more context until eight.

42:37

And then if you want to generate beyond eight,

42:39

you have to start cropping because the transformer only

42:41

works for eight elements in time dimension.

42:43

And so all of these transformers in the [INAUDIBLE] setting

42:47

have a finite block size or context length,

42:50

and in typical models, this will be 1,024 tokens or 2,048

42:54

tokens, something like that.

42:56

But these tokens are usually like BPE tokens,

42:58

or SentencePiece tokens, or WorkPiece tokens.

43:00

There's many different encodings.

43:02

So it's not like that long.

43:03

And so that's why, I think, [INAUDIBLE]..

43:05

We really want to expand the context size,

43:06

and it gets gnarly because the attention

43:08

is sporadic in the [INAUDIBLE] case.

43:11

Now, if you want to implement an encoder instead of a decoder

43:16

attention.

43:18

Then all you have to do is this [INAUDIBLE]

43:21

and you just delete that line.

43:23

So if you don't mask the attention,

43:25

then all the nodes communicate to each other,

43:27

and everything is allowed, and information

43:29

flows between all the nodes.

43:31

So if you want to have the encoder here, just delete.

43:35

All the encoder blocks will use attention

43:38

where this line is deleted.

43:39

That's it.

43:40

So you're allowing whatever-- this encoder might store say,

43:44

10 tokens, 10 nodes, and they are all

43:46

allowed to communicate to each other going up the transformer.

43:51

And then if you want to implement cross-attention,

43:53

so you have a full encoder-decoder transformer,

43:55

not just a decoder-only transformer or a GPT.

43:59

Then we need to also add cross-attention in the middle.

44:03

So here, there is a self-attention piece where all

44:05

the--

44:06

there's a self-attention piece, a cross-attention piece,

44:08

and this MLP.

44:09

And in the cross-attention, we need

44:12

to take the features from the top of the encoder.

44:14

We need to add one more line here,

44:16

and this would be the cross-attention instead of a--

44:20

I should have implemented it instead of just pointing,

44:22

I think.

44:23

But there will be a cross-attention line here.

44:25

So we'll have three lines because we

44:26

need to add another block.

44:28

And the queries will come from x but the keys

44:31

and the values will come from the top of the encoder.

44:35

And there will be basic code information

44:36

flowing from the encoder, strictly

44:38

to all the nodes inside x.

44:41

And then that's it.

44:42

So it's a very simple modifications

44:44

on the decoder attention.

44:47

So you'll hear people talk that you have

44:49

a decoder-only model like GPT.

44:51

You can have an encoder-only model like BERT,

44:53

or you can have an encoder-decoder model

44:55

like say T5, doing things like machine translation.

44:59

And in BERT, you can't train it using this language modeling

45:04

setup that's utter aggressive, and you're just

45:06

trying to predict next [INAUDIBLE] in the sequence.

45:07

You're training it doing slightly different objectives.

45:09

You're putting in the full sentence,

45:12

and, the full sentence is allowed to communicate fully.

45:14

And then you're trying to classify sentiment or something

45:16

like that.

45:18

So you're not trying to model the next token in the sequence.

45:21

So these are trained slightly different

45:26

using masking and other denoising techniques.

45:31

OK.

45:32

So that's like the transformer.

45:34

I'm going to continue.

45:36

So yeah, maybe more questions.

45:38

[INAUDIBLE]

46:01

This is like we are enforcing these constraints on it

46:06

by just masking [INAUDIBLE]

46:12

So I'm not sure if I fully follow.

46:14

So there's different ways to look at this analogy,

46:16

but one analogy is you can interpret

46:18

this graph as really fixed.

46:20

It's just that every time we do the communicate,

46:22

we are using different weights.

46:23

You can look at it that way.

46:24

So if we have block size of eight in my example,

46:26

we would have eight nodes.

46:27

Here we have 2, 4, 6.

46:29

OK, so we'd have eight nodes.

46:30

They would be connected in--

46:33

you lay them out, and you only connect from left to right.

46:35

[INAUDIBLE]

46:42

Why would they connect-- usually,

46:44

the connections don't change as a function of the data

46:46

or something like that--

46:47

[INAUDIBLE]

47:00

I don't think I've seen a single example where

47:02

the connectivity changes dynamically

47:03

in the function data.

47:04

Usually, the connectivity is fixed.

47:05

If you have an encoder, and you're training a BERT,

47:07

you have how many tokens you want,

47:09

and they are fully connected.

47:11

And if you have a decoder-only model,

47:13

you have this triangular thing, and if you

47:15

have encoder-decoder, then you have

47:16

awkwardly two pools of nodes.

47:21

Yeah.

47:24

Go ahead.

47:25

[INAUDIBLE] I wonder, you know much more about this

47:45

than I know.

47:46

But do you have a sense of like if you ran [INAUDIBLE]

48:00

In my head, I'm thinking [INAUDIBLE] but then you also

48:08

have different things for one or more of [INAUDIBLE]----

48:13

Yeah, it's really hard to say, so that's

48:15

why I think this paper is so interesting because like, yeah,

48:17

usually, you'd see like the path,

48:18

and maybe they had path internally.

48:19

They just didn't publish it.

48:20

All you can see is things that didn't look like a transformer.

48:23

I mean, you have ResNets, which have lots of this.

48:26

But a ResNet would be like this, but there's

48:29

no self-attention component.

48:31

But the MLP is there kind of in a ResNet.

48:35

So a ResNet looks very much like this

48:37

except there's no-- you can use layer norms in ResNets,

48:40

I believe, as well.

48:41

Typically, sometimes, they can be batch norms.

48:43

So it is kind of like a ResNet.

48:45

It is like they took a ResNet, and they

48:47

put in a self-attention block in addition

48:50

to the preexisting MLP block, which

48:52

is kind of like convolutions.

48:53

And MLP was strictly speaking deconvolution,

48:55

one by one convolution, but I think

48:59

the idea is similar in that MLP is just like a typical weights,

49:04

nonlinearity weights operation.

49:11

But I will say, yeah, this is kind of interesting

49:13

because a lot of work is not there,

49:15

and then they give you this transformer.

49:17

And then it turns out 5 years later,

49:18

it's not changed, even though everyone's trying to change it.

49:20

So it's interesting to me that it's like a package,

49:23

in like a package, which I think is really

49:25

interesting historically.

49:26

And I also talked to paper authors,

49:30

and they were unaware of the impact

49:32

that the transformer would have at the time.

49:33

So when you read this paper, actually, it's unfortunate

49:37

because this is the paper that changed everything,

49:39

but when people read it, it's like question marks

49:41

because it reads like a pretty random machine translation

49:45

paper.

49:46

It's like, oh, we're doing machine translation.

49:47

Oh, here's a cool architecture.

49:48

OK, great, good results.

49:51

It doesn't know what's going to happen.

49:53

[LAUGHS] And so when people read it today,

49:56

I think they're confused potentially.

50:00

I will have some tweets at the end,

50:02

but I think I would have renamed it

50:03

with the benefit of hindsight of like, well, I'll get to it.

50:08

[INAUDIBLE]

50:20

Yeah, I think that's a good question as well.

50:22

Currently, I mean, I certainly don't

50:24

love the autoregressive modeling approach.

50:27

I think it's kind of weird to sample a token

50:29

and then commit to it.

50:31

So maybe there are some ways, some hybrids

50:36

with the Fusion as an example, which

50:38

I think would be really cool, or we'll

50:41

find some other ways to edit the sequences later but still

50:44

in our regressive framework.

50:47

But I think the Fusion is like an up and coming modeling

50:49

approach that I personally find much more appealing.

50:51

When I sample text, I don't go chunk, chunk, chunk,

50:54

and commit.

50:55

I do a draft one, and then I do a better draft two.

50:58

And that feels like a diffusion process.

51:00

So that would be my hope.

51:05

OK, also a question.

51:07

So yeah, you'd think the [INAUDIBLE]

51:20

And then once we have the edge rates,

51:21

we just have to multiply it by the values,

51:23

and then you just [INAUDIBLE] it.

51:25

Yes, yeah, it's right.

51:27

And you think there's MLG within graph neural networks

51:30

and they'll potentially--

51:32

I find the graph neural networks like a confusing term

51:34

because, I mean, yeah, previously,

51:38

there, was this notion of--

51:40

I feel like maybe today everything is a graph neural

51:42

network because a transformer is a graph neural network

51:44

processor.

51:45

The native representation that the transformer operates over

51:48

is sets that are connected by edges in a direct way.

51:51

And so that's the native representation, and then, yeah.

51:55

OK, I should go on because I still have 30 slides.

51:57

[INAUDIBLE]

52:08

Oh yeah, yeah, the root DE, I think, it basically

52:11

like if you're initializing with random weights

52:14

setup from a [INAUDIBLE] as your dimension size grows,

52:17

so does your values, the variance grows.

52:19

And then your softmax will just become the one half vector.

52:23

So it's just a way to control the variance

52:25

and bring it to always be in a good range for softmax

52:28

and nice diffused distribution.

52:31

OK, so it's almost like an initialization thing.

52:37

OK, so transformers have been applied

52:41

to all the other fields, and the way this was done

52:44

is in my opinion, ridiculous ways

52:46

honestly because I was a computer vision person,

52:49

and you have ComNets, and they make sense.

52:51

So what we're doing now with VITs as an example is

52:53

you take an image and you chop it up into little squares.

52:56

And then those squares, literally,

52:57

feed into a transformer, and that's

52:59

it, which is kind of ridiculous.

53:01

And so, I mean, yeah, and so the transformer

53:06

doesn't even, in the simplest case, really know where

53:08

these patches might come from.

53:10

They are usually positionally encoded,

53:12

but it has to rediscover a lot of the structure,

53:16

I think, of them in some ways.

53:19

And it's kind of weird to approach it that way.

53:23

But it's just the simplest baseline

53:25

of just chomping up big images into small squares

53:27

and feeding them in as the individual nodes actually

53:29

works fairly well.

53:30

And then this is in a transformer encoder,

53:32

so all the patches are talking to each other

53:34

throughout the entire transformer.

53:36

And the number of nodes here would be like nine.

53:42

Also, in speech recognition, you just take your melSpectrogram,

53:44

and you chop it up into slices and you feed them

53:46

into a transformer.

53:47

So there was paper like this, but also Whisper.

53:49

Whisper is a copy-paste transformer.

53:51

If you saw Whisper from OpenAI, you just chop up melSpectrogram

53:55

and feed it into a transformer and then pretend

53:57

you're dealing with text.

53:58

And it works very well.

54:00

Decision transformer in RL, you take your states, actions,

54:03

and reward that you experience in environment,

54:05

and you just pretend it's a language.

54:07

Then you start to model the sequences of that,

54:09

and then you can use that for planning later.

54:11

That works really well.

54:13

Even things AlphaFold, so we were briefly

54:15

talking about molecules and how you can plug them in.

54:17

So at the heart of AlphaFold, computationally,

54:19

is also a transformer.

54:21

One thing I wanted to also say about transformers

54:23

is I find that they're very flexible,

54:26

and I really enjoy that.

54:28

I'll give you an example from Tesla.

54:31

You have a ComNet that takes an image

54:32

and makes predictions about the image.

54:34

And then the big question is, how do you

54:35

feed in extra information?

54:37

And it's not always trivial like say, I

54:38

had additional information that I

54:40

want to inform that I want the outputs to be informed by.

54:43

Maybe I have other sensors like Radar.

54:45

Maybe I have some map information, or a vehicle type,

54:47

or some audio.

54:48

And the question is, how do you feed information into a ComNet?

54:50

Like where do you feed it in?

54:52

Do you concatenate it?

54:54

Do you add it?

54:55

At what stage?

54:56

And so with a transformer, it's much easier

54:58

because you just take whatever you want, you chop it

55:00

up into pieces, and you feed it in with a set

55:02

of what you had before.

55:03

And you let the self-attention figure out

55:04

how everything should communicate.

55:06

And that actually apparently works.

55:07

So just chop up everything and throw it into the mix

55:10

is like the way.

55:11

And it frees neural nets from this burgeon

55:15

of Euclidean space, where previously you

55:19

had to arrange your computation to conform to the Euclidean

55:21

space or three dimensions of how you're laying out the compute.

55:25

Like the compute actually kind of

55:26

happens in almost like 3D space if you think about it.

55:29

But in attention, everything is just sets.

55:32

So it's a very flexible framework,

55:33

and you can just throw in stuff into your conditioning set.

55:35

And everything just self-attended over.

55:37

So it's quite beautiful from that perspective.

55:39

OK, so now what exactly makes transformers so effective?

55:43

I think a good example of this comes

55:44

from the GPT-3 paper, which I encourage people to read.

55:48

Language Models of Few-Shot Learners.

55:50

I would have probably renamed this a little bit.

55:52

I would have said something like transformers

55:54

are capable of in-context learning or meta-learning.

55:57

That's like what makes them really special.

56:00

So basically the setting that they're working with

56:02

is, OK, I have some context, and I'm

56:03

trying-- like say, a passage.

56:04

This is just one example of many.

56:06

I have a passage, and I'm asking questions about it.

56:08

And then as part of the context in the prompt,

56:12

I'm giving the questions and the answers.

56:14

So I'm giving one example of question-answer,

56:16

another example of question-answer,

56:17

another example of question-answer, and so on.

56:19

And this becomes--

56:21

Oh yeah, people are going to have to leave soon, huh?

56:24

OK, is this really important?

56:25

Let me think.

56:29

OK, so what's really interesting is basically

56:31

like with more examples given in a context,

56:35

the accuracy improves.

56:37

And so what that can set is that the transformer

56:39

is able to somehow learn in the activations

56:42

without doing any gradient descent

56:43

in a typical fine-tuning fashion.

56:45

So if you fine-tune, you have to give an example and the answer,

56:48

and you fine-tune it, using gradient descent.

56:51

But it looks like the transformer internally

56:53

in its weights is doing something

56:54

that looks like potentially gradient, some kind

56:56

of a metalearning in the weights of the transformer

56:57

as it is reading the prompt.

56:59

And so in this paper, they go into, OK,

57:01

distinguishing this outer loop with stochastic gradient

57:03

descent in this inner loop of the intercontext learning.

57:06

So the inner loop is as the transformer is reading

57:08

the sequence almost and the outer loop is the training

57:12

by gradient descent.

57:14

So basically, there's some training

57:15

happening in the activations of the transformer

57:17

as it is consuming a sequence that

57:18

may be very much looks like gradient descent.

57:21

And so there are some recent papers that hint at this

57:23

and study it.

57:23

And so as an example, in this paper

57:25

here, they propose something called the draw operator.

57:28

And they argue that the raw operator is implemented

57:32

by transformer, and then they show

57:33

that you can implement things like ridge regression

57:35

on top of the raw operator.

57:36

And so this is giving--

57:39

There are papers hinting that maybe there

57:40

is some thing that looks like gradient-based learning

57:42

inside the activations of the transformer.

57:45

And I think this is not impossible to think through

57:47

because what is gradient-based learning?

57:49

Overpass, backward pass, and then update.

57:52

Oh, that looks like a ResNet, right,

57:54

because you're adding to the weights.

57:57

So the start of initial random set of weights,

57:59

forward pass, backward pass, and update your weights,

58:01

and then forward pass, backward pass, update the weights.

58:04

Looks like a ResNet.

58:04

Transformer is a ResNet, so much more hand-wavey,

58:10

but basically, some papers are trying

58:11

to hint at why that would be potentially possible.

58:14

And then I have a bunch of tweets I just copy-pasted here

58:16

in the end.

58:18

This was like meant for general consumption,

58:20

so they're a bit more high-level and hypey a little bit.

58:22

But I'm talking about why this architecture is so interesting

58:26

and why potentially it became so popular.

58:27

And I think it simultaneously optimizes

58:29

three properties that, I think, are very desirable.

58:31

Number one, the transformer is very

58:33

expressive in the forward pass.

58:35

It sort of like it's able to implement

58:37

very interesting functions, potentially functions

58:39

that can even do meta-learning.

58:41

Number two, it is very optimizable thanks

58:43

to things like residual connections, layer nodes,

58:45

and so on.

58:45

And number three, it's extremely efficient.

58:47

This is not always appreciated, but the transformer,

58:49

if you look at the computational graph,

58:51

is a shallow, wide network, which

58:53

is perfect to take advantage of the parallelism of GPUs.

58:56

So I think the transformer was designed very deliberately

58:58

to run efficiently on GPUs.

59:00

There's previous work like neural GPU

59:02

that I really enjoy as well, which is really just

59:05

like how do we design neural nets that are efficient on GPUs

59:08

and thinking backwards from the constraints of the hardware,

59:10

which I think is a very interesting way

59:11

to think about it.

59:17

Oh yeah, so here, I'm saying, I probably would have called--

59:21

I probably would've called the transformer a general purpose

59:24

efficient optimizable computer instead of attention

59:27

is all you need.

59:28

That's what I would have maybe in hindsight called that paper.

59:31

It's proposing a model that is very general purpose, so

59:37

forward passes, expressive.

59:38

It's very efficient in terms of GPU usage

59:40

and is easily optimizable by gradient descent and trains

59:44

very nicely.

59:46

And then I have some other hype tweets here.

59:51

Anyway, so you can read them later.

59:53

But I think this one is maybe interesting.

59:55

So if previous neural nets are special purpose computers

59:58

designed for a specific task, GPT

1:00:00

is a general purpose computer, reconfigurable at runtime

1:00:03

to run natural language programs.

1:00:06

So the programs are given as prompts,

1:00:08

and then GPT runs the program by completing the document.

1:00:12

So I really like these analogies personally to computer.

1:00:16

It's just like a powerful computer,

1:00:18

and it's optimizable by gradient descent.

1:00:22

And I don't know--

1:00:30

OK, yeah.

1:00:31

That's it.

1:00:31

[LAUGHTER]

1:00:33

You can read the tweets later, but that's for now.

1:00:35

I'll just thank you.

1:00:36

I'll just leave this up.

1:00:45

Sorry, I just found this tweet.

1:00:46

So turns out that if you scale up the training set

1:00:49

and use a powerful enough neural net like a transformer,

1:00:51

the network becomes a kind of general purpose

1:00:53

computer over text.

1:00:54

So I think that's nice way to look at it.

1:00:56

And instead of performing a single text sequence,

1:00:58

you can design the sequence in the prompt.

1:01:00

And because the transformer is both powerful

1:01:02

but also is trained on large enough, very hard data set,

1:01:05

it becomes this general purpose text computer.

1:01:07

And so I think that's kind of interesting way to look at it.

1:01:11

Yeah.

1:01:13

[INAUDIBLE]

1:02:01

And I guess my question is [INAUDIBLE] how

1:02:04

much do you think [INAUDIBLE]?

1:02:10

really because it's mostly more efficient or [INAUDIBLE]

1:02:25

So I think there's a bit of that.

1:02:27

Yeah, so I would say RNNs in principle,

1:02:29

yes, they can implement arbitrary programs.

1:02:31

I think, it's like a useless statement to some extent

1:02:33

because they're probably--

1:02:35

I'm not sure that they're probably expressive

1:02:37

because in a sense of power and that they can implement

1:02:40

these arbitrary functions.

1:02:43

But they're not optimizable.

1:02:44

And they're certainly not efficient because they

1:02:46

are serial computing devices.

1:02:50

So if you look at it as a compute graph,

1:02:51

RNNs are very long, thin compute graph.

1:02:58

What if you stretched out the neurons and you looked--

1:03:00

like take all the individual neurons interconnectivity,

1:03:02

and stretch them out, and try to visualize them.

1:03:04

RNNs would be like a very long graph and that's bad.

1:03:07

And it's bad also for optimizability

1:03:08

because I don't exactly know why,

1:03:10

but just the rough intuition is when you're backpropagating,

1:03:13

you don't want to make too many steps.

1:03:15

And so transformers are a shallow wide graph, and so

1:03:19

from supervision to inputs is a very small number of hops.

1:03:23

And it's a long residual pathways,

1:03:25

which make gradients flow very easily.

1:03:26

And there's all these layer norms

1:03:28

to control the scales of all of those activations.

1:03:32

And so there's not too many hops,

1:03:34

and you're going from supervision to input

1:03:36

very quickly and just flows through the graph.

1:03:40

And it can all be done in parallel,

1:03:42

so you don't need to do this--

1:03:43

encoder and decoder RNNs, you have to go from first word,

1:03:46

then second word, then third word.

1:03:47

But here in transformer, every single word

1:03:49

was processed completely in parallel, which is kind of a--

1:03:54

So I think all of these are really important because all

1:03:57

of these are really important.

1:03:57

And I think number 3 is less talked about but extremely

1:04:00

important because in deep learning scale matters.

1:04:03

And so the size of the network that you can train it

1:04:06

gives you is extremely important.

1:04:08

And so if it's efficient on the current hardware,

1:04:10

then you can make it bigger.

1:04:14

You mentioned that if you do it with multiple modalities

1:04:17

of data, [INAUDIBLE].

1:04:21

How does that actually work?

1:04:22

Do you leave the different data as different token,

1:04:26

or is it [INAUDIBLE]?

1:04:29

No, so yeah, so you take your image,

1:04:31

and you apparently chop them up into patches.

1:04:33

So there's the first thousand tokens or whatever.

1:04:35

And now, I have a special--

1:04:37

so radar could be also, but I don't actually

1:04:40

want to make a representation of radar.

1:04:43

But you just need to chop it up and enter it.

1:04:46

And then you have to encode it somehow.

1:04:47

Like the transformer needs to know

1:04:48

that they're coming from radar.

1:04:49

So you create a special--

1:04:52

you have some kind of a special token of that to--

1:04:55

these radar tokens are what's slightly

1:04:57

different in the representation, and it's

1:04:58

learnable by gradient descent.

1:05:00

And like vehicle information would also

1:05:03

come in with a special embedded token that can be learned.

1:05:07

So--

1:05:09

So how do you line those before really--

1:05:11

Actually, but you don't.

1:05:12

It's all just a set.

1:05:13

And there's--

1:05:14

Even the [INAUDIBLE]

1:05:18

Yeah, it's all just a set, but you can positionally

1:05:20

encode these sets if you want.

1:05:23

So positional encoding means you can

1:05:26

hardwire, for example, the coordinates

1:05:28

like using [INAUDIBLE].

1:05:29

You can hardwire that, but it's better

1:05:31

if you don't hardwire the position.

1:05:33

It's just a vector that is always

1:05:34

hanging out the dislocation.

1:05:35

Whatever content is there, it just adds on it.

1:05:37

And this vector is trainable by background.

1:05:39

That's how you do it.

1:05:43

Good point.

1:05:43

I don't really like the [INAUDIBLE]..

1:05:48

They seem to work, but it seems like they're sometimes

1:05:51

[INAUDIBLE]

1:06:08

I'm not sure if I understand your question.

1:06:10

[LAUGHTER]

1:06:11

So I mean the positional encoders

1:06:12

like they're actually like not--

1:06:14

OK, so they have very little inductive bias or something

1:06:16

like that.

1:06:17

They're just vectors hanging out in location always,

1:06:19

and you're trying to help the network in some way.

1:06:23

And I think the intuition is good,

1:06:28

but if you have enough data, usually,

1:06:30

trying to mess with it is a bad thing.

1:06:33

Trying to enter knowledge when you

1:06:35

have enough knowledge in the data

1:06:36

set itself is not usually productive.

1:06:38

So it all really depends on what scale you want.

1:06:40

If you have infinity data, then you actually

1:06:41

want to encode less and less.

1:06:43

That turns out to work better.

1:06:44

And if you have very little data, then actually, you do

1:06:46

want to encode some biases.

1:06:47

And maybe if you have a much smaller data set, then

1:06:49

maybe convolutions are a good idea

1:06:50

because you actually have this bias coming from your filters.

1:06:55

But I think-- so the transformer is extremely general,

1:06:58

but there are ways to mess with the encodings

1:07:01

to put in more structure.

1:07:02

Like you could, for example, encode [INAUDIBLE] and fix it,

1:07:05

or you could actually go to the attention mechanism

1:07:07

and say, OK, if my image is chopped up into patches,

1:07:10

this patch can only communicate to this neighborhood.

1:07:13

And you just do that in the attention matrix,

1:07:15

you just mask out whatever you don't want to communicate.

1:07:18

And so people really play with this

1:07:19

because the full attention is inefficient.

1:07:22

So they will intersperse, for example, layers

1:07:25

that only communicate in little patches

1:07:26

and then layers that communicate globally.

1:07:28

And they will do all kinds of tricks like that.

1:07:30

So you can slowly bring in more inductive bias.

1:07:33

You would do it, but the inductive biases

1:07:35

are like they're factored out from the core transformer.

1:07:38

And they are factored out, and the interconnectivity

1:07:41

of the nodes.

1:07:42

And they are factored out in the positionally--

1:07:44

and you can mess with this for computation.

1:07:49

[INAUDIBLE]

1:08:02

So there's probably about 200 papers on this now if not more.

1:08:06

They're kind of hard to keep track of.

1:08:07

Honestly, like my Safari browser, which is-- oh,

1:08:10

it's all up on my computer, like 200 open tabs.

1:08:13

But yes, I'm not even sure if I want

1:08:20

to pick my favorite honestly.

1:08:23

Yeah, [INAUDIBLE]

1:08:42

Maybe you can use a transformer like that [INAUDIBLE]

1:08:45

The other one that I actually like even more

1:08:46

is potentially, keep the context length fixed

1:08:49

but allow the network to somehow use a scratch pad.

1:08:53

And so the way this works is you will teach the transformer

1:08:55

somehow via examples in [INAUDIBLE] hey,

1:08:57

you actually have a scratch pad.

1:09:00

Basically, you can't remember too much.

1:09:01

Your context line is finite.

1:09:02

But you can use a scratch pad.

1:09:04

And you do that by emitting a start scratch pad,

1:09:06

and then writing whatever you want to remember, and then

1:09:08

end scratch pad.

1:09:10

And then you continue with whatever you want.

1:09:12

And then later when it's decoding,

1:09:14

you actually have special objects

1:09:15

that when you detect start scratch pad,

1:09:18

More from Stanford Online

Trending Transcripts