Advertisement
1:11:40
Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy
Stanford Online
·
May 10, 2026
Open on YouTube
Transcript
0:05
Hi, everyone.
0:06
Welcome to CS 25
Transformers United V2.
0:09
This was a course that
was held at Stanford
0:11
in the winter of 2023.
0:13
This course is not
about robots that
0:14
can transform into cars as
this picture might suggest.
0:17
Rather, it's about
deep learning models
0:18
that have taken
the world by storm
Advertisement
0:21
and have revolutionized
the field of AI and others.
0:23
Starting from natural
language processing,
0:25
transformers have
been applied all over,
0:27
computer vision, reinforcement
learning, biology, robotics,
0:30
et cetera.
0:31
We have an exciting set
of videos lined up for you
0:34
with some truly fascinating
speakers, talks, presenting
0:37
how they're applying
transformers
0:39
to the research in
different fields and areas.
0:44
We hope you'll enjoy and
learn from these videos.
Advertisement
0:47
So without any further
ado, let's get started.
0:52
This is a purely
introductory lecture.
0:54
And we'll go into the building
blocks of transformers.
0:58
So first, let's start with
introducing the instructors.
1:03
So for me, I'm currently on a
temporary deferral from the PhD
1:06
program, and I'm leading AI at a
robotics startup, Collaborative
1:09
Robotics, that are working on
some general purpose robots,
1:13
somewhat like [INAUDIBLE].
1:14
And I'm very passionate about
robotics and building FSG
1:18
learning algorithms.
1:19
My research interests are
in reinforcement learning,
1:21
computer vision, and
remodeling, and I
1:23
have a bunch of
publications in robotics,
1:25
autonomous driving,
and other areas.
1:28
My undergrad was at Cornell.
1:29
If someone is from Cornell,
so nice to [INAUDIBLE]..
1:33
So I'm Stephen, currently
a first-year CS PhD here.
1:37
Previously did my master's at
CMU and undergrad at Waterloo.
1:40
I'm mainly into NLP research,
anything involving language
1:43
and text, but more
recently, I've
1:45
been getting more into computer
vision as well as [INAUDIBLE]
1:48
And just some stuff I do
for fun, a lot of music
1:51
stuff, mainly piano.
1:52
Some self-promo of what I post
a lot on my Insta, YouTube,
1:55
and TikTok, so if you
guys want to check it out.
1:58
My friends and I are also
starting a Stanford piano club,
2:01
so if anybody's interested,
feel free to email
2:04
or DM me for details.
2:07
Other than that, martial arts,
bodybuilding, and huge fan
2:11
of k-dramas, anime,
and occasional gamer.
2:14
[LAUGHS]
2:18
OK, cool.
2:19
Yeah, so my name is Rylan.
2:20
Instead of talking
about myself, I just
2:21
want to very briefly
say that I'm super
2:23
excited to take this class.
2:24
I took it the last time--
sorry-- to teach this.
2:26
Excuse me.
2:26
I took it the last
time I was offered.
2:28
I had a bunch of fun.
2:30
I thought we brought in a
really great group of speakers
2:32
last time.
2:33
I'm super excited
for this offering.
2:35
And yeah, I'm thankful
that you're all here,
2:37
and I'm looking forward to a
really fun quarter together.
2:39
Thank you.
2:39
Yeah, so fun fact, Rylan was
the most outspoken student
2:42
last year.
2:43
And so if someone wants to
become an instructor next year,
2:45
you know what to do.
2:46
[LAUGHTER]
2:50
OK, cool.
2:53
Let's see.
2:54
OK, I think we
have a few minutes.
2:56
So what we hope you will learn
in this class is, first of all,
2:59
how do transformers
work, how they
3:02
are being applied,
just beyond NLP,
3:04
and nowadays, like they
are pretty [INAUDIBLE]
3:06
them everywhere in
AI machine learning.
3:10
And what are some new and
interesting directions
3:12
of research in these topics.
3:17
Cool, so this class is
just an introductory.
3:19
So we're just talking about
the basics of transformers,
3:22
introducing them, talking about
the self-attention mechanism
3:24
on which they're founded.
3:26
And we'll do a deep dive
more on models like BERT
3:30
to GPT, stuff like that.
3:32
So with that, happy
to get started.
3:35
OK, so let me start with
presenting the attention
3:38
timeline.
3:40
Attention all started
with this one paper.
3:43
[INAUDIBLE] by
Vaswani et al in 2017.
3:46
That was the beginning
of transformers.
3:49
Before that, we had
the prehistoric error,
3:51
where we had models
like RNM, LSDMs,
3:55
and simple attention
mechanisms that didn't work
3:57
or [INAUDIBLE].
3:59
Starting 2017, we saw this
explosion of transformers
4:02
into NLP, where people started
using it for everything.
4:07
I even heard this
quote from Google.
4:08
It's like our performance
increased every time
4:10
we [INAUDIBLE]
4:11
[CHUCKLES]
4:15
For the [INAUDIBLE]
after 2018 to 2020,
4:17
we saw this explosion
of transformers
4:18
into other fields like vision,
a bunch of other stuff,
4:23
and like biology as a whole.
4:25
And in last year,
2021 was the start
4:28
of the generative era, where we
got a lot of genetic modeling,
4:31
started models like
Codex, GPT, DALL-E,
4:35
stable diffusions,
or a lot of things
4:37
happening in genetic modeling.
4:40
And we started scaling up in AI.
4:44
And now, the present.
4:45
So this is 2022 and
the startup in '23.
4:49
And now we have models
like ChatGPT, Whisperer,
4:53
a bunch of others.
4:54
And we're scaling onwards
without splitting up,
4:57
so that's great.
4:58
So that's the future.
5:01
So going more into this,
so once there were RNNs.
5:06
So we had Seq2Seq
models, LSTMs, GRU.
5:10
What worked there was that they
were good at encoding history,
5:13
but what did not work was they
didn't encode long sequences
5:17
and they were very bad
at encoding context.
5:21
So consider this example.
5:24
Consider trying to predict
the last word in the text,
5:27
"I grew up in France,
dot, dot, dot.
5:29
I speak fluent Dutch."
5:31
Here, you need to understand
the context for it
5:33
to predict French, and
attention mechanism
5:36
is very good at that, whereas
if they're just using LSDMs,
5:39
it doesn't here work that well.
5:42
Another thing transformers
are good at is,
5:46
more based on content, is
also context prediction
5:50
is like finding attention maps.
5:52
If I have something
like a word like it,
5:56
what noun does it correlate to.
5:57
And we can give a
property attention
6:01
on one of the
possible activations.
6:05
And this works better
than existing mechanisms.
6:10
OK, so where we were in 2021,
we were on the verge of takeoff.
6:16
We were starting to realize
the potential of transformers
6:18
in different fields.
6:20
We solved a lot of
long sequence problems
6:23
like protein folding,
AlphaFold, offline RL.
6:28
We started to see few-shots,
zero-shot generalization.
6:31
We saw multimodal
tasks and applications
6:34
like generating
images from language.
6:36
So that was DALL-E. And
it feels like [INAUDIBLE]..
6:43
And this was also a
talk on transformers
6:45
that you can watch on YouTube.
6:48
Yeah, cool.
6:51
And this is where we were
going from 2021 to 2022,
6:55
which is we have gone from
the version of [INAUDIBLE]
6:58
And now, we are seeing
unique applications
7:00
in audio generation,
art, music, storytelling.
7:03
We are starting to see
these new capabilities
7:05
like commonsense,
logical reasoning,
7:08
mathematical reasoning.
7:09
We are also able to now
get human enlightenment
7:12
and interaction.
7:13
They're able to use
reinforcement learning
7:15
and human feedback.
7:16
That's how ChatGPT is trained
to perform really good.
7:19
We have a lot of
mechanisms for controlling
7:21
toxicity bias and ethics now.
7:24
And there are a
lot of also, a lot
7:26
of developments in other
areas like diffusion models.
7:30
Cool.
7:33
So the future is a
spaceship, and we are all
7:35
excited about it.
7:39
And there's a lot
of more applications
7:40
that we can enable,
and it'll be great
7:44
if you can see
transformers also up there.
7:47
One big example is video
understanding and generation.
7:49
That is something that
everyone is interested in,
7:51
and I'm hoping we'll
see a lot of models
7:53
in this area this year,
also, finance, business.
7:59
I'll be very excited to
see GPT author a novel,
8:02
but we need to solve very
long sequence modeling.
8:04
And most transformer
models are still
8:07
limited to 4,000 tokens
or something like that.
8:09
So we need to make them
generalize much more
8:13
better on long sequences.
8:17
We also want to have
generalized agents
8:19
that can do a lot of multitask,
a multi-input predictions
8:27
like Gato.
8:28
And so I think we will
see more of that, too.
8:31
And finally, we also want
domain specific models.
8:37
So you might want
a GPT model, let's
8:39
put it like maybe your health.
8:41
So that could be like
a DoctorGPT model.
8:43
You might have a
LawyerGPT model that's
8:45
trained on only law data.
8:46
So currently, we have GPT models
that are trained on everything.
8:49
But we might start to see
more niche models that
8:51
are good at one task.
8:53
And we could have a
mixture of experts,
8:55
so it's like, you
can think this is a--
8:57
how you'd normally
consult an expert,
8:58
you'll have expert AI models.
9:00
And you can go to a different AI
model for your different needs.
9:05
There are still a lot
of missing ingredients
9:07
to make this all successful.
9:10
The first of all
is external memory.
9:12
We are already starting to
see this with the models
9:15
like ChatGPT, where the
inflections are short-lived.
9:18
There's no long-term
memory, and they
9:20
don't have ability
to remember or store
9:23
conversations for long-term.
9:25
And this is something
you want to fix.
9:29
Second is reducing the
computation complexity.
9:32
So attention mechanism is
quadratic over the sequence
9:36
length, which is slow.
9:37
And we want to reduce
it and make it faster.
9:42
Another thing we
want to do is we
9:44
want to enhance the
controllability of these models
9:46
like a lot of these
models can be stochastic.
9:48
And we want to be able to
control what sort of outputs
9:51
we get from them.
9:52
And you might have
experienced the ChatGPT,
9:54
if you just refresh, you get
different output each time.
9:56
But you might want to have
a mechanism that controls
9:59
what sort of things you get.
10:01
And finally, we want to align
our state of art language
10:04
models with how the
human brain works.
10:06
And we are seeing the
surge, but we still
10:09
need more research on seeing
how they can make more informed.
10:12
Thank you.
10:14
Great, hi.
10:16
Yes, I'm excited to be here.
10:18
I live very nearby, so I got
the invites to come to class.
10:21
And I was like, OK,
I'll just walk over.
10:23
But then I spent like
10 hours on the slides,
10:25
so it wasn't as simple.
10:28
So yeah, I'm going to
talk about transformers.
10:30
I'm going to skip the
first two over there.
10:32
I'm not going to
talk about those.
10:34
We'll talk about that one
just to simplify the lecture
10:36
since we don't have time.
10:39
OK, so I wanted to provide
a little bit of context
10:41
on why does this transformers
class even exist.
10:44
So a little bit of
historical context.
10:45
I feel like Bilbo over there.
10:47
I joined like telling
you guys about this.
10:50
I don't know if you guys
saw Lord of the Rings.
10:52
And basically, I joined AI in
roughly 2012, the full course,
10:56
so maybe a decade ago.
10:58
And back then, you
wouldn't even say
10:59
that you joined AI by the way.
11:00
That was like a dirty word.
11:02
Now, it's OK to talk
about, but back then, it
11:04
was not even deep learning.
11:05
It was machine learning.
11:06
That was the term we would
use if you were serious.
11:08
But now, now, AI is
OK to use, I think.
11:11
So basically, do
you even realize
11:13
how lucky you are
potentially entering
11:15
this area in roughly 2023?
11:17
So back then, in 2011 or so
when I was working specifically
11:20
on computer vision, your
pipeline's looked like this.
11:25
So you wanted to
classify some images,
11:28
you would go to a paper, and I
think this is representative.
11:30
You would have three pages
in the paper describing
11:32
all kinds of a zoo,
of kitchen sink,
11:34
of different kinds of
features, descriptors.
11:36
And you would go
to a poster session
11:38
and in computer
vision conference,
11:40
and everyone would have their
favorite feature descriptor
11:41
that they're proposing.
11:42
And it's totally
ridiculous, and you
11:44
would take notes on which
one you should incorporate
11:45
into your pipeline because
you would extract all of them,
11:48
and then you would
put an SVM on top.
11:49
So that's what you would do.
11:51
So there's two pages.
11:52
Make sure you get your
[? Spar ?] SIFT histograms,
11:54
your SSIMs, your color
histograms, textiles,
11:56
tiny images.
11:57
And don't forget the
geometry specific histograms.
11:59
All of them have basically
complicated code by themselves.
12:02
So you're collecting code from
everywhere and running it,
12:04
and it was a total nightmare.
12:06
So on top of that,
it also didn't work.
12:10
[LAUGHTER]
12:11
So this would be, I think,
it represents the prediction
12:14
from that time.
12:15
You would just get predictions
like this once in a while,
12:17
and you'd be like, you
just shrug your shoulders
12:19
like that just happens
once in a while.
12:20
Today, you would be
looking for a bug.
12:23
And worse than that,
every single chunk of AI
12:30
had their own completely
separate vocabulary
12:32
that they work with.
12:33
So if you go to NLP
papers, those papers
12:36
would be completely different.
12:38
So you're reading the NLP
paper, and you're like,
12:40
what is this part
of speech tagging,
12:42
morphological analysis,
and tactic parsing,
12:44
co-reference resolution?
12:46
What is MPBTKJ?
12:48
And you're confused.
12:49
So the vocabulary and everything
was completely different.
12:51
And you couldn't
read papers, I would
12:52
say, across different areas.
12:55
So now, that
changed a little bit
12:56
starting 2012 when Al Krizhevsky
and colleagues basically
13:02
demonstrated that if you
scale a large neural network
13:05
on large data set, you can
get very strong performance.
13:08
And so up till then, there was
a lot of focus on algorithms.
13:10
But this showed that actually
neural nets scale very well.
13:13
So you need to now worry
about compute and data,
13:15
and you can scale it up.
13:16
It works pretty well.
13:17
And then that recipe
actually did copy paste
13:19
across many areas of AI.
13:21
So we start to see neural
networks pop up everywhere
13:23
since 2012.
13:25
So we saw them in computer
vision, and NLP, and speech,
13:28
and translation in RL and so on.
13:30
So everyone started to use
the same kind of modeling
13:32
toolkit, modeling framework.
13:33
And now when you go to NLP, and
you start reading papers there,
13:36
in machine translation,
for example,
13:38
this is a sequence
to sequence paper
13:40
which we'll come
back to in a bit.
13:41
You start to read those
papers, and you're like, OK,
13:44
I can recognize these words.
13:45
Like there's a neural network.
13:46
There's some parameters.
13:47
There's an optimizer, and
it starts to read things
13:50
that you know of.
13:50
So that decreased tremendously
the barrier to entry
13:54
across the different areas.
13:56
And then, I think,
the big deal is
13:57
that when the transformer
came out in 2017,
14:00
it's not even that just the tool
kits and the neural networks
14:02
were similar-- there's that
literally the architectures
14:05
converged to like one
architecture that you
14:07
copy paste across
everything seemingly.
14:10
So this was kind of an
unassuming machine translation
14:12
paper at the time, proposing
to transformer architecture.
14:15
But what we found since then
is that you can just basically
14:17
copy paste this architecture
and use it everywhere.
14:21
And what's changing is
the details of the data,
14:23
and the chunking of the
data, and how you feed it in.
14:26
And that's a
caricature, but it's
14:28
kind of like a correct
first order statement.
14:29
And so now, papers are
even more similar looking
14:32
because everyone's
just using transformer.
14:34
And so this convergence
was remarkable to watch
14:38
and unfolded over
the last decade.
14:40
And it's pretty crazy to me.
14:42
What I find
interesting is I think
14:44
this is some kind of a hint
that we're maybe converging
14:46
to something that maybe
the brain is doing
14:48
because the brain is very
homogeneous and uniform
14:50
across the entire
sheet of your cortex.
14:52
And OK, maybe some of
the details are changing,
14:54
but those feel like
hyperparameters
14:56
like a transformer.
14:57
But your auditory cortex
and your visual cortex
14:59
and everything else
looks very similar.
15:01
And so maybe we're
converging to some kind
15:02
of a uniform powerful
learning algorithm here.
15:06
Something like that, I think,
is interesting and exciting.
15:09
OK, so I want to talk about
where the transformer came
15:11
from briefly, historically.
15:12
So I want to start in 2003.
15:15
I like this paper quite a bit.
15:17
It was the first popular
application of neural networks
15:21
to the problem of
language modeling,
15:22
so predicting in this
case, the next word
15:24
in the sequence, which
allows you to build
15:26
generative models over text.
15:27
And in this case, they were
using multi-layer perceptron,
15:29
so very simple neural net.
15:30
The neural nets took three words
and predicted the probability
15:33
distribution for the
fourth word in a sequence.
15:36
So this was well and
good at this point.
15:39
Now, over time, people
started to apply this
15:41
to machine translation.
15:43
So that brings us to
sequence to sequence paper
15:45
from 2014 that was
pretty influential,
15:48
and the big problem
here was OK, we
15:49
don't just want to take three
words and predict the fourth.
15:52
We want to predict how to
go from an English sentence
15:55
to a French sentence.
15:56
And the key problem
was OK, you can
15:58
have arbitrary number of words
in English and arbitrary number
16:00
of words in French,
so how do you
16:03
get an architecture
that can process
16:04
this variably sized input?
16:06
And so here they used a
LSDM, and there's basically
16:10
two chunks of this, which are
covered by the slack, by this.
16:16
But basically have an
encoder LSDM on the left,
16:19
and it just consumes
one word at a time
16:22
and builds up a context
of what it has read.
16:24
And then that acts as
a conditioning vector
16:26
to the decoder RNN or LSDM.
16:29
That basically
goes chonk, chonk,
16:30
chonk for the next
word in a sequence,
16:32
translating the English to
French or something like that.
16:35
Now, the big problem with
this, that people identified,
16:37
I think, very quickly
and tried to resolve
16:40
is that there's what's called
this encoder bottleneck.
16:43
So this entire English sentence
that we are trying to condition
16:46
on is packed into
a single vector
16:48
that goes from the
encoder to the decoder.
16:50
And so this is just
too much information
16:52
to potentially maintain
in a single vector,
16:54
and that didn't seem correct.
16:55
And so people who are
looking around for ways
16:57
to alleviate the attention of
the encoder bottleneck as it
17:00
was called at the time.
17:02
And so that brings
us to this paper,
17:03
Neural Machine Translation
by Jointly Learning
17:05
to Align and Translate.
17:07
And here, just quoting from
the abstract, "in this paper,
17:11
we conjectured that the use
of a fixed length vector
17:13
is a bottleneck in
improving the performance
17:15
of the basic
encoder-decoder architecture
17:17
and propose to extend
this by allowing
17:19
the model to
automatically soft search
17:21
for parts of the source sentence
that are relevant to predicting
17:24
a target word without
having to form
17:28
these parts or hard
segments exclusively."
17:30
So this was a way to look
back to the words that
17:34
are coming from the encoder.
17:35
And it was achieved
using this soft search.
17:38
So as you are
decoding in the words
17:42
here, while you
are decoding them,
17:44
you are allowed to
look back at the words
17:45
at the encoder via this soft
attention mechanism proposed
17:49
in this paper.
17:50
And so this paper, I think,
is the first time that I saw,
17:52
basically, attention.
17:55
So your context vector
that comes from the encoder
17:58
is a weighted sum
of the hidden states
18:01
of the words in the encoding.
18:05
And then the weights
of this sum come
18:07
from a softmax that is based
on these compatibilities
18:10
between the current
state as you're decoding
18:13
and the hidden states
generated by the encoder.
18:15
And so this is the first
time that really you
18:17
start to look at it, and this
is the current modern equations
18:22
of the attention.
18:23
And I think this was the
first paper that I saw it in.
18:25
It's the first time
that there's a word
18:27
attention used, as far as I
know, to call this mechanism.
18:32
So I actually tried to dig
into the details of the history
18:34
of the attention.
18:35
So the first author
here, Dzmitry, I
18:38
had an email
correspondence with him,
18:40
and I basically
sent him an email.
18:41
I'm like, Dzmitry, this
is really interesting.
18:43
Just rumors have taken over.
18:44
Where did you come up
with the soft attention
18:45
mechanism that ends up being
the heart of the transformer?
18:48
And to my surprise, he wrote me
back this massive email, which
18:52
was really fascinating.
18:52
So this is an excerpt
from that email.
18:57
So basically, he talks about
how he was looking for a way
18:59
to avoid this bottleneck
between the encoder and decoder.
19:02
He had some ideas
about cursors that
19:04
traverse the sequences
that didn't quite work out.
19:06
And then here, "so one
day, I had this thought
19:08
that it would be nice
to enable the decoder
19:10
RNN to learn to search where
to put the cursor in the source
19:13
sequence.
19:14
This was sort of inspired
by translation exercises
19:16
that learning English in
my middle school involved.
19:21
Your gaze shifts back and forth
between source and target,
19:23
sequence as you translate."
19:24
So literally, I thought that
this was kind of interesting,
19:27
that he's not a native
English speaker,
19:28
and here, that gave him an edge
in this machine translation
19:31
that led to attention and
then led to transformer.
19:34
So that's really fascinating.
19:37
"I expressed a soft
search a softmax
19:38
and then weighted averaging
of the [INAUDIBLE] states.
19:40
And basically, to
my great excitement,
19:43
this worked from
the very first try."
19:45
So really, I think,
interesting piece of history.
19:48
And as it later turned out
that the name of RNN search
19:51
was kind of lame, so the
better name attention came
19:54
from Yoshua on one
of the final passes
19:57
as they went over the paper.
19:58
So maybe Attention
is All You Need
20:00
would have been called RNN
Search is All You Need,
20:03
but we have Yoshua
Bengio to thank
20:05
for a little bit of
better name, I would say.
20:07
So apparently,
that's the history
20:08
of this, which I
thought was interesting.
20:11
OK, so that brings us to
2017, which is Attention
20:13
is All You Need.
20:14
So this attention
component, which
20:16
in Dzmitry's paper was
just one small segment,
20:19
and there's all this
bidirectional RNN, RNN
20:21
and decoder, and this Attention
All You Need paper is saying,
20:25
OK, you can actually
delete everything.
20:26
What's making this
work very well
20:28
is just attention by itself.
20:29
And so delete everything,
keep attention.
20:32
And then what's remarkable about
this paper actually is usually,
20:35
you see papers that
are very incremental.
20:36
They add one thing, and
they show that it's better.
20:39
But I feel like
Attention is All You
20:41
Need was like a mix of multiple
things at the same time.
20:44
They were combined
in a very unique way,
20:46
and then also achieve a
very good local minimum
20:49
in the architecture space.
20:50
And so to me, this is
really a landmark paper
20:52
that is quite
remarkable and, I think,
20:55
had quite a lot of
work behind the scenes.
20:58
So delete all the RNN,
just keep attention.
21:01
Because attention
operates over sets--
21:03
and I'm going to go
to this in a second--
21:05
you now need to positionally
encode your inputs
21:07
because attention doesn't have
the notion of space by itself.
21:14
I have to be very careful.
21:17
They adopted this
residual network structure
21:19
from resonance.
21:21
They interspersed attention
with multi-layer perceptrons.
21:24
They used layer norms, which
came from a different paper.
21:27
They introduced the concept
of multiple heads of attention
21:29
that were applied in parallel.
21:30
And they gave us, I think,
like a fairly good set
21:33
of hyperparameters that
to this day are used.
21:35
So the expansion factor in the
multi-layer perceptron goes up
21:39
by 4X--
21:40
and we'll go into
a bit more detail--
21:41
and this 4X has stuck around.
21:43
And I believe there's
a number of papers
21:44
that try to play with all
kinds of little details
21:47
of the transformer, and nothing
sticks because this is actually
21:50
quite good.
21:51
The only thing to my
knowledge that didn't stick
21:54
was this reshuffling
of the layer norms
21:56
to go into the prenorm
version where here you
21:59
see the layer norms are after
the multiheaded attention feed
22:01
forward.
22:02
They just put them
before instead.
22:04
So just reshuffling of
layer norms, but otherwise,
22:06
the TPTs and everything else
that you're seeing today
22:08
is basically the 2017
architecture from 5 years ago.
22:11
And even though everyone
is working on it,
22:13
it's been proven
remarkably resilient,
22:15
which I think is
real interesting.
22:17
There are innovations
that, I think,
22:18
have been adopted also
in positional encoding.
22:21
It's more common to use
different rotary and relative
22:24
positional encoding and so on.
22:25
So I think there have been
changes, but for the most part,
22:28
it's proven very resilient.
22:31
So really quite an
interesting paper.
22:32
Now, I wanted to go into
the attention mechanism.
22:36
And I think, the way I interpret
it is not similar to the ways
22:43
that I've seen it
presented before.
22:44
So let me try a different
way of how I see it.
22:47
Basically, to me, attention is
kind of like the communication
22:49
phase of the transformer,
and the transformer
22:52
interweaves two phases of the
communication phase, which
22:55
is the multi-headed
attention, and the computation
22:57
stage, which is this
multilayered perceptron
23:00
or [INAUDIBLE].
23:01
So in the communication
phase, it's
23:03
really just a data
dependent message
23:05
passing on directed graphs.
23:07
And you can think of it
as OK, forget everything
23:09
with machine
translation, everything.
23:10
Let's just-- we have
directed graphs.
23:13
At each node, you
are storing a vector.
23:16
And then let me talk now
about the communication
23:18
phase of how these
vectors talk to each other
23:20
and this directed graph.
23:21
And then the compute
phase later is just
23:23
a multi-perceptron, which then
basically acts on every node
23:27
individually.
23:28
But how do these nodes
talk to each other
23:30
in this directed graph?
23:32
So I wrote like
some simple Python--
23:36
I wrote this in Python
basically to create
23:39
one round of communication
of using attention
23:44
as the message passing scheme.
23:46
So here, a node has this
private data vector,
23:51
as you can think of it
as private information
23:53
to this node.
23:54
And then it can also emit a
key, a query, and a value.
23:57
And simply, that's done
by linear transformation
24:00
from this node.
24:01
So the key is what are
the things that I am--
24:07
sorry.
24:07
The query is what are the
things that I'm looking for?
24:10
The key is what other
the things that I have?
24:12
And the value is what are the
things that I will communicate?
24:15
And so then when you
have your graph that's
24:16
made up of nodes in some
random edges, when you actually
24:19
have these nodes communicating,
what's happening is
24:21
you loop over all the
nodes individually
24:23
in some random order,
and you're at some node,
24:27
and you get the
query vector q, which
24:29
is, I'm a node in
some graph, and this
24:32
is what I'm looking for.
24:33
And so that's just achieved
via this linear transformation
24:36
here.
24:36
And then we look at all the
inputs that point to this node,
24:39
and then they broadcast what
are the things that I have,
24:42
which is their keys.
24:44
So they broadcast the keys.
24:45
I have the query, then those
interact by dot product
24:49
to get scores.
24:51
So basically, simply
by doing dot product,
24:53
you get some
unnormalized weighting
24:55
of the interestingness of all
of the information in the nodes
24:59
that point to me and to
the things I'm looking for.
25:02
And then when you normalize
that with softmax,
25:03
so it just sums to
1, you basically just
25:06
end up using those scores, which
now sum to 1 in our probability
25:09
distribution, and you do a
weighted sum of the values
25:13
to get your update.
25:15
So I have a query.
25:17
They have keys, dot products
to get interestingness or like
25:21
affinity, softmax to
normalize it, and then
25:24
weighted sum of those values
flow to me and update me.
25:27
And this is happening for
each node individually.
25:29
And then we update at the end.
25:30
And so this kind of a
message passing scheme
25:32
is at the heart of
the transformer.
25:35
And it happens in the more
vectorized batched way
25:40
that is more confusing and is
also interspersed with layer
25:44
norms and things like that
to make the training behave
25:46
better.
25:47
But that's roughly what's
happening in the attention
25:49
mechanism, I think,
on a high level.
25:53
So yeah, so in the communication
phase of the transformer, then
25:59
this message passing
scheme happens
26:00
in every head in parallel and
then in every layer in series
26:06
and with different
weights each time.
26:08
And that's it as far as the
multi-headed attention goes.
26:13
And so if you look at these
encooder-decoder models,
26:15
you can think of it then in
terms of the connectivity
26:18
of these nodes in the graph.
26:19
You can think of it as like,
OK, all these tokens that
26:21
are in the encoder that
we want to condition on,
26:23
they are fully
connected to each other.
26:25
So when they communicate,
they communicate fully
26:28
when you calculate
their features.
26:30
But in the decoder,
because we are
26:32
trying to have a
language model, we
26:33
don't want to have
communication for future tokens
26:35
because they give away
the answer at this step.
26:38
So the tokens in the
decoder are fully connected
26:40
from all the encoder
states, and then they
26:43
are also fully connected from
everything that is decoding.
26:46
And so you end up with
this triangular structure
26:49
in the data graph.
26:50
But that's the
message passing scheme
26:52
that this basically implements.
26:54
And then you have to be also
a little bit careful because
26:57
in the cross attention
here with the decoder,
26:59
you consume the features
from the top of the encoder.
27:01
So think of it as
in the encoder,
27:03
all the nodes are
looking at each other,
27:05
all the tokens are looking at
each other many, many times.
27:08
And they really figure
out what's in there,
27:09
and then the decoder when it's
looking only at the top nodes.
27:14
So that's roughly the
message passing scheme.
27:16
I was going to go into
more of an implementation
27:18
of a transformer.
27:19
I don't know if there's
any questions about this.
27:23
[INAUDIBLE] self-attention
and multi-headed attention,
27:26
but what is the
advantage of [INAUDIBLE]??
27:30
Yeah, so self-attention and
multi-headed attention, so
27:35
the multi-headed attention is
just this attention scheme,
27:38
but it's just applied
multiple times in parallel.
27:40
Multiple heads just means
independent applications
27:42
of the same attention.
27:44
So this message passing
scheme basically just
27:47
happens in parallel
multiple times
27:49
with different weights for
the query, key, and value.
27:52
So you can almost look at
it like in parallel, I'm
27:55
looking for, I'm seeking
different kinds of information
27:57
from different nodes.
27:59
And I'm collecting it
all in the same node.
28:01
It's all done in parallel.
28:03
So heads is really just
copy-paste in parallel.
28:06
And layers are
copy-paste but in series.
28:12
Maybe that makes sense.
28:15
And self-attention, when
it's self-attention,
28:18
what it's referring to
is that the node here
28:21
produces each node here.
28:23
So as I described it here,
this is really self-attention
28:25
because every one of
these nodes produces
28:27
a key query and a value
from this individual node.
28:30
When you have cross-attention,
you have one cross-attention
28:33
here, coming from the encoder.
28:36
That just means that
the queries are still
28:38
produced from this node,
but the keys and the values
28:42
are produced as a
function of nodes that
28:44
are coming from the encoder.
28:48
So I have my queries because
I'm trying to decode some--
28:52
the fifth word in the sequence.
28:53
And I'm looking
for certain things
28:55
because I'm the fifth word.
28:56
And then the keys and
the values in terms
28:58
of the source of information
that could answer my queries
29:01
can come from the previous
nodes in the current decoding
29:04
sequence or from the
top of the encoder.
29:06
So all the nodes that
have already seen all
29:09
of the encoding tokens many,
many times cannot broadcast
29:12
what they contain in
terms of information.
29:14
So I guess, to summarize,
the self-attention is--
29:18
sorry, cross-attention
and self-attention
29:20
only differ in where the piece
and the values come from.
29:24
Either the keys and values
are produced from this node,
29:28
or they are produced from some
external source like an encoder
29:31
and the nodes over there.
29:33
But algorithmically, is the
same mathematical operations.
29:39
Question.
29:39
Yeah, OK.
29:40
So two questions for you.
29:41
First question is, in the
message passing [INAUDIBLE]
29:56
So think of-- so each one
of these nodes is a token.
30:04
I guess they don't have
a very good picture of it
30:06
in the transformer.
30:06
But this node here could
represent the third word
30:14
in the output in the decoder,
and in the beginning,
30:19
it is just the
embedding of the word.
30:27
And then, OK, I have to
think through this analogy
30:30
a little bit more.
30:31
I came up with it this morning.
30:32
[LAUGHTER]
30:34
[INAUDIBLE]
30:39
What example of instantiation
[INAUDIBLE] nodes
30:45
as in in blocks were embedding?
30:50
These nodes are
basically the vectors.
30:53
I'll go to an implementation.
30:54
I'll go to the implementation,
and then maybe I'll
30:56
make the connections
to the graph.
30:58
So let me try to first
go to-- let me now go to,
31:01
with this intuition
in mind, at least,
31:03
to a nanoGPT, which is a
concrete implementation
31:05
of a transformer
that is very minimal.
31:06
So I worked on this
over the last few days,
31:08
and here it is reproducing
GPT-2 on open web text.
31:11
So it's a pretty serious
implementation that reproduces
31:14
GPT-2, I would say, and
provide it enough compute--
31:17
This was one node of 8 GPUs
for 38 hours or something
31:21
like that, if I
remember correctly.
31:22
And it's very readable.
31:23
It's 300 lines, so everyone
can take a look at it.
31:27
And yeah, let me basically
briefly step through it.
31:30
So let's try to have a
decoder-only transformer.
31:34
So what that means is that
it's a language model.
31:36
It tries to model the
next word in the sequence
31:39
or the next character
in the sequence.
31:41
So the data that
we train on this
31:43
is always some kind of text.
31:44
So here's some fake Shakespeare.
31:45
Sorry, this is real Shakespeare.
31:47
We're going to produce
fake Shakespeare.
31:48
So this is called
a Tiny Shakespeare
31:50
dataset, which is one of
my favorite toy datasets.
31:52
You take all of
Shakespeare, concatenate it,
31:54
and it's 1 megabyte
file, and then
31:55
you can train
language models on it
31:56
and get infinite
Shakespeare, if you like,
31:58
which I think is kind of cool.
31:59
So we have a text.
32:00
The first thing we
need to do is we
32:02
need to convert it to
a sequence of integers
32:05
because transformers
natively process--
32:09
you can't plug text
into transformer.
32:10
You need to somehow encode it.
32:11
So the way that
encoding is done is
32:13
we convert, for example,
in the simplest case,
32:15
every character gets an
integer, and then instead of "hi
32:18
there," we would have
this sequence of integers.
32:21
So then you can encode every
single character as an integer
32:25
and get a massive
sequence of integers.
32:27
You just concatenate
it all into one
32:29
large, long
one-dimensional sequence.
32:31
And then you can train on it.
32:32
Now, here, we only
have a single document.
32:34
In some cases, if you have
multiple independent documents,
32:36
what people like to do
is create special tokens,
32:38
and they intersperse
those documents
32:40
with those special
end of text tokens
32:42
that they splice in between
to create boundaries.
32:46
But those boundaries actually
don't have any modeling impact.
32:50
It's just that the
transformer is supposed
32:52
to learn via backpropagation
that the end of document
32:55
sequence means that you
should wipe the memory.
33:00
OK, so then we produce batches.
33:02
So these batches
of data just mean
33:04
that we go back to the
one-dimensional sequence,
33:06
and we take out chunks
of this sequence.
33:08
So say, if the block size is 8,
Then the block size indicates
33:13
the maximum length of context
that your transformer will
33:17
process.
33:18
So if our block size
is 8, that means
33:20
that we are going to have up
to eight characters of context
33:23
to predict the ninth
character in a sequence.
33:26
And the batch size indicates
how many sequences in parallel
33:29
we're going to process.
33:30
And we want this to be
as large as possible,
33:31
so we're fully taking
advantage of the GPU
33:33
and the parallels [INAUDIBLE]
So in this example,
33:36
we're doing a 4 by 8 batches.
33:38
So every row here is
independent example
33:41
and then every row here is a
small chunk of the sequence
33:47
that we're going to train on.
33:48
And then we have both the
inputs and the targets
33:50
at every single point here.
33:52
So to fully spell out what's
contained in a single 4
33:55
by 8 batch to the transformer--
33:57
I sort of compact it here--
33:59
so when the input is 47, by
itself, the target is 58.
34:04
And when the input is
the sequence 47, 58,
34:07
the target is one.
34:08
And when it's 47, 58, 1,
the target is 51 and so on.
34:13
So actually, the single batch
of examples that score by 8
34:15
actually has a ton of
individual examples
34:17
that we are expecting
a transformer
34:18
to learn on in parallel.
34:21
And so you'll see that
the batches are learned
34:23
on completely independently, but
the time dimension here along
34:28
horizontally is also
trained on in parallel.
34:30
So your real batch size
is more like B times T.
34:34
And it's just that the
context grows linearly
34:37
for the predictions that you
make along the T direction
34:41
in the model.
34:42
So this is all the examples
that the model will learn from,
34:45
this single batch.
34:48
So now, this is the GPT class.
34:52
And because this is
a decoder-only model,
34:55
so we're not going to have
an encoder because there's no
34:58
English we're translating from--
34:59
we're not trying to condition
in some other external
35:02
information.
35:02
We're just trying to produce
a sequence of words that
35:05
follow each other or likely to.
35:08
So this is all PyTorch, and
I'm going slightly faster
35:10
because I'm assuming people
have taken 231 or something
35:12
along those lines.
35:15
But here in the forward
pass, we take these indices,
35:19
and then we both encode the
identity of the indices,
35:24
just via an embedding
lookup table.
35:26
So every single integer, we
index into a lookup table of
35:31
vectors in this, and end
up embedding, and pull out
35:34
the word vector for that token.
35:38
And then because the
transformer by itself
35:41
doesn't actually-- the
process is set natively.
35:43
So we need to also positionally
encode these vectors
35:45
so that we basically
have both the information
35:47
about the token identity and
its place in the sequence from 1
35:51
to block size.
35:53
Now, the information
about what and where
35:56
is combined additively,
so the token embeddings
35:58
and the positional embeddings
are just added exactly as here.
36:02
So then there's
optional dropout,
36:06
this x here basically
just contains
36:08
the set of words
and their positions,
36:14
and that feeds into the
blocks of transformer.
36:16
And we're going to look
into what's block here.
36:18
But for here, for now,
this is just a series
36:20
of blocks in a transformer.
36:22
And then in the end,
there's a layer norm,
36:23
and then you're
decoding the logits
36:26
for the next word or next
integer in a sequence,
36:30
using the linear projection of
the output of this transformer
36:33
So LM head here, a short
core language model head.
36:36
It's just a linear function.
36:38
So basically, positionally
encode all the words,
36:42
feed them into a
sequence of blocks,
36:45
and then apply a linear
layer to get the probability
36:47
distribution for
the next character.
36:50
And then if we have
the targets, which
36:51
we produced in the data order--
36:54
and you'll notice that
the targets are just
36:55
the inputs offset
by one in time--
36:59
then those targets feed
into a cross entropy loss.
37:01
So this is just a
negative log likelihood
37:03
typical classification loss.
37:04
So now let's drill into
what's here in the blocks.
37:08
So these blocks that are
applied sequentially,
37:11
there's, again, as I
mentioned, this communicate
37:13
phase and the compute phase.
37:15
So in the communicate
phase, all the nodes
37:17
get to talk to each other, and
so these nodes are basically,
37:21
if our block size
is 8, then we are
37:23
going to have eight
nodes in this graph.
37:26
There's eight nodes
in this graph.
37:28
The first node is pointed
to only by itself.
37:30
The second node is pointed to
by the first node and itself.
37:33
The third node is pointed
to by the first two nodes
37:35
and itself, et cetera.
37:36
So there's eight nodes here.
37:38
So you apply-- there's a
residual pathway and x.
37:42
You take it out.
37:43
You apply a layer norm,
and then the self-attention
37:45
so that these communicate,
these eight nodes communicate.
37:47
But you have to keep in
mind that the batch is 4.
37:50
So because batch is 4,
this is also applied--
37:54
so we have eight
nodes communicating,
37:55
but there's a batch of four of
them individually communicating
37:58
in one of those eight nodes.
37:59
There's no crisscross across
the batch dimension, of course.
38:02
There's no batch
anywhere luckily.
38:04
And then once they've
changed information,
38:06
they are processed using
the multi-layer perceptron.
38:09
And that's the compute phase.
38:12
And then also here we are
missing the cross-attention
38:18
because this is a
decoder-only model.
38:19
So all we have is
this step here,
38:21
the multi-headed
attention, and that's
38:22
this line, the
communicate phase.
38:24
And then we have the feed
forward, which is the MLP,
38:27
and that's the compute phase.
38:29
I'll take question's
a bit later.
38:31
Then the MLP here is
fairly straightforward.
38:34
The MLP is just individual
processing on each node,
38:38
just transforming the feature
representation at that node.
38:41
So applying a
two-layer neural net
38:45
with a GELU nonlinearity,
which is just
38:47
think of it as a ReLU
or something like that.
38:49
It's just a nonlinearity.
38:51
And then MLP is straightforward.
38:53
I don't think there's
anything too crazy there.
38:55
And then this is the
causal self-attention part,
38:57
the communication phase.
38:59
So this is like
the meat of things
39:01
and the most complicated part.
39:03
It's only complicated
because of the batching
39:06
and the implementation detail
of how you mask the connectivity
39:10
in the graph so that
you can't obtain
39:13
any information
from the future when
39:15
you're predicting your token.
39:16
Otherwise, it gives
away the information.
39:18
So if I'm the fifth token and
if I'm the fifth position,
39:23
then I'm getting the fourth
token coming into the input,
39:26
and I'm attending to the
third, second, and first,
39:29
and I'm trying to figure
out what is the next token.
39:32
Well then, in this batch,
in the next element
39:34
over in the time dimension,
the answer is at the input.
39:37
So I can't get any
information from there.
39:40
So that's why this
is all tricky,
39:41
but basically, in
the forward pass,
39:45
we are calculating the queries,
keys, and values based on x.
39:50
So these are the keys,
queries, and values.
39:52
Here, when I'm
computing the attention,
39:54
I have the queries matrix
multiplying the piece.
39:58
So this is the dot product in
parallel for all the queries
40:00
and all the keys
in all the heads.
40:03
So I failed to mention
that there's also
40:06
the aspect of the heads, which
is also done all in parallel
40:08
here.
40:09
So we have the batch
dimension, the time dimension,
40:10
and the head dimension,
and you end up
40:12
with five-dimensional tensors,
and it's all really confusing.
40:14
So I invite you to step through
it later and convince yourself
40:17
that this is actually
doing the right thing.
40:19
But basically, you have the
batch dimension, the head
40:21
dimension and the
time dimension,
40:23
and then you have
features at them.
40:25
And so this is evaluating for
all the batch elements, for all
40:28
the head elements, and
all the time elements,
40:31
the simple Python that I gave
you earlier, which is query
40:34
dot product p.
40:35
Then here, we do a masked_fill,
and what this is doing
40:38
is it's basically clamping the
attention between the nodes
40:44
that are not supposed to
communicate to be negative
40:46
infinity.
40:47
And we're doing
negative infinity
40:48
because we're about to softmax,
and so negative infinity will
40:51
make basically the attention
that those elements be zero.
40:54
And so here we are going
to basically end up
40:56
with the weights, the affinities
between these nodes, optional
41:03
dropout.
41:03
And then here, attention
matrix multiply v is basically
41:08
the gathering of the information
according to the affinities
41:10
we calculated.
41:11
And this is just a
weighted sum of the values
41:14
at all those nodes.
41:15
So this matrix multiplies
is doing that weighted sum.
41:19
And then transpose
contiguous view
41:20
because it's all
complicated and batched
41:22
in five-dimensional
tensors, but it's really not
41:24
doing anything,
optional drop out,
41:26
and then a linear projection
back to the residual pathway.
41:30
So this is implementing the
communication phase here.
41:34
Then you can train
this transformer.
41:37
And then you can generate
infinite Shakespeare.
41:41
And you will simply do this by--
41:43
because our block size is 8,
we start with a sum token,
41:47
say like, I used
in this case, you
41:50
can use something like a
new line as the start token.
41:53
And then you communicate
only to yourself
41:55
because there's a
single node, and you
41:57
get the probability
distribution for the first word
41:59
in the sequence.
42:00
And then you decode it
for the first character
42:03
in the sequence.
42:04
You decode the character.
42:05
And then you bring
back the character,
42:06
and you re-encode
it as an integer.
42:08
And now, you have
the second thing.
42:10
And so you get--
42:12
OK, we're at the first
position, and this
42:14
is whatever integer it is,
add the positional encodings,
42:17
goes into the sequence,
goes in the transformer,
42:19
and again, this token
now communicates
42:21
with the first token
and it's identity.
42:26
And so you just keep
plugging it back.
42:28
And once you run out of the
block size, which is eight,
42:31
you start to crawl,
because you can never
42:33
have watt size more than
eight in the way you've
42:34
trained this transformer.
42:35
So we have more and more
context until eight.
42:37
And then if you want to
generate beyond eight,
42:39
you have to start cropping
because the transformer only
42:41
works for eight elements
in time dimension.
42:43
And so all of these transformers
in the [INAUDIBLE] setting
42:47
have a finite block
size or context length,
42:50
and in typical models, this
will be 1,024 tokens or 2,048
42:54
tokens, something like that.
42:56
But these tokens are
usually like BPE tokens,
42:58
or SentencePiece tokens,
or WorkPiece tokens.
43:00
There's many
different encodings.
43:02
So it's not like that long.
43:03
And so that's why, I
think, [INAUDIBLE]..
43:05
We really want to
expand the context size,
43:06
and it gets gnarly
because the attention
43:08
is sporadic in the
[INAUDIBLE] case.
43:11
Now, if you want to implement
an encoder instead of a decoder
43:16
attention.
43:18
Then all you have to
do is this [INAUDIBLE]
43:21
and you just delete that line.
43:23
So if you don't
mask the attention,
43:25
then all the nodes
communicate to each other,
43:27
and everything is
allowed, and information
43:29
flows between all the nodes.
43:31
So if you want to have the
encoder here, just delete.
43:35
All the encoder blocks
will use attention
43:38
where this line is deleted.
43:39
That's it.
43:40
So you're allowing whatever--
this encoder might store say,
43:44
10 tokens, 10 nodes,
and they are all
43:46
allowed to communicate to each
other going up the transformer.
43:51
And then if you want to
implement cross-attention,
43:53
so you have a full
encoder-decoder transformer,
43:55
not just a decoder-only
transformer or a GPT.
43:59
Then we need to also add
cross-attention in the middle.
44:03
So here, there is a
self-attention piece where all
44:05
the--
44:06
there's a self-attention
piece, a cross-attention piece,
44:08
and this MLP.
44:09
And in the
cross-attention, we need
44:12
to take the features from
the top of the encoder.
44:14
We need to add one
more line here,
44:16
and this would be the
cross-attention instead of a--
44:20
I should have implemented
it instead of just pointing,
44:22
I think.
44:23
But there will be a
cross-attention line here.
44:25
So we'll have three
lines because we
44:26
need to add another block.
44:28
And the queries will
come from x but the keys
44:31
and the values will come
from the top of the encoder.
44:35
And there will be
basic code information
44:36
flowing from the
encoder, strictly
44:38
to all the nodes inside x.
44:41
And then that's it.
44:42
So it's a very
simple modifications
44:44
on the decoder attention.
44:47
So you'll hear people
talk that you have
44:49
a decoder-only model like GPT.
44:51
You can have an encoder-only
model like BERT,
44:53
or you can have an
encoder-decoder model
44:55
like say T5, doing things
like machine translation.
44:59
And in BERT, you can't train
it using this language modeling
45:04
setup that's utter
aggressive, and you're just
45:06
trying to predict next
[INAUDIBLE] in the sequence.
45:07
You're training it doing
slightly different objectives.
45:09
You're putting in
the full sentence,
45:12
and, the full sentence is
allowed to communicate fully.
45:14
And then you're trying to
classify sentiment or something
45:16
like that.
45:18
So you're not trying to model
the next token in the sequence.
45:21
So these are trained
slightly different
45:26
using masking and other
denoising techniques.
45:31
OK.
45:32
So that's like the transformer.
45:34
I'm going to continue.
45:36
So yeah, maybe more questions.
45:38
[INAUDIBLE]
46:01
This is like we are enforcing
these constraints on it
46:06
by just masking [INAUDIBLE]
46:12
So I'm not sure
if I fully follow.
46:14
So there's different ways
to look at this analogy,
46:16
but one analogy is
you can interpret
46:18
this graph as really fixed.
46:20
It's just that every time
we do the communicate,
46:22
we are using different weights.
46:23
You can look at it that way.
46:24
So if we have block size
of eight in my example,
46:26
we would have eight nodes.
46:27
Here we have 2, 4, 6.
46:29
OK, so we'd have eight nodes.
46:30
They would be connected in--
46:33
you lay them out, and you only
connect from left to right.
46:35
[INAUDIBLE]
46:42
Why would they
connect-- usually,
46:44
the connections don't change
as a function of the data
46:46
or something like that--
46:47
[INAUDIBLE]
47:00
I don't think I've seen
a single example where
47:02
the connectivity
changes dynamically
47:03
in the function data.
47:04
Usually, the
connectivity is fixed.
47:05
If you have an encoder,
and you're training a BERT,
47:07
you have how many
tokens you want,
47:09
and they are fully connected.
47:11
And if you have a
decoder-only model,
47:13
you have this triangular
thing, and if you
47:15
have encoder-decoder,
then you have
47:16
awkwardly two pools of nodes.
47:21
Yeah.
47:24
Go ahead.
47:25
[INAUDIBLE] I wonder, you
know much more about this
47:45
than I know.
47:46
But do you have a sense of
like if you ran [INAUDIBLE]
48:00
In my head, I'm thinking
[INAUDIBLE] but then you also
48:08
have different things for
one or more of [INAUDIBLE]----
48:13
Yeah, it's really
hard to say, so that's
48:15
why I think this paper is so
interesting because like, yeah,
48:17
usually, you'd
see like the path,
48:18
and maybe they had
path internally.
48:19
They just didn't publish it.
48:20
All you can see is things that
didn't look like a transformer.
48:23
I mean, you have ResNets,
which have lots of this.
48:26
But a ResNet would be
like this, but there's
48:29
no self-attention component.
48:31
But the MLP is there
kind of in a ResNet.
48:35
So a ResNet looks
very much like this
48:37
except there's no-- you can
use layer norms in ResNets,
48:40
I believe, as well.
48:41
Typically, sometimes,
they can be batch norms.
48:43
So it is kind of like a ResNet.
48:45
It is like they took
a ResNet, and they
48:47
put in a self-attention
block in addition
48:50
to the preexisting
MLP block, which
48:52
is kind of like convolutions.
48:53
And MLP was strictly
speaking deconvolution,
48:55
one by one convolution,
but I think
48:59
the idea is similar in that MLP
is just like a typical weights,
49:04
nonlinearity weights operation.
49:11
But I will say, yeah, this
is kind of interesting
49:13
because a lot of
work is not there,
49:15
and then they give
you this transformer.
49:17
And then it turns
out 5 years later,
49:18
it's not changed, even though
everyone's trying to change it.
49:20
So it's interesting to me
that it's like a package,
49:23
in like a package,
which I think is really
49:25
interesting historically.
49:26
And I also talked
to paper authors,
49:30
and they were
unaware of the impact
49:32
that the transformer
would have at the time.
49:33
So when you read this paper,
actually, it's unfortunate
49:37
because this is the paper
that changed everything,
49:39
but when people read it,
it's like question marks
49:41
because it reads like a pretty
random machine translation
49:45
paper.
49:46
It's like, oh, we're
doing machine translation.
49:47
Oh, here's a cool architecture.
49:48
OK, great, good results.
49:51
It doesn't know what's
going to happen.
49:53
[LAUGHS] And so when
people read it today,
49:56
I think they're
confused potentially.
50:00
I will have some
tweets at the end,
50:02
but I think I would
have renamed it
50:03
with the benefit of hindsight
of like, well, I'll get to it.
50:08
[INAUDIBLE]
50:20
Yeah, I think that's a
good question as well.
50:22
Currently, I mean,
I certainly don't
50:24
love the autoregressive
modeling approach.
50:27
I think it's kind of
weird to sample a token
50:29
and then commit to it.
50:31
So maybe there are
some ways, some hybrids
50:36
with the Fusion as
an example, which
50:38
I think would be
really cool, or we'll
50:41
find some other ways to edit
the sequences later but still
50:44
in our regressive framework.
50:47
But I think the Fusion is
like an up and coming modeling
50:49
approach that I personally
find much more appealing.
50:51
When I sample text, I don't
go chunk, chunk, chunk,
50:54
and commit.
50:55
I do a draft one, and then
I do a better draft two.
50:58
And that feels like
a diffusion process.
51:00
So that would be my hope.
51:05
OK, also a question.
51:07
So yeah, you'd think
the [INAUDIBLE]
51:20
And then once we
have the edge rates,
51:21
we just have to multiply
it by the values,
51:23
and then you just
[INAUDIBLE] it.
51:25
Yes, yeah, it's right.
51:27
And you think there's MLG
within graph neural networks
51:30
and they'll potentially--
51:32
I find the graph neural
networks like a confusing term
51:34
because, I mean,
yeah, previously,
51:38
there, was this notion of--
51:40
I feel like maybe today
everything is a graph neural
51:42
network because a transformer
is a graph neural network
51:44
processor.
51:45
The native representation that
the transformer operates over
51:48
is sets that are connected
by edges in a direct way.
51:51
And so that's the native
representation, and then, yeah.
51:55
OK, I should go on because
I still have 30 slides.
51:57
[INAUDIBLE]
52:08
Oh yeah, yeah, the root
DE, I think, it basically
52:11
like if you're initializing
with random weights
52:14
setup from a [INAUDIBLE] as
your dimension size grows,
52:17
so does your values,
the variance grows.
52:19
And then your softmax will just
become the one half vector.
52:23
So it's just a way to
control the variance
52:25
and bring it to always be
in a good range for softmax
52:28
and nice diffused distribution.
52:31
OK, so it's almost like
an initialization thing.
52:37
OK, so transformers
have been applied
52:41
to all the other fields,
and the way this was done
52:44
is in my opinion,
ridiculous ways
52:46
honestly because I was a
computer vision person,
52:49
and you have ComNets,
and they make sense.
52:51
So what we're doing now
with VITs as an example is
52:53
you take an image and you chop
it up into little squares.
52:56
And then those
squares, literally,
52:57
feed into a
transformer, and that's
52:59
it, which is kind of ridiculous.
53:01
And so, I mean, yeah,
and so the transformer
53:06
doesn't even, in the simplest
case, really know where
53:08
these patches might come from.
53:10
They are usually
positionally encoded,
53:12
but it has to rediscover
a lot of the structure,
53:16
I think, of them in some ways.
53:19
And it's kind of weird
to approach it that way.
53:23
But it's just the
simplest baseline
53:25
of just chomping up big
images into small squares
53:27
and feeding them in as the
individual nodes actually
53:29
works fairly well.
53:30
And then this is in a
transformer encoder,
53:32
so all the patches are
talking to each other
53:34
throughout the
entire transformer.
53:36
And the number of nodes
here would be like nine.
53:42
Also, in speech recognition, you
just take your melSpectrogram,
53:44
and you chop it up into
slices and you feed them
53:46
into a transformer.
53:47
So there was paper like
this, but also Whisper.
53:49
Whisper is a
copy-paste transformer.
53:51
If you saw Whisper from OpenAI,
you just chop up melSpectrogram
53:55
and feed it into a
transformer and then pretend
53:57
you're dealing with text.
53:58
And it works very well.
54:00
Decision transformer in RL,
you take your states, actions,
54:03
and reward that you
experience in environment,
54:05
and you just pretend
it's a language.
54:07
Then you start to model
the sequences of that,
54:09
and then you can use
that for planning later.
54:11
That works really well.
54:13
Even things AlphaFold,
so we were briefly
54:15
talking about molecules and
how you can plug them in.
54:17
So at the heart of
AlphaFold, computationally,
54:19
is also a transformer.
54:21
One thing I wanted to also
say about transformers
54:23
is I find that
they're very flexible,
54:26
and I really enjoy that.
54:28
I'll give you an
example from Tesla.
54:31
You have a ComNet
that takes an image
54:32
and makes predictions
about the image.
54:34
And then the big
question is, how do you
54:35
feed in extra information?
54:37
And it's not always
trivial like say, I
54:38
had additional
information that I
54:40
want to inform that I want
the outputs to be informed by.
54:43
Maybe I have other
sensors like Radar.
54:45
Maybe I have some map
information, or a vehicle type,
54:47
or some audio.
54:48
And the question is, how do you
feed information into a ComNet?
54:50
Like where do you feed it in?
54:52
Do you concatenate it?
54:54
Do you add it?
54:55
At what stage?
54:56
And so with a transformer,
it's much easier
54:58
because you just take
whatever you want, you chop it
55:00
up into pieces, and you
feed it in with a set
55:02
of what you had before.
55:03
And you let the
self-attention figure out
55:04
how everything
should communicate.
55:06
And that actually
apparently works.
55:07
So just chop up everything
and throw it into the mix
55:10
is like the way.
55:11
And it frees neural
nets from this burgeon
55:15
of Euclidean space,
where previously you
55:19
had to arrange your computation
to conform to the Euclidean
55:21
space or three dimensions of how
you're laying out the compute.
55:25
Like the compute
actually kind of
55:26
happens in almost like 3D
space if you think about it.
55:29
But in attention,
everything is just sets.
55:32
So it's a very
flexible framework,
55:33
and you can just throw in stuff
into your conditioning set.
55:35
And everything just
self-attended over.
55:37
So it's quite beautiful
from that perspective.
55:39
OK, so now what exactly makes
transformers so effective?
55:43
I think a good
example of this comes
55:44
from the GPT-3 paper, which
I encourage people to read.
55:48
Language Models of
Few-Shot Learners.
55:50
I would have probably
renamed this a little bit.
55:52
I would have said
something like transformers
55:54
are capable of in-context
learning or meta-learning.
55:57
That's like what makes
them really special.
56:00
So basically the setting
that they're working with
56:02
is, OK, I have some
context, and I'm
56:03
trying-- like say, a passage.
56:04
This is just one
example of many.
56:06
I have a passage, and I'm
asking questions about it.
56:08
And then as part of the
context in the prompt,
56:12
I'm giving the questions
and the answers.
56:14
So I'm giving one example
of question-answer,
56:16
another example of
question-answer,
56:17
another example of
question-answer, and so on.
56:19
And this becomes--
56:21
Oh yeah, people are going
to have to leave soon, huh?
56:24
OK, is this really important?
56:25
Let me think.
56:29
OK, so what's really
interesting is basically
56:31
like with more examples
given in a context,
56:35
the accuracy improves.
56:37
And so what that can set
is that the transformer
56:39
is able to somehow
learn in the activations
56:42
without doing any
gradient descent
56:43
in a typical
fine-tuning fashion.
56:45
So if you fine-tune, you have to
give an example and the answer,
56:48
and you fine-tune it,
using gradient descent.
56:51
But it looks like the
transformer internally
56:53
in its weights is
doing something
56:54
that looks like potentially
gradient, some kind
56:56
of a metalearning in the
weights of the transformer
56:57
as it is reading the prompt.
56:59
And so in this paper,
they go into, OK,
57:01
distinguishing this outer
loop with stochastic gradient
57:03
descent in this inner loop
of the intercontext learning.
57:06
So the inner loop is as
the transformer is reading
57:08
the sequence almost and the
outer loop is the training
57:12
by gradient descent.
57:14
So basically,
there's some training
57:15
happening in the activations
of the transformer
57:17
as it is consuming
a sequence that
57:18
may be very much looks
like gradient descent.
57:21
And so there are some recent
papers that hint at this
57:23
and study it.
57:23
And so as an example,
in this paper
57:25
here, they propose something
called the draw operator.
57:28
And they argue that the
raw operator is implemented
57:32
by transformer,
and then they show
57:33
that you can implement
things like ridge regression
57:35
on top of the raw operator.
57:36
And so this is giving--
57:39
There are papers
hinting that maybe there
57:40
is some thing that looks
like gradient-based learning
57:42
inside the activations
of the transformer.
57:45
And I think this is not
impossible to think through
57:47
because what is
gradient-based learning?
57:49
Overpass, backward
pass, and then update.
57:52
Oh, that looks like
a ResNet, right,
57:54
because you're adding
to the weights.
57:57
So the start of initial
random set of weights,
57:59
forward pass, backward pass,
and update your weights,
58:01
and then forward pass, backward
pass, update the weights.
58:04
Looks like a ResNet.
58:04
Transformer is a ResNet,
so much more hand-wavey,
58:10
but basically, some
papers are trying
58:11
to hint at why that would
be potentially possible.
58:14
And then I have a bunch of
tweets I just copy-pasted here
58:16
in the end.
58:18
This was like meant for
general consumption,
58:20
so they're a bit more high-level
and hypey a little bit.
58:22
But I'm talking about why this
architecture is so interesting
58:26
and why potentially
it became so popular.
58:27
And I think it
simultaneously optimizes
58:29
three properties that, I
think, are very desirable.
58:31
Number one, the
transformer is very
58:33
expressive in the forward pass.
58:35
It sort of like it's
able to implement
58:37
very interesting functions,
potentially functions
58:39
that can even do meta-learning.
58:41
Number two, it is very
optimizable thanks
58:43
to things like residual
connections, layer nodes,
58:45
and so on.
58:45
And number three, it's
extremely efficient.
58:47
This is not always appreciated,
but the transformer,
58:49
if you look at the
computational graph,
58:51
is a shallow, wide
network, which
58:53
is perfect to take advantage
of the parallelism of GPUs.
58:56
So I think the transformer
was designed very deliberately
58:58
to run efficiently on GPUs.
59:00
There's previous
work like neural GPU
59:02
that I really enjoy as
well, which is really just
59:05
like how do we design neural
nets that are efficient on GPUs
59:08
and thinking backwards from the
constraints of the hardware,
59:10
which I think is a
very interesting way
59:11
to think about it.
59:17
Oh yeah, so here, I'm saying,
I probably would have called--
59:21
I probably would've called the
transformer a general purpose
59:24
efficient optimizable
computer instead of attention
59:27
is all you need.
59:28
That's what I would have maybe
in hindsight called that paper.
59:31
It's proposing a model that
is very general purpose, so
59:37
forward passes, expressive.
59:38
It's very efficient
in terms of GPU usage
59:40
and is easily optimizable by
gradient descent and trains
59:44
very nicely.
59:46
And then I have some
other hype tweets here.
59:51
Anyway, so you can
read them later.
59:53
But I think this one
is maybe interesting.
59:55
So if previous neural nets
are special purpose computers
59:58
designed for a
specific task, GPT
1:00:00
is a general purpose computer,
reconfigurable at runtime
1:00:03
to run natural
language programs.
1:00:06
So the programs are
given as prompts,
1:00:08
and then GPT runs the program
by completing the document.
1:00:12
So I really like these analogies
personally to computer.
1:00:16
It's just like a
powerful computer,
1:00:18
and it's optimizable
by gradient descent.
1:00:22
And I don't know--
1:00:30
OK, yeah.
1:00:31
That's it.
1:00:31
[LAUGHTER]
1:00:33
You can read the tweets
later, but that's for now.
1:00:35
I'll just thank you.
1:00:36
I'll just leave this up.
1:00:45
Sorry, I just found this tweet.
1:00:46
So turns out that if you
scale up the training set
1:00:49
and use a powerful enough
neural net like a transformer,
1:00:51
the network becomes a
kind of general purpose
1:00:53
computer over text.
1:00:54
So I think that's nice
way to look at it.
1:00:56
And instead of performing
a single text sequence,
1:00:58
you can design the
sequence in the prompt.
1:01:00
And because the transformer
is both powerful
1:01:02
but also is trained on large
enough, very hard data set,
1:01:05
it becomes this general
purpose text computer.
1:01:07
And so I think that's kind of
interesting way to look at it.
1:01:11
Yeah.
1:01:13
[INAUDIBLE]
1:02:01
And I guess my question
is [INAUDIBLE] how
1:02:04
much do you think [INAUDIBLE]?
1:02:10
really because it's mostly
more efficient or [INAUDIBLE]
1:02:25
So I think there's
a bit of that.
1:02:27
Yeah, so I would say
RNNs in principle,
1:02:29
yes, they can implement
arbitrary programs.
1:02:31
I think, it's like a useless
statement to some extent
1:02:33
because they're probably--
1:02:35
I'm not sure that they're
probably expressive
1:02:37
because in a sense of power
and that they can implement
1:02:40
these arbitrary functions.
1:02:43
But they're not optimizable.
1:02:44
And they're certainly not
efficient because they
1:02:46
are serial computing devices.
1:02:50
So if you look at it
as a compute graph,
1:02:51
RNNs are very long,
thin compute graph.
1:02:58
What if you stretched out
the neurons and you looked--
1:03:00
like take all the individual
neurons interconnectivity,
1:03:02
and stretch them out, and
try to visualize them.
1:03:04
RNNs would be like a very
long graph and that's bad.
1:03:07
And it's bad also
for optimizability
1:03:08
because I don't
exactly know why,
1:03:10
but just the rough intuition
is when you're backpropagating,
1:03:13
you don't want to
make too many steps.
1:03:15
And so transformers are a
shallow wide graph, and so
1:03:19
from supervision to inputs is
a very small number of hops.
1:03:23
And it's a long
residual pathways,
1:03:25
which make gradients
flow very easily.
1:03:26
And there's all
these layer norms
1:03:28
to control the scales of
all of those activations.
1:03:32
And so there's
not too many hops,
1:03:34
and you're going from
supervision to input
1:03:36
very quickly and just
flows through the graph.
1:03:40
And it can all be
done in parallel,
1:03:42
so you don't need to do this--
1:03:43
encoder and decoder RNNs, you
have to go from first word,
1:03:46
then second word,
then third word.
1:03:47
But here in transformer,
every single word
1:03:49
was processed completely in
parallel, which is kind of a--
1:03:54
So I think all of these are
really important because all
1:03:57
of these are really important.
1:03:57
And I think number 3 is less
talked about but extremely
1:04:00
important because in deep
learning scale matters.
1:04:03
And so the size of the
network that you can train it
1:04:06
gives you is
extremely important.
1:04:08
And so if it's efficient
on the current hardware,
1:04:10
then you can make it bigger.
1:04:14
You mentioned that if you do
it with multiple modalities
1:04:17
of data, [INAUDIBLE].
1:04:21
How does that actually work?
1:04:22
Do you leave the different
data as different token,
1:04:26
or is it [INAUDIBLE]?
1:04:29
No, so yeah, so you
take your image,
1:04:31
and you apparently chop
them up into patches.
1:04:33
So there's the first
thousand tokens or whatever.
1:04:35
And now, I have a special--
1:04:37
so radar could be also,
but I don't actually
1:04:40
want to make a
representation of radar.
1:04:43
But you just need to
chop it up and enter it.
1:04:46
And then you have to
encode it somehow.
1:04:47
Like the transformer
needs to know
1:04:48
that they're coming from radar.
1:04:49
So you create a special--
1:04:52
you have some kind of a
special token of that to--
1:04:55
these radar tokens
are what's slightly
1:04:57
different in the
representation, and it's
1:04:58
learnable by gradient descent.
1:05:00
And like vehicle
information would also
1:05:03
come in with a special embedded
token that can be learned.
1:05:07
So--
1:05:09
So how do you line
those before really--
1:05:11
Actually, but you don't.
1:05:12
It's all just a set.
1:05:13
And there's--
1:05:14
Even the [INAUDIBLE]
1:05:18
Yeah, it's all just a set,
but you can positionally
1:05:20
encode these sets if you want.
1:05:23
So positional
encoding means you can
1:05:26
hardwire, for example,
the coordinates
1:05:28
like using [INAUDIBLE].
1:05:29
You can hardwire
that, but it's better
1:05:31
if you don't hardwire
the position.
1:05:33
It's just a vector
that is always
1:05:34
hanging out the dislocation.
1:05:35
Whatever content is
there, it just adds on it.
1:05:37
And this vector is
trainable by background.
1:05:39
That's how you do it.
1:05:43
Good point.
1:05:43
I don't really like
the [INAUDIBLE]..
1:05:48
They seem to work, but it
seems like they're sometimes
1:05:51
[INAUDIBLE]
1:06:08
I'm not sure if I
understand your question.
1:06:10
[LAUGHTER]
1:06:11
So I mean the
positional encoders
1:06:12
like they're actually like not--
1:06:14
OK, so they have very little
inductive bias or something
1:06:16
like that.
1:06:17
They're just vectors hanging
out in location always,
1:06:19
and you're trying to help
the network in some way.
1:06:23
And I think the
intuition is good,
1:06:28
but if you have
enough data, usually,
1:06:30
trying to mess with
it is a bad thing.
1:06:33
Trying to enter
knowledge when you
1:06:35
have enough
knowledge in the data
1:06:36
set itself is not
usually productive.
1:06:38
So it all really depends
on what scale you want.
1:06:40
If you have infinity
data, then you actually
1:06:41
want to encode less and less.
1:06:43
That turns out to work better.
1:06:44
And if you have very little
data, then actually, you do
1:06:46
want to encode some biases.
1:06:47
And maybe if you have a
much smaller data set, then
1:06:49
maybe convolutions
are a good idea
1:06:50
because you actually have this
bias coming from your filters.
1:06:55
But I think-- so the transformer
is extremely general,
1:06:58
but there are ways to
mess with the encodings
1:07:01
to put in more structure.
1:07:02
Like you could, for example,
encode [INAUDIBLE] and fix it,
1:07:05
or you could actually go
to the attention mechanism
1:07:07
and say, OK, if my image
is chopped up into patches,
1:07:10
this patch can only communicate
to this neighborhood.
1:07:13
And you just do that in
the attention matrix,
1:07:15
you just mask out whatever
you don't want to communicate.
1:07:18
And so people really
play with this
1:07:19
because the full
attention is inefficient.
1:07:22
So they will intersperse,
for example, layers
1:07:25
that only communicate
in little patches
1:07:26
and then layers that
communicate globally.
1:07:28
And they will do all
kinds of tricks like that.
1:07:30
So you can slowly bring
in more inductive bias.
1:07:33
You would do it, but
the inductive biases
1:07:35
are like they're factored out
from the core transformer.
1:07:38
And they are factored out,
and the interconnectivity
1:07:41
of the nodes.
1:07:42
And they are factored
out in the positionally--
1:07:44
and you can mess with
this for computation.
1:07:49
[INAUDIBLE]
1:08:02
So there's probably about 200
papers on this now if not more.
1:08:06
They're kind of hard
to keep track of.
1:08:07
Honestly, like my Safari
browser, which is-- oh,
1:08:10
it's all up on my computer,
like 200 open tabs.
1:08:13
But yes, I'm not
even sure if I want
1:08:20
to pick my favorite honestly.
1:08:23
Yeah, [INAUDIBLE]
1:08:42
Maybe you can use a transformer
like that [INAUDIBLE]
1:08:45
The other one that I
actually like even more
1:08:46
is potentially, keep
the context length fixed
1:08:49
but allow the network to
somehow use a scratch pad.
1:08:53
And so the way this works is
you will teach the transformer
1:08:55
somehow via examples
in [INAUDIBLE] hey,
1:08:57
you actually have a scratch pad.
1:09:00
Basically, you can't
remember too much.
1:09:01
Your context line is finite.
1:09:02
But you can use a scratch pad.
1:09:04
And you do that by emitting
a start scratch pad,
1:09:06
and then writing whatever you
want to remember, and then
1:09:08
end scratch pad.
1:09:10
And then you continue
with whatever you want.
1:09:12
And then later
when it's decoding,
1:09:14
you actually have
special objects
1:09:15
that when you detect
start scratch pad,
1:09:18
you will like save
whatever it puts
1:09:19
in there in like external thing
and allow it to attend over it.
1:09:22
So basically, you can teach the
transformer just dynamically
1:09:25
because it's so meta-learned.
1:09:27
You can teach it dynamically
to use other gizmos and gadgets
1:09:30
and allow it to expand
its memory that way
1:09:31
if that makes sense.
1:09:32
It's just like human learning
to use a notepad, right.
1:09:35
You don't have to
keep it in your brain.
1:09:37
So keeping things in your
brain is like the context line
1:09:39
from the transformer.
1:09:39
But maybe we can just
give it a notebook.
1:09:42
And then it can query the
notebook, and read from it,
1:09:45
and write to it.
1:09:46
[INAUDIBLE] transformer to
plug in another transformer.
1:09:48
[LAUGHTER]
1:09:53
[INAUDIBLE]
1:10:09
I don't know if I detected that.
1:10:10
I feel like-- did you feel
like there was more than just
1:10:12
a long prompt that's unfolding?
1:10:14
Yeah, [INAUDIBLE]
1:10:19
I didn't try extensively, but
I did see a [INAUDIBLE] event.
1:10:22
And I felt like the block
size was just moved.
1:10:28
Maybe I'm wrong.
1:10:28
I don't actually know about
the internals of ChatGPT.
1:10:31
We have two online questions.
1:10:33
So one question is, "what do
you think about architecture
1:10:35
[INAUDIBLE]?"
1:10:38
S4?
1:10:39
S4.
1:10:40
I'm sorry.
1:10:41
I don't know S4.
1:10:42
Which one is this one?
1:10:45
The second question, this
one's a personal question.
1:10:47
"What are you going
to work on next?"
1:10:49
[INAUDIBLE]
1:10:51
I mean, so right now, I'm
working on things like nanoGPT.
1:10:53
Where is nanoGPT?
1:10:58
I mean, I'm going basically
slightly from computer vision
1:11:01
and like computer
vision-based products, do
1:11:03
a little bit in language domain.
1:11:05
Where's ChatGPT?
1:11:06
OK, nanoGPT.
1:11:07
So originally, I had minGPT,
which I rewrote to nanoGPT.
1:11:10
And I'm working on this.
1:11:11
I'm trying to reproduce
GPTs, and I mean,
1:11:14
I think something
like ChatGPT, I think,
1:11:16
incrementally improved
in a product fashion
1:11:17
would be extremely interesting.
1:11:19
And I think a lot
of people feel it,
1:11:23
and that's why it went so wide.
1:11:24
So I think there's
something like a Google plus
1:11:28
plus plus to build that I
think is more interesting.
1:11:31
Shall we give our speaker
a round of applause?
— end of transcript —
Advertisement
More from Stanford Online
1:45:08
Stanford CS230 | Autumn 2025 | Lecture 9: Career Advice in AI
Stanford Online
1:44:31
Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)
Stanford Online
1:02:52
Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 1: Introduction
Stanford Online
1:49:54
Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG
Stanford Online