Advertisement
Ad slot
5: Deep Learning for Natural Language – The Basics 1:17:03

5: Deep Learning for Natural Language – The Basics

MIT OpenCourseWare · May 11, 2026
Open on YouTube
Transcript ~13802 words · 1:17:03
0:16
Okay. So today we start the the natural
0:20
language processing sequence and so just
0:23
to give you a quick idea we're going to
0:24
start with uh what's called
0:26
vectorization
0:27
uh and then the bag of words model and
0:29
then we'll spend a fair amount of time
0:30
on a collab uh and then on Wednesday we
Advertisement
Ad slot
0:33
talk about these things called
0:34
embeddings which you'll come to
0:36
appreciate over the the next couple of
0:38
weeks form like the sort of the core
0:40
atomic unit of all modern natural
0:42
language processing and for that matter
0:45
vision processing as well. uh and then
0:47
we will uh following week we'll do
0:49
transformers two lectures on
0:50
transformers we'll get into the theory
Advertisement
Ad slot
0:52
and then we'll get into a bunch of
0:53
applications and then lectures nine and
0:55
10 will be all LLMs all about LLMs so
0:59
it's going to be a lot of fun u this is
1:01
one of my favorite segments of the class
1:04
of course truth be told every segment of
1:05
the class is my favorite so don't judge
1:08
me all right so let's get going uh so
1:10
why why natural language processing u
1:13
you know these are in some sense the the
1:16
things I have on the slide here are sort
1:17
of obvious but I think it's actually
1:18
worth reme reminding ourselves of how
1:21
important text is for everything we do.
1:24
Uh obviously human knowledge is mostly
1:26
encoded as text. The internet is mostly
1:29
text. At least this was true till the
1:30
advent of Tik Tok and YouTube. Uh and uh
1:33
human communication is mostly text and
1:35
cultural production you know movies,
1:37
books, uh arts and so on. So much of it
1:40
is so textheavy and so in some sense uh
1:43
text forms not just a big chunk of all
1:47
the media that's out there but it also
1:49
happens to be the way in which we think
1:50
and communicate and so on and so forth.
1:52
So it's sort of uh primacy is in my
1:55
opinion sort of unparalleled uh in how
1:57
we think about the world. And so the the
1:59
tantalizing possibility is that imagine
2:02
if we had an AI system which could just
2:04
read and quote unquote understand all
2:06
this text, right? Um and so you can
2:09
imagine such a system reading all of
2:11
PubMed, reading all the medical
2:13
literature and then coming back and
2:15
saying you know for this particular
2:17
disease you know this particular sort of
2:19
protein is actually the malfunctioning
2:21
protein and for that that small molecule
2:23
is going to dock into the protein and
2:24
cure the disease and you didn't know
2:26
this. It came back and told you that.
2:27
Wouldn't it be unbelievable? So my
2:29
feeling is that such things are going to
2:31
happen. It's just that it's not going to
2:33
happen soon enough for my lifetime, but
2:36
perhaps it'll happen in yours. All
2:38
right. Okay. So, let's continue. So, NLP
2:40
is an action all around us. Um, you
2:42
know, according to Google, apparently
2:44
Google autocomplete, uh, which uses a
2:46
fair bit of NLP, uh, saves 200 years of
2:49
typing time apparently, every day. Uh, I
2:53
actually thought it was, you know, this
2:54
I wasn't very impressed with this
2:55
number, frankly, because billions of
2:57
searches are being done every day and
2:58
I'm like only 200 years. So anyway u but
3:01
I think the more important point is that
3:03
it made mobile possible right if you if
3:06
you didn't have autocomplete people
3:08
would not be you know typing and pecking
3:09
on their keyboards it's going to be much
3:11
worse it would have had a hugely
3:13
dampening effect on e-commerce for
3:15
instance so this humble little
3:17
autocomplete has incredible incredible
3:19
impact on the world economy and the
3:21
other thing which I heard about I'm not
3:23
sure if it's 100% true but it's an
3:25
interesting example apparently the very
3:26
first iPhone keyboard that came out
3:28
right the soft keyboard not the hard
3:30
keyboard. Um they had some very basic,
3:34
you know, sort of word continuation
3:35
prediction going on. And so if if when
3:38
you start typing T and H, obviously it's
3:41
going to guess the E is going to come
3:43
next, right? So that part is old old
3:46
news, nothing new there. But apparently
3:48
the E letter in the keyboard will become
3:50
slightly bigger. So when your finger
3:53
goes towards it, it has a better shot of
3:54
actually connecting with it. Right? So
3:57
these kinds of things are used to change
3:59
the UI in real time in a whole bunch of
4:01
applications and you just don't even
4:02
realize it. All right. So uh and of
4:06
course we all know about uh LRM at this
4:08
point. So I asked it to write a
4:09
limmerick about the beauty and power of
4:11
deep learning yesterday and it says in a
4:13
world where data flows like a stream
4:15
deep learning is more than a dream.
4:16
Sifts through the noise with an elegant
4:18
poise unveiling insights that gleam.
4:22
Cool, right? All right. So let's get
4:25
back to work. Uh so NLP has
4:26
extraordinary potential for making
4:28
products, service and services much much
4:30
smarter. Uh and what I want to point out
4:33
here is that you know even if you focus
4:35
on this very very simple sort of
4:37
formalism right a bunch of text comes in
4:40
a bunch of text goes out that's it. If
4:42
you take that very simple text in text
4:44
out formalism this little humble little
4:46
thing has just an enormous enormous
4:49
range of applicability. Right? So
4:51
obviously you can send a bunch of text
4:53
in and ask it to classify it right for
4:56
mo you know sentiment route it for
4:58
customer support you can try to figure
5:00
out the intent of what the person is
5:01
asking in search you can filter it you
5:03
can content filter to make sure there's
5:04
no toxic abusive stuff going on I mean
5:06
the the possibilities for just text
5:08
classification are numerous okay but
5:11
that's a that's sort of a use case we
5:12
are all kind of familiar with right so
5:14
no surprise there now text extraction we
5:17
may be less familiar with here and the
5:19
idea is that you can actually look at a
5:20
lot lot of uh unstructured textual data
5:23
and extract all sorts of interesting
5:25
entities from it. Right? Hedge fun hedge
5:27
funds use it very heavily. They will
5:29
extract all sorts of company information
5:30
from news articles u and then obviously
5:33
doctor's notes. There are a whole bunch
5:34
of NLP startups that will take the
5:36
doctor's the doctor patient conversation
5:38
transcribe it and then extract disease
5:40
codes, diagnosis codes, medication codes
5:43
and things like that. Uh right. So the
5:45
possibilities for this are enormous. Of
5:47
course text summarization and we all
5:48
have been doing it thanks to chat GPT
5:50
right take text in and any kind of
5:53
summary that comes out of the text is
5:54
just text out okay and then text
5:57
generation of course we can take text
5:58
and do marketing copy sales emails
6:00
market summaries so on so forth and
6:01
including troublingly for educators
6:03
college application essays
6:06
code generation is a more subtle example
6:10
of text out because code is just text
6:14
right so text in text out also covers
6:16
was text in code out. Okay. And question
6:20
answering. So you can take a bunch of
6:22
text,
6:24
you can take a whole bunch of documents,
6:25
you can add a bit of text to it which is
6:27
your question and this whole thing at
6:29
the end of the day is just text it in
6:31
and then you can have and you can use it
6:33
to answer questions and therefore create
6:35
chat bots for all sorts of interesting
6:36
applications.
6:39
And you know if you look at this example
6:42
call centers that's that is where a lot
6:44
of money is being spent right now to
6:46
build these call center chatbots for
6:47
text and text out question answering and
6:49
so just if you drill into this right if
6:52
you imagine taking all the call center
6:54
transcripts and their internal product
6:56
documentation service documentation FAQs
6:59
etc stick it in you can start to answer
7:02
these kinds of questions okay yesterday
7:04
what are the top reasons why customers
7:05
were upset with us what interventions
7:08
made by the agent actually worked what
7:09
did not work, right? What characterizes
7:12
the best agents from the rest? How
7:14
should we grade this particular agent's
7:16
interaction with the particular
7:16
customer? How should she how should we
7:18
chain the call center script? How should
7:20
we coach the agent in real time? Every
7:23
one of these applications is aminable to
7:25
this very humble text and text route
7:26
model.
7:28
Okay. And so, and of course the
7:30
potential for is now everybody knows
7:32
this potential because of the advent of
7:33
large language models. Uh, by the way,
7:36
Google is uh released something called
7:38
Google Geminy 1.5 Pro u a couple of days
7:42
ago. Uh, and it's incredible.
7:46
It's incredible, right? And anyway,
7:49
we'll get back to that later. But the
7:50
point is that the kind of potential we
7:52
have is just amazing even for text and
7:54
text. Okay. And as you would imagine
8:00
>> this is all like though we are calling
8:02
it language this is all primarily
8:04
English right
8:05
>> now there are lots of multilingual uh
8:07
models as well uh there are multilingual
8:09
models by that I mean models which are
8:12
specialized to other languages
8:13
non-English languages and models which
8:15
are mult truly multilingual like
8:16
polyglot models as well and both of them
8:18
are available uh right now and many many
8:21
modern LLMs are actually trained from
8:23
the get-go to be multilingual in a bunch
8:26
of the what are called high resource
8:28
languages. Languages which are spoken by
8:30
lots of people. Uh but actually it's
8:32
funny you should ask that question
8:33
because this this Google Gemini model
8:34
that I just described they actually u so
8:37
there is a language called kalamang
8:40
which is spoken by 200 people in the
8:41
world and so a researcher had created a
8:45
one book which is sort of like a grammar
8:48
manual for kalamag right because there
8:50
are no other written works in that
8:52
language. And so what they did is they
8:54
took a whole bunch of English dialogue
8:56
and this book fed it into uh Google
9:00
Gemini 1.4 Pro 1.5 and it translated
9:04
into Calamong at human level
9:06
proficiency.
9:07
It had never seen it before. So that's
9:10
an example
9:12
of of this.
9:15
Yes. So the question is the question
9:18
text here is all the things you want to
9:19
translate from English to kalamong. The
9:21
documents here is just one document
9:23
singular the grammar book the manual and
9:25
then what comes out is a translation. So
9:29
these models even when they're not
9:30
explicitly trained on a different
9:31
language if you give them enough of sort
9:34
of grammar manuals and stuff like that
9:35
they may do a pretty decent job from the
9:37
get-go with no training.
9:40
It's kind of a shocker. Two years ago
9:42
people would be like that's impossible.
9:44
All right. So
9:47
back to this.
9:50
All right. And as you folks, you know,
9:51
may already know and maybe you're in
9:53
fact participating in this gold rush
9:54
already. Um, you know, lots of people
9:57
are creating lots of really cool
9:58
companies to take some of these ideas
10:00
and actually create really interesting
10:02
products and services out of them. Um,
10:04
so if you're not doing it and if you've
10:06
been thinking about entrepreneurial
10:07
stuff, here's a word of advice. Take the
10:10
plunge.
10:15
Dismissed. Just kidding. All right. So,
10:18
and as you can imagine, enterprise
10:19
vendors are rushing to add NLP to all
10:22
their products. Salesforce Einstein now
10:24
has Einstein GPT. Microsoft has
10:27
co-pilot. I mean, the list goes on.
10:28
Everybody, everybody's like scrambling
10:30
and really trying hard to infuse some
10:32
GPT magic into whatever they're doing.
10:34
Okay, some of it is real, a lot of it is
10:36
not. Uh, okay. So, let's go to like the
10:39
arc of NLP progress. How did we get to
10:41
this kind of crazy times that we live
10:43
in? Um so if you look at natural
10:46
language processing basically efforts to
10:48
take language and try to analyze
10:50
language and you do predictions with
10:52
language and so on and so forth. Um
10:56
the first phase of it was just
10:58
handcrafted rules based on linguistics.
11:00
So these are all linguists who would
11:02
really understand the grammar of a
11:03
language and then they would use a deep
11:05
knowledge of linguistics to figure out
11:07
all these rules by which you can process
11:08
and analyze natural language text. And
11:11
then this other thing came along which
11:13
was a statistical machine learning
11:15
approach which basically said never mind
11:17
all that complicated knowledge of
11:19
linguistics and grammar. Why don't we
11:21
simply count things? Let's count the
11:24
number of times these two will co-
11:25
occur. Now let's count that. Let's count
11:26
this basically just count a lot. Okay.
11:29
And let's see if it does right if it
11:31
does for predicting things for say for
11:32
classifying text and so on. And
11:34
shockingly those methods ended up being
11:36
really good. They ended up being really
11:39
good and in fact they actually were
11:41
better than the lovingly handcurated
11:44
linguistically driven rules. Okay, so
11:47
much so this is a famous quote which
11:50
says every time I fire a linguist the
11:52
performance of speech recognizer goes up
11:55
right obviously made in justest but
11:57
there is a kernel of truth to it.
11:59
So that was that's what that's that's
12:01
what that's where we were and then deep
12:03
learning happened okay um in 2012
12:06
roughly and then we had these things
12:08
called recurren neural networks which
12:09
are based on deep learning which
12:11
actually moved the ball forward and then
12:13
in 2017
12:15
something called the transformer was
12:17
invented
12:18
2017 and the transformer replaced
12:21
everything else across the board so we
12:26
just going to leaprog directly to
12:27
transformers in hodle we will not spend
12:29
any time on recurren neural networks and
12:30
that is not to say that they are sort of
12:32
dead. Um there's there's a very
12:35
interesting work which actually is
12:36
trying to now revive recurren neural
12:38
networks to make it work for these kinds
12:40
of modern LLM kinds of tasks but it's
12:42
still very early days. Okay. So for now
12:44
we'll just focus on transformers.
12:46
Okay. So the the very high level view of
12:49
the problem here is that like most
12:51
things in deep learning it's basically
12:53
fancy regression.
12:55
There is some variable X that comes in.
12:57
It goes through a bunch this goes to
12:59
this very complicated function along
13:01
with this W which is the weights and
13:03
then out pops an output. Right? That's
13:05
just the view that you've always had.
13:07
And so in this case X happens to be
13:10
text. Y can be text. It could be labels.
13:12
It could be numbers. It could be
13:13
anything else. The W is the weights. And
13:15
the function is a deep neural network.
13:16
Right? This by by at this point when you
13:19
look at this slide it should be like
13:20
blindingly obvious.
13:23
So now the key question here is how do
13:26
you actually represent X? That's the key
13:28
question for pictures for images. We saw
13:31
that we just took the pixel values which
13:34
were light intensity numbers between 0
13:36
and 255 and you could just use that
13:37
directly but when a when a sentence
13:39
comes in like I love deep learning like
13:41
what do you do right how do you actually
13:43
represent it because remember we have to
13:45
numericalize everything that's coming
13:46
in. So that's a key question and and
13:49
this actually is a very subtle question
13:50
very important question and we'll focus
13:52
on that today and then next week when we
13:56
look at transformers we will look at
13:58
what neural network architecture is best
14:00
suited to process this sort of text
14:02
inputs that are coming in right those
14:04
are the two big questions we're going to
14:06
look at all right so processing basics
14:11
we going to follow this very standard
14:12
process
14:15
this is the process by which we take any
14:18
any text that comes in and we do run it
14:21
through these four steps and this
14:23
process is called text vectorization and
14:25
as the name suggest that we are
14:26
essentially taking text and creating
14:28
vectors of numbers out of it right text
14:30
vectorization and we'll go through each
14:32
of these processes one after the other
14:34
so I just find it very useful to just
14:36
have this acronym stie in my head like
14:39
stie just keep that in mind it may be
14:41
helpful um all right so we what we do is
14:45
the setup here is that we have a whole
14:48
bunch of documents, right? We call it
14:50
the training corpus. We have a whole
14:51
bunch of text documents, text data. Uh,
14:54
and as far as we are concerned, you can
14:55
just imagine it as just lists of long
14:58
passages. Okay? What is a novel? It's
15:01
just a long passage, right, of text. So
15:03
whether it's a novel or a sentence
15:05
doesn't really matter. We just think of
15:07
them as a big list of strings, a big
15:09
list of text. Okay, that's a training
15:11
corpus. And what we do is we take this
15:13
training corpus and we run it through
15:15
and we apply standardization and
15:17
tokenization which I will describe to
15:19
this entire training corpus up front.
15:22
Okay. So we first do this and and
15:26
standardization is basically
15:29
the default for most applications tends
15:32
to be this which is we first strip
15:34
capitalization and make everything lower
15:36
case
15:38
and then we remove punctuation and
15:40
accents and so on and so forth. Okay,
15:42
that's the first thing we do. I'll talk
15:44
about why we do it in just a moment, but
15:46
the mechanics of it are we do this
15:48
first. Then we look at words like a,
15:51
the, it, and so on and so forth.
15:53
Basically filler words, right? Which
15:55
which we we need to actually make
15:57
complete sentences, but they may not
15:59
have any value predicting things. So we
16:02
remove them and they are called stop
16:03
words. And then finally we take words
16:06
which are very similar which have sort
16:08
of a same kind of stem or root and then
16:10
we just map it to like a common
16:12
representation like ate eaten eating
16:14
eaten all these things just becomes
16:16
let's say eats and we do that sometimes.
16:19
So this we almost always do this we
16:21
often do and this we do it sometimes.
16:23
Okay. Now, why do we do any of these
16:25
things?
16:34
>> I think we want to try to recognize the
16:36
essential thing with the word, right?
16:38
Whether it's eaten or eat, but the
16:40
essential thing is the eat, right? So,
16:42
we want to try to sort of abstract from
16:45
it the more essential thing,
16:47
>> right? So, why do we need to abstract? I
16:49
guess you're absolutely correct. We're
16:50
trying to abstract. Why is there a
16:52
benefit to doing this abstraction?
16:58
How about somebody from this side of the
16:59
room? Oh yes.
17:03
>> So I want to reduce the library.
17:07
>> Why is it a good idea to reduce the
17:08
library? The size of the library
17:12
>> because of the the amount of computation
17:14
needed. So that is part of the answer.
17:17
There's another part to the answer which
17:20
says all right let's swing to the right
17:26
um is it faculties comparison between
17:28
different sets
17:30
of standard
17:33
[clears throat]
17:34
>> okay so I will go with that but I think
17:37
the the key thing we want to uh the key
17:39
thing to realize here is that you want
17:42
the model much like when you go when we
17:44
talk about computer vision we said look
17:46
if it's vertical line. I want to be able
17:48
to detect it wherever it happens. I
17:51
don't want the model to think that the
17:52
vertical line on the left side is
17:54
different from the vertical line on the
17:55
right side and then later realize they
17:57
are the same thing because you would
17:58
have wasted valuable capacity learning
18:00
things which actually happen to be the
18:02
same because you didn't know it was the
18:03
same. So here if you for example take a
18:06
word and lowerase it clearly the case of
18:09
it whether it's uppercase or lower case
18:11
most of the time it's not going to
18:12
matter for anything you want to predict.
18:14
So you're essentially telling the model
18:16
you know the lowerase version uppercase
18:18
version they are not different they're
18:19
actually the same and the easiest way to
18:21
tell the model they are the same is just
18:23
make everything lower case so that is
18:25
the key idea okay and similarly if you
18:29
look at stop words the reason is that
18:31
these stop words may not help you
18:32
predict anything whether a word uh and
18:34
the showed up in a movie review probably
18:36
does not affect the sentiment of the
18:38
review and therefore let's remove it so
18:40
that's a slightly different reason
18:42
stemming is the same reason as the first
18:44
which is that all these words kind of
18:46
mean the same thing. We don't have to be
18:48
super precise about it and so let's just
18:50
like collapse them onto the same thing.
18:51
Now that these are all the standard
18:54
things we do there are totally notice
18:57
you know important exceptions to all
18:58
these things. Okay we'll come back to
19:00
the exceptions a bit later but that is
19:02
the standard thing we do make sense. All
19:05
right.
19:08
So if you look at something like this um
19:11
this sentence here right hola what do
19:14
you picture when you think of travel
19:15
Mexico boom and then you can see here
19:17
this is the standardized version like
19:20
everything has become lower case like
19:21
the h has become small h the punctuation
19:24
has disappeared that's part of
19:25
standardization and then uh travel and
19:29
you can see here that Mexico m has
19:32
become small sipping has become sips uh
19:35
things think has become things and so on
19:37
and so forth
19:38
So that's an example of strandization at
19:41
work.
19:47
Okay.
19:49
The next thing we do is something very
19:51
important and it's called tokenization.
19:53
So what we do typically is that okay now
19:55
we have standardized everything. We have
19:56
a bunch of words. Uh we need to now
19:59
split them into what are called tokens.
20:01
So the most common default is to just
20:04
think of a word as a token.
20:07
We just split on the white space, right?
20:09
You take each string and wherever there
20:11
is white space, meaning actual spaces,
20:14
uh, carriage returns and things like
20:15
that, boom, you just split on them and
20:17
you just create words out of it. So, so
20:20
for instance, if you have this
20:22
standardized sentence here, you just
20:24
split it after every word and you get
20:26
this thing. Okay? So, each of these is
20:29
now a token.
20:32
Now, this has some disadvantages.
20:36
What are some disadvantages of just
20:38
splitting on on the space between words?
20:40
Uh yeah,
20:43
>> I think we lose any context because we
20:46
look at each word separately. Uh so we
20:49
don't have any password or what happens
20:52
next,
20:53
>> right? So for example, the cat sat on
20:55
the mat and the mat sat on the cat will
20:57
have the same like set, right? Yeah. So
21:00
you lose the order. What are some other
21:02
issues with it?
21:05
for words that should have two together
21:07
like you lose the fact that that's one
21:10
name because you separated
21:11
>> right exactly so there are compound
21:14
words right like father-in-law for
21:16
instance that's one problem another
21:18
problem is that lots of non-English
21:20
languages they actually don't have this
21:22
notion of a space between words right
21:25
actually runs one after the other and it
21:27
is and the native speakers know from
21:29
context how to chunk it and break it so
21:32
well what do we do Right?
21:34
Because you basically will have one word
21:36
for the whole passage, one token. The
21:39
other problem is that there are
21:40
languages, German is perhaps the most
21:42
notable one in which you have very long
21:44
words.
21:47
Um I saw a word uh which I think I might
21:50
have it on the site somewhere this like
21:52
this long which means uh
21:57
you realize that something amazing is
21:59
happening but the rest of the world
22:00
hasn't woken up to it yet. It's that
22:02
feeling.
22:04
There's a word for that. Amazing, right?
22:07
Anyway, so yeah, some words or Japanese,
22:10
for example, there's a word called. Do
22:12
people know the meaning of the word
22:13
combi?
22:16
It means the transient beauty of
22:20
sunlight going through fall foliage.
22:24
There's a word for that. How cool is
22:26
that? Anyway, sorry. I love that word.
22:29
So, back to this. Um so we have this
22:31
thing here. So there are all reasons for
22:33
which splitting on the the space between
22:35
words not going to work. Okay. Um
22:38
so what we will so what happens is that
22:41
modern large language models. So the the
22:44
what we have described so far despite
22:46
its shortcomings is actually really good
22:47
for lots of NLP use cases. Okay. If you
22:50
want to classify text as good enough for
22:52
instance but if you want to generate
22:54
text like LLMs do it's not going to
22:57
work. It's not going to work because you
22:59
know when you ask the strategia question
23:01
it comes back with perfect punctuation.
23:03
Clearly punctuation was not stripped. It
23:05
comes back with particular upper and
23:07
lower case clearly that wasn't stripped.
23:09
You can actually make up new words and
23:11
ask it to use the new word. It'll make
23:12
it I'll use it. Therefore, it's not like
23:15
it can only recognize a finite set. So
23:17
there's a very clever scheme called bite
23:19
pair encoding right which is which is
23:22
invented to do all those things. And I
23:24
have slides at the end and if we have
23:26
time we'll talk about it.
23:28
All right, for now let's continue this
23:29
thing. So when this is done for every
23:33
sentence or every uh passage in our
23:35
training data set, we have now have a
23:37
list of distinct tokens, right? We have
23:40
a list of distinct tokens. In this
23:41
simple case, it happens to be all the
23:42
distinct words that we have seen, right?
23:45
That's called the vocabulary.
23:47
That's called the vocabulary.
23:49
So now we move to the third and fourth
23:51
stages. In this in these stages, the
23:53
indexing and encoding stage, we only
23:55
work with the vocabulary. Okay. And so
23:58
what we do is the first thing the
24:00
indexing we assign a unique integer to
24:03
each distinct token in the vocabulary.
24:05
So for instance, let's say that you know
24:07
you took a whole bunch of English
24:09
literature as your training corus and
24:12
you ran it through you basically you'll
24:14
come up with English dictionary right?
24:16
So it'll have maybe starting with a all
24:18
the way to zebra a whole bunch of words.
24:20
Um, and so I'm just putting 50,000 here
24:24
because turns out the GPD family uses
24:26
something called 50,000 tokens. So I'm
24:28
just using 50,000. It's not the actual
24:30
number of words in the English language.
24:31
It's much more than that. So let's say
24:33
that we give a number one through
24:35
50,000. And then we actually also
24:37
introduce a special token called UN. It
24:40
stands for unknown. And we'll come back
24:42
to this later. And we give unknown the
24:44
integer zero.
24:46
Okay. So this what this is what we mean
24:48
by indexing. take the word the tokens
24:51
you have identified and just map it to
24:52
an integer.
24:55
Okay, that's the indexing step. Then
24:57
what we do is we assign a vector to
25:00
every one of these integers.
25:03
Okay, and that is the encoding step. We
25:05
assign a vector to each integer.
25:08
So you have a bunch of distinct words
25:10
and each word we put an integer on it
25:12
and then we take that integer and map it
25:14
to a vector. Yeah. Can you please
25:16
explain
25:17
to
25:18
>> Can you please explain what unknown
25:20
means?
25:20
>> Yeah. So, so I'll come back to that for
25:23
now. Just assume that we have a token
25:25
called unknown. And the way we are going
25:26
to use it will become apparent in a few
25:28
minutes.
25:29
>> Does it mean there's a base to it
25:31
though? There's like a letter or
25:32
something.
25:32
>> It's it's a it's a placeholder for
25:34
something else which I'll describe
25:36
shortly.
25:38
Okay. So, that's what we have. U so
25:42
let's say that we want to assign a
25:44
vector to each integer in our vocabulary
25:46
and let's assume that we have uh okay
25:50
let's say we have 50,000 possible
25:52
integers because we have 50,000 possible
25:54
words and we want to assign a vector so
25:56
that if you take the vector of two
25:58
different words they should look
25:59
different right clearly that's the whole
26:02
point of mapping from integer to vector
26:04
they better be different uh what is the
26:06
simplest way to come up with a vector
26:08
for each each of these tokens
26:20
the same as the index.
26:21
>> Sorry,
26:22
>> the same as the index. It's just a
26:24
vector one one by one with the index.
26:26
>> So, a vector of uh zeros and ones or
26:31
>> it's just a vector with one dimension.
26:34
>> Oh. Oh, I see. So, god. Well, it's it
26:38
it's creative, but it's a little
26:39
cheating, right? Because you're
26:40
essentially putting a square bracket
26:42
around the number and saying it's a
26:43
vector. Good try.
26:47
>> You can try one hot encoding,
26:48
>> right? You can try one hot encoding.
26:51
So remember the list of distinct tokens
26:53
you have, you can just think of them as
26:55
the distinct levels of a categorical
26:57
variable,
26:59
right? And you can just use one hard
27:01
encoding for it.
27:04
So what you can do is you can the
27:07
simplest thing is do one one hard
27:08
encoding and the way it's going to work
27:10
is that if you have let's say 50,000
27:13
uh 50,000 possible values the vector is
27:16
going to be 50,000 long it's going to
27:17
have zeros everywhere except in the
27:20
index value of whatever that token is.
27:22
So for instance since we said ank is
27:25
going to be the first uh first number
27:28
zero it has a one here and the zero the
27:31
zero index position has a one everything
27:33
is zero a happens to be the second one
27:36
so it happens to be one in the second
27:37
position zero you get the idea
27:40
okay
27:40
>> so this real zero hot encoding we can do
27:42
the zero hot one coding one hot encoding
27:45
and so so the dimension of this encoding
27:47
vector how long it is it's basically the
27:50
number of distinct tokens that you have
27:51
seen in in the training corpus plus one
27:54
for this unk thing that you'll get to.
27:59
Okay,
28:01
so that is a dimensional encoding vector
28:03
which is this is called the vocabulary
28:05
size.
28:09
It's called the vocabulary size.
28:13
All right. So at this point we have
28:16
created a vocabulary for the training
28:18
data training corpus. every distinct
28:20
token vocabulary has been assigned a one
28:22
hot vector and we are done with basic
28:24
pre-processing.
28:26
Okay, so all the text that has come in,
28:29
every token has been mapped to some one
28:31
hot one potentially very long one hot
28:33
vector.
28:35
Any questions on the mechanics of this
28:37
before we continue on?
28:45
>> Now let's see if when you get a new
28:47
input sentence in a new sentence freshly
28:50
arriving and we want to feed it into a
28:52
deep neural network, how will this
28:53
process actually apply to the new
28:55
sentence that's coming in? Okay, so
28:57
let's assume um that we have completed
29:00
our SDIE on the training corpus and it
29:02
turns out we found only you know 99
29:05
distinct tokens 99 distinct words and
29:08
then we add this ank thing to it so we
29:10
got a 100 okay so this is our vocabulary
29:13
it starts with ank a and then goes all
29:16
the way to zebra but there are only 100
29:17
of them in total right and just to be
29:20
very clear we didn't bother to do things
29:22
like stemming and stop word removal and
29:24
stuff like that which is why you have
29:26
words like 'the' showing up in this
29:28
list.
29:30
Okay. All right. So,
29:34
let's say this input string arrives, the
29:35
cats are on the mat, and then we run it
29:38
through STIE. So, the cats are on the
29:40
mat goes through this thingoop.
29:43
Then the output is going to be a table
29:46
with a bunch of rows and a bunch of
29:49
columns. Any guesses
29:52
how many rows and how many columns?
30:02
Just raise your hands. I'll call on you.
30:13
>> Yeah, you use a microphone. Go for it.
30:14
>> Yeah, I would guess uh 100 rows and uh
30:18
six columns.
30:20
All right, we'll take a look. Uh
30:23
100 and six as well as six and 100 are
30:24
both correct. So, so the way I've done
30:27
it is six and 100. And the and that's
30:30
exactly right. So, the idea is that this
30:33
is your vocabulary, right? So, the word
30:36
the cat sat on the mat once you change
30:38
the case of it, it becomes like this.
30:41
So, the the happens to be a one hot
30:43
vector with a one where there is a the
30:47
and zero everywhere else. I'm not
30:48
showing all the zeros because it'll get
30:50
too cluttered.
30:52
Similarly, cat has a one where the the
30:55
cat position is and zero everywhere else
30:57
and so on and so forth. Does that make
30:59
sense? So, the the phrase the cat sat on
31:02
the mat came in as just whatever six
31:04
words and then it became this you know
31:06
600 entry table.
31:12
Okay. Now, what is the best way to feed
31:15
this table to a deep neural network?
31:18
What can we do?
31:23
It's not a vector. It's a table.
31:26
If it's a vector, we know what to do. We
31:27
just feed it in. We'll just maybe send
31:29
it to some, you know, hidden layer and
31:30
declare victory at that point.
31:34
>> Yeah.
31:37
>> You would like to flatten it. And like
31:38
how how might you do it?
31:43
Flattening is a reasonable answer by the
31:45
way.
31:46
I think you mean you just have to like
31:49
take each like each column
31:52
take the first one each row and each row
31:54
each word kind of like
31:56
>> yeah so basically you can take all the
31:57
first columns and then take the second
31:59
column and attach it under the first
32:01
column and so on and so forth right so
32:03
we can certainly do that and that's very
32:05
akin to how we work with images right u
32:08
but there is one downside to that what
32:10
is that downside
32:15
uh Um,
32:18
>> it's pretty long. Like I wonder if
32:20
instead you could for the first word
32:23
it's one, for the second word it's two,
32:25
and then you maintain the order, but you
32:27
still keep it just as like one row.
32:30
>> One row. So one issue, so we'll come
32:33
back to what we do about this, but what
32:34
you're pointing out is it could be very
32:36
long, right? Because if each word is a
32:39
50,000 long one vector with just six
32:42
words, it becomes a 300,000 long vector.
32:45
Imagine take the 300,000 long vector and
32:48
sending it into a 100 hidden unit hidden
32:50
layer. 300,000 times 100 parameters. Too
32:53
much can't learn anything.
32:56
So that's one issue. The other issue is
32:58
that different length texts that are
33:01
coming in will have different sized
33:02
inputs.
33:04
So here the cat sat on the mat has six
33:06
times 50,000 but maybe the cat sat on
33:08
the mat and the rat rat ran over to the
33:10
cat becomes even longer. We can't handle
33:13
variable sized inputs.
33:15
the inputs all have to be mapped to the
33:16
same length.
33:19
That's another problem.
33:22
>> So maybe you can count how many you can
33:24
sum the columns basically and count how
33:26
many times each word appears since
33:27
you're using the like spatial
33:29
relationship.
33:30
>> Yes. So you Yeah. So both you and are on
33:33
the same sort of trajectory which is
33:34
that uh we need to somehow take this
33:37
table and make it into a vector. And
33:39
there are many ways like what you folks
33:40
are describing to make it into a vector
33:42
and turns out um this is all the things
33:46
that we've been discussing so far the
33:48
varying length ratio and so on. So, so
33:50
what we can do is we can aggregate all
33:53
these things. If you just add them up,
33:56
this is what you described. I believe
33:58
it's called sum encoding.
34:00
And if instead of adding you just or
34:02
them, meaning if you look at the column
34:04
and say, is there any one in this
34:05
column? If there's a any one, I'll put a
34:07
stick of one, otherwise it's a zero.
34:08
It's called multihot encoding. So, so if
34:12
you look at this thing, if you literally
34:13
just go column by column and count
34:15
everything. Okay, there's a one here,
34:17
one here. Oh, wait. There are two twos
34:19
here. So you put a two. That's count
34:21
count encoding. Multih hard encoding. It
34:23
just looks for any ones and puts on.
34:26
Make sense? So by the way there are many
34:28
ways to take these tables and make them
34:30
into vectors. These two happen to be
34:32
very commonly used and they kind of make
34:34
common sense.
34:39
Okay.
34:41
Right. So this aggregation approach that
34:43
we just described is called the bag of
34:44
words model.
34:46
Bag of words model. And the reason is
34:49
that first of all this bag that we have
34:51
has words either it counts whether a
34:53
word exists or not or it counts how many
34:56
words how many times the word has
34:58
appeared right count versus multihot
35:01
versus sum encoding count encoding but
35:04
more importantly and this goes back to
35:05
your observation is that we have lost
35:09
the order of the words now whether the
35:12
phrase came in was the cat sat on the
35:14
mat or the mat sat on the cat the count
35:18
encoding and the multih hard encoding
35:19
are exactly the same. There's no
35:21
difference because we're just looking
35:23
for the the presence or absence of
35:24
words. That's it. We don't care in what
35:27
which order they appear, right? That's a
35:29
huge limitation, but shockingly for many
35:32
applications, it doesn't matter. It's
35:34
good enough. So, it's called the bag of
35:36
words model.
35:38
All right, so this called the bag of
35:40
words model.
35:42
Um, now does it have any shortcomings? I
35:46
already talked about the first
35:47
shortcoming which is that it loses
35:48
sequentiality the order we lost this
35:51
order information right uh we we lose
35:54
the meaning inherent in the order of the
35:55
words what are some other issues with it
36:04
what do you mean by that
36:12
>> right so there are lots of zeros not
36:14
that many ones so you have it's a very
36:16
sparse amount of information but maybe
36:18
is carrying around a lot of information
36:19
to to make it all work. Now there are
36:22
some tricks CS computer science tricks
36:24
to handle sparsity in some clever ways
36:26
but it is certainly an issue. Now the
36:29
other issue is that let's say the
36:30
vocabulary is very long.
36:32
Each input sentence whether it's the
36:34
collected works of William Shakespeare
36:36
or the phrase I love you will have the
36:39
same length input.
36:42
Is that the same length input
36:45
because ultimately every incoming thing
36:48
gets mapped into one vector. Okay, that
36:51
feels a little sub suboptimal.
36:54
Clearly the collected works of ins have
36:56
a lot more stuff going on in them.
36:59
Right? So that's a problem. In
37:02
particular, very very small things that
37:04
come in, you'll be spending a lot of
37:06
compute on those long vectors and
37:08
processing them. Um, now you can
37:10
mitigate some of this by choosing only
37:13
the most frequent words. You don't have
37:14
to take, you know, I think the English
37:16
language I read somewhere has roughly
37:18
500,000 words or so. Uh, but turns out
37:20
the top 50,000 most frequent words are
37:23
responsible for just about everything
37:24
you're going to see ever. And the other
37:27
50,000 are what's called the long tail.
37:29
They almost never happen, right? You
37:31
never see them. So, you can be very
37:33
pragmatic and say, "I'm not going to
37:34
take every little word that I see in my
37:36
vocabulary. I'm going to only take the
37:38
most frequent words. I'm just going to
37:40
ignore the rest.
37:42
I'm just going to ignore the rest."
37:44
Okay?
37:46
But if you ignore the rest, let's say
37:50
the there is one word uh let's take some
37:52
Shakespeare word hamlet. Let's let's
37:55
assume that you ignore the word Hamlet
37:57
from your training corpus. You just
37:58
delete it because it's not one of the
38:00
top most frequent things you have seen.
38:02
And then somebody sends you a text
38:04
saying, you know, Hamlet was a bad
38:06
prince.
38:08
Analyze the sentiment of the sentence.
38:10
Well, when you see Hamlet, what is your
38:12
system going to do?
38:14
It's going to look at the Hamlet and
38:15
say, I can't see it in my vocabulary
38:16
anywhere.
38:18
And if it can't see in the vocabulary,
38:19
what is the only thing it can do?
38:22
Replace it with Unk. So that's where
38:26
comes into the picture.
38:28
So whenever it can't see something in
38:30
the vocabulary in a new input, it just
38:32
replaced with ank. Which means that
38:35
if you had ignored Romeo, Juliet, and
38:37
Hamlet in the in the training corpus,
38:40
all of them are going to be replaced by
38:42
the same ankh, which means that we can't
38:44
distinguish between them anymore.
38:46
>> So is this whereation
38:48
comes into play here where it doesn't
38:52
recogize
38:54
H interesting question. This is
38:56
whereation comes up. Actually, as it
38:58
turns out, no, as we will see when we
39:00
talk about LLMs later. Uh LLMs actually
39:03
will not have this UN problem because
39:06
they use a different tokenization scheme
39:08
which can handle anything you throw at
39:09
it, including new stuff you just made
39:10
up.
39:12
So, we'll come back to that.
39:14
All right. Um so, that's what we have.
39:17
And so what we're going to do is despite
39:19
its shortcomings, bag of words is
39:21
actually a really good default for many
39:23
NLP tasks. Uh and in the spirit of do
39:26
the simple stuff first and do
39:27
complicated things only the simple
39:28
doesn't work. We'll use a bag of words
39:30
model right now. Okay. So we'll switch
39:32
to a collab and see how it's done.
39:36
So here the the application we're going
39:39
to work with is kind of a fun
39:40
application. Uh we're going to try to
39:43
predict the genre of songs.
39:46
Okay, it's a nice classification use
39:47
case. Um, so we want to take some
39:50
arbitrary song and then classify it into
39:52
either hip-hop, rock or pop.
39:55
Okay. Um, and so for instance,
39:59
right, this is the kind of lyric you're
40:01
lyrics you're going to see. And as you
40:03
will see in this data set, the data set,
40:04
just a quick word of caution, uh, the
40:07
data set does have lyrics which may not
40:10
be sort of, you know, safe for work as
40:12
it were. So I'm not going to be like
40:14
exploring the lyrics in the collab, but
40:16
I just wanted to be aware of it. Okay.
40:18
Um, so but it's just some data set that
40:20
we downloaded from somewhere, right? Uh,
40:22
it's got all these lyrics. Okay. So
40:24
we're going to try to classify each
40:25
verse that we see into one of three
40:27
things. Hip hop, rock or pop. It's a
40:29
multi-class classification problem.
40:31
All right. Actually, what is the
40:33
simplest neural network based classifier
40:35
we can build
40:37
for this problem?
40:41
All right. So what is the simplest
40:42
neural network we can build for this
40:44
problem? So remember what is the input?
40:47
The input is going to be a bunch of song
40:49
lyrics. It's going to be a really long
40:50
song for all you know, right? And we're
40:52
going to use the bag of birds model. Uh
40:54
and let's assume for a moment that we
40:56
will use multihot encoding, right? We'll
40:59
create a vocabulary from this for the
41:02
song. We'll take all the songs. We'll
41:04
process them, run it through STI. will
41:06
do multihod encoding which means that
41:08
every song that comes in will have will
41:10
be a vector that's how long
41:14
it'll be as long as the
41:17
correct as a vocabulary size right so um
41:20
so maybe what comes in is this phrase um
41:24
since it's supposed to be songs I'll say
41:26
something which is probably common to
41:28
90% of songs I love you
41:30
okay that goes in
41:34
it goes into our ST STIE process
41:38
and then this SDIE process gives us a
41:42
vector which is X1 X2 all the way to XV
41:49
where V stands for the size of
41:50
vocabulary. Okay. So that that's our
41:52
input layer
41:54
all the way. So knowing what we know now
41:58
about deep learning what can we do next?
42:02
Couldn't you or maybe I'm getting ahead
42:04
but wouldn't the classifier just be like
42:07
the baseline would be classify it as the
42:10
most common genre?
42:11
>> That is the baseline. Correct. Correct.
42:13
I'm just saying and we'll come to the
42:14
baseline a bit later. But here I'm
42:17
saying suppose you need to you wanted to
42:18
build a neural network model for this.
42:21
How would you set it up?
42:23
>> You think about the layers that you
42:25
want,
42:26
>> right? And what is the simplest thing
42:27
you can do with a neural network? How
42:29
many layers?
42:30
>> Uh no layers. Well, then it becomes
42:33
problematic with even a neural network
42:35
because it could just be logistic
42:36
regression
42:37
>> one hidden layer.
42:38
>> Yes, thank you. I'm being a little
42:41
squishy about this because there are
42:43
some people who be like well even if
42:44
there's no hidden layers if you're using
42:46
relus and this and that and sigma that's
42:48
maybe it's a neural network and I don't
42:49
want to get into that how many ages in
42:51
the tip of a pin argument. So um so yeah
42:54
we need one hidden layer right in this
42:56
course we need at least one hidden layer
42:57
for it to qualify as a neural network.
42:59
Okay, so let's have a hidden layer and
43:01
we'll have a bunch of ReLUS as usual.
43:04
Okay, bunch of ReLULS and I'll ignore
43:07
all the arrows between them. It's kind
43:09
of a pain. U and then we come to the
43:11
output layer. And what should the output
43:13
layer be?
43:15
How many nodes do we have need in the
43:16
output layer? Three, right? Hip-hop,
43:19
rock, whatever. Pop. So we And then that
43:22
layer is called what? What activation
43:23
function?
43:25
Softmax. Perfect. Love it. love this
43:27
class. All right, three things. Uh,
43:30
rock, hip-hop,
43:33
and uh, pop, right? And this is a soft
43:36
max right there.
43:39
And then it's going to give us three
43:41
probabilities that add up to one because
43:44
it's a soft max. So that's our basic
43:46
network, right? Perfect. Yeah.
43:49
>> Why do you need those probabilities?
43:51
Again, if you just want to identify the
43:52
most likely genre, the soft max just
43:55
give you a way to kind of add them all
43:56
up once. Why do you need soft? Why don't
43:59
you just take the max value and say it's
44:01
that?
44:01
>> Oh, interesting question. Why can't we
44:03
just produce three numbers and grab the
44:05
maximum number? So, it turns out finding
44:09
the maximum bunch of numbers that
44:11
function
44:12
is not very it's not very friendly for
44:14
differentiation.
44:16
And ultimately you want to take this
44:18
output, run it through a loss function
44:20
like cross entropy and then be able to
44:23
run back prop on it. And so
44:25
fundamentally back propagation is just
44:27
differentiation and it requires
44:29
everything inside of it to have well-
44:31
behaved gradients. And so this little
44:34
max function is actually not well
44:36
behaved and which is why we have a soft
44:39
version of it soft max which makes it
44:41
easy to differentiate. So I can tell you
44:44
more about it offline but that's sort of
44:45
the quick synopsis.
44:49
So a lot of tricks you will see in the
44:50
neural network literature or ways to
44:52
avoid this the problem of having certain
44:55
the like the obvious choice of function
44:57
will not be well behaved for
44:59
differentiation. That's why you need to
45:00
go through all these other mechanisms
45:02
much like we couldn't just say accuracy.
45:05
Why don't you just maximize accuracy
45:06
instead of doing this cross entropy
45:07
business? Same reason.
45:10
All right. So let's come back here.
45:14
All right.
45:20
So that's what we created on the thing.
45:23
Right? Cats out of the mat vocabulary
45:27
thing and so on. And I you know I was
45:28
playing around with it uh earlier and so
45:31
I I found that you know eight relu
45:33
neurons were pretty good to get the job
45:35
done. So I'm just going to go with eight
45:36
rel
45:37
neurons in the hidden layer.
45:39
So I think that brings us to the collab.
45:44
Yeah. So let's switch to the collab.
45:47
All right. So um that's what we have
45:49
here. We you know there's a little bit
45:50
of verbiage here which just describes
45:52
what I just talked about. So we'll do
45:54
the usual things and upload everything
45:56
uh import everything we want. TensorFlow
45:58
and caras and the the holy trinity of
46:01
numpy pandas and mattplot lib. Uh set
46:03
the random seed as usual at 42.
46:07
This is our SDIE framework here. And the
46:09
nice thing is that all four of these
46:11
things SDIE are beautifully implemented
46:14
in Keras is a single simple layer called
46:16
the text vectorzation layer. Okay, which
46:19
is nice. Um, so we have the text vector
46:22
right here. And so in our first example,
46:25
what we'll do is we will use a default
46:26
standardization which will just remove
46:29
punctuation, convert to lowercase. We'll
46:31
use a default tokenization which just
46:33
means split on the space between words.
46:35
And then we will set the output to
46:37
multihart. Right? All the things we
46:39
talked about, KAS will just do it for
46:41
you automatically. And so output mode
46:43
multihart standardize this spread whites
46:45
space and boom, you run the text
46:47
vectorization thing. And once you do it,
46:49
KAS creates this textualization layer
46:52
with these settings and it's now ready
46:53
to swing into action. So what does swing
46:56
into action actually means? Well, now we
46:58
need to actually feed it a training
46:59
carpass so that it can do all the things
47:01
it's supposed to do and create the
47:02
vocabulary for you, right? So um so and
47:07
that thing is called the adapt method.
47:08
So we create a tiny training corpus for
47:11
us. This is our data set. Um right this
47:14
just a bunch of words from some of these
47:16
lyrics. And then what we'll do is we'll
47:18
take this layer that we just defined
47:19
here that we have set up here. And then
47:21
we will ask this layer to actually
47:24
create the vocabulary using this adapt
47:26
command. Okay. Index the vocabulary. And
47:29
it's done. And once it does it, you can
47:31
actually ask it for the vocabulary.
47:34
Okay, this is the vocabulary using the
47:36
get vocabulary command. And so first of
47:38
all, how long is the vocab? 17 17 words,
47:41
17 tokens. What are they?
47:45
And see here, and you can see these are
47:46
all the words and you can see it is
47:48
stuck in an in the very beginning,
47:50
right? It's sort of the default. By the
47:52
way, uh just a little programming tip if
47:54
you're not familiar with if you don't
47:55
have a ton of programming experience. If
47:57
you want to, you know, print these
47:58
Python objects like list and all in a
48:00
pretty way, one trick that often works
48:02
is just stick it into a data frame
48:05
and then print it. Usually, it'll print
48:08
it in a much better way. So, you can see
48:09
it like that.
48:11
So, you can see here ank arrays blah
48:13
blah blah blah blah. And you can see
48:15
integer zero assigned the ank token. By
48:17
the way, how come it picked the word
48:19
arrays as the second entry? Why not
48:22
something like an or um you know why
48:26
not? Why not a how come a is not chosen
48:29
as a second entry? Why why did it pick
48:32
arrays? You think
48:40
>> maybe maybe it tried like the words that
48:43
are most influential on the meaning of
48:45
the sentence to be on the
48:49
But it at this point it doesn't know
48:51
what we're going to use it for.
48:54
So it has no way to know what word is
48:56
useful because we haven't told it how
48:57
we're going to use it.
48:59
But but you're kind of on the right
49:01
track. So what KAS does is it'll
49:04
calculate it'll find all these tokens
49:06
and then it'll actually just sort them
49:07
by frequency.
49:09
So the most frequent as it turns out in
49:12
those four sentences we gave it happen
49:13
to be the word arrays. That's why arrays
49:15
is showing up on top. Um, and you can
49:17
actually confirm this by going to the
49:19
our little data set and you can see here
49:21
array shows up here and was up here
49:23
twice and that's why it came up on top.
49:25
Okay. All right. So that's what we have
49:29
and u and now now that we have populated
49:32
this we can run any sentence through it
49:34
easily. Yeah.
49:36
>> Does [clears throat] it matter that it's
49:37
on the top or is it just
49:39
>> it doesn't matter. It doesn't matter.
49:41
The reason why it's helpful later on is
49:43
because suppose you tell Kas hey don't
49:45
take every word you see here give me
49:48
only the most frequent 100 words I don't
49:50
want any more than that it can easily do
49:52
that that's the reason yeah
50:01
>> this is just a vocabulary so basically
50:03
you you give it all this phrases it
50:05
happens just four phrases in our example
50:07
and then it finds all the distinct words
50:09
and you know does all that stuff and and
50:10
then it has created a vocabulary. At
50:12
this point the the training corpus you
50:14
fed it will is forgotten and the only
50:17
thing has survived this processing is
50:19
just the vocabulary. That's it. Now we
50:21
have to start applying it to any kind of
50:23
text we want to use it for.
50:25
So here when you come back here u so
50:28
this is what we have and so what you can
50:30
do is you can take any sentence and you
50:32
can just run it through a layer and to
50:33
make sure that actually is doing the
50:35
right thing for you. So we'll take the
50:37
sentence, we will then run it through
50:39
the text vectorization layer by just
50:40
passing that sentence into it and then
50:42
we can just print it.
50:46
So now it's giving you a tensor. This is
50:47
a multihot encoder tensor with all these
50:50
ones and zeros. So note that this tensor
50:54
is 17 units long which is which is a
50:56
good check because our vocabulary is 17
50:58
long. So it's better match that. Uh now
51:00
recall that the ank token is at the
51:03
first location. It's at index zero and
51:05
it says that this encoded sentence does
51:08
have an unk word.
51:10
Okay. So
51:13
why is that? What is this UN word?
51:15
Anyone can guess?
51:19
Well, it turns out to be the word still.
51:21
Um I think yeah still is not in our
51:24
vocabulary because the four sentences
51:26
which is our training corpus used to
51:28
build vocabulary. They had a lot of
51:30
write and rewrite but there was no still
51:32
in it anyway. That's why there's an UN
51:33
ank for it. Uh we can just double check
51:35
that by asking Python is it is it
51:38
vocabulary? Nope, it's not. Okay. Now,
51:40
in the spirit of making small changes to
51:41
the code to understand what's going on,
51:42
which is a very useful tip for folks who
51:45
don't have a ton of programming
51:46
knowledge. Let's say that you send the
51:48
phrase Sloan Hodddle and DM, DMD. Uh I
51:52
think you will agree with me that none
51:54
of these words is in the training
51:55
corpus, right? So what will this what is
51:59
the multihot encoded vector for this
52:02
phrase sloan hodddle BMD
52:07
three
52:11
it's not count encoding it's multihod
52:13
encoding
52:14
right it's going to be 1 0 0 so you can
52:17
see here or in this case remember the
52:19
vocabulary is 17
52:21
right so each of these words is going to
52:23
be a one followed by 16 zeros
52:27
And then it's going to multih hot encode
52:29
them which means the three ones in the
52:30
column just become a one. So so you
52:34
still have only this one. Okay. All
52:37
right. Good. So now let's see that's now
52:39
let's actually get to the the the data
52:41
set. We have this 90,000 songs. Uh and
52:45
it's in this little thing here. Uh we
52:47
have grabbed the data and cleaned it up.
52:49
Cleaned it up meaning like formatting
52:50
wise not content wise. uh and then we
52:53
stuck it in this uh data frame and it's
52:55
we already have divided into train, test
52:56
and validation for your benefit. So you
52:58
don't have to worry about it. So turns
53:00
out we have 40 almost 49,000 songs in
53:03
the training set, 16,000 songs in the
53:05
validation set and 22 roughly 22,000 in
53:08
the test set. Okay, lot of songs. It's a
53:10
lot. It's a big data set. Um so let's
53:13
just look at the first few.
53:15
So oh girl, I can't get ready. We met on
53:18
rainy evening. Paralysis through
53:20
analysis.
53:22
Okay, that I can relate to as a data
53:23
science person. But anyway, u but uh by
53:27
the way this uh these things are very
53:29
useful for exploration of any uh data
53:31
frames that you might have. Collab is a
53:33
collab feature just check it out. Um so
53:36
anyway, that's the first few the first
53:38
few rows. Let's look at the last few
53:40
rows.
53:43
Okay,
53:48
you never listen to me as pop. Beamer
53:51
Benz is hip-hop. Yeah, of course.
53:57
So, okay. Uh, now to go back to the
53:59
question of, okay, um, what could be a
54:01
good baseline model? We need to
54:02
understand the proportion of these three
54:04
classes of songs. So, we'll do a quick
54:07
check. Turns out rock is 55%. So, if you
54:10
had to just guess something just
54:12
naively, you would just guess everything
54:13
to be rock and you'd be right 55% of the
54:15
time. Uh so now uh by the way the the
54:18
the target variable which tells you
54:20
whether which of these three genres it
54:21
is uh is is is a is actually a dummy
54:24
variable. So we need to one hot encode
54:26
that right. Um so we'll just turn that
54:29
this way using the pandas get dummies
54:32
function. And when we do that uh this is
54:34
y train which contains the dependent
54:35
variable. And you can see that is one
54:37
hot encoded now. Uh 0 1 0 0 1 0 0 1 and
54:40
so on and so forth. That's it. So I
54:42
think the first I forget it rock,
54:44
hip-hop, rock, pop or whatever. It's in
54:46
some order. We'll we'll get to that
54:48
later. So it's one hot encoded as well.
54:50
So that is as far as the data
54:52
downloading and setup is concerned. Any
54:54
questions?
54:55
>> Yeah.
54:57
>> Uh this kind of goes back to the
54:58
transfer learning concept. But do you
55:01
always want to build your corpus based
55:04
off of the vocabulary of your training
55:06
data or could you have like a
55:08
pre-ompiled like somebody's already made
55:10
like a list of the 50,000 words?
55:13
>> That's a really good question. Uh
55:15
unfortunately I'm going to punt on it
55:16
for the moment because um with modern
55:20
large language models a number of these
55:22
NLP tasks for which you had to sort of
55:25
roll your own and build your own thing
55:27
can now be very easily done using large
55:29
language models without even any further
55:31
training.
55:33
Case you pay for it is that you have to
55:34
use a large language model which means
55:35
you have to pay somebody an API call and
55:37
things like that and there are other
55:38
issues with it. uh but
55:41
we'll talk a lot about transfer learning
55:43
for text when we come to a little later
55:46
in the NLP sequence. So if I forget
55:48
please bring it up again.
55:53
>> Yeah.
55:54
>> Um quick clarification on the encode
55:58
factor. If I post it as floats not ins.
56:00
If it gets incredibly long wouldn't that
56:03
eat into compute time? Is there a reason
56:05
why it's floats?
56:06
>> Yeah. So uh question is that when when I
56:09
showed you that tensor the it is
56:11
actually is written as a continuous
56:13
number right a float floating point
56:14
number but we know these are one zeros
56:16
and ones so why can't we why do we have
56:18
to waste compute capacity by telling the
56:20
computer that these are all big
56:21
continuous numbers when it's just a zero
56:23
one there are ways to optimize that but
56:25
these problems are so small we just
56:26
don't worry about it but when we come to
56:28
something called parameter efficient
56:30
fine-tuning lecture maybe 10ish uh we
56:34
actually exploit that particular fact to
56:35
make things faster
56:38
Okay, so that's what we have. Uh, so
56:41
we'll we'll do the bag of birds model.
56:43
Um, by the way, there's a whole bunch of
56:46
stuff here. It just repeats what I've
56:47
been telling you in the lecture. So feel
56:49
free to read it again, but we can ignore
56:50
it for the moment. And now there's a new
56:54
thing we are doing here. So we are
56:55
basically saying, look, instead of
56:58
taking every word you see in these
57:00
49,000 uh songs in the training corpus,
57:03
uh, it's going to be too many words.
57:05
just pick the 5,000 most frequent words
57:09
and that's what this max tokens stands
57:11
for. Okay. And so we tell it uh all
57:15
right do this thing max tokens 5,000
57:18
sorry not 50,000 5,000 and still do
57:20
multihart and we are not explicitly
57:22
saying the standardization and all that
57:24
stuff because the defaults are what
57:25
we're going with. Okay. Yeah.
57:29
This is for making it more efficient.
57:30
Like this is like don't waste your time
57:32
on these thousand sports. Use them more.
57:36
Use them. Just focus on that to make
57:39
more efficient.
57:40
>> Make more efficient. But there is a
57:42
related and important point which is
57:44
that fundamentally the number of tokens
57:46
you allow this layer to have dictates
57:49
the size of your vocabulary and the size
57:51
of your vocabulary dictates the size of
57:53
the vector that you feed in. So shorter
57:56
vectors are better than longer vectors.
57:57
That's the efficiency point. The other
57:59
point is that the longer the input
58:00
vector, the more the number of
58:02
parameters the network has to learn
58:04
because the first layer itself is the
58:06
size of the input times roughly times
58:08
the size of the hidden layer. So this
58:10
thing becomes 10 times as long. You have
58:11
10 times as many parameters to learn and
58:13
given a finite amount of data, right?
58:15
The more parameters you have, the worse
58:17
it's going to do when you actually start
58:18
using it in the real world. It's going
58:19
to overfitit heavily. That's why you
58:21
need to be very careful.
58:24
Okay.
58:25
Yeah.
58:27
So, um, you downloaded the data set, but
58:29
are you still using the vocabulary the
58:31
17 words or did you
58:33
>> No, no, I'm that was just for fun. I'm
58:35
going to actually build a vocabulary
58:36
now. It's coming. Yeah, good question.
58:38
Yeah. So, all right, let's do that. Um,
58:41
so I first, you know, I defined this
58:43
layer. Uh, okay. I just defined it. All
58:46
right. Now we actually build the
58:47
vocabulary by essentially telling it to
58:49
adapt the layer using essentially the
58:53
full all 15 basically 49,000 songs in
58:56
the training data set right that's a
58:58
long list of songs as far as kas is
59:01
concerned you're just looking for a list
59:02
of strings so you just give it the list
59:04
of strings instead of four we're giving
59:06
it 49,000 the same uh philosophy applies
59:09
so we run it
59:11
it's obviously going to take a few
59:12
seconds to do that because it's 49,000
59:15
songs
59:17
five seconds. Uh, all right. Let's look
59:19
at the most common 20,
59:21
right? We get the vocabulary from our
59:23
layer. See, once you adapt the layer and
59:26
has built a vocabulary, the layer is
59:27
sort of been populated with all this
59:29
information. So, you can query it. So,
59:31
you can get the vocab top 20 words, the
59:34
most frequent word, no surprise, u, I,
59:37
blah, blah, blah. Uh, let's look at the
59:39
last few.
59:41
Dagger cheddar
59:43
verified
59:46
moving on
59:48
right and then we so once we have done
59:51
that now we actually can vectorize all
59:52
the data sets we have using this and by
59:55
vectorize you mean take every string and
59:57
create the multihot encoded vector from
59:59
it uh yeah
1:00:00
>> are we doing stie because we're keeping
1:00:02
stuff like d a etc. Yeah, we are not
1:00:05
strictly doing STI or to put it
1:00:07
differently the S stands typically S has
1:00:09
lower case uppercase strip punctuation
1:00:12
stemming stop word removal here the
1:00:14
default in KAS happens to not do
1:00:16
stemming not do stop word removal so
1:00:18
we're just going with the default thanks
1:00:20
for the clarification
1:00:22
and in fact in practice what I find
1:00:23
these days is that don't even bother to
1:00:25
stem don't even bother to remove the
1:00:27
stop words it's going to work well
1:00:28
enough
1:00:31
okay so all right uh okay so now Each
1:00:34
phrase is a vector. How long is this
1:00:36
vector? Each song is now a vector. How
1:00:38
long is that vector?
1:00:41
5,000. Correct. Because that is a size
1:00:43
vocabulary. Correct.
1:00:47
It's max tokens long, which is 5,000. So
1:00:49
if you actually look at X Oh, wait,
1:00:51
wait, wait, wait, wait. I haven't done
1:00:52
this thing yet.
1:00:57
It's going through 49,000. It's going
1:00:59
through another what? 23,000. Fine. So
1:01:02
let's run it.
1:01:04
Okay, now we can see X train which is
1:01:06
all the training data you have has is a
1:01:09
tensor is a table with 48 991 rows and
1:01:12
each row is a 5,000 long vector.
1:01:18
All right, good. Now we will try the
1:01:20
simple neural network that we wrote up
1:01:23
in class. So and now at this point this
1:01:28
code should be sort of second nature,
1:01:31
right? Isn't that cool? It's so easy to
1:01:34
write the write the thing the power of
1:01:36
abstraction. So uh we take kasin input
1:01:39
as usual input layer we tell it what is
1:01:41
the size of each thing that's coming in.
1:01:42
Well the size of each thing is a 50 max
1:01:44
tokens long vector. So we tell it the
1:01:46
shape is max tokens and then we run it
1:01:48
through a dense layer with eight relus.
1:01:51
Okay I'm hurrying.
1:01:54
So we get the outputs then we string the
1:01:56
inputs and the outputs into a model and
1:01:58
then we summarize the model. That's it.
1:01:59
So we go here and this has 40,000
1:02:02
parameters and you can see here right
1:02:04
when you go from the input the 5,000 * 8
1:02:08
that gives you 40,000 plus the eight
1:02:10
neurons have a bias coming in that's
1:02:11
another eight so you get 40,0008 okay
1:02:15
and we compile it as usual we use atom
1:02:17
as usual and because now the the output
1:02:20
y variable the y train variable is now
1:02:23
it itself is actually one hot encoded
1:02:27
right 0 1 0 0 1 depending on pop rock
1:02:29
and so on and so forth. We don't use
1:02:31
sparse categorical cross entropy. We
1:02:33
just use plain old categorical cross
1:02:35
entropy here. Okay. And this was
1:02:38
explained in lecture last week. So you
1:02:40
can revisit it if uh if it's if it's not
1:02:42
familiar. We again report accuracy,
1:02:44
right? So let's compile it. And we've
1:02:46
got a model. So we just run it for 10
1:02:48
epochs with a batch size of 32. And
1:02:50
because we have validation data already
1:02:52
supplied to us, we don't have to tell
1:02:53
Karas take the training data and keep
1:02:55
20% of it aside for validation. We can
1:02:58
literally tell it what validation to
1:02:59
use. That's what we're doing here. Okay.
1:03:04
All right. So, it's running.
1:03:06
Um,
1:03:09
it's pretty fast.
1:03:16
Any questions so far?
1:03:18
>> Yes.
1:03:20
>> The microphone.
1:03:23
>> How do we decide the max total? like
1:03:25
define the number of 5,000 here but we
1:03:27
do not know how many words would be
1:03:29
there in the entire text.
1:03:29
>> Yeah. So it's a good question. How do
1:03:31
you decide on this the maximum
1:03:32
vocabulary? What you typically do in
1:03:34
practice is that you actually you do it
1:03:36
without the max tokens and then you see
1:03:38
how long the vocabulary is and then you
1:03:40
actually get statistics on how
1:03:41
frequently the very infrequent words
1:03:43
actually show up. And then you'll
1:03:45
typically see like a dramatic fall-off
1:03:47
at some point and you pick that fall-off
1:03:49
point and then set that to be the max.
1:03:54
Uh all right. So perfect. Let's test it.
1:03:58
Uh accuracy is pretty good. 87% on the
1:04:01
training and 73 on the validation. We'll
1:04:05
do it on the test set. All right. 72%.
1:04:09
So we saw earlier the the largest class
1:04:11
of the three-way is a rock with around
1:04:13
50%. So the naive model is going to get
1:04:15
50% accuracy and this little neural
1:04:17
network model gets you 70 72% which is
1:04:19
pretty nice. Okay. So now let's actually
1:04:22
kick it up a notch and make it slightly
1:04:23
more capable. So the key thing here is
1:04:26
that uh as was has been observed in
1:04:29
class already when you go with a bag of
1:04:31
words model we lose all notion of order
1:04:33
right the word order clearly matters and
1:04:35
we're kind of ignoring it. So what we do
1:04:38
to get around it is um so actually this
1:04:40
actually really interesting uh sentence
1:04:42
here. Let's say this is a movie review.
1:04:44
Kate Vinclet's performance as a
1:04:46
detective trying to solve a terrible
1:04:48
crime in a P small pin tennos is
1:04:50
anything but disappointing.
1:04:52
Tricky tricky thing, right? Because if
1:04:55
you look at the word separately, the
1:04:56
word terrible and disappointing like
1:04:58
negative sentiment, right? But then if
1:05:01
you actually know that the word terrible
1:05:04
respon refers to the crime, not to the
1:05:06
movie or anything but disappointing
1:05:08
changes the meaning of the word
1:05:09
disappointing, you will see obviously
1:05:10
it's a positive review, right? So
1:05:12
clearly the the the words around the
1:05:14
word provide valuable clues as to how to
1:05:17
interpret that word. And so what we do
1:05:20
is how can we make our little model a
1:05:23
bit more capable of recognizing the
1:05:25
context around every word. And the way
1:05:27
we do it is something called bgrams.
1:05:29
Okay. And what for biograms what we
1:05:32
basically do is instead of taking
1:05:34
instead of just taking each word we take
1:05:36
each word and we further take every pair
1:05:39
of adjacent words
1:05:42
and those become our tokens and because
1:05:44
we take two adjacent words right it are
1:05:47
called bgrams you can take three adjent
1:05:49
words trigrams you get the idea engram
1:05:51
grams okay so that's the idea of bgrams
1:05:54
and so um so for example if you had the
1:05:56
cat matt sat on the cat sat on the mat
1:05:59
you will have the the cat cats sat you
1:06:03
get the idea right uh that's what we
1:06:05
have so let's do a little example and
1:06:07
kas makes it very easy you literally
1:06:09
tell it engram grams equals 2
1:06:12
bs and now by by from this you auto
1:06:15
immediately should know that engram
1:06:16
grams equals 1 is the default that's why
1:06:19
we didn't have to specify it okay so you
1:06:23
run it and then you do
1:06:25
cats on the mat is your training corpus
1:06:27
and then you get the vocabulary and you
1:06:29
can see here, right? It has created all
1:06:31
these nice biograms for you. And so
1:06:34
that's it. All right. Now, what we do is
1:06:35
we'll go back to the songs and we
1:06:37
actually tell Keras to not just take
1:06:39
each word, but take all the biograms as
1:06:41
well. And hopefully you'll do a better
1:06:43
job, right, of figuring out what the
1:06:45
sentiment is. And now because you know
1:06:47
when you have when you when you say,
1:06:49
okay, take the top 5,000 words, that's
1:06:51
great for single unigs as they are
1:06:53
called. But when you have biograms, you
1:06:56
have 5,000 possibilities for the first
1:06:57
word, maybe 5,000 for the second word,
1:06:59
right? That's a lot of possibilities. 25
1:07:01
million. Now, most of the 25 million
1:07:03
possibilities are not going to show up
1:07:04
in the data. So, you don't need to
1:07:05
actually make it much larger, but you
1:07:07
should make the vocabulary a bit more
1:07:08
than 5,000. So, here we go with say
1:07:11
20,000, right? Otherwise, it's the same.
1:07:13
Still multihart. So, let's run it. And
1:07:16
now we will run this. Now that the layer
1:07:18
has been set up with all the right
1:07:20
settings, we'll ask it to create the
1:07:21
vocabulary. Okay? again by doing exactly
1:07:24
what we did before. Create the
1:07:25
vocabulary
1:07:30
seconds
1:07:42
by triagrams all of them will get much
1:07:44
more computer intensive that's why
1:07:46
you're seeing this. So all right let's
1:07:48
look at the first 10 words. The first 10
1:07:51
words are all just single words and
1:07:53
that's not surprising because the single
1:07:54
words are going to be the most more
1:07:55
frequent right u
1:07:59
and then the last few
1:08:02
your mom your god you short you hell
1:08:09
all right let's just uh you know uh
1:08:13
index the whole all the data we have the
1:08:15
training validation test sets using this
1:08:17
vocabulary
1:08:23
Perfect. Now we come to our second model
1:08:24
where we say the shape the incoming
1:08:26
shape is now 20,000 long right because
1:08:28
we increase max tokens from 5,000 to
1:08:30
20,000. So each thing is a 20,000 long
1:08:32
vector otherwise it's the same and now
1:08:35
we will use this thing called dropout
1:08:37
for the first time which is a
1:08:38
rigorization thing that I have referred
1:08:41
to earlier that I never really described
1:08:43
and I will describe today if we have
1:08:45
time but I'll first run through the
1:08:47
whole demo. So just you know just you
1:08:49
can just you think of dropout as just
1:08:50
another layer you can insert and it's
1:08:52
essentially a great way to prevent
1:08:54
overfitting. So I just routinely will
1:08:56
use it and I'll talk more about it. So
1:08:58
for now you have this dropout layer in
1:09:00
the middle. It receives the input from
1:09:02
the dense layer and then sends it to the
1:09:04
output layer. The output layer is
1:09:05
unchanged. It's a three-way softmax.
1:09:07
Same model as before. Okay. And now uh
1:09:10
all right we'll come back to drop out.
1:09:11
So we'll compile it the same way as
1:09:13
before and then we will we will I will
1:09:15
just fit it for three epochs. Um if
1:09:17
you're interested after class later on
1:09:19
you can actually try it for more epochs
1:09:20
and see if it does better. Uh for now in
1:09:22
the interest of time we'll just do it
1:09:23
for three
1:09:29
right
1:09:36
I think 72% right was the uh the single
1:09:39
word unig thing we had.
1:09:43
>> If you're rerunning this code with the
1:09:45
same number of Do you ever expect the
1:09:47
accuracy to change?
1:09:49
>> Um if if you had to run this code in
1:09:51
your machine, you would expect it to be
1:09:53
roughly the same, but there are some
1:09:55
minute differences due to hardware and
1:09:57
device drivers.
1:09:58
>> If you rewrite it on your own machine
1:09:59
twice, would you expect a change?
1:10:02
>> That's actually a very tricky question.
1:10:05
Uh because it depends on what else I
1:10:07
have been doing in that notebook.
1:10:09
If I start fresh and do nothing but
1:10:11
that, typically I get the same numbers
1:10:13
typically. But for some reason I don't
1:10:15
get it exactly the right.
1:10:19
Okay. So we come to this. Let's evaluate
1:10:22
our little model.
1:10:25
Okay. 75%. So it went from 72 to 75.
1:10:29
It's actually a meaningful jump just by
1:10:30
using biograms. Okay. And I ran it only
1:10:32
for three epochs. If you run it for 10,
1:10:34
maybe it's going to do even better. All
1:10:36
right. So that is the beauty of this
1:10:38
thing. Now let's just actually do a
1:10:40
little demo. Uh we'll try to predict
1:10:42
some lyrics. Okay, I'll try another one.
1:10:45
Bites the dust.
1:10:49
It's a rock song. I think that's
1:10:50
correct. Yes. Okay. Okay, folks. Your
1:10:53
turn now.
1:10:55
Uh, somebody tell me your favorite song.
1:11:00
>> Dancing Queen from Aba.
1:11:03
>> I love ABBA. That's awesome. All right.
1:11:05
Okay.
1:11:07
Uh, Dancing Queen
1:11:11
Rex.
1:11:17
worse one intro. I don't like that.
1:11:18
Let's just go to something without all
1:11:20
this metadata.
1:11:23
Right.
1:11:27
All right. I'll just take the first
1:11:28
page. Okay.
1:11:40
Are we good?
1:11:42
All right,
1:11:45
down model. Let's predict
1:11:50
pop just about. Yay.
1:11:55
All right. So, uh yeah. So, that's
1:11:58
basically the model, but we have five
1:12:00
minutes. I want to get back to you can
1:12:01
play around and put your own lyrics in.
1:12:03
Uh typically what happens is that the
1:12:05
last two years that I've been doing this
1:12:07
particular lecture, I've noticed that
1:12:09
the songs are always rock songs for some
1:12:11
reason.
1:12:13
>> First time I'm getting a pop song from
1:12:14
the from a group that I actually like.
1:12:16
So thank you.
1:12:18
Uh all right. Uh let's go back to
1:12:20
dropout.
1:12:22
So the idea here in dropout is that you
1:12:24
know you have all these the input comes
1:12:26
in, it goes through a hidden layer and
1:12:28
so on and so forth. What the dropout? So
1:12:30
dropout is a layer and you put this
1:12:33
layer just like you use any other layer.
1:12:35
And what dropout does is that it takes
1:12:37
all the things that are coming into it
1:12:38
from the previous layer and randomly
1:12:41
decides to replace that number with a
1:12:43
zero.
1:12:46
That's it. It drops that number and
1:12:48
replace it with a zero. Okay? But it
1:12:50
does it randomly. It basically toss a
1:12:52
coin and the coin comes up heads zero.
1:12:54
If it comes up to us, let it through.
1:12:55
Pass it through. Okay? And the reason
1:12:58
why this is very effective is because
1:13:02
you can imagine all the neurons in a
1:13:04
particular layer when they overfit to a
1:13:07
particular data set the overfitting
1:13:09
happens because the neurons essentially
1:13:11
collude with each other right they sort
1:13:14
of collude with each other to actually
1:13:15
overfitit and predict things in sort of
1:13:17
a very accurate way. So you want to
1:13:19
break any sort of collusion between the
1:13:21
neurons, right? I'm obviously using sort
1:13:24
of like a you know again theoretic way
1:13:26
of describing it but the idea is that
1:13:28
any kind of speurious correlations in
1:13:30
your data neurons can pick it up by
1:13:33
being correlated themselves.
1:13:36
And so the way you avoid the spurious
1:13:38
correlation is by dropping neurons
1:13:40
randomly. You just kill the neuron
1:13:42
randomly which means that no neuron can
1:13:44
depend on another neuron being
1:13:45
available.
1:13:47
I know it's a bit grim but that's the
1:13:50
basic idea of dropout and apparently the
1:13:52
story goes that the the folk person who
1:13:54
the team that invented it Jeff Hinton
1:13:56
who won the touring for the stuff not
1:13:58
for not for dropout just for deep
1:13:59
learning um he said I don't know if it's
1:14:02
true but he said that apparently he got
1:14:03
the idea when he went to a bank and
1:14:05
realized that you know very often the
1:14:07
bank the folks who working in that bank
1:14:09
branch that he used to go to kept
1:14:11
changing
1:14:13
right they were never sort of the same
1:14:14
the people would be transferring in
1:14:16
transferring out and he was like why Why
1:14:17
can't they just leave these people
1:14:18
alone? Why does it keep changing? And
1:14:19
then he got the insight that maybe a lot
1:14:21
of fraud happens because the person
1:14:24
working in the branch colludes with the
1:14:26
customer, but by changing the staff
1:14:28
constantly, you break the the risk of
1:14:30
fraud happening. And that apparently was
1:14:32
the genesis for this idea. True,
1:14:34
apocryphal? I have no idea. But it's
1:14:36
sort of a fun story. Uh yes,
1:14:40
>> instead of random, if we go to the way
1:14:43
historical models are built, concepts of
1:14:45
multiple and all of that, would that
1:14:47
make it sharper as compared to this?
1:14:50
>> The problem is that um these networks
1:14:53
are massive, right? And for you to take
1:14:56
each layer and look at it correlation
1:14:58
with some other layer and so on and so
1:14:59
forth. First of all, investigating
1:15:01
multi-linearity is pro is a problem. The
1:15:04
second thing is okay, what do you do
1:15:05
then? Next uh in linear regression you
1:15:08
can do things like principal components
1:15:09
analysis to get around it. Here
1:15:11
everything is nonlinear. There is no
1:15:12
easy way to solve the problem. So we are
1:15:14
like we'll just solve the problem in one
1:15:16
shot using dropout. That's all right. Um
1:15:20
so I had uh some material on
1:15:23
something called bite pair encoding
1:15:25
which I will um which I will do when we
1:15:28
get to LLMs and I stuck it in the end
1:15:30
because I knew that we probably won't
1:15:31
have enough time to cover this anyway.
1:15:33
And that is a very clever tokenization
1:15:35
scheme used by for example the GPT
1:15:37
family and that allows them to do
1:15:40
beautiful punctuation, keep the case
1:15:41
intact and then use words that you just
1:15:43
made up and things like that. Okay. So
1:15:45
we have two one more minute. I'm happy
1:15:47
to answer any questions you might have.
1:15:50
>> And so initially when we are picking
1:15:52
like the hidden layer the number of
1:15:54
neurons and weed. So so far in all the
1:15:57
materials this is has been given to us
1:15:59
but initially how do you pick it? Is it
1:16:01
more of a trial and error type of thing
1:16:03
or
1:16:03
>> it tends to be trial and error. Um so
1:16:05
that's in fact what I did when I created
1:16:07
the collabs. So um and and you can
1:16:10
actually make it a bit more systematic
1:16:12
by trying lots of different values and
1:16:14
there is a particular package uh Python
1:16:16
package called Keras tuner. So just
1:16:18
Google Keras tuner and it comes with
1:16:20
very nice collabs and if I have a chance
1:16:22
maybe I'll just record a screen
1:16:23
walkthrough of doing that. But that's
1:16:25
that's a very efficient way to do these
1:16:27
things. And it comes under the broad
1:16:28
category of something called
1:16:29
hyperparameter optimization where the
1:16:31
number of neurons, the activation you
1:16:33
use, the learning rate, all those things
1:16:35
can all be tried. You can try lots of
1:16:36
variations and kas is a great way to do
1:16:39
it in the context of kas.
1:16:42
Other questions?
1:16:45
>> All right, I give you 30 seconds back.
1:16:47
Thank you. See you tomorrow.
— end of transcript —
Advertisement
Ad slot

More from MIT OpenCourseWare

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.