Advertisement
1:17:03
5: Deep Learning for Natural Language – The Basics
MIT OpenCourseWare
·
May 11, 2026
Open on YouTube
Transcript
0:16
Okay. So today we start the the natural
0:20
language processing sequence and so just
0:23
to give you a quick idea we're going to
0:24
start with uh what's called
0:26
vectorization
0:27
uh and then the bag of words model and
0:29
then we'll spend a fair amount of time
0:30
on a collab uh and then on Wednesday we
Advertisement
0:33
talk about these things called
0:34
embeddings which you'll come to
0:36
appreciate over the the next couple of
0:38
weeks form like the sort of the core
0:40
atomic unit of all modern natural
0:42
language processing and for that matter
0:45
vision processing as well. uh and then
0:47
we will uh following week we'll do
0:49
transformers two lectures on
0:50
transformers we'll get into the theory
Advertisement
0:52
and then we'll get into a bunch of
0:53
applications and then lectures nine and
0:55
10 will be all LLMs all about LLMs so
0:59
it's going to be a lot of fun u this is
1:01
one of my favorite segments of the class
1:04
of course truth be told every segment of
1:05
the class is my favorite so don't judge
1:08
me all right so let's get going uh so
1:10
why why natural language processing u
1:13
you know these are in some sense the the
1:16
things I have on the slide here are sort
1:17
of obvious but I think it's actually
1:18
worth reme reminding ourselves of how
1:21
important text is for everything we do.
1:24
Uh obviously human knowledge is mostly
1:26
encoded as text. The internet is mostly
1:29
text. At least this was true till the
1:30
advent of Tik Tok and YouTube. Uh and uh
1:33
human communication is mostly text and
1:35
cultural production you know movies,
1:37
books, uh arts and so on. So much of it
1:40
is so textheavy and so in some sense uh
1:43
text forms not just a big chunk of all
1:47
the media that's out there but it also
1:49
happens to be the way in which we think
1:50
and communicate and so on and so forth.
1:52
So it's sort of uh primacy is in my
1:55
opinion sort of unparalleled uh in how
1:57
we think about the world. And so the the
1:59
tantalizing possibility is that imagine
2:02
if we had an AI system which could just
2:04
read and quote unquote understand all
2:06
this text, right? Um and so you can
2:09
imagine such a system reading all of
2:11
PubMed, reading all the medical
2:13
literature and then coming back and
2:15
saying you know for this particular
2:17
disease you know this particular sort of
2:19
protein is actually the malfunctioning
2:21
protein and for that that small molecule
2:23
is going to dock into the protein and
2:24
cure the disease and you didn't know
2:26
this. It came back and told you that.
2:27
Wouldn't it be unbelievable? So my
2:29
feeling is that such things are going to
2:31
happen. It's just that it's not going to
2:33
happen soon enough for my lifetime, but
2:36
perhaps it'll happen in yours. All
2:38
right. Okay. So, let's continue. So, NLP
2:40
is an action all around us. Um, you
2:42
know, according to Google, apparently
2:44
Google autocomplete, uh, which uses a
2:46
fair bit of NLP, uh, saves 200 years of
2:49
typing time apparently, every day. Uh, I
2:53
actually thought it was, you know, this
2:54
I wasn't very impressed with this
2:55
number, frankly, because billions of
2:57
searches are being done every day and
2:58
I'm like only 200 years. So anyway u but
3:01
I think the more important point is that
3:03
it made mobile possible right if you if
3:06
you didn't have autocomplete people
3:08
would not be you know typing and pecking
3:09
on their keyboards it's going to be much
3:11
worse it would have had a hugely
3:13
dampening effect on e-commerce for
3:15
instance so this humble little
3:17
autocomplete has incredible incredible
3:19
impact on the world economy and the
3:21
other thing which I heard about I'm not
3:23
sure if it's 100% true but it's an
3:25
interesting example apparently the very
3:26
first iPhone keyboard that came out
3:28
right the soft keyboard not the hard
3:30
keyboard. Um they had some very basic,
3:34
you know, sort of word continuation
3:35
prediction going on. And so if if when
3:38
you start typing T and H, obviously it's
3:41
going to guess the E is going to come
3:43
next, right? So that part is old old
3:46
news, nothing new there. But apparently
3:48
the E letter in the keyboard will become
3:50
slightly bigger. So when your finger
3:53
goes towards it, it has a better shot of
3:54
actually connecting with it. Right? So
3:57
these kinds of things are used to change
3:59
the UI in real time in a whole bunch of
4:01
applications and you just don't even
4:02
realize it. All right. So uh and of
4:06
course we all know about uh LRM at this
4:08
point. So I asked it to write a
4:09
limmerick about the beauty and power of
4:11
deep learning yesterday and it says in a
4:13
world where data flows like a stream
4:15
deep learning is more than a dream.
4:16
Sifts through the noise with an elegant
4:18
poise unveiling insights that gleam.
4:22
Cool, right? All right. So let's get
4:25
back to work. Uh so NLP has
4:26
extraordinary potential for making
4:28
products, service and services much much
4:30
smarter. Uh and what I want to point out
4:33
here is that you know even if you focus
4:35
on this very very simple sort of
4:37
formalism right a bunch of text comes in
4:40
a bunch of text goes out that's it. If
4:42
you take that very simple text in text
4:44
out formalism this little humble little
4:46
thing has just an enormous enormous
4:49
range of applicability. Right? So
4:51
obviously you can send a bunch of text
4:53
in and ask it to classify it right for
4:56
mo you know sentiment route it for
4:58
customer support you can try to figure
5:00
out the intent of what the person is
5:01
asking in search you can filter it you
5:03
can content filter to make sure there's
5:04
no toxic abusive stuff going on I mean
5:06
the the possibilities for just text
5:08
classification are numerous okay but
5:11
that's a that's sort of a use case we
5:12
are all kind of familiar with right so
5:14
no surprise there now text extraction we
5:17
may be less familiar with here and the
5:19
idea is that you can actually look at a
5:20
lot lot of uh unstructured textual data
5:23
and extract all sorts of interesting
5:25
entities from it. Right? Hedge fun hedge
5:27
funds use it very heavily. They will
5:29
extract all sorts of company information
5:30
from news articles u and then obviously
5:33
doctor's notes. There are a whole bunch
5:34
of NLP startups that will take the
5:36
doctor's the doctor patient conversation
5:38
transcribe it and then extract disease
5:40
codes, diagnosis codes, medication codes
5:43
and things like that. Uh right. So the
5:45
possibilities for this are enormous. Of
5:47
course text summarization and we all
5:48
have been doing it thanks to chat GPT
5:50
right take text in and any kind of
5:53
summary that comes out of the text is
5:54
just text out okay and then text
5:57
generation of course we can take text
5:58
and do marketing copy sales emails
6:00
market summaries so on so forth and
6:01
including troublingly for educators
6:03
college application essays
6:06
code generation is a more subtle example
6:10
of text out because code is just text
6:14
right so text in text out also covers
6:16
was text in code out. Okay. And question
6:20
answering. So you can take a bunch of
6:22
text,
6:24
you can take a whole bunch of documents,
6:25
you can add a bit of text to it which is
6:27
your question and this whole thing at
6:29
the end of the day is just text it in
6:31
and then you can have and you can use it
6:33
to answer questions and therefore create
6:35
chat bots for all sorts of interesting
6:36
applications.
6:39
And you know if you look at this example
6:42
call centers that's that is where a lot
6:44
of money is being spent right now to
6:46
build these call center chatbots for
6:47
text and text out question answering and
6:49
so just if you drill into this right if
6:52
you imagine taking all the call center
6:54
transcripts and their internal product
6:56
documentation service documentation FAQs
6:59
etc stick it in you can start to answer
7:02
these kinds of questions okay yesterday
7:04
what are the top reasons why customers
7:05
were upset with us what interventions
7:08
made by the agent actually worked what
7:09
did not work, right? What characterizes
7:12
the best agents from the rest? How
7:14
should we grade this particular agent's
7:16
interaction with the particular
7:16
customer? How should she how should we
7:18
chain the call center script? How should
7:20
we coach the agent in real time? Every
7:23
one of these applications is aminable to
7:25
this very humble text and text route
7:26
model.
7:28
Okay. And so, and of course the
7:30
potential for is now everybody knows
7:32
this potential because of the advent of
7:33
large language models. Uh, by the way,
7:36
Google is uh released something called
7:38
Google Geminy 1.5 Pro u a couple of days
7:42
ago. Uh, and it's incredible.
7:46
It's incredible, right? And anyway,
7:49
we'll get back to that later. But the
7:50
point is that the kind of potential we
7:52
have is just amazing even for text and
7:54
text. Okay. And as you would imagine
8:00
>> this is all like though we are calling
8:02
it language this is all primarily
8:04
English right
8:05
>> now there are lots of multilingual uh
8:07
models as well uh there are multilingual
8:09
models by that I mean models which are
8:12
specialized to other languages
8:13
non-English languages and models which
8:15
are mult truly multilingual like
8:16
polyglot models as well and both of them
8:18
are available uh right now and many many
8:21
modern LLMs are actually trained from
8:23
the get-go to be multilingual in a bunch
8:26
of the what are called high resource
8:28
languages. Languages which are spoken by
8:30
lots of people. Uh but actually it's
8:32
funny you should ask that question
8:33
because this this Google Gemini model
8:34
that I just described they actually u so
8:37
there is a language called kalamang
8:40
which is spoken by 200 people in the
8:41
world and so a researcher had created a
8:45
one book which is sort of like a grammar
8:48
manual for kalamag right because there
8:50
are no other written works in that
8:52
language. And so what they did is they
8:54
took a whole bunch of English dialogue
8:56
and this book fed it into uh Google
9:00
Gemini 1.4 Pro 1.5 and it translated
9:04
into Calamong at human level
9:06
proficiency.
9:07
It had never seen it before. So that's
9:10
an example
9:12
of of this.
9:15
Yes. So the question is the question
9:18
text here is all the things you want to
9:19
translate from English to kalamong. The
9:21
documents here is just one document
9:23
singular the grammar book the manual and
9:25
then what comes out is a translation. So
9:29
these models even when they're not
9:30
explicitly trained on a different
9:31
language if you give them enough of sort
9:34
of grammar manuals and stuff like that
9:35
they may do a pretty decent job from the
9:37
get-go with no training.
9:40
It's kind of a shocker. Two years ago
9:42
people would be like that's impossible.
9:44
All right. So
9:47
back to this.
9:50
All right. And as you folks, you know,
9:51
may already know and maybe you're in
9:53
fact participating in this gold rush
9:54
already. Um, you know, lots of people
9:57
are creating lots of really cool
9:58
companies to take some of these ideas
10:00
and actually create really interesting
10:02
products and services out of them. Um,
10:04
so if you're not doing it and if you've
10:06
been thinking about entrepreneurial
10:07
stuff, here's a word of advice. Take the
10:10
plunge.
10:15
Dismissed. Just kidding. All right. So,
10:18
and as you can imagine, enterprise
10:19
vendors are rushing to add NLP to all
10:22
their products. Salesforce Einstein now
10:24
has Einstein GPT. Microsoft has
10:27
co-pilot. I mean, the list goes on.
10:28
Everybody, everybody's like scrambling
10:30
and really trying hard to infuse some
10:32
GPT magic into whatever they're doing.
10:34
Okay, some of it is real, a lot of it is
10:36
not. Uh, okay. So, let's go to like the
10:39
arc of NLP progress. How did we get to
10:41
this kind of crazy times that we live
10:43
in? Um so if you look at natural
10:46
language processing basically efforts to
10:48
take language and try to analyze
10:50
language and you do predictions with
10:52
language and so on and so forth. Um
10:56
the first phase of it was just
10:58
handcrafted rules based on linguistics.
11:00
So these are all linguists who would
11:02
really understand the grammar of a
11:03
language and then they would use a deep
11:05
knowledge of linguistics to figure out
11:07
all these rules by which you can process
11:08
and analyze natural language text. And
11:11
then this other thing came along which
11:13
was a statistical machine learning
11:15
approach which basically said never mind
11:17
all that complicated knowledge of
11:19
linguistics and grammar. Why don't we
11:21
simply count things? Let's count the
11:24
number of times these two will co-
11:25
occur. Now let's count that. Let's count
11:26
this basically just count a lot. Okay.
11:29
And let's see if it does right if it
11:31
does for predicting things for say for
11:32
classifying text and so on. And
11:34
shockingly those methods ended up being
11:36
really good. They ended up being really
11:39
good and in fact they actually were
11:41
better than the lovingly handcurated
11:44
linguistically driven rules. Okay, so
11:47
much so this is a famous quote which
11:50
says every time I fire a linguist the
11:52
performance of speech recognizer goes up
11:55
right obviously made in justest but
11:57
there is a kernel of truth to it.
11:59
So that was that's what that's that's
12:01
what that's where we were and then deep
12:03
learning happened okay um in 2012
12:06
roughly and then we had these things
12:08
called recurren neural networks which
12:09
are based on deep learning which
12:11
actually moved the ball forward and then
12:13
in 2017
12:15
something called the transformer was
12:17
invented
12:18
2017 and the transformer replaced
12:21
everything else across the board so we
12:26
just going to leaprog directly to
12:27
transformers in hodle we will not spend
12:29
any time on recurren neural networks and
12:30
that is not to say that they are sort of
12:32
dead. Um there's there's a very
12:35
interesting work which actually is
12:36
trying to now revive recurren neural
12:38
networks to make it work for these kinds
12:40
of modern LLM kinds of tasks but it's
12:42
still very early days. Okay. So for now
12:44
we'll just focus on transformers.
12:46
Okay. So the the very high level view of
12:49
the problem here is that like most
12:51
things in deep learning it's basically
12:53
fancy regression.
12:55
There is some variable X that comes in.
12:57
It goes through a bunch this goes to
12:59
this very complicated function along
13:01
with this W which is the weights and
13:03
then out pops an output. Right? That's
13:05
just the view that you've always had.
13:07
And so in this case X happens to be
13:10
text. Y can be text. It could be labels.
13:12
It could be numbers. It could be
13:13
anything else. The W is the weights. And
13:15
the function is a deep neural network.
13:16
Right? This by by at this point when you
13:19
look at this slide it should be like
13:20
blindingly obvious.
13:23
So now the key question here is how do
13:26
you actually represent X? That's the key
13:28
question for pictures for images. We saw
13:31
that we just took the pixel values which
13:34
were light intensity numbers between 0
13:36
and 255 and you could just use that
13:37
directly but when a when a sentence
13:39
comes in like I love deep learning like
13:41
what do you do right how do you actually
13:43
represent it because remember we have to
13:45
numericalize everything that's coming
13:46
in. So that's a key question and and
13:49
this actually is a very subtle question
13:50
very important question and we'll focus
13:52
on that today and then next week when we
13:56
look at transformers we will look at
13:58
what neural network architecture is best
14:00
suited to process this sort of text
14:02
inputs that are coming in right those
14:04
are the two big questions we're going to
14:06
look at all right so processing basics
14:11
we going to follow this very standard
14:12
process
14:15
this is the process by which we take any
14:18
any text that comes in and we do run it
14:21
through these four steps and this
14:23
process is called text vectorization and
14:25
as the name suggest that we are
14:26
essentially taking text and creating
14:28
vectors of numbers out of it right text
14:30
vectorization and we'll go through each
14:32
of these processes one after the other
14:34
so I just find it very useful to just
14:36
have this acronym stie in my head like
14:39
stie just keep that in mind it may be
14:41
helpful um all right so we what we do is
14:45
the setup here is that we have a whole
14:48
bunch of documents, right? We call it
14:50
the training corpus. We have a whole
14:51
bunch of text documents, text data. Uh,
14:54
and as far as we are concerned, you can
14:55
just imagine it as just lists of long
14:58
passages. Okay? What is a novel? It's
15:01
just a long passage, right, of text. So
15:03
whether it's a novel or a sentence
15:05
doesn't really matter. We just think of
15:07
them as a big list of strings, a big
15:09
list of text. Okay, that's a training
15:11
corpus. And what we do is we take this
15:13
training corpus and we run it through
15:15
and we apply standardization and
15:17
tokenization which I will describe to
15:19
this entire training corpus up front.
15:22
Okay. So we first do this and and
15:26
standardization is basically
15:29
the default for most applications tends
15:32
to be this which is we first strip
15:34
capitalization and make everything lower
15:36
case
15:38
and then we remove punctuation and
15:40
accents and so on and so forth. Okay,
15:42
that's the first thing we do. I'll talk
15:44
about why we do it in just a moment, but
15:46
the mechanics of it are we do this
15:48
first. Then we look at words like a,
15:51
the, it, and so on and so forth.
15:53
Basically filler words, right? Which
15:55
which we we need to actually make
15:57
complete sentences, but they may not
15:59
have any value predicting things. So we
16:02
remove them and they are called stop
16:03
words. And then finally we take words
16:06
which are very similar which have sort
16:08
of a same kind of stem or root and then
16:10
we just map it to like a common
16:12
representation like ate eaten eating
16:14
eaten all these things just becomes
16:16
let's say eats and we do that sometimes.
16:19
So this we almost always do this we
16:21
often do and this we do it sometimes.
16:23
Okay. Now, why do we do any of these
16:25
things?
16:34
>> I think we want to try to recognize the
16:36
essential thing with the word, right?
16:38
Whether it's eaten or eat, but the
16:40
essential thing is the eat, right? So,
16:42
we want to try to sort of abstract from
16:45
it the more essential thing,
16:47
>> right? So, why do we need to abstract? I
16:49
guess you're absolutely correct. We're
16:50
trying to abstract. Why is there a
16:52
benefit to doing this abstraction?
16:58
How about somebody from this side of the
16:59
room? Oh yes.
17:03
>> So I want to reduce the library.
17:07
>> Why is it a good idea to reduce the
17:08
library? The size of the library
17:12
>> because of the the amount of computation
17:14
needed. So that is part of the answer.
17:17
There's another part to the answer which
17:20
says all right let's swing to the right
17:26
um is it faculties comparison between
17:28
different sets
17:30
of standard
17:33
[clears throat]
17:34
>> okay so I will go with that but I think
17:37
the the key thing we want to uh the key
17:39
thing to realize here is that you want
17:42
the model much like when you go when we
17:44
talk about computer vision we said look
17:46
if it's vertical line. I want to be able
17:48
to detect it wherever it happens. I
17:51
don't want the model to think that the
17:52
vertical line on the left side is
17:54
different from the vertical line on the
17:55
right side and then later realize they
17:57
are the same thing because you would
17:58
have wasted valuable capacity learning
18:00
things which actually happen to be the
18:02
same because you didn't know it was the
18:03
same. So here if you for example take a
18:06
word and lowerase it clearly the case of
18:09
it whether it's uppercase or lower case
18:11
most of the time it's not going to
18:12
matter for anything you want to predict.
18:14
So you're essentially telling the model
18:16
you know the lowerase version uppercase
18:18
version they are not different they're
18:19
actually the same and the easiest way to
18:21
tell the model they are the same is just
18:23
make everything lower case so that is
18:25
the key idea okay and similarly if you
18:29
look at stop words the reason is that
18:31
these stop words may not help you
18:32
predict anything whether a word uh and
18:34
the showed up in a movie review probably
18:36
does not affect the sentiment of the
18:38
review and therefore let's remove it so
18:40
that's a slightly different reason
18:42
stemming is the same reason as the first
18:44
which is that all these words kind of
18:46
mean the same thing. We don't have to be
18:48
super precise about it and so let's just
18:50
like collapse them onto the same thing.
18:51
Now that these are all the standard
18:54
things we do there are totally notice
18:57
you know important exceptions to all
18:58
these things. Okay we'll come back to
19:00
the exceptions a bit later but that is
19:02
the standard thing we do make sense. All
19:05
right.
19:08
So if you look at something like this um
19:11
this sentence here right hola what do
19:14
you picture when you think of travel
19:15
Mexico boom and then you can see here
19:17
this is the standardized version like
19:20
everything has become lower case like
19:21
the h has become small h the punctuation
19:24
has disappeared that's part of
19:25
standardization and then uh travel and
19:29
you can see here that Mexico m has
19:32
become small sipping has become sips uh
19:35
things think has become things and so on
19:37
and so forth
19:38
So that's an example of strandization at
19:41
work.
19:47
Okay.
19:49
The next thing we do is something very
19:51
important and it's called tokenization.
19:53
So what we do typically is that okay now
19:55
we have standardized everything. We have
19:56
a bunch of words. Uh we need to now
19:59
split them into what are called tokens.
20:01
So the most common default is to just
20:04
think of a word as a token.
20:07
We just split on the white space, right?
20:09
You take each string and wherever there
20:11
is white space, meaning actual spaces,
20:14
uh, carriage returns and things like
20:15
that, boom, you just split on them and
20:17
you just create words out of it. So, so
20:20
for instance, if you have this
20:22
standardized sentence here, you just
20:24
split it after every word and you get
20:26
this thing. Okay? So, each of these is
20:29
now a token.
20:32
Now, this has some disadvantages.
20:36
What are some disadvantages of just
20:38
splitting on on the space between words?
20:40
Uh yeah,
20:43
>> I think we lose any context because we
20:46
look at each word separately. Uh so we
20:49
don't have any password or what happens
20:52
next,
20:53
>> right? So for example, the cat sat on
20:55
the mat and the mat sat on the cat will
20:57
have the same like set, right? Yeah. So
21:00
you lose the order. What are some other
21:02
issues with it?
21:05
for words that should have two together
21:07
like you lose the fact that that's one
21:10
name because you separated
21:11
>> right exactly so there are compound
21:14
words right like father-in-law for
21:16
instance that's one problem another
21:18
problem is that lots of non-English
21:20
languages they actually don't have this
21:22
notion of a space between words right
21:25
actually runs one after the other and it
21:27
is and the native speakers know from
21:29
context how to chunk it and break it so
21:32
well what do we do Right?
21:34
Because you basically will have one word
21:36
for the whole passage, one token. The
21:39
other problem is that there are
21:40
languages, German is perhaps the most
21:42
notable one in which you have very long
21:44
words.
21:47
Um I saw a word uh which I think I might
21:50
have it on the site somewhere this like
21:52
this long which means uh
21:57
you realize that something amazing is
21:59
happening but the rest of the world
22:00
hasn't woken up to it yet. It's that
22:02
feeling.
22:04
There's a word for that. Amazing, right?
22:07
Anyway, so yeah, some words or Japanese,
22:10
for example, there's a word called. Do
22:12
people know the meaning of the word
22:13
combi?
22:16
It means the transient beauty of
22:20
sunlight going through fall foliage.
22:24
There's a word for that. How cool is
22:26
that? Anyway, sorry. I love that word.
22:29
So, back to this. Um so we have this
22:31
thing here. So there are all reasons for
22:33
which splitting on the the space between
22:35
words not going to work. Okay. Um
22:38
so what we will so what happens is that
22:41
modern large language models. So the the
22:44
what we have described so far despite
22:46
its shortcomings is actually really good
22:47
for lots of NLP use cases. Okay. If you
22:50
want to classify text as good enough for
22:52
instance but if you want to generate
22:54
text like LLMs do it's not going to
22:57
work. It's not going to work because you
22:59
know when you ask the strategia question
23:01
it comes back with perfect punctuation.
23:03
Clearly punctuation was not stripped. It
23:05
comes back with particular upper and
23:07
lower case clearly that wasn't stripped.
23:09
You can actually make up new words and
23:11
ask it to use the new word. It'll make
23:12
it I'll use it. Therefore, it's not like
23:15
it can only recognize a finite set. So
23:17
there's a very clever scheme called bite
23:19
pair encoding right which is which is
23:22
invented to do all those things. And I
23:24
have slides at the end and if we have
23:26
time we'll talk about it.
23:28
All right, for now let's continue this
23:29
thing. So when this is done for every
23:33
sentence or every uh passage in our
23:35
training data set, we have now have a
23:37
list of distinct tokens, right? We have
23:40
a list of distinct tokens. In this
23:41
simple case, it happens to be all the
23:42
distinct words that we have seen, right?
23:45
That's called the vocabulary.
23:47
That's called the vocabulary.
23:49
So now we move to the third and fourth
23:51
stages. In this in these stages, the
23:53
indexing and encoding stage, we only
23:55
work with the vocabulary. Okay. And so
23:58
what we do is the first thing the
24:00
indexing we assign a unique integer to
24:03
each distinct token in the vocabulary.
24:05
So for instance, let's say that you know
24:07
you took a whole bunch of English
24:09
literature as your training corus and
24:12
you ran it through you basically you'll
24:14
come up with English dictionary right?
24:16
So it'll have maybe starting with a all
24:18
the way to zebra a whole bunch of words.
24:20
Um, and so I'm just putting 50,000 here
24:24
because turns out the GPD family uses
24:26
something called 50,000 tokens. So I'm
24:28
just using 50,000. It's not the actual
24:30
number of words in the English language.
24:31
It's much more than that. So let's say
24:33
that we give a number one through
24:35
50,000. And then we actually also
24:37
introduce a special token called UN. It
24:40
stands for unknown. And we'll come back
24:42
to this later. And we give unknown the
24:44
integer zero.
24:46
Okay. So this what this is what we mean
24:48
by indexing. take the word the tokens
24:51
you have identified and just map it to
24:52
an integer.
24:55
Okay, that's the indexing step. Then
24:57
what we do is we assign a vector to
25:00
every one of these integers.
25:03
Okay, and that is the encoding step. We
25:05
assign a vector to each integer.
25:08
So you have a bunch of distinct words
25:10
and each word we put an integer on it
25:12
and then we take that integer and map it
25:14
to a vector. Yeah. Can you please
25:16
explain
25:17
to
25:18
>> Can you please explain what unknown
25:20
means?
25:20
>> Yeah. So, so I'll come back to that for
25:23
now. Just assume that we have a token
25:25
called unknown. And the way we are going
25:26
to use it will become apparent in a few
25:28
minutes.
25:29
>> Does it mean there's a base to it
25:31
though? There's like a letter or
25:32
something.
25:32
>> It's it's a it's a placeholder for
25:34
something else which I'll describe
25:36
shortly.
25:38
Okay. So, that's what we have. U so
25:42
let's say that we want to assign a
25:44
vector to each integer in our vocabulary
25:46
and let's assume that we have uh okay
25:50
let's say we have 50,000 possible
25:52
integers because we have 50,000 possible
25:54
words and we want to assign a vector so
25:56
that if you take the vector of two
25:58
different words they should look
25:59
different right clearly that's the whole
26:02
point of mapping from integer to vector
26:04
they better be different uh what is the
26:06
simplest way to come up with a vector
26:08
for each each of these tokens
26:20
the same as the index.
26:21
>> Sorry,
26:22
>> the same as the index. It's just a
26:24
vector one one by one with the index.
26:26
>> So, a vector of uh zeros and ones or
26:31
>> it's just a vector with one dimension.
26:34
>> Oh. Oh, I see. So, god. Well, it's it
26:38
it's creative, but it's a little
26:39
cheating, right? Because you're
26:40
essentially putting a square bracket
26:42
around the number and saying it's a
26:43
vector. Good try.
26:47
>> You can try one hot encoding,
26:48
>> right? You can try one hot encoding.
26:51
So remember the list of distinct tokens
26:53
you have, you can just think of them as
26:55
the distinct levels of a categorical
26:57
variable,
26:59
right? And you can just use one hard
27:01
encoding for it.
27:04
So what you can do is you can the
27:07
simplest thing is do one one hard
27:08
encoding and the way it's going to work
27:10
is that if you have let's say 50,000
27:13
uh 50,000 possible values the vector is
27:16
going to be 50,000 long it's going to
27:17
have zeros everywhere except in the
27:20
index value of whatever that token is.
27:22
So for instance since we said ank is
27:25
going to be the first uh first number
27:28
zero it has a one here and the zero the
27:31
zero index position has a one everything
27:33
is zero a happens to be the second one
27:36
so it happens to be one in the second
27:37
position zero you get the idea
27:40
okay
27:40
>> so this real zero hot encoding we can do
27:42
the zero hot one coding one hot encoding
27:45
and so so the dimension of this encoding
27:47
vector how long it is it's basically the
27:50
number of distinct tokens that you have
27:51
seen in in the training corpus plus one
27:54
for this unk thing that you'll get to.
27:59
Okay,
28:01
so that is a dimensional encoding vector
28:03
which is this is called the vocabulary
28:05
size.
28:09
It's called the vocabulary size.
28:13
All right. So at this point we have
28:16
created a vocabulary for the training
28:18
data training corpus. every distinct
28:20
token vocabulary has been assigned a one
28:22
hot vector and we are done with basic
28:24
pre-processing.
28:26
Okay, so all the text that has come in,
28:29
every token has been mapped to some one
28:31
hot one potentially very long one hot
28:33
vector.
28:35
Any questions on the mechanics of this
28:37
before we continue on?
28:45
>> Now let's see if when you get a new
28:47
input sentence in a new sentence freshly
28:50
arriving and we want to feed it into a
28:52
deep neural network, how will this
28:53
process actually apply to the new
28:55
sentence that's coming in? Okay, so
28:57
let's assume um that we have completed
29:00
our SDIE on the training corpus and it
29:02
turns out we found only you know 99
29:05
distinct tokens 99 distinct words and
29:08
then we add this ank thing to it so we
29:10
got a 100 okay so this is our vocabulary
29:13
it starts with ank a and then goes all
29:16
the way to zebra but there are only 100
29:17
of them in total right and just to be
29:20
very clear we didn't bother to do things
29:22
like stemming and stop word removal and
29:24
stuff like that which is why you have
29:26
words like 'the' showing up in this
29:28
list.
29:30
Okay. All right. So,
29:34
let's say this input string arrives, the
29:35
cats are on the mat, and then we run it
29:38
through STIE. So, the cats are on the
29:40
mat goes through this thingoop.
29:43
Then the output is going to be a table
29:46
with a bunch of rows and a bunch of
29:49
columns. Any guesses
29:52
how many rows and how many columns?
30:02
Just raise your hands. I'll call on you.
30:13
>> Yeah, you use a microphone. Go for it.
30:14
>> Yeah, I would guess uh 100 rows and uh
30:18
six columns.
30:20
All right, we'll take a look. Uh
30:23
100 and six as well as six and 100 are
30:24
both correct. So, so the way I've done
30:27
it is six and 100. And the and that's
30:30
exactly right. So, the idea is that this
30:33
is your vocabulary, right? So, the word
30:36
the cat sat on the mat once you change
30:38
the case of it, it becomes like this.
30:41
So, the the happens to be a one hot
30:43
vector with a one where there is a the
30:47
and zero everywhere else. I'm not
30:48
showing all the zeros because it'll get
30:50
too cluttered.
30:52
Similarly, cat has a one where the the
30:55
cat position is and zero everywhere else
30:57
and so on and so forth. Does that make
30:59
sense? So, the the phrase the cat sat on
31:02
the mat came in as just whatever six
31:04
words and then it became this you know
31:06
600 entry table.
31:12
Okay. Now, what is the best way to feed
31:15
this table to a deep neural network?
31:18
What can we do?
31:23
It's not a vector. It's a table.
31:26
If it's a vector, we know what to do. We
31:27
just feed it in. We'll just maybe send
31:29
it to some, you know, hidden layer and
31:30
declare victory at that point.
31:34
>> Yeah.
31:37
>> You would like to flatten it. And like
31:38
how how might you do it?
31:43
Flattening is a reasonable answer by the
31:45
way.
31:46
I think you mean you just have to like
31:49
take each like each column
31:52
take the first one each row and each row
31:54
each word kind of like
31:56
>> yeah so basically you can take all the
31:57
first columns and then take the second
31:59
column and attach it under the first
32:01
column and so on and so forth right so
32:03
we can certainly do that and that's very
32:05
akin to how we work with images right u
32:08
but there is one downside to that what
32:10
is that downside
32:15
uh Um,
32:18
>> it's pretty long. Like I wonder if
32:20
instead you could for the first word
32:23
it's one, for the second word it's two,
32:25
and then you maintain the order, but you
32:27
still keep it just as like one row.
32:30
>> One row. So one issue, so we'll come
32:33
back to what we do about this, but what
32:34
you're pointing out is it could be very
32:36
long, right? Because if each word is a
32:39
50,000 long one vector with just six
32:42
words, it becomes a 300,000 long vector.
32:45
Imagine take the 300,000 long vector and
32:48
sending it into a 100 hidden unit hidden
32:50
layer. 300,000 times 100 parameters. Too
32:53
much can't learn anything.
32:56
So that's one issue. The other issue is
32:58
that different length texts that are
33:01
coming in will have different sized
33:02
inputs.
33:04
So here the cat sat on the mat has six
33:06
times 50,000 but maybe the cat sat on
33:08
the mat and the rat rat ran over to the
33:10
cat becomes even longer. We can't handle
33:13
variable sized inputs.
33:15
the inputs all have to be mapped to the
33:16
same length.
33:19
That's another problem.
33:22
>> So maybe you can count how many you can
33:24
sum the columns basically and count how
33:26
many times each word appears since
33:27
you're using the like spatial
33:29
relationship.
33:30
>> Yes. So you Yeah. So both you and are on
33:33
the same sort of trajectory which is
33:34
that uh we need to somehow take this
33:37
table and make it into a vector. And
33:39
there are many ways like what you folks
33:40
are describing to make it into a vector
33:42
and turns out um this is all the things
33:46
that we've been discussing so far the
33:48
varying length ratio and so on. So, so
33:50
what we can do is we can aggregate all
33:53
these things. If you just add them up,
33:56
this is what you described. I believe
33:58
it's called sum encoding.
34:00
And if instead of adding you just or
34:02
them, meaning if you look at the column
34:04
and say, is there any one in this
34:05
column? If there's a any one, I'll put a
34:07
stick of one, otherwise it's a zero.
34:08
It's called multihot encoding. So, so if
34:12
you look at this thing, if you literally
34:13
just go column by column and count
34:15
everything. Okay, there's a one here,
34:17
one here. Oh, wait. There are two twos
34:19
here. So you put a two. That's count
34:21
count encoding. Multih hard encoding. It
34:23
just looks for any ones and puts on.
34:26
Make sense? So by the way there are many
34:28
ways to take these tables and make them
34:30
into vectors. These two happen to be
34:32
very commonly used and they kind of make
34:34
common sense.
34:39
Okay.
34:41
Right. So this aggregation approach that
34:43
we just described is called the bag of
34:44
words model.
34:46
Bag of words model. And the reason is
34:49
that first of all this bag that we have
34:51
has words either it counts whether a
34:53
word exists or not or it counts how many
34:56
words how many times the word has
34:58
appeared right count versus multihot
35:01
versus sum encoding count encoding but
35:04
more importantly and this goes back to
35:05
your observation is that we have lost
35:09
the order of the words now whether the
35:12
phrase came in was the cat sat on the
35:14
mat or the mat sat on the cat the count
35:18
encoding and the multih hard encoding
35:19
are exactly the same. There's no
35:21
difference because we're just looking
35:23
for the the presence or absence of
35:24
words. That's it. We don't care in what
35:27
which order they appear, right? That's a
35:29
huge limitation, but shockingly for many
35:32
applications, it doesn't matter. It's
35:34
good enough. So, it's called the bag of
35:36
words model.
35:38
All right, so this called the bag of
35:40
words model.
35:42
Um, now does it have any shortcomings? I
35:46
already talked about the first
35:47
shortcoming which is that it loses
35:48
sequentiality the order we lost this
35:51
order information right uh we we lose
35:54
the meaning inherent in the order of the
35:55
words what are some other issues with it
36:04
what do you mean by that
36:12
>> right so there are lots of zeros not
36:14
that many ones so you have it's a very
36:16
sparse amount of information but maybe
36:18
is carrying around a lot of information
36:19
to to make it all work. Now there are
36:22
some tricks CS computer science tricks
36:24
to handle sparsity in some clever ways
36:26
but it is certainly an issue. Now the
36:29
other issue is that let's say the
36:30
vocabulary is very long.
36:32
Each input sentence whether it's the
36:34
collected works of William Shakespeare
36:36
or the phrase I love you will have the
36:39
same length input.
36:42
Is that the same length input
36:45
because ultimately every incoming thing
36:48
gets mapped into one vector. Okay, that
36:51
feels a little sub suboptimal.
36:54
Clearly the collected works of ins have
36:56
a lot more stuff going on in them.
36:59
Right? So that's a problem. In
37:02
particular, very very small things that
37:04
come in, you'll be spending a lot of
37:06
compute on those long vectors and
37:08
processing them. Um, now you can
37:10
mitigate some of this by choosing only
37:13
the most frequent words. You don't have
37:14
to take, you know, I think the English
37:16
language I read somewhere has roughly
37:18
500,000 words or so. Uh, but turns out
37:20
the top 50,000 most frequent words are
37:23
responsible for just about everything
37:24
you're going to see ever. And the other
37:27
50,000 are what's called the long tail.
37:29
They almost never happen, right? You
37:31
never see them. So, you can be very
37:33
pragmatic and say, "I'm not going to
37:34
take every little word that I see in my
37:36
vocabulary. I'm going to only take the
37:38
most frequent words. I'm just going to
37:40
ignore the rest.
37:42
I'm just going to ignore the rest."
37:44
Okay?
37:46
But if you ignore the rest, let's say
37:50
the there is one word uh let's take some
37:52
Shakespeare word hamlet. Let's let's
37:55
assume that you ignore the word Hamlet
37:57
from your training corpus. You just
37:58
delete it because it's not one of the
38:00
top most frequent things you have seen.
38:02
And then somebody sends you a text
38:04
saying, you know, Hamlet was a bad
38:06
prince.
38:08
Analyze the sentiment of the sentence.
38:10
Well, when you see Hamlet, what is your
38:12
system going to do?
38:14
It's going to look at the Hamlet and
38:15
say, I can't see it in my vocabulary
38:16
anywhere.
38:18
And if it can't see in the vocabulary,
38:19
what is the only thing it can do?
38:22
Replace it with Unk. So that's where
38:26
comes into the picture.
38:28
So whenever it can't see something in
38:30
the vocabulary in a new input, it just
38:32
replaced with ank. Which means that
38:35
if you had ignored Romeo, Juliet, and
38:37
Hamlet in the in the training corpus,
38:40
all of them are going to be replaced by
38:42
the same ankh, which means that we can't
38:44
distinguish between them anymore.
38:46
>> So is this whereation
38:48
comes into play here where it doesn't
38:52
recogize
38:54
H interesting question. This is
38:56
whereation comes up. Actually, as it
38:58
turns out, no, as we will see when we
39:00
talk about LLMs later. Uh LLMs actually
39:03
will not have this UN problem because
39:06
they use a different tokenization scheme
39:08
which can handle anything you throw at
39:09
it, including new stuff you just made
39:10
up.
39:12
So, we'll come back to that.
39:14
All right. Um so, that's what we have.
39:17
And so what we're going to do is despite
39:19
its shortcomings, bag of words is
39:21
actually a really good default for many
39:23
NLP tasks. Uh and in the spirit of do
39:26
the simple stuff first and do
39:27
complicated things only the simple
39:28
doesn't work. We'll use a bag of words
39:30
model right now. Okay. So we'll switch
39:32
to a collab and see how it's done.
39:36
So here the the application we're going
39:39
to work with is kind of a fun
39:40
application. Uh we're going to try to
39:43
predict the genre of songs.
39:46
Okay, it's a nice classification use
39:47
case. Um, so we want to take some
39:50
arbitrary song and then classify it into
39:52
either hip-hop, rock or pop.
39:55
Okay. Um, and so for instance,
39:59
right, this is the kind of lyric you're
40:01
lyrics you're going to see. And as you
40:03
will see in this data set, the data set,
40:04
just a quick word of caution, uh, the
40:07
data set does have lyrics which may not
40:10
be sort of, you know, safe for work as
40:12
it were. So I'm not going to be like
40:14
exploring the lyrics in the collab, but
40:16
I just wanted to be aware of it. Okay.
40:18
Um, so but it's just some data set that
40:20
we downloaded from somewhere, right? Uh,
40:22
it's got all these lyrics. Okay. So
40:24
we're going to try to classify each
40:25
verse that we see into one of three
40:27
things. Hip hop, rock or pop. It's a
40:29
multi-class classification problem.
40:31
All right. Actually, what is the
40:33
simplest neural network based classifier
40:35
we can build
40:37
for this problem?
40:41
All right. So what is the simplest
40:42
neural network we can build for this
40:44
problem? So remember what is the input?
40:47
The input is going to be a bunch of song
40:49
lyrics. It's going to be a really long
40:50
song for all you know, right? And we're
40:52
going to use the bag of birds model. Uh
40:54
and let's assume for a moment that we
40:56
will use multihot encoding, right? We'll
40:59
create a vocabulary from this for the
41:02
song. We'll take all the songs. We'll
41:04
process them, run it through STI. will
41:06
do multihod encoding which means that
41:08
every song that comes in will have will
41:10
be a vector that's how long
41:14
it'll be as long as the
41:17
correct as a vocabulary size right so um
41:20
so maybe what comes in is this phrase um
41:24
since it's supposed to be songs I'll say
41:26
something which is probably common to
41:28
90% of songs I love you
41:30
okay that goes in
41:34
it goes into our ST STIE process
41:38
and then this SDIE process gives us a
41:42
vector which is X1 X2 all the way to XV
41:49
where V stands for the size of
41:50
vocabulary. Okay. So that that's our
41:52
input layer
41:54
all the way. So knowing what we know now
41:58
about deep learning what can we do next?
42:02
Couldn't you or maybe I'm getting ahead
42:04
but wouldn't the classifier just be like
42:07
the baseline would be classify it as the
42:10
most common genre?
42:11
>> That is the baseline. Correct. Correct.
42:13
I'm just saying and we'll come to the
42:14
baseline a bit later. But here I'm
42:17
saying suppose you need to you wanted to
42:18
build a neural network model for this.
42:21
How would you set it up?
42:23
>> You think about the layers that you
42:25
want,
42:26
>> right? And what is the simplest thing
42:27
you can do with a neural network? How
42:29
many layers?
42:30
>> Uh no layers. Well, then it becomes
42:33
problematic with even a neural network
42:35
because it could just be logistic
42:36
regression
42:37
>> one hidden layer.
42:38
>> Yes, thank you. I'm being a little
42:41
squishy about this because there are
42:43
some people who be like well even if
42:44
there's no hidden layers if you're using
42:46
relus and this and that and sigma that's
42:48
maybe it's a neural network and I don't
42:49
want to get into that how many ages in
42:51
the tip of a pin argument. So um so yeah
42:54
we need one hidden layer right in this
42:56
course we need at least one hidden layer
42:57
for it to qualify as a neural network.
42:59
Okay, so let's have a hidden layer and
43:01
we'll have a bunch of ReLUS as usual.
43:04
Okay, bunch of ReLULS and I'll ignore
43:07
all the arrows between them. It's kind
43:09
of a pain. U and then we come to the
43:11
output layer. And what should the output
43:13
layer be?
43:15
How many nodes do we have need in the
43:16
output layer? Three, right? Hip-hop,
43:19
rock, whatever. Pop. So we And then that
43:22
layer is called what? What activation
43:23
function?
43:25
Softmax. Perfect. Love it. love this
43:27
class. All right, three things. Uh,
43:30
rock, hip-hop,
43:33
and uh, pop, right? And this is a soft
43:36
max right there.
43:39
And then it's going to give us three
43:41
probabilities that add up to one because
43:44
it's a soft max. So that's our basic
43:46
network, right? Perfect. Yeah.
43:49
>> Why do you need those probabilities?
43:51
Again, if you just want to identify the
43:52
most likely genre, the soft max just
43:55
give you a way to kind of add them all
43:56
up once. Why do you need soft? Why don't
43:59
you just take the max value and say it's
44:01
that?
44:01
>> Oh, interesting question. Why can't we
44:03
just produce three numbers and grab the
44:05
maximum number? So, it turns out finding
44:09
the maximum bunch of numbers that
44:11
function
44:12
is not very it's not very friendly for
44:14
differentiation.
44:16
And ultimately you want to take this
44:18
output, run it through a loss function
44:20
like cross entropy and then be able to
44:23
run back prop on it. And so
44:25
fundamentally back propagation is just
44:27
differentiation and it requires
44:29
everything inside of it to have well-
44:31
behaved gradients. And so this little
44:34
max function is actually not well
44:36
behaved and which is why we have a soft
44:39
version of it soft max which makes it
44:41
easy to differentiate. So I can tell you
44:44
more about it offline but that's sort of
44:45
the quick synopsis.
44:49
So a lot of tricks you will see in the
44:50
neural network literature or ways to
44:52
avoid this the problem of having certain
44:55
the like the obvious choice of function
44:57
will not be well behaved for
44:59
differentiation. That's why you need to
45:00
go through all these other mechanisms
45:02
much like we couldn't just say accuracy.
45:05
Why don't you just maximize accuracy
45:06
instead of doing this cross entropy
45:07
business? Same reason.
45:10
All right. So let's come back here.
45:14
All right.
45:20
So that's what we created on the thing.
45:23
Right? Cats out of the mat vocabulary
45:27
thing and so on. And I you know I was
45:28
playing around with it uh earlier and so
45:31
I I found that you know eight relu
45:33
neurons were pretty good to get the job
45:35
done. So I'm just going to go with eight
45:36
rel
45:37
neurons in the hidden layer.
45:39
So I think that brings us to the collab.
45:44
Yeah. So let's switch to the collab.
45:47
All right. So um that's what we have
45:49
here. We you know there's a little bit
45:50
of verbiage here which just describes
45:52
what I just talked about. So we'll do
45:54
the usual things and upload everything
45:56
uh import everything we want. TensorFlow
45:58
and caras and the the holy trinity of
46:01
numpy pandas and mattplot lib. Uh set
46:03
the random seed as usual at 42.
46:07
This is our SDIE framework here. And the
46:09
nice thing is that all four of these
46:11
things SDIE are beautifully implemented
46:14
in Keras is a single simple layer called
46:16
the text vectorzation layer. Okay, which
46:19
is nice. Um, so we have the text vector
46:22
right here. And so in our first example,
46:25
what we'll do is we will use a default
46:26
standardization which will just remove
46:29
punctuation, convert to lowercase. We'll
46:31
use a default tokenization which just
46:33
means split on the space between words.
46:35
And then we will set the output to
46:37
multihart. Right? All the things we
46:39
talked about, KAS will just do it for
46:41
you automatically. And so output mode
46:43
multihart standardize this spread whites
46:45
space and boom, you run the text
46:47
vectorization thing. And once you do it,
46:49
KAS creates this textualization layer
46:52
with these settings and it's now ready
46:53
to swing into action. So what does swing
46:56
into action actually means? Well, now we
46:58
need to actually feed it a training
46:59
carpass so that it can do all the things
47:01
it's supposed to do and create the
47:02
vocabulary for you, right? So um so and
47:07
that thing is called the adapt method.
47:08
So we create a tiny training corpus for
47:11
us. This is our data set. Um right this
47:14
just a bunch of words from some of these
47:16
lyrics. And then what we'll do is we'll
47:18
take this layer that we just defined
47:19
here that we have set up here. And then
47:21
we will ask this layer to actually
47:24
create the vocabulary using this adapt
47:26
command. Okay. Index the vocabulary. And
47:29
it's done. And once it does it, you can
47:31
actually ask it for the vocabulary.
47:34
Okay, this is the vocabulary using the
47:36
get vocabulary command. And so first of
47:38
all, how long is the vocab? 17 17 words,
47:41
17 tokens. What are they?
47:45
And see here, and you can see these are
47:46
all the words and you can see it is
47:48
stuck in an in the very beginning,
47:50
right? It's sort of the default. By the
47:52
way, uh just a little programming tip if
47:54
you're not familiar with if you don't
47:55
have a ton of programming experience. If
47:57
you want to, you know, print these
47:58
Python objects like list and all in a
48:00
pretty way, one trick that often works
48:02
is just stick it into a data frame
48:05
and then print it. Usually, it'll print
48:08
it in a much better way. So, you can see
48:09
it like that.
48:11
So, you can see here ank arrays blah
48:13
blah blah blah blah. And you can see
48:15
integer zero assigned the ank token. By
48:17
the way, how come it picked the word
48:19
arrays as the second entry? Why not
48:22
something like an or um you know why
48:26
not? Why not a how come a is not chosen
48:29
as a second entry? Why why did it pick
48:32
arrays? You think
48:40
>> maybe maybe it tried like the words that
48:43
are most influential on the meaning of
48:45
the sentence to be on the
48:49
But it at this point it doesn't know
48:51
what we're going to use it for.
48:54
So it has no way to know what word is
48:56
useful because we haven't told it how
48:57
we're going to use it.
48:59
But but you're kind of on the right
49:01
track. So what KAS does is it'll
49:04
calculate it'll find all these tokens
49:06
and then it'll actually just sort them
49:07
by frequency.
49:09
So the most frequent as it turns out in
49:12
those four sentences we gave it happen
49:13
to be the word arrays. That's why arrays
49:15
is showing up on top. Um, and you can
49:17
actually confirm this by going to the
49:19
our little data set and you can see here
49:21
array shows up here and was up here
49:23
twice and that's why it came up on top.
49:25
Okay. All right. So that's what we have
49:29
and u and now now that we have populated
49:32
this we can run any sentence through it
49:34
easily. Yeah.
49:36
>> Does [clears throat] it matter that it's
49:37
on the top or is it just
49:39
>> it doesn't matter. It doesn't matter.
49:41
The reason why it's helpful later on is
49:43
because suppose you tell Kas hey don't
49:45
take every word you see here give me
49:48
only the most frequent 100 words I don't
49:50
want any more than that it can easily do
49:52
that that's the reason yeah
50:01
>> this is just a vocabulary so basically
50:03
you you give it all this phrases it
50:05
happens just four phrases in our example
50:07
and then it finds all the distinct words
50:09
and you know does all that stuff and and
50:10
then it has created a vocabulary. At
50:12
this point the the training corpus you
50:14
fed it will is forgotten and the only
50:17
thing has survived this processing is
50:19
just the vocabulary. That's it. Now we
50:21
have to start applying it to any kind of
50:23
text we want to use it for.
50:25
So here when you come back here u so
50:28
this is what we have and so what you can
50:30
do is you can take any sentence and you
50:32
can just run it through a layer and to
50:33
make sure that actually is doing the
50:35
right thing for you. So we'll take the
50:37
sentence, we will then run it through
50:39
the text vectorization layer by just
50:40
passing that sentence into it and then
50:42
we can just print it.
50:46
So now it's giving you a tensor. This is
50:47
a multihot encoder tensor with all these
50:50
ones and zeros. So note that this tensor
50:54
is 17 units long which is which is a
50:56
good check because our vocabulary is 17
50:58
long. So it's better match that. Uh now
51:00
recall that the ank token is at the
51:03
first location. It's at index zero and
51:05
it says that this encoded sentence does
51:08
have an unk word.
51:10
Okay. So
51:13
why is that? What is this UN word?
51:15
Anyone can guess?
51:19
Well, it turns out to be the word still.
51:21
Um I think yeah still is not in our
51:24
vocabulary because the four sentences
51:26
which is our training corpus used to
51:28
build vocabulary. They had a lot of
51:30
write and rewrite but there was no still
51:32
in it anyway. That's why there's an UN
51:33
ank for it. Uh we can just double check
51:35
that by asking Python is it is it
51:38
vocabulary? Nope, it's not. Okay. Now,
51:40
in the spirit of making small changes to
51:41
the code to understand what's going on,
51:42
which is a very useful tip for folks who
51:45
don't have a ton of programming
51:46
knowledge. Let's say that you send the
51:48
phrase Sloan Hodddle and DM, DMD. Uh I
51:52
think you will agree with me that none
51:54
of these words is in the training
51:55
corpus, right? So what will this what is
51:59
the multihot encoded vector for this
52:02
phrase sloan hodddle BMD
52:07
three
52:11
it's not count encoding it's multihod
52:13
encoding
52:14
right it's going to be 1 0 0 so you can
52:17
see here or in this case remember the
52:19
vocabulary is 17
52:21
right so each of these words is going to
52:23
be a one followed by 16 zeros
52:27
And then it's going to multih hot encode
52:29
them which means the three ones in the
52:30
column just become a one. So so you
52:34
still have only this one. Okay. All
52:37
right. Good. So now let's see that's now
52:39
let's actually get to the the the data
52:41
set. We have this 90,000 songs. Uh and
52:45
it's in this little thing here. Uh we
52:47
have grabbed the data and cleaned it up.
52:49
Cleaned it up meaning like formatting
52:50
wise not content wise. uh and then we
52:53
stuck it in this uh data frame and it's
52:55
we already have divided into train, test
52:56
and validation for your benefit. So you
52:58
don't have to worry about it. So turns
53:00
out we have 40 almost 49,000 songs in
53:03
the training set, 16,000 songs in the
53:05
validation set and 22 roughly 22,000 in
53:08
the test set. Okay, lot of songs. It's a
53:10
lot. It's a big data set. Um so let's
53:13
just look at the first few.
53:15
So oh girl, I can't get ready. We met on
53:18
rainy evening. Paralysis through
53:20
analysis.
53:22
Okay, that I can relate to as a data
53:23
science person. But anyway, u but uh by
53:27
the way this uh these things are very
53:29
useful for exploration of any uh data
53:31
frames that you might have. Collab is a
53:33
collab feature just check it out. Um so
53:36
anyway, that's the first few the first
53:38
few rows. Let's look at the last few
53:40
rows.
53:43
Okay,
53:48
you never listen to me as pop. Beamer
53:51
Benz is hip-hop. Yeah, of course.
53:57
So, okay. Uh, now to go back to the
53:59
question of, okay, um, what could be a
54:01
good baseline model? We need to
54:02
understand the proportion of these three
54:04
classes of songs. So, we'll do a quick
54:07
check. Turns out rock is 55%. So, if you
54:10
had to just guess something just
54:12
naively, you would just guess everything
54:13
to be rock and you'd be right 55% of the
54:15
time. Uh so now uh by the way the the
54:18
the target variable which tells you
54:20
whether which of these three genres it
54:21
is uh is is is a is actually a dummy
54:24
variable. So we need to one hot encode
54:26
that right. Um so we'll just turn that
54:29
this way using the pandas get dummies
54:32
function. And when we do that uh this is
54:34
y train which contains the dependent
54:35
variable. And you can see that is one
54:37
hot encoded now. Uh 0 1 0 0 1 0 0 1 and
54:40
so on and so forth. That's it. So I
54:42
think the first I forget it rock,
54:44
hip-hop, rock, pop or whatever. It's in
54:46
some order. We'll we'll get to that
54:48
later. So it's one hot encoded as well.
54:50
So that is as far as the data
54:52
downloading and setup is concerned. Any
54:54
questions?
54:55
>> Yeah.
54:57
>> Uh this kind of goes back to the
54:58
transfer learning concept. But do you
55:01
always want to build your corpus based
55:04
off of the vocabulary of your training
55:06
data or could you have like a
55:08
pre-ompiled like somebody's already made
55:10
like a list of the 50,000 words?
55:13
>> That's a really good question. Uh
55:15
unfortunately I'm going to punt on it
55:16
for the moment because um with modern
55:20
large language models a number of these
55:22
NLP tasks for which you had to sort of
55:25
roll your own and build your own thing
55:27
can now be very easily done using large
55:29
language models without even any further
55:31
training.
55:33
Case you pay for it is that you have to
55:34
use a large language model which means
55:35
you have to pay somebody an API call and
55:37
things like that and there are other
55:38
issues with it. uh but
55:41
we'll talk a lot about transfer learning
55:43
for text when we come to a little later
55:46
in the NLP sequence. So if I forget
55:48
please bring it up again.
55:53
>> Yeah.
55:54
>> Um quick clarification on the encode
55:58
factor. If I post it as floats not ins.
56:00
If it gets incredibly long wouldn't that
56:03
eat into compute time? Is there a reason
56:05
why it's floats?
56:06
>> Yeah. So uh question is that when when I
56:09
showed you that tensor the it is
56:11
actually is written as a continuous
56:13
number right a float floating point
56:14
number but we know these are one zeros
56:16
and ones so why can't we why do we have
56:18
to waste compute capacity by telling the
56:20
computer that these are all big
56:21
continuous numbers when it's just a zero
56:23
one there are ways to optimize that but
56:25
these problems are so small we just
56:26
don't worry about it but when we come to
56:28
something called parameter efficient
56:30
fine-tuning lecture maybe 10ish uh we
56:34
actually exploit that particular fact to
56:35
make things faster
56:38
Okay, so that's what we have. Uh, so
56:41
we'll we'll do the bag of birds model.
56:43
Um, by the way, there's a whole bunch of
56:46
stuff here. It just repeats what I've
56:47
been telling you in the lecture. So feel
56:49
free to read it again, but we can ignore
56:50
it for the moment. And now there's a new
56:54
thing we are doing here. So we are
56:55
basically saying, look, instead of
56:58
taking every word you see in these
57:00
49,000 uh songs in the training corpus,
57:03
uh, it's going to be too many words.
57:05
just pick the 5,000 most frequent words
57:09
and that's what this max tokens stands
57:11
for. Okay. And so we tell it uh all
57:15
right do this thing max tokens 5,000
57:18
sorry not 50,000 5,000 and still do
57:20
multihart and we are not explicitly
57:22
saying the standardization and all that
57:24
stuff because the defaults are what
57:25
we're going with. Okay. Yeah.
57:29
This is for making it more efficient.
57:30
Like this is like don't waste your time
57:32
on these thousand sports. Use them more.
57:36
Use them. Just focus on that to make
57:39
more efficient.
57:40
>> Make more efficient. But there is a
57:42
related and important point which is
57:44
that fundamentally the number of tokens
57:46
you allow this layer to have dictates
57:49
the size of your vocabulary and the size
57:51
of your vocabulary dictates the size of
57:53
the vector that you feed in. So shorter
57:56
vectors are better than longer vectors.
57:57
That's the efficiency point. The other
57:59
point is that the longer the input
58:00
vector, the more the number of
58:02
parameters the network has to learn
58:04
because the first layer itself is the
58:06
size of the input times roughly times
58:08
the size of the hidden layer. So this
58:10
thing becomes 10 times as long. You have
58:11
10 times as many parameters to learn and
58:13
given a finite amount of data, right?
58:15
The more parameters you have, the worse
58:17
it's going to do when you actually start
58:18
using it in the real world. It's going
58:19
to overfitit heavily. That's why you
58:21
need to be very careful.
58:24
Okay.
58:25
Yeah.
58:27
So, um, you downloaded the data set, but
58:29
are you still using the vocabulary the
58:31
17 words or did you
58:33
>> No, no, I'm that was just for fun. I'm
58:35
going to actually build a vocabulary
58:36
now. It's coming. Yeah, good question.
58:38
Yeah. So, all right, let's do that. Um,
58:41
so I first, you know, I defined this
58:43
layer. Uh, okay. I just defined it. All
58:46
right. Now we actually build the
58:47
vocabulary by essentially telling it to
58:49
adapt the layer using essentially the
58:53
full all 15 basically 49,000 songs in
58:56
the training data set right that's a
58:58
long list of songs as far as kas is
59:01
concerned you're just looking for a list
59:02
of strings so you just give it the list
59:04
of strings instead of four we're giving
59:06
it 49,000 the same uh philosophy applies
59:09
so we run it
59:11
it's obviously going to take a few
59:12
seconds to do that because it's 49,000
59:15
songs
59:17
five seconds. Uh, all right. Let's look
59:19
at the most common 20,
59:21
right? We get the vocabulary from our
59:23
layer. See, once you adapt the layer and
59:26
has built a vocabulary, the layer is
59:27
sort of been populated with all this
59:29
information. So, you can query it. So,
59:31
you can get the vocab top 20 words, the
59:34
most frequent word, no surprise, u, I,
59:37
blah, blah, blah. Uh, let's look at the
59:39
last few.
59:41
Dagger cheddar
59:43
verified
59:46
moving on
59:48
right and then we so once we have done
59:51
that now we actually can vectorize all
59:52
the data sets we have using this and by
59:55
vectorize you mean take every string and
59:57
create the multihot encoded vector from
59:59
it uh yeah
1:00:00
>> are we doing stie because we're keeping
1:00:02
stuff like d a etc. Yeah, we are not
1:00:05
strictly doing STI or to put it
1:00:07
differently the S stands typically S has
1:00:09
lower case uppercase strip punctuation
1:00:12
stemming stop word removal here the
1:00:14
default in KAS happens to not do
1:00:16
stemming not do stop word removal so
1:00:18
we're just going with the default thanks
1:00:20
for the clarification
1:00:22
and in fact in practice what I find
1:00:23
these days is that don't even bother to
1:00:25
stem don't even bother to remove the
1:00:27
stop words it's going to work well
1:00:28
enough
1:00:31
okay so all right uh okay so now Each
1:00:34
phrase is a vector. How long is this
1:00:36
vector? Each song is now a vector. How
1:00:38
long is that vector?
1:00:41
5,000. Correct. Because that is a size
1:00:43
vocabulary. Correct.
1:00:47
It's max tokens long, which is 5,000. So
1:00:49
if you actually look at X Oh, wait,
1:00:51
wait, wait, wait, wait. I haven't done
1:00:52
this thing yet.
1:00:57
It's going through 49,000. It's going
1:00:59
through another what? 23,000. Fine. So
1:01:02
let's run it.
1:01:04
Okay, now we can see X train which is
1:01:06
all the training data you have has is a
1:01:09
tensor is a table with 48 991 rows and
1:01:12
each row is a 5,000 long vector.
1:01:18
All right, good. Now we will try the
1:01:20
simple neural network that we wrote up
1:01:23
in class. So and now at this point this
1:01:28
code should be sort of second nature,
1:01:31
right? Isn't that cool? It's so easy to
1:01:34
write the write the thing the power of
1:01:36
abstraction. So uh we take kasin input
1:01:39
as usual input layer we tell it what is
1:01:41
the size of each thing that's coming in.
1:01:42
Well the size of each thing is a 50 max
1:01:44
tokens long vector. So we tell it the
1:01:46
shape is max tokens and then we run it
1:01:48
through a dense layer with eight relus.
1:01:51
Okay I'm hurrying.
1:01:54
So we get the outputs then we string the
1:01:56
inputs and the outputs into a model and
1:01:58
then we summarize the model. That's it.
1:01:59
So we go here and this has 40,000
1:02:02
parameters and you can see here right
1:02:04
when you go from the input the 5,000 * 8
1:02:08
that gives you 40,000 plus the eight
1:02:10
neurons have a bias coming in that's
1:02:11
another eight so you get 40,0008 okay
1:02:15
and we compile it as usual we use atom
1:02:17
as usual and because now the the output
1:02:20
y variable the y train variable is now
1:02:23
it itself is actually one hot encoded
1:02:27
right 0 1 0 0 1 depending on pop rock
1:02:29
and so on and so forth. We don't use
1:02:31
sparse categorical cross entropy. We
1:02:33
just use plain old categorical cross
1:02:35
entropy here. Okay. And this was
1:02:38
explained in lecture last week. So you
1:02:40
can revisit it if uh if it's if it's not
1:02:42
familiar. We again report accuracy,
1:02:44
right? So let's compile it. And we've
1:02:46
got a model. So we just run it for 10
1:02:48
epochs with a batch size of 32. And
1:02:50
because we have validation data already
1:02:52
supplied to us, we don't have to tell
1:02:53
Karas take the training data and keep
1:02:55
20% of it aside for validation. We can
1:02:58
literally tell it what validation to
1:02:59
use. That's what we're doing here. Okay.
1:03:04
All right. So, it's running.
1:03:06
Um,
1:03:09
it's pretty fast.
1:03:16
Any questions so far?
1:03:18
>> Yes.
1:03:20
>> The microphone.
1:03:23
>> How do we decide the max total? like
1:03:25
define the number of 5,000 here but we
1:03:27
do not know how many words would be
1:03:29
there in the entire text.
1:03:29
>> Yeah. So it's a good question. How do
1:03:31
you decide on this the maximum
1:03:32
vocabulary? What you typically do in
1:03:34
practice is that you actually you do it
1:03:36
without the max tokens and then you see
1:03:38
how long the vocabulary is and then you
1:03:40
actually get statistics on how
1:03:41
frequently the very infrequent words
1:03:43
actually show up. And then you'll
1:03:45
typically see like a dramatic fall-off
1:03:47
at some point and you pick that fall-off
1:03:49
point and then set that to be the max.
1:03:54
Uh all right. So perfect. Let's test it.
1:03:58
Uh accuracy is pretty good. 87% on the
1:04:01
training and 73 on the validation. We'll
1:04:05
do it on the test set. All right. 72%.
1:04:09
So we saw earlier the the largest class
1:04:11
of the three-way is a rock with around
1:04:13
50%. So the naive model is going to get
1:04:15
50% accuracy and this little neural
1:04:17
network model gets you 70 72% which is
1:04:19
pretty nice. Okay. So now let's actually
1:04:22
kick it up a notch and make it slightly
1:04:23
more capable. So the key thing here is
1:04:26
that uh as was has been observed in
1:04:29
class already when you go with a bag of
1:04:31
words model we lose all notion of order
1:04:33
right the word order clearly matters and
1:04:35
we're kind of ignoring it. So what we do
1:04:38
to get around it is um so actually this
1:04:40
actually really interesting uh sentence
1:04:42
here. Let's say this is a movie review.
1:04:44
Kate Vinclet's performance as a
1:04:46
detective trying to solve a terrible
1:04:48
crime in a P small pin tennos is
1:04:50
anything but disappointing.
1:04:52
Tricky tricky thing, right? Because if
1:04:55
you look at the word separately, the
1:04:56
word terrible and disappointing like
1:04:58
negative sentiment, right? But then if
1:05:01
you actually know that the word terrible
1:05:04
respon refers to the crime, not to the
1:05:06
movie or anything but disappointing
1:05:08
changes the meaning of the word
1:05:09
disappointing, you will see obviously
1:05:10
it's a positive review, right? So
1:05:12
clearly the the the words around the
1:05:14
word provide valuable clues as to how to
1:05:17
interpret that word. And so what we do
1:05:20
is how can we make our little model a
1:05:23
bit more capable of recognizing the
1:05:25
context around every word. And the way
1:05:27
we do it is something called bgrams.
1:05:29
Okay. And what for biograms what we
1:05:32
basically do is instead of taking
1:05:34
instead of just taking each word we take
1:05:36
each word and we further take every pair
1:05:39
of adjacent words
1:05:42
and those become our tokens and because
1:05:44
we take two adjacent words right it are
1:05:47
called bgrams you can take three adjent
1:05:49
words trigrams you get the idea engram
1:05:51
grams okay so that's the idea of bgrams
1:05:54
and so um so for example if you had the
1:05:56
cat matt sat on the cat sat on the mat
1:05:59
you will have the the cat cats sat you
1:06:03
get the idea right uh that's what we
1:06:05
have so let's do a little example and
1:06:07
kas makes it very easy you literally
1:06:09
tell it engram grams equals 2
1:06:12
bs and now by by from this you auto
1:06:15
immediately should know that engram
1:06:16
grams equals 1 is the default that's why
1:06:19
we didn't have to specify it okay so you
1:06:23
run it and then you do
1:06:25
cats on the mat is your training corpus
1:06:27
and then you get the vocabulary and you
1:06:29
can see here, right? It has created all
1:06:31
these nice biograms for you. And so
1:06:34
that's it. All right. Now, what we do is
1:06:35
we'll go back to the songs and we
1:06:37
actually tell Keras to not just take
1:06:39
each word, but take all the biograms as
1:06:41
well. And hopefully you'll do a better
1:06:43
job, right, of figuring out what the
1:06:45
sentiment is. And now because you know
1:06:47
when you have when you when you say,
1:06:49
okay, take the top 5,000 words, that's
1:06:51
great for single unigs as they are
1:06:53
called. But when you have biograms, you
1:06:56
have 5,000 possibilities for the first
1:06:57
word, maybe 5,000 for the second word,
1:06:59
right? That's a lot of possibilities. 25
1:07:01
million. Now, most of the 25 million
1:07:03
possibilities are not going to show up
1:07:04
in the data. So, you don't need to
1:07:05
actually make it much larger, but you
1:07:07
should make the vocabulary a bit more
1:07:08
than 5,000. So, here we go with say
1:07:11
20,000, right? Otherwise, it's the same.
1:07:13
Still multihart. So, let's run it. And
1:07:16
now we will run this. Now that the layer
1:07:18
has been set up with all the right
1:07:20
settings, we'll ask it to create the
1:07:21
vocabulary. Okay? again by doing exactly
1:07:24
what we did before. Create the
1:07:25
vocabulary
1:07:30
seconds
1:07:42
by triagrams all of them will get much
1:07:44
more computer intensive that's why
1:07:46
you're seeing this. So all right let's
1:07:48
look at the first 10 words. The first 10
1:07:51
words are all just single words and
1:07:53
that's not surprising because the single
1:07:54
words are going to be the most more
1:07:55
frequent right u
1:07:59
and then the last few
1:08:02
your mom your god you short you hell
1:08:09
all right let's just uh you know uh
1:08:13
index the whole all the data we have the
1:08:15
training validation test sets using this
1:08:17
vocabulary
1:08:23
Perfect. Now we come to our second model
1:08:24
where we say the shape the incoming
1:08:26
shape is now 20,000 long right because
1:08:28
we increase max tokens from 5,000 to
1:08:30
20,000. So each thing is a 20,000 long
1:08:32
vector otherwise it's the same and now
1:08:35
we will use this thing called dropout
1:08:37
for the first time which is a
1:08:38
rigorization thing that I have referred
1:08:41
to earlier that I never really described
1:08:43
and I will describe today if we have
1:08:45
time but I'll first run through the
1:08:47
whole demo. So just you know just you
1:08:49
can just you think of dropout as just
1:08:50
another layer you can insert and it's
1:08:52
essentially a great way to prevent
1:08:54
overfitting. So I just routinely will
1:08:56
use it and I'll talk more about it. So
1:08:58
for now you have this dropout layer in
1:09:00
the middle. It receives the input from
1:09:02
the dense layer and then sends it to the
1:09:04
output layer. The output layer is
1:09:05
unchanged. It's a three-way softmax.
1:09:07
Same model as before. Okay. And now uh
1:09:10
all right we'll come back to drop out.
1:09:11
So we'll compile it the same way as
1:09:13
before and then we will we will I will
1:09:15
just fit it for three epochs. Um if
1:09:17
you're interested after class later on
1:09:19
you can actually try it for more epochs
1:09:20
and see if it does better. Uh for now in
1:09:22
the interest of time we'll just do it
1:09:23
for three
1:09:29
right
1:09:36
I think 72% right was the uh the single
1:09:39
word unig thing we had.
1:09:43
>> If you're rerunning this code with the
1:09:45
same number of Do you ever expect the
1:09:47
accuracy to change?
1:09:49
>> Um if if you had to run this code in
1:09:51
your machine, you would expect it to be
1:09:53
roughly the same, but there are some
1:09:55
minute differences due to hardware and
1:09:57
device drivers.
1:09:58
>> If you rewrite it on your own machine
1:09:59
twice, would you expect a change?
1:10:02
>> That's actually a very tricky question.
1:10:05
Uh because it depends on what else I
1:10:07
have been doing in that notebook.
1:10:09
If I start fresh and do nothing but
1:10:11
that, typically I get the same numbers
1:10:13
typically. But for some reason I don't
1:10:15
get it exactly the right.
1:10:19
Okay. So we come to this. Let's evaluate
1:10:22
our little model.
1:10:25
Okay. 75%. So it went from 72 to 75.
1:10:29
It's actually a meaningful jump just by
1:10:30
using biograms. Okay. And I ran it only
1:10:32
for three epochs. If you run it for 10,
1:10:34
maybe it's going to do even better. All
1:10:36
right. So that is the beauty of this
1:10:38
thing. Now let's just actually do a
1:10:40
little demo. Uh we'll try to predict
1:10:42
some lyrics. Okay, I'll try another one.
1:10:45
Bites the dust.
1:10:49
It's a rock song. I think that's
1:10:50
correct. Yes. Okay. Okay, folks. Your
1:10:53
turn now.
1:10:55
Uh, somebody tell me your favorite song.
1:11:00
>> Dancing Queen from Aba.
1:11:03
>> I love ABBA. That's awesome. All right.
1:11:05
Okay.
1:11:07
Uh, Dancing Queen
1:11:11
Rex.
1:11:17
worse one intro. I don't like that.
1:11:18
Let's just go to something without all
1:11:20
this metadata.
1:11:23
Right.
1:11:27
All right. I'll just take the first
1:11:28
page. Okay.
1:11:40
Are we good?
1:11:42
All right,
1:11:45
down model. Let's predict
1:11:50
pop just about. Yay.
1:11:55
All right. So, uh yeah. So, that's
1:11:58
basically the model, but we have five
1:12:00
minutes. I want to get back to you can
1:12:01
play around and put your own lyrics in.
1:12:03
Uh typically what happens is that the
1:12:05
last two years that I've been doing this
1:12:07
particular lecture, I've noticed that
1:12:09
the songs are always rock songs for some
1:12:11
reason.
1:12:13
>> First time I'm getting a pop song from
1:12:14
the from a group that I actually like.
1:12:16
So thank you.
1:12:18
Uh all right. Uh let's go back to
1:12:20
dropout.
1:12:22
So the idea here in dropout is that you
1:12:24
know you have all these the input comes
1:12:26
in, it goes through a hidden layer and
1:12:28
so on and so forth. What the dropout? So
1:12:30
dropout is a layer and you put this
1:12:33
layer just like you use any other layer.
1:12:35
And what dropout does is that it takes
1:12:37
all the things that are coming into it
1:12:38
from the previous layer and randomly
1:12:41
decides to replace that number with a
1:12:43
zero.
1:12:46
That's it. It drops that number and
1:12:48
replace it with a zero. Okay? But it
1:12:50
does it randomly. It basically toss a
1:12:52
coin and the coin comes up heads zero.
1:12:54
If it comes up to us, let it through.
1:12:55
Pass it through. Okay? And the reason
1:12:58
why this is very effective is because
1:13:02
you can imagine all the neurons in a
1:13:04
particular layer when they overfit to a
1:13:07
particular data set the overfitting
1:13:09
happens because the neurons essentially
1:13:11
collude with each other right they sort
1:13:14
of collude with each other to actually
1:13:15
overfitit and predict things in sort of
1:13:17
a very accurate way. So you want to
1:13:19
break any sort of collusion between the
1:13:21
neurons, right? I'm obviously using sort
1:13:24
of like a you know again theoretic way
1:13:26
of describing it but the idea is that
1:13:28
any kind of speurious correlations in
1:13:30
your data neurons can pick it up by
1:13:33
being correlated themselves.
1:13:36
And so the way you avoid the spurious
1:13:38
correlation is by dropping neurons
1:13:40
randomly. You just kill the neuron
1:13:42
randomly which means that no neuron can
1:13:44
depend on another neuron being
1:13:45
available.
1:13:47
I know it's a bit grim but that's the
1:13:50
basic idea of dropout and apparently the
1:13:52
story goes that the the folk person who
1:13:54
the team that invented it Jeff Hinton
1:13:56
who won the touring for the stuff not
1:13:58
for not for dropout just for deep
1:13:59
learning um he said I don't know if it's
1:14:02
true but he said that apparently he got
1:14:03
the idea when he went to a bank and
1:14:05
realized that you know very often the
1:14:07
bank the folks who working in that bank
1:14:09
branch that he used to go to kept
1:14:11
changing
1:14:13
right they were never sort of the same
1:14:14
the people would be transferring in
1:14:16
transferring out and he was like why Why
1:14:17
can't they just leave these people
1:14:18
alone? Why does it keep changing? And
1:14:19
then he got the insight that maybe a lot
1:14:21
of fraud happens because the person
1:14:24
working in the branch colludes with the
1:14:26
customer, but by changing the staff
1:14:28
constantly, you break the the risk of
1:14:30
fraud happening. And that apparently was
1:14:32
the genesis for this idea. True,
1:14:34
apocryphal? I have no idea. But it's
1:14:36
sort of a fun story. Uh yes,
1:14:40
>> instead of random, if we go to the way
1:14:43
historical models are built, concepts of
1:14:45
multiple and all of that, would that
1:14:47
make it sharper as compared to this?
1:14:50
>> The problem is that um these networks
1:14:53
are massive, right? And for you to take
1:14:56
each layer and look at it correlation
1:14:58
with some other layer and so on and so
1:14:59
forth. First of all, investigating
1:15:01
multi-linearity is pro is a problem. The
1:15:04
second thing is okay, what do you do
1:15:05
then? Next uh in linear regression you
1:15:08
can do things like principal components
1:15:09
analysis to get around it. Here
1:15:11
everything is nonlinear. There is no
1:15:12
easy way to solve the problem. So we are
1:15:14
like we'll just solve the problem in one
1:15:16
shot using dropout. That's all right. Um
1:15:20
so I had uh some material on
1:15:23
something called bite pair encoding
1:15:25
which I will um which I will do when we
1:15:28
get to LLMs and I stuck it in the end
1:15:30
because I knew that we probably won't
1:15:31
have enough time to cover this anyway.
1:15:33
And that is a very clever tokenization
1:15:35
scheme used by for example the GPT
1:15:37
family and that allows them to do
1:15:40
beautiful punctuation, keep the case
1:15:41
intact and then use words that you just
1:15:43
made up and things like that. Okay. So
1:15:45
we have two one more minute. I'm happy
1:15:47
to answer any questions you might have.
1:15:50
>> And so initially when we are picking
1:15:52
like the hidden layer the number of
1:15:54
neurons and weed. So so far in all the
1:15:57
materials this is has been given to us
1:15:59
but initially how do you pick it? Is it
1:16:01
more of a trial and error type of thing
1:16:03
or
1:16:03
>> it tends to be trial and error. Um so
1:16:05
that's in fact what I did when I created
1:16:07
the collabs. So um and and you can
1:16:10
actually make it a bit more systematic
1:16:12
by trying lots of different values and
1:16:14
there is a particular package uh Python
1:16:16
package called Keras tuner. So just
1:16:18
Google Keras tuner and it comes with
1:16:20
very nice collabs and if I have a chance
1:16:22
maybe I'll just record a screen
1:16:23
walkthrough of doing that. But that's
1:16:25
that's a very efficient way to do these
1:16:27
things. And it comes under the broad
1:16:28
category of something called
1:16:29
hyperparameter optimization where the
1:16:31
number of neurons, the activation you
1:16:33
use, the learning rate, all those things
1:16:35
can all be tried. You can try lots of
1:16:36
variations and kas is a great way to do
1:16:39
it in the context of kas.
1:16:42
Other questions?
1:16:45
>> All right, I give you 30 seconds back.
1:16:47
Thank you. See you tomorrow.
— end of transcript —
Advertisement