WEBVTT

00:00:16.800 --> 00:00:23.039
Okay. So today we start the the natural

00:00:20.399 --> 00:00:24.799
language processing sequence and so just

00:00:23.039 --> 00:00:26.400
to give you a quick idea we're going to

00:00:24.800 --> 00:00:27.920
start with uh what's called

00:00:26.399 --> 00:00:29.759
vectorization

00:00:27.920 --> 00:00:30.960
uh and then the bag of words model and

00:00:29.760 --> 00:00:33.920
then we'll spend a fair amount of time

00:00:30.960 --> 00:00:34.799
on a collab uh and then on Wednesday we

00:00:33.920 --> 00:00:36.480
talk about these things called

00:00:34.799 --> 00:00:38.000
embeddings which you'll come to

00:00:36.479 --> 00:00:40.238
appreciate over the the next couple of

00:00:38.000 --> 00:00:42.640
weeks form like the sort of the core

00:00:40.238 --> 00:00:45.280
atomic unit of all modern natural

00:00:42.640 --> 00:00:47.439
language processing and for that matter

00:00:45.280 --> 00:00:49.280
vision processing as well. uh and then

00:00:47.439 --> 00:00:50.640
we will uh following week we'll do

00:00:49.280 --> 00:00:52.399
transformers two lectures on

00:00:50.640 --> 00:00:53.520
transformers we'll get into the theory

00:00:52.399 --> 00:00:55.759
and then we'll get into a bunch of

00:00:53.520 --> 00:00:59.280
applications and then lectures nine and

00:00:55.759 --> 00:01:01.358
10 will be all LLMs all about LLMs so

00:00:59.280 --> 00:01:04.320
it's going to be a lot of fun u this is

00:01:01.359 --> 00:01:05.920
one of my favorite segments of the class

00:01:04.319 --> 00:01:08.000
of course truth be told every segment of

00:01:05.920 --> 00:01:10.879
the class is my favorite so don't judge

00:01:08.000 --> 00:01:13.599
me all right so let's get going uh so

00:01:10.879 --> 00:01:16.079
why why natural language processing u

00:01:13.599 --> 00:01:17.599
you know these are in some sense the the

00:01:16.079 --> 00:01:18.879
things I have on the slide here are sort

00:01:17.599 --> 00:01:21.679
of obvious but I think it's actually

00:01:18.879 --> 00:01:24.239
worth reme reminding ourselves of how

00:01:21.680 --> 00:01:26.320
important text is for everything we do.

00:01:24.239 --> 00:01:29.280
Uh obviously human knowledge is mostly

00:01:26.319 --> 00:01:30.959
encoded as text. The internet is mostly

00:01:29.280 --> 00:01:33.920
text. At least this was true till the

00:01:30.959 --> 00:01:35.759
advent of Tik Tok and YouTube. Uh and uh

00:01:33.920 --> 00:01:37.840
human communication is mostly text and

00:01:35.759 --> 00:01:40.640
cultural production you know movies,

00:01:37.840 --> 00:01:43.680
books, uh arts and so on. So much of it

00:01:40.640 --> 00:01:47.040
is so textheavy and so in some sense uh

00:01:43.680 --> 00:01:49.200
text forms not just a big chunk of all

00:01:47.040 --> 00:01:50.880
the media that's out there but it also

00:01:49.200 --> 00:01:52.560
happens to be the way in which we think

00:01:50.879 --> 00:01:55.438
and communicate and so on and so forth.

00:01:52.560 --> 00:01:57.920
So it's sort of uh primacy is in my

00:01:55.438 --> 00:01:59.919
opinion sort of unparalleled uh in how

00:01:57.920 --> 00:02:02.399
we think about the world. And so the the

00:01:59.920 --> 00:02:04.719
tantalizing possibility is that imagine

00:02:02.399 --> 00:02:06.560
if we had an AI system which could just

00:02:04.718 --> 00:02:09.519
read and quote unquote understand all

00:02:06.560 --> 00:02:11.759
this text, right? Um and so you can

00:02:09.520 --> 00:02:13.599
imagine such a system reading all of

00:02:11.759 --> 00:02:15.199
PubMed, reading all the medical

00:02:13.598 --> 00:02:17.199
literature and then coming back and

00:02:15.199 --> 00:02:19.439
saying you know for this particular

00:02:17.199 --> 00:02:21.039
disease you know this particular sort of

00:02:19.439 --> 00:02:23.039
protein is actually the malfunctioning

00:02:21.039 --> 00:02:24.400
protein and for that that small molecule

00:02:23.039 --> 00:02:26.400
is going to dock into the protein and

00:02:24.400 --> 00:02:27.680
cure the disease and you didn't know

00:02:26.400 --> 00:02:29.920
this. It came back and told you that.

00:02:27.680 --> 00:02:31.439
Wouldn't it be unbelievable? So my

00:02:29.919 --> 00:02:33.759
feeling is that such things are going to

00:02:31.439 --> 00:02:36.239
happen. It's just that it's not going to

00:02:33.759 --> 00:02:38.000
happen soon enough for my lifetime, but

00:02:36.239 --> 00:02:40.560
perhaps it'll happen in yours. All

00:02:38.000 --> 00:02:42.639
right. Okay. So, let's continue. So, NLP

00:02:40.560 --> 00:02:44.400
is an action all around us. Um, you

00:02:42.639 --> 00:02:46.958
know, according to Google, apparently

00:02:44.400 --> 00:02:49.840
Google autocomplete, uh, which uses a

00:02:46.959 --> 00:02:53.199
fair bit of NLP, uh, saves 200 years of

00:02:49.840 --> 00:02:54.640
typing time apparently, every day. Uh, I

00:02:53.199 --> 00:02:55.759
actually thought it was, you know, this

00:02:54.639 --> 00:02:57.598
I wasn't very impressed with this

00:02:55.759 --> 00:02:58.959
number, frankly, because billions of

00:02:57.598 --> 00:03:01.919
searches are being done every day and

00:02:58.959 --> 00:03:03.598
I'm like only 200 years. So anyway u but

00:03:01.919 --> 00:03:06.399
I think the more important point is that

00:03:03.598 --> 00:03:08.000
it made mobile possible right if you if

00:03:06.400 --> 00:03:09.920
you didn't have autocomplete people

00:03:08.000 --> 00:03:11.759
would not be you know typing and pecking

00:03:09.919 --> 00:03:13.679
on their keyboards it's going to be much

00:03:11.759 --> 00:03:15.199
worse it would have had a hugely

00:03:13.680 --> 00:03:17.519
dampening effect on e-commerce for

00:03:15.199 --> 00:03:19.598
instance so this humble little

00:03:17.519 --> 00:03:21.759
autocomplete has incredible incredible

00:03:19.598 --> 00:03:23.359
impact on the world economy and the

00:03:21.759 --> 00:03:25.039
other thing which I heard about I'm not

00:03:23.360 --> 00:03:26.959
sure if it's 100% true but it's an

00:03:25.039 --> 00:03:28.799
interesting example apparently the very

00:03:26.959 --> 00:03:30.640
first iPhone keyboard that came out

00:03:28.800 --> 00:03:34.239
right the soft keyboard not the hard

00:03:30.639 --> 00:03:35.759
keyboard. Um they had some very basic,

00:03:34.239 --> 00:03:38.719
you know, sort of word continuation

00:03:35.759 --> 00:03:41.519
prediction going on. And so if if when

00:03:38.719 --> 00:03:43.039
you start typing T and H, obviously it's

00:03:41.519 --> 00:03:46.080
going to guess the E is going to come

00:03:43.039 --> 00:03:48.239
next, right? So that part is old old

00:03:46.080 --> 00:03:50.719
news, nothing new there. But apparently

00:03:48.239 --> 00:03:53.360
the E letter in the keyboard will become

00:03:50.719 --> 00:03:54.959
slightly bigger. So when your finger

00:03:53.360 --> 00:03:57.280
goes towards it, it has a better shot of

00:03:54.959 --> 00:03:59.438
actually connecting with it. Right? So

00:03:57.280 --> 00:04:01.280
these kinds of things are used to change

00:03:59.438 --> 00:04:02.560
the UI in real time in a whole bunch of

00:04:01.280 --> 00:04:06.000
applications and you just don't even

00:04:02.560 --> 00:04:08.560
realize it. All right. So uh and of

00:04:06.000 --> 00:04:09.919
course we all know about uh LRM at this

00:04:08.560 --> 00:04:11.680
point. So I asked it to write a

00:04:09.919 --> 00:04:13.919
limmerick about the beauty and power of

00:04:11.680 --> 00:04:15.360
deep learning yesterday and it says in a

00:04:13.919 --> 00:04:16.798
world where data flows like a stream

00:04:15.360 --> 00:04:18.319
deep learning is more than a dream.

00:04:16.798 --> 00:04:22.399
Sifts through the noise with an elegant

00:04:18.319 --> 00:04:25.199
poise unveiling insights that gleam.

00:04:22.399 --> 00:04:26.879
Cool, right? All right. So let's get

00:04:25.199 --> 00:04:28.478
back to work. Uh so NLP has

00:04:26.879 --> 00:04:30.478
extraordinary potential for making

00:04:28.478 --> 00:04:33.279
products, service and services much much

00:04:30.478 --> 00:04:35.758
smarter. Uh and what I want to point out

00:04:33.279 --> 00:04:37.599
here is that you know even if you focus

00:04:35.759 --> 00:04:40.160
on this very very simple sort of

00:04:37.600 --> 00:04:42.160
formalism right a bunch of text comes in

00:04:40.160 --> 00:04:44.000
a bunch of text goes out that's it. If

00:04:42.160 --> 00:04:46.720
you take that very simple text in text

00:04:44.000 --> 00:04:49.199
out formalism this little humble little

00:04:46.720 --> 00:04:51.840
thing has just an enormous enormous

00:04:49.199 --> 00:04:53.840
range of applicability. Right? So

00:04:51.839 --> 00:04:56.079
obviously you can send a bunch of text

00:04:53.839 --> 00:04:58.399
in and ask it to classify it right for

00:04:56.079 --> 00:05:00.159
mo you know sentiment route it for

00:04:58.399 --> 00:05:01.519
customer support you can try to figure

00:05:00.160 --> 00:05:03.520
out the intent of what the person is

00:05:01.519 --> 00:05:04.799
asking in search you can filter it you

00:05:03.519 --> 00:05:06.879
can content filter to make sure there's

00:05:04.800 --> 00:05:08.639
no toxic abusive stuff going on I mean

00:05:06.879 --> 00:05:11.038
the the possibilities for just text

00:05:08.639 --> 00:05:12.879
classification are numerous okay but

00:05:11.038 --> 00:05:14.879
that's a that's sort of a use case we

00:05:12.879 --> 00:05:17.360
are all kind of familiar with right so

00:05:14.879 --> 00:05:19.038
no surprise there now text extraction we

00:05:17.360 --> 00:05:20.879
may be less familiar with here and the

00:05:19.038 --> 00:05:23.360
idea is that you can actually look at a

00:05:20.879 --> 00:05:25.279
lot lot of uh unstructured textual data

00:05:23.360 --> 00:05:27.759
and extract all sorts of interesting

00:05:25.279 --> 00:05:29.439
entities from it. Right? Hedge fun hedge

00:05:27.759 --> 00:05:30.720
funds use it very heavily. They will

00:05:29.439 --> 00:05:33.600
extract all sorts of company information

00:05:30.720 --> 00:05:34.880
from news articles u and then obviously

00:05:33.600 --> 00:05:36.800
doctor's notes. There are a whole bunch

00:05:34.879 --> 00:05:38.959
of NLP startups that will take the

00:05:36.800 --> 00:05:40.879
doctor's the doctor patient conversation

00:05:38.959 --> 00:05:43.279
transcribe it and then extract disease

00:05:40.879 --> 00:05:45.199
codes, diagnosis codes, medication codes

00:05:43.279 --> 00:05:47.279
and things like that. Uh right. So the

00:05:45.199 --> 00:05:48.960
possibilities for this are enormous. Of

00:05:47.279 --> 00:05:50.799
course text summarization and we all

00:05:48.959 --> 00:05:53.038
have been doing it thanks to chat GPT

00:05:50.800 --> 00:05:54.319
right take text in and any kind of

00:05:53.038 --> 00:05:57.120
summary that comes out of the text is

00:05:54.319 --> 00:05:58.719
just text out okay and then text

00:05:57.120 --> 00:06:00.478
generation of course we can take text

00:05:58.720 --> 00:06:01.680
and do marketing copy sales emails

00:06:00.478 --> 00:06:03.199
market summaries so on so forth and

00:06:01.680 --> 00:06:06.840
including troublingly for educators

00:06:03.199 --> 00:06:06.840
college application essays

00:06:06.959 --> 00:06:14.478
code generation is a more subtle example

00:06:10.800 --> 00:06:16.720
of text out because code is just text

00:06:14.478 --> 00:06:20.959
right so text in text out also covers

00:06:16.720 --> 00:06:22.639
was text in code out. Okay. And question

00:06:20.959 --> 00:06:24.399
answering. So you can take a bunch of

00:06:22.639 --> 00:06:25.759
text,

00:06:24.399 --> 00:06:27.758
you can take a whole bunch of documents,

00:06:25.759 --> 00:06:29.680
you can add a bit of text to it which is

00:06:27.759 --> 00:06:31.840
your question and this whole thing at

00:06:29.680 --> 00:06:33.680
the end of the day is just text it in

00:06:31.839 --> 00:06:35.038
and then you can have and you can use it

00:06:33.680 --> 00:06:36.560
to answer questions and therefore create

00:06:35.038 --> 00:06:39.560
chat bots for all sorts of interesting

00:06:36.560 --> 00:06:39.560
applications.

00:06:39.918 --> 00:06:44.799
And you know if you look at this example

00:06:42.240 --> 00:06:46.319
call centers that's that is where a lot

00:06:44.800 --> 00:06:47.680
of money is being spent right now to

00:06:46.319 --> 00:06:49.840
build these call center chatbots for

00:06:47.680 --> 00:06:52.160
text and text out question answering and

00:06:49.839 --> 00:06:54.318
so just if you drill into this right if

00:06:52.160 --> 00:06:56.720
you imagine taking all the call center

00:06:54.319 --> 00:06:59.280
transcripts and their internal product

00:06:56.720 --> 00:07:02.000
documentation service documentation FAQs

00:06:59.279 --> 00:07:04.159
etc stick it in you can start to answer

00:07:02.000 --> 00:07:05.519
these kinds of questions okay yesterday

00:07:04.160 --> 00:07:08.000
what are the top reasons why customers

00:07:05.519 --> 00:07:09.758
were upset with us what interventions

00:07:08.000 --> 00:07:12.319
made by the agent actually worked what

00:07:09.759 --> 00:07:14.560
did not work, right? What characterizes

00:07:12.319 --> 00:07:16.000
the best agents from the rest? How

00:07:14.560 --> 00:07:16.879
should we grade this particular agent's

00:07:16.000 --> 00:07:18.800
interaction with the particular

00:07:16.879 --> 00:07:20.478
customer? How should she how should we

00:07:18.800 --> 00:07:23.280
chain the call center script? How should

00:07:20.478 --> 00:07:25.038
we coach the agent in real time? Every

00:07:23.279 --> 00:07:26.399
one of these applications is aminable to

00:07:25.038 --> 00:07:28.159
this very humble text and text route

00:07:26.399 --> 00:07:30.478
model.

00:07:28.160 --> 00:07:32.080
Okay. And so, and of course the

00:07:30.478 --> 00:07:33.598
potential for is now everybody knows

00:07:32.079 --> 00:07:36.399
this potential because of the advent of

00:07:33.598 --> 00:07:38.959
large language models. Uh, by the way,

00:07:36.399 --> 00:07:42.318
Google is uh released something called

00:07:38.959 --> 00:07:46.879
Google Geminy 1.5 Pro u a couple of days

00:07:42.319 --> 00:07:49.199
ago. Uh, and it's incredible.

00:07:46.879 --> 00:07:50.560
It's incredible, right? And anyway,

00:07:49.199 --> 00:07:52.319
we'll get back to that later. But the

00:07:50.560 --> 00:07:54.000
point is that the kind of potential we

00:07:52.319 --> 00:07:59.000
have is just amazing even for text and

00:07:54.000 --> 00:07:59.000
text. Okay. And as you would imagine

00:08:00.478 --> 00:08:04.478
>> this is all like though we are calling

00:08:02.478 --> 00:08:05.680
it language this is all primarily

00:08:04.478 --> 00:08:07.758
English right

00:08:05.680 --> 00:08:09.598
>> now there are lots of multilingual uh

00:08:07.759 --> 00:08:12.160
models as well uh there are multilingual

00:08:09.598 --> 00:08:13.439
models by that I mean models which are

00:08:12.160 --> 00:08:15.039
specialized to other languages

00:08:13.439 --> 00:08:16.478
non-English languages and models which

00:08:15.038 --> 00:08:18.800
are mult truly multilingual like

00:08:16.478 --> 00:08:21.918
polyglot models as well and both of them

00:08:18.800 --> 00:08:23.598
are available uh right now and many many

00:08:21.918 --> 00:08:26.318
modern LLMs are actually trained from

00:08:23.598 --> 00:08:28.319
the get-go to be multilingual in a bunch

00:08:26.319 --> 00:08:30.080
of the what are called high resource

00:08:28.319 --> 00:08:32.320
languages. Languages which are spoken by

00:08:30.079 --> 00:08:33.199
lots of people. Uh but actually it's

00:08:32.320 --> 00:08:34.959
funny you should ask that question

00:08:33.200 --> 00:08:37.680
because this this Google Gemini model

00:08:34.958 --> 00:08:40.000
that I just described they actually u so

00:08:37.679 --> 00:08:41.918
there is a language called kalamang

00:08:40.000 --> 00:08:45.919
which is spoken by 200 people in the

00:08:41.918 --> 00:08:48.720
world and so a researcher had created a

00:08:45.919 --> 00:08:50.879
one book which is sort of like a grammar

00:08:48.720 --> 00:08:52.240
manual for kalamag right because there

00:08:50.879 --> 00:08:54.559
are no other written works in that

00:08:52.240 --> 00:08:56.799
language. And so what they did is they

00:08:54.559 --> 00:09:00.799
took a whole bunch of English dialogue

00:08:56.799 --> 00:09:04.479
and this book fed it into uh Google

00:09:00.799 --> 00:09:06.079
Gemini 1.4 Pro 1.5 and it translated

00:09:04.480 --> 00:09:07.680
into Calamong at human level

00:09:06.080 --> 00:09:10.639
proficiency.

00:09:07.679 --> 00:09:12.399
It had never seen it before. So that's

00:09:10.639 --> 00:09:15.600
an example

00:09:12.399 --> 00:09:18.399
of of this.

00:09:15.600 --> 00:09:19.920
Yes. So the question is the question

00:09:18.399 --> 00:09:21.759
text here is all the things you want to

00:09:19.919 --> 00:09:23.360
translate from English to kalamong. The

00:09:21.759 --> 00:09:25.919
documents here is just one document

00:09:23.360 --> 00:09:29.039
singular the grammar book the manual and

00:09:25.919 --> 00:09:30.559
then what comes out is a translation. So

00:09:29.039 --> 00:09:31.838
these models even when they're not

00:09:30.559 --> 00:09:34.159
explicitly trained on a different

00:09:31.839 --> 00:09:35.839
language if you give them enough of sort

00:09:34.159 --> 00:09:37.679
of grammar manuals and stuff like that

00:09:35.839 --> 00:09:40.480
they may do a pretty decent job from the

00:09:37.679 --> 00:09:42.479
get-go with no training.

00:09:40.480 --> 00:09:44.879
It's kind of a shocker. Two years ago

00:09:42.480 --> 00:09:47.440
people would be like that's impossible.

00:09:44.879 --> 00:09:50.159
All right. So

00:09:47.440 --> 00:09:51.680
back to this.

00:09:50.159 --> 00:09:53.039
All right. And as you folks, you know,

00:09:51.679 --> 00:09:54.559
may already know and maybe you're in

00:09:53.039 --> 00:09:57.039
fact participating in this gold rush

00:09:54.559 --> 00:09:58.799
already. Um, you know, lots of people

00:09:57.039 --> 00:10:00.639
are creating lots of really cool

00:09:58.799 --> 00:10:02.000
companies to take some of these ideas

00:10:00.639 --> 00:10:04.159
and actually create really interesting

00:10:02.000 --> 00:10:06.320
products and services out of them. Um,

00:10:04.159 --> 00:10:07.439
so if you're not doing it and if you've

00:10:06.320 --> 00:10:10.000
been thinking about entrepreneurial

00:10:07.440 --> 00:10:13.000
stuff, here's a word of advice. Take the

00:10:10.000 --> 00:10:13.000
plunge.

00:10:15.120 --> 00:10:19.839
Dismissed. Just kidding. All right. So,

00:10:18.240 --> 00:10:22.240
and as you can imagine, enterprise

00:10:19.839 --> 00:10:24.480
vendors are rushing to add NLP to all

00:10:22.240 --> 00:10:27.039
their products. Salesforce Einstein now

00:10:24.480 --> 00:10:28.800
has Einstein GPT. Microsoft has

00:10:27.039 --> 00:10:30.639
co-pilot. I mean, the list goes on.

00:10:28.799 --> 00:10:32.319
Everybody, everybody's like scrambling

00:10:30.639 --> 00:10:34.559
and really trying hard to infuse some

00:10:32.320 --> 00:10:36.480
GPT magic into whatever they're doing.

00:10:34.559 --> 00:10:39.679
Okay, some of it is real, a lot of it is

00:10:36.480 --> 00:10:41.759
not. Uh, okay. So, let's go to like the

00:10:39.679 --> 00:10:43.759
arc of NLP progress. How did we get to

00:10:41.759 --> 00:10:46.720
this kind of crazy times that we live

00:10:43.759 --> 00:10:48.958
in? Um so if you look at natural

00:10:46.720 --> 00:10:50.639
language processing basically efforts to

00:10:48.958 --> 00:10:52.239
take language and try to analyze

00:10:50.639 --> 00:10:56.240
language and you do predictions with

00:10:52.240 --> 00:10:58.000
language and so on and so forth. Um

00:10:56.240 --> 00:11:00.240
the first phase of it was just

00:10:58.000 --> 00:11:02.000
handcrafted rules based on linguistics.

00:11:00.240 --> 00:11:03.360
So these are all linguists who would

00:11:02.000 --> 00:11:05.200
really understand the grammar of a

00:11:03.360 --> 00:11:07.278
language and then they would use a deep

00:11:05.200 --> 00:11:08.959
knowledge of linguistics to figure out

00:11:07.278 --> 00:11:11.919
all these rules by which you can process

00:11:08.958 --> 00:11:13.919
and analyze natural language text. And

00:11:11.919 --> 00:11:15.360
then this other thing came along which

00:11:13.919 --> 00:11:17.679
was a statistical machine learning

00:11:15.360 --> 00:11:19.440
approach which basically said never mind

00:11:17.679 --> 00:11:21.359
all that complicated knowledge of

00:11:19.440 --> 00:11:24.160
linguistics and grammar. Why don't we

00:11:21.360 --> 00:11:25.360
simply count things? Let's count the

00:11:24.159 --> 00:11:26.879
number of times these two will co-

00:11:25.360 --> 00:11:29.120
occur. Now let's count that. Let's count

00:11:26.879 --> 00:11:31.278
this basically just count a lot. Okay.

00:11:29.120 --> 00:11:32.879
And let's see if it does right if it

00:11:31.278 --> 00:11:34.799
does for predicting things for say for

00:11:32.879 --> 00:11:36.799
classifying text and so on. And

00:11:34.799 --> 00:11:39.199
shockingly those methods ended up being

00:11:36.799 --> 00:11:41.120
really good. They ended up being really

00:11:39.200 --> 00:11:44.000
good and in fact they actually were

00:11:41.120 --> 00:11:47.039
better than the lovingly handcurated

00:11:44.000 --> 00:11:50.159
linguistically driven rules. Okay, so

00:11:47.039 --> 00:11:52.159
much so this is a famous quote which

00:11:50.159 --> 00:11:55.278
says every time I fire a linguist the

00:11:52.159 --> 00:11:57.039
performance of speech recognizer goes up

00:11:55.278 --> 00:11:59.759
right obviously made in justest but

00:11:57.039 --> 00:12:01.199
there is a kernel of truth to it.

00:11:59.759 --> 00:12:03.600
So that was that's what that's that's

00:12:01.200 --> 00:12:06.639
what that's where we were and then deep

00:12:03.600 --> 00:12:08.320
learning happened okay um in 2012

00:12:06.639 --> 00:12:09.839
roughly and then we had these things

00:12:08.320 --> 00:12:11.278
called recurren neural networks which

00:12:09.839 --> 00:12:13.600
are based on deep learning which

00:12:11.278 --> 00:12:15.600
actually moved the ball forward and then

00:12:13.600 --> 00:12:17.120
in 2017

00:12:15.600 --> 00:12:18.959
something called the transformer was

00:12:17.120 --> 00:12:21.919
invented

00:12:18.958 --> 00:12:26.159
2017 and the transformer replaced

00:12:21.919 --> 00:12:27.519
everything else across the board so we

00:12:26.159 --> 00:12:29.199
just going to leaprog directly to

00:12:27.519 --> 00:12:30.959
transformers in hodle we will not spend

00:12:29.200 --> 00:12:32.879
any time on recurren neural networks and

00:12:30.958 --> 00:12:35.119
that is not to say that they are sort of

00:12:32.879 --> 00:12:36.559
dead. Um there's there's a very

00:12:35.120 --> 00:12:38.320
interesting work which actually is

00:12:36.559 --> 00:12:40.159
trying to now revive recurren neural

00:12:38.320 --> 00:12:42.639
networks to make it work for these kinds

00:12:40.159 --> 00:12:44.399
of modern LLM kinds of tasks but it's

00:12:42.639 --> 00:12:46.639
still very early days. Okay. So for now

00:12:44.399 --> 00:12:49.759
we'll just focus on transformers.

00:12:46.639 --> 00:12:51.759
Okay. So the the very high level view of

00:12:49.759 --> 00:12:53.600
the problem here is that like most

00:12:51.759 --> 00:12:55.600
things in deep learning it's basically

00:12:53.600 --> 00:12:57.680
fancy regression.

00:12:55.600 --> 00:12:59.120
There is some variable X that comes in.

00:12:57.679 --> 00:13:01.599
It goes through a bunch this goes to

00:12:59.120 --> 00:13:03.839
this very complicated function along

00:13:01.600 --> 00:13:05.920
with this W which is the weights and

00:13:03.839 --> 00:13:07.760
then out pops an output. Right? That's

00:13:05.919 --> 00:13:10.399
just the view that you've always had.

00:13:07.759 --> 00:13:12.720
And so in this case X happens to be

00:13:10.399 --> 00:13:13.600
text. Y can be text. It could be labels.

00:13:12.720 --> 00:13:15.360
It could be numbers. It could be

00:13:13.600 --> 00:13:16.879
anything else. The W is the weights. And

00:13:15.360 --> 00:13:19.600
the function is a deep neural network.

00:13:16.879 --> 00:13:20.639
Right? This by by at this point when you

00:13:19.600 --> 00:13:23.440
look at this slide it should be like

00:13:20.639 --> 00:13:26.000
blindingly obvious.

00:13:23.440 --> 00:13:28.560
So now the key question here is how do

00:13:26.000 --> 00:13:31.679
you actually represent X? That's the key

00:13:28.559 --> 00:13:34.399
question for pictures for images. We saw

00:13:31.679 --> 00:13:36.078
that we just took the pixel values which

00:13:34.399 --> 00:13:37.600
were light intensity numbers between 0

00:13:36.078 --> 00:13:39.599
and 255 and you could just use that

00:13:37.600 --> 00:13:41.839
directly but when a when a sentence

00:13:39.600 --> 00:13:43.600
comes in like I love deep learning like

00:13:41.839 --> 00:13:45.279
what do you do right how do you actually

00:13:43.600 --> 00:13:46.639
represent it because remember we have to

00:13:45.278 --> 00:13:49.439
numericalize everything that's coming

00:13:46.639 --> 00:13:50.959
in. So that's a key question and and

00:13:49.440 --> 00:13:52.800
this actually is a very subtle question

00:13:50.958 --> 00:13:56.638
very important question and we'll focus

00:13:52.799 --> 00:13:58.719
on that today and then next week when we

00:13:56.639 --> 00:14:00.720
look at transformers we will look at

00:13:58.720 --> 00:14:02.959
what neural network architecture is best

00:14:00.720 --> 00:14:04.560
suited to process this sort of text

00:14:02.958 --> 00:14:06.000
inputs that are coming in right those

00:14:04.559 --> 00:14:11.198
are the two big questions we're going to

00:14:06.000 --> 00:14:12.879
look at all right so processing basics

00:14:11.198 --> 00:14:15.879
we going to follow this very standard

00:14:12.879 --> 00:14:15.879
process

00:14:15.919 --> 00:14:21.519
this is the process by which we take any

00:14:18.639 --> 00:14:23.120
any text that comes in and we do run it

00:14:21.519 --> 00:14:25.360
through these four steps and this

00:14:23.120 --> 00:14:26.720
process is called text vectorization and

00:14:25.360 --> 00:14:28.159
as the name suggest that we are

00:14:26.720 --> 00:14:30.399
essentially taking text and creating

00:14:28.159 --> 00:14:32.399
vectors of numbers out of it right text

00:14:30.399 --> 00:14:34.559
vectorization and we'll go through each

00:14:32.399 --> 00:14:36.879
of these processes one after the other

00:14:34.559 --> 00:14:39.278
so I just find it very useful to just

00:14:36.879 --> 00:14:41.519
have this acronym stie in my head like

00:14:39.278 --> 00:14:45.519
stie just keep that in mind it may be

00:14:41.519 --> 00:14:48.240
helpful um all right so we what we do is

00:14:45.519 --> 00:14:50.078
the setup here is that we have a whole

00:14:48.240 --> 00:14:51.839
bunch of documents, right? We call it

00:14:50.078 --> 00:14:54.319
the training corpus. We have a whole

00:14:51.839 --> 00:14:55.760
bunch of text documents, text data. Uh,

00:14:54.320 --> 00:14:58.399
and as far as we are concerned, you can

00:14:55.759 --> 00:15:01.120
just imagine it as just lists of long

00:14:58.399 --> 00:15:03.360
passages. Okay? What is a novel? It's

00:15:01.120 --> 00:15:05.839
just a long passage, right, of text. So

00:15:03.360 --> 00:15:07.360
whether it's a novel or a sentence

00:15:05.839 --> 00:15:09.440
doesn't really matter. We just think of

00:15:07.360 --> 00:15:11.360
them as a big list of strings, a big

00:15:09.440 --> 00:15:13.600
list of text. Okay, that's a training

00:15:11.360 --> 00:15:15.759
corpus. And what we do is we take this

00:15:13.600 --> 00:15:17.680
training corpus and we run it through

00:15:15.759 --> 00:15:19.600
and we apply standardization and

00:15:17.679 --> 00:15:22.399
tokenization which I will describe to

00:15:19.600 --> 00:15:26.879
this entire training corpus up front.

00:15:22.399 --> 00:15:29.919
Okay. So we first do this and and

00:15:26.879 --> 00:15:32.480
standardization is basically

00:15:29.919 --> 00:15:34.639
the default for most applications tends

00:15:32.480 --> 00:15:36.399
to be this which is we first strip

00:15:34.639 --> 00:15:38.240
capitalization and make everything lower

00:15:36.399 --> 00:15:40.078
case

00:15:38.240 --> 00:15:42.240
and then we remove punctuation and

00:15:40.078 --> 00:15:44.559
accents and so on and so forth. Okay,

00:15:42.240 --> 00:15:46.639
that's the first thing we do. I'll talk

00:15:44.559 --> 00:15:48.559
about why we do it in just a moment, but

00:15:46.639 --> 00:15:51.360
the mechanics of it are we do this

00:15:48.559 --> 00:15:53.359
first. Then we look at words like a,

00:15:51.360 --> 00:15:55.199
the, it, and so on and so forth.

00:15:53.360 --> 00:15:57.680
Basically filler words, right? Which

00:15:55.198 --> 00:15:59.439
which we we need to actually make

00:15:57.679 --> 00:16:02.399
complete sentences, but they may not

00:15:59.440 --> 00:16:03.759
have any value predicting things. So we

00:16:02.399 --> 00:16:06.559
remove them and they are called stop

00:16:03.759 --> 00:16:08.079
words. And then finally we take words

00:16:06.559 --> 00:16:10.399
which are very similar which have sort

00:16:08.078 --> 00:16:12.159
of a same kind of stem or root and then

00:16:10.399 --> 00:16:14.958
we just map it to like a common

00:16:12.159 --> 00:16:16.559
representation like ate eaten eating

00:16:14.958 --> 00:16:19.439
eaten all these things just becomes

00:16:16.559 --> 00:16:21.679
let's say eats and we do that sometimes.

00:16:19.440 --> 00:16:23.600
So this we almost always do this we

00:16:21.679 --> 00:16:25.599
often do and this we do it sometimes.

00:16:23.600 --> 00:16:28.600
Okay. Now, why do we do any of these

00:16:25.600 --> 00:16:28.600
things?

00:16:34.320 --> 00:16:38.879
>> I think we want to try to recognize the

00:16:36.480 --> 00:16:40.480
essential thing with the word, right?

00:16:38.879 --> 00:16:42.799
Whether it's eaten or eat, but the

00:16:40.480 --> 00:16:45.600
essential thing is the eat, right? So,

00:16:42.799 --> 00:16:47.198
we want to try to sort of abstract from

00:16:45.600 --> 00:16:49.120
it the more essential thing,

00:16:47.198 --> 00:16:50.799
>> right? So, why do we need to abstract? I

00:16:49.120 --> 00:16:52.560
guess you're absolutely correct. We're

00:16:50.799 --> 00:16:56.439
trying to abstract. Why is there a

00:16:52.559 --> 00:16:56.439
benefit to doing this abstraction?

00:16:58.000 --> 00:17:02.919
How about somebody from this side of the

00:16:59.278 --> 00:17:02.919
room? Oh yes.

00:17:03.440 --> 00:17:08.880
>> So I want to reduce the library.

00:17:07.359 --> 00:17:12.240
>> Why is it a good idea to reduce the

00:17:08.880 --> 00:17:14.480
library? The size of the library

00:17:12.240 --> 00:17:17.519
>> because of the the amount of computation

00:17:14.480 --> 00:17:20.160
needed. So that is part of the answer.

00:17:17.519 --> 00:17:25.240
There's another part to the answer which

00:17:20.160 --> 00:17:25.240
says all right let's swing to the right

00:17:26.480 --> 00:17:30.720
um is it faculties comparison between

00:17:28.880 --> 00:17:33.880
different sets

00:17:30.720 --> 00:17:33.880
of standard

00:17:33.906 --> 00:17:37.279
[clears throat]

00:17:34.480 --> 00:17:39.759
>> okay so I will go with that but I think

00:17:37.279 --> 00:17:42.240
the the key thing we want to uh the key

00:17:39.759 --> 00:17:44.480
thing to realize here is that you want

00:17:42.240 --> 00:17:46.640
the model much like when you go when we

00:17:44.480 --> 00:17:48.720
talk about computer vision we said look

00:17:46.640 --> 00:17:51.038
if it's vertical line. I want to be able

00:17:48.720 --> 00:17:52.640
to detect it wherever it happens. I

00:17:51.038 --> 00:17:54.000
don't want the model to think that the

00:17:52.640 --> 00:17:55.038
vertical line on the left side is

00:17:54.000 --> 00:17:57.038
different from the vertical line on the

00:17:55.038 --> 00:17:58.720
right side and then later realize they

00:17:57.038 --> 00:18:00.879
are the same thing because you would

00:17:58.720 --> 00:18:02.319
have wasted valuable capacity learning

00:18:00.880 --> 00:18:03.760
things which actually happen to be the

00:18:02.319 --> 00:18:06.798
same because you didn't know it was the

00:18:03.759 --> 00:18:09.519
same. So here if you for example take a

00:18:06.798 --> 00:18:11.918
word and lowerase it clearly the case of

00:18:09.519 --> 00:18:12.960
it whether it's uppercase or lower case

00:18:11.919 --> 00:18:14.559
most of the time it's not going to

00:18:12.960 --> 00:18:16.400
matter for anything you want to predict.

00:18:14.558 --> 00:18:18.399
So you're essentially telling the model

00:18:16.400 --> 00:18:19.840
you know the lowerase version uppercase

00:18:18.400 --> 00:18:21.919
version they are not different they're

00:18:19.839 --> 00:18:23.678
actually the same and the easiest way to

00:18:21.919 --> 00:18:25.919
tell the model they are the same is just

00:18:23.679 --> 00:18:29.200
make everything lower case so that is

00:18:25.919 --> 00:18:31.038
the key idea okay and similarly if you

00:18:29.200 --> 00:18:32.080
look at stop words the reason is that

00:18:31.038 --> 00:18:34.400
these stop words may not help you

00:18:32.079 --> 00:18:36.720
predict anything whether a word uh and

00:18:34.400 --> 00:18:38.160
the showed up in a movie review probably

00:18:36.720 --> 00:18:40.400
does not affect the sentiment of the

00:18:38.160 --> 00:18:42.240
review and therefore let's remove it so

00:18:40.400 --> 00:18:44.320
that's a slightly different reason

00:18:42.240 --> 00:18:46.400
stemming is the same reason as the first

00:18:44.319 --> 00:18:48.319
which is that all these words kind of

00:18:46.400 --> 00:18:50.160
mean the same thing. We don't have to be

00:18:48.319 --> 00:18:51.839
super precise about it and so let's just

00:18:50.160 --> 00:18:54.080
like collapse them onto the same thing.

00:18:51.839 --> 00:18:57.038
Now that these are all the standard

00:18:54.079 --> 00:18:58.399
things we do there are totally notice

00:18:57.038 --> 00:19:00.319
you know important exceptions to all

00:18:58.400 --> 00:19:02.000
these things. Okay we'll come back to

00:19:00.319 --> 00:19:05.359
the exceptions a bit later but that is

00:19:02.000 --> 00:19:08.359
the standard thing we do make sense. All

00:19:05.359 --> 00:19:08.359
right.

00:19:08.720 --> 00:19:14.240
So if you look at something like this um

00:19:11.679 --> 00:19:15.440
this sentence here right hola what do

00:19:14.240 --> 00:19:17.919
you picture when you think of travel

00:19:15.440 --> 00:19:20.000
Mexico boom and then you can see here

00:19:17.919 --> 00:19:21.520
this is the standardized version like

00:19:20.000 --> 00:19:24.000
everything has become lower case like

00:19:21.519 --> 00:19:25.279
the h has become small h the punctuation

00:19:24.000 --> 00:19:29.440
has disappeared that's part of

00:19:25.279 --> 00:19:32.160
standardization and then uh travel and

00:19:29.440 --> 00:19:35.759
you can see here that Mexico m has

00:19:32.160 --> 00:19:37.759
become small sipping has become sips uh

00:19:35.759 --> 00:19:38.960
things think has become things and so on

00:19:37.759 --> 00:19:41.279
and so forth

00:19:38.960 --> 00:19:44.279
So that's an example of strandization at

00:19:41.279 --> 00:19:44.279
work.

00:19:47.038 --> 00:19:51.200
Okay.

00:19:49.200 --> 00:19:53.840
The next thing we do is something very

00:19:51.200 --> 00:19:55.600
important and it's called tokenization.

00:19:53.839 --> 00:19:56.720
So what we do typically is that okay now

00:19:55.599 --> 00:19:59.439
we have standardized everything. We have

00:19:56.720 --> 00:20:01.919
a bunch of words. Uh we need to now

00:19:59.440 --> 00:20:04.558
split them into what are called tokens.

00:20:01.919 --> 00:20:07.120
So the most common default is to just

00:20:04.558 --> 00:20:09.279
think of a word as a token.

00:20:07.119 --> 00:20:11.119
We just split on the white space, right?

00:20:09.279 --> 00:20:14.160
You take each string and wherever there

00:20:11.119 --> 00:20:15.678
is white space, meaning actual spaces,

00:20:14.160 --> 00:20:17.440
uh, carriage returns and things like

00:20:15.679 --> 00:20:20.720
that, boom, you just split on them and

00:20:17.440 --> 00:20:22.080
you just create words out of it. So, so

00:20:20.720 --> 00:20:24.160
for instance, if you have this

00:20:22.079 --> 00:20:26.159
standardized sentence here, you just

00:20:24.160 --> 00:20:29.120
split it after every word and you get

00:20:26.160 --> 00:20:32.519
this thing. Okay? So, each of these is

00:20:29.119 --> 00:20:32.519
now a token.

00:20:32.880 --> 00:20:38.400
Now, this has some disadvantages.

00:20:36.159 --> 00:20:40.799
What are some disadvantages of just

00:20:38.400 --> 00:20:43.840
splitting on on the space between words?

00:20:40.798 --> 00:20:46.319
Uh yeah,

00:20:43.839 --> 00:20:49.439
>> I think we lose any context because we

00:20:46.319 --> 00:20:52.639
look at each word separately. Uh so we

00:20:49.440 --> 00:20:53.440
don't have any password or what happens

00:20:52.640 --> 00:20:55.280
next,

00:20:53.440 --> 00:20:57.759
>> right? So for example, the cat sat on

00:20:55.279 --> 00:21:00.558
the mat and the mat sat on the cat will

00:20:57.759 --> 00:21:02.400
have the same like set, right? Yeah. So

00:21:00.558 --> 00:21:05.798
you lose the order. What are some other

00:21:02.400 --> 00:21:05.798
issues with it?

00:21:05.839 --> 00:21:10.319
for words that should have two together

00:21:07.440 --> 00:21:11.840
like you lose the fact that that's one

00:21:10.319 --> 00:21:14.240
name because you separated

00:21:11.839 --> 00:21:16.480
>> right exactly so there are compound

00:21:14.240 --> 00:21:18.640
words right like father-in-law for

00:21:16.480 --> 00:21:20.319
instance that's one problem another

00:21:18.640 --> 00:21:22.240
problem is that lots of non-English

00:21:20.319 --> 00:21:25.038
languages they actually don't have this

00:21:22.240 --> 00:21:27.359
notion of a space between words right

00:21:25.038 --> 00:21:29.359
actually runs one after the other and it

00:21:27.359 --> 00:21:32.479
is and the native speakers know from

00:21:29.359 --> 00:21:34.879
context how to chunk it and break it so

00:21:32.480 --> 00:21:36.480
well what do we do Right?

00:21:34.880 --> 00:21:39.280
Because you basically will have one word

00:21:36.480 --> 00:21:40.720
for the whole passage, one token. The

00:21:39.279 --> 00:21:42.960
other problem is that there are

00:21:40.720 --> 00:21:44.960
languages, German is perhaps the most

00:21:42.960 --> 00:21:47.279
notable one in which you have very long

00:21:44.960 --> 00:21:50.798
words.

00:21:47.279 --> 00:21:52.319
Um I saw a word uh which I think I might

00:21:50.798 --> 00:21:57.200
have it on the site somewhere this like

00:21:52.319 --> 00:21:59.200
this long which means uh

00:21:57.200 --> 00:22:00.720
you realize that something amazing is

00:21:59.200 --> 00:22:02.798
happening but the rest of the world

00:22:00.720 --> 00:22:04.400
hasn't woken up to it yet. It's that

00:22:02.798 --> 00:22:07.918
feeling.

00:22:04.400 --> 00:22:10.640
There's a word for that. Amazing, right?

00:22:07.919 --> 00:22:12.640
Anyway, so yeah, some words or Japanese,

00:22:10.640 --> 00:22:13.520
for example, there's a word called. Do

00:22:12.640 --> 00:22:16.520
people know the meaning of the word

00:22:13.519 --> 00:22:16.519
combi?

00:22:16.640 --> 00:22:24.640
It means the transient beauty of

00:22:20.240 --> 00:22:26.720
sunlight going through fall foliage.

00:22:24.640 --> 00:22:29.280
There's a word for that. How cool is

00:22:26.720 --> 00:22:31.440
that? Anyway, sorry. I love that word.

00:22:29.279 --> 00:22:33.440
So, back to this. Um so we have this

00:22:31.440 --> 00:22:35.200
thing here. So there are all reasons for

00:22:33.440 --> 00:22:38.798
which splitting on the the space between

00:22:35.200 --> 00:22:41.360
words not going to work. Okay. Um

00:22:38.798 --> 00:22:44.720
so what we will so what happens is that

00:22:41.359 --> 00:22:46.319
modern large language models. So the the

00:22:44.720 --> 00:22:47.919
what we have described so far despite

00:22:46.319 --> 00:22:50.960
its shortcomings is actually really good

00:22:47.919 --> 00:22:52.640
for lots of NLP use cases. Okay. If you

00:22:50.960 --> 00:22:54.640
want to classify text as good enough for

00:22:52.640 --> 00:22:57.679
instance but if you want to generate

00:22:54.640 --> 00:22:59.840
text like LLMs do it's not going to

00:22:57.679 --> 00:23:01.600
work. It's not going to work because you

00:22:59.839 --> 00:23:03.839
know when you ask the strategia question

00:23:01.599 --> 00:23:05.918
it comes back with perfect punctuation.

00:23:03.839 --> 00:23:07.119
Clearly punctuation was not stripped. It

00:23:05.919 --> 00:23:09.600
comes back with particular upper and

00:23:07.119 --> 00:23:11.359
lower case clearly that wasn't stripped.

00:23:09.599 --> 00:23:12.719
You can actually make up new words and

00:23:11.359 --> 00:23:15.359
ask it to use the new word. It'll make

00:23:12.720 --> 00:23:17.919
it I'll use it. Therefore, it's not like

00:23:15.359 --> 00:23:19.918
it can only recognize a finite set. So

00:23:17.919 --> 00:23:22.240
there's a very clever scheme called bite

00:23:19.919 --> 00:23:24.880
pair encoding right which is which is

00:23:22.240 --> 00:23:26.400
invented to do all those things. And I

00:23:24.880 --> 00:23:28.240
have slides at the end and if we have

00:23:26.400 --> 00:23:29.759
time we'll talk about it.

00:23:28.240 --> 00:23:33.440
All right, for now let's continue this

00:23:29.759 --> 00:23:35.359
thing. So when this is done for every

00:23:33.440 --> 00:23:37.440
sentence or every uh passage in our

00:23:35.359 --> 00:23:40.079
training data set, we have now have a

00:23:37.440 --> 00:23:41.519
list of distinct tokens, right? We have

00:23:40.079 --> 00:23:42.960
a list of distinct tokens. In this

00:23:41.519 --> 00:23:45.200
simple case, it happens to be all the

00:23:42.960 --> 00:23:47.360
distinct words that we have seen, right?

00:23:45.200 --> 00:23:49.840
That's called the vocabulary.

00:23:47.359 --> 00:23:51.279
That's called the vocabulary.

00:23:49.839 --> 00:23:53.599
So now we move to the third and fourth

00:23:51.279 --> 00:23:55.279
stages. In this in these stages, the

00:23:53.599 --> 00:23:58.719
indexing and encoding stage, we only

00:23:55.279 --> 00:24:00.319
work with the vocabulary. Okay. And so

00:23:58.720 --> 00:24:03.200
what we do is the first thing the

00:24:00.319 --> 00:24:05.359
indexing we assign a unique integer to

00:24:03.200 --> 00:24:07.600
each distinct token in the vocabulary.

00:24:05.359 --> 00:24:09.599
So for instance, let's say that you know

00:24:07.599 --> 00:24:12.000
you took a whole bunch of English

00:24:09.599 --> 00:24:14.240
literature as your training corus and

00:24:12.000 --> 00:24:16.079
you ran it through you basically you'll

00:24:14.240 --> 00:24:18.159
come up with English dictionary right?

00:24:16.079 --> 00:24:20.879
So it'll have maybe starting with a all

00:24:18.159 --> 00:24:24.000
the way to zebra a whole bunch of words.

00:24:20.880 --> 00:24:26.960
Um, and so I'm just putting 50,000 here

00:24:24.000 --> 00:24:28.480
because turns out the GPD family uses

00:24:26.960 --> 00:24:30.159
something called 50,000 tokens. So I'm

00:24:28.480 --> 00:24:31.519
just using 50,000. It's not the actual

00:24:30.159 --> 00:24:33.600
number of words in the English language.

00:24:31.519 --> 00:24:35.119
It's much more than that. So let's say

00:24:33.599 --> 00:24:37.519
that we give a number one through

00:24:35.119 --> 00:24:40.158
50,000. And then we actually also

00:24:37.519 --> 00:24:42.240
introduce a special token called UN. It

00:24:40.159 --> 00:24:44.559
stands for unknown. And we'll come back

00:24:42.240 --> 00:24:46.960
to this later. And we give unknown the

00:24:44.558 --> 00:24:48.558
integer zero.

00:24:46.960 --> 00:24:51.600
Okay. So this what this is what we mean

00:24:48.558 --> 00:24:52.798
by indexing. take the word the tokens

00:24:51.599 --> 00:24:55.038
you have identified and just map it to

00:24:52.798 --> 00:24:57.440
an integer.

00:24:55.038 --> 00:25:00.400
Okay, that's the indexing step. Then

00:24:57.440 --> 00:25:03.120
what we do is we assign a vector to

00:25:00.400 --> 00:25:05.600
every one of these integers.

00:25:03.119 --> 00:25:08.959
Okay, and that is the encoding step. We

00:25:05.599 --> 00:25:10.639
assign a vector to each integer.

00:25:08.960 --> 00:25:12.480
So you have a bunch of distinct words

00:25:10.640 --> 00:25:14.000
and each word we put an integer on it

00:25:12.480 --> 00:25:16.079
and then we take that integer and map it

00:25:14.000 --> 00:25:17.599
to a vector. Yeah. Can you please

00:25:16.079 --> 00:25:18.558
explain

00:25:17.599 --> 00:25:20.158
to

00:25:18.558 --> 00:25:20.960
>> Can you please explain what unknown

00:25:20.159 --> 00:25:23.679
means?

00:25:20.960 --> 00:25:25.200
>> Yeah. So, so I'll come back to that for

00:25:23.679 --> 00:25:26.720
now. Just assume that we have a token

00:25:25.200 --> 00:25:28.240
called unknown. And the way we are going

00:25:26.720 --> 00:25:29.759
to use it will become apparent in a few

00:25:28.240 --> 00:25:31.038
minutes.

00:25:29.759 --> 00:25:32.480
>> Does it mean there's a base to it

00:25:31.038 --> 00:25:32.960
though? There's like a letter or

00:25:32.480 --> 00:25:34.798
something.

00:25:32.960 --> 00:25:36.400
>> It's it's a it's a placeholder for

00:25:34.798 --> 00:25:38.639
something else which I'll describe

00:25:36.400 --> 00:25:42.798
shortly.

00:25:38.640 --> 00:25:44.320
Okay. So, that's what we have. U so

00:25:42.798 --> 00:25:46.879
let's say that we want to assign a

00:25:44.319 --> 00:25:50.720
vector to each integer in our vocabulary

00:25:46.880 --> 00:25:52.880
and let's assume that we have uh okay

00:25:50.720 --> 00:25:54.400
let's say we have 50,000 possible

00:25:52.880 --> 00:25:56.480
integers because we have 50,000 possible

00:25:54.400 --> 00:25:58.400
words and we want to assign a vector so

00:25:56.480 --> 00:25:59.759
that if you take the vector of two

00:25:58.400 --> 00:26:02.320
different words they should look

00:25:59.759 --> 00:26:04.319
different right clearly that's the whole

00:26:02.319 --> 00:26:06.399
point of mapping from integer to vector

00:26:04.319 --> 00:26:08.079
they better be different uh what is the

00:26:06.400 --> 00:26:12.240
simplest way to come up with a vector

00:26:08.079 --> 00:26:12.240
for each each of these tokens

00:26:20.079 --> 00:26:22.399
the same as the index.

00:26:21.839 --> 00:26:24.319
>> Sorry,

00:26:22.400 --> 00:26:26.880
>> the same as the index. It's just a

00:26:24.319 --> 00:26:31.599
vector one one by one with the index.

00:26:26.880 --> 00:26:34.799
>> So, a vector of uh zeros and ones or

00:26:31.599 --> 00:26:38.158
>> it's just a vector with one dimension.

00:26:34.798 --> 00:26:39.359
>> Oh. Oh, I see. So, god. Well, it's it

00:26:38.159 --> 00:26:40.799
it's creative, but it's a little

00:26:39.359 --> 00:26:42.000
cheating, right? Because you're

00:26:40.798 --> 00:26:43.038
essentially putting a square bracket

00:26:42.000 --> 00:26:47.038
around the number and saying it's a

00:26:43.038 --> 00:26:48.640
vector. Good try.

00:26:47.038 --> 00:26:51.440
>> You can try one hot encoding,

00:26:48.640 --> 00:26:53.520
>> right? You can try one hot encoding.

00:26:51.440 --> 00:26:55.360
So remember the list of distinct tokens

00:26:53.519 --> 00:26:57.119
you have, you can just think of them as

00:26:55.359 --> 00:26:59.759
the distinct levels of a categorical

00:26:57.119 --> 00:27:01.759
variable,

00:26:59.759 --> 00:27:04.558
right? And you can just use one hard

00:27:01.759 --> 00:27:07.359
encoding for it.

00:27:04.558 --> 00:27:08.480
So what you can do is you can the

00:27:07.359 --> 00:27:10.399
simplest thing is do one one hard

00:27:08.480 --> 00:27:13.599
encoding and the way it's going to work

00:27:10.400 --> 00:27:16.000
is that if you have let's say 50,000

00:27:13.599 --> 00:27:17.599
uh 50,000 possible values the vector is

00:27:16.000 --> 00:27:20.079
going to be 50,000 long it's going to

00:27:17.599 --> 00:27:22.719
have zeros everywhere except in the

00:27:20.079 --> 00:27:25.359
index value of whatever that token is.

00:27:22.720 --> 00:27:28.319
So for instance since we said ank is

00:27:25.359 --> 00:27:31.278
going to be the first uh first number

00:27:28.319 --> 00:27:33.359
zero it has a one here and the zero the

00:27:31.278 --> 00:27:36.159
zero index position has a one everything

00:27:33.359 --> 00:27:37.918
is zero a happens to be the second one

00:27:36.159 --> 00:27:40.480
so it happens to be one in the second

00:27:37.919 --> 00:27:40.880
position zero you get the idea

00:27:40.480 --> 00:27:42.480
okay

00:27:40.880 --> 00:27:45.039
>> so this real zero hot encoding we can do

00:27:42.480 --> 00:27:47.599
the zero hot one coding one hot encoding

00:27:45.038 --> 00:27:50.079
and so so the dimension of this encoding

00:27:47.599 --> 00:27:51.678
vector how long it is it's basically the

00:27:50.079 --> 00:27:54.798
number of distinct tokens that you have

00:27:51.679 --> 00:27:59.320
seen in in the training corpus plus one

00:27:54.798 --> 00:27:59.319
for this unk thing that you'll get to.

00:27:59.759 --> 00:28:03.278
Okay,

00:28:01.278 --> 00:28:05.278
so that is a dimensional encoding vector

00:28:03.278 --> 00:28:08.278
which is this is called the vocabulary

00:28:05.278 --> 00:28:08.278
size.

00:28:09.519 --> 00:28:13.398
It's called the vocabulary size.

00:28:13.440 --> 00:28:18.159
All right. So at this point we have

00:28:16.798 --> 00:28:20.480
created a vocabulary for the training

00:28:18.159 --> 00:28:22.240
data training corpus. every distinct

00:28:20.480 --> 00:28:24.240
token vocabulary has been assigned a one

00:28:22.240 --> 00:28:26.480
hot vector and we are done with basic

00:28:24.240 --> 00:28:29.359
pre-processing.

00:28:26.480 --> 00:28:31.440
Okay, so all the text that has come in,

00:28:29.359 --> 00:28:33.359
every token has been mapped to some one

00:28:31.440 --> 00:28:35.840
hot one potentially very long one hot

00:28:33.359 --> 00:28:37.439
vector.

00:28:35.839 --> 00:28:41.158
Any questions on the mechanics of this

00:28:37.440 --> 00:28:41.159
before we continue on?

00:28:45.038 --> 00:28:50.000
>> Now let's see if when you get a new

00:28:47.278 --> 00:28:52.000
input sentence in a new sentence freshly

00:28:50.000 --> 00:28:53.599
arriving and we want to feed it into a

00:28:52.000 --> 00:28:55.038
deep neural network, how will this

00:28:53.599 --> 00:28:57.599
process actually apply to the new

00:28:55.038 --> 00:29:00.240
sentence that's coming in? Okay, so

00:28:57.599 --> 00:29:02.558
let's assume um that we have completed

00:29:00.240 --> 00:29:05.038
our SDIE on the training corpus and it

00:29:02.558 --> 00:29:08.000
turns out we found only you know 99

00:29:05.038 --> 00:29:10.079
distinct tokens 99 distinct words and

00:29:08.000 --> 00:29:13.599
then we add this ank thing to it so we

00:29:10.079 --> 00:29:16.158
got a 100 okay so this is our vocabulary

00:29:13.599 --> 00:29:17.599
it starts with ank a and then goes all

00:29:16.159 --> 00:29:20.399
the way to zebra but there are only 100

00:29:17.599 --> 00:29:22.398
of them in total right and just to be

00:29:20.398 --> 00:29:24.319
very clear we didn't bother to do things

00:29:22.398 --> 00:29:26.239
like stemming and stop word removal and

00:29:24.319 --> 00:29:28.000
stuff like that which is why you have

00:29:26.240 --> 00:29:30.319
words like 'the' showing up in this

00:29:28.000 --> 00:29:34.159
list.

00:29:30.319 --> 00:29:35.759
Okay. All right. So,

00:29:34.159 --> 00:29:38.000
let's say this input string arrives, the

00:29:35.759 --> 00:29:40.158
cats are on the mat, and then we run it

00:29:38.000 --> 00:29:43.440
through STIE. So, the cats are on the

00:29:40.159 --> 00:29:46.640
mat goes through this thingoop.

00:29:43.440 --> 00:29:49.440
Then the output is going to be a table

00:29:46.640 --> 00:29:52.559
with a bunch of rows and a bunch of

00:29:49.440 --> 00:29:56.840
columns. Any guesses

00:29:52.558 --> 00:29:56.839
how many rows and how many columns?

00:30:02.000 --> 00:30:06.278
Just raise your hands. I'll call on you.

00:30:13.359 --> 00:30:18.319
>> Yeah, you use a microphone. Go for it.

00:30:14.880 --> 00:30:20.000
>> Yeah, I would guess uh 100 rows and uh

00:30:18.319 --> 00:30:23.200
six columns.

00:30:20.000 --> 00:30:24.880
All right, we'll take a look. Uh

00:30:23.200 --> 00:30:27.919
100 and six as well as six and 100 are

00:30:24.880 --> 00:30:30.799
both correct. So, so the way I've done

00:30:27.919 --> 00:30:33.038
it is six and 100. And the and that's

00:30:30.798 --> 00:30:36.158
exactly right. So, the idea is that this

00:30:33.038 --> 00:30:38.720
is your vocabulary, right? So, the word

00:30:36.159 --> 00:30:41.600
the cat sat on the mat once you change

00:30:38.720 --> 00:30:43.919
the case of it, it becomes like this.

00:30:41.599 --> 00:30:47.199
So, the the happens to be a one hot

00:30:43.919 --> 00:30:48.799
vector with a one where there is a the

00:30:47.200 --> 00:30:50.159
and zero everywhere else. I'm not

00:30:48.798 --> 00:30:52.079
showing all the zeros because it'll get

00:30:50.159 --> 00:30:55.679
too cluttered.

00:30:52.079 --> 00:30:57.519
Similarly, cat has a one where the the

00:30:55.679 --> 00:30:59.919
cat position is and zero everywhere else

00:30:57.519 --> 00:31:02.079
and so on and so forth. Does that make

00:30:59.919 --> 00:31:04.159
sense? So, the the phrase the cat sat on

00:31:02.079 --> 00:31:06.319
the mat came in as just whatever six

00:31:04.159 --> 00:31:10.200
words and then it became this you know

00:31:06.319 --> 00:31:10.200
600 entry table.

00:31:12.240 --> 00:31:18.000
Okay. Now, what is the best way to feed

00:31:15.679 --> 00:31:21.559
this table to a deep neural network?

00:31:18.000 --> 00:31:21.558
What can we do?

00:31:23.599 --> 00:31:27.678
It's not a vector. It's a table.

00:31:26.319 --> 00:31:29.359
If it's a vector, we know what to do. We

00:31:27.679 --> 00:31:30.960
just feed it in. We'll just maybe send

00:31:29.359 --> 00:31:34.398
it to some, you know, hidden layer and

00:31:30.960 --> 00:31:37.200
declare victory at that point.

00:31:34.398 --> 00:31:38.959
>> Yeah.

00:31:37.200 --> 00:31:42.840
>> You would like to flatten it. And like

00:31:38.960 --> 00:31:42.840
how how might you do it?

00:31:43.200 --> 00:31:46.960
Flattening is a reasonable answer by the

00:31:45.119 --> 00:31:49.038
way.

00:31:46.960 --> 00:31:52.480
I think you mean you just have to like

00:31:49.038 --> 00:31:54.798
take each like each column

00:31:52.480 --> 00:31:56.319
take the first one each row and each row

00:31:54.798 --> 00:31:57.839
each word kind of like

00:31:56.319 --> 00:31:59.599
>> yeah so basically you can take all the

00:31:57.839 --> 00:32:01.439
first columns and then take the second

00:31:59.599 --> 00:32:03.359
column and attach it under the first

00:32:01.440 --> 00:32:05.120
column and so on and so forth right so

00:32:03.359 --> 00:32:08.158
we can certainly do that and that's very

00:32:05.119 --> 00:32:10.319
akin to how we work with images right u

00:32:08.159 --> 00:32:13.640
but there is one downside to that what

00:32:10.319 --> 00:32:13.639
is that downside

00:32:15.759 --> 00:32:20.798
uh Um,

00:32:18.480 --> 00:32:23.360
>> it's pretty long. Like I wonder if

00:32:20.798 --> 00:32:25.440
instead you could for the first word

00:32:23.359 --> 00:32:27.439
it's one, for the second word it's two,

00:32:25.440 --> 00:32:30.558
and then you maintain the order, but you

00:32:27.440 --> 00:32:33.038
still keep it just as like one row.

00:32:30.558 --> 00:32:34.960
>> One row. So one issue, so we'll come

00:32:33.038 --> 00:32:36.240
back to what we do about this, but what

00:32:34.960 --> 00:32:39.440
you're pointing out is it could be very

00:32:36.240 --> 00:32:42.399
long, right? Because if each word is a

00:32:39.440 --> 00:32:45.278
50,000 long one vector with just six

00:32:42.398 --> 00:32:48.000
words, it becomes a 300,000 long vector.

00:32:45.278 --> 00:32:50.798
Imagine take the 300,000 long vector and

00:32:48.000 --> 00:32:53.839
sending it into a 100 hidden unit hidden

00:32:50.798 --> 00:32:56.158
layer. 300,000 times 100 parameters. Too

00:32:53.839 --> 00:32:58.879
much can't learn anything.

00:32:56.159 --> 00:33:01.360
So that's one issue. The other issue is

00:32:58.880 --> 00:33:02.720
that different length texts that are

00:33:01.359 --> 00:33:04.398
coming in will have different sized

00:33:02.720 --> 00:33:06.319
inputs.

00:33:04.398 --> 00:33:08.879
So here the cat sat on the mat has six

00:33:06.319 --> 00:33:10.558
times 50,000 but maybe the cat sat on

00:33:08.880 --> 00:33:13.200
the mat and the rat rat ran over to the

00:33:10.558 --> 00:33:15.359
cat becomes even longer. We can't handle

00:33:13.200 --> 00:33:16.798
variable sized inputs.

00:33:15.359 --> 00:33:19.599
the inputs all have to be mapped to the

00:33:16.798 --> 00:33:22.158
same length.

00:33:19.599 --> 00:33:24.079
That's another problem.

00:33:22.159 --> 00:33:26.000
>> So maybe you can count how many you can

00:33:24.079 --> 00:33:27.599
sum the columns basically and count how

00:33:26.000 --> 00:33:29.519
many times each word appears since

00:33:27.599 --> 00:33:30.240
you're using the like spatial

00:33:29.519 --> 00:33:33.359
relationship.

00:33:30.240 --> 00:33:34.880
>> Yes. So you Yeah. So both you and are on

00:33:33.359 --> 00:33:37.199
the same sort of trajectory which is

00:33:34.880 --> 00:33:39.120
that uh we need to somehow take this

00:33:37.200 --> 00:33:40.960
table and make it into a vector. And

00:33:39.119 --> 00:33:42.879
there are many ways like what you folks

00:33:40.960 --> 00:33:46.880
are describing to make it into a vector

00:33:42.880 --> 00:33:48.159
and turns out um this is all the things

00:33:46.880 --> 00:33:50.880
that we've been discussing so far the

00:33:48.159 --> 00:33:53.039
varying length ratio and so on. So, so

00:33:50.880 --> 00:33:56.720
what we can do is we can aggregate all

00:33:53.038 --> 00:33:58.319
these things. If you just add them up,

00:33:56.720 --> 00:34:00.720
this is what you described. I believe

00:33:58.319 --> 00:34:02.720
it's called sum encoding.

00:34:00.720 --> 00:34:04.079
And if instead of adding you just or

00:34:02.720 --> 00:34:05.360
them, meaning if you look at the column

00:34:04.079 --> 00:34:07.038
and say, is there any one in this

00:34:05.359 --> 00:34:08.878
column? If there's a any one, I'll put a

00:34:07.038 --> 00:34:12.239
stick of one, otherwise it's a zero.

00:34:08.878 --> 00:34:13.918
It's called multihot encoding. So, so if

00:34:12.239 --> 00:34:15.358
you look at this thing, if you literally

00:34:13.918 --> 00:34:17.199
just go column by column and count

00:34:15.358 --> 00:34:19.838
everything. Okay, there's a one here,

00:34:17.199 --> 00:34:21.519
one here. Oh, wait. There are two twos

00:34:19.838 --> 00:34:23.039
here. So you put a two. That's count

00:34:21.519 --> 00:34:26.159
count encoding. Multih hard encoding. It

00:34:23.039 --> 00:34:28.800
just looks for any ones and puts on.

00:34:26.159 --> 00:34:30.159
Make sense? So by the way there are many

00:34:28.800 --> 00:34:32.159
ways to take these tables and make them

00:34:30.159 --> 00:34:34.159
into vectors. These two happen to be

00:34:32.159 --> 00:34:37.480
very commonly used and they kind of make

00:34:34.159 --> 00:34:37.480
common sense.

00:34:39.199 --> 00:34:43.039
Okay.

00:34:41.039 --> 00:34:44.800
Right. So this aggregation approach that

00:34:43.039 --> 00:34:46.800
we just described is called the bag of

00:34:44.800 --> 00:34:49.039
words model.

00:34:46.800 --> 00:34:51.760
Bag of words model. And the reason is

00:34:49.039 --> 00:34:53.918
that first of all this bag that we have

00:34:51.760 --> 00:34:56.560
has words either it counts whether a

00:34:53.918 --> 00:34:58.000
word exists or not or it counts how many

00:34:56.559 --> 00:35:01.039
words how many times the word has

00:34:58.000 --> 00:35:04.000
appeared right count versus multihot

00:35:01.039 --> 00:35:05.920
versus sum encoding count encoding but

00:35:04.000 --> 00:35:09.199
more importantly and this goes back to

00:35:05.920 --> 00:35:12.320
your observation is that we have lost

00:35:09.199 --> 00:35:14.399
the order of the words now whether the

00:35:12.320 --> 00:35:18.079
phrase came in was the cat sat on the

00:35:14.400 --> 00:35:19.599
mat or the mat sat on the cat the count

00:35:18.079 --> 00:35:21.440
encoding and the multih hard encoding

00:35:19.599 --> 00:35:23.200
are exactly the same. There's no

00:35:21.440 --> 00:35:24.880
difference because we're just looking

00:35:23.199 --> 00:35:27.039
for the the presence or absence of

00:35:24.880 --> 00:35:29.599
words. That's it. We don't care in what

00:35:27.039 --> 00:35:32.480
which order they appear, right? That's a

00:35:29.599 --> 00:35:34.160
huge limitation, but shockingly for many

00:35:32.480 --> 00:35:36.800
applications, it doesn't matter. It's

00:35:34.159 --> 00:35:38.960
good enough. So, it's called the bag of

00:35:36.800 --> 00:35:40.480
words model.

00:35:38.960 --> 00:35:42.720
All right, so this called the bag of

00:35:40.480 --> 00:35:46.320
words model.

00:35:42.719 --> 00:35:47.199
Um, now does it have any shortcomings? I

00:35:46.320 --> 00:35:48.960
already talked about the first

00:35:47.199 --> 00:35:51.279
shortcoming which is that it loses

00:35:48.960 --> 00:35:54.320
sequentiality the order we lost this

00:35:51.280 --> 00:35:55.680
order information right uh we we lose

00:35:54.320 --> 00:36:00.280
the meaning inherent in the order of the

00:35:55.679 --> 00:36:00.279
words what are some other issues with it

00:36:04.079 --> 00:36:07.720
what do you mean by that

00:36:12.480 --> 00:36:16.559
>> right so there are lots of zeros not

00:36:14.639 --> 00:36:18.239
that many ones so you have it's a very

00:36:16.559 --> 00:36:19.920
sparse amount of information but maybe

00:36:18.239 --> 00:36:22.000
is carrying around a lot of information

00:36:19.920 --> 00:36:24.159
to to make it all work. Now there are

00:36:22.000 --> 00:36:26.239
some tricks CS computer science tricks

00:36:24.159 --> 00:36:29.118
to handle sparsity in some clever ways

00:36:26.239 --> 00:36:30.319
but it is certainly an issue. Now the

00:36:29.119 --> 00:36:32.640
other issue is that let's say the

00:36:30.320 --> 00:36:34.960
vocabulary is very long.

00:36:32.639 --> 00:36:36.879
Each input sentence whether it's the

00:36:34.960 --> 00:36:39.838
collected works of William Shakespeare

00:36:36.880 --> 00:36:42.640
or the phrase I love you will have the

00:36:39.838 --> 00:36:45.519
same length input.

00:36:42.639 --> 00:36:48.078
Is that the same length input

00:36:45.519 --> 00:36:51.440
because ultimately every incoming thing

00:36:48.079 --> 00:36:54.480
gets mapped into one vector. Okay, that

00:36:51.440 --> 00:36:56.159
feels a little sub suboptimal.

00:36:54.480 --> 00:36:59.280
Clearly the collected works of ins have

00:36:56.159 --> 00:37:02.719
a lot more stuff going on in them.

00:36:59.280 --> 00:37:04.480
Right? So that's a problem. In

00:37:02.719 --> 00:37:06.239
particular, very very small things that

00:37:04.480 --> 00:37:08.159
come in, you'll be spending a lot of

00:37:06.239 --> 00:37:10.799
compute on those long vectors and

00:37:08.159 --> 00:37:13.039
processing them. Um, now you can

00:37:10.800 --> 00:37:14.560
mitigate some of this by choosing only

00:37:13.039 --> 00:37:16.000
the most frequent words. You don't have

00:37:14.559 --> 00:37:18.000
to take, you know, I think the English

00:37:16.000 --> 00:37:20.800
language I read somewhere has roughly

00:37:18.000 --> 00:37:23.440
500,000 words or so. Uh, but turns out

00:37:20.800 --> 00:37:24.640
the top 50,000 most frequent words are

00:37:23.440 --> 00:37:27.200
responsible for just about everything

00:37:24.639 --> 00:37:29.519
you're going to see ever. And the other

00:37:27.199 --> 00:37:31.358
50,000 are what's called the long tail.

00:37:29.519 --> 00:37:33.119
They almost never happen, right? You

00:37:31.358 --> 00:37:34.639
never see them. So, you can be very

00:37:33.119 --> 00:37:36.640
pragmatic and say, "I'm not going to

00:37:34.639 --> 00:37:38.559
take every little word that I see in my

00:37:36.639 --> 00:37:40.000
vocabulary. I'm going to only take the

00:37:38.559 --> 00:37:42.078
most frequent words. I'm just going to

00:37:40.000 --> 00:37:44.000
ignore the rest.

00:37:42.079 --> 00:37:46.960
I'm just going to ignore the rest."

00:37:44.000 --> 00:37:50.079
Okay?

00:37:46.960 --> 00:37:52.400
But if you ignore the rest, let's say

00:37:50.079 --> 00:37:55.280
the there is one word uh let's take some

00:37:52.400 --> 00:37:57.358
Shakespeare word hamlet. Let's let's

00:37:55.280 --> 00:37:58.640
assume that you ignore the word Hamlet

00:37:57.358 --> 00:38:00.400
from your training corpus. You just

00:37:58.639 --> 00:38:02.159
delete it because it's not one of the

00:38:00.400 --> 00:38:04.480
top most frequent things you have seen.

00:38:02.159 --> 00:38:06.559
And then somebody sends you a text

00:38:04.480 --> 00:38:08.240
saying, you know, Hamlet was a bad

00:38:06.559 --> 00:38:10.400
prince.

00:38:08.239 --> 00:38:12.159
Analyze the sentiment of the sentence.

00:38:10.400 --> 00:38:14.160
Well, when you see Hamlet, what is your

00:38:12.159 --> 00:38:15.358
system going to do?

00:38:14.159 --> 00:38:16.799
It's going to look at the Hamlet and

00:38:15.358 --> 00:38:18.480
say, I can't see it in my vocabulary

00:38:16.800 --> 00:38:19.920
anywhere.

00:38:18.480 --> 00:38:22.400
And if it can't see in the vocabulary,

00:38:19.920 --> 00:38:26.000
what is the only thing it can do?

00:38:22.400 --> 00:38:28.400
Replace it with Unk. So that's where

00:38:26.000 --> 00:38:30.079
comes into the picture.

00:38:28.400 --> 00:38:32.000
So whenever it can't see something in

00:38:30.079 --> 00:38:35.839
the vocabulary in a new input, it just

00:38:32.000 --> 00:38:37.838
replaced with ank. Which means that

00:38:35.838 --> 00:38:40.880
if you had ignored Romeo, Juliet, and

00:38:37.838 --> 00:38:42.239
Hamlet in the in the training corpus,

00:38:40.880 --> 00:38:44.079
all of them are going to be replaced by

00:38:42.239 --> 00:38:46.719
the same ankh, which means that we can't

00:38:44.079 --> 00:38:48.960
distinguish between them anymore.

00:38:46.719 --> 00:38:52.159
>> So is this whereation

00:38:48.960 --> 00:38:54.880
comes into play here where it doesn't

00:38:52.159 --> 00:38:56.239
recogize

00:38:54.880 --> 00:38:58.400
H interesting question. This is

00:38:56.239 --> 00:39:00.799
whereation comes up. Actually, as it

00:38:58.400 --> 00:39:03.680
turns out, no, as we will see when we

00:39:00.800 --> 00:39:06.480
talk about LLMs later. Uh LLMs actually

00:39:03.679 --> 00:39:08.078
will not have this UN problem because

00:39:06.480 --> 00:39:09.440
they use a different tokenization scheme

00:39:08.079 --> 00:39:10.960
which can handle anything you throw at

00:39:09.440 --> 00:39:12.480
it, including new stuff you just made

00:39:10.960 --> 00:39:14.800
up.

00:39:12.480 --> 00:39:17.838
So, we'll come back to that.

00:39:14.800 --> 00:39:19.760
All right. Um so, that's what we have.

00:39:17.838 --> 00:39:21.440
And so what we're going to do is despite

00:39:19.760 --> 00:39:23.599
its shortcomings, bag of words is

00:39:21.440 --> 00:39:26.400
actually a really good default for many

00:39:23.599 --> 00:39:27.599
NLP tasks. Uh and in the spirit of do

00:39:26.400 --> 00:39:28.880
the simple stuff first and do

00:39:27.599 --> 00:39:30.400
complicated things only the simple

00:39:28.880 --> 00:39:32.079
doesn't work. We'll use a bag of words

00:39:30.400 --> 00:39:36.480
model right now. Okay. So we'll switch

00:39:32.079 --> 00:39:39.440
to a collab and see how it's done.

00:39:36.480 --> 00:39:40.719
So here the the application we're going

00:39:39.440 --> 00:39:43.119
to work with is kind of a fun

00:39:40.719 --> 00:39:46.000
application. Uh we're going to try to

00:39:43.119 --> 00:39:47.599
predict the genre of songs.

00:39:46.000 --> 00:39:50.480
Okay, it's a nice classification use

00:39:47.599 --> 00:39:52.800
case. Um, so we want to take some

00:39:50.480 --> 00:39:55.440
arbitrary song and then classify it into

00:39:52.800 --> 00:39:59.599
either hip-hop, rock or pop.

00:39:55.440 --> 00:40:01.200
Okay. Um, and so for instance,

00:39:59.599 --> 00:40:03.200
right, this is the kind of lyric you're

00:40:01.199 --> 00:40:04.879
lyrics you're going to see. And as you

00:40:03.199 --> 00:40:07.279
will see in this data set, the data set,

00:40:04.880 --> 00:40:10.320
just a quick word of caution, uh, the

00:40:07.280 --> 00:40:12.720
data set does have lyrics which may not

00:40:10.320 --> 00:40:14.320
be sort of, you know, safe for work as

00:40:12.719 --> 00:40:16.719
it were. So I'm not going to be like

00:40:14.320 --> 00:40:18.880
exploring the lyrics in the collab, but

00:40:16.719 --> 00:40:20.959
I just wanted to be aware of it. Okay.

00:40:18.880 --> 00:40:22.480
Um, so but it's just some data set that

00:40:20.960 --> 00:40:24.240
we downloaded from somewhere, right? Uh,

00:40:22.480 --> 00:40:25.599
it's got all these lyrics. Okay. So

00:40:24.239 --> 00:40:27.759
we're going to try to classify each

00:40:25.599 --> 00:40:29.200
verse that we see into one of three

00:40:27.760 --> 00:40:31.680
things. Hip hop, rock or pop. It's a

00:40:29.199 --> 00:40:33.279
multi-class classification problem.

00:40:31.679 --> 00:40:35.039
All right. Actually, what is the

00:40:33.280 --> 00:40:37.760
simplest neural network based classifier

00:40:35.039 --> 00:40:41.119
we can build

00:40:37.760 --> 00:40:42.800
for this problem?

00:40:41.119 --> 00:40:44.880
All right. So what is the simplest

00:40:42.800 --> 00:40:47.519
neural network we can build for this

00:40:44.880 --> 00:40:49.519
problem? So remember what is the input?

00:40:47.519 --> 00:40:50.719
The input is going to be a bunch of song

00:40:49.519 --> 00:40:52.800
lyrics. It's going to be a really long

00:40:50.719 --> 00:40:54.879
song for all you know, right? And we're

00:40:52.800 --> 00:40:56.560
going to use the bag of birds model. Uh

00:40:54.880 --> 00:40:59.680
and let's assume for a moment that we

00:40:56.559 --> 00:41:02.239
will use multihot encoding, right? We'll

00:40:59.679 --> 00:41:04.000
create a vocabulary from this for the

00:41:02.239 --> 00:41:06.559
song. We'll take all the songs. We'll

00:41:04.000 --> 00:41:08.239
process them, run it through STI. will

00:41:06.559 --> 00:41:10.719
do multihod encoding which means that

00:41:08.239 --> 00:41:14.239
every song that comes in will have will

00:41:10.719 --> 00:41:17.279
be a vector that's how long

00:41:14.239 --> 00:41:20.719
it'll be as long as the

00:41:17.280 --> 00:41:24.720
correct as a vocabulary size right so um

00:41:20.719 --> 00:41:26.480
so maybe what comes in is this phrase um

00:41:24.719 --> 00:41:28.000
since it's supposed to be songs I'll say

00:41:26.480 --> 00:41:30.960
something which is probably common to

00:41:28.000 --> 00:41:34.639
90% of songs I love you

00:41:30.960 --> 00:41:38.480
okay that goes in

00:41:34.639 --> 00:41:42.000
it goes into our ST STIE process

00:41:38.480 --> 00:41:49.039
and then this SDIE process gives us a

00:41:42.000 --> 00:41:50.318
vector which is X1 X2 all the way to XV

00:41:49.039 --> 00:41:52.639
where V stands for the size of

00:41:50.318 --> 00:41:54.960
vocabulary. Okay. So that that's our

00:41:52.639 --> 00:41:58.239
input layer

00:41:54.960 --> 00:42:02.400
all the way. So knowing what we know now

00:41:58.239 --> 00:42:04.959
about deep learning what can we do next?

00:42:02.400 --> 00:42:07.920
Couldn't you or maybe I'm getting ahead

00:42:04.960 --> 00:42:10.240
but wouldn't the classifier just be like

00:42:07.920 --> 00:42:11.920
the baseline would be classify it as the

00:42:10.239 --> 00:42:13.199
most common genre?

00:42:11.920 --> 00:42:14.800
>> That is the baseline. Correct. Correct.

00:42:13.199 --> 00:42:17.039
I'm just saying and we'll come to the

00:42:14.800 --> 00:42:18.720
baseline a bit later. But here I'm

00:42:17.039 --> 00:42:21.119
saying suppose you need to you wanted to

00:42:18.719 --> 00:42:23.358
build a neural network model for this.

00:42:21.119 --> 00:42:25.280
How would you set it up?

00:42:23.358 --> 00:42:26.078
>> You think about the layers that you

00:42:25.280 --> 00:42:27.359
want,

00:42:26.079 --> 00:42:29.039
>> right? And what is the simplest thing

00:42:27.358 --> 00:42:30.159
you can do with a neural network? How

00:42:29.039 --> 00:42:33.279
many layers?

00:42:30.159 --> 00:42:35.358
>> Uh no layers. Well, then it becomes

00:42:33.280 --> 00:42:36.800
problematic with even a neural network

00:42:35.358 --> 00:42:37.759
because it could just be logistic

00:42:36.800 --> 00:42:38.800
regression

00:42:37.760 --> 00:42:41.760
>> one hidden layer.

00:42:38.800 --> 00:42:43.119
>> Yes, thank you. I'm being a little

00:42:41.760 --> 00:42:44.800
squishy about this because there are

00:42:43.119 --> 00:42:46.480
some people who be like well even if

00:42:44.800 --> 00:42:48.560
there's no hidden layers if you're using

00:42:46.480 --> 00:42:49.838
relus and this and that and sigma that's

00:42:48.559 --> 00:42:51.519
maybe it's a neural network and I don't

00:42:49.838 --> 00:42:54.400
want to get into that how many ages in

00:42:51.519 --> 00:42:56.079
the tip of a pin argument. So um so yeah

00:42:54.400 --> 00:42:57.358
we need one hidden layer right in this

00:42:56.079 --> 00:42:59.039
course we need at least one hidden layer

00:42:57.358 --> 00:43:01.119
for it to qualify as a neural network.

00:42:59.039 --> 00:43:04.800
Okay, so let's have a hidden layer and

00:43:01.119 --> 00:43:07.680
we'll have a bunch of ReLUS as usual.

00:43:04.800 --> 00:43:09.119
Okay, bunch of ReLULS and I'll ignore

00:43:07.679 --> 00:43:11.519
all the arrows between them. It's kind

00:43:09.119 --> 00:43:13.039
of a pain. U and then we come to the

00:43:11.519 --> 00:43:15.358
output layer. And what should the output

00:43:13.039 --> 00:43:16.960
layer be?

00:43:15.358 --> 00:43:19.519
How many nodes do we have need in the

00:43:16.960 --> 00:43:22.400
output layer? Three, right? Hip-hop,

00:43:19.519 --> 00:43:23.759
rock, whatever. Pop. So we And then that

00:43:22.400 --> 00:43:25.358
layer is called what? What activation

00:43:23.760 --> 00:43:27.520
function?

00:43:25.358 --> 00:43:30.960
Softmax. Perfect. Love it. love this

00:43:27.519 --> 00:43:33.838
class. All right, three things. Uh,

00:43:30.960 --> 00:43:36.880
rock, hip-hop,

00:43:33.838 --> 00:43:39.199
and uh, pop, right? And this is a soft

00:43:36.880 --> 00:43:41.760
max right there.

00:43:39.199 --> 00:43:44.639
And then it's going to give us three

00:43:41.760 --> 00:43:46.400
probabilities that add up to one because

00:43:44.639 --> 00:43:49.679
it's a soft max. So that's our basic

00:43:46.400 --> 00:43:51.039
network, right? Perfect. Yeah.

00:43:49.679 --> 00:43:52.799
>> Why do you need those probabilities?

00:43:51.039 --> 00:43:55.279
Again, if you just want to identify the

00:43:52.800 --> 00:43:56.720
most likely genre, the soft max just

00:43:55.280 --> 00:43:59.359
give you a way to kind of add them all

00:43:56.719 --> 00:44:01.358
up once. Why do you need soft? Why don't

00:43:59.358 --> 00:44:01.759
you just take the max value and say it's

00:44:01.358 --> 00:44:03.679
that?

00:44:01.760 --> 00:44:05.760
>> Oh, interesting question. Why can't we

00:44:03.679 --> 00:44:09.519
just produce three numbers and grab the

00:44:05.760 --> 00:44:11.200
maximum number? So, it turns out finding

00:44:09.519 --> 00:44:12.719
the maximum bunch of numbers that

00:44:11.199 --> 00:44:14.960
function

00:44:12.719 --> 00:44:16.959
is not very it's not very friendly for

00:44:14.960 --> 00:44:18.880
differentiation.

00:44:16.960 --> 00:44:20.800
And ultimately you want to take this

00:44:18.880 --> 00:44:23.200
output, run it through a loss function

00:44:20.800 --> 00:44:25.839
like cross entropy and then be able to

00:44:23.199 --> 00:44:27.679
run back prop on it. And so

00:44:25.838 --> 00:44:29.599
fundamentally back propagation is just

00:44:27.679 --> 00:44:31.199
differentiation and it requires

00:44:29.599 --> 00:44:34.160
everything inside of it to have well-

00:44:31.199 --> 00:44:36.239
behaved gradients. And so this little

00:44:34.159 --> 00:44:39.039
max function is actually not well

00:44:36.239 --> 00:44:41.598
behaved and which is why we have a soft

00:44:39.039 --> 00:44:44.318
version of it soft max which makes it

00:44:41.599 --> 00:44:45.760
easy to differentiate. So I can tell you

00:44:44.318 --> 00:44:49.079
more about it offline but that's sort of

00:44:45.760 --> 00:44:49.079
the quick synopsis.

00:44:49.119 --> 00:44:52.640
So a lot of tricks you will see in the

00:44:50.480 --> 00:44:55.440
neural network literature or ways to

00:44:52.639 --> 00:44:57.358
avoid this the problem of having certain

00:44:55.440 --> 00:44:59.200
the like the obvious choice of function

00:44:57.358 --> 00:45:00.400
will not be well behaved for

00:44:59.199 --> 00:45:02.960
differentiation. That's why you need to

00:45:00.400 --> 00:45:05.039
go through all these other mechanisms

00:45:02.960 --> 00:45:06.400
much like we couldn't just say accuracy.

00:45:05.039 --> 00:45:07.679
Why don't you just maximize accuracy

00:45:06.400 --> 00:45:10.880
instead of doing this cross entropy

00:45:07.679 --> 00:45:14.480
business? Same reason.

00:45:10.880 --> 00:45:17.640
All right. So let's come back here.

00:45:14.480 --> 00:45:17.639
All right.

00:45:20.639 --> 00:45:27.279
So that's what we created on the thing.

00:45:23.679 --> 00:45:28.960
Right? Cats out of the mat vocabulary

00:45:27.280 --> 00:45:31.359
thing and so on. And I you know I was

00:45:28.960 --> 00:45:33.519
playing around with it uh earlier and so

00:45:31.358 --> 00:45:35.039
I I found that you know eight relu

00:45:33.519 --> 00:45:36.159
neurons were pretty good to get the job

00:45:35.039 --> 00:45:37.838
done. So I'm just going to go with eight

00:45:36.159 --> 00:45:39.920
rel

00:45:37.838 --> 00:45:44.078
neurons in the hidden layer.

00:45:39.920 --> 00:45:47.039
So I think that brings us to the collab.

00:45:44.079 --> 00:45:49.519
Yeah. So let's switch to the collab.

00:45:47.039 --> 00:45:50.960
All right. So um that's what we have

00:45:49.519 --> 00:45:52.318
here. We you know there's a little bit

00:45:50.960 --> 00:45:54.159
of verbiage here which just describes

00:45:52.318 --> 00:45:56.400
what I just talked about. So we'll do

00:45:54.159 --> 00:45:58.639
the usual things and upload everything

00:45:56.400 --> 00:46:01.280
uh import everything we want. TensorFlow

00:45:58.639 --> 00:46:03.838
and caras and the the holy trinity of

00:46:01.280 --> 00:46:07.040
numpy pandas and mattplot lib. Uh set

00:46:03.838 --> 00:46:09.679
the random seed as usual at 42.

00:46:07.039 --> 00:46:11.759
This is our SDIE framework here. And the

00:46:09.679 --> 00:46:14.480
nice thing is that all four of these

00:46:11.760 --> 00:46:16.880
things SDIE are beautifully implemented

00:46:14.480 --> 00:46:19.440
in Keras is a single simple layer called

00:46:16.880 --> 00:46:22.880
the text vectorzation layer. Okay, which

00:46:19.440 --> 00:46:25.200
is nice. Um, so we have the text vector

00:46:22.880 --> 00:46:26.960
right here. And so in our first example,

00:46:25.199 --> 00:46:29.039
what we'll do is we will use a default

00:46:26.960 --> 00:46:31.199
standardization which will just remove

00:46:29.039 --> 00:46:33.039
punctuation, convert to lowercase. We'll

00:46:31.199 --> 00:46:35.598
use a default tokenization which just

00:46:33.039 --> 00:46:37.358
means split on the space between words.

00:46:35.599 --> 00:46:39.680
And then we will set the output to

00:46:37.358 --> 00:46:41.039
multihart. Right? All the things we

00:46:39.679 --> 00:46:43.598
talked about, KAS will just do it for

00:46:41.039 --> 00:46:45.759
you automatically. And so output mode

00:46:43.599 --> 00:46:47.359
multihart standardize this spread whites

00:46:45.760 --> 00:46:49.760
space and boom, you run the text

00:46:47.358 --> 00:46:52.000
vectorization thing. And once you do it,

00:46:49.760 --> 00:46:53.599
KAS creates this textualization layer

00:46:52.000 --> 00:46:56.159
with these settings and it's now ready

00:46:53.599 --> 00:46:58.480
to swing into action. So what does swing

00:46:56.159 --> 00:46:59.679
into action actually means? Well, now we

00:46:58.480 --> 00:47:01.920
need to actually feed it a training

00:46:59.679 --> 00:47:02.960
carpass so that it can do all the things

00:47:01.920 --> 00:47:07.039
it's supposed to do and create the

00:47:02.960 --> 00:47:08.800
vocabulary for you, right? So um so and

00:47:07.039 --> 00:47:11.599
that thing is called the adapt method.

00:47:08.800 --> 00:47:14.880
So we create a tiny training corpus for

00:47:11.599 --> 00:47:16.160
us. This is our data set. Um right this

00:47:14.880 --> 00:47:18.240
just a bunch of words from some of these

00:47:16.159 --> 00:47:19.920
lyrics. And then what we'll do is we'll

00:47:18.239 --> 00:47:21.838
take this layer that we just defined

00:47:19.920 --> 00:47:24.240
here that we have set up here. And then

00:47:21.838 --> 00:47:26.078
we will ask this layer to actually

00:47:24.239 --> 00:47:29.679
create the vocabulary using this adapt

00:47:26.079 --> 00:47:31.760
command. Okay. Index the vocabulary. And

00:47:29.679 --> 00:47:34.239
it's done. And once it does it, you can

00:47:31.760 --> 00:47:36.160
actually ask it for the vocabulary.

00:47:34.239 --> 00:47:38.479
Okay, this is the vocabulary using the

00:47:36.159 --> 00:47:41.679
get vocabulary command. And so first of

00:47:38.480 --> 00:47:45.119
all, how long is the vocab? 17 17 words,

00:47:41.679 --> 00:47:46.799
17 tokens. What are they?

00:47:45.119 --> 00:47:48.880
And see here, and you can see these are

00:47:46.800 --> 00:47:50.640
all the words and you can see it is

00:47:48.880 --> 00:47:52.400
stuck in an in the very beginning,

00:47:50.639 --> 00:47:54.239
right? It's sort of the default. By the

00:47:52.400 --> 00:47:55.599
way, uh just a little programming tip if

00:47:54.239 --> 00:47:57.118
you're not familiar with if you don't

00:47:55.599 --> 00:47:58.400
have a ton of programming experience. If

00:47:57.119 --> 00:48:00.240
you want to, you know, print these

00:47:58.400 --> 00:48:02.960
Python objects like list and all in a

00:48:00.239 --> 00:48:05.838
pretty way, one trick that often works

00:48:02.960 --> 00:48:08.240
is just stick it into a data frame

00:48:05.838 --> 00:48:09.599
and then print it. Usually, it'll print

00:48:08.239 --> 00:48:11.679
it in a much better way. So, you can see

00:48:09.599 --> 00:48:13.760
it like that.

00:48:11.679 --> 00:48:15.598
So, you can see here ank arrays blah

00:48:13.760 --> 00:48:17.920
blah blah blah blah. And you can see

00:48:15.599 --> 00:48:19.760
integer zero assigned the ank token. By

00:48:17.920 --> 00:48:22.559
the way, how come it picked the word

00:48:19.760 --> 00:48:26.960
arrays as the second entry? Why not

00:48:22.559 --> 00:48:29.839
something like an or um you know why

00:48:26.960 --> 00:48:32.400
not? Why not a how come a is not chosen

00:48:29.838 --> 00:48:36.039
as a second entry? Why why did it pick

00:48:32.400 --> 00:48:36.039
arrays? You think

00:48:40.318 --> 00:48:45.358
>> maybe maybe it tried like the words that

00:48:43.358 --> 00:48:49.119
are most influential on the meaning of

00:48:45.358 --> 00:48:49.119
the sentence to be on the

00:48:49.760 --> 00:48:54.160
But it at this point it doesn't know

00:48:51.280 --> 00:48:56.000
what we're going to use it for.

00:48:54.159 --> 00:48:57.358
So it has no way to know what word is

00:48:56.000 --> 00:48:59.599
useful because we haven't told it how

00:48:57.358 --> 00:49:01.838
we're going to use it.

00:48:59.599 --> 00:49:04.559
But but you're kind of on the right

00:49:01.838 --> 00:49:06.400
track. So what KAS does is it'll

00:49:04.559 --> 00:49:07.680
calculate it'll find all these tokens

00:49:06.400 --> 00:49:09.760
and then it'll actually just sort them

00:49:07.679 --> 00:49:12.239
by frequency.

00:49:09.760 --> 00:49:13.680
So the most frequent as it turns out in

00:49:12.239 --> 00:49:15.838
those four sentences we gave it happen

00:49:13.679 --> 00:49:17.838
to be the word arrays. That's why arrays

00:49:15.838 --> 00:49:19.279
is showing up on top. Um, and you can

00:49:17.838 --> 00:49:21.759
actually confirm this by going to the

00:49:19.280 --> 00:49:23.760
our little data set and you can see here

00:49:21.760 --> 00:49:25.920
array shows up here and was up here

00:49:23.760 --> 00:49:29.920
twice and that's why it came up on top.

00:49:25.920 --> 00:49:32.559
Okay. All right. So that's what we have

00:49:29.920 --> 00:49:34.400
and u and now now that we have populated

00:49:32.559 --> 00:49:36.319
this we can run any sentence through it

00:49:34.400 --> 00:49:37.358
easily. Yeah.

00:49:36.318 --> 00:49:39.599
>> Does [clears throat] it matter that it's

00:49:37.358 --> 00:49:41.199
on the top or is it just

00:49:39.599 --> 00:49:43.519
>> it doesn't matter. It doesn't matter.

00:49:41.199 --> 00:49:45.598
The reason why it's helpful later on is

00:49:43.519 --> 00:49:48.079
because suppose you tell Kas hey don't

00:49:45.599 --> 00:49:50.559
take every word you see here give me

00:49:48.079 --> 00:49:52.318
only the most frequent 100 words I don't

00:49:50.559 --> 00:49:56.519
want any more than that it can easily do

00:49:52.318 --> 00:49:56.519
that that's the reason yeah

00:50:01.199 --> 00:50:05.679
>> this is just a vocabulary so basically

00:50:03.280 --> 00:50:07.519
you you give it all this phrases it

00:50:05.679 --> 00:50:09.039
happens just four phrases in our example

00:50:07.519 --> 00:50:10.639
and then it finds all the distinct words

00:50:09.039 --> 00:50:12.558
and you know does all that stuff and and

00:50:10.639 --> 00:50:14.480
then it has created a vocabulary. At

00:50:12.559 --> 00:50:17.680
this point the the training corpus you

00:50:14.480 --> 00:50:19.440
fed it will is forgotten and the only

00:50:17.679 --> 00:50:21.838
thing has survived this processing is

00:50:19.440 --> 00:50:23.280
just the vocabulary. That's it. Now we

00:50:21.838 --> 00:50:25.838
have to start applying it to any kind of

00:50:23.280 --> 00:50:28.559
text we want to use it for.

00:50:25.838 --> 00:50:30.159
So here when you come back here u so

00:50:28.559 --> 00:50:32.240
this is what we have and so what you can

00:50:30.159 --> 00:50:33.920
do is you can take any sentence and you

00:50:32.239 --> 00:50:35.039
can just run it through a layer and to

00:50:33.920 --> 00:50:37.039
make sure that actually is doing the

00:50:35.039 --> 00:50:39.119
right thing for you. So we'll take the

00:50:37.039 --> 00:50:40.558
sentence, we will then run it through

00:50:39.119 --> 00:50:42.000
the text vectorization layer by just

00:50:40.559 --> 00:50:45.640
passing that sentence into it and then

00:50:42.000 --> 00:50:45.639
we can just print it.

00:50:46.000 --> 00:50:50.559
So now it's giving you a tensor. This is

00:50:47.838 --> 00:50:54.318
a multihot encoder tensor with all these

00:50:50.559 --> 00:50:56.400
ones and zeros. So note that this tensor

00:50:54.318 --> 00:50:58.079
is 17 units long which is which is a

00:50:56.400 --> 00:51:00.880
good check because our vocabulary is 17

00:50:58.079 --> 00:51:03.519
long. So it's better match that. Uh now

00:51:00.880 --> 00:51:05.680
recall that the ank token is at the

00:51:03.519 --> 00:51:08.159
first location. It's at index zero and

00:51:05.679 --> 00:51:10.558
it says that this encoded sentence does

00:51:08.159 --> 00:51:13.358
have an unk word.

00:51:10.559 --> 00:51:15.920
Okay. So

00:51:13.358 --> 00:51:19.039
why is that? What is this UN word?

00:51:15.920 --> 00:51:21.680
Anyone can guess?

00:51:19.039 --> 00:51:24.400
Well, it turns out to be the word still.

00:51:21.679 --> 00:51:26.480
Um I think yeah still is not in our

00:51:24.400 --> 00:51:28.079
vocabulary because the four sentences

00:51:26.480 --> 00:51:30.240
which is our training corpus used to

00:51:28.079 --> 00:51:32.000
build vocabulary. They had a lot of

00:51:30.239 --> 00:51:33.838
write and rewrite but there was no still

00:51:32.000 --> 00:51:35.920
in it anyway. That's why there's an UN

00:51:33.838 --> 00:51:38.159
ank for it. Uh we can just double check

00:51:35.920 --> 00:51:40.000
that by asking Python is it is it

00:51:38.159 --> 00:51:41.598
vocabulary? Nope, it's not. Okay. Now,

00:51:40.000 --> 00:51:42.960
in the spirit of making small changes to

00:51:41.599 --> 00:51:45.680
the code to understand what's going on,

00:51:42.960 --> 00:51:46.880
which is a very useful tip for folks who

00:51:45.679 --> 00:51:48.879
don't have a ton of programming

00:51:46.880 --> 00:51:52.960
knowledge. Let's say that you send the

00:51:48.880 --> 00:51:54.480
phrase Sloan Hodddle and DM, DMD. Uh I

00:51:52.960 --> 00:51:55.760
think you will agree with me that none

00:51:54.480 --> 00:51:59.358
of these words is in the training

00:51:55.760 --> 00:52:02.000
corpus, right? So what will this what is

00:51:59.358 --> 00:52:06.199
the multihot encoded vector for this

00:52:02.000 --> 00:52:06.199
phrase sloan hodddle BMD

00:52:07.440 --> 00:52:10.440
three

00:52:11.440 --> 00:52:14.800
it's not count encoding it's multihod

00:52:13.119 --> 00:52:17.358
encoding

00:52:14.800 --> 00:52:19.039
right it's going to be 1 0 0 so you can

00:52:17.358 --> 00:52:21.598
see here or in this case remember the

00:52:19.039 --> 00:52:23.599
vocabulary is 17

00:52:21.599 --> 00:52:27.440
right so each of these words is going to

00:52:23.599 --> 00:52:29.200
be a one followed by 16 zeros

00:52:27.440 --> 00:52:30.880
And then it's going to multih hot encode

00:52:29.199 --> 00:52:34.318
them which means the three ones in the

00:52:30.880 --> 00:52:37.039
column just become a one. So so you

00:52:34.318 --> 00:52:39.599
still have only this one. Okay. All

00:52:37.039 --> 00:52:41.358
right. Good. So now let's see that's now

00:52:39.599 --> 00:52:45.359
let's actually get to the the the data

00:52:41.358 --> 00:52:47.598
set. We have this 90,000 songs. Uh and

00:52:45.358 --> 00:52:49.440
it's in this little thing here. Uh we

00:52:47.599 --> 00:52:50.720
have grabbed the data and cleaned it up.

00:52:49.440 --> 00:52:53.280
Cleaned it up meaning like formatting

00:52:50.719 --> 00:52:55.039
wise not content wise. uh and then we

00:52:53.280 --> 00:52:56.960
stuck it in this uh data frame and it's

00:52:55.039 --> 00:52:58.480
we already have divided into train, test

00:52:56.960 --> 00:53:00.720
and validation for your benefit. So you

00:52:58.480 --> 00:53:03.599
don't have to worry about it. So turns

00:53:00.719 --> 00:53:05.759
out we have 40 almost 49,000 songs in

00:53:03.599 --> 00:53:08.800
the training set, 16,000 songs in the

00:53:05.760 --> 00:53:10.960
validation set and 22 roughly 22,000 in

00:53:08.800 --> 00:53:13.119
the test set. Okay, lot of songs. It's a

00:53:10.960 --> 00:53:15.838
lot. It's a big data set. Um so let's

00:53:13.119 --> 00:53:18.079
just look at the first few.

00:53:15.838 --> 00:53:20.558
So oh girl, I can't get ready. We met on

00:53:18.079 --> 00:53:22.000
rainy evening. Paralysis through

00:53:20.559 --> 00:53:23.599
analysis.

00:53:22.000 --> 00:53:27.599
Okay, that I can relate to as a data

00:53:23.599 --> 00:53:29.280
science person. But anyway, u but uh by

00:53:27.599 --> 00:53:31.440
the way this uh these things are very

00:53:29.280 --> 00:53:33.440
useful for exploration of any uh data

00:53:31.440 --> 00:53:36.720
frames that you might have. Collab is a

00:53:33.440 --> 00:53:38.318
collab feature just check it out. Um so

00:53:36.719 --> 00:53:40.159
anyway, that's the first few the first

00:53:38.318 --> 00:53:43.119
few rows. Let's look at the last few

00:53:40.159 --> 00:53:46.118
rows.

00:53:43.119 --> 00:53:46.119
Okay,

00:53:48.800 --> 00:53:56.280
you never listen to me as pop. Beamer

00:53:51.440 --> 00:53:56.280
Benz is hip-hop. Yeah, of course.

00:53:57.599 --> 00:54:01.440
So, okay. Uh, now to go back to the

00:53:59.679 --> 00:54:02.639
question of, okay, um, what could be a

00:54:01.440 --> 00:54:04.559
good baseline model? We need to

00:54:02.639 --> 00:54:07.118
understand the proportion of these three

00:54:04.559 --> 00:54:10.559
classes of songs. So, we'll do a quick

00:54:07.119 --> 00:54:12.480
check. Turns out rock is 55%. So, if you

00:54:10.559 --> 00:54:13.599
had to just guess something just

00:54:12.480 --> 00:54:15.920
naively, you would just guess everything

00:54:13.599 --> 00:54:18.400
to be rock and you'd be right 55% of the

00:54:15.920 --> 00:54:20.159
time. Uh so now uh by the way the the

00:54:18.400 --> 00:54:21.680
the target variable which tells you

00:54:20.159 --> 00:54:24.639
whether which of these three genres it

00:54:21.679 --> 00:54:26.318
is uh is is is a is actually a dummy

00:54:24.639 --> 00:54:29.598
variable. So we need to one hot encode

00:54:26.318 --> 00:54:32.000
that right. Um so we'll just turn that

00:54:29.599 --> 00:54:34.559
this way using the pandas get dummies

00:54:32.000 --> 00:54:35.920
function. And when we do that uh this is

00:54:34.559 --> 00:54:37.200
y train which contains the dependent

00:54:35.920 --> 00:54:40.800
variable. And you can see that is one

00:54:37.199 --> 00:54:42.719
hot encoded now. Uh 0 1 0 0 1 0 0 1 and

00:54:40.800 --> 00:54:44.960
so on and so forth. That's it. So I

00:54:42.719 --> 00:54:46.799
think the first I forget it rock,

00:54:44.960 --> 00:54:48.400
hip-hop, rock, pop or whatever. It's in

00:54:46.800 --> 00:54:50.800
some order. We'll we'll get to that

00:54:48.400 --> 00:54:52.559
later. So it's one hot encoded as well.

00:54:50.800 --> 00:54:54.240
So that is as far as the data

00:54:52.559 --> 00:54:55.680
downloading and setup is concerned. Any

00:54:54.239 --> 00:54:57.439
questions?

00:54:55.679 --> 00:54:58.960
>> Yeah.

00:54:57.440 --> 00:55:01.440
>> Uh this kind of goes back to the

00:54:58.960 --> 00:55:04.000
transfer learning concept. But do you

00:55:01.440 --> 00:55:06.079
always want to build your corpus based

00:55:04.000 --> 00:55:08.000
off of the vocabulary of your training

00:55:06.079 --> 00:55:10.559
data or could you have like a

00:55:08.000 --> 00:55:13.679
pre-ompiled like somebody's already made

00:55:10.559 --> 00:55:15.280
like a list of the 50,000 words?

00:55:13.679 --> 00:55:16.558
>> That's a really good question. Uh

00:55:15.280 --> 00:55:20.240
unfortunately I'm going to punt on it

00:55:16.559 --> 00:55:22.240
for the moment because um with modern

00:55:20.239 --> 00:55:25.039
large language models a number of these

00:55:22.239 --> 00:55:27.039
NLP tasks for which you had to sort of

00:55:25.039 --> 00:55:29.759
roll your own and build your own thing

00:55:27.039 --> 00:55:31.838
can now be very easily done using large

00:55:29.760 --> 00:55:33.520
language models without even any further

00:55:31.838 --> 00:55:34.639
training.

00:55:33.519 --> 00:55:35.759
Case you pay for it is that you have to

00:55:34.639 --> 00:55:37.759
use a large language model which means

00:55:35.760 --> 00:55:38.800
you have to pay somebody an API call and

00:55:37.760 --> 00:55:41.760
things like that and there are other

00:55:38.800 --> 00:55:43.920
issues with it. uh but

00:55:41.760 --> 00:55:46.319
we'll talk a lot about transfer learning

00:55:43.920 --> 00:55:48.559
for text when we come to a little later

00:55:46.318 --> 00:55:52.279
in the NLP sequence. So if I forget

00:55:48.559 --> 00:55:52.280
please bring it up again.

00:55:53.358 --> 00:55:58.159
>> Yeah.

00:55:54.880 --> 00:56:00.880
>> Um quick clarification on the encode

00:55:58.159 --> 00:56:03.440
factor. If I post it as floats not ins.

00:56:00.880 --> 00:56:05.599
If it gets incredibly long wouldn't that

00:56:03.440 --> 00:56:06.559
eat into compute time? Is there a reason

00:56:05.599 --> 00:56:09.119
why it's floats?

00:56:06.559 --> 00:56:11.359
>> Yeah. So uh question is that when when I

00:56:09.119 --> 00:56:13.200
showed you that tensor the it is

00:56:11.358 --> 00:56:14.639
actually is written as a continuous

00:56:13.199 --> 00:56:16.399
number right a float floating point

00:56:14.639 --> 00:56:18.159
number but we know these are one zeros

00:56:16.400 --> 00:56:20.240
and ones so why can't we why do we have

00:56:18.159 --> 00:56:21.519
to waste compute capacity by telling the

00:56:20.239 --> 00:56:23.118
computer that these are all big

00:56:21.519 --> 00:56:25.199
continuous numbers when it's just a zero

00:56:23.119 --> 00:56:26.559
one there are ways to optimize that but

00:56:25.199 --> 00:56:28.960
these problems are so small we just

00:56:26.559 --> 00:56:30.319
don't worry about it but when we come to

00:56:28.960 --> 00:56:34.079
something called parameter efficient

00:56:30.318 --> 00:56:35.838
fine-tuning lecture maybe 10ish uh we

00:56:34.079 --> 00:56:38.318
actually exploit that particular fact to

00:56:35.838 --> 00:56:38.318
make things faster

00:56:38.480 --> 00:56:43.519
Okay, so that's what we have. Uh, so

00:56:41.199 --> 00:56:46.000
we'll we'll do the bag of birds model.

00:56:43.519 --> 00:56:47.119
Um, by the way, there's a whole bunch of

00:56:46.000 --> 00:56:49.199
stuff here. It just repeats what I've

00:56:47.119 --> 00:56:50.880
been telling you in the lecture. So feel

00:56:49.199 --> 00:56:54.000
free to read it again, but we can ignore

00:56:50.880 --> 00:56:55.920
it for the moment. And now there's a new

00:56:54.000 --> 00:56:58.159
thing we are doing here. So we are

00:56:55.920 --> 00:57:00.159
basically saying, look, instead of

00:56:58.159 --> 00:57:03.519
taking every word you see in these

00:57:00.159 --> 00:57:05.358
49,000 uh songs in the training corpus,

00:57:03.519 --> 00:57:09.119
uh, it's going to be too many words.

00:57:05.358 --> 00:57:11.679
just pick the 5,000 most frequent words

00:57:09.119 --> 00:57:15.039
and that's what this max tokens stands

00:57:11.679 --> 00:57:18.719
for. Okay. And so we tell it uh all

00:57:15.039 --> 00:57:20.798
right do this thing max tokens 5,000

00:57:18.719 --> 00:57:22.318
sorry not 50,000 5,000 and still do

00:57:20.798 --> 00:57:24.318
multihart and we are not explicitly

00:57:22.318 --> 00:57:25.599
saying the standardization and all that

00:57:24.318 --> 00:57:29.119
stuff because the defaults are what

00:57:25.599 --> 00:57:30.960
we're going with. Okay. Yeah.

00:57:29.119 --> 00:57:32.798
This is for making it more efficient.

00:57:30.960 --> 00:57:36.639
Like this is like don't waste your time

00:57:32.798 --> 00:57:39.358
on these thousand sports. Use them more.

00:57:36.639 --> 00:57:40.239
Use them. Just focus on that to make

00:57:39.358 --> 00:57:42.318
more efficient.

00:57:40.239 --> 00:57:44.000
>> Make more efficient. But there is a

00:57:42.318 --> 00:57:46.400
related and important point which is

00:57:44.000 --> 00:57:49.599
that fundamentally the number of tokens

00:57:46.400 --> 00:57:51.760
you allow this layer to have dictates

00:57:49.599 --> 00:57:53.680
the size of your vocabulary and the size

00:57:51.760 --> 00:57:56.079
of your vocabulary dictates the size of

00:57:53.679 --> 00:57:57.358
the vector that you feed in. So shorter

00:57:56.079 --> 00:57:59.039
vectors are better than longer vectors.

00:57:57.358 --> 00:58:00.639
That's the efficiency point. The other

00:57:59.039 --> 00:58:02.719
point is that the longer the input

00:58:00.639 --> 00:58:04.400
vector, the more the number of

00:58:02.719 --> 00:58:06.558
parameters the network has to learn

00:58:04.400 --> 00:58:08.480
because the first layer itself is the

00:58:06.559 --> 00:58:10.000
size of the input times roughly times

00:58:08.480 --> 00:58:11.199
the size of the hidden layer. So this

00:58:10.000 --> 00:58:13.039
thing becomes 10 times as long. You have

00:58:11.199 --> 00:58:15.439
10 times as many parameters to learn and

00:58:13.039 --> 00:58:17.199
given a finite amount of data, right?

00:58:15.440 --> 00:58:18.400
The more parameters you have, the worse

00:58:17.199 --> 00:58:19.679
it's going to do when you actually start

00:58:18.400 --> 00:58:21.200
using it in the real world. It's going

00:58:19.679 --> 00:58:24.000
to overfitit heavily. That's why you

00:58:21.199 --> 00:58:25.679
need to be very careful.

00:58:24.000 --> 00:58:27.519
Okay.

00:58:25.679 --> 00:58:29.440
Yeah.

00:58:27.519 --> 00:58:31.358
So, um, you downloaded the data set, but

00:58:29.440 --> 00:58:33.760
are you still using the vocabulary the

00:58:31.358 --> 00:58:35.598
17 words or did you

00:58:33.760 --> 00:58:36.720
>> No, no, I'm that was just for fun. I'm

00:58:35.599 --> 00:58:38.960
going to actually build a vocabulary

00:58:36.719 --> 00:58:41.838
now. It's coming. Yeah, good question.

00:58:38.960 --> 00:58:43.599
Yeah. So, all right, let's do that. Um,

00:58:41.838 --> 00:58:46.000
so I first, you know, I defined this

00:58:43.599 --> 00:58:47.599
layer. Uh, okay. I just defined it. All

00:58:46.000 --> 00:58:49.760
right. Now we actually build the

00:58:47.599 --> 00:58:53.519
vocabulary by essentially telling it to

00:58:49.760 --> 00:58:56.640
adapt the layer using essentially the

00:58:53.519 --> 00:58:58.719
full all 15 basically 49,000 songs in

00:58:56.639 --> 00:59:01.679
the training data set right that's a

00:58:58.719 --> 00:59:02.798
long list of songs as far as kas is

00:59:01.679 --> 00:59:04.879
concerned you're just looking for a list

00:59:02.798 --> 00:59:06.159
of strings so you just give it the list

00:59:04.880 --> 00:59:09.200
of strings instead of four we're giving

00:59:06.159 --> 00:59:11.358
it 49,000 the same uh philosophy applies

00:59:09.199 --> 00:59:12.879
so we run it

00:59:11.358 --> 00:59:15.039
it's obviously going to take a few

00:59:12.880 --> 00:59:17.280
seconds to do that because it's 49,000

00:59:15.039 --> 00:59:19.039
songs

00:59:17.280 --> 00:59:21.519
five seconds. Uh, all right. Let's look

00:59:19.039 --> 00:59:23.759
at the most common 20,

00:59:21.519 --> 00:59:26.318
right? We get the vocabulary from our

00:59:23.760 --> 00:59:27.839
layer. See, once you adapt the layer and

00:59:26.318 --> 00:59:29.358
has built a vocabulary, the layer is

00:59:27.838 --> 00:59:31.279
sort of been populated with all this

00:59:29.358 --> 00:59:34.719
information. So, you can query it. So,

00:59:31.280 --> 00:59:37.040
you can get the vocab top 20 words, the

00:59:34.719 --> 00:59:39.039
most frequent word, no surprise, u, I,

00:59:37.039 --> 00:59:41.039
blah, blah, blah. Uh, let's look at the

00:59:39.039 --> 00:59:43.599
last few.

00:59:41.039 --> 00:59:46.599
Dagger cheddar

00:59:43.599 --> 00:59:46.599
verified

00:59:46.798 --> 00:59:51.199
moving on

00:59:48.880 --> 00:59:52.960
right and then we so once we have done

00:59:51.199 --> 00:59:55.439
that now we actually can vectorize all

00:59:52.960 --> 00:59:57.039
the data sets we have using this and by

00:59:55.440 --> 00:59:59.119
vectorize you mean take every string and

00:59:57.039 --> 01:00:00.400
create the multihot encoded vector from

00:59:59.119 --> 01:00:02.480
it uh yeah

01:00:00.400 --> 01:00:05.358
>> are we doing stie because we're keeping

01:00:02.480 --> 01:00:07.119
stuff like d a etc. Yeah, we are not

01:00:05.358 --> 01:00:09.598
strictly doing STI or to put it

01:00:07.119 --> 01:00:12.000
differently the S stands typically S has

01:00:09.599 --> 01:00:14.960
lower case uppercase strip punctuation

01:00:12.000 --> 01:00:16.798
stemming stop word removal here the

01:00:14.960 --> 01:00:18.639
default in KAS happens to not do

01:00:16.798 --> 01:00:20.000
stemming not do stop word removal so

01:00:18.639 --> 01:00:22.078
we're just going with the default thanks

01:00:20.000 --> 01:00:23.519
for the clarification

01:00:22.079 --> 01:00:25.039
and in fact in practice what I find

01:00:23.519 --> 01:00:27.039
these days is that don't even bother to

01:00:25.039 --> 01:00:28.239
stem don't even bother to remove the

01:00:27.039 --> 01:00:31.119
stop words it's going to work well

01:00:28.239 --> 01:00:34.399
enough

01:00:31.119 --> 01:00:36.000
okay so all right uh okay so now Each

01:00:34.400 --> 01:00:38.639
phrase is a vector. How long is this

01:00:36.000 --> 01:00:41.039
vector? Each song is now a vector. How

01:00:38.639 --> 01:00:43.279
long is that vector?

01:00:41.039 --> 01:00:46.920
5,000. Correct. Because that is a size

01:00:43.280 --> 01:00:46.920
vocabulary. Correct.

01:00:47.199 --> 01:00:51.679
It's max tokens long, which is 5,000. So

01:00:49.599 --> 01:00:52.960
if you actually look at X Oh, wait,

01:00:51.679 --> 01:00:56.358
wait, wait, wait, wait. I haven't done

01:00:52.960 --> 01:00:56.358
this thing yet.

01:00:57.838 --> 01:01:02.400
It's going through 49,000. It's going

01:00:59.599 --> 01:01:04.400
through another what? 23,000. Fine. So

01:01:02.400 --> 01:01:06.798
let's run it.

01:01:04.400 --> 01:01:09.200
Okay, now we can see X train which is

01:01:06.798 --> 01:01:12.960
all the training data you have has is a

01:01:09.199 --> 01:01:18.039
tensor is a table with 48 991 rows and

01:01:12.960 --> 01:01:18.039
each row is a 5,000 long vector.

01:01:18.079 --> 01:01:23.280
All right, good. Now we will try the

01:01:20.559 --> 01:01:28.240
simple neural network that we wrote up

01:01:23.280 --> 01:01:31.359
in class. So and now at this point this

01:01:28.239 --> 01:01:34.078
code should be sort of second nature,

01:01:31.358 --> 01:01:36.159
right? Isn't that cool? It's so easy to

01:01:34.079 --> 01:01:39.280
write the write the thing the power of

01:01:36.159 --> 01:01:41.279
abstraction. So uh we take kasin input

01:01:39.280 --> 01:01:42.720
as usual input layer we tell it what is

01:01:41.280 --> 01:01:44.480
the size of each thing that's coming in.

01:01:42.719 --> 01:01:46.480
Well the size of each thing is a 50 max

01:01:44.480 --> 01:01:48.880
tokens long vector. So we tell it the

01:01:46.480 --> 01:01:51.119
shape is max tokens and then we run it

01:01:48.880 --> 01:01:54.160
through a dense layer with eight relus.

01:01:51.119 --> 01:01:56.079
Okay I'm hurrying.

01:01:54.159 --> 01:01:58.000
So we get the outputs then we string the

01:01:56.079 --> 01:01:59.680
inputs and the outputs into a model and

01:01:58.000 --> 01:02:02.239
then we summarize the model. That's it.

01:01:59.679 --> 01:02:04.639
So we go here and this has 40,000

01:02:02.239 --> 01:02:08.239
parameters and you can see here right

01:02:04.639 --> 01:02:10.239
when you go from the input the 5,000 * 8

01:02:08.239 --> 01:02:11.838
that gives you 40,000 plus the eight

01:02:10.239 --> 01:02:15.039
neurons have a bias coming in that's

01:02:11.838 --> 01:02:17.119
another eight so you get 40,0008 okay

01:02:15.039 --> 01:02:20.159
and we compile it as usual we use atom

01:02:17.119 --> 01:02:23.760
as usual and because now the the output

01:02:20.159 --> 01:02:27.039
y variable the y train variable is now

01:02:23.760 --> 01:02:29.599
it itself is actually one hot encoded

01:02:27.039 --> 01:02:31.440
right 0 1 0 0 1 depending on pop rock

01:02:29.599 --> 01:02:33.519
and so on and so forth. We don't use

01:02:31.440 --> 01:02:35.119
sparse categorical cross entropy. We

01:02:33.519 --> 01:02:38.000
just use plain old categorical cross

01:02:35.119 --> 01:02:40.318
entropy here. Okay. And this was

01:02:38.000 --> 01:02:42.400
explained in lecture last week. So you

01:02:40.318 --> 01:02:44.318
can revisit it if uh if it's if it's not

01:02:42.400 --> 01:02:46.400
familiar. We again report accuracy,

01:02:44.318 --> 01:02:48.558
right? So let's compile it. And we've

01:02:46.400 --> 01:02:50.798
got a model. So we just run it for 10

01:02:48.559 --> 01:02:52.640
epochs with a batch size of 32. And

01:02:50.798 --> 01:02:53.838
because we have validation data already

01:02:52.639 --> 01:02:55.679
supplied to us, we don't have to tell

01:02:53.838 --> 01:02:58.159
Karas take the training data and keep

01:02:55.679 --> 01:02:59.519
20% of it aside for validation. We can

01:02:58.159 --> 01:03:04.000
literally tell it what validation to

01:02:59.519 --> 01:03:06.798
use. That's what we're doing here. Okay.

01:03:04.000 --> 01:03:09.119
All right. So, it's running.

01:03:06.798 --> 01:03:12.599
Um,

01:03:09.119 --> 01:03:12.599
it's pretty fast.

01:03:16.318 --> 01:03:20.480
Any questions so far?

01:03:18.159 --> 01:03:23.519
>> Yes.

01:03:20.480 --> 01:03:25.358
>> The microphone.

01:03:23.519 --> 01:03:27.679
>> How do we decide the max total? like

01:03:25.358 --> 01:03:29.038
define the number of 5,000 here but we

01:03:27.679 --> 01:03:29.919
do not know how many words would be

01:03:29.039 --> 01:03:31.200
there in the entire text.

01:03:29.920 --> 01:03:32.720
>> Yeah. So it's a good question. How do

01:03:31.199 --> 01:03:34.399
you decide on this the maximum

01:03:32.719 --> 01:03:36.480
vocabulary? What you typically do in

01:03:34.400 --> 01:03:38.240
practice is that you actually you do it

01:03:36.480 --> 01:03:40.079
without the max tokens and then you see

01:03:38.239 --> 01:03:41.838
how long the vocabulary is and then you

01:03:40.079 --> 01:03:43.839
actually get statistics on how

01:03:41.838 --> 01:03:45.279
frequently the very infrequent words

01:03:43.838 --> 01:03:47.279
actually show up. And then you'll

01:03:45.280 --> 01:03:49.599
typically see like a dramatic fall-off

01:03:47.280 --> 01:03:54.119
at some point and you pick that fall-off

01:03:49.599 --> 01:03:54.119
point and then set that to be the max.

01:03:54.960 --> 01:04:01.599
Uh all right. So perfect. Let's test it.

01:03:58.719 --> 01:04:05.358
Uh accuracy is pretty good. 87% on the

01:04:01.599 --> 01:04:09.280
training and 73 on the validation. We'll

01:04:05.358 --> 01:04:11.440
do it on the test set. All right. 72%.

01:04:09.280 --> 01:04:13.200
So we saw earlier the the largest class

01:04:11.440 --> 01:04:15.358
of the three-way is a rock with around

01:04:13.199 --> 01:04:17.279
50%. So the naive model is going to get

01:04:15.358 --> 01:04:19.279
50% accuracy and this little neural

01:04:17.280 --> 01:04:22.160
network model gets you 70 72% which is

01:04:19.280 --> 01:04:23.839
pretty nice. Okay. So now let's actually

01:04:22.159 --> 01:04:26.798
kick it up a notch and make it slightly

01:04:23.838 --> 01:04:29.358
more capable. So the key thing here is

01:04:26.798 --> 01:04:31.119
that uh as was has been observed in

01:04:29.358 --> 01:04:33.759
class already when you go with a bag of

01:04:31.119 --> 01:04:35.358
words model we lose all notion of order

01:04:33.760 --> 01:04:38.000
right the word order clearly matters and

01:04:35.358 --> 01:04:40.400
we're kind of ignoring it. So what we do

01:04:38.000 --> 01:04:42.079
to get around it is um so actually this

01:04:40.400 --> 01:04:44.720
actually really interesting uh sentence

01:04:42.079 --> 01:04:46.640
here. Let's say this is a movie review.

01:04:44.719 --> 01:04:48.639
Kate Vinclet's performance as a

01:04:46.639 --> 01:04:50.639
detective trying to solve a terrible

01:04:48.639 --> 01:04:52.879
crime in a P small pin tennos is

01:04:50.639 --> 01:04:55.038
anything but disappointing.

01:04:52.880 --> 01:04:56.160
Tricky tricky thing, right? Because if

01:04:55.039 --> 01:04:58.400
you look at the word separately, the

01:04:56.159 --> 01:05:01.358
word terrible and disappointing like

01:04:58.400 --> 01:05:04.000
negative sentiment, right? But then if

01:05:01.358 --> 01:05:06.318
you actually know that the word terrible

01:05:04.000 --> 01:05:08.559
respon refers to the crime, not to the

01:05:06.318 --> 01:05:09.440
movie or anything but disappointing

01:05:08.559 --> 01:05:10.798
changes the meaning of the word

01:05:09.440 --> 01:05:12.639
disappointing, you will see obviously

01:05:10.798 --> 01:05:14.719
it's a positive review, right? So

01:05:12.639 --> 01:05:17.679
clearly the the the words around the

01:05:14.719 --> 01:05:20.558
word provide valuable clues as to how to

01:05:17.679 --> 01:05:23.519
interpret that word. And so what we do

01:05:20.559 --> 01:05:25.599
is how can we make our little model a

01:05:23.519 --> 01:05:27.599
bit more capable of recognizing the

01:05:25.599 --> 01:05:29.680
context around every word. And the way

01:05:27.599 --> 01:05:32.960
we do it is something called bgrams.

01:05:29.679 --> 01:05:34.318
Okay. And what for biograms what we

01:05:32.960 --> 01:05:36.960
basically do is instead of taking

01:05:34.318 --> 01:05:39.599
instead of just taking each word we take

01:05:36.960 --> 01:05:42.240
each word and we further take every pair

01:05:39.599 --> 01:05:44.400
of adjacent words

01:05:42.239 --> 01:05:47.279
and those become our tokens and because

01:05:44.400 --> 01:05:49.440
we take two adjacent words right it are

01:05:47.280 --> 01:05:51.680
called bgrams you can take three adjent

01:05:49.440 --> 01:05:54.480
words trigrams you get the idea engram

01:05:51.679 --> 01:05:56.719
grams okay so that's the idea of bgrams

01:05:54.480 --> 01:05:59.920
and so um so for example if you had the

01:05:56.719 --> 01:06:03.519
cat matt sat on the cat sat on the mat

01:05:59.920 --> 01:06:05.680
you will have the the cat cats sat you

01:06:03.519 --> 01:06:07.679
get the idea right uh that's what we

01:06:05.679 --> 01:06:09.279
have so let's do a little example and

01:06:07.679 --> 01:06:12.399
kas makes it very easy you literally

01:06:09.280 --> 01:06:15.119
tell it engram grams equals 2

01:06:12.400 --> 01:06:16.960
bs and now by by from this you auto

01:06:15.119 --> 01:06:19.358
immediately should know that engram

01:06:16.960 --> 01:06:23.039
grams equals 1 is the default that's why

01:06:19.358 --> 01:06:25.440
we didn't have to specify it okay so you

01:06:23.039 --> 01:06:27.520
run it and then you do

01:06:25.440 --> 01:06:29.119
cats on the mat is your training corpus

01:06:27.519 --> 01:06:31.280
and then you get the vocabulary and you

01:06:29.119 --> 01:06:34.160
can see here, right? It has created all

01:06:31.280 --> 01:06:35.920
these nice biograms for you. And so

01:06:34.159 --> 01:06:37.679
that's it. All right. Now, what we do is

01:06:35.920 --> 01:06:39.680
we'll go back to the songs and we

01:06:37.679 --> 01:06:41.919
actually tell Keras to not just take

01:06:39.679 --> 01:06:43.519
each word, but take all the biograms as

01:06:41.920 --> 01:06:45.440
well. And hopefully you'll do a better

01:06:43.519 --> 01:06:47.759
job, right, of figuring out what the

01:06:45.440 --> 01:06:49.679
sentiment is. And now because you know

01:06:47.760 --> 01:06:51.680
when you have when you when you say,

01:06:49.679 --> 01:06:53.919
okay, take the top 5,000 words, that's

01:06:51.679 --> 01:06:56.078
great for single unigs as they are

01:06:53.920 --> 01:06:57.680
called. But when you have biograms, you

01:06:56.079 --> 01:06:59.599
have 5,000 possibilities for the first

01:06:57.679 --> 01:07:01.279
word, maybe 5,000 for the second word,

01:06:59.599 --> 01:07:03.280
right? That's a lot of possibilities. 25

01:07:01.280 --> 01:07:04.240
million. Now, most of the 25 million

01:07:03.280 --> 01:07:05.680
possibilities are not going to show up

01:07:04.239 --> 01:07:07.199
in the data. So, you don't need to

01:07:05.679 --> 01:07:08.798
actually make it much larger, but you

01:07:07.199 --> 01:07:11.038
should make the vocabulary a bit more

01:07:08.798 --> 01:07:13.759
than 5,000. So, here we go with say

01:07:11.039 --> 01:07:16.160
20,000, right? Otherwise, it's the same.

01:07:13.760 --> 01:07:18.240
Still multihart. So, let's run it. And

01:07:16.159 --> 01:07:20.000
now we will run this. Now that the layer

01:07:18.239 --> 01:07:21.519
has been set up with all the right

01:07:20.000 --> 01:07:24.079
settings, we'll ask it to create the

01:07:21.519 --> 01:07:25.679
vocabulary. Okay? again by doing exactly

01:07:24.079 --> 01:07:28.680
what we did before. Create the

01:07:25.679 --> 01:07:28.679
vocabulary

01:07:30.000 --> 01:07:33.000
seconds

01:07:42.000 --> 01:07:46.639
by triagrams all of them will get much

01:07:44.480 --> 01:07:48.400
more computer intensive that's why

01:07:46.639 --> 01:07:51.358
you're seeing this. So all right let's

01:07:48.400 --> 01:07:53.200
look at the first 10 words. The first 10

01:07:51.358 --> 01:07:54.639
words are all just single words and

01:07:53.199 --> 01:07:55.598
that's not surprising because the single

01:07:54.639 --> 01:07:59.519
words are going to be the most more

01:07:55.599 --> 01:08:02.559
frequent right u

01:07:59.519 --> 01:08:08.038
and then the last few

01:08:02.559 --> 01:08:08.039
your mom your god you short you hell

01:08:09.920 --> 01:08:15.838
all right let's just uh you know uh

01:08:13.039 --> 01:08:17.520
index the whole all the data we have the

01:08:15.838 --> 01:08:19.920
training validation test sets using this

01:08:17.520 --> 01:08:19.920
vocabulary

01:08:23.198 --> 01:08:26.479
Perfect. Now we come to our second model

01:08:24.798 --> 01:08:28.479
where we say the shape the incoming

01:08:26.479 --> 01:08:30.238
shape is now 20,000 long right because

01:08:28.479 --> 01:08:32.718
we increase max tokens from 5,000 to

01:08:30.238 --> 01:08:35.198
20,000. So each thing is a 20,000 long

01:08:32.719 --> 01:08:37.198
vector otherwise it's the same and now

01:08:35.198 --> 01:08:38.639
we will use this thing called dropout

01:08:37.198 --> 01:08:41.119
for the first time which is a

01:08:38.640 --> 01:08:43.440
rigorization thing that I have referred

01:08:41.119 --> 01:08:45.439
to earlier that I never really described

01:08:43.439 --> 01:08:47.119
and I will describe today if we have

01:08:45.439 --> 01:08:49.439
time but I'll first run through the

01:08:47.119 --> 01:08:50.960
whole demo. So just you know just you

01:08:49.439 --> 01:08:52.879
can just you think of dropout as just

01:08:50.960 --> 01:08:54.079
another layer you can insert and it's

01:08:52.880 --> 01:08:56.798
essentially a great way to prevent

01:08:54.079 --> 01:08:58.798
overfitting. So I just routinely will

01:08:56.798 --> 01:09:00.719
use it and I'll talk more about it. So

01:08:58.798 --> 01:09:02.399
for now you have this dropout layer in

01:09:00.719 --> 01:09:04.319
the middle. It receives the input from

01:09:02.399 --> 01:09:05.278
the dense layer and then sends it to the

01:09:04.319 --> 01:09:07.039
output layer. The output layer is

01:09:05.279 --> 01:09:10.319
unchanged. It's a three-way softmax.

01:09:07.039 --> 01:09:11.838
Same model as before. Okay. And now uh

01:09:10.319 --> 01:09:13.279
all right we'll come back to drop out.

01:09:11.838 --> 01:09:15.198
So we'll compile it the same way as

01:09:13.279 --> 01:09:17.440
before and then we will we will I will

01:09:15.198 --> 01:09:19.198
just fit it for three epochs. Um if

01:09:17.439 --> 01:09:20.479
you're interested after class later on

01:09:19.198 --> 01:09:22.879
you can actually try it for more epochs

01:09:20.479 --> 01:09:23.838
and see if it does better. Uh for now in

01:09:22.880 --> 01:09:27.000
the interest of time we'll just do it

01:09:23.838 --> 01:09:27.000
for three

01:09:29.838 --> 01:09:32.838
right

01:09:36.560 --> 01:09:43.440
I think 72% right was the uh the single

01:09:39.520 --> 01:09:45.279
word unig thing we had.

01:09:43.439 --> 01:09:47.358
>> If you're rerunning this code with the

01:09:45.279 --> 01:09:49.120
same number of Do you ever expect the

01:09:47.359 --> 01:09:51.759
accuracy to change?

01:09:49.119 --> 01:09:53.679
>> Um if if you had to run this code in

01:09:51.759 --> 01:09:55.359
your machine, you would expect it to be

01:09:53.679 --> 01:09:57.119
roughly the same, but there are some

01:09:55.359 --> 01:09:58.238
minute differences due to hardware and

01:09:57.119 --> 01:09:59.920
device drivers.

01:09:58.238 --> 01:10:02.959
>> If you rewrite it on your own machine

01:09:59.920 --> 01:10:05.039
twice, would you expect a change?

01:10:02.960 --> 01:10:07.198
>> That's actually a very tricky question.

01:10:05.039 --> 01:10:09.679
Uh because it depends on what else I

01:10:07.198 --> 01:10:11.759
have been doing in that notebook.

01:10:09.679 --> 01:10:13.840
If I start fresh and do nothing but

01:10:11.760 --> 01:10:15.199
that, typically I get the same numbers

01:10:13.840 --> 01:10:19.000
typically. But for some reason I don't

01:10:15.198 --> 01:10:19.000
get it exactly the right.

01:10:19.359 --> 01:10:25.559
Okay. So we come to this. Let's evaluate

01:10:22.000 --> 01:10:25.560
our little model.

01:10:25.840 --> 01:10:30.640
Okay. 75%. So it went from 72 to 75.

01:10:29.119 --> 01:10:32.960
It's actually a meaningful jump just by

01:10:30.640 --> 01:10:34.800
using biograms. Okay. And I ran it only

01:10:32.960 --> 01:10:36.239
for three epochs. If you run it for 10,

01:10:34.800 --> 01:10:38.960
maybe it's going to do even better. All

01:10:36.238 --> 01:10:40.639
right. So that is the beauty of this

01:10:38.960 --> 01:10:42.719
thing. Now let's just actually do a

01:10:40.640 --> 01:10:45.920
little demo. Uh we'll try to predict

01:10:42.719 --> 01:10:49.198
some lyrics. Okay, I'll try another one.

01:10:45.920 --> 01:10:50.399
Bites the dust.

01:10:49.198 --> 01:10:53.359
It's a rock song. I think that's

01:10:50.399 --> 01:10:55.839
correct. Yes. Okay. Okay, folks. Your

01:10:53.359 --> 01:11:00.359
turn now.

01:10:55.840 --> 01:11:00.360
Uh, somebody tell me your favorite song.

01:11:00.479 --> 01:11:05.519
>> Dancing Queen from Aba.

01:11:03.039 --> 01:11:07.039
>> I love ABBA. That's awesome. All right.

01:11:05.520 --> 01:11:11.120
Okay.

01:11:07.039 --> 01:11:14.119
Uh, Dancing Queen

01:11:11.119 --> 01:11:14.119
Rex.

01:11:17.119 --> 01:11:20.158
worse one intro. I don't like that.

01:11:18.560 --> 01:11:23.480
Let's just go to something without all

01:11:20.158 --> 01:11:23.479
this metadata.

01:11:23.679 --> 01:11:26.679
Right.

01:11:27.359 --> 01:11:31.559
All right. I'll just take the first

01:11:28.238 --> 01:11:31.559
page. Okay.

01:11:40.479 --> 01:11:45.439
Are we good?

01:11:42.560 --> 01:11:49.560
All right,

01:11:45.439 --> 01:11:49.559
down model. Let's predict

01:11:50.319 --> 01:11:55.238
pop just about. Yay.

01:11:55.679 --> 01:12:00.399
All right. So, uh yeah. So, that's

01:11:58.238 --> 01:12:01.919
basically the model, but we have five

01:12:00.399 --> 01:12:03.599
minutes. I want to get back to you can

01:12:01.920 --> 01:12:05.440
play around and put your own lyrics in.

01:12:03.600 --> 01:12:07.600
Uh typically what happens is that the

01:12:05.439 --> 01:12:09.519
last two years that I've been doing this

01:12:07.600 --> 01:12:11.679
particular lecture, I've noticed that

01:12:09.520 --> 01:12:13.280
the songs are always rock songs for some

01:12:11.679 --> 01:12:14.800
reason.

01:12:13.279 --> 01:12:16.000
>> First time I'm getting a pop song from

01:12:14.800 --> 01:12:18.320
the from a group that I actually like.

01:12:16.000 --> 01:12:20.158
So thank you.

01:12:18.319 --> 01:12:22.000
Uh all right. Uh let's go back to

01:12:20.158 --> 01:12:24.879
dropout.

01:12:22.000 --> 01:12:26.560
So the idea here in dropout is that you

01:12:24.880 --> 01:12:28.079
know you have all these the input comes

01:12:26.560 --> 01:12:30.719
in, it goes through a hidden layer and

01:12:28.079 --> 01:12:33.760
so on and so forth. What the dropout? So

01:12:30.719 --> 01:12:35.198
dropout is a layer and you put this

01:12:33.760 --> 01:12:37.600
layer just like you use any other layer.

01:12:35.198 --> 01:12:38.960
And what dropout does is that it takes

01:12:37.600 --> 01:12:41.360
all the things that are coming into it

01:12:38.960 --> 01:12:43.760
from the previous layer and randomly

01:12:41.359 --> 01:12:46.079
decides to replace that number with a

01:12:43.760 --> 01:12:48.239
zero.

01:12:46.079 --> 01:12:50.399
That's it. It drops that number and

01:12:48.238 --> 01:12:52.158
replace it with a zero. Okay? But it

01:12:50.399 --> 01:12:54.319
does it randomly. It basically toss a

01:12:52.158 --> 01:12:55.839
coin and the coin comes up heads zero.

01:12:54.319 --> 01:12:58.479
If it comes up to us, let it through.

01:12:55.840 --> 01:13:02.960
Pass it through. Okay? And the reason

01:12:58.479 --> 01:13:04.799
why this is very effective is because

01:13:02.960 --> 01:13:07.600
you can imagine all the neurons in a

01:13:04.800 --> 01:13:09.840
particular layer when they overfit to a

01:13:07.600 --> 01:13:11.360
particular data set the overfitting

01:13:09.840 --> 01:13:14.319
happens because the neurons essentially

01:13:11.359 --> 01:13:15.839
collude with each other right they sort

01:13:14.319 --> 01:13:17.439
of collude with each other to actually

01:13:15.840 --> 01:13:19.920
overfitit and predict things in sort of

01:13:17.439 --> 01:13:21.839
a very accurate way. So you want to

01:13:19.920 --> 01:13:24.480
break any sort of collusion between the

01:13:21.840 --> 01:13:26.239
neurons, right? I'm obviously using sort

01:13:24.479 --> 01:13:28.799
of like a you know again theoretic way

01:13:26.238 --> 01:13:30.718
of describing it but the idea is that

01:13:28.800 --> 01:13:33.440
any kind of speurious correlations in

01:13:30.719 --> 01:13:36.079
your data neurons can pick it up by

01:13:33.439 --> 01:13:38.079
being correlated themselves.

01:13:36.079 --> 01:13:40.800
And so the way you avoid the spurious

01:13:38.079 --> 01:13:42.000
correlation is by dropping neurons

01:13:40.800 --> 01:13:44.159
randomly. You just kill the neuron

01:13:42.000 --> 01:13:45.520
randomly which means that no neuron can

01:13:44.158 --> 01:13:47.679
depend on another neuron being

01:13:45.520 --> 01:13:50.320
available.

01:13:47.679 --> 01:13:52.560
I know it's a bit grim but that's the

01:13:50.319 --> 01:13:54.399
basic idea of dropout and apparently the

01:13:52.560 --> 01:13:56.640
story goes that the the folk person who

01:13:54.399 --> 01:13:58.238
the team that invented it Jeff Hinton

01:13:56.640 --> 01:13:59.679
who won the touring for the stuff not

01:13:58.238 --> 01:14:02.158
for not for dropout just for deep

01:13:59.679 --> 01:14:03.279
learning um he said I don't know if it's

01:14:02.158 --> 01:14:05.519
true but he said that apparently he got

01:14:03.279 --> 01:14:07.759
the idea when he went to a bank and

01:14:05.520 --> 01:14:09.440
realized that you know very often the

01:14:07.760 --> 01:14:11.280
bank the folks who working in that bank

01:14:09.439 --> 01:14:13.119
branch that he used to go to kept

01:14:11.279 --> 01:14:14.800
changing

01:14:13.119 --> 01:14:16.000
right they were never sort of the same

01:14:14.800 --> 01:14:17.279
the people would be transferring in

01:14:16.000 --> 01:14:18.158
transferring out and he was like why Why

01:14:17.279 --> 01:14:19.519
can't they just leave these people

01:14:18.158 --> 01:14:21.519
alone? Why does it keep changing? And

01:14:19.520 --> 01:14:24.560
then he got the insight that maybe a lot

01:14:21.520 --> 01:14:26.080
of fraud happens because the person

01:14:24.560 --> 01:14:28.400
working in the branch colludes with the

01:14:26.079 --> 01:14:30.399
customer, but by changing the staff

01:14:28.399 --> 01:14:32.319
constantly, you break the the risk of

01:14:30.399 --> 01:14:34.879
fraud happening. And that apparently was

01:14:32.319 --> 01:14:36.559
the genesis for this idea. True,

01:14:34.880 --> 01:14:40.480
apocryphal? I have no idea. But it's

01:14:36.560 --> 01:14:43.039
sort of a fun story. Uh yes,

01:14:40.479 --> 01:14:45.198
>> instead of random, if we go to the way

01:14:43.039 --> 01:14:47.039
historical models are built, concepts of

01:14:45.198 --> 01:14:50.079
multiple and all of that, would that

01:14:47.039 --> 01:14:53.279
make it sharper as compared to this?

01:14:50.079 --> 01:14:56.238
>> The problem is that um these networks

01:14:53.279 --> 01:14:58.158
are massive, right? And for you to take

01:14:56.238 --> 01:14:59.839
each layer and look at it correlation

01:14:58.158 --> 01:15:01.679
with some other layer and so on and so

01:14:59.840 --> 01:15:04.319
forth. First of all, investigating

01:15:01.679 --> 01:15:05.359
multi-linearity is pro is a problem. The

01:15:04.319 --> 01:15:08.559
second thing is okay, what do you do

01:15:05.359 --> 01:15:09.920
then? Next uh in linear regression you

01:15:08.560 --> 01:15:11.360
can do things like principal components

01:15:09.920 --> 01:15:12.960
analysis to get around it. Here

01:15:11.359 --> 01:15:14.719
everything is nonlinear. There is no

01:15:12.960 --> 01:15:16.000
easy way to solve the problem. So we are

01:15:14.719 --> 01:15:20.000
like we'll just solve the problem in one

01:15:16.000 --> 01:15:23.840
shot using dropout. That's all right. Um

01:15:20.000 --> 01:15:25.920
so I had uh some material on

01:15:23.840 --> 01:15:28.239
something called bite pair encoding

01:15:25.920 --> 01:15:30.319
which I will um which I will do when we

01:15:28.238 --> 01:15:31.759
get to LLMs and I stuck it in the end

01:15:30.319 --> 01:15:33.519
because I knew that we probably won't

01:15:31.760 --> 01:15:35.280
have enough time to cover this anyway.

01:15:33.520 --> 01:15:37.679
And that is a very clever tokenization

01:15:35.279 --> 01:15:40.079
scheme used by for example the GPT

01:15:37.679 --> 01:15:41.679
family and that allows them to do

01:15:40.079 --> 01:15:43.920
beautiful punctuation, keep the case

01:15:41.679 --> 01:15:45.679
intact and then use words that you just

01:15:43.920 --> 01:15:47.520
made up and things like that. Okay. So

01:15:45.679 --> 01:15:50.719
we have two one more minute. I'm happy

01:15:47.520 --> 01:15:52.640
to answer any questions you might have.

01:15:50.719 --> 01:15:54.079
>> And so initially when we are picking

01:15:52.640 --> 01:15:57.119
like the hidden layer the number of

01:15:54.079 --> 01:15:59.279
neurons and weed. So so far in all the

01:15:57.119 --> 01:16:01.039
materials this is has been given to us

01:15:59.279 --> 01:16:03.439
but initially how do you pick it? Is it

01:16:01.039 --> 01:16:03.920
more of a trial and error type of thing

01:16:03.439 --> 01:16:05.678
or

01:16:03.920 --> 01:16:07.520
>> it tends to be trial and error. Um so

01:16:05.679 --> 01:16:10.319
that's in fact what I did when I created

01:16:07.520 --> 01:16:12.400
the collabs. So um and and you can

01:16:10.319 --> 01:16:14.319
actually make it a bit more systematic

01:16:12.399 --> 01:16:16.719
by trying lots of different values and

01:16:14.319 --> 01:16:18.719
there is a particular package uh Python

01:16:16.719 --> 01:16:20.719
package called Keras tuner. So just

01:16:18.719 --> 01:16:22.640
Google Keras tuner and it comes with

01:16:20.719 --> 01:16:23.920
very nice collabs and if I have a chance

01:16:22.640 --> 01:16:25.440
maybe I'll just record a screen

01:16:23.920 --> 01:16:27.119
walkthrough of doing that. But that's

01:16:25.439 --> 01:16:28.559
that's a very efficient way to do these

01:16:27.119 --> 01:16:29.359
things. And it comes under the broad

01:16:28.560 --> 01:16:31.920
category of something called

01:16:29.359 --> 01:16:33.519
hyperparameter optimization where the

01:16:31.920 --> 01:16:35.279
number of neurons, the activation you

01:16:33.520 --> 01:16:36.960
use, the learning rate, all those things

01:16:35.279 --> 01:16:39.039
can all be tried. You can try lots of

01:16:36.960 --> 01:16:42.000
variations and kas is a great way to do

01:16:39.039 --> 01:16:45.238
it in the context of kas.

01:16:42.000 --> 01:16:45.238
Other questions?

01:16:45.920 --> 01:16:51.399
>> All right, I give you 30 seconds back.

01:16:47.359 --> 01:16:51.399
Thank you. See you tomorrow.
