English_srt/CS 285 Lecture 1, Part 2.srt

1
00:00:01,199 --> 00:00:06,580
So, why should we care about deep reinforcement learning?
In particular deep is in the title of this class

2
00:00:06,720 --> 00:00:09,220
So, let's talk about that a little bit.

3
00:00:10,000 --> 00:00:12,900
And to start this conversation, 
we'll start with a really big question.

4
00:00:13,000 --> 00:00:18,100
and this is a question that we'll come back 
to a few times in the first lecture module.

5
00:00:18,240 --> 00:00:24,320
How do we build intelligent machines?

6
00:00:21,840 --> 00:00:26,880
and by this, i really mean literally
what it says on at the top

7
00:00:24,880 --> 00:00:32,719
intelligent machines the kind of machines that we see in cartoons

8
00:00:28,960 --> 00:00:34,320
are robot butlers robot helpers that

9
00:00:32,719 --> 00:00:36,880
help with medical care

10
00:00:34,320 --> 00:00:38,079
or even science fiction robots that

11
00:00:36,880 --> 00:00:39,760
pilot starships

12
00:00:38,079 --> 00:00:41,120
or if you're a little bit more

13
00:00:39,760 --> 00:00:44,559
mischievously inclined

14
00:00:41,120 --> 00:00:46,079
evil robot villains.

15
00:00:44,559 --> 00:00:48,079
Intelligent machines have to be able to

16
00:00:46,079 --> 00:00:49,760
adapt they have to handle

17
00:00:48,079 --> 00:00:52,079
flexibly the complexity and

18
00:00:49,760 --> 00:00:54,000
unpredictability of the real world.

19
00:00:52,079 --> 00:00:56,079
If we wanted to build for instance an

20
00:00:54,000 --> 00:00:57,440
autonomous oil tanker

21
00:00:56,079 --> 00:00:59,920
that would not be very difficult to do 
today.

22
00:00:57,440 --> 00:01:00,559
While it's hard for humans to

23
00:00:59,920 --> 00:01:02,320
figure out

24
00:01:00,559 --> 00:01:04,239
how to navigate the ocean to reach a

25
00:01:02,320 --> 00:01:06,640
destination half a world away

26
00:01:04,239 --> 00:01:07,760
a combination of gps and motion planning

27
00:01:06,640 --> 00:01:09,360
can actually solve this problem

28
00:01:07,760 --> 00:01:11,439
reasonably well.

29
00:01:09,360 --> 00:01:12,799
However, most oil tankers still have

30
00:01:11,439 --> 00:01:14,479
human crews on board.

31
00:01:12,799 --> 00:01:15,680
Why is that? Well, it's because when

32
00:01:14,479 --> 00:01:17,040
something goes wrong when something

33
00:01:15,680 --> 00:01:18,320
breaks in the engine room,

34
00:01:17,040 --> 00:01:20,159
you really need a person to go down

35
00:01:18,320 --> 00:01:21,280
there and fix it.

36
00:01:20,159 --> 00:01:23,040
While navigating the oil tanker

37
00:01:21,280 --> 00:01:25,200
is comparatively not a very difficult

38
00:01:23,040 --> 00:01:27,520
artificial intelligence problem.

39
00:01:25,200 --> 00:01:29,280
Fixing something when it goes wrong with

40
00:01:27,520 --> 00:01:31,680
current technology is exceedingly difficult

41
00:01:32,159 --> 00:01:35,360
The difficulty really comes from the

42
00:01:33,520 --> 00:01:36,479
fact the real world is unstructured and

43
00:01:35,360 --> 00:01:38,400
unpredictable.

44
00:01:36,479 --> 00:01:40,079
And a very powerful technology for

45
00:01:38,400 --> 00:01:41,759
handling the unstructured unpredictable

46
00:01:40,079 --> 00:01:44,240
nature of the real world that we have at

47
00:01:41,759 --> 00:01:47,280
our disposal is deep learning.

48
00:01:44,240 --> 00:01:49,119
In deep learning, we train a very large

49
00:01:47,280 --> 00:01:51,040
heavily over parametrized model like a

50
00:01:49,119 --> 00:01:54,079
deep neural network to map inputs to outputs.

51
00:01:52,000 --> 00:01:55,119
For instance, if you want to recognize

52
00:01:54,079 --> 00:01:56,799
objects in an image

53
00:01:55,119 --> 00:01:58,479
you would collect a large number of

54
00:01:56,799 --> 00:02:00,320
labeled images and then

55
00:01:58,479 --> 00:02:01,840
use typically standard supervised

56
00:02:00,320 --> 00:02:04,560
learning methods to predict inputs from outputs.

57
00:02:02,719 --> 00:02:06,240
But deep learning is fundamentally about

58
00:02:04,560 --> 00:02:08,239
the choice of large over-parameterized models

59
00:02:06,880 --> 00:02:10,319
more than it is about the choice of the algorithm.

60
00:02:08,239 --> 00:02:11,440
We've seen deep learning methods

61
00:02:10,319 --> 00:02:15,360
succeed at tasks ranging from

62
00:02:12,720 --> 00:02:17,440
image classification to translating text

63
00:02:15,360 --> 00:02:20,480
even directly from images to recognizing human speech.

64
00:02:18,480 --> 00:02:21,840
And these are all open world settings in

65
00:02:20,480 --> 00:02:23,040
the sense that

66
00:02:21,840 --> 00:02:24,720
you need models that generalize

67
00:02:23,040 --> 00:02:28,080
effectively to things they've never seen before

68
00:02:25,599 --> 00:02:31,599
and all sorts of weird special cases and

69
00:02:28,080 --> 00:02:31,599
unusual situations can arise.

70
00:02:31,920 --> 00:02:34,800
Reinforcement learning provides a

71
00:02:33,200 --> 00:02:36,480
formalism for behavior as i mentioned

72
00:02:34,800 --> 00:02:38,239
before it provides a mathematical

73
00:02:36,480 --> 00:02:41,680
way of thinking about sequential decision-making.

74
00:02:39,680 --> 00:02:44,239
In reinforcement learning, 
an agent interacts with the world,

75
00:02:41,680 --> 00:02:46,400
gets observations and rewards

76
00:02:44,239 --> 00:02:48,000
and these kinds of methods have been

77
00:02:46,400 --> 00:02:51,120
used in combination

78
00:02:48,000 --> 00:02:51,440
with deep neural networks for a variety

79
00:02:51,120 --> 00:02:53,280
of

80
00:02:51,440 --> 00:02:54,560
applications where you do have to handle

81
00:02:53,280 --> 00:02:56,800
flexibly

82
00:02:54,560 --> 00:02:58,640
unusual and unpredictable situations.

83
00:02:56,800 --> 00:03:00,159
For instance, one of the early successes

84
00:02:58,640 --> 00:03:02,000
of the combination of reinforcement

85
00:03:00,159 --> 00:03:03,120
learning and neural networks has been

86
00:03:02,000 --> 00:03:06,080
learning to play

87
00:03:03,120 --> 00:03:07,519
the board game of backgammon.

88
00:03:06,080 --> 00:03:08,959
This is a system called td gammon

89
00:03:07,519 --> 00:03:10,800
that learned to play backgammon at the

90
00:03:08,959 --> 00:03:11,760
level of an expert human.

91
00:03:10,800 --> 00:03:13,760
Not quite at the level of beating

92
00:03:11,760 --> 00:03:15,680
the championed human player but a very

93
00:03:13,760 --> 00:03:18,080
at a very professional level.

94
00:03:15,680 --> 00:03:21,360
The technology behind alphago which

95
00:03:18,080 --> 00:03:23,440
defeated the human champion for go 2016

96
00:03:21,360 --> 00:03:26,000
in many ways had a lot in common with

97
00:03:23,440 --> 00:03:27,360
td gammon back in the 90s.

98
00:03:26,000 --> 00:03:28,720
Deep reinforcement learning methods

99
00:03:27,360 --> 00:03:31,280
meaning reinforcement learning algorithms

100
00:03:29,360 --> 00:03:33,040
that use deep neural networks have been

101
00:03:31,280 --> 00:03:34,879
used for tasks ranging from

102
00:03:33,040 --> 00:03:36,080
robotic locomotion to robotic

103
00:03:34,879 --> 00:03:39,280
manipulation skills

104
00:03:36,080 --> 00:03:42,319
playing video games and so on.

105
00:03:39,280 --> 00:03:44,000
So what is deep RL exactly?

106
00:03:42,319 --> 00:03:46,319
And why should we care about it?

107
00:03:44,000 --> 00:03:47,920
Well, to understand the importance of

108
00:03:46,319 --> 00:03:49,040
deep RL the difference that it makes in

109
00:03:47,920 --> 00:03:51,519
reinforcement learning methods

110
00:03:49,040 --> 00:03:52,799
Let's start with

111
00:03:51,519 --> 00:03:54,400
an example from a different domain an

112
00:03:52,799 --> 00:03:56,959
example from computer vision

113
00:03:54,400 --> 00:03:57,599
to see why it is that deep neural

114
00:03:56,959 --> 00:03:59,200
networks

115
00:03:57,599 --> 00:04:01,920
have such a transformative effect on the

116
00:03:59,200 --> 00:04:04,000
capabilities of machine learning systems.

117
00:04:01,920 --> 00:04:05,040
So if we go back in time maybe about

118
00:04:04,000 --> 00:04:06,159
15 to 20 years

119
00:04:05,040 --> 00:04:07,680
and see how computer vision was

120
00:04:06,159 --> 00:04:08,720
typically done we would see something

121
00:04:07,680 --> 00:04:10,720
like this.

122
00:04:08,720 --> 00:04:12,640
You start with pixels in an image and

123
00:04:10,720 --> 00:04:14,159
then you extract some hand-designed

124
00:04:12,640 --> 00:04:16,079
low-level visual features from those

125
00:04:14,159 --> 00:04:18,000
pixels like for instance a histogram of

126
00:04:16,079 --> 00:04:20,079
oriented gradients.

127
00:04:18,000 --> 00:04:22,320
Then you might extract some mid-level features,

128
00:04:20,079 --> 00:04:25,120
like a deformable parts model,
for example,

129
00:04:23,280 --> 00:04:26,960
and then on top of those mid-level features

130
00:04:25,120 --> 00:04:28,000
you would train some simple

131
00:04:26,960 --> 00:04:29,600
linear classifier

132
00:04:28,000 --> 00:04:30,880
like a support vector machine to

133
00:04:29,600 --> 00:04:35,280
actually classify the thing that you want.

134
00:04:32,639 --> 00:04:37,600
Now with deep learning the deep neural net

135
00:04:35,840 --> 00:04:39,520
performs much the same function

136
00:04:37,600 --> 00:04:40,639
internally it has mid-level features and

137
00:04:39,520 --> 00:04:42,639
low level features

138
00:04:40,639 --> 00:04:44,000
and a classifier the difference is that

139
00:04:42,639 --> 00:04:46,800
these now don't have to be designed by hand.

140
00:04:44,479 --> 00:04:48,560
They're actually learned end-to-end by

141
00:04:46,800 --> 00:04:50,160
the deep neural network.

142
00:04:48,560 --> 00:04:52,000
This not only means that we save a lot

143
00:04:50,160 --> 00:04:52,960
of human effort in designing all those

144
00:04:52,000 --> 00:04:54,880
features.

145
00:04:52,960 --> 00:04:56,320
But it also means that the features are

146
00:04:54,880 --> 00:04:57,600
optimally adapted to the tasks that they

147
00:04:56,320 --> 00:04:58,960
actually have to solve.

148
00:04:57,600 --> 00:05:00,800
So you don't just get some generic

149
00:04:58,960 --> 00:05:01,919
histogram or ingredients features you

150
00:05:00,800 --> 00:05:06,320
get the right features

151
00:05:01,919 --> 00:05:07,680
for classifying tigers from jaguars.

152
00:05:06,320 --> 00:05:09,520
Now let's think about how this lesson

153
00:05:07,680 --> 00:05:10,120
maps on to 
the reinforcement learning setting.

154
00:05:10,160 --> 00:05:13,379
Let's think about the game of backgammon.

155
00:05:13,520 --> 00:05:16,720
If you wanted to use standard

156
00:05:14,479 --> 00:05:18,160
reinforcement learning methods you would

157
00:05:16,720 --> 00:05:19,759
have to extract features from the game

158
00:05:18,160 --> 00:05:21,840
of backgammon somehow.

159
00:05:19,759 --> 00:05:23,440
What kind of features do you use?

160
00:05:21,840 --> 00:05:24,720
Well, maybe if you're an extra backgammon

161
00:05:23,440 --> 00:05:26,160
player you might know that there are

162
00:05:24,720 --> 00:05:27,759
some things that matter in the game i

163
00:05:26,160 --> 00:05:29,039
i'm not an expert backgammon player so i

164
00:05:27,759 --> 00:05:30,320
don't know what those things are but

165
00:05:29,039 --> 00:05:30,800
perhaps you know what they are and then

166
00:05:30,320 --> 00:05:33,360
you could

167
00:05:30,800 --> 00:05:33,620
write them down.

168
00:05:32,460 --> 00:05:34,700
But it's not enough to just have

169
00:05:34,320 --> 00:05:36,800
features that you think are important

170
00:05:35,600 --> 00:05:37,440
for the game they have to also be

171
00:05:36,800 --> 00:05:40,320
features

172
00:05:37,440 --> 00:05:41,840
that can be used to represent policies

173
00:05:40,320 --> 00:05:43,360
value functions and other objects

174
00:05:41,840 --> 00:05:44,400
relevant to reinforcement learning in

175
00:05:43,360 --> 00:05:47,440
some simple way

176
00:05:44,400 --> 00:05:48,800
like a tabular or linear representation.

177
00:05:47,440 --> 00:05:50,880
And that's much harder design because

178
00:05:48,800 --> 00:05:53,360
now you need someone who is not only

179
00:05:50,880 --> 00:05:55,520
an expert and backgammon but also an

180
00:05:53,360 --> 00:05:56,880
expert in reinforcement learning.

181
00:05:55,520 --> 00:05:59,039
And you need a lot of intuition for

182
00:05:56,880 --> 00:06:00,560
which features are good

183
00:05:59,039 --> 00:06:02,080
This turned out to be very difficult in

184
00:06:00,560 --> 00:06:03,039
practice and for a long time it was

185
00:06:02,080 --> 00:06:04,800
extremely hard

186
00:06:03,039 --> 00:06:07,680
to apply reinforcement learning methods

187
00:06:04,800 --> 00:06:09,360
to complex problems.

188
00:06:07,680 --> 00:06:11,600
Deep learning applies the same formula

189
00:06:09,360 --> 00:06:13,360
to reinforcement learning that it did

190
00:06:11,600 --> 00:06:14,720
to the computer mission problem which is

191
00:06:13,360 --> 00:06:15,520
that we replace the manual feature

192
00:06:14,720 --> 00:06:17,199
extraction

193
00:06:15,520 --> 00:06:18,880
with automatically learned features

194
00:06:17,199 --> 00:06:21,440
represented by a deep neural network

195
00:06:18,880 --> 00:06:23,199
and train it end to end.

196
00:06:21,440 --> 00:06:24,479
However in the reinforcement learning setting

197
00:06:23,199 --> 00:06:26,160
typically for the breadth of problems

198
00:06:24,479 --> 00:06:27,360
that we want to handle our intuition for

199
00:06:26,160 --> 00:06:29,039
designing features

200
00:06:27,360 --> 00:06:33,280
is substantially weaker than it is for
computer vision

201
00:06:30,960 --> 00:06:34,319
and for this reason,
deep reinforcement learning methods have

202
00:06:33,280 --> 00:06:35,919
a transformative effect on the

203
00:06:34,319 --> 00:06:38,319
capabilities of reinforcement learning
algorithms.

204
00:06:39,200 --> 00:06:46,400
So what does end-to-end learning mean
for sequential decision making?

205
00:06:43,600 --> 00:06:49,199
Well, first let me explain what it means

206
00:06:46,400 --> 00:06:50,319
to not have intended learning.

207
00:06:49,199 --> 00:06:51,680
When you don't have intended learning

208
00:06:50,319 --> 00:06:53,039
that means that you have to handle the

209
00:06:51,680 --> 00:06:54,880
recognition part of the problem and the

210
00:06:53,039 --> 00:06:57,440
control part of the problem separately.

211
00:06:54,880 --> 00:06:58,560
So maybe you have one system that

212
00:06:57,440 --> 00:07:00,560
figures out what you're seeing in an

213
00:06:58,560 --> 00:07:02,720
image are you seeing a tiger or jaguar

214
00:07:00,560 --> 00:07:04,400
or something innocuous and then a

215
00:07:02,720 --> 00:07:06,160
pipeline that leads to another component

216
00:07:04,400 --> 00:07:09,360
that decides what actually to take based

217
00:07:06,160 --> 00:07:09,360
on that perceptual outcome.

218
00:07:09,520 --> 00:07:12,639
And you train your perception system to

219
00:07:11,360 --> 00:07:13,440
be a good perception system and

220
00:07:12,639 --> 00:07:14,800
recognize

221
00:07:13,440 --> 00:07:16,479
what you're seeing accurately and you

222
00:07:14,800 --> 00:07:18,400
train your control system to be a good

223
00:07:16,479 --> 00:07:19,759
control system to take the right actions.

224
00:07:18,400 --> 00:07:21,680
But because the perception system is

225
00:07:19,759 --> 00:07:22,560
trained separately it's not informed by

226
00:07:21,680 --> 00:07:24,800
the demands

227
00:07:22,560 --> 00:07:26,000
of the behavioral system it doesn't know

228
00:07:24,800 --> 00:07:27,120
what kind of detections are important,

229
00:07:26,000 --> 00:07:29,599
what kind are not important,

230
00:07:27,120 --> 00:07:30,960
what kind of mistakes are costly

231
00:07:29,599 --> 00:07:32,560
and what kind of mistakes are less costly.

232
00:07:30,960 --> 00:07:35,680
And that's a big deal we

233
00:07:32,560 --> 00:07:35,680
need to run away from the tiger.

234
00:07:35,759 --> 00:07:38,720
An intense system closes the sensory

235
00:07:37,680 --> 00:07:40,240
motor loop but actually trains the

236
00:07:38,720 --> 00:07:42,400
entire system end-to-end

237
00:07:40,240 --> 00:07:43,759
performing both perception and control

238
00:07:42,400 --> 00:07:46,319
acquiring both visual

239
00:07:43,759 --> 00:07:48,000
and behavioral features directly

240
00:07:46,319 --> 00:07:51,440
informed by the final performance of the task.

241
00:07:50,000 --> 00:07:53,039
Here's what that means in some example

242
00:07:51,440 --> 00:07:54,240
application scenarios if you want to do

243
00:07:53,039 --> 00:07:56,160
robotic control

244
00:07:54,240 --> 00:07:57,440
a traditional robotics pipeline will

245
00:07:56,160 --> 00:07:59,680
consist of stages

246
00:07:57,440 --> 00:08:01,759
taking in observations estimating the

247
00:07:59,680 --> 00:08:02,960
state such as the positions of objects

248
00:08:01,759 --> 00:08:04,720
doing a little bit of modeling and prediction

249
00:08:02,960 --> 00:08:06,000
figure out how those objects

250
00:08:04,720 --> 00:08:05,960
will behave in the future

251
00:08:06,000 --> 00:08:08,600
doing some planning based on that
modeling and prediction

252
00:08:08,939 --> 00:08:11,960
doing some low level control on top of that
and so on

253
00:08:12,100 --> 00:08:15,979
Importantly each stage in this
process can incur some error.

254
00:08:16,100 --> 00:08:20,120
You could make a

255
00:08:16,479 --> 00:08:20,879
mistake in detecting where the object is

256
00:08:18,720 --> 00:08:22,400
at which point the plan you construct

257
00:08:20,879 --> 00:08:23,680
might not have any meaningful effect in

258
00:08:22,400 --> 00:08:25,360
the real world because it's based on

259
00:08:23,680 --> 00:08:26,960
false premises.

260
00:08:25,360 --> 00:08:28,560
An intent approach would overcome this

261
00:08:26,960 --> 00:08:30,479
issue because

262
00:08:28,560 --> 00:08:31,759
the each stage in this pipeline will be

263
00:08:30,479 --> 00:08:33,120
informed by the demands of the next

264
00:08:31,759 --> 00:08:33,100
stage down the line.

265
00:08:33,120 --> 00:08:34,740
So you could imagine

266
00:08:34,800 --> 00:08:37,539
a deepreinforcement learning
approach to robotic control

267
00:08:37,599 --> 00:08:42,719
where you have 
a convolutional deep neural network

268
00:08:40,640 --> 00:08:44,320
that performs both perception and action.

269
00:08:42,719 --> 00:08:46,080
So images from the robot's camera are

270
00:08:44,320 --> 00:08:49,279
fitted to the bottom of this network

271
00:08:46,080 --> 00:08:51,200
and outputs from the end are fed into

272
00:08:49,279 --> 00:08:53,680
the robot's actuators.

273
00:08:51,200 --> 00:08:54,640
This kind of sensory motor loop sort of

274
00:08:53,680 --> 00:08:56,320
represents

275
00:08:54,640 --> 00:08:57,920
a little microcosm of brain for this

276
00:08:56,320 --> 00:08:59,279
robot with a little visual component a

277
00:08:57,920 --> 00:09:00,880
little behavioral component.

278
00:08:59,279 --> 00:09:02,800
But all of them are trained end-to-end

279
00:09:00,880 --> 00:09:04,160
for the final performance of the task.

280
00:09:02,800 --> 00:09:06,320
So you can think of the convolutional

281
00:09:04,160 --> 00:09:08,000
layers to use a very strained analogy

282
00:09:06,320 --> 00:09:10,240
as a kind of tiny highly specialized

283
00:09:08,000 --> 00:09:11,839
visual cortex and the fully connected

284
00:09:10,240 --> 00:09:12,959
layers is a tiny highly specialized

285
00:09:11,839 --> 00:09:14,720
motor cortex.

286
00:09:12,959 --> 00:09:16,080
But they're all trained and to end and

287
00:09:14,720 --> 00:09:17,600
because of that they'll end up doing

288
00:09:16,080 --> 00:09:18,880
whatever is optimal for the task

289
00:09:17,600 --> 00:09:20,640
within the capabilities of their

290
00:09:18,880 --> 00:09:21,839
representational capacity.

291
00:09:20,640 --> 00:09:24,000
So then the robot could learn from

292
00:09:21,839 --> 00:09:26,399
experience practicing this task

293
00:09:24,000 --> 00:09:27,200
and using that to train all the weights

294
00:09:26,399 --> 00:09:29,279
in this network end to end.

295
00:09:30,240 --> 00:09:34,880
So if we think about these uh example

296
00:09:32,640 --> 00:09:37,519
applications of reinforcement problems,

297
00:09:34,880 --> 00:09:39,440
when we combine them with deep neural

298
00:09:37,519 --> 00:09:42,000
network representations,

299
00:09:39,440 --> 00:09:42,959
the reinforcement learning system really

300
00:09:42,000 --> 00:09:44,959
represents

301
00:09:42,959 --> 00:09:47,040
in a sense the entirety of the ai problem.

302
00:09:44,959 --> 00:09:48,000
Supervised learning systems

303
00:09:47,040 --> 00:09:50,160
require

304
00:09:48,000 --> 00:09:51,120
input and output supervision.

305
00:09:50,160 --> 00:09:53,600
Reinforced learning systems

306
00:09:51,120 --> 00:09:55,360
aim for optimal behavior without relying

307
00:09:53,600 --> 00:09:57,920
on such supervision just for

308
00:09:55,360 --> 00:09:59,120
relying on reward feedback.

309
00:09:57,920 --> 00:10:01,200
And deep models are what allows

310
00:09:59,120 --> 00:10:02,720
reinforced learning algorithms to solve

311
00:10:01,200 --> 00:10:03,760
complex problems end-to-end so

312
00:10:02,720 --> 00:10:05,680
reinforcement learning

313
00:10:03,760 --> 00:10:06,959
provides the mathematical formalism the

314
00:10:05,680 --> 00:10:08,480
algorithmic foundations

315
00:10:06,959 --> 00:10:10,160
while deep models provide the

316
00:10:08,480 --> 00:10:11,360
representations that allow those

317
00:10:10,160 --> 00:10:15,120
algorithmic foundations

318
00:10:11,360 --> 00:10:15,120
to be scaled up to real-world systems.

319
00:10:16,079 --> 00:10:19,000
So why should we study these questions now?

320
00:10:19,200 --> 00:10:22,399
Well, there have been a number of

321
00:10:20,720 --> 00:10:23,839
advances over the last decade that make

322
00:10:22,399 --> 00:10:25,360
this a really exciting time to study

323
00:10:23,839 --> 00:10:26,480
deep reinforcement learning.

324
00:10:25,360 --> 00:10:28,720
First, there have been tremendous

325
00:10:26,480 --> 00:10:31,440
advances in deep learning itself

326
00:10:28,720 --> 00:10:33,360
methods for building

327
00:10:30,640 --> 00:10:33,359
deep neural network representations.

328
00:10:33,360 --> 00:10:36,399
There have also been a lot of advances

329
00:10:34,959 --> 00:10:37,839
in reinforcement learning algorithms

330
00:10:36,399 --> 00:10:38,060
that make it possible to scale these

331
00:10:37,839 --> 00:10:40,780
methods up to use those deep learning
representations.

332
00:10:40,880 --> 00:10:43,920
And lastly, advances in computational

333
00:10:43,120 --> 00:10:45,279
capability

334
00:10:43,920 --> 00:10:49,200
have made it more practical than ever before

335
00:10:45,279 --> 00:10:49,200
to combine these two elements

336
00:10:49,360 --> 00:10:52,959
That said the basic foundations of deep

337
00:10:51,279 --> 00:10:54,160
reinforcement learning are not by any

338
00:10:52,959 --> 00:10:55,760
means new.

339
00:10:54,160 --> 00:10:57,519
For instance, there have been textbooks

340
00:10:55,760 --> 00:10:58,880
on neural networks for control

341
00:10:57,519 --> 00:11:00,720
going back decades going back to the

342
00:10:58,880 --> 00:11:03,440
1980s.

343
00:11:00,720 --> 00:11:04,560
And work even from the 90s has actually

344
00:11:03,440 --> 00:11:07,040
discussed many of the

345
00:11:04,560 --> 00:11:08,560
ingredients that are part of cutting

346
00:11:07,040 --> 00:11:09,680
edge research today for instance this

347
00:11:08,560 --> 00:11:12,560
phd thesis

348
00:11:09,680 --> 00:11:13,920
from 1993 says a few things that modern

349
00:11:12,560 --> 00:11:15,519
deep reinforcement learning researchers

350
00:11:13,920 --> 00:11:17,519
would find very familiar such as,

351
00:11:15,519 --> 00:11:18,720
reinforcement learning can be naturally

352
00:11:17,519 --> 00:11:19,279
integrated with artificial neural

353
00:11:18,720 --> 00:11:21,519
networks

354
00:11:19,279 --> 00:11:23,279
to obtain high quality generalization.

355
00:11:21,519 --> 00:11:24,560
Experience replays a simple technique

356
00:11:23,279 --> 00:11:26,079
that implements this idea.

357
00:11:24,560 --> 00:11:27,760
We'll learn about experience we play

358
00:11:26,079 --> 00:11:29,760
later in the course.

359
00:11:27,760 --> 00:11:31,360
Instructive training instances provided

360
00:11:29,760 --> 00:11:33,839
by human teachers

361
00:11:31,360 --> 00:11:35,200
can result in significant speed up.

362
00:11:33,839 --> 00:11:36,800
That's uh we call that

363
00:11:35,200 --> 00:11:39,360
curriculum learning or imitation learning,

364
00:11:36,800 --> 00:11:40,560
hierarchical learning,

365
00:11:39,360 --> 00:11:42,480
reinforced learning agents can

366
00:11:40,560 --> 00:11:43,519
significantly reduce learning time by

367
00:11:42,480 --> 00:11:45,040
hierarchical learning.

368
00:11:43,519 --> 00:11:46,640
That's a cutting-edge research topic in

369
00:11:45,040 --> 00:11:48,560
deep RL today.

370
00:11:46,640 --> 00:11:50,079
And reinforced learning agents can deal

371
00:11:48,560 --> 00:11:50,880
with a wide range of non-markovian

372
00:11:50,079 --> 00:11:52,800
environments

373
00:11:50,880 --> 00:11:54,480
by having a memory of their past and

374
00:11:52,800 --> 00:11:56,160
we'll talk about integrating memory

375
00:11:54,480 --> 00:11:58,240
and recurrence into deep learning

376
00:11:56,160 --> 00:12:00,240
methods.

377
00:11:58,240 --> 00:12:02,240
And of course these methods have had

378
00:12:00,240 --> 00:12:04,079
enormous successes in the past too,

379
00:12:02,240 --> 00:12:05,360
such as the td gammon system that i

380
00:12:04,079 --> 00:12:06,560
mentioned before which was actually done

381
00:12:05,360 --> 00:12:10,480
in 1996

382
00:12:06,560 --> 00:12:11,920
more than 20 years prior to alphago

383
00:12:10,480 --> 00:12:13,920
but in recent years we've seen some

384
00:12:11,920 --> 00:12:15,360
pretty tremendous advances.

385
00:12:13,920 --> 00:12:16,800
Deep reinforcement learning algorithms

386
00:12:15,360 --> 00:12:19,200
that can directly learn to play video

387
00:12:16,800 --> 00:12:20,880
games directly from raw image pixels

388
00:12:19,200 --> 00:12:23,920
robotic systems that can learn a wide

389
00:12:20,880 --> 00:12:27,120
range of highly generalizable skills

390
00:12:23,920 --> 00:12:28,720
and deep reinforced learning algorithms

391
00:12:27,120 --> 00:12:37,360
that could even beat the world champion

392
00:12:28,720 --> 00:12:37,360
at go