-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCS 285 Lecture 2, Part 6.srt
779 lines (622 loc) · 15.2 KB
/
CS 285 Lecture 2, Part 6.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
1
00:00:01,120 --> 00:00:03,040
in the final part of today's lecture
2
00:00:03,040 --> 00:00:05,200
i want to talk a little bit about some
in the final part of today's lecture
3
00:00:05,200 --> 00:00:06,960
more advanced materials
i want to talk a little bit about some
4
00:00:06,960 --> 00:00:09,280
about some other ways that we can use
more advanced materials
5
00:00:09,280 --> 00:00:11,120
imitation learning to solve interesting
about some other ways that we can use
6
00:00:11,120 --> 00:00:13,679
decision making and control problems
imitation learning to solve interesting
7
00:00:13,679 --> 00:00:16,160
so this other imitation idea that i'm
decision making and control problems
8
00:00:16,160 --> 00:00:17,680
going to talk about
so this other imitation idea that i'm
9
00:00:17,680 --> 00:00:20,880
deals with how data that is
going to talk about
10
00:00:20,880 --> 00:00:23,199
not necessarily optimal for one task
deals with how data that is
11
00:00:23,199 --> 00:00:24,960
could be optimal for another one and how
not necessarily optimal for one task
12
00:00:24,960 --> 00:00:26,560
we can leverage that observation
could be optimal for another one and how
13
00:00:26,560 --> 00:00:30,000
for imitation learning so let's say that
we can leverage that observation
14
00:00:30,000 --> 00:00:31,679
you have a trajectory for your agent
for imitation learning so let's say that
15
00:00:31,679 --> 00:00:34,160
that reached some point p1
you have a trajectory for your agent
16
00:00:34,160 --> 00:00:35,760
right so if you have many trajectories
that reached some point p1
17
00:00:35,760 --> 00:00:37,760
that all reach p1 that's good you can
right so if you have many trajectories
18
00:00:37,760 --> 00:00:39,200
use those trajectories of limitation
that all reach p1 that's good you can
19
00:00:39,200 --> 00:00:42,719
learning and learn how to reach p1
use those trajectories of limitation
20
00:00:43,440 --> 00:00:44,879
but what if you have a bunch of
21
00:00:44,879 --> 00:00:46,640
demonstrations that all reach
but what if you have a bunch of
22
00:00:46,640 --> 00:00:48,399
different points one of them reached p1
demonstrations that all reach
23
00:00:48,399 --> 00:00:49,920
another one reached p2 another one
different points one of them reached p1
24
00:00:49,920 --> 00:00:51,280
reached p3
another one reached p2 another one
25
00:00:51,280 --> 00:00:52,879
now the trouble is that you might not
reached p3
26
00:00:52,879 --> 00:00:54,399
have enough demonstrations
now the trouble is that you might not
27
00:00:54,399 --> 00:00:57,199
for any one of those points by itself so
have enough demonstrations
28
00:00:57,199 --> 00:00:59,039
maybe if you just use the demonstrations
for any one of those points by itself so
29
00:00:59,039 --> 00:01:00,000
for p1
maybe if you just use the demonstrations
30
00:01:00,000 --> 00:01:01,359
you won't actually be able to learn to
for p1
31
00:01:01,359 --> 00:01:04,159
reach p1 successfully
you won't actually be able to learn to
32
00:01:04,159 --> 00:01:06,159
what if you condition your policy on the
reach p1 successfully
33
00:01:06,159 --> 00:01:08,320
point that was reached
what if you condition your policy on the
34
00:01:08,320 --> 00:01:09,920
now even if you have just a single
point that was reached
35
00:01:09,920 --> 00:01:11,680
demonstration for each point
now even if you have just a single
36
00:01:11,680 --> 00:01:14,000
you can learn a conditional policy kind
demonstration for each point
37
00:01:14,000 --> 00:01:15,520
of amounts to just appending p to the
you can learn a conditional policy kind
38
00:01:15,520 --> 00:01:16,240
state
of amounts to just appending p to the
39
00:01:16,240 --> 00:01:18,159
and that conditional policy would then
state
40
00:01:18,159 --> 00:01:19,680
be able to reach any point
and that conditional policy would then
41
00:01:19,680 --> 00:01:22,560
including p1 so in this way by
be able to reach any point
42
00:01:22,560 --> 00:01:23,920
conditioning your policy on additional
including p1 so in this way by
43
00:01:23,920 --> 00:01:24,799
information
conditioning your policy on additional
44
00:01:24,799 --> 00:01:27,920
you can actually train policies for
information
45
00:01:27,920 --> 00:01:29,920
performing some tasks even if you didn't
you can actually train policies for
46
00:01:29,920 --> 00:01:32,960
have enough data for that task
performing some tasks even if you didn't
47
00:01:32,960 --> 00:01:34,640
there are a number of works that have
have enough data for that task
48
00:01:34,640 --> 00:01:36,400
used this idea in one form or another
there are a number of works that have
49
00:01:36,400 --> 00:01:39,040
that i want to tell you about briefly
used this idea in one form or another
50
00:01:39,040 --> 00:01:40,159
so we're going to call this goal
that i want to tell you about briefly
51
00:01:40,159 --> 00:01:42,399
condition behavior cloning and during
so we're going to call this goal
52
00:01:42,399 --> 00:01:43,600
training we'll observe
condition behavior cloning and during
53
00:01:43,600 --> 00:01:45,439
a bunch of demonstrations that in
training we'll observe
54
00:01:45,439 --> 00:01:48,720
general all do different things
a bunch of demonstrations that in
55
00:01:48,720 --> 00:01:50,000
but then we're going to treat each
general all do different things
56
00:01:50,000 --> 00:01:52,640
demonstration as a successful example
but then we're going to treat each
57
00:01:52,640 --> 00:01:54,000
for doing whatever it is that that
demonstration as a successful example
58
00:01:54,000 --> 00:01:56,560
demonstration actually succeeded at
for doing whatever it is that that
59
00:01:56,560 --> 00:01:57,920
so if the demonstration reached the
demonstration actually succeeded at
60
00:01:57,920 --> 00:02:00,399
state st it's a successful demonstration
so if the demonstration reached the
61
00:02:00,399 --> 00:02:02,960
for reaching st and then we'll learn a
state st it's a successful demonstration
62
00:02:02,960 --> 00:02:03,920
policy
for reaching st and then we'll learn a
63
00:02:03,920 --> 00:02:09,039
conditional state and condition on goal
policy
64
00:02:09,039 --> 00:02:10,959
so the way that this works that for each
conditional state and condition on goal
65
00:02:10,959 --> 00:02:13,200
demo you basically select the goal to be
so the way that this works that for each
66
00:02:13,200 --> 00:02:14,560
the last state that was reached
demo you basically select the goal to be
67
00:02:14,560 --> 00:02:15,840
and then just do regular behavior
the last state that was reached
68
00:02:15,840 --> 00:02:18,160
cloning
and then just do regular behavior
69
00:02:18,160 --> 00:02:20,319
this idea has been used in a few papers
cloning
70
00:02:20,319 --> 00:02:22,239
here are a couple of examples i'm going
this idea has been used in a few papers
71
00:02:22,239 --> 00:02:23,599
to actually talk about the first one
here are a couple of examples i'm going
72
00:02:23,599 --> 00:02:26,160
called learning latent plants from play
to actually talk about the first one
73
00:02:26,160 --> 00:02:28,080
and the idea in this paper was to
called learning latent plants from play
74
00:02:28,080 --> 00:02:29,440
collect data from
and the idea in this paper was to
75
00:02:29,440 --> 00:02:31,360
human demonstrators that are not doing
collect data from
76
00:02:31,360 --> 00:02:32,879
anything in particular
human demonstrators that are not doing
77
00:02:32,879 --> 00:02:33,360
but are actually just
anything in particular
78
00:02:33,360 --> 00:02:35,120
attempting a wide range of different
but are actually just
79
00:02:35,120 --> 00:02:37,519
behaviors then this data is processed to
attempting a wide range of different
80
00:02:37,519 --> 00:02:40,239
behaviors then this data is processed to
81
00:02:40,239 --> 00:02:42,230
essentially select the last state as a goal
behaviors then this data is processed to
82
00:02:42,230 --> 00:02:43,040
essentially select the last state as a goal
83
00:02:43,040 --> 00:02:44,560
the actual model architecture in this
essentially select the last state as a goal
84
00:02:44,560 --> 00:02:46,879
paper is a little bit complicated
the actual model architecture in this
85
00:02:46,879 --> 00:02:48,239
for the reasons that i described
paper is a little bit complicated
86
00:02:48,239 --> 00:02:49,519
earlier in the lecture essentially
for the reasons that i described
87
00:02:49,519 --> 00:02:50,400
in order for this to work
earlier in the lecture essentially
88
00:02:50,400 --> 00:02:52,239
the imitation needs to be very powerful
in order for this to work
89
00:02:52,239 --> 00:02:53,599
it needs to match the expert very
the imitation needs to be very powerful
90
00:02:53,599 --> 00:02:54,400
accurately
it needs to match the expert very
91
00:02:54,400 --> 00:02:56,080
which means it has to deal with
accurately
92
00:02:56,080 --> 00:02:59,680
multimodality and non-markovian problems
which means it has to deal with
93
00:02:59,680 --> 00:03:01,440
the way that it does this is actually by
multimodality and non-markovian problems
94
00:03:01,440 --> 00:03:03,519
using a combination of autoregressive
the way that it does this is actually by
95
00:03:03,519 --> 00:03:04,640
discretization
using a combination of autoregressive
96
00:03:04,640 --> 00:03:06,480
and a latent variable model so if you
discretization
97
00:03:06,480 --> 00:03:08,400
remember in a latent variable model
and a latent variable model so if you
98
00:03:08,400 --> 00:03:10,800
we feed an additional noise input into
remember in a latent variable model
99
00:03:10,800 --> 00:03:11,680
the policy
we feed an additional noise input into
100
00:03:11,680 --> 00:03:13,360
and that noise input allows us to
the policy
101
00:03:13,360 --> 00:03:16,319
represent arbitrary output distributions
and that noise input allows us to
102
00:03:16,319 --> 00:03:18,080
so it's kind of a more sophisticated
represent arbitrary output distributions
103
00:03:18,080 --> 00:03:19,680
imitation learning architecture
so it's kind of a more sophisticated
104
00:03:19,680 --> 00:03:21,360
but the overall algorithm follows the
imitation learning architecture
105
00:03:21,360 --> 00:03:24,239
scheme that i had on the previous slide
but the overall algorithm follows the
106
00:03:24,239 --> 00:03:26,159
and once this policy is trained it can
scheme that i had on the previous slide
107
00:03:26,159 --> 00:03:27,840
actually reach a variety of different
and once this policy is trained it can
108
00:03:27,840 --> 00:03:28,959
goals
actually reach a variety of different
109
00:03:28,959 --> 00:03:30,799
so here the image on the left shows the
goals
110
00:03:30,799 --> 00:03:32,879
goal and the video on the right shows
so here the image on the left shows the
111
00:03:32,879 --> 00:03:34,400
the policy actually trying to reach that
goal and the video on the right shows
112
00:03:34,400 --> 00:03:36,159
goal and this is a single policy
the policy actually trying to reach that
113
00:03:36,159 --> 00:03:38,319
trained on a wide variety of different
goal and this is a single policy
114
00:03:38,319 --> 00:03:40,879
training data
trained on a wide variety of different
115
00:03:41,120 --> 00:03:43,840
now another idea that we could imagine
116
00:03:43,840 --> 00:03:45,200
from
now another idea that we could imagine
117
00:03:45,200 --> 00:03:48,080
watching these results is that well if
from
118
00:03:48,080 --> 00:03:49,519
we don't need demonstrations to be
watching these results is that well if
119
00:03:49,519 --> 00:03:50,480
successful
we don't need demonstrations to be
120
00:03:50,480 --> 00:03:52,480
at a single specific task but we can
successful
121
00:03:52,480 --> 00:03:54,080
instead use demonstrations for a wide
at a single specific task but we can
122
00:03:54,080 --> 00:03:55,439
range of different tasks
instead use demonstrations for a wide
123
00:03:55,439 --> 00:03:57,760
with this conditioning trick do we even
range of different tasks
124
00:03:57,760 --> 00:03:59,680
need them to be demonstrations at all
with this conditioning trick do we even
125
00:03:59,680 --> 00:04:02,400
could we instead maybe get bad random
need them to be demonstrations at all
126
00:04:02,400 --> 00:04:03,040
data
could we instead maybe get bad random
127
00:04:03,040 --> 00:04:04,239
and use that random data as
data
128
00:04:04,239 --> 00:04:06,159
demonstrations so basically
and use that random data as
129
00:04:06,159 --> 00:04:08,159
if you have a bad trajectory but that
demonstrations so basically
130
00:04:08,159 --> 00:04:09,920
trajectory reaches some state maybe it's
if you have a bad trajectory but that
131
00:04:09,920 --> 00:04:11,280
actually a good trajectory
trajectory reaches some state maybe it's
132
00:04:11,280 --> 00:04:13,519
for whatever state it reached and that's
actually a good trajectory
133
00:04:13,519 --> 00:04:14,879
actually the algorithm described in this
for whatever state it reached and that's
134
00:04:14,879 --> 00:04:16,799
paper called learning to reach goals
actually the algorithm described in this
135
00:04:16,799 --> 00:04:18,959
by iterating supervised learning where
paper called learning to reach goals
136
00:04:18,959 --> 00:04:20,320
the overall recipe
by iterating supervised learning where
137
00:04:20,320 --> 00:04:21,840
is to actually start from scratch
the overall recipe
138
00:04:21,840 --> 00:04:23,440
there's no human data at all
is to actually start from scratch
139
00:04:23,440 --> 00:04:25,680
instead we start with a random policy
there's no human data at all
140
00:04:25,680 --> 00:04:28,240
that random policy collects random data
instead we start with a random policy
141
00:04:28,240 --> 00:04:30,320
but then that random data is relabeled
that random policy collects random data
142
00:04:30,320 --> 00:04:32,160
with whatever goal that it reached so if
but then that random data is relabeled
143
00:04:32,160 --> 00:04:33,759
your random trajectory reached
with whatever goal that it reached so if
144
00:04:33,759 --> 00:04:36,320
state st then it's a good demonstration
your random trajectory reached
145
00:04:36,320 --> 00:04:36,960
for the goal
state st then it's a good demonstration
146
00:04:36,960 --> 00:04:39,440
st that is then used to
for the goal
147
00:04:39,440 --> 00:04:40,800
retrain the policy
st that is then used to
148
00:04:40,800 --> 00:04:42,160
which gives you a better policy for
retrain the policy
149
00:04:42,160 --> 00:04:43,280
reaching goals that is better than
which gives you a better policy for
150
00:04:43,280 --> 00:04:45,280
random and then you repeat
reaching goals that is better than
151
00:04:45,280 --> 00:04:46,400
and it can actually be shown that
random and then you repeat
152
00:04:46,400 --> 00:04:48,000
repeating this process can actually
and it can actually be shown that
153
00:04:48,000 --> 00:04:49,360
solve some pretty sophisticated
repeating this process can actually
154
00:04:49,360 --> 00:04:51,199
reinforcement learning tasks
solve some pretty sophisticated
155
00:04:51,199 --> 00:04:53,120
just with an inner loop imitation
reinforcement learning tasks
156
00:04:53,120 --> 00:04:55,040
learning procedure and this relabeling stage
just with an inner loop imitation
157
00:04:55,040 --> 00:04:57,840
learning procedure and this relabeling stage