English_srt/CS 285 Lecture 2, Part 6.srt

﻿1
00:00:01,120 --> 00:00:03,040
in the final part of today's lecture

2
00:00:03,040 --> 00:00:05,200
i want to talk a little bit about some
in the final part of today's lecture

3
00:00:05,200 --> 00:00:06,960
more advanced materials
i want to talk a little bit about some

4
00:00:06,960 --> 00:00:09,280
about some other ways that we can use
more advanced materials

5
00:00:09,280 --> 00:00:11,120
imitation learning to solve interesting
about some other ways that we can use

6
00:00:11,120 --> 00:00:13,679
decision making and control problems
imitation learning to solve interesting

7
00:00:13,679 --> 00:00:16,160
so this other imitation idea that i'm
decision making and control problems

8
00:00:16,160 --> 00:00:17,680
going to talk about
so this other imitation idea that i'm

9
00:00:17,680 --> 00:00:20,880
deals with how data that is
going to talk about

10
00:00:20,880 --> 00:00:23,199
not necessarily optimal for one task
deals with how data that is

11
00:00:23,199 --> 00:00:24,960
could be optimal for another one and how
not necessarily optimal for one task

12
00:00:24,960 --> 00:00:26,560
we can leverage that observation
could be optimal for another one and how

13
00:00:26,560 --> 00:00:30,000
for imitation learning so let's say that
we can leverage that observation

14
00:00:30,000 --> 00:00:31,679
you have a trajectory for your agent
for imitation learning so let's say that

15
00:00:31,679 --> 00:00:34,160
that reached some point p1
you have a trajectory for your agent

16
00:00:34,160 --> 00:00:35,760
right so if you have many trajectories
that reached some point p1

17
00:00:35,760 --> 00:00:37,760
that all reach p1 that's good you can
right so if you have many trajectories

18
00:00:37,760 --> 00:00:39,200
use those trajectories of limitation
that all reach p1 that's good you can

19
00:00:39,200 --> 00:00:42,719
learning and learn how to reach p1
use those trajectories of limitation

20
00:00:43,440 --> 00:00:44,879
but what if you have a bunch of

21
00:00:44,879 --> 00:00:46,640
demonstrations that all reach
but what if you have a bunch of

22
00:00:46,640 --> 00:00:48,399
different points one of them reached p1
demonstrations that all reach

23
00:00:48,399 --> 00:00:49,920
another one reached p2 another one
different points one of them reached p1

24
00:00:49,920 --> 00:00:51,280
reached p3
another one reached p2 another one

25
00:00:51,280 --> 00:00:52,879
now the trouble is that you might not
reached p3

26
00:00:52,879 --> 00:00:54,399
have enough demonstrations
now the trouble is that you might not

27
00:00:54,399 --> 00:00:57,199
for any one of those points by itself so
have enough demonstrations

28
00:00:57,199 --> 00:00:59,039
maybe if you just use the demonstrations
for any one of those points by itself so

29
00:00:59,039 --> 00:01:00,000
for p1
maybe if you just use the demonstrations

30
00:01:00,000 --> 00:01:01,359
you won't actually be able to learn to
for p1

31
00:01:01,359 --> 00:01:04,159
reach p1 successfully
you won't actually be able to learn to

32
00:01:04,159 --> 00:01:06,159
what if you condition your policy on the
reach p1 successfully

33
00:01:06,159 --> 00:01:08,320
point that was reached
what if you condition your policy on the

34
00:01:08,320 --> 00:01:09,920
now even if you have just a single
point that was reached

35
00:01:09,920 --> 00:01:11,680
demonstration for each point
now even if you have just a single

36
00:01:11,680 --> 00:01:14,000
you can learn a conditional policy kind
demonstration for each point

37
00:01:14,000 --> 00:01:15,520
of amounts to just appending p to the
you can learn a conditional policy kind

38
00:01:15,520 --> 00:01:16,240
state
of amounts to just appending p to the

39
00:01:16,240 --> 00:01:18,159
and that conditional policy would then
state

40
00:01:18,159 --> 00:01:19,680
be able to reach any point
and that conditional policy would then

41
00:01:19,680 --> 00:01:22,560
including p1 so in this way by
be able to reach any point

42
00:01:22,560 --> 00:01:23,920
conditioning your policy on additional
including p1 so in this way by

43
00:01:23,920 --> 00:01:24,799
information
conditioning your policy on additional

44
00:01:24,799 --> 00:01:27,920
you can actually train policies for
information

45
00:01:27,920 --> 00:01:29,920
performing some tasks even if you didn't
you can actually train policies for

46
00:01:29,920 --> 00:01:32,960
have enough data for that task
performing some tasks even if you didn't

47
00:01:32,960 --> 00:01:34,640
there are a number of works that have
have enough data for that task

48
00:01:34,640 --> 00:01:36,400
used this idea in one form or another
there are a number of works that have

49
00:01:36,400 --> 00:01:39,040
that i want to tell you about briefly
used this idea in one form or another

50
00:01:39,040 --> 00:01:40,159
so we're going to call this goal
that i want to tell you about briefly

51
00:01:40,159 --> 00:01:42,399
condition behavior cloning and during
so we're going to call this goal

52
00:01:42,399 --> 00:01:43,600
training we'll observe
condition behavior cloning and during

53
00:01:43,600 --> 00:01:45,439
a bunch of demonstrations that in
training we'll observe

54
00:01:45,439 --> 00:01:48,720
general all do different things
a bunch of demonstrations that in

55
00:01:48,720 --> 00:01:50,000
but then we're going to treat each
general all do different things

56
00:01:50,000 --> 00:01:52,640
demonstration as a successful example
but then we're going to treat each

57
00:01:52,640 --> 00:01:54,000
for doing whatever it is that that
demonstration as a successful example

58
00:01:54,000 --> 00:01:56,560
demonstration actually succeeded at
for doing whatever it is that that

59
00:01:56,560 --> 00:01:57,920
so if the demonstration reached the
demonstration actually succeeded at

60
00:01:57,920 --> 00:02:00,399
state st it's a successful demonstration
so if the demonstration reached the

61
00:02:00,399 --> 00:02:02,960
for reaching st and then we'll learn a
state st it's a successful demonstration

62
00:02:02,960 --> 00:02:03,920
policy
for reaching st and then we'll learn a

63
00:02:03,920 --> 00:02:09,039
conditional state and condition on goal
policy

64
00:02:09,039 --> 00:02:10,959
so the way that this works that for each
conditional state and condition on goal

65
00:02:10,959 --> 00:02:13,200
demo you basically select the goal to be
so the way that this works that for each

66
00:02:13,200 --> 00:02:14,560
the last state that was reached
demo you basically select the goal to be

67
00:02:14,560 --> 00:02:15,840
and then just do regular behavior
the last state that was reached

68
00:02:15,840 --> 00:02:18,160
cloning
and then just do regular behavior

69
00:02:18,160 --> 00:02:20,319
this idea has been used in a few papers
cloning

70
00:02:20,319 --> 00:02:22,239
here are a couple of examples i'm going
this idea has been used in a few papers

71
00:02:22,239 --> 00:02:23,599
to actually talk about the first one
here are a couple of examples i'm going

72
00:02:23,599 --> 00:02:26,160
called learning latent plants from play
to actually talk about the first one

73
00:02:26,160 --> 00:02:28,080
and the idea in this paper was to
called learning latent plants from play

74
00:02:28,080 --> 00:02:29,440
collect data from
and the idea in this paper was to

75
00:02:29,440 --> 00:02:31,360
human demonstrators that are not doing
collect data from

76
00:02:31,360 --> 00:02:32,879
anything in particular
human demonstrators that are not doing

77
00:02:32,879 --> 00:02:33,360
but are actually just
anything in particular

78
00:02:33,360 --> 00:02:35,120
attempting a wide range of different
but are actually just

79
00:02:35,120 --> 00:02:37,519
behaviors then this data is processed to
attempting a wide range of different

80
00:02:37,519 --> 00:02:40,239
behaviors then this data is processed to

81
00:02:40,239 --> 00:02:42,230
essentially select the last state as a goal
behaviors then this data is processed to

82
00:02:42,230 --> 00:02:43,040
essentially select the last state as a goal

83
00:02:43,040 --> 00:02:44,560
the actual model architecture in this
essentially select the last state as a goal

84
00:02:44,560 --> 00:02:46,879
paper is a little bit complicated
the actual model architecture in this

85
00:02:46,879 --> 00:02:48,239
for the reasons that i described
paper is a little bit complicated

86
00:02:48,239 --> 00:02:49,519
earlier in the lecture essentially
for the reasons that i described

87
00:02:49,519 --> 00:02:50,400
in order for this to work
earlier in the lecture essentially

88
00:02:50,400 --> 00:02:52,239
the imitation needs to be very powerful
in order for this to work

89
00:02:52,239 --> 00:02:53,599
it needs to match the expert very
the imitation needs to be very powerful

90
00:02:53,599 --> 00:02:54,400
accurately
it needs to match the expert very

91
00:02:54,400 --> 00:02:56,080
which means it has to deal with
accurately

92
00:02:56,080 --> 00:02:59,680
multimodality and non-markovian problems
which means it has to deal with

93
00:02:59,680 --> 00:03:01,440
the way that it does this is actually by
multimodality and non-markovian problems

94
00:03:01,440 --> 00:03:03,519
using a combination of autoregressive
the way that it does this is actually by

95
00:03:03,519 --> 00:03:04,640
discretization
using a combination of autoregressive

96
00:03:04,640 --> 00:03:06,480
and a latent variable model so if you
discretization

97
00:03:06,480 --> 00:03:08,400
remember in a latent variable model
and a latent variable model so if you

98
00:03:08,400 --> 00:03:10,800
we feed an additional noise input into
remember in a latent variable model

99
00:03:10,800 --> 00:03:11,680
the policy
we feed an additional noise input into

100
00:03:11,680 --> 00:03:13,360
and that noise input allows us to
the policy

101
00:03:13,360 --> 00:03:16,319
represent arbitrary output distributions
and that noise input allows us to

102
00:03:16,319 --> 00:03:18,080
so it's kind of a more sophisticated
represent arbitrary output distributions

103
00:03:18,080 --> 00:03:19,680
imitation learning architecture
so it's kind of a more sophisticated

104
00:03:19,680 --> 00:03:21,360
but the overall algorithm follows the
imitation learning architecture

105
00:03:21,360 --> 00:03:24,239
scheme that i had on the previous slide
but the overall algorithm follows the

106
00:03:24,239 --> 00:03:26,159
and once this policy is trained it can
scheme that i had on the previous slide

107
00:03:26,159 --> 00:03:27,840
actually reach a variety of different
and once this policy is trained it can

108
00:03:27,840 --> 00:03:28,959
goals
actually reach a variety of different

109
00:03:28,959 --> 00:03:30,799
so here the image on the left shows the
goals

110
00:03:30,799 --> 00:03:32,879
goal and the video on the right shows
so here the image on the left shows the

111
00:03:32,879 --> 00:03:34,400
the policy actually trying to reach that
goal and the video on the right shows

112
00:03:34,400 --> 00:03:36,159
goal and this is a single policy
the policy actually trying to reach that

113
00:03:36,159 --> 00:03:38,319
trained on a wide variety of different
goal and this is a single policy

114
00:03:38,319 --> 00:03:40,879
training data
trained on a wide variety of different

115
00:03:41,120 --> 00:03:43,840
now another idea that we could imagine

116
00:03:43,840 --> 00:03:45,200
from
now another idea that we could imagine

117
00:03:45,200 --> 00:03:48,080
watching these results is that well if
from

118
00:03:48,080 --> 00:03:49,519
we don't need demonstrations to be
watching these results is that well if

119
00:03:49,519 --> 00:03:50,480
successful
we don't need demonstrations to be

120
00:03:50,480 --> 00:03:52,480
at a single specific task but we can
successful

121
00:03:52,480 --> 00:03:54,080
instead use demonstrations for a wide
at a single specific task but we can

122
00:03:54,080 --> 00:03:55,439
range of different tasks
instead use demonstrations for a wide

123
00:03:55,439 --> 00:03:57,760
with this conditioning trick do we even
range of different tasks

124
00:03:57,760 --> 00:03:59,680
need them to be demonstrations at all
with this conditioning trick do we even

125
00:03:59,680 --> 00:04:02,400
could we instead maybe get bad random
need them to be demonstrations at all

126
00:04:02,400 --> 00:04:03,040
data
could we instead maybe get bad random

127
00:04:03,040 --> 00:04:04,239
and use that random data as
data

128
00:04:04,239 --> 00:04:06,159
demonstrations so basically
and use that random data as

129
00:04:06,159 --> 00:04:08,159
if you have a bad trajectory but that
demonstrations so basically

130
00:04:08,159 --> 00:04:09,920
trajectory reaches some state maybe it's
if you have a bad trajectory but that

131
00:04:09,920 --> 00:04:11,280
actually a good trajectory
trajectory reaches some state maybe it's

132
00:04:11,280 --> 00:04:13,519
for whatever state it reached and that's
actually a good trajectory

133
00:04:13,519 --> 00:04:14,879
actually the algorithm described in this
for whatever state it reached and that's

134
00:04:14,879 --> 00:04:16,799
paper called learning to reach goals
actually the algorithm described in this

135
00:04:16,799 --> 00:04:18,959
by iterating supervised learning where
paper called learning to reach goals

136
00:04:18,959 --> 00:04:20,320
the overall recipe
by iterating supervised learning where

137
00:04:20,320 --> 00:04:21,840
is to actually start from scratch
the overall recipe

138
00:04:21,840 --> 00:04:23,440
there's no human data at all
is to actually start from scratch

139
00:04:23,440 --> 00:04:25,680
instead we start with a random policy
there's no human data at all

140
00:04:25,680 --> 00:04:28,240
that random policy collects random data
instead we start with a random policy

141
00:04:28,240 --> 00:04:30,320
but then that random data is relabeled
that random policy collects random data

142
00:04:30,320 --> 00:04:32,160
with whatever goal that it reached so if
but then that random data is relabeled

143
00:04:32,160 --> 00:04:33,759
your random trajectory reached
with whatever goal that it reached so if

144
00:04:33,759 --> 00:04:36,320
state st then it's a good demonstration
your random trajectory reached

145
00:04:36,320 --> 00:04:36,960
for the goal
state st then it's a good demonstration

146
00:04:36,960 --> 00:04:39,440
st that is then used to
for the goal

147
00:04:39,440 --> 00:04:40,800
retrain the policy
st that is then used to

148
00:04:40,800 --> 00:04:42,160
which gives you a better policy for
retrain the policy

149
00:04:42,160 --> 00:04:43,280
reaching goals that is better than
which gives you a better policy for

150
00:04:43,280 --> 00:04:45,280
random and then you repeat
reaching goals that is better than

151
00:04:45,280 --> 00:04:46,400
and it can actually be shown that
random and then you repeat

152
00:04:46,400 --> 00:04:48,000
repeating this process can actually
and it can actually be shown that

153
00:04:48,000 --> 00:04:49,360
solve some pretty sophisticated
repeating this process can actually

154
00:04:49,360 --> 00:04:51,199
reinforcement learning tasks
solve some pretty sophisticated

155
00:04:51,199 --> 00:04:53,120
just with an inner loop imitation
reinforcement learning tasks

156
00:04:53,120 --> 00:04:55,040
learning procedure and this relabeling stage
just with an inner loop imitation

157
00:04:55,040 --> 00:04:57,840
learning procedure and this relabeling stage