Korean_srt/CS 285 Lecture 2, Part 5.ko.srt

1
00:00:01,199 --> 00:00:04,240
오늘 강의의 다음 주제에 대해 알아보겠습니다.

2
00:00:02,960 --> 00:00:07,759
다음 주제로 imitation learning(모방학습)에 관한 이론에 대해 다루어 보겠습니다.

3
00:00:06,000 --> 00:00:12,799
이 이론은 비용함수(cost function)과 보상함수(reward function)를 바탕으로 합니다.

4
00:00:10,400 --> 00:00:16,160
그럼 우리는 어떤 비용함수가 모방학습에서 좋은것인지

5
00:00:14,160 --> 00:00:19,439
혹은 보상함수가 모방학습에서 좋은것인지 생각해봅시다.

6
00:00:17,840 --> 00:00:23,680
이 그림은 이전에 다루었던 모방학습에 관한 내용을 나타내고 있습니다.

7
00:00:21,199 --> 00:00:25,760
우리는 관찰(observations)을 행동(actions)로 대응 시키려고 합니다.

8
00:00:23,680 --> 00:00:29,359
그리고 합리적인 보상함수를 가정 할수 있습니다.

9
00:00:27,359 --> 00:00:33,200
전문가의 행동을 할당한 로그 확률이 될 수 있습니다.


10
00:00:31,359 --> 00:00:37,760
그래서 이 행동에 대한 로그확률은 최적-정책함수(pi*)와 동일하게 행동을 취합니다.

11
00:00:35,440 --> 00:00:40,800
여기서 최적-정책함수는 전문가의 알수없는 정책입니다.

12
00:00:39,040 --> 00:00:44,480
또한 우리는 0과 1의 손실이라고 부르는 것을 사용 할 수 있습니다.

13
00:00:42,320 --> 00:00:48,239
만약 우리가 전문가와 완전히 동일하게 관찰을 행동으로 변환한다면 0의 손실을 얻습니다.

14
00:00:45,520 --> 00:00:51,520
반대로 그렇지 못한 경우에는 1의 손실을 받게 됩니다.

15
00:00:49,120 --> 00:00:52,879
이렇게 하는것은 이론적인 분석을 편리하게 하기위해

16
00:00:51,520 --> 00:00:56,000
그리고 당신히 실수한 숫자를 바로 셀수 있도록 하기 위함입니다.

17
00:00:54,000 --> 00:01:00,960
로그 가능도가 좀더 실질적인 선택임에도 말입니다.

18
00:00:58,320 --> 00:01:04,320
그러나 강화학습에서는 다음을 기억해야 합니다.

19
00:01:01,520 --> 00:01:08,240
보상과 손실은 반드시 기댓값으로 계산이 되어져야 합니다.

20
00:01:05,360 --> 00:01:12,960
이 기댓값 계산은 학습되어지는 정책하에(전문가의 정책이 아님) 이루어집니다.

21
00:01:10,640 --> 00:01:14,880
따라서 우리는 전문가의 행동들을 일치시키기를 원합니다.

22
00:01:12,960 --> 00:01:20,000
전문가가 방문한 상태뿐만 아니라 실제로 방문한 상태들을 포함합니다.

23
00:01:17,680 --> 00:01:23,119
그리고 이것이 왜 행동복제가 왜 실제로 최적이 아닌지를 설명합니다.

24
00:01:21,600 --> 00:01:28,880
목표로하는 행동복제는 로그 가능도의 기댓값을 최대화 하는것이 일반적입니다.

25
00:01:27,439 --> 00:01:32,560
전문가의 상태 혹은 관찰의 분포에 기반해서 말이죠

26
00:01:30,640 --> 00:01:38,640
하지만 우리는 우리의 상태나 관찰의 분포에 대한 기댓값을 최대화 하고 싶습니다.

27
00:01:35,840 --> 00:01:42,880
이말은 분포의 불일치 문제를 가지는 다른 방법의 문제로 말할 수 있습니다.

28
00:01:41,040 --> 00:01:47,119
바로 이것이 우리가 정확하게 수정하려고 하는 문제입니다.

29
00:01:45,200 --> 00:01:52,479
그래서 제가 다음으로 하려고 하는것이

30
00:01:49,200 --> 00:01:58,640
순진한 행동복제를 위한 기대비용을 기반으로 이론적으로 분석해 보려고 합니다.

31
00:01:56,560 --> 00:02:01,360
그리고 우리가 이런 분포 불일치 문제를 어떻게 수식화 할것인지에 대해 논의합니다.

32
00:02:01,520 --> 00:02:06,640
그럼 이제부터 분석할 내용에 T 문자를 사용합니다.

33
00:02:04,960 --> 00:02:08,319
이것은 궤적의 길이를 나타냅니다.

34
00:02:06,640 --> 00:02:10,160
그래서 T를 보게된다면

35
00:02:08,319 --> 00:02:14,319
우리의 궤적이 T 단계만큼 존재한다고 보면됩니다.

36
00:02:12,800 --> 00:02:18,400
그리고 0, 1 손실에 대해 살펴봅니다.

37
00:02:16,319 --> 00:02:22,959
그리고 우리가 분석하려는 양은 0,1 손실에 대한 기대되는 총합이 됩니다.

38
00:02:21,360 --> 00:02:27,599
이것은 하나의 궤적을 그릴때 실수의 기댓값으로 생각해도 됩니다.

39
00:02:27,840 --> 00:02:33,760
또한 우리는 분석을 위해 몇가지 가정들을 할것입니다.

40
00:02:32,480 --> 00:02:39,519
그리고 우리가 가정하려고 하는 이것들은

41
00:02:36,080 --> 00:02:41,040
지도학습이 최초에 동작하게 하려고 하는것입니다.

42
00:02:39,519 --> 00:02:42,400
상당히 순진한 가정을 해본다면

43
00:02:41,040 --> 00:02:48,000
훈련 집합에 있는 모든 상태들의 확률값은

44
00:02:45,840 --> 00:02:52,239
해당 상태에서 실수를 할 확률값이 입실론 보다 작거나 같게 됩니다.

45
00:02:50,239 --> 00:02:56,239
이것은 쓸만한 추측이 됩니다. 

46
00:02:53,840 --> 00:02:58,000
왜냐하면 여러분들이 그것을 효과적으로 배웠다는 것을 보았기 때문입니다.

47
00:02:56,239 --> 00:03:02,000
만약 누군가가 상태 s 에서 행동 a를 취한다면 여러분들은 a를 할당합니다.

48
00:03:00,640 --> 00:03:09,200
엡실론보다 작거나 같은 확률 엡실론이 작은 숫자인 다른 동작에 대한 확률입니다.

49
00:03:06,720 --> 00:03:12,480
그리고 지금부터 실수의 총합의 경계에 대해 생각해 보겠습니다.

50
00:03:10,879 --> 00:03:16,560
우리는 일어날수 있는 가장 비관적인 시나리오를 고려해보겠습니다.

51
00:03:14,319 --> 00:03:20,239
우리가 비관적일때 이론적인 경계를 구축할때 

52
00:03:18,080 --> 00:03:24,879
우리는 최악의 경우를 내포하는 부등식을 적어볼수 있습니다.

53
00:03:22,720 --> 00:03:28,000
그리고 0,1 손실에 대한 가장 비관적인 시나리오는

54
00:03:26,560 --> 00:03:32,799
줄타기라고 부를 시나리오입니다.

55
00:03:30,080 --> 00:03:37,360
그래서 탱크 로프 워커를 위해 줄타기는 반드시 올바른 행동을 취해야 합니다.

56
00:03:35,120 --> 00:03:39,360
시간적으로 매순간 순간에서 말이죠.

57
00:03:37,360 --> 00:03:43,200
그리고 만약 그들이 1번의 실수만 해도 줄타기에서 떨어질겁니다.

58
00:03:41,440 --> 00:03:46,080
그리고 그들은 무엇을 해야할지 모릅니다.

59
00:03:43,200 --> 00:03:49,599
그래서 우리는 이런 형태의 워커를 그리드월드에 나타낼 수 있습니다.

60
00:03:48,159 --> 00:03:54,159
그들이 회색 사방면에 머물러야 할곳에서 오른쪽으로 계속 진행합니다.

61
00:03:51,680 --> 00:03:57,680
그리고 그들이 한번이라도 벗어나면 떨어져서 빨간 사방면으로 가게 됩니다.

62
00:03:55,920 --> 00:03:59,519
그것들을 보여줄 데이터도 없이 말입니다.

63
00:03:57,680 --> 00:04:00,640
왜냐하면 전문가는 절대 떨어지지 않기 때문입니다.

64
00:03:59,519 --> 00:04:05,760
그래서 그들은 어떻게 해야할지를 알 방법이 없습니다.

65
00:04:03,120 --> 00:04:09,360
이런 형태의 워커는 추락에 관계가 없음이 자명합니다.

66
00:04:07,680 --> 00:04:13,599
왜먀하면 그들이 떨어질때만 그들이 상처를 받기 때문입니다.

67
00:04:11,360 --> 00:04:14,640
왜냐하면 그들은 0, 1 손실을 발생시키기 때문입니다.

68
00:04:13,599 --> 00:04:17,919
그들이 떨어질때 아무런 데이터가 없는 상태에 있다는 것입니다.

69
00:04:16,400 --> 00:04:21,440
이것은 그 상태에서는 어떤것을 해야할지 알길이 없음을 의미합니다.

70
00:04:19,919 --> 00:04:22,720
그래서 이런 줄타기 재미있는 종류입니다.

71
00:04:21,440 --> 00:04:25,199
그들의 관심사는 그들을 다치게 하는것에 관한것이 아니라

72
00:04:23,360 --> 00:04:28,880
그들의 관심은 어떻게 해야할지를 아는것이 아닙니다.

73
00:04:27,120 --> 00:04:32,639
오케이, 이 줄타기에 대해 분석해 봅시다.

74
00:04:30,320 --> 00:04:41,040
우리가 경계를 취하려는 수는 t 시간동안의 0,1 손실을 모두 더한것입니다.

75
00:04:37,919 --> 00:04:43,280
이것은 실수들에 대한 기댓값입니다.

76
00:04:41,040 --> 00:04:47,120
맨처음의 단계에 대해 생각해 봅시다.

77
00:04:45,280 --> 00:04:50,880
줄타기 선수는 훈련 집합에 있는 상태로 부터 시작할 것입니다.

78
00:04:48,479 --> 00:04:55,520
이말은 입실론 확률보다는 크다는 의미입니다.

79
00:04:52,960 --> 00:04:57,840
그들은 확률만큼 올바른 행동을 취하게 됩니다.

80
00:04:56,080 --> 00:05:00,960
입실론 보다 작거나 같은 확률로 잘못된 행동을 취합니다.

81
00:04:59,520 --> 00:05:03,520
만약 줄타기선수가 잘못된 행동을 한다면 떨어지게 될것입니다.

82
00:05:00,960 --> 00:05:07,280
그러나 일단 선수가 떨어지고 난뒤에도 여전히 T 만큼의 단계들이 남아있습니다.

83
00:05:06,080 --> 00:05:08,320
그들은 어떻게 해야할지를 알 방법이 없습니다.

84
00:05:07,280 --> 00:05:10,000
그래서 일단 선수가 떨어진다면

85
00:05:08,320 --> 00:05:15,840
T 실수개만큼 더해줍니다.

86
00:05:12,479 --> 00:05:17,680
그럼 우리는 맨처음 단계에 대해 말할수 있습니다.

87
00:05:15,840 --> 00:05:21,520
거기에는 입실론 만큼의 떨어질 확률이 있습니다.

88
00:05:19,600 --> 00:05:25,840
그리고 만약 떨어진다면 T 만큼의 부가적인 실수들을 얻게 됩니다.

89
00:05:23,199 --> 00:05:28,080
그래서 거기에 입실론 곱하기 T 만큼이 있으며

90
00:05:25,840 --> 00:05:34,720
그리고 (1 - 입실론)*(입실론*(T-1) + ... 가 됩니다.

91
00:05:32,240 --> 00:05:41,600
여기에 더하기를 해서 (1-입실론)(...) 이 다음 단계부터 마지막까지의 식이 됩니다.

92
00:05:39,840 --> 00:05:46,240
그리고 다음의 상태에도 똑같은 일이 일어나게 됩니다.

93
00:05:44,800 --> 00:05:51,759
만약 선수가 처음 단계에서 올바른 행동을 했다면 입실론 확률값만큼 떨어질 확률을 가집니다.

94
00:05:49,600 --> 00:05:53,759
그리고 t - 1 단계만큼만 남게됩니다.

95
00:05:51,759 --> 00:05:56,479
그래서 두번째 단계에서 떨어집니다.

96
00:05:53,759 --> 00:06:04,479
그들은 입실론*(t-1)+(1-입실론)*(나머지) 가 되게 됩니다.

97
00:06:00,960 --> 00:06:17,840
그래서 이런 화면과 같은 시계열의 식을 얻게 됩니다. 

98
00:06:14,720 --> 00:06:20,080
기본적으로 T개만큼 얻게 되고

99
00:06:17,840 --> 00:06:24,560
이들 각각은 입실론의 승수가 되게 됩니다.

100
00:06:21,840 --> 00:06:27,120
빅오 라고 표시되어집니다. 

101
00:06:24,560 --> 00:06:29,520
우리는 앞단의 상수를 제거할수 있습니다.

102
00:06:27,120 --> 00:06:32,240
그리고 1-입실론은 보통 매우 큰값입니다.


103
00:06:30,880 --> 00:06:36,880
우리가 실제로 얻는 것은 입실론 t 항들의 다발입니다.

104
00:06:34,800 --> 00:06:40,720
그래서 그 의미는 총경계는 T*입실론t가 되게 됩니다.


105
00:06:38,160 --> 00:06:46,000
그래서 에러는 입실론t^2이 되게 됩니다.

106
00:06:45,039 --> 00:06:50,319
입실론t^2와 같거나

107
00:06:49,440 --> 00:06:52,400
비슷한 다른것이 됩니다.

108
00:06:50,319 --> 00:06:54,160
그러나 만약 우리가 여기서 빅오에만 관심을 가진다면

109
00:06:52,400 --> 00:06:57,599
다른 상수들은 신경을 쓸 필요가 없습니다.

110
00:06:54,160 --> 00:07:00,160
입실론T^2이 꽤 나쁜 경계입니다.

111
00:06:58,160 --> 00:07:05,440
왜냐하면 우리의 궤적이 커지면 커질수록

112
00:07:02,400 --> 00:07:08,240
우리의 에러도 기하급수적으로 늘어난다는 의미입니다.

113
00:07:05,440 --> 00:07:12,000
그리고 그게 왜 순진한 행동복제가 이론적으로 나쁜 아이디어인 이유입니다.

114
00:07:14,000 --> 00:07:19,599
그런 분석은 약간 순진했었습니다.

115
00:07:18,000 --> 00:07:21,280
우리는 일반적인 어떤 것도 가정하지 못했습니다.
because we didn't assume anything about

116
00:07:19,599 --> 00:07:23,840
단지 다음과 같은 가정만 했기때문입니다.

117
00:07:21,280 --> 00:07:28,240
모든 훈련 집합의 상태는 입실론에 경계되어진다라는 가정

118
00:07:25,759 --> 00:07:33,199
더 일반적인 분석은 사실을 설명해 줍니다.

119
00:07:30,479 --> 00:07:35,919
당신의 정책은 단지 훈련 지점에서만 행동하는게 아니고

120
00:07:34,479 --> 00:07:40,319
같은 분포를 가지는 다른 지점에서도 행동을 하는것을 말합니다.

121
00:07:38,080 --> 00:07:42,000
그래서 이렇게 하는 방법은

122
00:07:40,319 --> 00:07:46,560
우리는 우리의 에러가 입실론에 경계되어진다고 말할수 있습니다.

123
00:07:42,720 --> 00:07:49,280
어떤 상태 s 에서 p확률을가지는 곳으로 부터 표본을 뽑을때

124
00:07:46,560 --> 00:07:55,120
사실 우리의 에러가 입실론의 기대값에 경계화 된다고해도 충분하지만

125
00:07:53,759 --> 00:07:58,560
그리고 우리의 실제 훈련목적에 좀더 실질적이지만

126
00:07:56,879 --> 00:08:03,120
제가 보여주려고 하는 그 분석을 위해서

127
00:08:00,720 --> 00:08:07,120
p 훈련 에서 표본을 추출하는게 확장하기 쉬운 방법입니다.

128
00:08:05,360 --> 00:08:12,879
입실론에 바운드되는 기대 에러와 실제로 이루어지는 것들은 

129
00:08:09,520 --> 00:08:14,319
uh in the stephon ross dagger paper

130
00:08:12,879 --> 00:08:18,080
그래서 여기서 좀더 복잡한 분석을 해보겠습니다.

131
00:08:16,319 --> 00:08:22,800
uh with that just to get this out of the
way with dagger

132
00:08:19,680 --> 00:08:27,039
p train approaches p theta right that's
the whole point of dagger

133
00:08:24,240 --> 00:08:31,759
so if p train approaches p theta then
all points will be in distribution

134
00:08:29,199 --> 00:08:34,880
which means that s is always sampled
from p train

135
00:08:32,800 --> 00:08:38,080
which means that error is always bounded
by epsilon

136
00:08:36,240 --> 00:08:40,880
which means that no matter what you do

137
00:08:38,080 --> 00:08:45,600
the bound on the expected cost will be
epsilon times t it'll be linear in t

138
00:08:44,159 --> 00:08:47,279
that's in some ways not actually all

139
00:08:45,600 --> 00:08:49,519
that interesting of a result

140
00:08:47,279 --> 00:08:50,800
because we're removing all distributional shift

141
00:08:49,519 --> 00:08:57,040
what's more interesting is how we
can show that the bound is still quadratic

142
00:08:54,560 --> 00:09:00,080
if we don't do dagger if we if we don't
have p train going to be theta

143
00:08:58,880 --> 00:09:06,399
and that's what we're going to do next on
this side so if p train is not equal to p theta

144
00:09:03,519 --> 00:09:10,560
then what we can do and then this
analysis is basically

145
00:09:08,000 --> 00:09:14,560
directly based on stefan ross's paper a reduction
of imitation learning and structural prediction

146
00:09:13,040 --> 00:09:18,320
to no regret online learning which is
the docker paper

147
00:09:16,240 --> 00:09:25,680
what we can do is we can write the distribution over
states at some time step t as the sum of two terms

148
00:09:24,000 --> 00:09:30,240
and these two terms are kind of
analogous to the type of walker scenario

149
00:09:28,160 --> 00:09:31,920
the first term represents what happened

150
00:09:30,240 --> 00:09:36,399
if we haven't made a mistake
at any previous time step and the second term represents

151
00:09:34,320 --> 00:09:38,399
what happens if we did

152
00:09:36,399 --> 00:09:43,200
so for the first term what's the probability
that we haven't made a mistake so far

153
00:09:41,120 --> 00:09:45,440
well our probability of doing the right thing

154
00:09:43,200 --> 00:09:48,480
if we're in distribution is one
minus epsilon

155
00:09:47,120 --> 00:09:55,440
and if we do the right thing then we
remain in distribution

156
00:09:50,240 --> 00:09:57,519
which means that by time step little t

157
00:09:55,440 --> 00:10:01,519
we have a probability of one minus
epsilon to the power t

158
00:09:59,360 --> 00:10:03,040
then we've done the right stuff up

159
00:10:01,519 --> 00:10:04,959
until now and we've done the right stuff

160
00:10:03,040 --> 00:10:08,640
up until now then our state distribution
is simply p train

161
00:10:06,959 --> 00:10:10,959
because we've matched the expert every single step up until

162
00:10:08,640 --> 00:10:17,920
now so the first term is 1 minus epsilon to the power
of t times p train the second term is everything else

163
00:10:15,519 --> 00:10:21,279
1 minus 1 minus epsilon to the t so this
basically accounts

164
00:10:18,720 --> 00:10:23,120
for all the other probability mass and

165
00:10:21,279 --> 00:10:25,120
if we have made a mistake in the past

166
00:10:23,120 --> 00:10:28,000
in general we can't really say anything
about where we're at

167
00:10:26,399 --> 00:10:33,120
so we might have fallen off the
tightrope we might be in some crazy place

168
00:10:29,360 --> 00:10:34,640
so we're going to call that p mistake

169
00:10:33,120 --> 00:10:36,160
so what this is saying is that for at

170
00:10:34,640 --> 00:10:39,839
any time step we can write the
distribution over states

171
00:10:37,680 --> 00:10:42,959
as the sum of two terms the first one
for not having made any mistakes

172
00:10:41,600 --> 00:10:48,640
and the second one for having made at
least one mistake

173
00:10:47,440 --> 00:10:52,000
now in general we can't say anything
about p mistake

174
00:10:50,480 --> 00:10:59,680
so for p mistake we might have arbitrarily
large probability of additional error

175
00:10:56,640 --> 00:11:10,560
so what we can do is we can bound the total variation divergence
between p theta s t and p train st

176
00:11:07,680 --> 00:11:15,360
total variation divergence is just the sum of the
absolute values of the differences

177
00:11:11,920 --> 00:11:18,079
between the probabilities over all the states

178
00:11:15,360 --> 00:11:21,279
so if we take the left-hand side of this
equation

179
00:11:18,880 --> 00:11:24,240
and substitute in the above equation for
p theta

180
00:11:22,560 --> 00:11:33,920
basically substitute in 1 minus epsilon to the t times b
train plus 1 minus 1 minus epsilon to the t times p mistake

181
00:11:30,399 --> 00:11:44,480
the first p train term cancels with the minus sign and we just end up with 1
minus 1 minus epsilon to the t of the absolute value of p mistake minus p train

182
00:11:42,959 --> 00:11:48,720
now remember that we can't say anything
about p mistake

183
00:11:46,240 --> 00:11:52,399
so the only bound that we can put on the
total variation divergence

184
00:11:50,160 --> 00:11:58,160
between p mistake and p train is the maximum
error between two distributions which is two right

185
00:11:56,320 --> 00:12:01,279
because the distribution sum to one
in the worst case one of them will be

186
00:11:59,839 --> 00:12:03,200
one in one place and the other will be
zero

187
00:12:01,760 --> 00:12:07,519
the other only one in another place
where the other one is zero

188
00:12:05,120 --> 00:12:09,040
so the only bound that we can put on

189
00:12:07,519 --> 00:12:15,440
this if we don't know anything about p mistake
is 2 times 1 minus 1 minus epsilon of t

190
00:12:15,600 --> 00:12:19,760
a useful identity that we can use to

191
00:12:17,200 --> 00:12:22,880
simplify this is that 1 minus epsilon to
the t

192
00:12:20,480 --> 00:12:26,480
is greater than or equal to 1 minus
epsilon times t

193
00:12:24,240 --> 00:12:28,000
if epsilon is between zero and one and

194
00:12:26,480 --> 00:12:29,440
we know that epsilon is between zero and

195
00:12:28,000 --> 00:12:31,440
one because epsilon is a bound on a
probability

196
00:12:30,079 --> 00:12:36,000
and probabilities are always less than
one and always positive

197
00:12:33,680 --> 00:12:39,839
so we can write a simpler bound for this
by turning this into two times epsilon t

198
00:12:38,399 --> 00:12:44,639
we don't have to do this this is a
looser bound but it's just convenient

199
00:12:42,320 --> 00:12:52,720
to remove the exponential and to give us kind of an
understanding of the big o magnitude of this okay

200
00:12:51,279 --> 00:13:00,959
so now let's go to the quantity that we really want what we
care about is the expected value of the cost overall time

201
00:12:59,360 --> 00:13:06,720
so we can write the expected value of
the cost uh as a sum over time

202
00:13:04,320 --> 00:13:15,360
of a sum over states of the probability of that state times
the cost that's just from the definition of expected value

203
00:13:12,160 --> 00:13:24,560
and we can express this uh we can express a bound on
this cost in terms of two terms p train times the cost

204
00:13:22,000 --> 00:13:27,920
plus the difference between p theta and
p train

205
00:13:26,079 --> 00:13:32,160
the way that you get this inequality is
just by writing

206
00:13:29,519 --> 00:13:34,399
p theta is equal to p theta plus p train
minus p

207
00:13:32,720 --> 00:13:36,079
train right if you add and subtract

208
00:13:34,399 --> 00:13:37,680
something that's all good

209
00:13:36,079 --> 00:13:43,120
then we group the terms and we put absolute value
around p theta minus p train

210
00:13:42,160 --> 00:13:46,720
because the absolute value of a number
is always greater than or equal to

211
00:13:44,880 --> 00:13:52,320
that number itself so that's how we can get
this bound

212
00:13:49,279 --> 00:13:56,720
now the first part p
train times c we know what that is

213
00:13:54,880 --> 00:13:59,519
because the expected value under p train
of the cost

214
00:13:57,760 --> 00:14:04,880
is bounded by epsilon that's our
assumption

215
00:14:01,760 --> 00:14:06,880
the second term well the

216
00:14:04,880 --> 00:14:16,160
bound that we have above on the absolute value of p theta minus p train
is two times epsilon t and c max the worst case cost is one

217
00:14:14,399 --> 00:14:24,079
so then the bound that we can write out for this is epsilon that comes from the
expected value of the constant or p train plus two times epsilon t

218
00:14:23,519 --> 00:14:32,079
which comes from the total variation divergence
or absolute value between p theta and p train

219
00:14:30,399 --> 00:14:36,720
so we have one term that's linear in
T and one term that's quadratic

220
00:14:34,639 --> 00:14:41,279
because the sum over t of two epsilon t
is on the order of capital t squared

221
00:14:41,360 --> 00:14:52,320
so we've derived an o of epsilon t squared
bound with this kind of more general assumption

222
00:14:49,360 --> 00:14:55,440
and notice here that uh the only
assumption on the policy

223
00:14:53,680 --> 00:14:58,480
that we needed is that the expected value of the cost

224
00:14:55,440 --> 00:15:03,839
under p train is epsilon so this is kind of the
more general assumption that i stated before

225
00:15:01,680 --> 00:15:06,959
all right so this derivation maybe went
by a little fast

226
00:15:04,880 --> 00:15:08,800
so if you have any questions about this

227
00:15:06,959 --> 00:15:10,240
do feel free to ask a comment on youtube

228
00:15:08,800 --> 00:15:13,600
we can discuss this in class

229
00:15:10,240 --> 00:15:15,680
and also consider checking out the paper

230
00:15:13,600 --> 00:15:20,639
cited below by stefan ross that
describes the full proof in detail