forked from rdpeng/CourseraLectures
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathregex.tex
455 lines (391 loc) · 11.9 KB
/
regex.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
\documentclass[aspectratio=169]{beamer}
\mode<presentation>
{
\usetheme{Warsaw}
% or ...
\setbeamercovered{transparent}
% or whatever (possibly just delete it)
}
\setbeamertemplate{footline}[page number]
\usepackage[english]{babel}
\usepackage[latin1]{inputenc}
\usepackage{graphicx}
%\usepackage{times}
%\usepackage[T1]{fontenc}
% Or whatever. Note that the encoding and the font should match. If T1
% does not look nice, try deleting the line with the fontenc.
\usepackage{amsmath,amsfonts,amssymb}
\input{macros}
\title{Regular Expressions}
\date{Computing for Data Analysis}
\begin{document}
\begin{frame}
\titlepage
\end{frame}
\begin{frame}{Regular expressions}
\begin{itemize}
\item
Regular expressions can be thought of as a combination of literals and
\textit{metacharacters}
\item
To draw an analogy with natural language, think of literal text
forming the words of this language, and the metacharacters defining
its grammar
\item
Regular expressions have a rich set of metacharacters
\end{itemize}
\end{frame}
\begin{frame}[fragile]{Literals}
Simplest pattern consists only of literals. The literal ``nuclear''
would match to the following lines:
\begin{verbatim}
Ooh. I just learned that to keep myself alive after a
nuclear blast! All I have to do is milk some rats
then drink the milk. Aweosme. :}
Laozi says nuclear weapons are mas macho
Chaos in a country that has nuclear weapons -- not good.
my nephew is trying to teach me nuclear physics, or
possibly just trying to show me how smart he is
so I'll be proud of him [which I am].
lol if you ever say "nuclear" people immediately think
DEATH by radiation LOL
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{Literals}
The literal ``Obama'' would match to the following lines
\begin{verbatim}
Politics r dum. Not 2 long ago Clinton was sayin Obama
was crap n now she sez vote 4 him n unite? WTF?
Screw em both + Mcain. Go Ron Paul!
Clinton conceeds to Obama but will her followers listen??
Are we sure Chelsea didn't vote for Obama?
thinking ... Michelle Obama is terrific!
jetlag..no sleep...early mornig to starbux..Ms. Obama
was moving
\end{verbatim}
\end{frame}
\begin{frame}{Regular Expressions}
\begin{itemize}
\item
Simplest pattern consists only of literals; a match occurs if the
sequence of literals occurs anywhere in the text being tested
\item
What if we only want the word ``Obama''? or sentences that end in
the word ``Clinton'', or ``clinton'' or ``clinto''?
\end{itemize}
\end{frame}
\begin{frame}{Regular Expressions}
We need a way to express
\begin{itemize}
\item
whitespace word boundaries
\item
sets of literals
\item
the beginning and end of a line
\item
alternatives (``war'' or ``peace'')
\end{itemize}
Metacharacters to the rescue!
\end{frame}
\begin{frame}[fragile]{Metacharacters}
Some metacharacters represent the start of a line
\begin{verbatim}
^i think
\end{verbatim}
will match the lines
\begin{verbatim}
i think we all rule for participating
i think i have been outed
i think this will be quite fun actually
i think i need to go to work
i think i first saw zombo in 1999.
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{Metacharacters}
\$ represents the end of a line
\begin{verbatim}
morning$
\end{verbatim}
will match the lines
\begin{verbatim}
well they had something this morning
then had to catch a tram home in the morning
dog obedience school in the morning
and yes happy birthday i forgot to say it earlier this morning
I walked in the rain this morning
good morning
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{Character Classes with []}
We can list a set of characters we will accept at a given point in the
match
\begin{verbatim}
[Bb][Uu][Ss][Hh]
\end{verbatim}
will match the lines
\begin{verbatim}
The democrats are playing, "Name the worst thing about Bush!"
I smelled the desert creosote bush, brownies, BBQ chicken
BBQ and bushwalking at Molonglo Gorge
Bush TOLD you that North Korea is part of the Axis of Evil
I'm listening to Bush - Hurricane (Album Version)
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{Character Classes with []}
\begin{verbatim}
^[Ii] am
\end{verbatim}
will match
\begin{verbatim}
i am so angry at my boyfriend i can't even bear to
look at him
i am boycotting the apple store
I am twittering from iPhone
I am a very vengeful person when you ruin my sweetheart.
I am so over this. I need food. Mmmm bacon...
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{Character Classes with []}
Similarly, you can specify a range of letters [a-z] or
[a-zA-Z]; notice that the order doesn't matter
\begin{verbatim}
^[0-9][a-zA-Z]
\end{verbatim}
will match the lines
\begin{verbatim}
7th inning stretch
2nd half soon to begin. OSU did just win something
3am - cant sleep - too hot still.. :(
5ft 7 sent from heaven
1st sign of starvagtion
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{Character Classes with []}
When used at the beginning of a character class, the ``\verb+^+'' is also a
metacharacter and indicates matching characters NOT in the
indicated class
\begin{verbatim}
[^?.]$
\end{verbatim}
will match the lines
\begin{verbatim}
i like basketballs
6 and 9
dont worry... we all die anyway!
Not in Baghdad
helicopter under water? hmmm
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{More Metacharacters}
``.'' is used to refer to any character. So
\begin{verbatim}
9.11
\end{verbatim}
will match the lines
\begin{verbatim}
its stupid the post 9-11 rules
if any 1 of us did 9/11 we would have been caught in days.
NetBios: scanning ip 203.169.114.66
Front Door 9:11:46 AM
Sings: 0118999881999119725...3 !
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{More Metacharacters: $|$}
This does not mean ``pipe'' in the context of regular expressions;
instead it translates to ``or''; we can use it to combine two
expressions, the subexpressions being called alternatives
\begin{verbatim}
flood|fire
\end{verbatim}
will match the lines
\begin{verbatim}
is firewire like usb on none macs?
the global flood makes sense within the context of the bible
yeah ive had the fire on tonight
... and the floods, hurricanes, killer heatwaves, rednecks, gun nuts, etc.
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{More Metacharacters: $|$}
We can include any number of alternatives...
\begin{verbatim}
flood|earthquake|hurricane|coldfire
\end{verbatim}
will match the lines
\begin{verbatim}
Not a whole lot of hurricanes in the Arctic.
We do have earthquakes nearly every day somewhere in our State
hurricanes swirl in the other direction
coldfire is STRAIGHT!
'cause we keep getting earthquakes
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{More Metacharacters: $|$}
The alternatives can be real expressions and not just literals
\begin{verbatim}
^[Gg]ood|[Bb]ad
\end{verbatim}
will match the lines
\begin{verbatim}
good to hear some good knews from someone here
Good afternoon fellow american infidels!
good on you-what do you drive?
Katie... guess they had bad experiences...
my middle name is trouble, Miss Bad News
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{More Metacharacters: ( and )}
Subexpressions are often contained in parentheses to constrain the
alternatives
\begin{verbatim}
^([Gg]ood|[Bb]ad)
\end{verbatim}
will match the lines
\begin{verbatim}
bad habbit
bad coordination today
good, becuase there is nothing worse than a man in kinky underwear
Badcop, its because people want to use drugs
Good Monday Holiday
Good riddance to Limey
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{More Metacharacters: ?}
The question mark indicates that the indicated expression is optional
\begin{verbatim}
[Gg]eorge( [Ww]\.)? [Bb]ush
\end{verbatim}
will match the lines
\begin{verbatim}
i bet i can spell better than you and george bush combined
BBC reported that President George W. Bush claimed God told him to invade Iraq
a bird in the hand is worth two george bushes
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{One thing to note...}
In the following
\begin{verbatim}
[Gg]eorge( [Ww]\.)? [Bb]ush
\end{verbatim}
we wanted to match a ``.'' as a literal period; to do that, we had to
``escape'' the metacharacter, preceding it with a backslash In
general, we have to do this for any metacharacter we want to include
in our match
\end{frame}
\begin{frame}[fragile]{More metacharacters: * and +}
The * and + signs are metacharacters used to indicate repetition; *
means ``any number, including none, of the item'' and + means ``at
least one of the item''
\begin{verbatim}
\(.*\)
\end{verbatim}
will match the lines
\begin{verbatim}
anyone wanna chat? (24, m, germany)
hello, 20.m here... ( east area + drives + webcam )
(he means older men)
()
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{More metacharacters: * and +}
The * and + signs are metacharacters used to indicate repetition; *
means ``any number, including none, of the item'' and + means ``at
least one of the item''
\begin{verbatim}
[0-9]+ (.*)[0-9]+
\end{verbatim}
will match the lines
\begin{verbatim}
working as MP here 720 MP battallion, 42nd birgade
so say 2 or 3 years at colleage and 4 at uni makes us 23 when and if we finish
it went down on several occasions for like, 3 or 4 *days*
Mmmm its time 4 me 2 go 2 bed
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{More metacharacters: \{ and \}}
\{ and \} are referred to as interval quantifiers; the let us specify
the minimum and maximum number of matches of an expression
\begin{verbatim}
[Bb]ush( +[^ ]+){1,5} debate
\end{verbatim}
will match the lines
\begin{verbatim}
Bush has historically won all major debates he's done.
in my view, Bush doesn't need these debates..
bush doesn't need the debates? maybe you are right
That's what Bush supporters are doing about the debate.
Felix, I don't disagree that Bush was poorly prepared for the debate.
indeed, but still, Bush should have taken the debate more seriously.
Keep repeating that Bush smirked and scowled during the debate
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{More metacharacters: { and }}
\begin{itemize}
\item
{m,n} means at least m but not more than n matches
\item
{m} means exactly m matches
\item
{m,} means at least m matches
\end{itemize}
\end{frame}
\begin{frame}[fragile]{More metacharacters: ( and ) revisited}
\begin{itemize}
\item
In most implementations of regular expressions, the parentheses not
only limit the scope of alternatives divided by a ``$|$'', but also
can be used to ``remember'' text matched by the subexpression enclosed
\item
We refer to the matched text with \verb+\1+, \verb+\2+, etc.
\end{itemize}
\end{frame}
\begin{frame}[fragile]{More metacharacters: ( and ) revisited}
So the expression
\begin{verbatim}
+([a-zA-Z]+) +\1 +
\end{verbatim}
will match the lines
\begin{verbatim}
time for bed, night night twitter!
blah blah blah blah
my tattoo is so so itchy today
i was standing all all alone against the world outside...
hi anybody anybody at home
estudiando css css css css.... que desastritooooo
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{More metacharacters: ( and ) revisited}
The \verb+*+ is ``greedy'' so it always matches the \textit{longest}
possible string that satisfies the regular expression. So
\begin{verbatim}
^s(.*)s
\end{verbatim}
matches
\begin{verbatim}
sitting at starbucks
setting up mysql and rails
studying stuff for the exams
spaghetti with marshmallows
stop fighting with crackers
sore shoulders, stupid ergonomics
\end{verbatim}
\end{frame}
\begin{frame}[fragile]{More metacharacters: ( and ) revisited}
The greediness of \verb+*+ can be turned off with the \verb+?+, as in
\begin{verbatim}
^s(.*?)s
\end{verbatim}
\end{frame}
\begin{frame}{Summary}
\begin{itemize}
\item Regular expressions are used in many different languages; not
unique to R.
\item Regular expressions are composed of literals and metacharacters
that represent sets or classes of characters/words
\item Text processing via regular expressions is a very powerful way
to extract data from ``unfriendly'' sources (not all data comes as a
CSV file)
\end{itemize}
(Thanks to Mark Hansen for some material in this lecture.)
\end{frame}
\end{document}