Friday, June 25, 2010

Regular Expressions

Useful information about regular expressions. Reproduced from Ben Weaver's talk at Scicoder. Pdf is on repository here.

. stands for any one character
d.g matches dog, dg, dfg ....

[] matches a set of characters
d[aeiou]g matches dag, deg, dig, dog, dug but not dfg.
d[a-z] matches dag - dzg

Quantifiers
? = 0 or 1 {0,1}
+ = 1 or more {1,}
* = 0 or more {0,}
{i,j} = at least i, up to j
{i,} at least i, up to infinity

Examples....
do+g matches dog, doog, dooog, . . .
do*g matches dg, dog, doog, dooog, . . .
do?g matches dg or dog, nothing else.
do{2,3}g matches doog or dooog.

<.*> matches the entirety <h1>Title</h1>
<.*?> matches only <h1> or </h1> in <h1>Title</h1>
The anti-greed operation ? can be applied to any
quantifier.
.* is very, very greedy. Use it with caution.
<[a-z0-9/ ]+> might be better to use here


| is the symbol for ‘or’
dog|cat matches dog and cat.
Does not match dogat or docat (has low precedence)

( ) defines a group
Does not match anything on its own
do(g|c)at matches dogat or docat

^ matches the start of a string
^[A-Z] matches My dog has no nose.
^dog does not match
Note: different from [^ ]

$ matches the end of a string
dogs$ matches cats and dogs
The ‘end’ can (usually) be thought of as squeezed in
between the last character and the newline.

\ is the escape character
Turns off the special meanings of other metacharacters
\[0-9\] matches [0-9]

Also turns ordinary characters into metacharacters
\d = [0-9]
\D = [^0-9]
\s matches whitespace (space, tab, . . . )
\S matches non-whitespace
\w matches ‘alphanumeric’ = [A-Za-z0-9_]
\W = [^A-Za-z0-9_]

Python implements RE through a module, re
Differs from perl, which has RE built into the language
You don’t have to use re if you don’t need to
However, expressing an RE gets trickier
import re
RE must be distinguished from ordinary strings
’\b’ is the bell character
r’\b’ is a backslash followed by a b


Example:
Problem: Remove trailing comments plus any trailing
whitespace from a line.
line = ’keyword value \t # A keyword-value pair’

The re way:
commentRe = re.compile(r’^([^#]*?)(\s*)(#.*)?$’) # Reusable!
clean = commentRe.sub(r’\1’,line)

The string way:
try:
clean = line[0:line.index(’#’)].strip()
except ValueError: # if the string contains no comment
clean = line.strip()

For more look at the pdf.

No comments:

Post a Comment