Basic Patterns
The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:
- a, X, 9, < – ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
- . (a period) – matches any single character except newline ‘\n’
- \w – (lowercase w) matches a “word” character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although “word” is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
- \b – boundary between word and non-word
- \s – (lowercase s) matches a single whitespace character – space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
- \t, \n, \r – tab, newline, return
- \d – decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
- ^ = start, $ = end – match the start or end of the string
- \ – inhibit the “specialness” of a character. So, for example, use . to match a period or \ to match a slash. If you are unsure if a character has special meaning, such as ‘@’, you can put a slash in front of it, \@, to make sure it is treated just as a character.
Repetition
Things get more interesting when you use + and * to specify repetition in the pattern
- + – 1 or more occurrences of the pattern to its left, e.g. ‘i+’ = one or more i’s
- * – 0 or more occurrences of the pattern to its left
? – match 0 or 1 occurrences of the pattern to its left
findall With Files
For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you – much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())
Options
The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search() or findall() etc., e.g. re.search(pat, str, re.IGNORECASE).
- IGNORECASE – ignore upper/lowercase differences for matching, so ‘a’ matches both ‘a’ and ‘A’.
- DOTALL – allow dot (.) to match newline – normally it matches anything but newline. This can trip you up – you think .* matches everything, but by default it does not go past the end of a line. Note that \s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s*
- MULTILINE – Within a string made of many lines, allow ^ and $ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string.
Substitution
The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include ‘\1’, ‘\2’ which refer to the text from group(1), group(2), and so on from the original matching text.
Here’s an example which searches for all the email addresses, and changes them to keep the user (\1) but have yo-yo-dyne.com as the host.
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str)
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher