Regular Expressions: A Brief Tutorial

by Dorothea Salo

Copyright 2000 Dorothea Salo. This work is licensed under a Creative Commons Attribution 3.0 United States License. Creative Commons License

Lesson 1: Defining regular expressions

Regular expressions represent a way to identify patterns in a text. They can be used to search for patterns and replace them with text, or with different patterns. They can also be used to identify a piece of a text for special handling.

What qualifies as a pattern? Almost anything. Here is an example text with several simple patterns:

January 1, 2000

Dear Sir:

It’s the end of the world as we know it, and I feel fine.

Sincerely,
Computer User

If you keep all your correspondence, you might have a terrible time searching through your files for this specific letter. You might not even know from its filename that it was a letter. A regular expression search could help.

Notice the greeting and closing. The greeting, since this is a business letter, is a line starting with the word “Dear” and ending with a colon. Anything might appear between “Dear” and the colon (“Sir”, “Madam”, any imaginable name), but “Dear” and the colon will always be there. The closing is a line ending with a comma, whether the actual words are “Sincerely,” “Yours truly,” “Love,” or whatever. Hardly any other line in any other kind of text ends with a comma. A regular expression can look for these patterns, and tell you that the file containing them is in all probability a letter. An ordinary text search can’t do that.

Notice the dateline. A string of letters (starting with a capital), a number, a comma, and a four-digit number. That’s probably how all your datelines in all your letters look. That’s also a pattern that a regular expression can catch. So if you knew that the letter was written sometime in early January, you could craft a regular expression that searched your files for all the letters with a line starting with January and ending with 2000.

The history of regular expressions

Before diving too deeply into regular expression syntax (that is, how to put together a regular expression to find what you’re looking for), a brief summary of the history of regular expressions is in order.

Regular expressions were invented by computer programmers, who call them nasty names to this day. They are often represented as “the most difficult thing about computer programming.” That is absurd, and you shouldn’t let it scare you. Programmers who talk that way just think in numbers, and they find text patterns hard to wrap their minds around.

In the early days of regular expressions, different programmers implemented them in different ways, with different (sometimes wildly different) syntaxes. After a while, most regular expression engines under most circumstances settled on a syntax that resembles that used for the tool known as “grep,” common to computers running the Unix operating system. (Yes, there is an explanation for the name, and no, I’m not going to offer it. Look in the book Mastering Regular Expressions.)

Once you understand Unix-style regular-expression syntax, it isn’t too hard to learn modifications to it, or even completely unfamiliar regex syntaxes.

Regular expression resources

The best book on regular expressions is Mastering Regular Expressions by Jeffrey Friedl. This book goes into a lot of detail, but is written in a very friendly fashion. You will probably not need or want to read more than the first few chapters.

If you’re a Mac user, the BBEdit manual has a pretty good introduction to regular expressions, and explains how to use them within BBEdit. (Ever wondered what that funny little “Use grep” box in the Find dialog meant? Now you know. It means “use regular expressions.”)

You can find regular-expression testers/debuggers on the Web.

Exercises

  1. What text pattern could you use to find the beginning of this lesson?
  2. What other patterns can be found in a typical business letter? (Don’t worry about how to specify them; that’s for later.)
  3. What text patterns might you find at the beginning of a book chapter? A journal article?
  4. What text patterns might help you pick apart bibliographic references (to find the author name, date of publication, etc)? Look at two different style guides, or bibliographies in two different publications, and determine how the patterns differ.

Lesson 2: Dots, stars, plusses, and backslashes

Let’s return to the letter from Lesson 1. As you will recall, we noted that the salutation to a business letter starts with the word “Dear” and ends with a colon. What we need, in order to match all possible business letter salutations, is some way to signal the letters between “Dear” and the colon without being specific.

Enter the dot (.). In regular expression syntax, a dot means “any single character.” So the regular expression b.b will match “bib”, “bob”, “brb”, or “bub". It will also match two b’s with a space or a tab between them; the dot matches these “whitespace” characters also. (Quick exercise: Where would the pattern b.b match in the previous sentence?)

This is helpful, but not quite enough. After all, there might be any number of letters between “Dear” and the colon in our salutation. Regular expression syntax offers three ways to say “not just one”: the question mark (?), the star (*) and the plus sign (+).

The question mark means “zero or one of the previous character.” So the regex Bb? will match “B” (one capital B, zero lowercase b) or “Bb” (one capital B, one lowercase b).

The star means “zero or more of the previous character.” So the regex Bb* will match “B”, “Bb”, “Bbb”, “Bbbb”, and so on. (Note that you must have the capital B, or the match will fail; the B must match before b* can try to.)

The plus means “one or more of the previous character.” So the regex Bb+ will match “Bb”, “Bbb”, “Bbbb”, and so on, but will not match “B” by itself.

Combining the dot with the plus or star solves our salutation problem. The regex .* means “zero or more of any character,” while the regex .+ means “one or more of any character.” So either  Dear.*: or  Dear.+: will match any imaginable business salutation.

Which is better, the dot or the star? It depends. If you know that you occasionally forget to insert the name of the person you’re writing to on your salutation line, the star would obviously be better (because it would match a line consisting entirely of “Dear:” or “Dear  :”). If you want to require that some kind of name be there, however, it’s wise to use the plus.

Cancelling special meanings

But what if you want to find an actual period (or an actual plus, or an actual question mark, or an actual star) in a regular expression? What if (to use a silly example) you want to find every letter where you mistyped the colon as a period in your salutation?

The backslash (\) signals to a regular expression that the following character, if it has a special regular expression meaning, should be interpreted literally, not in its special regex sense. So  .  by itself means “any single character,” while  \.  with the backslash means an actual period. This works for any character that has a special meaning in a regular expression:  \*  means an actual star, and  \+  means an actual plus. Of course, the backslash works on itself, too; to search for a literal backslash, a regex pattern must contain  \\ .

Some other special regex characters

Aside from brackets and parentheses (to be discussed in future lessons), a few other characters have special meanings in a regex. They aren’t used as often as the dot, question mark, star, and plus; I mention them mostly so that you know to use a backslash if you mean the actual character.

Python note: The ^ and $ characters can also be forced to mean the beginning and end of a line, respectively, if the Python flag “multiline” is triggered. More on Python flags in a future lesson. Do note, though, that regex engines in text editors like BBEdit and UltraEdit treat these characters this way all the time.

Summary

In a nutshell, here is everything discussed in this lesson:

SymbolMeaning
.Any single character
?Zero or one of the previous character.
*Zero or more of the previous character.
+One or more of the previous character.
^Beginning of a document (or line).
$End of a document (or line).
\Forces a special regex character to be interpreted literally.

Exercises

  1. Which regex matches one or more actual periods? (This is a TRICK QUESTION.)

    .+
    \.+

  2. Which among the following words is matched by the regex  p.p.+  ?

    pop
    pops
    popper
    p.p
    p.ps
    p ps

  3. What pattern(s) might you use to find the closing (“Sincerely,” “Yours truly,” etc.) to a letter?

  4. If you were writing a friendly letter instead of a business letter, what else would you find with the pattern you wrote for the previous question? Can you find just the closing with what you know about regular expressions? If not, what else do you think you need?

Lesson 3: Character classes

Remember the dateline of our letter? We could find it with the regex  .+ ..?, .... , but that isn’t terribly specific, and it looks awful. Could you easily guess, looking at that regex, that what you want is a date?

Dates are far more predictable than that; they contain (ignoring spaces) a word, a number, a comma, and a four-digit number. Surely there is some way to distinguish letters from numbers?

Indeed there is. Regex engines allow you to create “character classes” that narrow a search to specific collections of characters.

Character classes are contained within square brackets  [] . Simply put the collection of characters you want to search for inside the square brackets. If you want not just one of them, you may use the question mark, dot, star, or plus after the ending square bracket.

So one way to find our elusive dateline would be:

[JFMASOND][abceghilmnoprstuvy]+ [123]?[0123456789], [12][09][0123456789][0123456789]

Yikes! That will work (for the dates you’ll see in a business letter; obviously it doesn’t cover all possible dates), but it isn’t very pretty. (And can you spot where it’s a tiny bit over-general?) Fortunately, there is a shortcut. Square brackets also allow you to look for ranges of characters, such as “the digits 0 through 9” or “the lowercase letters a through z.” Simply separate the beginning and end of the range with a hyphen: for example, the character class for “all digits” is [0-9] .

Let’s try that dateline again:

[A-Z][a-z]+ [1-3]?[0-9], [12][09][0-9][0-9]

That’s rather better. It could be simplified further at the cost of a little specificity (how?), but as it stands, it’s fairly understandable, and will get the job done.

One thing you must know about character classes is that special regex characters are not special any more inside them. So another way to search for literal dots, stars, etc. is to put them inside square brackets:  [.?*+]  would look for a dot, a question mark, a star, or a plus. Another handy tip involves hyphens: to include them in a character class (since normally they indicate a range), they must appear as the first character in that class. So the regex  [-0-9] would search for any number or a hyphen.

One more important trick with character classes is that they are negatable: that is, you can just as easily look for any character that is not in a class, as any character that is. You could look for “any character that is not a number,” or “any character that is not a space.” To negate a character class, put ^ (shift-6) as the first character in the class. The regex  [^0-9]  represents any non-number character (like  .  without the numbers).

But ^ already has a meaning! Yes, it does, but the meaning “at the beginning of a document” only applies outside character classes. So the expression  ^[^A-Z]  will find a non-capital-letter character (a number, space, punctuation mark, or lowercase letter) at the very beginning of a document.

Negation is truly handy when looking for something between delimiters, like parentheses, square brackets, or wedges. The easiest way to find something between wedges (such as an XML tag) is to type the opening wedge, then a class negating the closing wedge, then the closing wedge:  <[^>]+> . It looks terrible, but works beautifully, and is much more reliable than trying to use the dot (for reasons to be explored in the next lesson).

Summary

Here’s what we know from this and previous lessons:

SymbolMeaning outside character classMeaning inside character class
.Any single character.
?Zero or one of the previous character.?
*Zero or more of the previous character.*
+One or more of the previous character.+
^Beginning of a document or line.Negates character class.
$End of a document or line.$
\Forces a special regex character to be interpreted literally.(more on this in next lesson)
[ ]Demarcates a character class. To interpret literally, use a backslash.

Exercises

  1. Write a regex that will find:

    any uppercase or lowercase letter
    any sentence-ending punctuation mark
    any character that is not a sentence-ending punctuation mark
    a star at the beginning of a document
    a dollar sign at the end of a document
    an XML character entity
    anything inside parentheses (be careful!)

  2. In Europe, dates are typically written “1 January 2000". Write a regex that will find this kind of date.
  3. Write a regex that will find the time of day, assuming that it is always followed by “a.m.” or “p.m."
  4. Write a regex that finds email addresses.
  5. Write a regex that will separate a journal article title from its subtitle. Do not use negation.
  6. Rewrite the previous regex using negation. Will one of your regexes work more consistently than the other? Why?

And some brainteasers:

  1. Why might one use a character set  [0-9ivxlcdm]+ ? (Hint: Think about book indexes.)
  2. Inside a character class, you can have , but not . Why?

Lesson 4: Metacharacters

If you’re typing away in a word processor or text editor, how do you start a new paragraph?

Most people hit return, and perhaps follow up with a tab. Unless you have “show invisibles” turned on, you can’t see a character on the screen that indicates the key you typed, but you see the result: you start a new line if you hit return, and the cursor moves in a bit if you hit tab.

Can you specify these invisible characters in a regular expression? Absolutely. All you need is a backslash.

As you recall, a backslash tells a regular expression that a character with a special meaning should be interpreted as itself, not the special meaning. The second use of a backslash is to give certain other characters special meanings.

To find a hard return with a regular expression, use the pattern  \n  (for “newline”). To find a tab, use the pattern  \t . These can be qualified with the usual occurrence indicators (question mark, star, etc). So to find files whose authors have the obnoxious habit of typing a whole bunch of carriage returns at the end, you could use \n+$.

BBEdit note: Because of ancient and bizarre computer rituals involving Macintoshes, BBEdit tends to recognize newlines as \r, not \n. If \n doesn’t work, try \r.

Another use for backslashed characters is as a shortcut for common character classes. These shortcuts vary highly across regex engines; never assume that your favorite one will be available in a different program! The most commonly available of these characters are  \d , which means “any digit” (that is,  [0-9] ), and  \s , which means “any whitespace character” (tab, space, or hard return).

These backslashed characters work inside character classes also. You could find the beginning of a paragraph, whether you use tabs or not, with  [\n\t]+ .

Backslashes also have the same function inside character classes as outside: that of protecting characters that would otherwise have a special meaning. Remember the brainteaser from the last lesson about ] inside a character class? If you want to include ] in a character class, simply backslash it:  [\]] . This kind of pattern can quickly become excruciatingly hard to read (programmer-types call it “obfuscated” and hold contests for who can produce the most obfuscated regex), but they are useful nevertheless.

Dots, newlines, and greed

Now that you are accustomed to the idea of treating a hard return as a character, you should know one thing about the hard return character: a dot does not match it! This limits regexes to searching one line of text at a time.

This isn’t as bad as it sounds; as you probably already know, a “line” of text for a regex is the text between two hard returns, regardless of “soft” returns (where your text editor is obligingly word-wrapping for you). Moreover, it’s possible to get past this limitation sometimes: remember that \s matches hard return as well as space and tab characters. (More ways to evade this limitation will appear in a future lesson.)

There is method to this madness; it puts a leash on a regex behavior known as “greed” that can get you in trouble. Simply put, greed is a regex’s desire to make the largest match it can, even when a smaller match is available. If you turn the regex  <I>(.+)</I>  loose on the following bit of text:

<I>Some italicized text.</I> Some more text. <I>Some more italicized text.</I>

you may not get what you bargained for. In fact, you’ll get the whole line, when all you probably wanted was “Some italicized text.” Again, remember that the regex makes the largest match it can.

This behavior is sometimes annoying and must be watched for. If the dot matched a newline, though, greed could be absolutely devastating; imagine the above regex matching most of a file!

Summary

It can be hard to remember when to use a backslash and when not. Eventually, you’ll get used to it. If you’re not sure, simply ask yourself “in this context, does this character mean something weird?” If the answer is “yes,” and you don’t want it to mean something weird, use a backslash.

The famous table, updated:

SymbolMeaning outside character classMeaning inside character class
.Any single character.
?Zero or one of the previous character.?
*Zero or more of the previous character.*
+One or more of the previous character.+
^Beginning of a document.Negates character class.
$End of a document.$
[ ]Demarcates a character class. To interpret literally, use a backslash.N/A. Use a backslash for the literal characters.
\Forces a special character to be interpreted literally.
Makes some ordinary characters special.
Forces a special character to be intepreted literally.
Makes some ordinary characters special.
\nNewline.Newline.
\tTab.Tab.
\sAny whitespace character (including tabs and hard returns).Any whitespace character (including tabs and hard returns).
\dAny digit.Any digit.

Exercises

  1. Write a regular expression that will find a P.O. Box address.
  2. Write a regular expression that will find your street address.
  3. Most business letters end with a closing, a few blank lines (for a handwritten signature), and a typed signature. Write a regular expression that will find this pattern, using your signature as a model.
  4. Write a regular expression that will find a line of a three-column, tab-delimited table.
  5. Rewrite the same expression to account for use of more than one tab to delimit columns.

Lesson 5: Parentheses and backreferencing

Thus far, we have been finding pieces of text with regular expressions, but we haven’t been doing anything with the text once we found it. This lesson will teach you how to isolate parts of a pattern and use them later.

Just as mathematical expressions do, regular expressions isolate parts of themselves using parentheses,  () . If you wanted to split that business-letter dateline into month, day, and year, you would use the regex

([A-Z][a-z]+) ([1-3]?[0-9]), ([12][09][0-9][0-9])

Note that we have now added another pair of characters with special regex meanings, which need to be backslashed outside character classes.

What can you do with split-up regexes? If you’re writing a program, you can store one of the pieces in a variable and use it somewhere else. If you’re searching and replacing, you can use your pieces in your replace line. This technique is called “backreferencing.”

If you use parentheses to break up a regex search, when a match is found, the regex engine “remembers” whatever piece of the match fits inside the parentheses. (A match for the entire regex must be found before any part is “remembered.”) The engine stores the piece so that it can be retrieved with a backslashed number, starting with  \1  for the leftmost (first) match. (The number of sets of parentheses available to you will vary by regex engine. The least-capable engine I know of allows nine parenthetical groups per regex. Python allows an utterly ridiculous number, something like 500.) Parenthetical groups are permitted to nest (groups (within groups) within groups), but they cannot otherwise overlap each other.

Remember an exercise a few lessons ago that asked you to find a dateline in European form, “1 January 2000” instead of “January 1, 2000”? A regex search-and-replace can find all your datelines and change them to the European form, so that you don’t alienate your European colleagues with your ugly American dates.

In the dateline regex above, you will see that there are three sets of parentheses, one for the month, one for the day, and one for the year. If you used that regex as the search line, you would replace it with:

\2 \1 \3

It’s just as easy to go the other way:

Search: ([1-3]?[0-9]) ([A-Z][a-z]+) ([12][09][0-9][0-9])
Replace: \2 \1, \3

The power of backreferencing is truly phenomenal. It allows text to be rearranged in almost any conceivable fashion. (The one thing it can’t do that would be nice is fold case, unfortunately.)

One more go at the table:

SymbolMeaning outside character classMeaning inside character class
.Any single character.
?Zero or one of the previous character.?
*Zero or more of the previous character.*
+One or more of the previous character.+
^Beginning of a document.Negates character class.
$End of a document.$
[]Demarcates a character class. To interpret literally, use a backslash.N/A. Use a backslash for the literal characters.
()Marks a group of characters to be remembered separately.( )
\Forces a special character to be interpreted literally.
Makes some ordinary characters special.
Forces a special character to be intepreted literally.
Makes some ordinary characters special.
\nNewline.Newline.
\tTab.Tab.
\sAny whitespace character (including tabs and hard returns).Any whitespace character (including tabs and hard returns).
\dAny digit.Any digit.

Exercises

  1. Write a search-and-replace that will split an author name into first and last names, putting <first> tags around the first name and <last> tags around the last name. For the sake of simplicity, assume that the last name is always the last word of the name, and include middle names or initials with the first name.

Lesson 6: Advanced tricks

Python regex flags

Python allows you to change the behavior of the regular expression engine using “flags.” Three flags are commonly used; they can be set singly or in any combination:

Flag (abbreviation and full name)Effect of setting flagBehavior when flag is not set
i, IGNORECASEMatches are not case-sensitiveMatches are case-sensitive
m, MULTILINEMakes ^ and $ apply to each line, not to the whole file/string^ and $ apply to the whole file/string
s, DOTALLMakes . match all characters, including newline (\n).. does not match newline

(For completeness’s sake, the other two flags are l/LOCALE and x/VERBOSE. I don’t think anyone at Impressions has ever used them.)

UltraEdit note: UltraEdit treats ^ and $ the way Python does when MULTILINE is turned on. That is, these characters always refer to the beginning and end (respectively) of lines, not files.

BBEdit note: BBEdit does the same thing. ^ and $ are specific to lines, not files.

Two friendly tips:

Greed

If you haven’t been bitten by regex greed yet, just wait; you will. Fortunately, Python allows you to turn regex greed off, so that the engine always tries to make the smallest match it can when presented with a  *  or  + .

It’s easy. Just add a question mark after the * or +. This is most helpful when using these in combination with the dot. To find the contents of <I></I> tags while ensuring you catch only what’s in one set of them at a time, use the regex <I>.+?</I>

Other Python goodies

Python also allows a couple of tricks called “lookaheads” and “group naming.” They are mostly useful for programmers, not people using a text editor, so I won’t discuss them here. If you decide you might need them, they are documented in Python books or online documentation.