2013-08-14

Why writers want to learn regular expressions

Anyone who’s worked with text has been in the situation where something needs to be changed but the changes are too complex for the standard search and replace function to do any good, resulting in hours of manual work. For these situations, learning regular expressions is a worthwhile investment. Though regular expressions may look scary, just understanding the basics will save you hours of time.

Simply put, a regular expression offers a very powerful way to search (and also replace) text and makes the computer do work you’d otherwise have to spend hours doing.

These days regular expressions are supported by most self-respecting text editor software. Unfortunately, many people turn back at the sight of them. They look far more daunting than they actually are. Which is unfortunate as they’re useful for more people than just programmers. Anyone who writes regularly will benefit from understanding how to use them.

If you invest some time in learning them, it will pay back itself times over. Many times, the standard search and replace isn’t up to the task and most people would resort to manual search and replace. This is boring, tedious and repetitive manual work that you can often avoid with the help of regular expressions.

What you can do with them

Regular expressons (or “regexes” for short) open up the possibility of smarter pattern matching which means you search for more than just exact matches (replace this with that) but also matches like:

Find every word with five characters
Find every line that has more than 50 characters in it
Find every number
Find every word starting with O, F or G
Find every phone number

Regular expressions can also be used to replace text. For example:

Find word and replace with (word)
Prefix every line in the document with >

Keep reading to learn the regular expression for each of examples above.

Let’s try it out

Regular expressions look like formulas and can seem quite cryptic if you have never seen them before. But don’t give up yet. They’re not that hard. In regular expressions there are several special sequences of characters that have special meaning. In this short introduction I will explain some of these.

Note: In programing parlance, sequences of characters are called strings, a word I will use from here on.

These special strings aren’t interpreted literally but as standing for something else. When the software sees one of these, it doesn’t look for this particular string in the text you are searching, instead it sees it as standing for something else. This allows you to tell the software what you are looking for broadly such as “a word that’s this long”, allowing you to do things regular search and replace wouldn’t do.

These special sequences fall into the following broad categories:

Classes: what kind of character they match such as letter, number or something else
Quantifiers: how many times the character appears
Anchors: refering to the beginning and end of a line, allow you to match entire lines of text
References: sequences refering to other special sequences, useful for replacements

In the following examples I will introduce some of these special strings and how they can be used to do some heavy lifting.

For some of these examples I will be using the first three sentences from the Hitchhiker’s Guide to the Galaxy by Douglas Adams.

Finding every word that’s five characters long

For this expression we will be using:

The character class \w which will match a character that is a letter (a to z, A to Z, underscore, numbers 0 through 9).
The word boundary \b which matches a place where a word character isn’t followed or preceded by another word character (such as a space between words).
The simple quantifier {} which lets us define how many word characters we want to match.

Regular Expression: \b\w{5}\b

In human words: Look for a sequence of 5 “word” characters that are preceded and followed by characters that are “non-word” characters.

Result:

Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun. Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape- descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea. This planet has or rather had a problem, which was this: most of the people on it were unhappy for pretty much of the time.

Find every line that has more than 50 characters in it

For this expression we will be using:

The “match all” . (dot, period) special character, which matches any character, including spaces.
The “at least n times” {n,} quantifier allowing is to limit how many times a character should repeat.
The anchors ^ and $ denoting the beginning and the end of a line (in the case of using a text editor).

Regular expression: ^.{50,}$

In human words: Starting at the beginning of the line and until its end, look for at least 50 characters of any kind.

Text:

Far out in the uncharted backwaters of the unfashionable
end of the western spiral arm of the Galaxy
lies a small unregarded yellow sun. Orbiting this at a
distance of roughly ninety-two million miles is
an utterly insignificant little blue green planet whose
ape-descended life forms are so amazingly primitive that
they still think digital watches are a pretty neat idea.
This planet has or rather had a problem, which was this:
most of the people on it were unhappy for pretty much
of the time.

Result:

Far out in the uncharted backwaters of the unfashionable
end of the western spiral arm of the Galaxy
lies a small unregarded yellow sun. Orbiting this at a
distance of roughly ninety-two million miles is
an utterly insignificant little blue green planet whose
ape-descended life forms are so amazingly primitive that
they still think digital watches are a pretty neat idea.
This planet has or rather had a problem, which was this:
most of the people on it were unhappy for pretty much
of the time.

Find every number

For this expression we will be using:

The “number” \d character class.
The “at least one” + quantifier.

Regular expression: ’\d+’

In human words: Find every sequence of characters that consists of at least one digit.

Note that this particular expression will not work with decimal numbers.

Text: There were 5 cows and 56 chickens, in addition to the thirty-five hogs, at the farm.

Result: There were 5 cows and 56 chickens, in addition to the thirty-five hogs, at the farm.

Find every word starting with O, F or G

For this expression we will be using:

“Alternation”, using parentheses and the pipe character, allowing us to match one of several possible characters.
The word character class \w we used earlier.
The “at least one” + quantifier we used earlier.

Regular expression: (O|F|G)\w+

In human words: Find every sequence of characters that is a word, is at least one character long and which is immediately preceded by O, F or G.

This expression will not match the “A” by itself, only words that begin with “A”. Sometimes you need to specify that you want case matching, otherwise the software will find words starting with “a” as well. If you omit the “+”, it will only match O, F or G and the next word character, even if they’re part of a complete word.

Text:

Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun. Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape- descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea. This planet has or rather had a problem, which was this: most of the people on it were unhappy for pretty much of the time.

Result:

Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun. Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape- descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea. This planet has or rather had a problem, which was this: most of the people on it were unhappy for pretty much of the time.

Find every phone number

For this expression we will be using:

The “number” \d character class.
The “escape” \ character allowing us to tell the software that the next character is to be treated as being literal and not part of a special expression.
The “n times” quantifier {}.

Regular expression: $\d{3}$-\d{3}-\d{4}

In human words: Find every sequence starting with “(“, followed by three digits, then “)”, followed by a dash, three more digits, a dash and four more digits.

Assuming a valid phone number has the format: (xxx)-xxx-xxxx. In this expression we need to use the back slash escape character \ to tell the software that the parenthesis is a literal parenthesis. Otherwise the software might treat it as a special character denoting a group (covered later).

Text:

123 556 334
876-12-12
1745-78-12
(444)-555-2221
001-55-123
(787)-123-4567

Result:

123 556 334
876-12-12
1745-78-12
(444)-555-2221
001-55-123
(787)-123-4567

Find word and replace with (word)

Using regexes for replacements, there are even more possibilites.

For this expression we will be using:

The “escape character” \ we used earlier.
The “group” or “capturing group” characters (…) allowing us to save what we matched and refer to it later.
The “back reference” \n or sometimes $n (depending on your text editor) allowing us to refer back to what we captured earlier. If there are multiple capturing groups, you can refer to the first with \1, the second with \2 and so on.

Match expression: (\b\w+\b)

Replacement expression: (\1)

In human words: Find every word (as we showed earlier), capture it as a group, then replace using the replacement pattern

Result: (Far) (out) (in) (the) (uncharted) (backwaters) (of) (the) (unfashionable) (end) (of) (the) (western) (spiral) (arm) (of) (the) (Galaxy) (lies) (a) (small) (unregarded) (yellow) (sun). (Orbiting) (this) (at) (a) (distance) (of) (roughly) (ninety)-(two) (million) (miles) (is) (an) (utterly) (insignificant) (little) (blue) (green) (planet) (whose) (ape)- (descended) (life) (forms) (are) (so) (amazingly) (primitive) (that) (they) (still) (think) (digital) (watches) (are) (a) (pretty) (neat) (idea). (This) (planet) (has) (or) (rather) (had) (a) (problem), (which) (was) (this): (most) (of) (the) (people) (on) (it) (were) (unhappy) (for) (pretty) (much) (of) (the) (time).

Prefix every line in the document with >

For this expression we will be using:

The anchors ^ and $ we used earlier.
The “match all” . (dot) character we used earlier.
The “match all 0 or more times” * (asterisk, star) character that will match even if there aren’t any characters.
The “group” or “capturing group” characters (…) we used earlier.
The “back reference” \n or sometimes $n we used earlier.

Match expression: ^(.*)$

Replacement expression: > \1

Result:

> Far out in the uncharted backwaters of the unfashionable
> end of the western spiral arm of the Galaxy
> lies a small unregarded yellow sun. Orbiting this at a
> distance of roughly ninety-two million miles is
> an utterly insignificant little blue green planet whose
> ape-descended life forms are so amazingly primitive that
> they still think digital watches are a pretty neat idea.
> This planet has or rather had a problem, which was this:
> most of the people on it were unhappy for pretty much
> of the time.

Learning more

As you can see, regular expressions can be quite handy. These are just some simple examples that writers will find useful. There are more advanced concepts like greediness that are good to know in order to create more advanced regexes.

If you want to learn more about these concepts and more ways you can use regexes, I recommend the following:

Photo: abcdz2000

What you can do with them

Let’s try it out

Finding every word that’s five characters long

Find every line that has more than 50 characters in it

Find every number

Find every word starting with O, F or G

Find every phone number

Find word and replace with (word)

Prefix every line in the document with >

Learning more

You should also read:

Saturday Learning Series

Comments

This website uses cookies