Friday, August 6, 2010

RegEx

Learning About Regular Expressions

Regular expressions are a very powerful way to match arbitrary text. Stemming from neurophysiological research conducted in the early 1940's, their mathematical foundation was established during the 1950's and 1960's. Their use has a long history in computer science, and they are an integral part of many UNIX tools, including awk, egrep, lex, perl, and sed, as well as many text editors. Regular expressions are slower than simple pattern matching algorithms, and they can be cryptic and difficult to write correctly. Small mistakes in specification can yield surprising results. They are, however, vastly more succinct and powerful than simple pattern matching, and can easily handle tasks that would be difficult or impossible otherwise.

The topic of regular expressions is a very large one, complicated by the arbitrary differences in the implementations found in various tools. Anything beyond an extremely simplistic sketch is well beyond the scope of this manual. To understand them better, we recommend a good text on the subject, such as "Mastering Regular Expressions", by Jeffrey E.F. Friedl (O'Reilly & Associates, Inc, ISBN 1-56592-257-3). The following is an abbreviated, simplified, and incomplete explanation of regular expressions, sufficient to gain a cursory understanding of them.

The regular expression engine attempts to match the regular expression against the input string. Such matching starts at the beginning of the string and moves from left to right. The matching is considered to be "greedy", because at any given point, it will always match the longest possible substring. For example, if a regular expression could match the substring `aa' or `aaa', it will always take the longer option.

Meta Characters

A regular expression "ordinary character" is a character that matches itself. Most characters are ordinary. The exceptions, sometimes called "meta characters", have special meanings. To convert a meta character into an ordinary one, you "escape" it by preceding it with a backslash character (e.g. '\*').

Meta Characters Character Description

. The period matches any character.

[ ] The open bracket character indicates a "bracket expression", which is discussed below. The close bracket character terminates such an expression.

\ The backslash suppresses the special meaning of the character it precedes, and turns it into an ordinary character. To insert a backslash into your regular expression pattern, use a double backslash ('\\').

( ) The open parenthesis indicates a "subexpression", discussed below. The close parenthesis character terminates such a subexpression.
Repetition Characters These characters below are used to specify repetition. The repetition is applied to the character or expression directly to the left of the repetition operator.

* Zero or more of the character or expression to the left. Hence, 'a*' means "zero or more instances of 'a' ".

+ One or more of the character or expression to the left. Hence, 'a+' means "one or more instances of 'a'".

? Zero or one of the character or expression to the left. Hence, 'a?' will match 'a' or the empty string ''.

{} An interval qualifier allows you to specify exactly how many instances of the character or expression to the left to match. If it encloses a single unsigned integer length, it means to match exactly that number of instances. Hence, 'a{3}' will match 'aaa'. If it encloses 2 such integers separated by a comma, it specifies a range of possible repetitions. For example, 'a{2,4}' will match 'aa', 'aaa', or 'aaaa'. Note that '{0,1}' is equivalent to '?'.

| Alternation. This operator is used to indicate that one of several possible choices can match. For example, '(a|b|c)z' will match any of 'az', 'bz', or 'cz'.
^ $ Anchors. A '^' matches the beginning of a string, and '$' matches the end. As we have seen above, regular expressions usually match any possible substring. Anchors can be used to change this and require a match to occur at the beginning or end of the string. For example, '^abc' will only match strings that start with the string 'abc'. '^abc$' will only match a string containing only 'abc'.

Subexpressions

Subexpressions are those parts of a regular expression enclosed in parentheses. There are two reasons to use subexpressions:

To apply a repetition operator to more than one character. For example, '(fun){3}' matches 'funfunfun', while 'fun{3}' matches 'funnn'.

To allow location of the subexpression using the SUBEXPR keyword to STREGEX.

Bracket Expressions

Bracket expressions (expressions enclosed in square brackets) are used to specify a set of characters that can satisfy a match. Many of the meta characters described above (.*[\) lose their special meaning within a bracket expression. The right bracket loses its special meaning if it occurs as the first character in the expression (after an initial '^', if any).

There are several different forms of bracket expressions, including:

Matching List — A matching list expression specifies a list that matches any one of the characters in the list. For example, '[abc]' matches any of the characters 'a', 'b', or 'c'.

Non-Matching List — A non-matching list expression begins with a '^', and specifies a list that matches any character not in the list. For example, '[^abc]' matches any characters except 'a', 'b', or 'c'. The '^' only has this special meaning when it occurs first in the list immediately after the opening '['.

Range Expression — A range expression consists of 2 characters separated by a hyphen, and matches any characters lexically within the range indicated. For example, '[A-Za-z]' will match any alphabetic character, upper or lower case. Another way to get this effect is to specify '[a-z]' and use the FOLD_CASE keyword to STREGEX.

Special Characters in Regular Expressions

Special (non-printing) characters are often represented in regular expressions using backslash escape codes, such as \t to represent a TAB character or \n to represent a newline character. IDL does not support these backslash codes in regular expressions. See Non-Printing Characters for information on how to represent these special characters in regular expressions.

No comments: