Regular Expressions

Oftentimes, as systems administrators, we will need to search the contents of a file, like a log file. One of the commands that we use to do that is the grep command. We have already discussed using the grep command, which is not unlike doing any kind of search, such as in Google. The command simply involves running grep along with the search string and against a file.

Multiword strings

It's good habit to include search strings within quotes, but this is especially important if we would search for multiword strings. In these cases, we must enclose them in quotes.

Command:

cat cities.csv

Output:

City              | 2020 Census | Founded
New York City, NY | 8804190     | 1624
Los Angeles, CA   | 3898747     | 1781
Chicago, IL       | 2746388     | 1780
Houston, TX       | 2304580     | 1837
Phoenix, AZ       | 1624569     | 1881
Philadelphia, PA  | 1576251     | 1701
San Antonio, TX   | 1451853     | 1718
San Diego, CA     | 1381611     | 1769
Dallas, TX        | 1288457     | 1856
San Jose, CA      | 983489      | 1777

Command:

grep "San Antonio" cities.csv

Output:

San Antonio, TX | 1451853 | 1718

Whole words, case sensitive by default

As a reminder, grep commands are case-sensitive by default, and since the contents of cities.csv are all in lowercase, if I run the above command without the city named capitalized, then grep will return nothing:

Command:

grep "san antonio" cities.csv

In order to tell grep to ignore case, I need to use the -i option. We also want to make sure that we enclose our entire search string withing double quotes.

This is a reminder for you to run man grep and to read through the documentation and see what the various options exit for this command.

Command:

grep -i "san antonio" cities.csv

Output:

San Antonio, TX   | 1451853     | 1718

Whole words by the edges

To search whole words, we can use special characters to match strings at the start and/or the end of words. For example, note the output if I search for cities in California in my file by searching for the string ca. Since this string appears in Chicago, then that city matches my grep search:

Command:

grep -i "ca" cities.csv

Output:

Los Angeles, CA   | 3898747     | 1781
Chicago, IL       | 2746388     | 1780
San Diego, CA     | 1381611     | 1769
San Jose, CA      | 983489      | 1777

To limit results to only CA, we can enclose our search in certain special characters:

Command:

grep -i "\bca\b" cities.csv

Output:

Los Angeles, CA   | 3898747     | 1781
San Diego, CA     | 1381611     | 1769
San Jose, CA      | 983489      | 1777

We can reverse that output and look for strings within other words. Here is an example of searching for the string ca within words:

Command:

grep -i "\Bca\B" cities.csv

Output:

Chicago, IL       | 2746388     | 1780

Bracket Expressions and Character Classes

In conjunction with the grep command, we can also use regular expressions to search for more general patterns in text files. For example, we can use bracket expressions and character classes to search for patterns in the text. Here again using man grep is very important because it includes instructions on how to use these regular expressions.

Bracket expressions

From man grep on bracket expressions:

A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that list. If the first character of the list is the caret ^ then it matches any character not in the list. For example, the regular expression [0123456789] matches any single digit.

The regular expression [^0123456789] matches the inverse.

Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters.

To see how this works, let's search the cities.csv file for letters matching A, B, or C. Specifically in the following command I use a hyphen to match any characters in the range A, B, C. The output does not include the cities Houston or Dallas since neither of those lines contain capital A, B, or C characters:

Command:

grep "[A-C]" cities.csv 

Output:

City              | 2020 Census | Founded
New York City, NY | 8804190     | 1624
Los Angeles, CA   | 3898747     | 1781
Chicago, IL       | 2746388     | 1780
Phoenix, AZ       | 1624569     | 1881
Philadelphia, PA  | 1576251     | 1701
San Antonio, TX   | 1451853     | 1718
San Diego, CA     | 1381611     | 1769
San Jose, CA      | 983489      | 1777

Bracket expressions, inverse searches

When placed after the first bracket, the carat key acts as a Boolean NOT. The following command matches any characters not in the range A,B,C:

Command:

grep "[^A-C]" cities.csv

However, the output matches all lines since there are no instances of A, B, and C in all lines:

Output:

City              | 2020 Census | Founded
New York City, NY | 8804190     | 1624
Los Angeles, CA   | 3898747     | 1781
Chicago, IL       | 2746388     | 1780
Houston, TX       | 2304580     | 1837
Phoenix, AZ       | 1624569     | 1881
Philadelphia, PA  | 1576251     | 1701
San Antonio, TX   | 1451853     | 1718
San Diego, CA     | 1381611     | 1769
Dallas, TX        | 1288457     | 1856
San Jose, CA      | 983489      | 1777

Process substitution

We can confirm that output from the first command does not include Houston or Dallas in the second command by comparing the outputs of the two commands using process substitution. This is a technique that pipes the standard output of multiple commands to be processed by another command. Here I use the diff command to compare the output of both grep commands:

Command:

diff <(grep "[A-C]" cities.csv) <(grep "[^A-C]" cities.csv)

The diff output shows that the second grep command includes the two lines below that are not in the output of the first grep command:

Output:

4a5
> Houston, TX       | 2304580     | 1837
8a10
> Dallas, TX        | 1288457     | 1856

The output of the diff command is nicely explained in this Stack Overflow answer.

Try this command for an alternate output:

diff -y <(grep "[A-C]" cities.csv) <(grep "[^A-C]" cities.csv)

Our ranges may be alphabetical or numerical. The following command matches any numbers in the range 1,2,3:

Command:

grep [1-3] cities.csv

Since all single digits appear in the file, the above command returns all lines. To invert the search, we can use the following grep command. This will match all non-integers:

Command:

grep [^0-9] cities.csv

Bracket expressions, carat preceding the bracket

We saw in a previous section that the carat ^ key indicates the start of line; however, we learned above that it is used to return the inverse of a string. To use the carat to signify the start of a line, the carat key must precede the opening bracket. For example, the following command matches any lines that start with the upper case letters within the range of N,O,P:

Command:

grep ^[N-P] cities.csv

Output:

New York City, NY | 8804190 | 1624
Phoenix, AZ       | 1624569 | 1881
Philadelphia, PA  | 1576251 | 1701

And we can reverse that with the following command, which returns all lines that do not start with N,O, or P:

Command:

grep ^[^N-P] cities.csv

Output:

City            | 2020 Census | Founded
Los Angeles, CA | 3898747     | 1781
Chicago, IL     | 2746388     | 1780
Houston, TX     | 2304580     | 1837
San Antonio, TX | 1451853     | 1718
San Diego, CA   | 1381611     | 1769
Dallas, TX      | 1288457     | 1856
San Jose, CA    | 983489      | 1777

Character classes

Character classes are special types of predefined bracket expressions. They make it easy to search for general patterns. From man grep on character classes:

Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:blank:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means the character class of numbers and letters ...

Here I search for anything that matches the Year column. Specifically, I search for a empty space [[:blank:]], a four digit string [[:digit:]]{4}. The {4} means "The preceding item is matched exactly 4 times" (man grep), and the number 4 can be replaced with any relevant number. and an end of line $:

Command:

grep -Eo "[[:blank:]][[:digit:]]{4}$" cities.csv 

Output:

 1624
 1781
 1780
 1837
 1881
 1701
 1718
 1769
 1856
 1777

In the above command, the [[:blank:]] can be excluded and we'd still retrieve the desired results because we've included the dollar sign to mark the end of the line, but I include it here for demonstration purposes. Note that I also added the -E option. This is required for character classes.

Anchoring

As seen above, outside of bracket expressions and character classes, we use the caret ^ to mark the beginning of a line. We can also use the $ to match the end of a line. Using either (or both) is called anchoring. Anchoring works in many places. For example, to search all lines that start with capital D through L

Command:

grep "^[D-L]" cities.csv

Output:

Los Angeles, CA | 3898747 | 1781
Houston, TX     | 2304580 | 1837
Dallas, TX      | 1288457 | 1856

And all lines that end with the numbers 4, 5, or 6:

Command:

grep "[4-6]$" cities.csv

Output:

New York City, NY | 8804190 | 1624
Dallas, TX        | 1288457 | 1856

We can use both anchors in our grep commands. The following searches for any lines starting with capital letters ranging from D through L and any lines ending with the numbers starting from 4 through 6. The single dot stands for any character, and the asterisk stands for "the preceding character will zero or more times" (man grep).

Command:

grep "^[D-L].*[4-6]$" cities.csv

Output:

Dallas, TX        | 1288457     | 1856

Repetition

If we want to use regular expressions to identify repetitive patterns, then we can use repetition operators. As we saw above, the most useful one is the * asterisk. But there are other options:

In come cases, we need to add the -E option to extend grep's regular expression functionality:

Here, the preceding item S is matched one or more times:

Command:

grep -E "S+" cities.csv

Output:

San Antonio, TX | 1451853 | 1718
San Diego, CA   | 1381611 | 1769
San Jose, CA    | 983489  | 1777

In the next search, the preceding item l is matched exactly 2 times:

Command:

grep -E "l{2}" cities.csv

Output:

Dallas, TX | 1288457 | 1856

Finally, in this example, the preceding item 7 is matched at least two times or at most three times:

Command:

grep -E "7{2,3}" cities.csv

Output:

San Jose, CA | 983489 | 1777

OR searches

We can use the vertical bar | to do a Boolean OR search. In a Boolean OR statement, the statement is True if either one part is true, the other part is true, or both are true. In a search statement, this means that at least one part of the search is true.

The following will return lines for each city because they both appear in the file:

Command:

grep -E "San Antonio|Dallas" cities.csv

Output:

San Antonio, TX | 1451853 | 1718
Dallas, TX      | 1288457 | 1856

The following will match San Antonio even though Lexington does not appear in the file:

Command:

grep -E "San Antonio|Lexington" cities.csv

Output:

San Antonio, TX   | 1451853     | 1718

Conclusion

We covered a lot in this section on grep and regular expressions.

We specifically covered:

  • multiword strings
  • whole word searches and case sensitivity
  • bracket expressions and character classes
  • anchoring
  • repetition
  • Boolean OR searches

Even though we focused on grep, many these regular expressions work across many programming languages.

See Regular-Expression.info for more in-depth lessons on regular expressions.