Regular Expressions with grep
By the end of this section, you will:
- Understand the purpose of
grep
: Recognize the versatility ofgrep
for searching through text and its use in filtering output, searching for patterns in files, and extracting relevant data. - Perform basic searches using
grep
: Search for multiword strings and whole words while understanding how to handle case sensitivity and word boundaries. - Utilize regular expressions: Apply regular expressions with
grep
to search for more complex text patterns, using features like bracket expressions, character classes, and anchoring. - Leverage repetition and OR operators: Use repetition operators (e.g.,
*
,+
) and Boolean OR searches to find repetitive patterns or multiple possible matches in your text. - Compare outputs with process substitution: Understand how to compare the
output of multiple
grep
commands using process substitution techniques with thediff
command. - Understand broader applications: Gain a foundational understanding of
regular expressions that apply across multiple programming languages and
tools beyond just
grep
.
Getting Started
The grep
command is a powerful tool used in the Linux command line for searching through text.
It scans files or input for lines that match a specified pattern, which can be a simple word or a more complex regular expression.
grep
is often used to filter output, search for specific data in logs, or find occurrences of certain text patterns within files.
Its versatility makes it an essential tool for efficiently locating information in large sets of data or documents.
In this section, we learn how to use grep
to search files.
We will use simple search strings with grep
to search for regular words.
But we will use regular expressions to search for more complex patterns.
Download Data File
To follow along in this tutorial, download the following file to your home directory on your Google Cloud VM:
wget https://raw.githubusercontent.com/cseanburns/linux_sysadmin/refs/heads/master/data/cities.md
Multiword strings
It's good habit to include search strings within quotes, but this is especially important if we would search for multiword strings. In these cases, we must enclose them in quotes.
Command:
cat cities.md
Output:
| City | 2020 Census | Founded |
|-------------------|-------------|---------|
| New York City, NY | 8804190 | 1624 |
| Los Angeles, CA | 3898747 | 1781 |
| Chicago, IL | 2746388 | 1780 |
| Houston, TX | 2304580 | 1837 |
| Phoenix, AZ | 1624569 | 1881 |
| Philadelphia, PA | 1576251 | 1701 |
| San Antonio, TX | 1451853 | 1718 |
| San Diego, CA | 1381611 | 1769 |
| Dallas, TX | 1288457 | 1856 |
| San Jose, CA | 983489 | 1777 |
Command:
grep "San Antonio" cities.md
Output:
| San Antonio, TX | 1451853 | 1718 |
Whole words, case sensitive by default
As a reminder, grep
commands are case-sensitive by default.
Thus, note that the contents of cities.md are all in lowercase.
If I run the above command without the city named capitalized, then grep
will return nothing:
Command:
grep "san antonio" cities.md
To tell grep to ignore case, I need to use the -i
option.
We also want to make sure that we enclose our entire search string withing double quotes.
This is a reminder for you to run man grep
and to read through the documentation and see what the various options exit for this command.
Command:
grep -i "san antonio" cities.md
Output:
| San Antonio, TX | 1451853 | 1718 |
Whole words by the edges
To search whole words, we can use special characters to match strings at the start and/or the end of words. For example, note the output if I search for cities in California in my file by searching for the string ca. Since this string appears in Chicago, then that city matches my grep search:
Command:
grep -i "ca" cities.md
Output:
| Los Angeles, CA | 3898747 | 1781 |
| Chicago, IL | 2746388 | 1780 |
| San Diego, CA | 1381611 | 1769 |
| San Jose, CA | 983489 | 1777 |
To limit results to only CA, we can enclose our search in special characters that tell grep
to limit by whole words only:
Command:
grep -i "\bca\b" cities.md
Output:
| Los Angeles, CA | 3898747 | 1781 |
| San Diego, CA | 1381611 | 1769 |
| San Jose, CA | 983489 | 1777 |
Note: in some cases you might need an extra backslash:
grep -i "\\bca\\b" cities.md
.
We can reverse that output and look for strings within other words. Here is an example of searching for the string ca within words:
Command:
grep -i "\Bca\B" cities.md
Output:
| Chicago, IL | 2746388 | 1780 |
Bracket Expressions and Character Classes
In conjunction with the grep
command, we can also use regular expressions to search for more general patterns in text files.
For example, we can use bracket expressions and character classes to search for patterns in the text.
Here again using man grep
is very important because it includes instructions on how to use these regular expressions.
Bracket expressions
From man grep
on bracket expressions:
A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that list. If the first character of the list is the caret ^ then it matches any character not in the list. For example, the regular expression [0123456789] matches any single digit.
The regular expression [^0123456789] matches the inverse.
Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters.
To see how this works, let's search the cities.md file for letters matching A, B, or C. Specifically, in the following command I use a hyphen to match any characters in the range A, B, C. The output does not include the cities Houston or Dallas since neither of those lines contain capital A, B, or C characters:
Command:
grep "[A-C]" cities.md
Output:
| City | 2020 Census | Founded |
| New York City, NY | 8804190 | 1624 |
| Los Angeles, CA | 3898747 | 1781 |
| Chicago, IL | 2746388 | 1780 |
| Phoenix, AZ | 1624569 | 1881 |
| Philadelphia, PA | 1576251 | 1701 |
| San Antonio, TX | 1451853 | 1718 |
| San Diego, CA | 1381611 | 1769 |
| San Jose, CA | 983489 | 1777 |
Note: Use
grep -i "[A-C]" cities.md
for a case insensitive search.
Bracket expressions, inverse searches
When placed after the first bracket, the carat key acts as a Boolean NOT. The following command matches any characters not in the range A,B,C:
Command:
grep "[^A-C]" cities.md
The output matches all lines since there are no instances of A, B, and C in all lines:
Output:
| City | 2020 Census | Founded |
|-------------------|-------------|---------|
| New York City, NY | 8804190 | 1624 |
| Los Angeles, CA | 3898747 | 1781 |
| Chicago, IL | 2746388 | 1780 |
| Houston, TX | 2304580 | 1837 |
| Phoenix, AZ | 1624569 | 1881 |
| Philadelphia, PA | 1576251 | 1701 |
| San Antonio, TX | 1451853 | 1718 |
| San Diego, CA | 1381611 | 1769 |
| Dallas, TX | 1288457 | 1856 |
| San Jose, CA | 983489 | 1777 |
Process substitution
Process substitution allows you to use the output of a command as if it were a file. This is particularly useful when you want to compare the outputs of two commands directly, without having to save them to temporary files.
For example, we can confirm that output from one command does not include Houston or Dallas in a second command by comparing the outputs. Specifically, we compare the outputs of two or more commands using process substitution. This works because the process substitution creates temporary files from the outputs.
Command:
diff <(grep "[A-C]" cities.md) <(grep "[^A-C]" cities.md)
Output:
1a2
> |-----------------|-------------|------|
4a6
> | Houston, TX | 2304580 | 1837 |
8a11
> Dallas, TX | 1288457 | 1856
How It Works
<(command)
creates a temporary file (or file-like stream) that holds the output of command.diff
can then read from these streams as if they were regular files, comparing their contents without needing you to manually save and load files. The output of thediff
command is nicely explained in this Stack Overflow answer.
Without process substitution, you would need to save the outputs of both grep commands to temporary files and then compare them:
grep "[A-C]" cities.md > output1.txt
grep "[^A-C]" cities.md > output2.txt
diff output1.txt output2.txt
This alternative works but is more cumbersome, as it requires managing temporary files. Process substitution simplifies the process by handling this behind the scenes.
Try this command for an alternate output:
diff -y <(grep "[A-C]" cities.md) <(grep "[^A-C]" cities.md)
Our ranges may be alphabetical or numerical. The following command matches any numbers in the range 1,2,3:
Command:
grep "[1-3]" cities.md
Since all single digits appear in the file, the above command returns all lines. To invert the search, we can use the following grep command. This will match all non-integers:
Command:
grep "[^0-9]" cities.md
Bracket expressions, carat preceding the bracket
We saw in a previous section that the carat ^
key indicates the start of line.
However, we learned above that it can be used to return the inverse of a string in special circumstances.
To use the carat to signify the start of a line, the carat key must precede the opening bracket.
For example, the following command matches any lines that start with the upper case letters within the range of N,O,P:
Command:
grep "^| [N-P]" cities.md
Output:
New York City, NY | 8804190 | 1624
Phoenix, AZ | 1624569 | 1881
Philadelphia, PA | 1576251 | 1701
And we can reverse that with the following command, which returns all lines that do not start with N,O, or P:
Command:
grep "^| [^N-P]" cities.md
Output:
| City | 2020 Census | Founded |
| Los Angeles, CA | 3898747 | 1781 |
| Chicago, IL | 2746388 | 1780 |
| Houston, TX | 2304580 | 1837 |
| San Antonio, TX | 1451853 | 1718 |
| San Diego, CA | 1381611 | 1769 |
| Dallas, TX | 1288457 | 1856 |
| San Jose, CA | 983489 | 1777 |
Character classes
Character classes are special types of predefined bracket expressions.
They make it easy to search for general patterns.
From man grep
on character classes:
Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:blank:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means the character class of numbers and letters ...
Below I use the awk
command to select the fourth column (or field) using the pipe as the field delimiter.
I pipe the output to grep
to select lines containing a vertical bar and four digit numbers [[:digit:]]{4}
from the results of the awk
command:
Command:
awk -F"|" '{ print $4 }' cities.md | grep -Eo "[[:digit:]]{4}"
Output:
1624
1781
1780
1837
1881
1701
1718
1769
1856
1777
I first tested that the
awk
command selects the appropriate field by running it by itself:awk -F"|" '{ print $4 }' cities.md
.
Anchoring
As seen above, outside of bracket expressions and character classes, we use the caret ^
to mark the beginning of a line.
We can also use the $
to match the end of a line.
Using either (or both) is called anchoring.
Anchoring works in many places.
For example, to search all lines that start with capital D through L
Command:
grep "^| [D-L]" cities.md
Output:
| Los Angeles, CA | 3898747 | 1781 |
| Houston, TX | 2304580 | 1837 |
| Dallas, TX | 1288457 | 1856 |
To show how to anchor the end of a line, let's look at the operating-systems.csv file.
Command:
grep "1993$" operating-systems.csv
Output:
FreeBSD, BSD, 1993
Windows NT, Proprietary, 1993
We can use both anchors in our grep
commands.
The following searches for any lines starting with capital letters that range from C through F.
Then any lines ending with the numbers starting from 3 through 6.
The single dot stands for any character, and the asterisk stands for "the preceding character will zero or more times" (man grep
).
Command:
grep "^[C-F].*[3-6]$" operating-systems.csv
Output:
CP/M, Proprietary, 1974
FreeBSD, BSD, 1993
Repetition
If we want to use regular expressions to identify repetitive patterns, then we can use repetition operators.
As we saw above, the most useful one is the *
asterisk.
But there are other options:
In come cases, we need to add the -E option
to extend grep
's regular expression functionality:
Here, the preceding item S is matched one or more times:
Command:
grep -E "S+" cities.md
Output:
| San Antonio, TX | 1451853 | 1718 |
| San Diego, CA | 1381611 | 1769 |
| San Jose, CA | 983489 | 1777 |
In the next search, the preceding item l is matched exactly 2 times:
Command:
grep -E "l{2}" cities.md
Output:
| Dallas, TX | 1288457 | 1856 |
Finally, in this example, the preceding item 7 is matched at least two times or at most three times:
Command:
grep -E "7{2,3}" cities.md
Output:
| San Jose, CA | 983489 | 1777 |
OR searches
We can use the vertical bar |
to do a Boolean OR search.
In a Boolean OR statement, the statement is True if either one part is true, the other part is true, or both are true.
In a search statement, this means that at least one part of the search is true.
The following will return lines for each city because they both appear in the file:
Command:
grep -E "San Antonio|Dallas" cities.md
Output:
| San Antonio, TX | 1451853 | 1718 |
| Dallas, TX | 1288457 | 1856 |
The following will match San Antonio even though Lexington does not appear in the file:
Command:
grep -E "San Antonio|Lexington" cities.md
Output:
| San Antonio, TX | 1451853 | 1718 |
Conclusion
We covered a lot in this section on grep
and regular expressions.
We specifically covered:
- multiword strings
- whole word searches and case sensitivity
- bracket expressions and character classes
- anchoring
- repetition
- Boolean OR searches
Even though we focused on grep
, many these regular expressions work across many programming languages.
See Regular-Expression.info for more in-depth lessons on regular expressions.