Searching with grep

We have available some powerful utilities and programs to process, manipulate, and analyze text files. In this section, we will focus on the grep utility, which offers some advanced methods for searching the contents of text files.

Grep

The grep command is one of my most often used commands. Basically, grep "prints lines that match patterns" (see man grep). In other words, it's search, and it's super powerful.

grep works line by line. So when we use it to search a file for a string of text, it will return the whole line that matches the string. This line by line idea is part of the history of Unix-like operating systems, and it's important to remember that most utilities and programs that we use on the commandline are line oriented.

"A string is any series of characters that are interpreted literally by a script. For example, 'hello world' and 'LKJH019283' are both examples of strings." -- Computer Hope. More generally, it's the literal characters that we type. It's data.

To visualize how grep works, let's consider a file called operating-systems.csv with content as seen below:

OS, License, Year
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

We can use grep to search for anything in that file. Let's start with a search for the string Chrome. Notice that even though the string Chrome only appears once, and in one part of a line, grep returns the entire line.

Command:

grep "Chrome" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009

Be aware that, by default, grep is case-sensitive, which means a search for the string chrome, with a lower case c, would return no results. Fortunately, grep has an -i option, which means to ignore the case of the search string. In the following examples, grep returns nothing in the first search since we do not capitalize the string chrome. However, adding the -i option results in success:

Command:

grep "chrome" operating-systems.csv

Output:

None.

Command:

grep -i "chrome" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009

We can also search for lines that do not match our string using the -v option. We can combine that with the -i option to ignore the string's case. Therefore, in the following example, all lines that do not contain the string chrome are returned:

Command:

grep -vi "chrome" operating-systems.csv

Output:

FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

Sometimes data files, like spreadsheets, contain header columns in the first row. We can use grep to remove the first line of a file by inverting our search and select all lines not matching "OS" at the start of a line. Here the carat key ^ is a regex indicating the start of a line. Again, this grep command returns all lines that do not match the string os at the start of a line, ignoring case:

Command:

grep -vi "^os" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

Alternatively, since we know that the string Year comes at the end of the first line, we can use grep to invert search for that. Here the dollar sign key $ is a regex indicating the end of a line. Like the above, this grep command returns all lines that do not match the string year at the end of a line, ignoring case. The result, in this specific instance, is exactly the same as the last command:

Command:

grep -vi "year$" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

The man grep page lists other options, but a couple of other good ones include:

Get a count of the matching lines with the -c option:

Command:

grep -ic "proprietary" operating-systems.csv

Output:

4

Print only the match and not the whole line with the -o option:

Command:

grep -io "proprietary" operating-systems.csv

Output:

Proprietary
Proprietary
Proprietary
Proprietary

We can simulate a Boolean OR search, and print lines matching one or both strings using the -E option. We separate the strings with a vertical bar |. This is similar to a Boolean OR search since there's at least one match in the following string, there is at least one result.

Here is an example where only one string returns a true value:

Command:

grep -Ei "(bsd|atari)" operating-systems.csv

Output:

FreeBSD, BSD, 1993

Here's an example where both strings evaluate to true:

Command:

grep -Ei "(bsd|gpl)" operating-systems.csv

Output:

FreeBSD, BSD, 1993
Linux, GPL, 1991

By default, grep will return results where the string appears within a larger word, like OS in macOS.

Command:

grep -i "os" operating-systems.csv

Output:

OS, License, Year
Chrome OS, Proprietary, 2009
iOS, Proprietary, 2007
macOS, Proprietary, 2001

However, we might want to limit results so that we only return results where OS is a complete word. To do that, we can surround the string with special characters:

Command:

grep -i "\<os\>" operating-systems.csv

Output:

OS, License, Year
Chrome OS, Proprietary, 2009

Sometimes I find it hard to remember the backslash and angle bracket combinations because they're too much alike HTML syntax but not exactly like HTML syntax. Fortunately, grep has a -w option to match whole words:

Command:

grep -wi "os" operating-systems.csv

Output:

OS, License, Year
Chrome OS, Proprietary, 2009

Sometimes we want the context for a result; that is, we might want to print lines that surround our matches. For example, print the matching line plus the two lines after the matching line using the -A NUM option:

Command:

grep -i "linux" -A2 operating-systems.csv

Output:

Linux, GPL, 1991
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993

Or, print the matching line plus the two lines before the matching line using the -B NUM option:

Command

grep -i "linux" -B2 operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991

We can combine many of the variations. Here I search for the whole word BSD, case insensitive, and print the line before and the line after the match:

Command:

grep -iw -C1 "bsd" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991

We can use another option to stop returning results after some number of hits. Here I use grep to return search for the string "proprietary" and stop after the first hit:

Command:

grep -i -m1 "proprietary" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009

We can add the -n option to instruct grep to tell us what line number for each hit. Below we see that the string "proprietary" is found on lines 2, 5, and 6.

Command:

grep -in "proprietary" operating-systems.csv

Output:

2:Chrome OS, Proprietary, 2009
5:macOS, Proprietary, 2001
6:Windows NT, Proprietary, 1993

We can use grep to search for patterns in strings instead of literal words. Here we use what's called character classes and repetition to search for five letter words:

Command:

grep -Eiw "[a-z]{5}" operating-systems.csv

Output:

Linux, GPL, 1991
macOS, Proprietary, 2001

Or four letter numbers:

Command:

grep -Eiw "[0-9]{4}" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

grep can also search for words that begin with some letter and end with some letter and with a specified number of letters between. Here we search for words that start with m, end with s, and have three letters in the middle:

Command:

grep -Eiw "m.{3}s" operating-systems.csv

Output:

macOS, Proprietary, 2001

Practice

Here let's practice looking at the auth.log file. This file records all attempts to login to the system:

First, we change directory to /var/log.

Second, we use less to peruse the auth.log file.

Third, we do a simple grep search for the string invalid user and pipe that through another grep command that examines IP addresses.

Fourth, we do another simple search for a longer string and pipe that through other commands to sort the data.

cd /var/log
less auth.log
grep -E "session opened for user (sean|root)" auth.log | less
grep "invalid user" auth.log | grep -Eo "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" | sort | uniq -c | sort
grep "Connection closed by invalid user" auth.log | cut -d" " -f11 | sort | uniq -c | sort |less
grep "Connection closed by invalid user" auth.log | cut -d" " -f11 | sort | uniq -c | sort -r |less

Conclusion

grep is very powerful, and there are more options listed in its man page.

Note that I enclose my search strings in double quotes. For example: grep "search string" filename.txt It's not always required to enclose a search string in double quotes, but it's good practice because if your string contains more than one word or empty spaces, the search will fail.

The Linux (and other Unix-like OSes) command line offers a lot of utilities to examine data. It's fun to learn and practice these. Despite this, you do not have to become an advanced grep user. For most cases, simple grep searches work well.

If you want to learn more, there are many grep tutorials on the web.