Text Processing: Part 1
In this section, we will cover
- Text processing tools are fundamental: Learning to process and manipulate text is a crucial skill for systems administrators, programmers, and data analysts. Linux provides a variety of tools to examine, manipulate, and analyze text.
- Plain text is foundational: Programs and data are often stored in plain text, making it essential to know how to handle text files effectively.
- Essential text processing commands: Commands such as
cat
,cut
,head
,tail
,sort
,uniq
, andwc
allow users to view, manipulate, and analyze text files, even large datasets, with ease. - Power of pipes and redirection: Using pipes (
|
) and redirection (>
,>>
), you can chain commands together to create more complex workflows for processing text files. - CSV data manipulation: This lecture shows how to work with CSV
(comma-separated value) files, demonstrating how to view, sort, and filter
data with tools like
cut
anduniq
. - Practical applications for systems administration: The lecture emphasizes that text processing skills are directly applicable to managing user accounts, security, system configuration, and more in a systems administration context.
Getting Started
One of the more important sets of tools that Linux (as well Unix-like) operating systems provide are tools that aid processing and manipulating text. The ability to process and manipulate text, programmatically, is a basic and essential part of many programming languages, (e.g., Python, JavaScript, etc), and learning how to process and manipulate text is an important skill for a variety of jobs including statistics, data analytics, data science, programming, web programming, systems administration, and so forth. In other words, this functionality of Linux (and Unix-like) operating systems essentially means that to learn Linux and the tools that it provides is akin to learning how to program.
Plain text files are the basic building blocks of programs and data. Programs are written in plain text editors (Vim, NeoVim, VS Code, etc), and data is often stored as plain text. Linux offers many tools to examine, manipulate, process, analyze, and visualize data in plain text files.
In this section, we will learn some of the basic tools to examine plain text (i.e., data). We will do some programming later in this class, but for us, the main objective with learning to program aligns with our work as systems administrators. That means our text processing and programming goals will serve our interests in managing users, security, networking, system configuration, and so forth as Linux system administrators.
In the meantime, the goal of this section is to acquaint ourselves with some of the tools that can be used to process text. In this section, we will only cover a handful of text processing programs or utilities, but here is a fairly comprehensive list, and we'll examine some additional ones from this list later in the semester:
cat
: concatenate files and print on the standard outputcut
: remove sections from each line of filesdiff
: compare files line by lineecho
: display a line of textexpand
: convert tabs to spacesfind
: search for files in a directory hierarchyfmt
: simple optimal text formatterfold
: wrap each input line to fit in specified widthgrep
: print lines that match patternshead
: output the first part of filesjoin
: join lines of two files on a common fieldlook
: display lines beginning with a given stringnl
: number lines of filespaste
: merge lines of filesprintf
: format and print datashuf
: generate random permutationssort
: sort lines of text filestail
: output the last part of filestr
: translate or delete charactersunexpand
: convert spaces to tabsuniq
: report or omit repeat lineswc
: print newline, word, and byte counts for each file
We will also discuss two types of operators, the pipe and the redirect. The latter has a version that will write over the contents of a file, and a version that will append contents to the end of a file:
|
: redirect standard output from command1 to standard input for command2>
: redirect to standard output to a file, overwriting>>
: redirect to standard output to a file, appending
Today I want to cover a few of the above commands for processing data in a file; specifically:
cat
: concatenate files and print on the standard outputcut
: remove sections from each line of fileshead
: output the first part of filessort
: sort lines of text filestail
: output the last part of filesuniq
: report or omit repeat lineswc
: print newline, word, and byte counts for each file
Let's look at a toy, sample file that contains structured data as a CSV (comma separated value) file.
You can download the file to your gcloud
virtual machine using the following command:
wget https://raw.githubusercontent.com/cseanburns/linux_sysadmin/master/data/operating-systems.csv
The file contains a list of operating systems (column one), their software license (column two), and the year the OSes were released (column three).
We can use the cat
command to view the entire contents of this small file:
Command:
cat operating-systems.csv
Output:
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008
It's a small file, but we might want the line and word count of the file.
To acquire that, we can use the wc
(word count) command.
By itself, the wc
command will print the number of lines, words, and bytes of a file.
The following output states that the file contains seven lines, 23 words, and 165 bytes:
Command:
wc operating-systems.csv
Output:
7 23 165 operating-systems.csv
We can use the head
command to output the first ten lines of a file.
Since our file is only seven lines long,
we can use the -n
option to change the default number of lines.
In the following example, I print the first three lines of the file:
Command:
head -n3 operating-systems.csv
Output:
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
Using the cut
command, we can select data from file.
In the first example, I want to select column two (or field two), which contains the license information.
Since this is a CSV file, the fields (aka, columns) are separated by commas.
Therefore I use -d
option to instruct the cut
command to use commas as the separating character.
The -f
option tells the cut
command to select field two.
Note that a CSV file may use other characters as the separator character, like the Tab character or a colon.
In such cases, it may still be called a CSV file but you might also see .dat
files for data files or other variations.
Command:
cut -d"," -f2 operating-systems.csv
Output:
Proprietary
BSD
GPL
Proprietary
Proprietary
Proprietary
Apache
From there it's trivial to select a different column. In the next example, I select column three to get the release year:
Command:
cut -d"," -f3 operating-systems.csv
Output:
2009
1993
1991
2007
2001
1993
2008
A genius aspect of the Linux (and Unix) commandline is the ability to pipe and redirect output from one program to another program. Output can be further directed to a file. By stringing together multiple programs in this way, we can create small programs that do much more than the simple programs that compose them.
For example, in the following example, I use the pipe operators to send the output of the cut
command to the sort
command.
This sorts the data in alphabetical or numerical order, depending on the character type (lexical or numerical).
I then pipe that output to the uniq
command, which removes duplicate rows.
Finally, I redirect that final output to a new file titled os-years.csv.
Since the year 1993 appears twice in the original file, it only appears once in the output because the uniq
command removed the duplicate:
Command:
cut -d"," -f3 operating-systems.csv | sort | uniq > os-years.csv
Output:
cat os-years.csv
1991
1993
2001
2007
2008
2009
Data files like this often have a header line at the top row that names the data columns.
It's useful to know how to work with such files, so let's add a header row to the top of the file.
In this example, I'll use the sed
command, which we will learn more about in the next lesson.
For now, we use sed
with the option -i
to edit the file, then 1i
instructs sed
to insert text at line 1.
\OS, License, Year
is the text that we want inserted at line 1.
We wrap the argument within single quotes:
Command:
sed -i '1i \OS, License, Year' operating-systems.csv
cat operating-systems.csv
Output:
OS, License, Year
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008
I added the header row just to demonstrate how to remove it when processing files with header rows. Say we want the license field data, but we need to remove that first line. In this case, we can use the tail command:
Command:
tail -n +2 operating-systems.csv | cut -d"," -f2 | sort | uniq > license-data.csv
cat license-data.csv
Output:
Apache
BSD
GPL
Proprietary
The
tail
command generally outputs the last lines of a file, but the-n +2
option is special. It makes thetail
command output a file starting at the second line. We could specify a different number in order to start output at a different line. Seeman tail
for more information.
Conclusion
In this lesson, we learned how to process and make sense of data held in a text file. We used some commands that let us select, sort, de-duplicate, redirect, and view data in different ways. Our data file was a small one, but these are powerful and useful command and operators that would make sense of large data file.
The commands we used in this lesson include:
cat
: concatenate files and print on the standard outputcut
: remove sections from each line of fileshead
: output the first part of filessort
: sort lines of text filestail
: output the last part of filesuniq
: report or omit repeat lineswc
: print newline, word, and byte counts for each file
We also used two types of operators, the pipe and the redirect:
|
: redirect standard output command1 to standard input of command2>
: redirect to standard output to a file, overwriting