UNIX Unleashed

Chapter 6

Popular File Tools

By Pete Holsberg

IN THIS CHAPTER

Determining the Nature of a File's Contents with file
104
Browsing Through Text Files with more (page), and pg
106
Searching for Strings with the grep Family
110
Sorting Text Files
127
Compressing Files—compress, uncompress, and zcat
138
Printing with pr
140
Printing Hard Copy Output
146
Comparing Directories with dircmp
150
Encrypting a File with the crypt Command
152
Printing the Beginning or End of a File with head and tail
153
Pipe Fitting with tee
155
Updating a File's Time and Date with touch
156
Splitting Files with split and csplit
156
Comparing Files with cmp and diff
162

Files are the heart of UNIX. Unlike most other operating systems, UNIX was designed with a simple, yet highly sophisticated, view of files: Everything is a file. Information stored in an area of a disk or memory is a file; a directory is a file; the keyboard is a file; the screen is a file. This single-minded view makes it easy to write tools that manipulate files, because files have no structure—UNIX sees every file merely as a simple stream of bytes. This makes life much simpler for both the UNIX programmer and the UNIX user. The user benefits from being able to send the contents of a file to a command without having to go through a complex process of opening the file. In a similar way, the user can capture the output of a command in a file without having previously created that file. And perhaps most importantly, the user can send the output of one command directly to the input of another, using memory as a temporary storage device or file. Finally, users benefit from UNIX's unstructured files because they are simply easier to use than files that must conform to one of several highly structured formats.

Determining the Nature of a File's Contents with file

A user—especially a power user—must take a closer look at a file before manipulating it. If you've ever sent a binary file to a printer, you're aware of the mess that can result. Murphy's Law assures that every binary file includes a string of bytes that does one or more of the following:

Spew a ream of paper through the printer before you can shut it off, printing just enough on each page to render the paper fit only for the recycling bin
Put the printer into a print mode that prints all characters at 1/10 their intended size
Lock your keyboard
Dump core—that is, create a file consisting of whatever was in memory at that instant of time!

In a similar way, sending a binary file to the screen can lock the keyboard, put the screen in a mode that changes the displayed character set to one that is clearly not English, dump core, and so on.

While it's true that many files already stored on the system—and certainly every file you create with a text editor (see Chapter 7)—are text files, many are not. UNIX provides a command, file, that attempts to determine the nature of the contents of files when you supply their file names as arguments. You can invoke the file command in one of two ways:

file [-h] [-m mfile] [-f ffile] arg(s)
file [-h] [-m mfile] -f ffile

The file command performs a series of tests on each file in the list of arg(s) or on the list of files whose names are contained in the file ffile. If the file being tested is a text file, file examines the first 512 bytes and tries to determine the language in which it is written. The identification is worded by means of the contents of a file called /etc/magic. If you don't like what's in the file, you can use the -m mfile option, replacing mfile with the name of the "magic file" you'd like to use. (Consult your local magician for suitable spells and potions!) Here are the kinds of text files that Unixware Version 1.0's file command can identify:

Empty files
SCCS files
troff (typesetter runoff) output files
Data files
C program text files (with or without garbage)
FORTRAN program text files (with or without garbage)
Assembler program text files (with or without garbage)
[nt]roff, tbl, or eqn input text (with or without garbage)
Command text files (with or without garbage)
English text files (with or without garbage)
ASCII text files (with or without garbage)
PostScript program text files (with or without garbage)

Don't be concerned if you're not familiar with some of these kinds of text. Many of them are peculiar to UNIX and are explained in later chapters. If the file is not text, file looks near the beginning of the file for a magic number—a number or string that is associated with a file type; an arbitrary value that is couple with a descriptive phrase. Then file uses /etc/magic, which provides a database of magic numbers and kinds of files, or the file specified as mfile to determine the file's contents. If the file being tested is a symbolic link, file follows the link and tries to determine the nature of the contents of the file to which it is linked. The -h option causes file to ignore symbolic links.

The /etc/magic file contains the table of magic numbers and their meanings. For example, here is an excerpt from Unixware Version 1.0's /etc/magic file. The number following uxcore: is the magic number, and the phrase that follows is the file type. The other columns tell file how and where to look for the magic number:

>16  short    2        uxcore:231     executable
0    string            uxcore:648     expanded ASCII cpio archive
0    string            uxcore:650     ASCII cpio archive
>1   byte     0235     uxcore:571     compressed data
0    string            uxcore:248     current ar archive
0    short    0432     uxcore:256     Compiled Terminfo Entry
0    short    0434     uxcore:257     Curses screen image
0    short    0570     uxcore:259     vax executable
0    short    0510     uxcore:263     x86 executable
0    short    0560     uxcore:267     WE32000 executable
0    string   070701   uxcore:565     DOS executable (EXE)
0    string   070707   uxcore:566     DOS built-in
0    byte     0xe9     uxcore:567     DOS executable (COM)
0    short    0520     uxcore:277     mc68k executable
0    string            uxcore:569     core file (Xenix)
0    byte     0x80     uxcore:280     8086 relocatable (Microsoft)

CAUTION:
Human beings cannot read any of the files listed in this excerpt, so you should not send any of these files to the screen or the printer. The same is true for any of the previously listed text files that have garbage.

Browsing Through Text Files with more (page) and pg

After you identify a file as being a text file that humans can read, you may want to read it. The cat command streams the contents of a file to the screen, but you must be quick with the Scroll Lock (or equivalent) key so that the file content does not flash by so quickly that you cannot read it (your speed-reading lessons notwithstanding). UNIX provides a pair of programs that present the contents of a file one screen at a time. The more(page) programs are almost identical, and will be discussed as if they were a simple program. The only differences are the following:

page clears the screen automatically between pages, but more does not.
more provides a two-line overlap from one screen to the next, while page provides only a one-line overlap.

Both more and page have several commands, many of which take a numerical argument that controls the number of times the command is actually executed. You can issue these commands while using the more or page program (see the syntax below), and none of these commands are echoed to the screen. Table 6.1 lists the major commands.

     more [-cdflrsuw] [-lines] [+linenumber] [+/pattern] [file(s)]
     page [-cdflrsuw] [-lines] [+linenumber] [+/pattern] [file(s)]

Table 6.1. Commands for more(page)

nSpacebar: If no positive number is entered, display the next screenfull. If an n value is entered, display n more lines.
nReturn: If no positive number is entered, display another line. If an n value is entered, display n more lines. (Depending on your keyboard, you can press either the Return or Enter key.)
n^D, nd: If no positive number is entered, scroll down 11 more lines. If an n value is given, scroll the screen down n times.
nz: Same as nSpacebar, except that if an n value is entered, it becomes the new number of lines per screenfull.
n^B, nb: Skip back n screensfull and then print a screenfull.
q, Q: Exit from more or page.
=: Display the number of the current line.
v: Drop into the editor (see Chapter 7) indicated by the EDITOR environment variable (see Chapters 11, 12, 13), at the current line of the current file.
h: Display a Help screen that describes all the more or page commands.
:f: Display the name and current line number of the file that you are viewing.
:q, :Q: Exit from more or page (same as q or Q).
_ (dot): Repeat the previous command.

After you type the more and page programs' commands, you need not press the Enter or Return key (except, of course, in the case of the nReturn command). The programs execute the commands immediately after you type them.

You can invoke more(page) with certain options that specify the program's behavior. For example, these programs can display explicit error messages instead of just beeping. Table 6.2 lists the most commonly used options for more and page.

Table 6.2. Options for more and page

-c: Clear before displaying. To display screens more quickly, this option redraws the screen instead of scrolling. You need not use this option with page.
-d: Display an error message instead of beeping if an unrecognized command is typed.
-r: Display each control character as a two-character pattern consisting of a caret followed by the specified character, as in ^C.
-s: Replace any number of consecutive blank lines with a single blank line.
-lines: Make lines the number of lines in a screenfull.
+n: Start at line number n.
+/: pattern

Start two lines above the line that contains the regular expression pattern. (Regular expressions are explained in the next section.) The more(page) program is a legacy from the Berkeley version of UNIX. System V variants give us pg, another screen-at-a-time file viewer. The pg program offers a little more versatility by giving you more control over your movement within a file (you can move both forward and backward) and your search for patterns. The program has its own commands and a set of command-line options. Table 6.3 lists the more frequently used commands. Unlike more and page, the pg program requires that you always press the Return or Enter key to execute its commands.

$pg [options] file

Table 6.3. Commands for pg

nReturn: If no n value is entered or if a value of +1 is entered, display the next page. If the value of n is -1, display the previous page. If the value of n has no sign, display page number n. For example, a value of 3 causes pg to display page 3. (Depending on your keyboard, you can press either the Return or Enter key.)
nd, ^D: Scroll half a screen. The value n can be positive or negative. So, for example, 2d will scroll full screen forward, and -3d will scroll one and a half screens back.
nz: Same as nReturn except that if an n value is entered, it becomes the number of lines per screenfull.
., ^L: Redisplay (clear the screen and then display again) the current page of text.
$: Displays the last screenfull in the file.
n/pattern/: Search forward for the nth occurrence of pattern. (The default value for n is 1.) Searching begins immediately after the current page and continues to the end of the current file, without wrap-around.
n?pattern?: Search backward for the nth occurrence of pattern. (The default value for n is 1.) Searching begins immediately before the current page and continues to the beginning of the current file, without wrap-around.
h: Display an abbreviated summary of available commands.
q, Q: Quit pg.
!command: Execute the shell command command as if it were typed on a command line.

Addressing is the ability to specify a number with a sign or a number without a sign. A number with no sign provides absolute addressing; for example, pressing 3 followed by the Return key displays page 3. A number with a sign provides relative addressing; that is, the command moves you to a line relative to the current line.

The pg program has several startup options that modify its behavior. Table 6.4 describes the most frequently used options.

Table 6.4. Some of pg's Startup Options

-n: Change the number of lines per page to the value of n. Otherwise, the number of lines is determined automatically by the terminal. For example, a 24-line terminal automatically uses 23 lines per page.
-c: Clear the screen before displaying a page.
-n: Remove the requirement that you press Return or Enter after you type the command. Note: Some commands will still require that you press Enter or Return.
-p string: Change the prompt from a colon (:) to string. If string contains the two characters %d, they are replaced by the current page number when the prompt appears.
-r: Prevent the use of !command and display an error message if the user attempts to use it.
-s: Print all messages and prompts in standout mode (which is usually inverse video).
+n: Start the display at line number n.
+/pattern/: Start the display at the first line that contains the regular expression pattern. Regular expressions are explained in the next section.

Each of the commands discussed in this section can accept a list of file names on the command line, and display the next file when it reaches the end of the current file.

Searching for Strings with the grep Family

Suppose that you want to know whether a certain person has an account on your system. You can use more, page, or pg to browse through /etc/passwd looking for that person's name, but if your system has many users, that can take a long time. Besides, an easier way is available: grep. It searches one or more files for the pattern of the characters that you specify and displays every line in the file or files that has that pattern in it.

grep stands for global/regular expression/print; that is, search through an entire file (do a global search) for a specified regular expression (the pattern that you specified) and display the line or lines that contain the pattern.

Before you can use grep and the other members of the grep family, you must explore regular expressions, which are what gives the grep commands (and many other UNIX commands) their power. After that, you will learn all of the details of the grep family of commands.

Regular Expressions

A regular expression is a sequence of ordinary characters and special operators. Ordinary characters include the set of all uppercase and lowercase letters, digits, and other commonly used characters: the tilde (~), the back quotation mark (—), the exclamation mark (!), the "at" sign (@), the pound sign (#), the underscore (_), the hyphen (-), the equals sign (=), the colon (:), the semicolon (;), the comma (,), and the slash (/). The special operators are backslash (\), dot (.), asterisk (*), left square bracket ([), caret (^), dollar sign ($), right square bracket (]). By using regular expressions, you can search for general strings in a file. For example, you can tell grep to show you all lines in a file that contain any of the following: the word Unix, the word UNIX, a pattern consisting of four digits, a ZIP code, a name, nothing, or all the vowels in alphabetic order.

You can also combine two strings into a pattern. For example, to combine a search for Unix and UNIX, you can specify a word that begins with U, followed by n or N, followed by i or I, and ending with x or X.

Several UNIX commands use regular expressions to find text in files. Usually you supply a regular expression to a command to tell that command what to search for. Most regular expressions match more than one text string.

There are two kinds of regular expressions: limited and full (sometimes called extended). Limited regular expressions are a subset of full regular expressions, but UNIX commands are inconsistent in the extended operations that they permit. At the end of this discussion, you'll find a table that lists the most common commands in UNIX System V Release 4 that use regular expressions, along with the operations that they can perform.

The simplest form of a regular expression includes only ordinary characters, and is called a string. The grep family (grep, egrep, and fgrep) matches a string wherever it finds the regular expression, even if it's surrounded by other characters. For example, the is a regular expression that matches only the three-letter sequence t, h, and e. This string is found in the words the, therefore, bother, and many others.

Two of the members of the grep family use regular expressions—the third, fgrep, operates only on strings:

grep

The name means to search globally (throughout the entire file) for a regular expression and print the line that contains it. In its simplest form, grep is called as follows:

 grep regular_expression filename

When grep finds a match of regular_expression, it displays the line of the file that contains it and then continues searching for a subsequent match. Thus, grep displays every line of a file that contains a text string that matches the regular expression.

egrep

You call this member exactly the same way as you call grep. However, this member uses an extended set of regular expression operators, that will be explained later, after you master the usual set.

CAUTION:
None of these commands alter the original file. Output goes to stdout (by default, stdout is the screen). To save the results, you must redirect the output to a file.

The contents of the following file are used in subsequent sections to demonstrate how you can use the grep family to search for regular expressions:

$ cat REfile

A regular expression is a sequence of characters taken
from the set of uppercase and lowercase letters, digits,
punctuation marks, etc., plus a set of special regular
expression operators. Some of these operators may remind
you of file name matching, but be forewarned: in general,
regular expression operators are different from the
shell metacharacters we discussed in Chapter 1.

The simplest form of a regular expression is one that
includes only letters. For example, they would match only
the three-letter sequence t, h, e. This pattern is found
in the following words: the, therefore, bother. In other
words, wherever the regular expression pattern is found
" even if it is surrounded by other characters " it will
be matched.

Regular Expression Characters

Regular expressions match patterns that consist of a combination of ordinary characters, such as letters, digits, and various other characters used as operators. You will meet examples of these below. A character's use often determines its meaning in a regular expression. All programs that use regular expressions have a search pattern. The editor family of programs (vi, ex, ed, and sed; see Chapter 7, "Editing Text Files") also has a replacement pattern. In some cases, the meaning of a special character differs depending on whether it's used as part of the search pattern or in the replacement pattern.

A Regular Expression with No Special Characters

Here's an example of a simple search for an regular expression. This regular expression is a character string with no special characters in it.

$ grep only REfile

includes only letters. For example, the would match only

The sole occurrence of only satisfied grep's search, so grep printed the matching line.

Special Characters

Certain characters have special meanings when used in regular expressions, and some of them have special meanings depending on their position in the regular expression. Some of these characters are used as placeholders and some as operators. Some are used for both, depending on their position in the regular expression.

The dot (.), asterisk (*), left square bracket ([) and backslash (\) are special except when they appear between a left and right pair of square brackets ([]).
A circumflex or caret (^) is special when it's the first character of a regular expression, and also when it's the first character after the opening left square bracket in a left and right pair of square brackets.
A dollar sign ($) is special when it's the last character of a regular expression.
A pair of delimiters, usually a pair of slash characters (//), is special because it delimits the regular expression.
NOTE:
Any character not used in the current regular expression can be used as the delimiter, but the slash is traditional.
A special character preceded by a backslash is matched by the character itself. This is called escaping. When a special character is escaped, the command recognizes it as a literal—the actual character with no special meaning. In other words, as in file-name matching, the backslash cancels the special meaning of the character that follows it.

Now let's look at each character in detail.

Matching Any One Character

The dot matches any one character except a newline. For example, consider the following:

$ grep "w.r" REfile
from the set of uppercase and lowercase letters, digits,
you of file name matching, but be forewarned: in general,
in the following words: the, therefore, bother. In other
words, wherever the regular expression pattern is found

NOTE:
The regular expression w.r appears within a set of apostrophes (referred to by UNIXees as 'single quotes'). Their use is mandatory if grep is to function properly. If they are omitted, the shell (see Chapters 11, 12, and 13) may interpret certain special characters in the regular expressio as if they were "shell special characters" rather than "grep special characters" and the result will be unexpected.

The pattern w.r matches wer in lowercase on the first displayed line, by war in forewarned on the second, by wor in words on the third, and by wor in words on the fourth. Expressed in English, the sample command says "Find and display all lines that match the following pattern: w followed by any character except a newline followed by r."

You can form a somewhat different one-character regular expression by enclosing a list of characters in a left and right pair of square brackets. The matching is limited to those characters listed between the brackets. For example, the pattern

[aei135XYZ]

matches any one of the characters a, e, i, 1, 3, 5, X, Y, or Z. Consider the following example:

$ grep "w[fhmkz]" REfile

words, wherever the regular expression pattern is found

This time, the match was satisfied only by the wh in wherever, matching the pattern "w followed by either f, h, m, k, or z." If the first character in the list is a right square bracket (]), it does not terminate the list—that would make the list empty, which is not permitted. Instead, ] itself becomes one of the possible characters in the search pattern. For example, the pattern

[]a]

matches either ] or a. If the first character in the list is a circumflex (also called a caret), the match occurs on any character that is not in the list:

$ grep "w[^fhmkz]" REfile

from the set of uppercase and lowercase letters, digits,
you of file name matching, but be forewarned: in general,
shell metacharacters we discussed in Chapter 1.
includes only letters. For example, the would match only
in the following words: the, therefore, bother. In other
words, wherever the regular expression pattern is found
" even if it is surrounded by other characters " it will

The pattern "w followed by anything except f, h, m, k, or z" has many matches. On line 1, we in lowercase is a "w followed by anything except an f, an h, an m, a k, or a z." On line 2, wa in forewarned is a match, as is the word we on line 3. Line 4 contains wo in would, and line 5 contains wo in words. Line 6 has wo in words as its match. The other possible matches on line 6 are ignored because the match is satisfied at the beginning of the line. Finally, at the end of line 7, wi in will matches.

You can use a minus sign (-) inside the left and right pair of square brackets to indicate a range of letters or digits. For example, the pattern

[a-z]

matches any lowercase letter.

NOTE:
You cannot write the range "backward"; that is, _ [z-a] is illegal.

Consider the following example:

$ grep "w[a-f]" REfile

from the set of uppercase and lowercase letters, digits,
you of file name matching, but be forewarned: in general,
shell metacharacters we discussed in Chapter 1.

The matches are we on line 1, wa on line 2, and we on line 3. Look at REfile again and note how many potential matches are omitted because the character following the w is not one of the group a through f.

Furthermore, you can include several ranges in one set of brackets. For example, the pattern

[a-zA-Z]

matches any letter, lower- or uppercase.

Matching Multiples of a Single Character

If you want to specify precisely how many of a given character you want the regular expression to match, you can use the escaped left and right curly brace pair (\{____\}). For example, the pattern

X\{2,5\}

matches at least two but not more than five Xs. That is, it matches XX, XXX, XXXX, or XXXXX. The minimum number of matches is written immediately after the escaped left curly brace, followed by a comma (,) and then the maximum value. If you omit the maximum value (but not the comma), as in

X\{2,\}

you specify that the match should occur for at least two Xs. If you write just a single value, omitting the comma, you specify the exact number of matches, no more and no less. For example, the pattern

X\{4\}

matches only XXXX. Here are some examples of this kind of regular expression:

$ grep "p\{2\}" REfile

from the set of uppercase and lowercase letters, digits,

This is the only line that contains "pp."

$ grep "p\{1\}" REfile

A regular expression is a sequence of characters taken
from the set of uppercase and lowercase letters, digits,
punctuation marks, etc., plus a set of special regular
expression operators. Some of these operators may remind
regular expression operators are different from the
shell metacharacters we discussed in Chapter 1.
The simplest form of a regular expression is one that
includes only letters. For example, the would match only
the three-letter sequence t, h, e. This pattern is found
words, wherever the regular expression pattern is found

Notice that on the second line, the first "p" in "uppercase" satisfies the search. The grep program doesn't even see the second "p" in the word because it stops searching as soon as it finds one "p."

Matching Multiples of a Regular Expression

The asterisk (*) matches zero or more of the preceding regular expression. Therefore, the pattern

X*

matches zero or more Xs: nothing, X, XX, XXX, and so on. To ensure that you get at least one character in the match, use

XX*

For example, the command

$ grep "p*" REfile

displays the entire file, because every line can match "zero or more instances of the letter p." However, note the output of the following commands:

$ grep "pp*" REfile

A regular expression is a sequence of characters taken
from the set of uppercase and lowercase letters, digits,
punctuation marks, etc., plus a set of special regular
expression operators. Some of these operators may remind
regular expression operators are different from the
shell metacharacters we discussed in Chapter 1.
The simplest form of a regular expression is one that
includes only letters. For example, the would match only
the three-letter sequence t, h, e. This pattern is found
words, wherever the regular expression pattern is found
$ grep "ppp*" REfile

from the set of uppercase and lowercase letters, digits,

The regular expression ppp* matches "pp followed by zero or more instances of the letter p," or, in other words, "two or more instances of the letter p."

The extended set of regular expressions includes two additional operators that are similar to the asterisk: the plus sign (+) and the question mark (?). The plus sign is used to match one or more occurrences of the preceding character, and the question mark is used to match zero or one occurrences. For example, the command

$ egrep "p?" REfile

outputs the entire file because every line contains zero or one p. However, note the output of the following command:

$ egrep "p+" REfile

A regular expression is a sequence of characters taken
from the set of uppercase and lowercase letters, digits,
punctuation marks, etc., plus a set of special regular
expression operators. Some of these operators may remind
regular expression operators are different from the
shell metacharacters we discussed in Chapter 1.
The simplest form of a regular expression is one that
includes only letters. For example, the would match only
the three-letter sequence t, h, e. This pattern is found
words, wherever the regular expression pattern is found

Another possibility is [a-z]+. This pattern matches one or more occurrences of any lowercase letter.

Anchoring the Match

A circumflex (^) used as the first character of the pattern anchors the regular expression to the beginning of the line. Therefore, the pattern

^[Tt]he

matches a line that begins with either The or the, but does not match a line that has a The or the at any other position on the line. Note, for example, the output of the following two commands:

$ grep "[Tt]he" REfile

from the set of uppercase and lowercase letters, digits,
expression operators. Some of these operators may remind
regular expression operators are different from the
The simplest form of a regular expression is one that
includes only letters. For example, the would match only
the three-letter sequence t, h, e. This pattern is found
in the following words: the, therefore, bother. In other
words, wherever the regular expression pattern is found
" even if it is surrounded by other characters " it is
$ grep "^[Tt]he" REfile

The simplest form of a regular expression is one that
the three-letter sequence t, h, e. This pattern is found

A dollar sign as the last character of the pattern anchors the regular expression to the end of the line, as in the following example:

$ grep "1\.$" REfile

shell metacharacters we discussed in Chapter 1.

This anchoring occurs because the line ends in a match of the regular expression. The period in the regular expression is preceded by a backslash, so the program knows that it's looking for a period and not just any character. Here's another example that uses REfile:

$ grep "[Tt]he$" REfile

regular expression operators are different from the

The regular expression .* is an idiom that is used to match zero or more occurrences of any sequence of any characters. Any multicharacter regular expression always matches the longest string of characters that fits the regular expression description. Consequently, .* used as the entire regular expression always matches an entire line of text. Therefore, the command

$ grep "^.*$" REfile

prints the entire file. Note that in this case the anchoring characters are redundant. When used as part of an "unanchored" regular expression, that idiomatic regular expression matches the longest string that fits the description, as in the following example:

$ grep "C.*1" REfile

shell metacharacters we discussed in Chapter 1.

The regular expression C.*1 matches the longest string that begins with a C and ends with a 1. Another expression, d.*d, matches the longest string that begins and ends with a d. On each line of output in the following example, the matched string is highlighted with italics:

$ grep "d.*d" REfile

from the set of uppercase and lowercase letters, digits,
shell metacharacters we discussed in Chapter 1.
includes only letters. For example, the would match only
words, wherever the regular expression pattern is found
" even if it is surrounded by other characters " it is

You've seen that a regular expression command such as grep finds a match even if the regular expression is surrounded by other characters. For example, the pattern

[Tt]he

matches the, The, there, There, other, oTher, and so on (even though the last word is unlikely to be used). Suppose that you're looking for the word The or the and do not want to match other, There, or there. In a few of the commands that use full regular expressions, you can surround the regular expression with escaped angle brackets (\<___\>). For example, the pattern

\<the\>

represents the string the, where t follows a character that is not a letter, digit, or underscore, and where e is followed by a character that is not a letter, digit, or underscore. If you need not completely isolate letters, digits, and underscores, you can use the angle brackets singly. That is, the pattern \<the matches anything that begins with the, and ter\> matches anything that ends with ter. You can tell egrep (but not grep) to search for either of two regular expressions as follows:

$ egrep "regular expression-1 | regular expression-2" filename

Regular Expression Examples

When you first look at the list of special characters used with regular expressions, constructing search-and-replacement patterns seems to be a complex process. A few examples and exercises, however, can make the process easier to understand.

Example 1: Matching Lines That Contain a Date

A standard USA date consists of a pattern that includes the capitalized name of a month, a space, a one- or two-digit number representing the day, a comma, a space, and a four-digit number representing the year. For example, Feb 9, 1994 is a standard USA date. You can write that pattern as a regular expression:

[A-Z][a-z]* [0-9]\{1,2\}, [0-9]\{4\}

You can improve this pattern so that it recognizes that May—the month with the shortest name—has three letters, and that September has nine:

[A-Z][a-z]\{3,9\} [0-9]\{1,2\}, [0-9]\{4\}

Example 2: Matching Social Security Numbers

Social security numbers also are highly structured: three digits, a dash, two digits, a dash, and four digits. Here's how you can write a regular expression for social security numbers:

[0-9]\{3\}-[0-9]\{\2\}-[0-9]\{4\}

Example 3: Matching Telephone Numbers

Another familiar structured pattern is found in telephone numbers, such as 1-800-555-1212. Here's a regular expression that matches that pattern:

1-[0-9]\{3\}-[0-9]\{3\}-[0-9]\{4\}

Details of the grep Family

The grep family consists of three members:

grep

This command uses a limited set of regular expressions. See table.

egrep

Extended grep. This command uses full regular expressions (expressions that have string values and use the full set of alphanumeric and special characters) to match patterns. Full regular expressions include all the limited regular expressions of grep (except for $ and $), as well as the following ones (where RE is any regular expression):

RE+: Matches one or more occurrences of RE. (Contrast that with RE*, which matches zero or more occurrences of RE.)
RE?: Matches zero or one occurrence of RE.
RE1 | RE2: Matches either RE1 or RE2. The | acts as a logical OR operator.
(RE): Groups multiple regular expressions.
The section "The egrep Command" provides examples of these expressions.

fgrep

Fast grep. This command searches for a string, not a pattern. Because fgrep does not use regular expressions, it interprets $, *, [, ], (, ), and \ as ordinary characters. Modern implementations of grep appear to be just as fast as fgrep, so using fgrep is becoming obsolete—except when your search involves the previously mentioned characters.

NOTE:
The $, *, [, ], (, ), and \ regular expression metacharacters also have special meaning to the shell, so you must enclose them within single quotation marks to prevent the shell from interpreting them. (See Chapters 11, 12, and 13.)

The grep Command

The most frequently used command in the family is grep. Its complete syntax is

$grep [options] RE [file(s)]

where RE is a limited regular expression. Table 6.5 lists the regular expressions that grep recognizes. The grep command reads from the specified file on the command line or, if no files are specified, from standard input. Table 6.5 lists the command-line options that grep takes.

Table 6.5. Command-Line Options for grep

-b: Display, at the beginning of the output line, the number of the block in which the regular expression was found. This can be helpful in locating block numbers by context. (The first block is block zero.)
-c: Print the number of lines that contain the pattern, that is, the number of matching lines.
-h: Prevent the name of the file that contains the matching line from being displayed at the beginning of that line.
NOTE:
When searching multiple files, grep normally reports not only the matching line but also the name of the file that contains it.
-i: Ignore distinctions between uppercase and lowercase during comparisons.
-l: Print one time the name of each file that contains lines that match the pattern—regardless of the actual number of matching lines in each file—on separate lines of the screen.
-n: Precede each matching line by its line number in the file.
-s: Suppress error messages about nonexistent or unreadable files.
-v: Print all lines except those that contain the pattern. This reverses the logic of the search.

Here are two sample files on which to exercise grep:

$ cat cron

In SCO Xenix 2.3, or SCO UNIX, you can edit a
crontab file to your heart's content, but it will
not be re-read, and your changes will not take
effect, until you come out of multi-user run
level (thus killing cron), and then re-enter
multi-user run level, when a new cron is started;
or until you do a reboot.

The proper way to install a new version of a
crontab (for root, or for any other user) is to
issue the command "crontab new.jobs", or "cat
new.jobs | crontab", or if in "vi" with a new
version of the commands, "w ! crontab". I find it
easy to type "vi /tmp/tbl", then ":0 r !crontab
-l" to read the existing crontab into the vi
buffer, then edit, then type ":w !crontab", or
"!crontab %" to replace the existing crontab with
what I see on vi's screen.

$ cat pax

This is an announcement for the MS-DOS version of
PAX version 2. See the README file and the man
pages for more information on how to run PAX,
TAR, and CPIO.

For those of you who don't know, pax is a 3 in 1
program that gives the functionality of pax, tar,
and cpio. It supports both the DOS filesystem
and the raw "tape on a disk" system used by most
micro UNIX systems. This will allow for easy
transfer of files to and from UNIX systems. It
also supports multiple volumes. Floppy density
for raw UNIX type read/writes can be specified on
the command line.

The source will eventually be posted to one of
the source groups.

Be sure to use a blocking factor of 20 with
pax-as-tar and B with pax-as-cpio for best
performance.

The following examples show how to find a string in a file:

$ grep "you" pax

For those of you who don't know, pax is a 3 in 1

$ grep "you" cron

In SCO Xenix 2.3, or SCO UNIX, you can edit a
crontab file to your heart's content, but it will
not be re-read, and your changes will not take
effect, until you come out of multi-user run
or until you do a reboot.

Note that you appears in your in the second and third lines.

You can find the same string in two or more files by using a variety of options. In this first example, case is ignored:

$ grep -i "you" pax cron

pax:For those of you who don't know, pax is a 3 in 1
cron:In SCO Xenix 2.3, or SCO UNIX, you can edit a
cron:crontab file to your heart's content, but it will
cron:not be re-read, and your changes will not take
cron:effect, until you come out of multi-user run
cron:or until you do a reboot.

Notice that each line of output begins with the name of the file that contains a match. In the following example, the output includes the name of the file and the number of the line of that file on which the match is found:

$ grep -n "you" pax cron

pax:6:For those of you who don't know, pax is a 3 in 1
cron:1:In SCO Xenix 2.3, or SCO UNIX, you can edit a
cron:2:crontab file to your heart's content, but it will
cron:3:not be re-read, and your changes will not take
cron:4:effect, until you come out of multi-user run
cron:7:or until you do a reboot.

The following example shows how to inhibit printing the lines themselves:

$ grep -c "you" pax cron

pax:1
cron:5

The following output shows the matching lines without specifying the files from which they came:

$ grep -h "you" pax cron

For those of you who don't know, pax is a 3 in 1
In SCO Xenix 2.3, or SCO UNIX, you can edit a
crontab file to your heart's content, but it will
not be re-read, and your changes will not take
effect, until you come out of multi-user run
or until you do a reboot.

The following specifies output of "every line in pax and cron that does not have [Yy][Oo][Uu] in it":

$ grep -iv "you" pax cron

pax:This is an announcement for the MS-DOS version of
pax:PAX version 2. See the README file and the man
pax:pages for more information on how to run PAX,
pax:TAR, and CPIO.
pax:
pax:program that gives the functionality of pax, tar,
pax:and cpio. It supports both the DOS filesystem
pax:and the raw "tape on a disk" system used by most
pax:micro UNIX systems. This will allow for easy
pax:transfer of files to and from UNIX systems. It
pax:also support multiple volumes. Floppy density
pax:for raw UNIX type read/writes can be specified on
pax:the command line.
pax:
pax:The source will eventually be posted to one of
pax:the source groups.
pax:
pax:Be sure to use a blocking factor of 20 with
pax:pax-as-tar and B with pax-as-cpio for best
pax:performance.
cron:level (thus killing cron), and then re-enter
cron:multi-user run level, when a new cron is started;
cron:
cron:The proper way to install a new version of a
cron:crontab (for root, or for any other user) is to
cron:issue the command "crontab new.jobs", or "cat
cron:new.jobs | crontab", or if in "vi" with a new
cron:version of the commands, "w ! crontab". I find it
cron:easy to type "vi /tmp/tbl", then ":0 r !crontab
cron:-l" to read the existing crontab into the vi
cron:buffer, then edit, then type ":w !crontab", or
cron:"!crontab %" to replace the existing crontab with
cron:what I see on vi's screen.

Note that blank lines are considered to be lines that do not match the given regular expression.

The following example is quite interesting. It lists every line that has r.*t in it and of course it matches the longest possible string in each line. First, let's see exactly how the strings are matched. The matching strings in the listing are highlighted in italics so that you can see what grep actually matches:

$ grep "r.*t" pax cron

pax:This is an announcement for the MS-DOS version of
pax:PAX version 2. See the README file and the man
pax:pages for more information on how to run PAX,
pax:For those of you who don't know, pax is a 3 in 1
pax:program that gives the functionality of pax, tar,
pax:and cpio. It supports both the DOS filesystem
pax:and the raw "tape on a disk" system used by most
pax:micro UNIX systems. This will allow for easy
pax:transfer of files to and from UNIX systems. It
pax:also support multiple volumes. Floppy density
pax:for raw UNIX type read/writes can be specified on
pax:The source will eventually be posted to one of
pax:Be sure to use a blocking factor of 20 with
pax:pax-as-tar and B with pax-as-cpio for best
cron:In SCO Xenix 2.3, or SCO UNIX, you can edit a
cron:crontab file to your heart's content, but it will
cron:not be re-read, and your changes will not take
cron:level (thus killing cron), and then re-enter
cron:multi-user run level, when a new cron is started;
cron:or until you do a reboot.
cron:The proper way to install a new version of a
cron:crontab (for root, or for any other user) is to
cron:issue the command "crontab new.jobs", or "cat
cron:new.jobs | crontab", or if in "vi" with a new
cron:version of the commands, "w ! crontab". I find it
cron:easy to type "vi /tmp/tbl", then ":0 r !crontab
cron:-l" to read the existing crontab into the vi
cron:buffer, then edit, then type ":w !crontab", or
cron:"!crontab %" to replace the existing crontab with

You can obtain for free a version of grep that highlights the matched string, but the standard version of grep simply shows the line that contains the match.

If you are thinking that grep doesn't seem to do anything with the patterns that it matches, you are correct. But in Chapter 7, "Editing Text Files," you will see how the sed command does replacements.

Now let's look for two or more ls (two ls followed by zero or more ls):

$ grep "lll*" pax cron

pax:micro UNIX systems. This will allow for easy
pax:The source will eventually be posted to one of
cron:crontab file to your heart's content, but it will
cron:not be re-read, and your changes will not take
cron:level (thus killing cron), and then re-enter
cron:The proper way to install a new version of a

The following command finds lines that begin with The:

$ grep "^The" pax cron

pax:The source will eventually be posted to one of
cron:The proper way to install a new version of a

The next command finds lines that end with n:

$ grep "n$" pax cron

pax:PAX version 2. See the README file and the man
pax:for raw UNIX type read/writes can be specified on
cron:effect, until you come out of multi-user run

You can easily use the grep command to search for two or more consecutive uppercase letters:

$ grep "[A-Z]\{2,\}" pax cron

pax:This is an announcement for the MS-DOS version of
pax:PAX version 2. See the README file and the man
pax:pages for more information on how to run PAX,
pax:TAR, and CPIO.
pax:and cpio. It supports both the DOS filesystem
pax:micro UNIX systems. This will allow for easy
pax:transfer of files to and from UNIX systems. It
pax:for raw UNIX type read/writes can be specified on
cron:In SCO Xenix 2.3, or SCO UNIX, you can edit a

The egrep Command

As mentioned earlier, egrep uses full regular expressions in the pattern string. The syntax of egrep is the same as that for grep:

$egrep [options] RE [files]

where RE is a regular expression. The egrep command uses the same regular expressions as the grep command, except for $ and $, and includes the following additional patterns:

RE+: Matches one or more occurrence(s) of RE. (Contrast this with grep's RE* pattern, which matches zero or more occurrences of RE.)
RE?: Matches zero or one occurrence of RE.
RE1 | RE2: Matches either RE1 or RE2. The | acts as a logical OR operator.
(RE): Groups multiple regular expressions.

The egrep command accepts the same command-line options as grep (see Table 6.6) as well as the following additional command-line options:

-e special_expression: Search for a special expression (that is, a regular expression that begins with a -)
-f file: Put the regular expressions into file

Here are a few examples of egrep's extended regular expressions. The first finds two or more consecutive uppercase letters:

$ egrep "[A-Z][A-Z]+" pax cron

pax:This is an announcement for the MS-DOS version of
pax:PAX version 2. See the README file and the man
pax:pages for more information on how to run PAX,
pax:TAR, and CPIO.
pax:For those of you who don't know, PAX is a 3-in-1
pax:and cpio. It supports both the DOS filesystem
pax:micro UNIX systems. This allows for easy
pax:transfer of files to and from UNIX systems. It
pax:for raw UNIX type read/writes can be specified on

The following command finds each line that contains either DOS or SCO:

$ egrep "DOS|SCO" pax cron

pax:This is an announcement for the MS-DOS version of
pax:and cpio. It supports both the DOS filesystem
cron:In SCO Xenix 2.3, or SCO UNIX, you can edit a

The next example finds all lines that contain either new or now:

$ egrep "n(e|o)w" cron

multi-user run level, when a new cron is started;
The proper way to install a new version of a
issue the command "crontab new.jobs", or "cat
new.jobs | crontab", or if in "vi" with a new

The fgrep Command

The fgrep command searches a file for a character string and prints all lines that contain the string. Unlike grep and egrep, fgrep interprets each character in the search string as a literal character, because fgrep has no metacharacters. The syntax of fgrep is

fgrep [options] string [files]

The options you use with the fgrep command are exactly the same as those that you use for egrep, with the addition of -x, which prints only the lines that are matched in their entirety. As an example of fgrep's -x option, consider the following file named sample:

$ cat sample
this is
a
file for testing
egrep's x
option.

Now, invoke fgrep with the -x option and a as the pattern.

$ fgrep -x a sample
a

That matches the second line of the file, but

$ fgrep -x option sample

outputs nothing, as option doesn't match a line in the file. However,

$ fgrep -x option. sample
option.

matches the entire last line.

Sorting Text Files

UNIX provides two commands that are useful when you are sorting text files: sort and uniq. The sort command merges text files together, and the uniq command compares adjacent lines of a file and eliminates all but one occurrence of adjacent duplicate lines.

The sort Command

The sort command is useful with database files—files that are line- and field-oriented—because it can sort or merge one or more text files into a sequence that you select. The command normally treats a blank or a tab as a delimiter. If the file has multiple blanks, multiple tabs, or both between two fields, only the first is considered a delimiter; all the others belong to the next field. The -b option tells sort to ignore the blanks and tabs that are not delimiters, discarding them instead of adding them to the beginning of the next field.

The normal ordering for sort follows the ASCII code sequence.

The syntax for sort is

$sort [-cmu] [-ooutfile] [-ymemsize] [-zrecsize] [-dfiMnr] [-btchar]
     [+pos1 [-pos2]] [file(s)]

Table 6.6 describes the options of sort.

Table 6.6. The sort Command's Options

-c: Tells sort to check only whether the file is in the order specified.
-u: Tells sort to ignore any repeated lines (but see the next section, "The uniq Command").
-m: Tells sort to merge (and sort) the files that are already sorted. (This section features an example.)
-zrecsize: Specifies the length of the longest line to be merged and prevents sort from terminating abnormally when it sees a line that is longer than usual. You use this option only when merging files.
-ooutfile: Specifies the name of the output file. This option is an alternative to and an improvement on redirection, in that outfile can have the same name as the file being sorted.
-ymemsize: Specifies the amount of memory that sort uses. This option keeps sort from consuming all the available memory. -y0 causes sort to begin with the minimum possible memory that your system permits, and -y initially gives sort the most it can get. memsize is specified in kilobytes.
-d: Causes a dictionary order sort, in which sort ignores everything except letters, digits, blanks, and tabs.
-f: Causes sort to ignore upper- and lowercase distinctions when sorting.
-i: Causes sort to ignore nonprinting characters (decimal ASCII codes 0 to 31 and 127).
-M: Compares the contents of specified fields as if they contained the name of month, by examining the first three letters or digits in each field, converting the letters to uppercase, and sorting them in calendar order.
-n: Causes sort to ignore blanks and sort in numerical order. Digits and associated characters—the plus sign, the minus sign, the decimal point, and so on—have their usual mathematical meanings.
-r: When added to any option, causes sort to sort in reverse.
-tchar: Selects the delimiter used in the file. (This option is unnecessary if the file uses a blank or a tab as its delimiter.)
+pos1 [-pos2]: Restricts the key on which the sort is based to one that begins at field pos1 and ends at field pos2. For example, to sort on field number 2, you must use +1 -2 (begin just after field 1 and continue through field 2). In addition, you can use - as an argument to force sort to take its input from stdin.

Here are some examples that demonstrate some common options. The file auto is a tab-delimited list of the results of an automobile race. From left to right, the fields list the class, driver's name, car year, car make, car model, and time:

$ cat auto

ES   Arther    85   Honda     Prelude   49.412
BS   Barker    90   Nissan    300ZX     48.209
AS   Saint     88   BMW       M-3       46.629
ES   Straw     86   Honda     Civic     49.543
DS   Swazy     87   Honda     CRX-Si    49.693
ES   Downs     83   VW        GTI       47.133
ES   Smith     86   VW        GTI       47.154
AS   Neuman    84   Porsche   911       47.201
CS   Miller    84   Mazda     RX-7      47.291
CS   Carlson   88   Pontiac   Fiero     47.398
DS   Kegler    84   Honda     Civic     47.429
ES   Sherman   83   VW        GTI       48.489
DS   Arbiter   86   Honda     CRX-Si    48.628
DS   Karle     74   Porsche   914       48.826
ES   Shorn     87   VW        GTI       49.357
CS   Chunk     85   Toyota    MR2       49.558
CS   Cohen     91   Mazda     Miata     50.046
DS   Lisanti   73   Porsche   914       50.609
CS   McGill    83   Porsche   944       50.642
AS   Lisle     72   Porsche   911       51.030
ES   Peerson   86   VW        Golf      54.493

If you invoke sort with no options, it sorts on the entire line:

$ sort auto

AS   Lisle     72   Porsche   911       51.030
AS   Neuman    84   Porsche   911       47.201
AS   Saint     88   BMW       M-3       46.629
BS   Barker    90   Nissan    300ZX     48.209
CS   Carlson   88   Pontiac   Fiero     47.398
CS   Chunk     85   Toyota    MR2       49.558
CS   Cohen     91   Mazda     Miata     50.046
CS   McGill    83   Porsche   944       50.642
CS   Miller    84   Mazda     RX-7      47.291
DS   Arbiter   86   Honda     CRX-Si    48.628
DS   Karle     74   Porsche   914       48.826
DS   Kegler    84   Honda     Civic     47.429
DS   Lisanti   73   Porsche   914       50.609
DS   Swazy     87   Honda     CRX-Si    49.693
ES   Arther    85   Honda     Prelude   49.412
ES   Downs     83   VW        GTI       47.133
ES   Peerson   86   VW        Golf      54.493
ES   Sherman   83   VW        GTI       48.489
ES   Shorn     87   VW        GTI       49.357
ES   Smith     86   VW        GTI       47.154
ES   Straw     86   Honda     Civic     49.543

To alphabetize a list by the driver's name, you need sort to begin with the second field (+1 means skip the first field). Sort normall treats the first blank (space or tab) in a sequence of blanks as the field separator, and consider that reht rest of the blanks are part of the next field. This has no effect on sorting on the second field because there is an equal number of blanks between the class letters and driver's name. However, whenever a field is "rapped"—for example, driver's name, car make, and car model—the next field will include leading blanks:

$ sort +1 auto

DS   Arbiter   86   Honda     CRX-Si    48.628
ES   Arther    85   Honda     Prelude   49.412
BS   Barker    90   Nissan    300ZX     48.209
CS   Carlson   88   Pontiac   Fiero     47.398
CS   Chunk     85   Toyota    MR2       49.558
CS   Cohen     91   Mazda     Miata     50.046
ES   Downs     83   VW        GTI       47.133
DS   Karle     74   Porsche   914       48.826
DS   Kegler    84   Honda     Civic     47.429
DS   Lisanti   73   Porsche   914       50.609
AS   Lisle     72   Porsche   911       51.030
CS   McGill    83   Porsche   944       50.642
CS   Miller    84   Mazda     RX-7      47.291
AS   Neuman    84   Porsche   911       47.201
ES   Peerson   86   VW        Golf      54.493
AS   Saint     88   BMW       M-3       46.629
ES   Sherman   83   VW        GTI       48.489
ES   Shorn     87   VW        GTI       49.357
ES   Smith     86   VW        GTI       47.154
ES   Straw     86   Honda     Civic     49.543
DS   Swazy     87   Honda     CRX-Si    49.693

Note that the key to this sort is only the driver's name. However, if two drivers had the same name, they would have been further sorted by the car year. In other words, +1 actually means skip the first field and sort on the rest of the line. Here's a list sorted by race times:

$ sort -b +5 auto

AS   Saint     88   BMW       M-3       46.629
ES   Downs     83   VW        GTI       47.133
ES   Smith     86   VW        GTI       47.154
AS   Neuman    84   Porsche   911       47.201
CS   Miller    84   Mazda     RX-7      47.291
CS   Carlson   88   Pontiac   Fiero     47.398
DS   Kegler    84   Honda     Civic     47.429
BS   Barker    90   Nissan    300ZX     48.209
ES   Sherman   83   VW        GTI       48.489
DS   Arbiter   86   Honda     CRX-Si    48.628
DS   Karle     74   Porsche   914       48.826
ES   Shorn     87   VW        GTI       49.357
ES   Arther    85   Honda     Prelude   49.412
ES   Straw     86   Honda     Civic     49.543
CS   Chunk     85   Toyota    MR2       49.558
DS   Swazy     87   Honda     CRX-Si    49.693
CS   Cohen     91   Mazda     Miata     50.046
DS   Lisanti   73   Porsche   914       50.609
CS   McGill    83   Porsche   944       50.642
AS   Lisle     72   Porsche   911       51.030
ES   Peerson   86   VW        Golf      54.493

The -b means do not treat the blanks between the car model (e.g. M-3) and the race time as part of the race time.

Suppose that you want a list of times by class. You try the following command and discover that it fails:

$ sort +0 -b +5 auto

AS   Lisle     72   Porsche   911       51.030
AS   Neuman    84   Porsche   911       47.201
AS   Saint     88   BMW       M-3       46.629
BS   Barker    90   Nissan    300ZX     48.209
CS   Carlson   88   Pontiac   Fiero     47.398
CS   Chunk     85   Toyota    MR2       49.558
CS   Cohen     91   Mazda     Miata     50.046
CS   McGill    83   Porsche   944       50.642
CS   Miller    84   Mazda     RX-7      47.291
DS   Arbiter   86   Honda     CRX-Si    48.628
DS   Karle     74   Porsche   914       48.826
DS   Kegler    84   Honda     Civic     47.429
DS   Lisanti   73   Porsche   914       50.609
DS   Swazy     87   Honda     CRX-Si    49.693
ES   Arther    85   Honda     Prelude   49.412
ES   Downs     83   VW        GTI       47.133
ES   Peerson   86   VW        Golf      54.493
ES   Sherman   83   VW        GTI       48.489
ES   Shorn     87   VW        GTI       49.357
ES   Smith     86   VW        GTI       47.154
ES   Straw     86   Honda     Civic     49.543

This command line fails because it tells sort to skip nothing and sort on the rest of the line, then sort on the sixth field. To restrict the first sort to just the class, and then sort on time as the secondary sort, use the following expression:

$ sort +0 -1 -b +5 auto

AS   Saint     88   BMW       M-3       46.629
AS   Neuman    84   Porsche   911       47.201
AS   Lisle     72   Porsche   911       51.030
BS   Barker    90   Nissan    300ZX     48.209
CS   Miller    84   Mazda     RX-7      47.291
CS   Carlson   88   Pontiac   Fiero     47.398
CS   Chunk     85   Toyota    MR2       49.558
CS   Cohen     91   Mazda     Miata     50.046
CS   McGill    83   Porsche   944       50.642
DS   Kegler    84   Honda     Civic     47.429
DS   Arbiter   86   Honda     CRX-Si    48.628
DS   Karle     74   Porsche   914       48.826
DS   Swazy     87   Honda     CRX-Si    49.693
DS   Lisanti   73   Porsche   914       50.609
ES   Downs     83   VW        GTI       47.133
ES   Smith     86   VW        GTI       47.154
ES   Sherman   83   VW        GTI       48.489
ES   Shorn     87   VW        GTI       49.357
ES   Arther    85   Honda     Prelude   49.412
ES   Straw     86   Honda     Civic     49.543
ES   Peerson   86   VW        Golf      54.493

This command says skip nothing and stop after sorting on the first field, then skip to the end of the fifth field and sort on the rest of the line. In this case, the rest of the line is just the sixth field. Here's a simple merge example. Notice that both files are already sorted by class and name.

$ cat auto.1

AS   Neuman    84   Porsche   911       47.201
AS   Saint     88   BMW       M-3       46.629
BS   Barker    90   Nissan    300ZX     48.209
CS   Carlson   88   Pontiac   Fiero     47.398
CS   Miller    84   Mazda     RX-7      47.291
DS   Swazy     87   Honda     CRX-Si    49.693
ES   Arther    85   Honda     Prelude   49.412
ES   Downs     83   VW        GTI       47.133
ES   Smith     86   VW        GTI       47.154
ES   Straw     86   Honda     Civic     49.543

$ cat auto.2

AS   Lisle     72   Porsche   911       51.030
CS   Chunk     85   Toyota    MR2       49.558
CS   Cohen     91   Mazda     Miata     50.046
CS   McGill    83   Porsche   944       50.642
DS   Arbiter   86   Honda     CRX-Si    48.628
DS   Karle     74   Porsche   914       48.826
DS   Kegler    84   Honda     Civic     47.429
DS   Lisanti   73   Porsche   914       50.609
ES   Peerson   86   VW        Golf      54.493
ES   Sherman   83   VW        GTI       48.489
ES   Shorn     87   VW        GTI       49.357

$ sort -m auto.1 auto.2

AS   Lisle     72   Porsche   911       51.030
AS   Neuman    84   Porsche   911       47.201
AS   Saint     88   BMW       M-3       46.629
BS   Barker    90   Nissan    300ZX     48.209
CS   Carlson   88   Pontiac   Fiero     47.398
CS   Chunk     85   Toyota    MR2       49.558
CS   Cohen     91   Mazda     Miata     50.046
CS   McGill    83   Porsche   944       50.642
CS   Miller    84   Mazda     RX-7      47.291
DS   Arbiter   86   Honda     CRX-Si    48.628
DS   Karle     74   Porsche   914       48.826
DS   Kegler    84   Honda     Civic     47.429
DS   Lisanti   73   Porsche   914       50.609
DS   Swazy     87   Honda     CRX-Si    49.693
ES   Arther    85   Honda     Prelude   49.412
ES   Downs     83   VW        GTI       47.133
ES   Peerson   86   VW        Golf      54.493
ES   Sherman   83   VW        GTI       48.489
ES   Shorn     87   VW        GTI       49.357
ES   Smith     86   VW        GTI       47.154
ES   Straw     86   Honda     Civic     49.543

For a final example, pass1 is an excerpt from /etc/passwd and Sort it on the user ID field—field number 3. Specify the -t option so that the field separator used by sort is the colon, as used by /etc/passwd.

$ cat pass1

root:x:0:0:System Administrator:/usr/root:/bin/ksh
slan:x:57:57:StarGROUP Software NPP Administration:/usr/slan:
labuucp:x:21:100:shevett's UPC:/usr/spool/uucppublic:/usr/lib/uucp/uucico
pcuucp:x:35:100:PCLAB:/usr/spool/uucppublic:/usr/lib/uucp/uucico
techuucp:x:36:100:The 6386:/usr/spool/uucppublic:/usr/lib/uucp/uucico
pjh:x:102:0:Peter J. Holsberg:/usr/pjh:/bin/ksh
lkh:x:250:1:lkh:/usr/lkh:/bin/ksh
shevett:x:251:1:dave shevett:/usr/shevett:/bin/ksh
mccollo:x:329:1:Carol McCollough:/usr/home/mccollo:/bin/ksh
gordon:x:304:20:gordon gary g:/u1/fall91/dp168/gordon:/bin/csh
grice:x:273:20:grice steven a:/u1/fall91/dp270/grice:/bin/ksh
gross:x:305:20:gross james l:/u1/fall91/dp168/gross:/bin/ksh
hagerho:x:326:20:hagerhorst paul j:/u1/fall91/dp168/hagerho:/bin/ksh
hendric:x:274:20:hendrickson robbin:/u1/fall91/dp270/hendric:/bin/ksh
hinnega:x:320:20:hinnegan dianna:/u1/fall91/dp163/hinnega:/bin/ksh
innis:x:262:20:innis rafael f:/u1/fall91/dp270/innis:/bin/ksh
intorel:x:286:20:intorelli anthony:/u1/fall91/dp168/intorel:/bin/ksh

Now run sort with the delimiter set to a colon:

$ sort -t: +2 -3 pass1

root:x:0:0:System Administrator:/usr/root:/bin/ksh
pjh:x:102:0:Peter J. Holsberg:/usr/pjh:/bin/ksh
labuucp:x:21:100:shevett's UPC:/usr/spool/uucppublic:/usr/lib/uucp/uucico
lkh:x:250:1:lkh:/usr/lkh:/bin/ksh
shevett:x:251:1:dave shevett:/usr/shevett:/bin/ksh
innis:x:262:20:innis rafael f:/u1/fall91/dp270/innis:/bin/ksh
grice:x:273:20:grice steven a:/u1/fall91/dp270/grice:/bin/ksh
hendric:x:274:20:hendrickson robbin:/u1/fall91/dp270/hendric:/bin/ksh
intorel:x:286:20:intorelli anthony:/u1/fall91/dp168/intorel:/bin/ksh
gordon:x:304:20:gordon gary g:/u1/fall91/dp168/gordon:/bin/csh
gross:x:305:20:gross james l:/u1/fall91/dp168/gross:/bin/ksh
hinnega:x:320:20:hinnegan dianna:/u1/fall91/dp163/hinnega:/bin/ksh
hagerho:x:326:20:hagerhorst paul j:/u1/fall91/dp168/hagerho:/bin/ksh
mccollo:x:329:1:Carol McCollough:/usr/home/mccollo:/bin/ksh
pcuucp:x:35:100:PCLAB:/usr/spool/uucppublic:/usr/lib/uucp/uucico
techuucp:x:36:100:The 6386:/usr/spool/uucppublic:/usr/lib/uucp/uucico
slan:x:57:57:StarGROUP Software NPP Administration:/usr/slan:

Note that 35 comes after 329, because sort does not recognize numeric characters as being numbers. You want the user ID field to be sorted by numerical value, so correct the command by adding the -n option:

$ sort -t: -n +2 -3 pass1

root:x:0:0:System Administrator:/usr/root:/bin/ksh
labuucp:x:21:100:shevett's UPC:/usr/spool/uucppublic:/usr/lib/uucp/uucico
pcuucp:x:35:100:PCLAB:/usr/spool/uucppublic:/usr/lib/uucp/uucico
techuucp:x:36:100:The 6386:/usr/spool/uucppublic:/usr/lib/uucp/uucico
slan:x:57:57:StarGROUP Software NPP Administration:/usr/slan:
pjh:x:102:0:Peter J. Holsberg:/usr/pjh:/bin/ksh
lkh:x:250:1:lkh:/usr/lkh:/bin/ksh
shevett:x:251:1:dave shevett:/usr/shevett:/bin/ksh
innis:x:262:20:innis rafael f:/u1/fall91/dp270/innis:/bin/ksh
grice:x:273:20:grice steven a:/u1/fall91/dp270/grice:/bin/ksh
hendric:x:274:20:hendrickson robbin:/u1/fall91/dp270/hendric:/bin/ksh
intorel:x:286:20:intorelli anthony:/u1/fall91/dp168/intorel:/bin/ksh
gordon:x:304:20:gordon gary g:/u1/fall91/dp168/gordon:/bin/csh
gross:x:305:20:gross james l:/u1/fall91/dp168/gross:/bin/ksh
hinnega:x:320:20:hinnegan dianna:/u1/fall91/dp163/hinnega:/bin/ksh
hagerho:x:326:20:hagerhorst paul j:/u1/fall91/dp168/hagerho:/bin/ksh
mccollo:x:329:1:Carol McCollough:/usr/home/mccollo:/bin/ksh

The uniq Command

The uniq command compares adjacent lines of a file. If it finds duplicates, it passes only one copy to stdout.

CAUTION:
Duplicate adjacent lines imply that the file was sorted before it was given to uniq for processing. Make sure that you sort a file before you feed it to uniq.

Here is uniq's syntax:

uniq [-udc [+n] [-m]] [input.file [output.file]]

The following examples demonstrate the options. The sample file contains the results of a survey taken by a USENET news administrator on a local computer. He asked users what newsgroups they read (newsgroups are a part of the structure of USENET News, an international electronic bulletin board), used cat to merge the users" responses into a single file, and used sort to sort the file. ngs is a piece of that file.

$ cat ngs

alt.dcom.telecom
alt.sources
comp.archives
comp.bugs.sys5
comp.databases
comp.databases.informix
comp.dcom.telecom
comp.lang.c
comp.lang.c
comp.lang.c
comp.lang.c
comp.lang.c++
comp.lang.c++
comp.lang.postscript
comp.laserprinters
comp.mail.maps
comp.sources
comp.sources.3b
comp.sources.3b
comp.sources.3b
comp.sources.bugs
comp.sources.d
comp.sources.misc
comp.sources.reviewed
comp.sources.unix
comp.sources.unix
comp.sources.wanted
comp.std.c
comp.std.c
comp.std.c++
comp.std.c++
comp.std.unix
comp.std.unix
comp.sys.3b
comp.sys.att
comp.sys.att
comp.unix.questions
comp.unix.shell
comp.unix.sysv386
comp.unix.wizards
u3b.sources

To produce a list that contains no duplicates, simply invoke uniq:

$ uniq ngs

alt.dcom.telecom
alt.sources
comp.archives
comp.bugs.sys5
comp.databases
comp.databases.informix
comp.dcom.telecom
comp.lang.c
comp.lang.c++
comp.lang.postscript
comp.laserprinters
comp.mail.maps
comp.sources
comp.sources.3b
comp.sources.bugs
comp.sources.d
comp.sources.misc
comp.sources.reviewed
comp.sources.unix
comp.sources.wanted
comp.std.c
comp.std.c++
comp.std.unix
comp.sys.3b
comp.sys.att
comp.unix.questions
comp.unix.shell
comp.unix.sysv386
comp.unix.wizards
u3b.sources

This is the desired list. Of course, you can get the same result by using the sort command's -u option while sorting the original file.

The -c option displays the so-called repetition count—the number of times each line appears in the original file:

$ uniq -c ngs

   1 alt.dcom.telecom
   1 alt.sources
   1 comp.archives
   1 comp.bugs.sys5
   1 comp.dcom.telecom
   1 comp.databases
   1 comp.databases.informix
   4 comp.lang.c
   2 comp.lang.c++
   1 comp.lang.postscript
   1 comp.laserprinters
   1 comp.mail.maps
   1 comp.sources
   3 comp.sources.3b
   1 comp.sources.bugs
   1 comp.sources.d
   1 comp.sources.misc
   1 comp.sources.reviewed
   2 comp.sources.unix
   1 comp.sources.wanted
   2 comp.std.c
   2 comp.std.c++
   2 comp.std.unix
   1 comp.sys.3b
   2 comp.sys.att
   1 comp.unix.questions
   1 comp.unix.shell
   1 comp.unix.sysv386
   1 comp.unix.wizards
   1 u3b.sources

The -u command tells uniq to output only the truly unique lines; that is, the lines that have a repetition count of 1:

$ uniq -u ngs

alt.dcom.telecom
alt.sources
comp.archives
comp.bugs.sys5
comp.databases
comp.databases.informix
comp.dcom.telecom
comp.lang.postscript
comp.laserprinters
comp.mail.maps
comp.sources
comp.sources.bugs
comp.sources.d
comp.sources.misc
comp.sources.reviewed
comp.sources.wanted
comp.sys.3b
comp.unix.questions
comp.unix.shell
comp.unix.sysv386
comp.unix.wizards
u3b.sources

The -d option tells uniq to output only those lines that have a repetition count of 2 or more:

$ uniq -d ngs

comp.lang.c
comp.lang.c++
comp.sources.3b
comp.sources.unix
comp.std.c
comp.std.c++
comp.std.unix
comp.sys.att

The uniq command also can handle lines that are divided into fields by a separator that consists of one or more spaces or tabs. The -m option tells uniq to skip the first m fields. The file mccc.ngs contains an abbreviated and modified newsgroup list in which every dot (.) is changed to a tab:

$ cat mccc.ngs

alt     dcom   telecom
alt     sources
comp    dcom   telecom
comp    sources
u3b     sources

Notice that some of the lines are identical except for the first field, so sort the file on the second field:

$ sort +1 mccc.ngs > mccc.ngs-1

$ cat mccc.ngs-1

alt     dcom    telecom
comp    dcom    telecom
alt     sources
comp    sources
u3b     sources

Now display lines that are unique except for the first field:

$ uniq -1 mccc.ngs-1

alt     dcom    telecom
alt     sources

The uniq command also can ignore the first m columns of a sorted file. The +n option tells uniq to skip the first n columns. The new file mccc.ngs-2 has four characters in each of its first fields on each line:

$ cat mccc.ngs-2

alt .dcom.telecom
comp.dcom.telecom
alt .sources
comp.sources
u3b .sources

$ uniq +4 mccc.ngs-2

alt .dcom.telecom
alt .sources

Compressing Files— compress , uncompress , and zcat

While investigating storage techniques, some computer science researchers discovered that certain types of files are stored quite inefficiently in their natural form. Most common among these "offenders" is the text file, which is stored one ASCII character per byte of memory. An ASCII character requires only seven bits, but almost all memory devices handle a minimum of eight bits at a time—that is, a byte. (A bit is a binary digit—the 1 or 0 found on electronic on/off switches.) Consequently, the researchers found that 12.5 percent of the memory device is wasted. These researchers further studied the field of language patterns, and found that they could code characters into even smaller bit patterns according to how frequently they are used.

The result of this research is a programming technique that compresses text files to about 50 percent of their original lengths. Although not as efficient with files that include characters that use all eight bits, this technique can indeed reduce file sizes substantially. Because the files are smaller, storage and file transfer can be much more efficient.

There are three UNIX commands associated with compression: compress, uncompress, and zcat. Here is the syntax for each command:

compress [ -cfv ] [ -b bits ] file(s)
uncompress [ -cv ] [ file(s) ]
zcat [ file(s)]

The options for these commands are listed in Table 6.7.

Table 6.7. Options for the Compression commands

-c: Writes to stdout instead of changing the file.
-f: Forces compression even if the compressed file is no smaller than the original.
-v: Displays the percentage of reduction for each compressed file.
-b bits: Tells compress how efficient to be. By default, bits is 16, but you can reduce it to as little as 9 for compatibility with computers that are not sufficiently powerful to handle full, 16-bit compression.

Normally, compress shrinks the file and replaces it with one that appends the extension .Z to the file name. However, things can go wrong; for example, the original file name might have 13 or 14 characters, or the compressed file could be the same size as the original when you have not specified the -f option. You can use uncompress to expand the file and replace the .Z file with the expanded file that has an appropriate name (usually the name is that of the compressed file, except without the .Z extension). The zcat command temporarily uncompresses a compressed file and prints it.

Incidentally, note that all three of these utilities can take their input from stdin through a pipe. For example, suppose that you retrieve a compressed tar archive (see Chapter 32, "Backing Up") from some site that archives free programs. If the compressed file were called archive.tar.Z, you could then uncompress it and separate it into its individual files with the following command:

$ zcat archive.tar * | tar -xf -

Printing with pr

The pr command is the "granddaddy" of all of the programs that format files. It can separate a file into pages of a specified number of lines, number the pages, put a header on each page, and so on. This section looks at some of the command's more useful options (see Table 6.8).

Incidentally, pr has nothing to do with actual printing on a printer. The name was used originally because the terminals of that time were printers—there were no screens as we know them today. You'll learn about true printing in the next section, "Printing Hard Copy Output." The syntax for the pr command is as follows:

pr -m [-N [-wM] [-a]] [-ecK] [-icK] [-drtfp] [+p] [ -ncK] [-oO] [-lL] 
[-sS] [-h header] [-F] [file(s)]

Table 6.8. Options for the pr Command

+p: Begin the display with page number p. If this is omitted, display begins with page 1.
-N: Display in N columns.
-d: Double-space the display.
-ecK: Expand tabs to character positions K+1, 2K+1, 3K+1, etc. Normally, tabs expand to positions 8, 16, 24, etc. If a character is entered for "c", use it as the tab character.
-ncK: Number each line with a K-digot number (default value is 5; e.g., 1, 2, 3, etc.). If a character is entered for "c", use it instead of a tab immediately following the K-digit number.
-wM: Set the width of each column to M characters when displaying two or more columns (default is 72).
-oO: Offset each line by O character positions to the right.
-lL: Set the length of a page to L lines (default is 66).
-h header: Use header as the text of the header of each page of the display in place of the name of the file. Note: there nust be a space between the h and the first character of the actual header string.
-p: Pause at the end of each page and ring the terminal bell. Proceed on receipt of a carriage return
-f: Use a form-feed character instead of a sequence of line feeds to begin a new page. Pause before displaying the first page on a terminal.
-r: Do not print error messages about files that cannot be opened.
-t: Omit the 5-line header and the 5-line trailer that each page normally has. Do not space to the beginning of a new page after displaying the last page. Takes precedence over -h header.
-sS: Separate columns by the character entered for S instead of a tab.
-F: Fold lines to fit the width of the column in multi-column display mode, or to fit an 80-character line.
-m: Merge and display up to eight files, one per column. May not be used with -N or -a

Here is the sample file that you'll use to examine pr:

$ cat names

allen christopher
babinchak david
best betty
bloom dennis
boelhower joseph
bose anita
cacossa ray
chang liang
crawford patricia
crowley charles
cuddy michael
czyzewski sharon
delucia joseph

The pr command normally prints a file with a five-line header and a five-line footer. The header, by default, consists of these five lines: two blank lines; a line that shows the date, time, filename, and page number; and two more blank lines. The footer consists of five blank lines. The blank lines provide proper top and bottom margins so that you can pipe the output of the pr command to a command that sends a file to the printer. The pr command normally uses 66-line pages, but to save space the demonstrations use a page length of 17: five lines of header, five lines of footer, and seven lines of text.

Use the -l option with a 17 argument to do this:

$ pr -l17 names

Sep 19 15:05 1991  names Page 1

allen christopher
babinchak david
best betty
bloom dennis
boelhower joseph
bose anita
cacossa ray

(Seven blank lines follow.)

Sep 19 15:05 1991  names Page 2

chang liang
crawford patricia
crowley charles
cuddy michael
czyzewski sharon
delucia joseph

Notice that pr puts the name for the file in the header, just before the page number. You can specify your own header with -h:

$ pr -l17 -h "This is the NAMES file" names

Sep 19 15:05 1991  This is the NAMES file Page 1

allen christopher
babinchak david
best betty
bloom dennis
boelhower joseph
bose anita
cacossa ray

(Seven blank lines follow.)

Sep 19 15:05 1991  This is the NAMES file Page 2

chang liang
crawford patricia
crowley charles
cuddy michael
czyzewski sharon
delucia joseph

The header that you specify replaces the file name.

NOTE:
There must be a space between -h and the start of the header string. Also, if the header string contains spaces, you must quote the entire string.

Multicolumn output is a pr option. Note how you specify two-column output (-2):

$ pr -l17 -2 names

Sep 19 15:05 1991  names Page 1

allen christopher            chang liang
babinchak david              crawford patricia
best betty                   crowley charles
bloom dennis                 cuddy michael
boelhower joseph             czyzewski sharon
bose anita                   delucia joseph
cacossa ray

You can number the lines of text; the numbering always begins with 1:

$ pr -l17 -n names

Sep 19 15:05 1991  names Page 1

    1     allen christopher
    2     babinchak david
    3     best betty
    4     bloom dennis
    5     boelhower joseph
    6     bose anita
    7     cacossa ray

(Seven blank lines follow.)

Sep 19 15:05 1991  names Page 2

    8     chang liang
    9     crawford patricia
   10     crowley charles
   11     cuddy michael
   12     czyzewski sharon
   13     delucia joseph

Combining numbering and multicolumns results in the following:

$ pr -l17 -n -2 names

Sep 19 15:05 1991  names Page 1

    1     allen christopher           8     chang liang
    2     babinchak david             9     crawford patricia
    3     best betty                 10     crowley charles
    4     bloom dennis               11     cuddy michael
    5     boelhower joseph           12     czyzewski sharon
    6     bose anita                 13     delucia joseph
    7     cacossa ray

pr is, good for combining two or more files. Here are three files created from fields in /etc/passwd:

$ cat p-login

allen
babinch
best
bloom
boelhow
bose
cacossa
chang
crawfor
crowley
cuddy
czyzews
delucia
diesso
dimemmo
dintron

$ cat p-home

/u1/fall91/dp168/allen
/u1/fall91/dp270/babinch
/u1/fall91/dp163/best
/u1/fall91/dp168/bloom
/u1/fall91/dp163/boelhow
/u1/fall91/dp168/bose
/u1/fall91/dp270/cacossa
/u1/fall91/dp168/chang
/u1/fall91/dp163/crawfor
/u1/fall91/dp163/crowley
/u1/fall91/dp270/cuddy
/u1/fall91/dp168/czyzews
/u1/fall91/dp168/delucia
/u1/fall91/dp270/diesso
/u1/fall91/dp168/dimemmo
/u1/fall91/dp168/dintron

$ cat p-uid

278
271
312
279
314
298
259
280
317
318
260
299
300
261
301
281

The -m option tells pr to merge the files:

$ pr -m -l20 p-home p-uid p-login

Oct 12 14:15 1991   Page 1

/u1/fall91/dp168/allen   278            allen
/u1/fall91/dp270/babinc  271            babinch
/u1/fall91/dp163/best    312            best
/u1/fall91/dp168/bloom   279            bloom
/u1/fall91/dp163/boelho  314            boelhow
/u1/fall91/dp168/bose    298            bose
/u1/fall91/dp270/cacoss  259            cacossa
/u1/fall91/dp168/chang   280            chang
/u1/fall91/dp163/crawfo  317            crawfor
/u1/fall91/dp163/crowle  318            crowley

(Seven blank lines follow.)

Oct 12 14:15 1991   Page 2

/u1/fall91/dp270/cuddy   260            cuddy
/u1/fall91/dp168/czyzew  299            czyzews
/u1/fall91/dp168/deluci  300            delucia
/u1/fall91/dp270/diesso  261            diesso
/u1/fall91/dp168/dimemm  301            dimemmo
/u1/fall91/dp168/dintro  281            dintron

You can tell pr what to put between fields by using -s and a character. If you omit the character, pr uses a tab character.

$ pr -m -l20 -s p-home p-uid p-login

Oct 12 14:16 1991   Page 1

/u1/fall91/dp168/allen   278  allen
/u1/fall91/dp270/babinch        271  babinch
/u1/fall91/dp163/best    312  best
/u1/fall91/dp168/bloom   279  bloom
/u1/fall91/dp163/boelhow        314  boelhow
/u1/fall91/dp168/bose    298  bose
/u1/fall91/dp270/cacossa        259  cacossa
/u1/fall91/dp168/chang   280  chang
/u1/fall91/dp163/crawfor        317  crawfor
/u1/fall91/dp163/crowley        318  crowley

(Seven blank lines follow.)

Oct 12 14:16 1991   Page 2

/u1/fall91/dp270/cuddy   260  cuddy
/u1/fall91/dp168/czyzews        299  czyzews
/u1/fall91/dp168/delucia        300  delucia
/u1/fall91/dp270/diesso  261  diesso
/u1/fall91/dp168/dimemmo        301  dimemmo
/u1/fall91/dp168/dintron        281  dintron

The -t option makes pr act somewhat like cat. By including the -t option, you can specify the order of merging, and even tell pr not to print (or leave room for) the header and footer:

$ pr -m -t -s p-uid p-login p-home

278  allen     /u1/fall91/dp168/allen
271  babinch   /u1/fall91/dp270/babinch
312  best      /u1/fall91/dp163/best
279  bloom     /u1/fall91/dp168/bloom
314  boelhow   /u1/fall91/dp163/boelhow
298  bose      /u1/fall91/dp168/bose
259  cacossa   /u1/fall91/dp270/cacossa
280  chang     /u1/fall91/dp168/chang
317  crawfor   /u1/fall91/dp163/crawfor
318  crowley   /u1/fall91/dp163/crowley
260  cuddy     /u1/fall91/dp270/cuddy
299  czyzews   /u1/fall91/dp168/czyzews
300  delucia   /u1/fall91/dp168/delucia
261  diesso    /u1/fall91/dp270/diesso
301  dimemmo   /u1/fall91/dp168/dimemmo
281  dintron   /u1/fall91/dp168/dintron

Printing Hard Copy Output

Displaying the results of your work on your terminal is fine, but when you need to present a report for management to read, nothing beats printed output. Three general types of printers are available:

Dot-matrix printers are usually very fast, but do not offer the print quality required for formal reports.
Inkjet printers are not quite as fast, but do offer better letter quality.
Laser printers provide the best print quality, and some are also quite fast. The two main types of laser printers, HP and PostScript, use different languages to convert your text file to something that the printer's engine can convert to hard copy.

Your system administrator can tell you which printers are available on your computer, or you can use the lpstat command to find out yourself. (This command is described later in this section.)

Requesting to Print

UNIX computers are multiuser computers, and there may be more users on a system than there are printers. For that reason, every print command that you issue is placed in a queue, to be acted on after all the ones previously issued are completed. To cancel requests, you use the cancel command.

The lp Command

Normally, the System V lp command has the following syntax:

lp [options] [files]

This command causes the named files and the designated options (if any) to become a print request. If no files are named in the command line, lp takes its input from the standard input so that it can be the last command in a pipeline. Table 6.9 contains the most frequently used options for lp.

Table 6.9. Options for lp Command

-m

Send mail after the files have been printed (see Chapter 9, "Communicating with Others").

-d dest

Choose dest as the printer or class of printers that is to do the printing. If dest is a printer, then lp prints the request only on that specific printer. If dest is a class of printers, then lp prints the request on the first available printer that is a member of the class. If dest is any, then lp prints the request on any printer that can handle it. For more information see the discussion below on lpstat.

-n N

Print N copies of the output. The default is one copy.

-o option

Specify a printer-dependent option. You can use the -o option as many times consecutively as you want, as in -o option1 -o option2 . . . -o optionN, or by specifying a list of options with one -o followed by the list enclosed in double quotation marks, as in -o "option1 option2 . . . optionN". The options are as follows:

nobanner: Do not print a banner page with this request. Normally, a banner page containing the user-ID, file name, date, and time is printed for each print request, to make it easy for several users to identify their own printed copy.
lpi=N: Print this request with the line pitch set to N.
cpi=pica|elite|compressed: Print this request with the character pitch set to pica (representing 10 characters per inch), elite (representing 12 characters per inch), or compressed (representing as many characters per inch as a printer can handle).
stty=stty-option-list: A list of options valid for the stty command. Enclose the list with single quotation marks if it contains blanks.
-t title: Print title on the banner page of the output. The default is no title. Enclose title in quotation marks if it contains blanks.
-w: Write a message on the user's terminal after the files are printed. If the user is not logged in, or if the printer resides on a remote system, send a mail message instead.

To print the file sample on the default printer, type:

$ lp sample
request id is lj-19 (1 file)

Note the response from the printing system. If you don't happen to remember the request id later, don't worry; lpstat will tell it to you, as long as it has not finished printing the file. Once the system has finished printing, your request has been fulfilled and no longer exists.

Suppose your organization has a fancy, all-the-latest-bells-and-whistles-and-costing-more-than-an-arm-and-a-leg printer, code-named the_best in the Chairman's secretary's office in the next building. People are permitted to use it for the final copies of important documents so it is kept fairly busy. And you don't want to have to walk over to that building and climb 6 flights of stairs to retrieve your print job until you know it's been printed. So you type

$ lp -m -d the_best final.report.94
request id is the_best-19882 (1 file)

You have asked that the printer called the_best be used and that mail be sent to you when the printing has completed. (This assumes that this printer and your computer are connected on some kind of network that will transfer the actual file from your computer to the printer.)

The cancel Command

You may want to cancel a print request for any number of reasons, but only one command enables you to do it—the cancel command. Usually, you invoke it as follows:

cancel [request-ID(s)]

where request-ID(s) is the print job number that lp displays when you make a print request. Again, if you forget the request-ID, lpstat (see the section on lpstat) will show it to you.

Getting Printer and Print Request Status

The lpstat command gives the user information about the print services, including the status of all current print requests, the name of the default printer, and the status of each printer. The syntax of lpstat is very simple:

$lpstat [options] [request-ID(s)]

When you use the lp command, it puts your request in a queue and issues a request ID for that particular command. If you supply that ID to lpstat, it reports on the status of that request. If you omit all IDs and use the lpstat command with no arguments, it displays the status of all your print requests.

Some options take a parameter list as arguments, indicated by [list] below. You can supply that list as either a list separated by commas, or a list enclosed in double quotation marks and separated by spaces, as in the following examples:

-p printer1,printer2

-u "user1 user2 user3"

If you specify all as the argument to an option that takes a list or if you omit the argument entirely, lpstat provides information about all requests, devices, statuses, and so on, appropriate to that option letter. For example, the following commands both display the status of all output requests:

$ lpstat -o all

$ lpstat -o

Here are some of the more common arguments and options for lpstat:

-d: Report what the system default destination is (if any).
-o [list]: Report the status of print requests. list is a list of printer names, class names, and request IDs. You can omit the -o.
-s: Display a status summary, including the status of the print scheduler, the system default destination, a list of class names and their members, a list of printers and their associated devices, and other, less pertinent information.
-p [list] [-D] [-l]: If the -D option is given, print a brief description of each printer in list. If the -l option is given, print a full description of each printer's configuration.
-t: Display all status information: all the information obtained with the -s option, plus the acceptance and idle/busy status of all printers and the status of all requests.
-a [list]: Report whether print destinations are accepting requests. list is a list of intermixed printer names and class names.

Comparing Directories with dircmp

The dircmp command examines the contents of two directories—including all subdirectories—and displays information about the contents of each. It lists all the files that are unique to each directory and all the files that are common. The command specifies whether each common file is different or the same by comparing the contents of each of those files. The syntax for dircmp is

dircmp [-d] [-s] [-wn] dir1 dir2

The options are as follows:

-d: Perform a diff operation on pairs of files with the same names (see the section "The diff Command" later in this chapter).
-s: Suppress messages about identical files.
-wN: Change the width of the output line to N columns. The default width is 72.

As an example, suppose that the two directories have the following contents:

./phlumph:
total 24
-rw-r- -r- -   1 pjh      sys         8432 Mar  6 13:02 TTYMON
-rw-r- -r- -   1 pjh      sys           51 Mar  6 12:57 x
-rw-r- -r- -   1 pjh      sys          340 Mar  6 12:55 y
-rw-r- -r- -   1 pjh      sys          222 Mar  6 12:57 z

./xyzzy:
total 8
-rw-r- -r- -   1 pjh      sys          385 Mar  6 13:00 CLEANUP
-rw-r- -r- -   1 pjh      sys           52 Mar  6 12:55 x
-rw-r- -r- -   1 pjh      sys          340 Mar  6 12:55 y
-rw-r- -r- -   1 pjh      sys          241 Mar  6 12:55 z

Each directory includes a unique file and three pairs of files that have the same name. Of the three files, two of them differ in size and presumably in content. Now use dircmp to determine whether the files in the two directories are the same or different, as follows:

$ dircmp xyzzy phlumph

Mar  6 13:02 1994  xyzzy only and phlumph only Page 1

./CLEANUP                               ./TTYMON

(Many blank lines removed to save space.)
Mar  6 13:02 1994  Comparison of xyzzy phlumph Page 1

directory       .
different       ./x
same            ./y
different       ./z

(Many blank lines removed to save space.)

$

Note that dircmp first reports on the files unique to each directory and then comments about the common files.

$ dircmp -d xyzzy phlumph

Mar  6 13:02 1994  xyzzy only and phlumph only Page 1

./CLEANUP                               ./TTYMON

(Many blank lines removed to save space.)

Mar  6 13:02 1994  Comparison of xyzzy phlumph Page 1

directory       .
different       ./x
same            ./y
different       ./z

(Many blank lines removed to save space.)

Mar  6 13:02 1994  diff of ./x in xyzzy and phlumph Page 1

3c3
< echo "root has logged out..."
- -
> echo "pjh has logged out..."

(Many blank lines removed to save space.)

Mar  6 13:02 1994  diff of ./z in xyzzy and phlumph Page 1

6d5
<       j) site=jonlab ;;

(Many blank lines removed to save space.)
$

At this point, you may want to refer back to the section "The diff Command" later in this chapter.

Encrypting a File with the crypt Command

If you have sensitive information stored in text files that you wish to give to other users you may want to encrypt them to make them unreadable by casual users. UNIX system owners of the Encryption Utilities, which are available only to purchasers in the United States, can encrypt a text file—in any way they see fit—before they transmit it to another user or site. The person who receives the encrypted file needs a copy of the crypt command and the password used by the person who encrypted the message in the first place. The usual syntax for the crypt command is

$crypt [ key ] < clearfile > encryptedfile

where key is any phrase. For example

crpyt 'secret agent 007" <mydat> xyzzy

will encrypt the contents of my dat and write the result to xyzzy.

TIP:
This approach requires that you type key, the encryption key, at your keyboard, in which case someone nearby might notice it. You can define your encryption key in an environment variable (see Chapters 11, 12, and 13) called CRYPTKEY.

Then use the following syntax:

$crypt -k < clearfile > encryptedfile

The encryption key need not be complex. In fact, the longer it is, the more time it takes to do the decryption. A key of three lowercase letters causes decryption to take as much as five minutes of machine time—and possibly much more on a multiuser machine.

CAUTION:
Do not concatenate encrypted files, even if they were encrypted with the same key. If you try to do so, you will successfully decrypt only the first file.

Also, do not pipe the output of crypt through any program that changes the settings of your terminal. Otherwise, when crypt finishes, the output will be in a strange state.

Printing the Beginning or End of a File with head and tail

By default, the head command prints the first 10 lines of a file to stdout (by default, the screen):

$ head names

allen christopher
babinchak david
best betty
bloom dennis
boelhower joseph
bose anita
cacossa ray
chang liang
crawford patricia
crowley charles

You can specify the number of lines that head displays, as follows:

$ head -4 names

allen christopher
babinchak david
best betty
bloom dennis

To view the last few lines of a file, use the tail command. This command is helpful when you have a large file and want to look at at the end only. For example, suppose that you want to see the last few entries in the log file that records the transactions that occur when files are transferred between your machine and a neighboring machine. That log file may be large, and you surely don't want to have to read all the beginning and middle of it just to get to the end.

By default, tail prints the last 10 lines of a file to stdout (by default, the screen). Suppose that your names file consist of the following:

$ cat names

allen christopher
babinchak david
best betty
bloom dennis
boelhower joseph
bose anita
cacossa ray
chang liang
crawford patricia
crowley charles
cuddy michael
czyzewski sharon
delucia joseph

The tail command limits your view to the last 10 lines:

$ tail names

bloom dennis
boelhower joseph
bose anita
cacossa ray
chang liang
crawford patricia
crowley charles
cuddy michael
czyzewski sharon
delucia joseph

You can change this display by specifying the number of lines to print. For example, the following command prints the last five lines of names:

$ tail -5 names

crawford patricia
crowley charles
cuddy michael
czyzewski sharon
delucia joseph

The tail also can follow a file; that is, it can continue looking at a file as a program continues to add text to the end of that file. The syntax is

tail -f logfile

where logfile is the name of the file being written to. If you're logged into a busy system, try one of the following forms:

$ tail -f /var/uucp/.Log/uucico/neighbor

$ tail -f /var/uucp/.Log/uuxqt/neighbor

where neighbor is the name of a file that contains log information about a computer that can exchange information with yours. The first is the log file that logs file-transfer activity between your computer and neighbor, and the second is the log of commands that your computer has executed as requested by neighbor. The tail command has several other useful options:

+n: Begin printing at line n of the file.
b: Count by blocks rather than lines (blocks are either 512 or 1,024 characters long).
c: Count by characters rather than lines.
r: Print from the designated starting point in the reverse direction. For example, tail -5r file prints the next-to-last five lines of the file. You cannot use option r cannot be used with option f.

Pipe Fitting with tee

In UNIX pipelines, you use the tee command just as a plumber uses a tee-fitting in a water line: to send output in two directions simultaneously. Fortunately, electrons behave different than water molecules, because tee can send all its input to both destinations. Probably the most common use of tee is to siphon off the output of a command and save it in a file while simultaneously passing it down the pipeline to another command. The syntax for the tee command is

$tee [-i] [-a] [file(s)]

The tee command can send its output to multiple files simultaneously. With the -a option specified, tee appends the output to those files instead of overwriting them. The -i option prevents the pipline from being broken. To show the use of tee, type the comman that follows:

$ lp /etc/passwd | tee status

This command causes the file /etc/passwd to be sent to the default printer, prints a message about the print request on the screen and simultaneously captures that message in a file called status. The tee sends the output of the lp command to two places: the screen and the named file.

Updating a File's Time and Date with touch

The touch command updates the access and modification time and date stamps of the files mentioned as its arguments. (See Chapters 4 and 35 for more information on the time and date of a file.) If the file mentioned does not exist, it is immediately created as a 0-byte file with no contents. You can use touch to protect files that might otherwise be removed by cleanup programs that delete files that have not been accessed or modified within a specified number of days. Using touch, you can change the time and date stamp in any way you choose, if you include that information in the command line. Here's the syntax:

$touch [ -amc ] [ mmddhhmm[yy] ] file(s): This command returns to the terminal an integer number that represents the number of files whose time and/or date could not be changed. With no options, touch updates both the time and date stamps. The options are as follows:
-a: Update the access time and date only.
-m: Update the modification time and date only.
-c: Do not create a file that does not exist.

The pattern for the time-date stamp—mmddhhmm[yy]—consists of the month (01—12), day (01—31 as appropriate), hour (00—23), minute (00—59) and, optionally, year (00—99). Therefore, the command

$ touch 0704202090 fireworks

changes both access and modification time and dates of the file fireworks to July 4, 1990, 8:20 P.M.

Splitting Files with split and csplit

There are occasions when you have a text file that's too big for some application. For example, suppose you habe a 2MB file that you want to copy to a 1.4MB floppy disk. You will have to use split (or csplit) to divide it into two (or more) smaller files. The syntax for split is

$ split [ -n ] [ in-file [ out-file ] ]

This command reads the text file in-file and splits it into several files, each consisting of n lines (except possibly the last file). If you omit -n, split creates 1,000-line files. The names of the small files depend on whether or not you specify out-file. If you do, these files are named out-fileaa, out-fileab, out-fileac, and so on. If you have more than 26 output files, the 27th is named as out-fileba, the 28th as out-filebb, and so forth. If you omit out-file, split uses x in its place, so that the files are named xaa, xab, xac, and so on.

To recreate the original file from a group of files named xaa and xab, etc., type

$ cat xa* > new-name

It may be more sensible to divide a file according to the context of its contents, rather than on a chosen number of lines. UNIX offers a context splitter, called csplit. This command's syntax is

$ csplit [ -s ] [ -k ] [ -f out-file ] in-file arg(s)

where in-file is the name of the file to be split, and out-file is the base name of teh ouput files.

The arg(s) determine where each file is split. If you have N args, you get N+1 output files, named out-file00, out-file01, and so on, through out-fileN (with a 0 in front of N if N is less than 10). N cannot be greater than 99. If you do not specify an out-file argument, csplit names the files xx00, xx01, and so forth. See below for an example where a file is divided by context into five files. The -s option suppresses csplit's reporting of the number of characters in each output file. The -k option prevents csplit from deleting all output files if an error occurs.

Suppose that you have a password file such as the following. It is divided into sections: an unlabeled one at the beginning, followed by UUCP Logins, Special Users, DP Fall 1991, and NCR.

$ cat passwd

root:x:0:0:System Administrator:/usr/root:/bin/ksh
reboot:x:7:1:- -:/:/etc/shutdown -y -g0 -i6
listen:x:37:4:x:/usr/net/nls:
slan:x:57:57:StarGROUP Software NPP Administration:/usr/slan:
lp:x:71:2:x:/usr/spool/lp:
_:-             :6:6:    ==============================     :6:
_:-             :6:6:    ==  UUCP Logins                    :6:
_:-             :6:6:    ==============================     :6:
uucp:x:5:5:0000-uucp(0000):x:
nuucp:x:10:10:0000-uucp(0000):/usr/spool/uucppublic:/usr/lib/uucp/uucico
zzuucp:x:37:100:Bob Sorenson:/usr/spool/uucppublic:/usr/lib/uucp/uucico
asyuucp:x:38:100:Robert L. Wald:/usr/spool/uucppublic:/usr/lib/uucp/uucico
knuucp:x:39:100:Kris Knigge:/usr/spool/uucppublic:/usr/lib/uucp/uucico
_:-             :6:6:    ==============================     :6:
_:-             :6:6:    ==  Special Users                  :6:
_:-             :6:6:    ==============================     :6:
msnet:x:100:99:Server Program:/usr/net/servers/msnet:/bin/false
install:x:101:1:x:/usr/install:
pjh:x:102:0:Peter J. Holsberg:/usr/pjh:/bin/ksh
hohen:x:346:1:Michael Hohenshilt:/usr/home/hohen:/bin/ksh
reilly:x:347:1:Joan Reilly:/usr/home/reilly:/bin/ksh
_:-             :6:6:    ==============================     :6:
_:-             :6:6:    ==  DP Fall 1991                   :6:
_:-             :6:6:    ==============================     :6:
gordon:x:304:20:gordon gary g:/u1/fall91/dp168/gordon:/bin/csh
lewis:x:288:20:lewis prince e:/u1/fall91/dp168/lewis:/bin/ksh
metelit:x:265:20:metelitsa natalya:/u1/fall91/dp270/metelit:/bin/ksh
nadaraj:x:307:20:nadarajah kalyani:/u1/fall91/dp168/nadaraj:/bin/ksh
nado:x:266:20:nado conan j:/u1/fall91/dp270/nado:/bin/ksh
_:-             :6:6:    ==============================     :6:
_:-             :6:6:    ===  NCR   ===================     :6:
_:-             :6:6:    ==============================     :6:
antello:x:334:20:antello ronald f:/u1/fall91/ff437/antello:/bin/ksh
cilino:x:335:20:cilino michael a:/u1/fall91/ff437/cilino:/bin/ksh
emmons:x:336:20:emmons william r:/u1/fall91/ff437/emmons:/bin/ksh
foreste:x:337:20:forester james r:/u1/fall91/ff437/foreste:/bin/ksh
hayden:x:338:20:hayden richard:/u1/fall91/ff437/hayden:/bin/ksh

You might want to split this file so that each section has its own file. To split this file into multiple files, you must specify the appropriate arguments to csplit. Each takes the form of a text string surrounded by slash (/) marks. The csplit command then copies from the current line up to, but not including, the argument. The following is the first attempt at splitting the file with csplit:

$ csplit -f PA passwd /UUCP/ /Special/ /Fall/ /NCR/

270
505
426
490
446

Note that there are four args: uucp, special, fall, and ncr. There will be five files created: PA01 will contan everything from the beginning of passwd, to (but not including) the first line that contains uucp. PA02 will contain everything from the first line containing uucp up to (but not including) the line that contains special, and so on. Five files are created: the first has 270 characters, the second has 505 characters, and so on. Now let's see what they look like:

$ cat PA00

root:x:0:0:System Administrator:/usr/root:/bin/ksh
reboot:x:7:1:- -:/:/etc/shutdown -y -g0 -i6
listen:x:37:4:x:/usr/net/nls:
slan:x:57:57:StarGROUP Software NPP Administration:/usr/slan:
lp:x:71:2:x:/usr/spool/lp:
_:-             :6:6:    ==============================     :6:

$ cat PA01

_:-             :6:6:    ==  UUCP Logins                    :6:
_:-             :6:6:    ==============================     :6:
uucp:x:5:5:0000-uucp(0000):x:
nuucp:x:10:10:0000-uucp(0000):/usr/spool/uucppublic:/usr/lib/uucp/uucico
zzuucp:x:37:100:Bob Sorenson:/usr/spool/uucppublic:/usr/lib/uucp/uucico
asyuucp:x:38:100:Robert L. Wald:/usr/spool/uucppublic:/usr/lib/uucp/uucico
knuucp:x:39:100:Kris Knigge:/usr/spool/uucppublic:/usr/lib/uucp/uucico
_:-             :6:6:    ==============================     :6:

$ cat PA02

_:-             :6:6:    ==  Special Users                  :6:
_:-             :6:6:    ==============================     :6:
msnet:x:100:99:Server Program:/usr/net/servers/msnet:/bin/false
install:x:101:1:x:/usr/install:
pjh:x:102:0:Peter J. Holsberg:/usr/pjh:/bin/ksh
hohen:x:346:1:Michael Hohenshilt:/usr/home/hohen:/bin/ksh
reilly:x:347:1:Joan Reilly:/usr/home/reilly:/bin/ksh
_:-             :6:6:    ==============================     :6:

$ cat PA03

_:-             :6:6:    ==  DP Fall 1991                   :6:
_:-             :6:6:    ==============================     :6:
gordon:x:304:20:gordon gary g:/u1/fall91/dp168/gordon:/bin/csh
lewis:x:288:20:lewis prince e:/u1/fall91/dp168/lewis:/bin/ksh
metelit:x:265:20:metelitsa natalya:/u1/fall91/dp270/metelit:/bin/ksh
nadaraj:x:307:20:nadarajah kalyani:/u1/fall91/dp168/nadaraj:/bin/ksh
nado:x:266:20:nado conan j:/u1/fall91/dp270/nado:/bin/ksh
_:-             :6:6:    ==============================     :6:

$ cat PA04

_:-             :6:6:    ===  NCR   ===================     :6:
_:-             :6:6:    ==============================     :6:
antello:x:334:20:antello ronald f:/u1/fall91/ff437/antello:/bin/ksh
cilino:x:335:20:cilino michael a:/u1/fall91/ff437/cilino:/bin/ksh
emmons:x:336:20:emmons william r:/u1/fall91/ff437/emmons:/bin/ksh
foreste:x:337:20:forester james r:/u1/fall91/ff437/foreste:/bin/ksh
hayden:x:338:20:hayden richard:/u1/fall91/ff437/hayden:/bin/ksh

This is not bad, but each file ends or begins with one or more lines that you don't want. The csplit command enables you to adjust the split point by appending an offset to the argument. For example, /UUCP/-1 means that the split point is the line before the one on which UUCP appears for the first time. Add -1 to each argument, and you should get rid of the unwanted line that ends each of the first four files:

$ csplit -f PB passwd /UUCP/-1 /Special/-1 /Fall/-1 /NCR/-1

213
505
426
490
503

You can see that the first file is smaller than the previous first file. Perhaps this is working. Let's see:

$ cat PB00

root:x:0:0:System Administrator:/usr/root:/bin/ksh
reboot:x:7:1:- -:/:/etc/shutdown -y -g0 -i6
listen:x:37:4:x:/usr/net/nls:
slan:x:57:57:StarGROUP Software NPP Administration:/usr/slan:
lp:x:71:2:x:/usr/spool/lp:

$ cat PB01

_:-             :6:6:    ==============================     :6:
_:-             :6:6:    ==  UUCP Logins                    :6:
_:-             :6:6:    ==============================     :6:
uucp:x:5:5:0000-uucp(0000):x:
nuucp:x:10:10:0000-uucp(0000):/usr/spool/uucppublic:/usr/lib/uucp/uucico
zzuucp:x:37:100:Bob Sorenson:/usr/spool/uucppublic:/usr/lib/uucp/uucico
asyuucp:x:38:100:Robert L. Wald:/usr/spool/uucppublic:/usr/lib/uucp/uucico
knuucp:x:39:100:Kris Knigge:/usr/spool/uucppublic:/usr/lib/uucp/uucico

$ cat PB02

_:-             :6:6:    ==============================     :6:
_:-             :6:6:    ==  Special Users                  :6:
_:-             :6:6:    ==============================     :6:
msnet:x:100:99:Server Program:/usr/net/servers/msnet:/bin/false
install:x:101:1:x:/usr/install:
pjh:x:102:0:Peter J. Holsberg:/usr/pjh:/bin/ksh
hohen:x:346:1:Michael Hohenshilt:/usr/home/hohen:/bin/ksh
reilly:x:347:1:Joan Reilly:/usr/home/reilly:/bin/ksh

$ cat PB03

_:-             :6:6:    ==============================     :6:
_:-             :6:6:    ==  DP Fall 1991                   :6:
_:-             :6:6:    ==============================     :6:
gordon:x:304:20:gordon gary g:/u1/fall91/dp168/gordon:/bin/csh
lewis:x:288:20:lewis prince e:/u1/fall91/dp168/lewis:/bin/ksh
metelit:x:265:20:metelitsa natalya:/u1/fall91/dp270/metelit:/bin/ksh
nadaraj:x:307:20:nadarajah kalyani:/u1/fall91/dp168/nadaraj:/bin/ksh
nado:x:266:20:nado conan j:/u1/fall91/dp270/nado:/bin/ksh

$ cat PB04

_:-             :6:6:    ==============================     :6:
_:-             :6:6:    ===  NCR   ===================     :6:
_:-             :6:6:    ==============================     :6:
antello:x:334:20:antello ronald f:/u1/fall91/ff437/antello:/bin/ksh
cilino:x:335:20:cilino michael a:/u1/fall91/ff437/cilino:/bin/ksh
emmons:x:336:20:emmons william r:/u1/fall91/ff437/emmons:/bin/ksh
foreste:x:337:20:forester james r:/u1/fall91/ff437/foreste:/bin/ksh
hayden:x:338:20:hayden richard:/u1/fall91/ff437/hayden:/bin/ksh

This is very good indeed. Now, to get rid of the unwanted lines at the beginning, you have csplit advance its current line without copying anything. A pair of arguments, /UUCP/-1 and %uucp%, tells csplit to skip all the lines beginning with the one that precedes the line containing UUCP, to the one that precedes the line containing uucp. This causes csplit to skip the lines that begin with _:-. The following displays the full command:

$ csplit -f PC passwd /UUCP/-1 %uucp% /Special/-1 %msnet% \
/Fall/-1 %dp[12][67][80]% /NCR/1%ff437%

213
334
255
321
332

Note the backslash (/) at the end of the first line fo the command. This is simply a continuation character—it tells the shell that the carriage return (or Enter) that you're about to press is not the end of the command, but that you'd like to continue typing on the next line on the scree. Also note that any argument can be a regular expression. Here are the resulting files:

$ cat PC00

root:x:0:0:System Administrator:/usr/root:/bin/ksh
reboot:x:7:1:- -:/:/etc/shutdown -y -g0 -i6
listen:x:37:4:x:/usr/net/nls:
slan:x:57:57:StarGROUP Software NPP Administration:/usr/slan:
lp:x:71:2:x:/usr/spool/lp:

$ cat PC01

uucp:x:5:5:0000-uucp(0000):x:
nuucp:x:10:10:0000-uucp(0000):/usr/spool/uucppublic:/usr/lib/uucp/uucico
zzuucp:x:37:100:Bob Sorenson:/usr/spool/uucppublic:/usr/lib/uucp/uucico
asyuucp:x:38:100:Robert L. Wald:/usr/spool/uucppublic:/usr/lib/uucp/uucico
knuucp:x:39:100:Kris Knigge:/usr/spool/uucppublic:/usr/lib/uucp/uucico

$ cat PC02

msnet:x:100:99:Server Program:/usr/net/servers/msnet:/bin/false
install:x:101:1:x:/usr/install:
pjh:x:102:0:Peter J. Holsberg:/usr/pjh:/bin/ksh
hohen:x:346:1:Michael Hohenshilt:/usr/home/hohen:/bin/ksh
reilly:x:347:1:Joan Reilly:/usr/home/reilly:/bin/ksh

$ cat PC03

gordon:x:304:20:gordon gary g:/u1/fall91/dp168/gordon:/bin/csh
lewis:x:288:20:lewis prince e:/u1/fall91/dp168/lewis:/bin/ksh
metelit:x:265:20:metelitsa natalya:/u1/fall91/dp270/metelit:/bin/ksh
nadaraj:x:307:20:nadarajah kalyani:/u1/fall91/dp168/nadaraj:/bin/ksh
nado:x:266:20:nado conan j:/u1/fall91/dp270/nado:/bin/ksh

$ cat PC04

antello:x:334:20:antello ronald f:/u1/fall91/ff437/antello:/bin/ksh
cilino:x:335:20:cilino michael a:/u1/fall91/ff437/cilino:/bin/ksh
emmons:x:336:20:emmons william r:/u1/fall91/ff437/emmons:/bin/ksh
foreste:x:337:20:forester james r:/u1/fall91/ff437/foreste:/bin/ksh
hayden:x:338:20:hayden richard:/u1/fall91/ff437/hayden:/bin/ksh

The program, therefore, has been a success.

In addition, an argument can be a line number (typed as an argument but without slashes) to indicate that the desired split should take place at the line before the specified number. You also can specify a repeat factor by appending {number} to a pattern. For example, /login/{8} means use the first eight lines that contain login as split points.

Comparing Files with cmp and diff

So far, you have seen UNIX commands that work with a single file at a time. However, often a user must compare two files and determine whether they are different, and if so, just what the differences are. UNIX provides commands that can help:

The cmp command compares two files, and then simple reports the character number and line number where they differ.
The diff command compares two files and tells you exactly where the files differ and what you must do to make them agree.

The cmp command is especially useful in shell scripts (see Chapters 11, 12 and 13). The diff command is more specialized in what it does and where you can use it.

The cmp Command

The simplest command for comparing two files, cmp, simply tells you whether the files are different or not. If they are different, it tells you where in the file it spotted the first difference, if you use cmp with no options. The command's syntax is

$ cmp [ -l ] [ -s ] file1 file2

The -l option gives you more information. It displays the number of each character that is different (the first character in the file is number 1), and then prints the octal value of the ASCII code of that character. (You will probably not have any use for the octal value of a character until you become a shell programming expert!) The -s option prints nothing, but returns an appropriate result code (0 if there are no differences, 1 if there are one or more differences). This option is useful when you write shell scripts (see Chapters 11, 12, and 13).

Here are two files that you can compare with cmp:

$ cat na.1

allen christopher
babinchak david
best betty
bloom dennis
boelhower joseph
bose anita
cacossa ray
delucia joseph

$ cat na.2

allen christopher
babinchak David
best betty
boelhower joseph
bose
cacossa ray
delucia joseph

Note that the first difference between the two files is on the second line. The D in David in the second file is the 29th character, counting all newline characters at the ends of lines.

$ cmp na.1 na.2

na.1 na.2 differ: char 29, line 2

$ cmp -l na.1 na.2

cmp:
    29 144 104
    68 141  12
    69 156 143
    70 151 141
    71 164 143
    72 141 157
    73  12 163
    74 143 163
    76 143  40
    77 157 162
    78 163 141
    79 163 171
    80 141  12
    81  40 144
    82 162 145
    83 141 154
    84 171 165
    85  12 143
    86 144 151
    87 145 141
    88 154  40
    89 165 152
    90 143 157
    91 151 163
    92 141 145
    93  40 160
    94 152 150
    95 157  12

This is quite a list! The 29th character is octal 144 in the first file and octal 104 in the second. If you look them up in an ASCII table, you'll see that the former is a d, and the latter is a D. Character 68 is the first a in anita in na.1 and the newline after the space after bose in na.2.

Now let's try the -s option on the two files:

$ cmp -s na.1 na.2

$ echo $?

1

The variable ? is the shell variable that contains the result code of the last command, and $? is its value. The value 1 on the last line indicates that cmp found at least one difference between the two files. (See Chapters 11, 12, and 13.) Next, for contrast, compare a file with itself to see how cmp reports no differences:

$ cmp -s na.1 na.2

$ echo $?

0

The value 0 means that cmp found no differences.

The diff Command

The diff command is much more powerful than the cmp command. It shows you the differences between two files by outputting the editing changes (see Chapter 7, "Editing Text Files") that you would need to make to convert one file to the other. The syntax of diff is one of the following lines:

$ diff [-bitw] [-c | -e | -f | -h | -n] file1 file2
$ diff [-bitw] [-C number] file1 file2
$ diff [-bitw] [-D string] file1 file2
$ diff [-bitw] [-c | -e | -f | -h | -n] [-l] [-r] [-s] [-Sname] dir1 dir2

The three sets of options—cefhn, -C number, and -Dstring—are mutually exclusive. The common options are

-b: Ignores trailing blanks, and treats all other strings of blanks as equivalent to one another.
-i: Ignores uppercase and lowercase distinctions.
-t: Preserves indentation level of the original file by expanding tabs in the output.
-w: Ignores all blanks (spaces and tabs).

Later in this section you'll see examples that demonstrate each of these options.

First, let's look at the two files that show what diff does:

Let's apply diff to the files na.1 and na.2 (the files with which cmp was demonstrated):

$ diff na.1 na.2

2c2
< babinchak david
- -
> babinchak David
4d3
< bloom dennis
6c5
< bose anita
- -
> bose

These editor commands are quite different from those that diff printed before. The first four lines show

 2c2
< babinchak david
- -
> babinchak David

which means that you can change the second line of file1 (na.1) to match the second line of file2 (na.2) by executing the command, which means change line 2 of file1 to line 2 of file2. Note that both the line from file1—prefaced with <—and the line from file2—prefaced with >—are displayed, separated by a line consisting of three dashes. The next command says to delete line 4 from file1 to bring it into agreement with file2 up to—but not including—line 3 of file2. Finally, notice that there is another change command, 6c5, which says change line 6 of file1 by replacing it with line 5 of file2.

Note that in line 2, the difference that diff found was the d versus D letter in the second word.

You can use the -i option to tell diff to ignore the case of the characters, as follows:

$ diff -i na.1 na.2

4d3
< bloom dennis
6c5
< bose anita
- -
> bose

The -c option causes the differences to be printed in context; that is, the output displays several of the lines above and below a line in which diff finds a difference. Each difference is marked with one of the following:

An exclamation point (!) indicates that corresponding lines in the two files are similar but not the same.
A minus sign (-) means that the line is not in file2.
A plus sign (+) means that the line is in file2 but not in file1.

Note in the following example that the output includes a header that displays the names of the two files, and the times and dates of their last changes. The header also shows either stars (***) to designate lines from the first file, or dashes (- - -) to designate lines from the second file.

$ diff -c na.1 na.2

*** na.1     Sat Nov  9 12:57:55 1991
" na.2     Sat Nov  9 12:58:27 1991
***************
*** 1,8 ****
  allen christopher
! babinchak david
  best betty
- bloom dennis
  boelhower joseph
! bose anita
  cacossa ray
  delucia joseph
- - - 1,7 - - -
  allen christopher
! babinchak David
  best betty
  boelhower joseph
! bose
  cacossa ray
  delucia joseph

After the header comes another asterisk-filled header that shows which lines of file1 (na.1) will be printed next (1,8), followed by the lines themselves. You see that the babinchak line differs in the two files, as does the bose line. Also, bloomdennis does not appear in file2 (na.2). Next, you see a header of dashes that indicates which lines of file2 will follow (1,7). Note that for the file2 list, the babinchak line and the bose line are marked with exclamation points. The number of lines displayed depends on how close together the differences are (the default is three lines of context). Later in this section, when you once again use diff with p1 and p2, you'll see an example that show how to change the number of context lines.

diff can create an ed script (see Chapter 7) that you can use to change file1 into file2. First you a execute a command such as the following:

$ diff -e na.1 na.2

6c
bose
.
4d
2c
babinchak David
.

Then you redirect this output to another file using a command such as the following:

$ diff -e na.1 na.2 > ed.scr

Edit the file by adding two lines, w and q (see Chapter 7), which results in the following file:

$ cat ed.scr

6c
bose
.
4d
2c
babinchak David
.
w
q

Then you execute the command:

$ ed na.1 < ed.scr

This command changes the contents na.1 to agree with na.2.

Perhaps this small example isn't very striking, but here's another, more impressive one. Suppose that you have a large program written in C that does something special for you; perhaps it manages your investments or keeps track of sales leads. Further, suppose that the people who provided the program discover that it has bugs (and what program doesn't?). They could either ship new disks that contain the rewritten program, or they could run diff on both the original and the corrected copy and then send you an ed script so that you can make the changes yourself. If the script were small enough (less than 50,000 characters or so), they could even distribute it through electronic mail.

The -f option creates what appears to be an ed script that changes file2 to file1. However, it is not an ed script at all, but a rather puzzling feature that is almost never used:

$ diff -f na.1 na.2

c2
babinchak David
.
d4
c6
bose
.

Also of limited value is the -h option, which causes diff to work in a "half-hearted" manner (according to the official AT&T UNIX System V Release 4 Users Reference Manual). With the -h option, diff is supposed to work best—and fast—on very large files having sections of change that encompass only a few lines at a time and that are widely separated in the files. Without -h, diff slows dramatically as the sizes increase for the files on which you are apply diff.

$ diff -h na.1 na.2

2c2
< babinchak david
- -
> babinchak David
4d3
< bloom dennis
6c5
< bose anita
- -
> bose

As you can see, diff with the -h option also works pretty well with original files that are too small to show a measurable difference in diff's speed.

The -n option, like -f, also produces something that lokks like an ed script, but isn't and is also rarely used. The -D option permits C programmers (see Chapter 17) to produce a source code file based on the differences between two source code files. This is useful when uniting a program that is to be compiled on two different computers.

Summary

This chapter introduced some tools that enable you to determine the nature of the contents of a file and to examine those contents. Other tools extract selected lines from a file and sort the structured information in a file. Some tools disguise the contents of a file, and others compress the contents so that the resultant file is half its original size. Other tools compare two files and then report the differences. These commands are the foundation that UNIX provides to enable users to create even more powerful tools from relatively simple ones. However, none of these tools enables you to create a file that is exactly—to the tiniest detail—what you want. The next chapter discusses just such tools—UNIX's text editors.