Understanding File Descriptors and IO Redirection

The ability to string together data between processes and controlling where it comes from and goes to is ultimately a very large aspect of what makes Linux and Unix systems great. In this article we will take a look at the methods for doing so as well as describe where this data is coming from and going. A solid understanding of redirection and file descriptors is mandatory for any systems administrator or programmer working with a UNIX-like system.

A Note On File Descriptors

Any program that runs on a Linux machine has access to something called a “File Descriptor Table.” This table acts as a map providing the process access to files, directories, unnamed pipes, named pipes, sockets and kernel-level data structures. This table exists for each process. Inside the bash shell you have access to the three standard file descriptors: standard input, standard output and standard error.

Standard Input, Standard Output, and Standard Error are all character devices. Character devices provide a mechanism to send a stream of characters or bytes. A stream provides sequential access – that is, it provides output in the order in which it was received, this is also known as a FIFO pipe, which stands for First-In First-Out.

Standard Input

Standard Input is a file descriptor that a program is given for input sent to it. On Linux machines the file descriptor can be accessed under the path /proc/PID/fd/0, where PID is the process id of the program. When the output of one command is piped into the input of another command, the input is taken from Standard Input. It is often referred to by the token STDIN or the file descriptor 0.

Standard Output

Standard Output is a file descriptor that a program is given to send output data to. When a command is run and output goes to a terminal, the output is usually being sent to Standard Output. It is worth nothing that there are exceptions where the output is being sent directly to the terminal. This is the case with anything that uses the curses library for rendering, such as irssi or top. On Linux machines the file descriptor can be accessed under the path /proc/PID/fd/1 where PID is the process id of the command. When the output of a command is piped to another command it is sending the contents of Standard Output. This is often referred to as STDOUT or the file descriptor 1.

Standard Error

Standard Error is a file descriptor that a program is given to send error output data to. When a command reports an error to the terminal it is most often sent through Standard Error. On Linux machines the file descriptor can be accessed under the path /proc/PID/fd/2 where PID is the process id of the command. When displayed in a terminal and not using any forms of redirection there appears to be no difference to the user between Standard Error and Standard Output. It is often referred to by STDERR and the file descriptor 2.

Although it is typically not important to know, it's worth stating anyway that while you can access these file descriptors though the above paths they actually exist in the memory allocated to the program by libc.

Redirection

Inline Redirection

Inline Redirection is a mechanism for taking the output of a command and putting it into a character device. The redirection mechanism is translated to a filename before the command we execute is evaluated. It sounds a bit more complicated than it actually is. Let's take a look at a good use for it.

Suppose that we have two SHA512 strings to validate a file:


92789bcce687a24ce695fafe4ad6d34c24ecc067fcfedc1c645ebdb5a8b
75386966bf09570d2a4074ed9e579594e7c629f77ab96e4afeb173b04d5
366213fb6c


92789bcce687a24ce695fafe4ad6d34c24ecc067fcfedc1c645ebdb5a8b
75386966bf09570d2a4074ed9e579594e7c629f77ab96e4afeb173b04d5
366213fb6c

We need to compare them to see if they are the same or different. We should use a number of methods from glancing at them for comparison to putting them each into a file and doing a diff. That would take writing to one file, writing to another file and then running diff on the two files. Let's do this with one command and without writing the files. To signal to bash that you wish to use inline redirection wrap your command inside of <() like so: <(command goes here).


symkat@symkat:~$ diff <(echo 92789bcce687a24ce695fafe4a\ 
> d6d34c24ecc067fcfedc1c645ebdb5a8b75386966bf09570d2a40\
> 74ed9e579594e7c629f77ab96e4afeb173b04d5366213fb6c)   \
> <(echo 92789bcce687a24ce695fafe4ad6d34c24ecc067fcfedc\
> 1c645ebdb5a8b75386966bf09570d2a4074ed9e579594e7c629f7\
> 7ab96e4afeb173b04d5366213fb6c)
symkat@symkat:~$

The command returned nothing, which means that the diff command found no difference. Let's use a line of Perl to see what the diff program saw.


symkat@symkat:~$ perl -e'print "Got Commandline Arguments: " . join(" ", @ARGV)."\n"' \
> <(echo 92789bcce687a24ce695fafe4a\ 
> d6d34c24ecc067fcfedc1c645ebdb5a8b75386966bf09570d2a40\
> 74ed9e579594e7c629f77ab96e4afeb173b04d5366213fb6c)   \
> <(echo 92789bcce687a24ce695fafe4ad6d34c24ecc067fcfedc\
> 1c645ebdb5a8b75386966bf09570d2a4074ed9e579594e7c629f7\
> 7ab96e4afeb173b04d5366213fb6c)
Got Commandline Arguments: /dev/fd/63 /dev/fd/62
symkat@symkat:~$

Inline Redirection is extremely useful for any situation where you would normally make a file from the output of a command to do processing, or for programs that require the input be from a file.

Truncating Redirection

Truncating Redirection takes the STDOUT of a command and sends it to a file. The important thing to know about Truncating Redirection is the way it opens the file it writes to. When opening the file it uses the flags O_CREAT and O_TRUNC. This instructs it to create the file if it does not currently exist and write to the beginning of the file, after having removed the contents of the file. To use truncating redirection append > filename to the end of your command.

Let's take a look at some uses of truncating redirection.


symkat@symkat:~$  ls log
ls: cannot access log: No such file or directory
symkat@symkat:~$ uptime
01:37:35 up 32 days,  6:03,  1 user,  load average: 0.00, 0.00, 0.00
symkat@symkat:~$ uptime > log
symkat@symkat:~$ cat log
01:37:38 up 32 days,  6:03,  1 user,  load average: 0.00, 0.00, 0.00
symkat@symkat:~$ uptime > log
symkat@symkat:~$ cat log
01:37:43 up 32 days,  6:03,  1 user,  load average: 0.00, 0.00, 0.00
symkat@symkat:~$

As you can see the file log did not exist. The output of the command uptime shows the time, how long the machine has been up, the user count logged into the machine and the load average. When using > log the file log was created and the contents of the uptime command were sent to it. You will notice something important, we are redirecting STDOUT, because of this it does not display in the terminal. When we ran the same command again it replaced the contents of the file. To avoid replacing the contents of the file we would use Appending Redirection.

Appending Redirection

Appending Redirection takes the STDOUT of a command and sends it to a file just as truncating redirection does. The important difference between them is that Appending Redirection uses the mode O_APPEND instead of O_TRUNC when writing to the file. This means that instead of emptying the contents of the file and writing to the beginning, it will leave the contents as they are and write to the end of the file. To use Appending Redirection use >> file at the end of your command.

Let's look at the example from Truncating Redirection, but with Appending Redirection instead this time.


symkat@symkat:~$ rm log
symkat@symkat:~$ ls log
ls: cannot access log: No such file or directory
symkat@symkat:~$ uptime
01:47:24 up 32 days,  6:13,  1 user,  load average: 0.01, 0.02, 0.00
symkat@symkat:~$ uptime >> log
symkat@symkat:~$ cat log
01:47:27 up 32 days,  6:13,  1 user,  load average: 0.01, 0.02, 0.00
symkat@symkat:~$ uptime >> log
symkat@symkat:~$ cat log
01:47:27 up 32 days,  6:13,  1 user,  load average: 0.01, 0.02, 0.00
01:47:34 up 32 days,  6:13,  1 user,  load average: 0.01, 0.02, 0.00
symkat@symkat:~$

We see here that log now contains two uptime lines. This is because we've appended to the file, instead of truncating it.

Input Redirection

Input Redirection takes the contents of a file and sends them to the STDIN of the command being executed. To use Input Redirection add < file to the end of your command. Let's take a look at a usage of Input Redirection.


symkat@symkat:~$ mysql -uroot -pblahblahpass testing_db < schema.sql
symkat@symkat:~$

We imported the schema from schema.sql to testing_db by passing the contents of the file into STDIN for the MySQL process.

Pipe Redirection

Pipe Redirection takes STDOUT from the process on the Left Hand Side and sends it to the STDIN of the command on the Right Hand Side. Let's say that you have two processes, x and y. You execute x | y. What happens at this point is that the command x is executed. The command y is then executed. A FIFO pipe connects the STDOUT from x to the STDIN of y. Both processes are running simultaneously. You can combine multiple pipes with multiple processes such that x | y | z will work as expected. STDOUT from x goes to STDIN for y; y's STDOUT goes to STDIN on z.

Pipe Redirection is one of the most useful methods of redirection as it allows you to process text through multiple programs.

Check out some of the following things we can do with piping.


symkat@symkat:~$ ps aux | grep light
symkat        11290  0.0  0.0   1764   496 pts/0    S+   03:30   0:00 grep light
www-data 14222  0.0  0.9   8784  4732 ?        S    Aug22   1:28 /usr/sbin/lighttpd -f /etc/lighttpd/lighttpd.conf
symkat@symkat:~$ ps aux | grep light | grep -v grep
www-data 14222  0.0  0.9   8784  4732 ?        S    Aug22   1:28 /usr/sbin/lighttpd -f /etc/lighttpd/lighttpd.conf
symkat@symkat:~$ ps aux | grep light | grep -v grep | awk '{print "Lighttpd CPU=" $3 " MEM=" $4}'
Lighttpd CPU=0.0 MEM=0.9
symkat@symkat:~$

In the above example we took the output of ps aux and passed it to grep asking it to only give us the lines of the output of ps aux that contain the phrase light. This returned two results: the grep command we used (because it contains the word we were looking for) and the lighttpd process.

The second command does the same, however instead of sending STDOUT to the terminal after we grep for light it sends it to an additional grep process that uses the -v switch (negated grep, print lines that do not contain the key) which then prints its STDOUT to the terminal.

We follow that command up with one more were we chain together 4 commands, each being executed concurrently, but waiting for information from the previous command.

Redirection With Tee

Tee is a program that does redirection splitting. It sends its input untouched from STDIN to STDOUT while simultaneously writing it out to a file. If you pass the -a switch before the filename it uses append mode to print to the file, otherwise it uses truncate mode. The basic usage is x | tee [-a] file or x | tee [-a] file | y.

Tee is useful for a handful of things: checking a pipeline in the middle of it, keeping two states of a file through one pipeline or keeping a log of a command you are watching.

Take for instance a situation where we are watching the memory usage on a system to try to catch a spike that occurs every night somewhere around 3AM:


symkat@symkat:~$ ps aux | sort -rnk 4 | head -n 5 | \
> cat <(echo `date`) - | tee -a high_mem.log

This will process ps aux and send it to sort which will sort on the 4th key (-k), which is memory, as a number (-n) and report the highest values first (-r) which will then send it head and give us the first 5 (-n 5) from the file. We're using cat here to print two files, the first being an inline redirection that prints out the current date and the second being STDIN (-). As a note - is often used to mean STDIN or STDOUT depending on the context it's being used in. Finally we send it to tee -a high_mem.log, which will display STDOUT to the screen and also append it to the file high_mem.log.

Let's take a look at putting this into a while loop we can watch, and then look at the log we can back-reference.


symkat@symkat:~$ while [ /bin/true ] ; \
> do clear ; \
> ps aux | sort -rnk 4 | head -n 5 | cat <(echo `date`) - | tee -a high_mem.log ; \
> sleep 4 ; \
> done

The terminal now shows the following:


Wed Aug 25 03:54:07 PDT 2010
mysql     3650  0.0  4.5 124256 23180 ?        Sl   Jul23  11:53 /usr/sbin/mysqld
www-data 26460  0.0  4.3  37564 22004 ?        S    Aug24   0:10 /usr/bin/php5-cgi
www-data 31089  0.0  4.0  36680 20788 ?        S    Aug24   0:01 /usr/bin/php5-cgi
www-data 28121  0.0  4.0  37228 20660 ?        S    Aug24   0:06 /usr/bin/php5-cgi
www-data  2228  0.1  3.9  36192 20028 ?        S    00:55   0:12 /usr/bin/php5-cgi

We can control c to exit the while loop and use wc –l to see how many lines long the file is:


symkat@symkat:~$ wc -l high_mem.log
162 high_mem.log
symkat@symkat:~$ grep php5 high_mem.log |head -n3
www-data 26460  0.0  4.3  37564 22004 ?        S    Aug24   0:10 /usr/bin/php5-cgi
www-data 31089  0.0  4.0  36680 20788 ?        S    Aug24   0:01 /usr/bin/php5-cgi
www-data 28121  0.0  4.0  37228 20660 ?        S    Aug24   0:06 /usr/bin/php5-cgi
symkat@symkat:~$