Five Text Processing Tools You Should Know


In the world of UNIX text is king. Almost anything one wants to know about a system can be gathered from reading a file. Often times that file can be a few thousand lines long. Sometimes it can be twelve million lines long.

This is where text processing comes in. Text processing aims to do one thing: given a source of information answer a question. Everybody has questions for their systems. "How many 404 errors has my website gotten?" "Has anyone tried to access my server without me knowing?" "What program is using the most CPU right now?"

This article will explore the most common ways to answer these questions with a set of tools that will by extension provide a method to answer hundreds of other questions. Questions will be answered with one-liners, or the stringing together of these tools to provide answers without writing a script itself.

grep

The grep command can be found in one-liners all over the world. This is no accident. When operating on a file grep will print only the lines of that file that match your criteria.

Let's say that I want to look at all the IP addresses who have logged into the machine using my username.


symkat@symkat:~$ wc -l /var/log/auth.log
1819 /var/log/auth.log
symkat@symkat:~$ grep symkat /var/log/auth.log
symkat.com sshd[18021]: Accepted publickey for symkat from 10.0.0.234 port 4233 ssh2
symkat.com sshd[30845]: Accepted publickey for symkat from 172.16.32.56 port 56964 ssh2
symkat.com sshd[6065]: Accepted publickey for symkat from 192.168.1.100 port 56374 ssh2
symkat.com sshd[8457]: Accepted publickey for symkat from 10.0.0.234 port 4162 ssh2
symkat.com sshd[8498]: Accepted publickey for symkat from 10.0.0.234 port 5353 ssh2
symkat.com sshd[9474]: Accepted publickey for symkat from 172.16.32.56 port 62164 ssh2
symkat.com sshd[9889]: Accepted publickey for symkat from 10.0.0.234 port 5059 ssh2
symkat.com sshd[23298]: Accepted publickey for symkat from 172.16.32.56 port 51604 ssh2
symkat.com sshd[23607]: Accepted publickey for symkat from 172.16.32.56 port 51611 ssh2
symkat.com sshd[15610]: Accepted publickey for symkat from 10.0.0.234 port 4146 ssh2
symkat.com sshd[17435]: Accepted publickey for symkat from 10.0.0.234 port 4320 ssh2
symkat.com sshd[22907]: Accepted publickey for symkat from 10.0.0.234 port 4254 ssh2
symkat.com sshd[8303]: Accepted publickey for symkat from 192.168.1.100 port 65065 ssh2
symkat.com sshd[26505]: Accepted publickey for symkat from 10.0.0.234 port 4282 ssh2
symkat@symkat:~$

We'll also save this to a file to play with it:


symkat@symkat:~$ grep symkat /var/log/auth.log > auth.log

The auth.log file itself contains a lot of other information we didn't care about in answering this question. Above we did a wc –l /var/log/auth.log and found that the auth file contained only 1,819 lines of text. Our grep command showed only what we wanted to know, which turned out to be 14 lines from that file.

awk

With grep we found out how we could pull just the matching lines from a file. All we care about is the IP addresses used to log into symkat, though. We have a lot of information we don't need and it's making our eyes glaze over.

One trick we can do with awk is to show only the columns of information we want to know. In this case it's the 8th column.


symkat@symkat:~$ awk '{print $8}' auth.log
10.0.0.234
172.16.32.56
192.168.1.100
10.0.0.234
10.0.0.234
172.16.32.56
10.0.0.234
172.16.32.56
172.16.32.56
10.0.0.234
10.0.0.234
10.0.0.234
192.168.1.100
10.0.0.234
symkat@symkat:~$

Awk expects to be given a pattern to match and an action and that's exactly what we gave it. To understand how this works we have to understand a few key points about awk:

  • Awk expects to be given a pattern to match and an action to run on input
  • Awk assigns $1, $2, $3, and so on to correlate to the values of the input, split by white space. $1 = symkat.com in this example. $2 = sshd[]:
  • We did not include a pattern. As such, awk will match on all lines. The action we took was to print the eighth column.

    It is worth noting that if I wanted additional information, such as both the username and the IP address a concatenation operator is not needed, the assumption is quoted text is literal (for instance, this quoting of a space between $6th value (username) and $8th value (IP Address):

    
    symkat@symkat:~$ awk '{print $6 " " $8 }' auth.log
    symkat 10.0.0.234
    symkat 172.16.32.56
    symkat 192.168.1.100
    symkat 10.0.0.234
    symkat 10.0.0.234
    symkat 172.16.32.56
    symkat 10.0.0.234
    symkat 172.16.32.56
    symkat 172.16.32.56
    symkat 10.0.0.234
    symkat 10.0.0.234
    symkat 10.0.0.234
    symkat 192.168.1.100
    symkat 10.0.0.234
    symkat@symkat:~$
    
    

    sort

    The sort command does what it says, it sorts data and displays it after it's been sorted. By default sort will use the first column of data as its point of reference for sorting.

    
    symkat@symkat:~$ awk '{print $8 }' auth.log | sort
    10.0.0.234
    10.0.0.234
    10.0.0.234
    10.0.0.234
    10.0.0.234
    10.0.0.234
    10.0.0.234
    10.0.0.234
    172.16.32.56
    172.16.32.56
    172.16.32.56
    172.16.32.56
    192.168.1.100
    192.168.1.100
    symkat@symkat:~$
    
    

    The | here is the pipe operator. It takes the STDOUT of whatever command is on the left hand side and directs it to be the STDIN for whatever is on the right hand side. In this case we're taking the IP addresses ($8th var from file auth.log) and passing it to the sort command as input, which is in turn giving us sorted output.

    uniq

    That's not a misspelling, it's just a UNIX-Spelling! This command will remove duplicate records, however it is required that you send it sorted data. To explain this most basically uniq will look at a line, if the next line is the same it will not print it, if it is different, it will print it. Because of this unsorted data cannot be processed.

    
    symkat@symkat:~$ awk '{print $8 }' auth.log | sort | uniq
    10.0.0.234
    172.16.32.56
    192.168.1.100
    symkat@symkat:~$
    
    

    Now we know the three IP addresses that have connected as symkat to symkat.com. We may want to know how many times each IP connected. In that case we can pass the –c argument to uniq for a count of the each line:

    
    symkat@symkat:~$ awk '{print $8 }' auth.log | sort | uniq -c
          8 10.0.0.234
          4 172.16.32.56
          2 192.168.1.100
    symkat@symkat:~$
    
    

    It is worth noting that in this instance we have gotten a uniq with a count that goes from highest to lowest connections. This is purely coincidence; generally we would be required to add an additional sort:

    
    symkat@symkat:~$ awk '{print $8 }' auth.log | sort | uniq -c | sort
          2 192.168.1.100
          4 172.16.32.56
          8 10.0.0.234
    symkat@symkat:~$
    
    

    You will notice that our second sort reordered them by lowest to highest, this is the default. To which it from highest to lowest we would add the reverse argument, -r. At the same time because we are sorting on a number we would also add the argument –n. We can combine these arguments:

    
    symkat@symkat:~$ awk '{print $8 }' auth.log | sort | uniq -c | sort -rn
          8 10.0.0.234
          4 172.16.32.56
          2 192.168.1.100
    symkat@symkat:~$
    
    

    The difference between a numerical sort and a string sort are obvious when we add some interesting data to a file and sort it:

    
    symkat@symkat:~$ cat > data_source.txt
    101
    12
    133
    symkat@symkat:~$ sort data_source.txt
    101
    12
    133
    symkat@symkat:~$ sort -n data_source.txt
    12
    101
    133
    symkat@symkat:~$
    
    

    As you can see these are very different results. The intricacies of sorting algorithms is out of the scope of this article, suffice it to say that if you are sorting on a number you will want to use a numerical sort.

    head

    We have a small data set in the case of this auth.log file. Let's pretend for a moment that the returned results show a few thousand IP addresses connecting, and we want to know which ones are connecting the most. In this case we will ask for the top two IP addresses connecting based on highest connections. There is a corollary to head, called tail. Tail operates from the last line of the file as opposed to the first.

    Head is the command we would want to use, we can supply it with the number of lines we want to see from input:

    
    symkat@symkat:~$ awk '{print $8 }' auth.log | sort | uniq -c | sort -rn | head -n 2
          8 10.0.0.234
          4 172.16.32.56
    symkat@symkat:~$
    
    

    Above I said I would explore the answers to three questions. We'll present the commands below, see if you can understand how they work:

    
    symkat@symkat:~$ awk '{print $7 " " $9}' /var/log/lighttpd/access.log | grep " 404" \
    > | awk '{print $1}' | sort | uniq -c | sort -rn | head -n 5
          7 /robots.txt
          5 /favicon.ico
          2 /sitemap.xml.gz
          1 http://www.wantsfly.com/prx2.php
          1 http://216.245.205.74/judge.php
    symkat@symkat:~$
    
    

    Here we found the top 404 errors on the site. Three of them I should fix or are already fixed, and the last two is someone thinking they can use the server as a proxy (they can't).

    
    symkat@symkat:~$ grep AllowUsers /var/log/auth.log | awk '{print $7 " " $9}' | sort | uniq -c
        221 root 81.30.185.201.dynamic.ufanet.ru
          8 root vayu1.nci.org.au
    symkat@symkat:~$
    
    

    It looks like I have 221 attempts from 81.30.185.201 to login as root. As an exceptional rule, I use AllowUsers in my sshdconfig, if you don't you'll be looking for failed login instead.

    
    symkat@symkat:~$ ps aux | sort -rnk 3 | head -n 5
    httpd 30317  0.0  6.2  46644 31776 ?        S    Aug01   0:01 /usr/bin/php5-cgi
    httpd 30316  0.0  1.0  33016  5176 ?        Ss   Aug01   0:00 /usr/bin/php5-cgi
    httpd 30315  0.0  0.3  33016  1948 ?        S    Aug01   0:00 /usr/bin/php5-cgi
    httpd 30314  0.0  1.0  33016  5180 ?        Ss   Aug01   0:00 /usr/bin/php5-cgi
    httpd 30313  0.0  0.3  33016  1944 ?        S    Aug01   0:00 /usr/bin/php5-cgi
    symkat@symkat:~$
    
    

    CPU usage is the third column; across the board I have nothing heavy on the CPU. If we went on memory (4th column):

    
    symkat@symkat:~$ ps aux | sort -rnk 4 | head -n 5
    httpd 30317  0.0  6.2  46644 31776 ?        S    Aug01   0:01 /usr/bin/php5-cgi
    mysql     3650  0.0  3.8 123016 19724 ?        Sl   Jul23   3:33 /usr/sbin/mysqld
    httpd 30316  0.0  1.0  33016  5176 ?        Ss   Aug01   0:00 /usr/bin/php5-cgi
    httpd 30314  0.0  1.0  33016  5180 ?        Ss   Aug01   0:00 /usr/bin/php5-cgi
    httpd 30312  0.0  1.0  33016  5176 ?        Ss   Aug01   0:00 /usr/bin/php5-cgi
    symkat@symkat:~$
    
    

    We would see that MySQL and PHP are taking up the most memory.

    There are far more advanced and interesting ways to process text, some of which we may cover in future articles. These basics should give a good foundation and understanding of how to accomplish most simple text processing tasks.


    Contact Me