Saturday, May 8, 2010

Completing the Text Processing Arsenal

For months and months, I have been procrastinating to learn the last tool in my text processing arsenal: awk. After a particularly uneventful afternoon, I decided today was the day. Sleeves rolled-up, chips by the table, all sources of interference disposed off, I sat down and finally put the last piece of the jigsaw in place. Today I rant about my text processing arsenal.

Here is a list of text processing commands/scripts available at your disposal in Linux and how you can combine them to serve your purpose.

1) The very basics: echo, cat, less (Useful to look through large logs or files that don't fit into one screen), more (less is really better), head (Prints first few lines of a file. Can be a very good help in scripts), tail (with -f switch, can be useful in watching logs real-time as they are generated)

2) Some more basics: grep. grep is oxygen. I use it all the time! I suggest you really go through that linked tutorial and thoroughly learn grep. It has some nice tools in its toolbox (like umm, the -v switch which inverts the regular expression, etc). I have also written a tutorial on grep earlier.

3) sort, uniq: I find both of them terribly useful for handling lists and looping on the same action for a list.
# cat package_list | grep "^java-*" | sort | uniq | xargs -n 1
This command takes a list of packages, searches for all java packages, eliminated duplicates and schedules a build for each,

4) All the above commands mainly encompass reading or searching through already available data. What about removing/selecting only parts of sentences or translating one character to another? (like small-case characters to upper case, etc). Not to worry. cut, tr to the rescue. Cut can remove parts of a line (by defining delimiters and selecting the part) and I use tr mostly to convert lower to upper case or squeeze whitespace. More on that here.

5) Getting your hands really dirty. All of the above commands really enhance your ability to mine data from a file. grep-ping interesting lines using grep, selecting only part of them using cut, making it presentable using tr, etc etc. What about actual editing in shell-scripts? Something like find and replace? Well, *lightning and thunder* sed is here! sed (which stands for stream editor) has saved me a lot of time. The very motif of sed is search and replace. And if offers VERY advanced searching patterns. You can specify regular expressions, search only a part of a file, replace first or all occurrences and a lot more. A VERY awesome tutorial is here. After grep, sed is the most useful tool I know.
eg. Converting fs_get to xfs_get in the entire source was never easier than this:
# find . -name "*.[cChH]" | xargs sed -i 's/fs_get/xfs_get/g`

6) All this is fine. But what about the advanced data mining and report generation tasks? awk is here. awk can process tables and columns of data. I majorly use awk to easily select and reformat text.
eg. If you observe your /var/log/messages file, the 5th field is always the process name. Suppose I wanted to find out how many times ntpd synced to the latest time. I can run
# cat /var/log/messages | awk '$5 ~ /^ntpd/' | grep "kernel time sync" | wc -l
Don't be intimidated by the awk syntax, its simple really. Just invest 10 mins is reading this tutorial here.

7) Lastly, the most insignificant command that I have never used in my life: join. Does exactly what a database join does.

So, this is my text processing arsenal. With echo, cat, grep, sed and awk I could also write a mini-database which will have the worst performance ever, but will work just fine nevertheless. Insert, Delete, Edit, Query, Join can all be implemented using just these commands. Among them, they cover almost all the text processing requirements.

Finally, I would like to add some more commands that I rarely use or are in my wishlist or can be useful for my readers.

8) dos2unix, unix2dos: DOS likes to write newline as "\r\n". UNIX prefers "\n". This can lead to really interesting issues in Makefiles, etc when files are written on Windows and executed on Linux. Above 2 utilities are used to convert from one format to another. iconv is a generic converter from hundreds to hundreds of formats.

9) gettext: This is in my wishlist, but I don't see myself learning to use gettext anywhere in the near future. Gettext is used to localize a program to your own language or a foreign language.

Thats all for now folks. Comments/Feedback appreciated.


Saurabh Shah said...

Thanks for the post. I did sed and awk at 2.00 am just the day before an imp interview and realized how powerful it was. But I didn't intend to. I was trying to find the average age of all employees in a records-file. I couldn't do it gracefully by simple means and wasted an hour. Then God said "Let there be awk" and the solution was just a one-liner :)

Try this: Sort a file with a billion records on a dual-core in the fastest possible way. (choose your own file structure). A one-liner would be to use "sort" but it'll waste one core. (Hint: use split, sort and merge).

Saurabh Shah said...

Also having a copy of Sumitabha Das's book would help a lot. It's an awesome book for shell scripting.

Jitesh Shah said...

you're obsessed by sorting a billion records :-P

Saurabh Shah said...

The interviewers are :P .. (and btw, that's the only thing I know :P )

Maithili said...

good one! thanks ... keep posting.. sometimes reading blog is more fun than tutorials or man pages..(abhyasacha feel yet nahi :P)

Jitesh Shah said...