Tuesday, September 4, 2012

Parsing tricky whitespace with awk.

I spent a bunch of time trying to make awk properly parse some tricky whitespace in a text file. Essentially, I have a backup log file with a list of files transferred. Given the following example output:


  create d 755       0/0         512 .ccache
  create d 755       0/0         512 .ccache/0
  create d 755       0/0         512 .ccache/0/0
  same     644       0/0        6558 .ccache/0/0/1ebefd822077669fa42316da42e2c4-11870.manifest
  same     644       0/0        8251 .ccache/0/0/5ae5f481490e100d6d85b693aee8df-3158.manifest
  same     644       0/0        8407 .ccache/0/0/70b8ebd6f2cb5085c0f828e9145216-5033.manifest
  same     644       0/0         359 .ccache/0/0/7de2817532c5d1365a70d8dec4e378-30356.manifest

I want to grab the file size from the 4th column, so that I can sort it and find large files. The problem is, if I use a simple awk statement like

awk "{print $4}"

the file size is either the fourth or fifth column, depending on whether the first (logical) column has a space in it. I tried accounting for this with more complex awk, like so

awk 'BEGIN {FS="[ ]{2,}"} { print $4 }'

but to no avail. Admittedly, my awk-fu is weak. I did find a solution, rather a colleague came up with one.


sed "s/create d/created/" | awk '{print $4 }'

A bit outside of the box, but clever enough that I want to remember it.

No comments:

Post a Comment