Monday, February 1, 2010

Left-Fold for Bash

I'd like to share a recent bash programming experience I've had. It began while processing the reams of data generated in my M.Sc. research study. I was producing long lists of frequency data in text files, and had the need to sum up all the lines in these tables. This is of course a trivial problem in many languages, and I had a wide array of options available to me:
  • I could write another ant task to do the summing (this required more work that I was willing to invest, as ant isn't really suited for computational tasks)
  • I could write a python script that took the contents of the specified file and returned the sum. I didn't really like this approach because it involved yet another file in my build process, invoked from the ant task. I always sort of thought that if you were forced to use , you were performing a task beyond the scope of your tool.
  • I could skip the generation of the table and go straight to the sum. A few of the tables were created with XSLT, so this was a valid option. However, my XSLT programming ability is very much trial-and-error based, so I thought this might take some time. Also, some of the other tables, created with grep would not be affected.
  • Write it in shell. I liked this idea. I really liked the feel of being able to just pipe something to 'sum' and have the sum returned. So this is what I chose.
My first version looked like this:

while read line
sum=$(($sum + $line))

echo $sum
exit 0
It did the trick quite well. I had suggestions from office mates for the following alternatives:
Using python (courtesy Aran Donohue):

python -c "import sys;print sum(float(x) for x in"

Using tr (courtesy Zak Kincaid):

cat numbers | tr '\n' '+'|head --bytes='-1'|bc
(Note that this version doesn't quite work. bc throws a syntax error. not exactly sure why.)

Using my original design, I realized that if I abstracted out the operator, I could use this script to perform any 2-operand function I wished on the list, essentially creating a basic left fold:

if [ "$1" == "" ]; then

while read line

echo $sum
exit 0

I've done a little bit of error checking, to see if the parameter supplied is blank, and if so replace it with + by default.

I've only seen this work for '+' and '-'. If I use '*', it breaks because it replaces the wildcard/multiplication character with the listing of the current directory.


zkincaid said...

Two (less-than-awesome) solutions for the tr/bc solution:

echo `tr '\n' '+' | head --bytes='-1'` | bc

(tr '\n' '+' | head --bytes='-1' && echo) | bc

... just need to get that newline in at the end...

Ernie said...

It's almost like someone should write a book on working with data ... :)