Sunday 18 November 2018

Poor man's map-reduce

Sometimes, you've just got to say "fuck it" and go with what you've got. I recently had such a night.

I had a large line-oriented input file, like a CSV file, no time, and a slow processing script. The script couldn't be optimised easily because each record needed a network call. I also didn't have the time to upgrade the script to use the language's built-in concurrency or parallelism. I needed answers sooner rather than later; every minute was a minute I was not in bed ahead of an important meeting where I would present the data and my analysis.

 needed to run my script against two different back-end systems, and a quick back-of-the-envelope estimate put the runtime over two hours. Two hours that I did not have.

Long story short, there was no time for caution, and it was do-or-die. The answer, of course, is improv everywhere shell scripting.

To sum up, I had:
  1. A large line-oriented input file
  2. A slow line-oriented processing script
  3. A *nix system
  4. No time for caution

Map-reduce

Map-reduce is a way of processing large input data quickly. The input is split into chunks, processed in parallel, then brought back together at the end.

My problem happened to fit this model, but I had no time to wrestle my script. I split up my input file, ran the script on each part, and then brought the answers back together manually.

shell to the rescue!

First, I used split -l to break up my input file up into chunks:

split -l 459 input.txt

This gave me ~48 with filenames like xaa and xab, each with ~460 lines. Each file would take ~1 minute to process at 6 TPS, which was about the maximum throughput of the processing script.

Next, I launched my script 48 times:

for i in x*
do script.py $i > output-$i &; done

Off it went, with 48 processes launched in parallel, each writing to their individual output file.

Bringing it back together

The files need to be brought back together. In my case, a colleague had written another script to analyse a single output file and generate a report. I needed to join all my files back together.

cat output-x* | sort | uniq > aggregate-output.txt

Error handling

Some lines in the input would cause my processing script to crash. Because I needed to do this process twice, I tried two different approaches:
  1. Let it crash and re-process missed lines later
  2. Code up try/catch/continue.
The first approach turned out to be frustrating and time consuming. I ended up using wc -l to quickly work out which files had failed, re-collecting, re-splitting, and re-running after removing some of the offending lines. This was especially difficult because a single chunk could have multiple "poison" lines, so I ended up going down to chunks with 2 lines. Very annoying.

The second approach was much better and quicker, however, it did need a little extra print >>sys.stderr "msg""  to list of poison lines. All in, the second approach was quicker in this case.

Conclusion

In the end, I took the processing down from >2 hours to a couple of minutes with only a few moments 'investment' with the shell.

I would not recommend this to any sane person, except in the fairly narrow circumstances listed above. It's difficult to test, requires manual shell work, and failures can really set you back.

From a software engineering perspective it's ridiculous. On the other hand, it's not stupid if it works.