small problem:
QCIF, MP, 3 frame window: 176*144*3*(16+36) = 4.0e6 nonzero coefs, 176*144*3*3 = 2.3e5 variables

worst case problem size at typical settings:
SD, HP, 10 frame window: 720*480*10*(64+72) = 4.7e8 nonzero coefs, 720*480*10*3 = 1.0e7 variables

worst case problem size at max settings:
HD, HP, 10 frame window: 1920*1080*10*(64+72) = 2.8e9 nonzero coefs, 1920*1080*10*3 = 6.2e7 variables

minimum problem (one dct block, no bitrate metric):
4096 nonzero coefs, 128 variables

x264 qcif fast: 350 fps, 2 MB ram
x264 qcif hq-insane: 28 fps, 5 MB ram
xlsc qcif fast: 1200 fps, 2 MB ram
xlsc qcif normal: 950 fps, 2 MB ram
xlsc qcif lookahead: 0.11 fps, 170 MB ram

threads:1 669.62user  9.80sys 679.77real  99%
threads:2 920.78user 13.18sys 494.43real 186%
most of the extra cpu-time comes from extra iterations. convergence slows down
when it gets very close to the end, but I think if I reduced the termination
threshold a little then there won't be much wasted computation with threads.

lookahead-lambda=.014
