Low cost cluster

Jonathan Morton chromi at chromatix.demon.co.uk
Tue Feb 25 03:11:01 CET 2003

>>If I really wanted to push the CPU-power envelope, I'd get a refurbished
>>PowerMac.  I still might.  The cluster would give me an aggregate 2500
>>MFLOPS, while a previous-model Mac would give me 3500 for about the
>is this supposed to be funny or do you really believe S. Job's jokes?

I currently have a G3, which I benchmarked the main component of my 
current algorithm on, obtaining 0.5 MFLOPS/MHz.

I then disassembled the code and noted that FP load, multiply-add, 
and store instructions could be replaced by their Altivec 
equivalents, which operate on four times the number of operands and 
are equally fast.  I even checked the dispatcher rules to make sure 
that would not be a bottleneck - the G4+ is able to dispatch a vector 
multiply-add and a vector load/store, plus an integer operation (say, 
pointer arithmetic) and a branch if required, all in the same clock 

Thus I obtained an effective performance figure of 2.0 MFLOPS/MHz, 
versus Athlon-XP figure of 0.5 (for both x87 and SSE) and Pentium-4 
figure of 0.4 (for SSE).  This is not hype - this is me reading the 
documentation and doing the maths.

Note that all figures assume the working set fits in cache.  I 
believe I can ensure this, and it's also considerably easier to 
achieve with a Mac's 1MB L3.

from:     Jonathan "Chromatix" Morton
mail:     chromi at chromatix.demon.co.uk
website:  http://www.chromatix.uklinux.net/
tagline:  The key to knowledge is not to rely on people to teach you it.

More information about the coreboot mailing list