Low cost cluster
Jonathan Morton
chromi at chromatix.demon.co.uk
Tue Feb 25 03:11:01 CET 2003
>>If I really wanted to push the CPU-power envelope, I'd get a refurbished
>>PowerMac. I still might. The cluster would give me an aggregate 2500
>>MFLOPS, while a previous-model Mac would give me 3500 for about the
>
>is this supposed to be funny or do you really believe S. Job's jokes?
I currently have a G3, which I benchmarked the main component of my
current algorithm on, obtaining 0.5 MFLOPS/MHz.
I then disassembled the code and noted that FP load, multiply-add,
and store instructions could be replaced by their Altivec
equivalents, which operate on four times the number of operands and
are equally fast. I even checked the dispatcher rules to make sure
that would not be a bottleneck - the G4+ is able to dispatch a vector
multiply-add and a vector load/store, plus an integer operation (say,
pointer arithmetic) and a branch if required, all in the same clock
cycle.
Thus I obtained an effective performance figure of 2.0 MFLOPS/MHz,
versus Athlon-XP figure of 0.5 (for both x87 and SSE) and Pentium-4
figure of 0.4 (for SSE). This is not hype - this is me reading the
documentation and doing the maths.
Note that all figures assume the working set fits in cache. I
believe I can ensure this, and it's also considerably easier to
achieve with a Mac's 1MB L3.
--
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: chromi at chromatix.demon.co.uk
website: http://www.chromatix.uklinux.net/
tagline: The key to knowledge is not to rely on people to teach you it.
More information about the coreboot
mailing list