[coreboot] rfc - gcc builtins and memset memcpy memmove memcmp

Sat Sep 11 19:54:12 CEST 2010

On Sat, Sep 11, 2010 at 9:34 AM, Scott Duplichan <scott at notabs.org> wrote:
> ]-----Original Message-----
> ]From: coreboot-bounces at coreboot.org [mailto:coreboot-bounces at coreboot.org] On Behalf Of Arne Georg Gleditsch
> ]Sent: Saturday, September 11, 2010 06:01 AM
> ]To: Scott Duplichan
> ]Cc: 'Marc Jones'; 'Carl-Daniel Hailfinger'; 'Coreboot'
> ]Subject: Re: [coreboot] rfc - gcc builtins and memset memcpy memmove memcmp
> ]
> ]"Scott Duplichan" <scott at notabs.org> writes:
> ]> In this report:
> ]> http://article.gmane.org/gmane.linux.bios/57707,
> ]> Arne may have been encountering the ClLinesToNbDis issue
> ]> (assuming the memset code was running from flash). Switching
> ]> to rep movs would greatly improve performance because unlike
> ]> a byte loop, rep movs loops in microcode which does not cause
> ]> continuous flash memory accesses.
> ]
> ]This was my assumption as well.  After fixing the ClLinesToNbDis
> ]setting, I have removed the rep stosb code from my tree, and so far I've
> ]not observed the pathological memset behaviour that caused me to put it
> ]in in the first place.  (As mentioned earlier this was never altogether
> ]deterministic, I'm assuming some critical part of the original memset
> ]loop needed to straddle cache lines or something for it to manifest.)
>
> Interesting point about memcpy straddling a cache line boundary. It got
> me thinking about what the DediProg em100 trace function shows when
> booting from SPI flash. With SPI, the SB initially reads a dword at a
> time. If the processor is not caching code, a byte loop memcpy would
> trigger multiple dword reads from the flash chip for every byte copied.
> If BIOS sets SB option PrefetchEnSPIFromHost, then the SB will switch
> to cache line reads, and cache the last line read. Since a byte loop
> memcpy fits in a cache line, it seems conceivable that memcpy performance
> would be good unless the function straddles a cache line boundary. I am
> not sure what the situation is with LPC flash.
>
> Anyway, I noticed coreboot is not setting the AMD SB bit PrefetchEnSPIFromHost.
> For big payloads, setting this bit could cut boot time by eliminating
> overhead when reading big chunks from SPI flash memory.

Oh, we should do that.

But, that doesn't really explain why gcc doesn't do a rep stos or rep
mov (which should hit the cache)/ That should be an easy optimization
for gcc. It also doesn't address why coreboot has a functions when we
could use gcc intrinsic that should be optimized for the architecture
they are built for.

Marc

-- 
http://se-eng.com