[coreboot] AMD CAR II

Fri May 7 21:54:20 CEST 2010

Hi Rudolf,

Good detailed email. Yes, this is how I think it works, and as far as
I know, the L1 instruction cache also writes to the L2. The L2 is the
main reason that CAR works. I have never been happy with the post_car
code. Something about it doesn't seem right, but I have never found
it. I do think that more care needs to happen with cache en/disable
and the MTRR settings.

Marc

On Fri, May 7, 2010 at 12:10 PM, Rudolf Marek <r.marek at assembler.cz> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi all,
>
> I examined a bit how does it works. Maybe if one can read this
> http://en.wikipedia.org/wiki/CPU_cache and then continue here :)
>
>
> I was particularly curious because we do writeback - writeback copy of data from
> CAR to ram (to copy stack and sysinfo, which must cause L1 evictions), and also
> we do DQS memory training (which writes to RAM during CAR) and we use cache to
> cache ROM too.
>
> This means not only L1 is used but we must be using L2 too. Here are some notes
> why I think it works :)
>
> Here is what I found:
>
> AMD L2 cache is exclusive, it means it only contains data evicted from L1
> caches. In other words there is never same data in both caches. I could not find
> any info if it is valid for the Icache too. If the icache gets moved to L2 or
> not. It should but it does not seem to happen during CAR.
>
> L1 Data cache:
>        Size: 64Kb      2-way associative.
>        lines per tag=1 line size=64 bytes.
> L1 Instruction cache:
>        Size: 64Kb      2-way associative.
>        lines per tag=1 line size=64 bytes.
>
> 512KB/core
>
> L2 cache:
>        Size: 1024Kb    16-way associative.
>        lines per tag=1 line size=64 bytes.
>
> Here is basic math how to calculate cache organization:
>
> line size => tells how many bytes are stored in one cache line (exploits the
> spatial locality of data). Here it is 64 bytes so bits of address 5:0 are used.
>
> Index: it tells how many cache lines do we have.
>
> The level of associativity tells how many addresses which compete for same index
> can be stored in cache simultaneously.
>
> For L1 we have: 64*1024 / 64 / 2 = 512 is the number of cache lines. We have 2
> (assoc is 2) "arrays" each has 64 bytes/per line and total size is 64KB. The
> index is therefore on addresses 14:6. The rest of address is used as tag (tag
> identifies the actual location of data in memory together with the cache line
> index) One can say each 14:0 bit of address compete for same index. We have asoc
> level of 2 so each 16 bits of addr will fill whole cache.
>
> For L2 here it is 512KB assoc is 16. We have 32KB / 64 indexes = 512 (lines)
> so addresses 14:6 build up the index. Rest is tag.
>
> The CAR idea on AMD is just to use it and never cause an eviction from L2 cache
> to main memory (which is not functioning).
>
> Step 0) enable cache and WB mtrrs for any ranges
> 1) all lines are invalid, validate them by dummy read exactly as big as max L1
> cache. For instruction cache enough is a instr fetch.
> 2) The dummy read region can be now used to store data - it is simply an
> arbitrary address range 0-64KB max.
>
> 3) caching of ROM works too because:
>
> a) MTRR for rom is set (currently only for part of it) it could be WP type but
> we use WB, no harm here because we do not modify any code ;)
> b) L1 instruction cache is filled from flash chip directly (remember L2 is
> exclusive cache on AMD)
> c) if L1 instr cache is not evicted into L2 then on cache miss it L1 line is
> simply invalidated and refilled from flash rom. I tried to check this using
> performance counters but there is not a counter for this. This is uninteresting
> case because it does not complicate anything.
>
> c) if L1 instr cache gets evicted into L2, (which I dont know if is true),
>
> then we can run into following
>
> I) no L1 data cache lines was evicted into L2 - again not interesting case
> because nothing gets wrong.
>
> II) we have some L1 data cache evicted into L2. This really happens in our CAR!
> print_debug("Copying data from cache to RAM -- switching to use RAM as stack...");
> memcopy((void *)((CONFIG_RAMTOP)-CONFIG_DCACHE_RAM_SIZE), (void
> *)CONFIG_DCACHE_RAM_BASE, CONFIG_DCACHE_RAM_SIZE);
>
> It happens here because we do copy from CAR region to RAM while CAR is still
> running. Both regions are WB so we must evict some L1 cache lines for sure, and
> performance counters confirm this. You may say this is not an issue because RAM
> is running normally, but for example while we resume from S3 we cannot overwrite
> random memory with out CAR... I think this evictions so far happens only here
> and still things works nice here is why:
>
> We have at most 64KB or dirty data, we can spread it into L2 nicely and still
> have a lot of free space even on systems where we have 128KB L2. In this case no
> evictions into system  because we can have the data still in L2.
>
> Now lets go back, what if CPU instruction cache gets evicted into L2? Here it
> would cause problems because in L2 would be L1 data cache data and random L1
> instr cache code competing for the space.
>
> I think here it works because dirty data is evicted with lowest priority. I
> think if all lanes of cache are full, the lane with "clean" data is invalidated
>  first. This saves the day for us because it guarantees that our L1 data will
> not fall off the cache never ever - only if we exceed the L2 cache size with
> dirty data.
>
> We examined so far the ROM caching and oversized L1 handling. But the memory
> training uses writes to not yet initialized RAM. How it works here?
>
> I checked and the memory write uses the instruction which bypasses caches. The
> read uses cache, but it invalidates the cache line afterwards. Again because we
> have at most L1size of dirty data and L2 is big enough it does not spoil the
> party and no stuff gets evicted back to non functioning memory.
>
> Last thing which worries me are speculated fills which can be do by CPU. I think
> they are disabled because the bit for proble FILLs is 0. The fam11 which has
> better documented L2 for general storage needs to have some other bits toggled
> not to do some extra speculations. Fam 10h describes only L1 car and older fams
> also the L1 only CAR. In our code we practically use L2 in all cases.
>
> What we could do is to program a performance counter for L2 writebacks to system
> at the beginning of CAR and in CAR disable check if it is still zero. This will
> tell if we did something nasty.
>
> We could also avoid the WB-WB copy of the CAR area. I tried with WB-UC copy and
> we have then 0 evictions from L1 which is fine (i did some experiments in
> January see AMD CAR questions email).
>
> Uhh its long email took like hour to write, please tell if you think that it
> works really this way.
>
> Thanks,
> Rudolf
>
>
>
>
>
>
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAkvkV60ACgkQ3J9wPJqZRNXaYgCglBFGuv2PtaR7yI/xxpVgvFBu
> vjwAn1ZPp1AArEih9CyO1T44tz/o97LR
> =ce4w
> -----END PGP SIGNATURE-----
>
> --
> coreboot mailing list: coreboot at coreboot.org
> http://www.coreboot.org/mailman/listinfo/coreboot
>

-- 
http://se-eng.com