[coreboot] [OT] machine check related to promotion to a large page
avg at freebsd.org
Wed Nov 18 18:22:45 CET 2009
Sorry for the offtopic, but I see from time to time technical people from amd.com
writing to this list, so I decided to try my luck.
I would be very grateful for any help with the following issues.
Perhaps, I could be referred to the proper technical contacts.
FreeBSD starting with upcoming 8.0 version has transparent support for large pages
(superpages in FreeBSD terms). This means that an eligible range of 4KB pages can
be promoted to a 4MB page at an opportune moment. The large page can also be
broken down into normal pages when needed, of course. Large pages can also be
explicitly allocated, but that is beside the point.
We seem to have a problem, that is perhaps caused by lack of strictness in our
code, that looks to be caused by the mentioned above superpages feature. But the
problem manifests itself only on AMD family 10h processors. To be precise, we
have reports that family Fh is not affected, all problem reports are for family
10h and we have no positive or negative reports for family 11h.
Another mandatory condition for the problem to manifest itself is having machine
check enabled by either BIOS or OS. Also, the problem is reported only for long mode.
So, the actual problem manifestation is a machine check report about parity error
in DC TLB L1. All reporters have confirmed that they don't experience any
problems if the superpages feature is turned off. So it seems likely that this
machine check report is not indicative of a hardware fault.
It looks that the way our code currently works it is possible that we could get
into a situation where two TLB entries would exist for the same linear-to-physical
translation. One through a large page and another through a normal page. Most
likely both should be correct (point to the same physical location). Is it
possible that such a situation could lead the integrity checking logic to believe
that there is a parity error in TLB?
I've searched though the errata for family 10h processors but couldn't find one
that would match.
Examples of processors affected by the problem (as reported by FreeBSD kernel):
CPU: AMD Athlon(tm) II X2 250 Processor (3013.75-MHz K8-class CPU)
Origin = "AuthenticAMD" Id = 0x100f62 Stepping = 2
CPU: Quad-Core AMD Opteron(tm) Processor 2352 (2100.09-MHz K8-class CPU)
Origin = "AuthenticAMD" Id = 0x100f23 Stepping = 3
This is how FreeBSD MCA code reported the machine check:
MCA: CPU 5 UNCOR PCC OVER DTLB L1 error
MCA: Address 0x80e5c8000
My guess of possible FreeBSD code issue: 4K mappings are not flushed when
corresponding PDE is updated from pointing to PT to pointing to a 2M page.
More information about the coreboot