Early SMP startup
This is an excerpt of an email exchange with the subject "locking..." on the coreboot mailing list. Not all the info below is guaranteed to be correct, but it serves as a great source of distilled knowledge.
After CAR the APs should be stopped until CPU init time, which is relatively short.
v2 has / used to have working locking code since it was first ported to opteron. It may be that it broke while adding 5 more printks but it is there somewhere.
Making the BSP poll for the APs (which is what we would do if we need to check the APs shared memory) basically renders the BSP unusable to do stuff while waiting for the APs.
With simple locking, everything can run in parallel, and only serial output needs to get synced. Which is what we actually want.
There's no real problem, we've just been doing too much cut and paste in the past without testing the new code. This made us end up with different versions of printk, some with locking, some without.
And, no, porting the code from v3 over is not an option at this point. It does too much different stuff. Let's rather start dropping unneeded implementations in v2 until things look sane again and then we can decide what implementation we want.
So each AP has some part of RAM to copy the buffer to?
The way SMP works, the BSP sets up its ram. At that point, the APs can use the BSP ram. That's why APs have a stack in the first place.
APs have a working stack when they are setting up their own RAM.
The way this works on amd64 is that the AP comes up, goes to cache as ram, finds it is an AP and goes to sleep again. Then it wakes up again in stage2 when the BSP sends an IPI. At this point (at least remote) RAM is available. They never set up their own ram (in terms of Jedec init, or setting up a ram controller), but only have to clear it, in case of ECC memory.
- the pre-ram locking can't be done with a stack, because the cache between CPUs is not always necessarily in the same state.
well, that may be true on intel stuff. The AMD startup (at least as I understand it) depends on the BSP memory being functional enough to provide the APs with a stack.
- the post-ram code does not need it, works quite nicely already.
actually, this is only partially true. It is still possible for a malfunctioning AP to lock the BSP out. It's just not something we've seen much of.
The point (of the v3 early init code and stack structure) was to stay as much the same, but also allow the BSP to better monitor (and control) what was going on. I would still claim the structure of the code is a big improvement. The v2 SMP startup is not an easy read.
I was wondering how you make the APs not conflict in the part of RAM they copy their buffers to. I was also wondering how it would affect interleaving, etc. That kind of thing seems difficult to debug, and is the reason I'd want to see the APs messages.
You give each AP a seperate stack. All that code is in there already.
The struct-based stack is a direct copy of the v2 startup. Rather than using lots of fiddly offsets onto a memory area, it provides a struct which contains variables and a stack. The variables are shared between the AP and the BSP. It is a more C-like way to do it. Once I did it I realized that the on-stack variables could be used as a communications path from the AP to the BSP, most important one being a POST variable that could be set by the AP and monitored by the BSP. This is a bit better than what we have in v2, where we get one bit back from the AP which tells us "done" or "not done". No real progress indication is available. At some point, the BSP times out the AP, but there is no error code. Plus, the way in which the shared variable is set up in v2 is not very straightforward.
Many systems only have one memory controller. But on all coreboot systems, even those with multiple memory controllers, the controllers are all set up by the BSP. Parallelizing here makes only very little sense. The reason we're parallelizing is you have to clear memory if you have ECC on some (older?) systems where the memory controller is incapable of doing that automatically. So when we have to clear 32G at a rate of 3 or 6GB/s we want to put that load on several CPUs, so it's 1-2s instead of 5-10. But the ram is all there at this point, and can be transparently accessed by all CPUs.
Other than that, we could unify those versions (of printk) by just defining an empty (for now) version of the spinlock functions in raminit stage. Then think about where we can place our locking for those platforms that need it this early...?
It's more than spinlock. You must also fix the use of the console_drivers struct. There are several things you need to get right to make this work.
If you want locking in do_printk, it will have to work in CAR.