Here’s how the Pi goes from power-on ROM to running your first C function in EL1. We’ll walk through each assembly step on the primary core, then show how the other cores join in.
When the Raspberry Pi 3B powers on, the CPU doesn't start running right away. Instead, the GPU is the first to wake up. It runs a small bit of code from on-chip ROM that loads bootcode.bin
from the SD card, which sets up the SDRAM so the system has usable memory. Then it loads start.elf
, which configures things like the system clock and power management. Once everything is set up, the GPU loads your kernel8.img
into memory and jumps to the _start
label. At this point, core 0 begins executing your code in EL2.
_start:
mrs x0, mpidr_el1 // Read the Multiprocessor Affinity Register
and x0, x0, #0xFF // Keep only the lowest 8 bits (core number)
cbz x0, master // If that number is zero, we’re on core 0
b proc_hang // Otherwise, park this core
When the CPU resets, every core has the same code in memory, but only one of them (core 0) actually needs to run the initialization sequence right away. We read MPIDR_EL1
, which among other things tells us “I’m core 0,” “I’m core 1,” up to “I’m core 3.” The lowest eight bits of that register (called “Affinity Level 0”) hold exactly that piece of information. By masking with 0xFF
, we strip away everything else and end up with a number from 0 to 3. If it’s zero, this is the primary core and we jump into the primary
setup. If it’s non-zero, we branch to proc_hang
and wait quietly until the primary core wakes us later on.
proc_hang:
wfe // Wait For Event: core goes into a low-power sleep
b proc_hang // loop back and sleep again
Although on the real Pi 3B only core 0 ever enters our _start
sequence at reset, we include this “parking” loop for cores 1–3 in case they do reach our code (for example, in simulators or on future platforms). Any core that branches here will sleep in place:
wfe
halts the core until an event (a sev
) arrives.proc_hang
ensures it immediately goes back to sleep if no new work is pending.
On actual hardware, cores 1–3 start stalled inside the GPU’s ROM spin-loop and are only released once the primary core writes their start address into the mailbox registers and issues a sev
. When they wake, they skip this parking loop entirely and instead begin execution in secondary_kernel_main
as part of the controlled multicore startup path.
add x1, x1, #LOW_MEMORY // convert per-core offset into actual RAM address
mov sp, x1 // use this as a temporary early stack
bl pickKernelStack // get the real, per-core stack pointer into x0
msr sp_el1, x0 // install it as the EL1 stack pointer
This section ensures that EL1 has a valid stack to use once we transition down from EL2. We first compute a temporary per-core stack pointer and assign it to sp
, then use pickKernelStack
to retrieve the proper aligned stack for that core. The key step is storing the final value in SP_EL1
using msr
.
This step is important for exception handling. If SP_EL1
is left uninitialized, the CPU will not know where to save state when an exception occurs system will crash or behave unpredictably the first time a fault is triggered.
master:
mov x0, #0x33FF
msr cptr_el2, x0 // stop FP/SIMD faults to EL2
msr hstr_el2, xzr // stop system-register traps
mrs x0, CPACR_EL1
orr x0, x0, #(1<<20)|(1<<21)
msr CPACR_EL1, x0 // allow FP/SIMD in EL1
These instructions disable traps that would otherwise occur when EL1 code tries to access floating-point or SIMD instructions. I'm still not completely sure why this setup is needed just to drop from EL2 to EL1, since no floating-point code runs at that point. However, I do know that if this part is left out, later code that uses floating-point instructions (either directly or through compiled C functions) will crash. So even if the immediate effect isn't obvious, it's important to have this in place early.
ldr x0, =HCR_VALUE
msr hcr_el2, x0 // Set EL1 to use AArch64 execution state
ldr x0, =_vectors
msr vbar_el1, x0 // Set the exception vector base address for EL1
The HCR_EL2
register controls how EL1 behaves when we return to it from EL2. Specifically, setting the RW
bit (bit 31) to 1 tells the CPU that EL1 should run in AArch64 mode rather than AArch32. If this isn’t set, your code might end up running in the wrong execution state and behave unpredictably (or not at all).
After that, we configure VBAR_EL1
, which is the vector base address register for EL1. This tells the processor where to jump when an exception occurs (like an interrupt, system call, or fault) while running in EL1. By pointing it to our own _vectors
table, we make sure any exceptions in our kernel get handled by our own code, not some undefined location.
mov x0, #0x3C4
msr spsr_el2, x0 // Set up the saved program status for EL1
adr x0, el1_entry
msr elr_el2, x0 // Set the return address for when we drop to EL1
eret // Exception return: jump to EL1
Before we can drop from EL2 to EL1, we need to tell the processor what kind of state EL1 should start in. SPSR_EL2
controls that. The value 0x3C4
tells the CPU to enter EL1 in handler mode, which means it will use sp_el1
as the active stack pointer. It also masks interrupts so they do not interfere with our setup.
Next, we set ELR_EL2
, which is the return address for when the CPU exits the current exception level. This is where the processor will jump once we execute eret
. We point it to el1_entry
, which is where our EL1 code begins.
Finally, eret
performs the transition. The CPU exits EL2 and starts executing at the address in ELR_EL2
using the settings we placed in SPSR_EL2
.
el1_entry:
adr x0, __bss_start
adr x1, __bss_end
sub x1, x1, x0
bl memzero // clear .bss
mov sp, #LOW_MEMORY // temporary stack
bl pickKernelStack // returns final SP in x0
mov sp, x0 // use it now
bl primary_kernel_init
b proc_hang
When we enter EL1, the first thing we do is clear the .bss
section. This section contains all global and static variables that are uninitialized or implicitly initialized to zero. The C runtime expects these to be zeroed out before execution begins. We calculate the size of the section and call memzero
to clear it.
This includes things like the stacks
object, which is a global variable with no explicit initializer. Even though it reserves a large amount of memory, it goes into the .bss
section and is not stored in the binary itself. The bootloader (or in our case, this code) ensures that it starts out filled with zeros.
After that, we need a working stack. We first set sp
to a known safe memory location to avoid crashing during early setup. Then we call pickKernelStack
, which returns a stack pointer based on the core ID. Once we have the correct address, we switch sp
over to it and continue.
With memory cleared and a proper stack in place, we can safely enter C by calling primary_kernel_init
.
wake_up_cores:
adr x0, setup_el1_for_secondary
str x0, [0xE0] // entry for core 1
str x0, [0xE8] // core 2
str x0, [0xF0] // core 3
sev // send event
When the Raspberry Pi powers on, only core 0 begins execution. The other cores are held in a low-power state using the wfe
instruction. To bring them online, we write the address of our secondary core entry point into fixed memory locations that act as mailboxes for each core: 0xE0 for core 1, 0xE8 for core 2, and 0xF0 for core 3. These are monitored by the firmware. After writing the addresses, the sev
instruction is used to broadcast a signal that wakes all sleeping cores.
This function is typically called after core 0 has completed its early initialization tasks. That includes setting up the UART, enabling printf
, virtual memory, configuring the heap, and bringing up any necessary drivers or subsystems. By the time the secondary cores start running, the system is already in a usable state. This avoids repeating expensive initialization steps on every core and lets the secondaries immediately start using the infrastructure set up by core 0.
setup_el1_for_secondary:
mrs x0, mpidr_el1
and x0, x0, #0xFF
mov x1, #SECTION_SIZE
mul x1, x1, x0
add x1, x1, #LOW_MEMORY
mov sp, x1 // temporary stack pointer for this core
bl pickKernelStack
msr sp_el1, x0 // install proper EL1 stack pointer
// Disable traps in EL2
mov x0, #0x33FF
msr cptr_el2, x0 // disable FP/SIMD trapping to EL2
msr hstr_el2, xzr // disable system register trapping
mov x0, #(3 << 20)
msr cpacr_el1, x0 // enable FP/SIMD access in EL1
// Configure EL2 to drop into EL1 in 64-bit mode
ldr x0, =HCR_VALUE
msr hcr_el2, x0
ldr x0, =_vectors
msr vbar_el1, x0 // set vector table for EL1
// Set up return from EL2 to EL1
ldr x0, =SPSR_VALUE
msr spsr_el2, x0
adr x0, secondary_kernel_main
msr elr_el2, x0
eret // return to EL1
After each secondary core is woken up, it begins execution in EL2. To bring it into the same environment as core 0, we must carefully prepare for the transition into EL1 the same way we did for core 0.
pickKernelStack
returns a properly aligned per-core stack, which we install into sp_el1
.cptr_el2
, hstr_el2
, and cpacr_el1
.HCR_EL2
is configured to ensure EL1 will run in AArch64 mode, matching the state of the kernel.VBAR_EL1
.SPSR_EL2
to define the state that EL1 will start in (EL1h, interrupts masked), and ELR_EL2
to the address of secondary_kernel_main
, which is where execution resumes after eret
.
Once everything is in place, we call eret
to drop into EL1. From this point onward, the secondary core runs in the same environment as the primary core, with its own stack and full access to system features.
setup_el1_for_secondary:
mrs x0, mpidr_el1
and x0, x0, #0xFF
mov x1, #SECTION_SIZE
mul x1, x1, x0
add x1, x1, #LOW_MEMORY
mov sp, x1
bl pickKernelStack
msr sp_el1, x0
mov x0, #0x33FF
msr cptr_el2, x0
msr hstr_el2, xzr
mov x0, #(3 << 20)
msr cpacr_el1, x0
ldr x0, =HCR_VALUE
msr hcr_el2, x0
ldr x0, =_vectors
msr vbar_el1, x0
ldr x0, =SPSR_VALUE
msr spsr_el2, x0
adr x0, secondary_kernel_main
msr elr_el2, x0
eret
This code prepares each secondary core to drop from EL2 into EL1, following the same steps we used for core 0. We assign each core a temporary stack, install its real per-core stack using sp_el1
, configure the relevant system registers, and then call eret
to complete the transition.
Since we’ve already discussed each register and their purpose during the primary core's setup, this section just replicates that sequence for the other cores. The key difference is that secondary_kernel_main
will be the entry point once the core enters EL1.
.align 11
_vectors:
// Synchronous
.align 7
mov x0, #0
mrs x1, esr_el1
mrs x2, elr_el1
mrs x3, spsr_el1
mrs x4, far_el1
b exc_handler
// IRQ
.align 7
mov x0, #1
mrs x1, esr_el1
mrs x2, elr_el1
mrs x3, spsr_el1
mrs x4, far_el1
b exc_handler
// FIQ
.align 7
mov x0, #2
mrs x1, esr_el1
mrs x2, elr_el1
mrs x3, spsr_el1
mrs x4, far_el1
b exc_handler
// SError
.align 7
mov x0, #3
mrs x1, esr_el1
mrs x2, elr_el1
mrs x3, spsr_el1
mrs x4, far_el1
b exc_handler
This is our exception vector table for EL1, which I briefly mentioned earlier. It defines how the CPU should respond to various types of exceptions: synchronous faults, IRQs, FIQs, and system errors (SError). Each entry is placed 128 bytes apart to satisfy the alignment requirements for exception vector lookup.
ESR_EL1
tells us what kind of exception occurred.ELR_EL1
holds the address of the instruction that triggered the exception.SPSR_EL1
saves the processor state at the time of the exception.FAR_EL1
provides the faulting address in memory-related exceptions.x0
is used to label the exception type (e.g. 0 for sync, 1 for IRQ, etc.).
All exception cases branch to a single exc_handler
function. This keeps handling unified while still giving us enough context to handle or report each type appropriately.