⭠ Back to Home

Early Low-Level Initialization

Here’s how the Pi goes from power-on ROM to running your first C function in EL1. We’ll walk through each assembly step on the primary core, then show how the other cores join in.

GPU Firmware to Kernel Handoff

When the Raspberry Pi 3B powers on, the CPU doesn't start running right away. Instead, the GPU is the first to wake up. It runs a small bit of code from on-chip ROM that loads bootcode.bin from the SD card, which sets up the SDRAM so the system has usable memory. Then it loads start.elf, which configures things like the system clock and power management. Once everything is set up, the GPU loads your kernel8.img into memory and jumps to the _start label. At this point, core 0 begins executing your code in EL2.

“Who Am I?”: Core ID Check

_start:
        mrs   x0, mpidr_el1          // Read the Multiprocessor Affinity Register
        and   x0, x0, #0xFF          // Keep only the lowest 8 bits (core number)
        cbz   x0, master             // If that number is zero, we’re on core 0
        b     proc_hang              // Otherwise, park this core

When the CPU resets, every core has the same code in memory, but only one of them (core 0) actually needs to run the initialization sequence right away. We read MPIDR_EL1, which among other things tells us “I’m core 0,” “I’m core 1,” up to “I’m core 3.” The lowest eight bits of that register (called “Affinity Level 0”) hold exactly that piece of information. By masking with 0xFF, we strip away everything else and end up with a number from 0 to 3. If it’s zero, this is the primary core and we jump into the primary setup. If it’s non-zero, we branch to proc_hang and wait quietly until the primary core wakes us later on.

Parking Secondary Cores

proc_hang:
        wfe                     // Wait For Event: core goes into a low-power sleep
        b     proc_hang         // loop back and sleep again

Although on the real Pi 3B only core 0 ever enters our _start sequence at reset, we include this “parking” loop for cores 1–3 in case they do reach our code (for example, in simulators or on future platforms). Any core that branches here will sleep in place:

wfe halts the core until an event (a sev) arrives.
The branch back to proc_hang ensures it immediately goes back to sleep if no new work is pending.

On actual hardware, cores 1–3 start stalled inside the GPU’s ROM spin-loop and are only released once the primary core writes their start address into the mailbox registers and issues a sev. When they wake, they skip this parking loop entirely and instead begin execution in secondary_kernel_main as part of the controlled multicore startup path.

Stack Setup for EL1 on the Primary Core

add   x1, x1, #LOW_MEMORY    // convert per-core offset into actual RAM address
    mov   sp, x1                // use this as a temporary early stack
    bl    pickKernelStack       // get the real, per-core stack pointer into x0
    msr   sp_el1, x0            // install it as the EL1 stack pointer

This section ensures that EL1 has a valid stack to use once we transition down from EL2. We first compute a temporary per-core stack pointer and assign it to sp, then use pickKernelStack to retrieve the proper aligned stack for that core. The key step is storing the final value in SP_EL1 using msr.

This step is important for exception handling. If SP_EL1 is left uninitialized, the CPU will not know where to save state when an exception occurs system will crash or behave unpredictably the first time a fault is triggered.

Disabling Traps in EL2

master:
        mov   x0, #0x33FF
        msr   cptr_el2, x0         // stop FP/SIMD faults to EL2
        msr   hstr_el2, xzr        // stop system-register traps
    
        mrs   x0, CPACR_EL1
        orr   x0, x0, #(1<<20)|(1<<21)
        msr   CPACR_EL1, x0        // allow FP/SIMD in EL1

These instructions disable traps that would otherwise occur when EL1 code tries to access floating-point or SIMD instructions. I'm still not completely sure why this setup is needed just to drop from EL2 to EL1, since no floating-point code runs at that point. However, I do know that if this part is left out, later code that uses floating-point instructions (either directly or through compiled C functions) will crash. So even if the immediate effect isn't obvious, it's important to have this in place early.

Setting Up the Hypervisor Register and VBAR


        ldr   x0, =HCR_VALUE
        msr   hcr_el2, x0          // Set EL1 to use AArch64 execution state
    
        ldr   x0, =_vectors
        msr   vbar_el1, x0         // Set the exception vector base address for EL1

The HCR_EL2 register controls how EL1 behaves when we return to it from EL2. Specifically, setting the RW bit (bit 31) to 1 tells the CPU that EL1 should run in AArch64 mode rather than AArch32. If this isn’t set, your code might end up running in the wrong execution state and behave unpredictably (or not at all). After that, we configure VBAR_EL1, which is the vector base address register for EL1. This tells the processor where to jump when an exception occurs (like an interrupt, system call, or fault) while running in EL1. By pointing it to our own _vectors table, we make sure any exceptions in our kernel get handled by our own code, not some undefined location.

Preparing to Return to EL1


        mov   x0, #0x3C4
        msr   spsr_el2, x0         // Set up the saved program status for EL1
    
        adr   x0, el1_entry
        msr   elr_el2, x0          // Set the return address for when we drop to EL1
    
        eret                       // Exception return: jump to EL1

Before we can drop from EL2 to EL1, we need to tell the processor what kind of state EL1 should start in. SPSR_EL2 controls that. The value 0x3C4 tells the CPU to enter EL1 in handler mode, which means it will use sp_el1 as the active stack pointer. It also masks interrupts so they do not interfere with our setup. Next, we set ELR_EL2, which is the return address for when the CPU exits the current exception level. This is where the processor will jump once we execute eret. We point it to el1_entry, which is where our EL1 code begins. Finally, eret performs the transition. The CPU exits EL2 and starts executing at the address in ELR_EL2 using the settings we placed in SPSR_EL2.

EL1 Entry: BSS Clear and Stack Setup

el1_entry:
        adr   x0, __bss_start
        adr   x1, __bss_end
        sub   x1, x1, x0
        bl    memzero              // clear .bss
    
        mov   sp, #LOW_MEMORY      // temporary stack
        bl    pickKernelStack      // returns final SP in x0
        mov   sp, x0               // use it now
    
        bl    primary_kernel_init
        b     proc_hang

When we enter EL1, the first thing we do is clear the .bss section. This section contains all global and static variables that are uninitialized or implicitly initialized to zero. The C runtime expects these to be zeroed out before execution begins. We calculate the size of the section and call memzero to clear it. This includes things like the stacks object, which is a global variable with no explicit initializer. Even though it reserves a large amount of memory, it goes into the .bss section and is not stored in the binary itself. The bootloader (or in our case, this code) ensures that it starts out filled with zeros. After that, we need a working stack. We first set sp to a known safe memory location to avoid crashing during early setup. Then we call pickKernelStack, which returns a stack pointer based on the core ID. Once we have the correct address, we switch sp over to it and continue. With memory cleared and a proper stack in place, we can safely enter C by calling primary_kernel_init.

Waking Up Secondary Cores

wake_up_cores:
        adr   x0, setup_el1_for_secondary
        str   x0, [0xE0]           // entry for core 1
        str   x0, [0xE8]           // core 2
        str   x0, [0xF0]           // core 3
        sev                        // send event

When the Raspberry Pi powers on, only core 0 begins execution. The other cores are held in a low-power state using the wfe instruction. To bring them online, we write the address of our secondary core entry point into fixed memory locations that act as mailboxes for each core: 0xE0 for core 1, 0xE8 for core 2, and 0xF0 for core 3. These are monitored by the firmware. After writing the addresses, the sev instruction is used to broadcast a signal that wakes all sleeping cores. This function is typically called after core 0 has completed its early initialization tasks. That includes setting up the UART, enabling printf, virtual memory, configuring the heap, and bringing up any necessary drivers or subsystems. By the time the secondary cores start running, the system is already in a usable state. This avoids repeating expensive initialization steps on every core and lets the secondaries immediately start using the infrastructure set up by core 0.

9. EL2-to-EL1 Transition for Secondary Cores

setup_el1_for_secondary:
        mrs x0, mpidr_el1
        and x0, x0, #0xFF
        mov x1, #SECTION_SIZE
        mul x1, x1, x0
        add x1, x1, #LOW_MEMORY
        mov sp, x1                // temporary stack pointer for this core
    
        bl  pickKernelStack
        msr sp_el1, x0            // install proper EL1 stack pointer
    
        // Disable traps in EL2
        mov x0, #0x33FF
        msr cptr_el2, x0          // disable FP/SIMD trapping to EL2
        msr hstr_el2, xzr         // disable system register trapping
        mov x0, #(3 << 20)
        msr cpacr_el1, x0         // enable FP/SIMD access in EL1
    
        // Configure EL2 to drop into EL1 in 64-bit mode
        ldr x0, =HCR_VALUE
        msr hcr_el2, x0
    
        ldr x0, =_vectors
        msr vbar_el1, x0          // set vector table for EL1
    
        // Set up return from EL2 to EL1
        ldr x0, =SPSR_VALUE
        msr spsr_el2, x0
    
        adr x0, secondary_kernel_main
        msr elr_el2, x0
    
        eret                       // return to EL1

After each secondary core is woken up, it begins execution in EL2. To bring it into the same environment as core 0, we must carefully prepare for the transition into EL1 the same way we did for core 0.

We begin by assigning each core a unique stack in memory, using the core ID to compute an offset. This is important because the core needs a valid stack before anything else can work, especially exception handling.
pickKernelStack returns a properly aligned per-core stack, which we install into sp_el1.
We then disable traps in EL2 that would otherwise prevent EL1 from using floating-point or system registers. This includes setting cptr_el2, hstr_el2, and cpacr_el1.
HCR_EL2 is configured to ensure EL1 will run in AArch64 mode, matching the state of the kernel.
We install the exception vector table by setting VBAR_EL1.
We set SPSR_EL2 to define the state that EL1 will start in (EL1h, interrupts masked), and ELR_EL2 to the address of secondary_kernel_main, which is where execution resumes after eret.

Once everything is in place, we call eret to drop into EL1. From this point onward, the secondary core runs in the same environment as the primary core, with its own stack and full access to system features.

EL2-to-EL1 Transition for Secondary Cores

setup_el1_for_secondary:
        mrs x0, mpidr_el1
        and x0, x0, #0xFF
        mov x1, #SECTION_SIZE
        mul x1, x1, x0
        add x1, x1, #LOW_MEMORY
        mov sp, x1
    
        bl  pickKernelStack
        msr sp_el1, x0
    
        mov x0, #0x33FF
        msr cptr_el2, x0
        msr hstr_el2, xzr
        mov x0, #(3 << 20)
        msr cpacr_el1, x0
    
        ldr x0, =HCR_VALUE
        msr hcr_el2, x0
    
        ldr x0, =_vectors
        msr vbar_el1, x0
    
        ldr x0, =SPSR_VALUE
        msr spsr_el2, x0
    
        adr x0, secondary_kernel_main
        msr elr_el2, x0
    
        eret

This code prepares each secondary core to drop from EL2 into EL1, following the same steps we used for core 0. We assign each core a temporary stack, install its real per-core stack using sp_el1, configure the relevant system registers, and then call eret to complete the transition.

Since we’ve already discussed each register and their purpose during the primary core's setup, this section just replicates that sequence for the other cores. The key difference is that secondary_kernel_main will be the entry point once the core enters EL1.

Exception Vector Table

    .align 11
    _vectors:
        // Synchronous
        .align  7
        mov     x0, #0
        mrs     x1, esr_el1
        mrs     x2, elr_el1
        mrs     x3, spsr_el1
        mrs     x4, far_el1
        b       exc_handler
    
        // IRQ
        .align  7
        mov     x0, #1
        mrs     x1, esr_el1
        mrs     x2, elr_el1
        mrs     x3, spsr_el1
        mrs     x4, far_el1
        b       exc_handler
    
        // FIQ
        .align  7
        mov     x0, #2
        mrs     x1, esr_el1
        mrs     x2, elr_el1
        mrs     x3, spsr_el1
        mrs     x4, far_el1
        b       exc_handler
    
        // SError
        .align  7
        mov     x0, #3
        mrs     x1, esr_el1
        mrs     x2, elr_el1
        mrs     x3, spsr_el1
        mrs     x4, far_el1
        b       exc_handler

This is our exception vector table for EL1, which I briefly mentioned earlier. It defines how the CPU should respond to various types of exceptions: synchronous faults, IRQs, FIQs, and system errors (SError). Each entry is placed 128 bytes apart to satisfy the alignment requirements for exception vector lookup.

ESR_EL1 tells us what kind of exception occurred.
ELR_EL1 holds the address of the instruction that triggered the exception.
SPSR_EL1 saves the processor state at the time of the exception.
FAR_EL1 provides the faulting address in memory-related exceptions.
x0 is used to label the exception type (e.g. 0 for sync, 1 for IRQ, etc.).

All exception cases branch to a single exc_handler function. This keeps handling unified while still giving us enough context to handle or report each type appropriately.