MMU Setup

Last time, I tested my MSPC atomic ring buffer, but ran into the issue of it crashing on calling compare_exchange() on my atomic index. After doing some research and some tests, I realized that this was because the MMU (memory management unit) is not automatically set up by the RPi bootloader. I need to turn on the MMU anyways to enable caching and other hardware speed-ups (speculative fetching, etc), so in this post I set out with the goal of enabling the MMU.

I found this blog post which seems to be a walkthrough of exactly what I want to do (set up the MMU on an Arm Cortex A53 cpu in aarch64 assembly); I will use this post as a guide for what I need to do.

The first thing I did was send over the value of the CurrentEL register over my newly implemented ringbuffer:

pub enum Register {
    CurrentEL,
}

pub fn read_register(reg: Register) -> u64 {
    let mut out: u64;
    unsafe {
        match reg {
            Register::CurrentEL => {
                asm!(
                    "mrs {out}, CurrentEL",
                    out = out(reg) out
                );
            }
        }
    }

    out
}

// Then in my main baremetal loop:

loop {
    counter += 1;

    let r = shared_mem.write_message(
        CoreID::Core1,
        common::shared_mem::types::BaremetalMessage::TestSendingAU64Lol(read_register(
            utils::Register::CurrentEL,
        )),
    );

}

Where the write_message function uses the SPSC implementation I talked about last time. The output of this is Message: TestSendingAU64Lol(8), so the value of the 64 bit register is 0b1000. The blog post I’m following writes

mrs  x0, CurrentEL
cmp  x0, 0b0100
beq  in_el1
blo  in_el0
cmp  x0, 0b1000
beq  in_el2

At the bottom, we compare the value of CurrentEL to 0b1000, and branch to in_el2 if they match; this means we are in exception level 2! This is good news, since if we were in exception level 0, I would probably have to make a custom stub to replace the default rpi one I’m running (which sounds like a whole rabbit hole in itself). (I also double-checked with the official ARM docs to verify the meaning of this value/register.)

At this point, I am going to copy what the blog post does in exception level 2:

in_el2:
    mrs     x0, hcr_el2
    orr     x0, x0, (1 << 31)
    and     x0, x0, ~(1 << 5)
    and     x0, x0, ~(1 << 4)
    and     x0, x0, ~(1 << 3)
    msr     hcr_el2, x0
    mov     x0, 0b00101
    msr     spsr_el2, x0
    adr     x0, in_el1
    msr     elr_el2, x0
    eret

I looked into what this actually does, and here’s my line by line breakdown:

mrs x0, hcr_el2 loads in the hcr_el2 configuration register into x0.
orr x0, x0, (1 << 31) sets the 31st bit to 1, which means that the next lower exception level (el1) will be in aarch64 mode when we switch to it (source)
and x0, x0, ~(1 << 5) sets bit 5 to 0, meaning that physical SError interrupts are not taken to el2 (source)
and x0, x0, ~(1 << 4) sets bit 4 to 0, meaning that physical IRQ interrupts are not taken to el2. (source)
and x0, x0, ~(1 << 3) sets bit 3 to 0, meaning that physical FIQ interrupts are not taken to el2. (source)
msr hcr_el2, x0 loads x0 back into spsr_el2 to save changes made above.
mov x0, 0b00101 sets x0 to 0b00101 and
msr spsr_el2, x0 loads x0 into spsr_el2, which as far as I can tell means we are setting values that will get copied to the execution state upon switching to el1. (source)
adr x0, in_el1 sets x0 to contain the address of in_el1 (which is a later part of the blog assembly that I did not copy),
msr elr_el2, x0 writes x0 to the elr_el2 register, which determines where the cpu will branch to upon exiting the exception level;
eret returns from exception level 2 (branching to the value stored above, so we are executing whatever is at in_el1).

I don’t actually really know what “page tables” are, or how the MMU is actually configured, I just know that the MMU is in charge of translating virtual and physical addresses between each other, and also controls the memory properties of each section of memory. To figure out what’s going on under the hood, I read this pdf and this documentation (as well as some other official ARM documentation) to get up and running.

Hiccup with u64

For some reason, the above code for reading CurrentEL started crashing on the bare metal side. I was confused, since it wouldn’t even start up at all (evidenced by inspecting the raw shared memory region). After much troubleshooting, I realized that the error stemmed from adding the u64 message variant to my BareMetalMessage enum. At first my mind went to alignment issues, but that didn’t really make sense to me since the SharedMem struct is located at address 0x10000000, which is about as aligned as can be (within reason). I pointed anb LLM to my code base and asked what the issue might be (since I couldn’t find anything online the normal way), and among the various suggestions for troubleshooting, one of them stood out: apparently rustc/LLVM sometimes emit SIMD instructions to operate on larger pieces of data (e.g. u64), and if you don’t enable SIMD on startup, trying to run such instructions cause a crash/exception (and I don’t have exception handling set up at the moment).

Re-compiling with CARGO_BUILD_RUSTFLAGS = "-C target-feature=+nofpu,+nosimd" in my flake made the program work, and confirmed that this was the problem! So I have to set up SIMD at some point to use more efficient instructions (which I will want to have for DSP purposes later anyways). For now I am going to focus on setting up the MMU!

Setting Up Translation Tables for MMU

I mentioned earlier that CurrentEL is 8 (i.e. 0b...001000) according to the message sent by bare metal, and this means we are in EL2 (source). Since I am not building a kernel in the usual sense, and I don’t plan on running any external code, I don’t actually care about using virtual addresses to sandbox programs; the only thing I need the MMU for is for setting up memory attributes. What I mean to say is that my virtual to physical address translation will be “transparent” (i.e. each virtual address maps to the physical address with the same literal address as the input), and the only thing I really need/want is to set memory attributes to allow caching/optimization for normal memory, and to keep the peripheral memory mapped as “device” memory to avoid unpredictable behaviour.

I have decided that all my bare-metal code will execute in EL1, so the first thing I am going to do is drop down to EL1 like the blog post above did. The post-boot initial assembly of my bare metal program now looks like this:

    .section .text._start
    .globl _start
_start:
    # Set the stack pointer for EL1.
    ldr x0, = _stack_start_1
    msr sp_el1, x0

    # Configure hypervisor configuration register (HCR)
    # to avoid trapping exceptions at EL2.
    mrs     x0, hcr_el2
    orr     x0, x0, (1 << 31)
    and     x0, x0, ~(1 << 5)
    and     x0, x0, ~(1 << 4)
    and     x0, x0, ~(1 << 3)
    msr     hcr_el2, x0

    
    mov     x0, 0b00101
    msr     spsr_el2, x0
    adr     x0, _el1_setup
    msr     elr_el2, x0
    eret
    
_el1_setup:

    bl _rust_main

I have copied the assembly from the blog post to drop us down into EL1. The bare metal program now sends Message: TestSendingAU64Lol(4), which means that the bottom bits of the CurrentEL register are 0b0100, indeed corresponding to EL1.

However, sometimes the program would crash, and swapping back to the previous minimal boot assembly seemed to solve the problem. I ended up changing the boot assembly to:

.section .text._start
.globl _start
_start:
    # Set the stack pointer for EL1.
    ldr x0, = _stack_start_1
    msr sp_el1, x0

    # Configure hypervisor configuration register (HCR)
    # to avoid trapping exceptions at EL2.
    mrs     x0, hcr_el2
    orr     x0, x0, #(1 << 31)
    bic     x0, x0, #(1 << 5)
    bic     x0, x0, #(1 << 4)
    bic     x0, x0, #(1 << 3)
    msr     hcr_el2, x0

    # Disable SIMD and FPU instruction trapping at EL2
    msr cptr_el2, xzr

    # Reset sctlr_el1 to zeros to avoid faults
    msr     sctlr_el1, xzr

    # Set DAIF to 1111 to avoid interrupts, set M[4:0] to 0b00101 also.
    mov     x0, 0b1111000101
    msr     spsr_el2, x0
    adr     x0, _el1_setup
    msr     elr_el2, x0

    
    isb
    eret

_el1_setup:

    # Enable FPU and SIMD, since rust likes to use instructions that require this.
    mov     x0, #(3 << 20)
    msr     cpacr_el1, x0

    bl _rust_main

The main additions are isb before eret, as well as resetting some registers that are undefined on power up (though they are likely set to something reasonable by the rpi boot code, I’m doing this anyways just in case). I also enable SIMD and the FPU, just to be certain that this isn’t causing crashes (and this allowed me to remove the “nosimd” flag from my nix shell, which means that my DSP code can more optimal/efficient later).

Anyways, back to translation tables! From the BCM2837 documentation, peripherals can be accessed at the physical address range 0x3F000000 through 0x3FFFFFFF, and I have defined the shared memory region to be in normal RAM in the address range 0x10000000 through 0x100FFFFF and the bare metal reserved memory region to be from 0x10100000 to 0x1B47EFFF. There is no need (and indeed it is a bad idea) to map the linux specific memory, since then I could accidentally overwrite important linux things; I’d rather the program crash on a read/write to illegal memory than cause less visible latent bugs with linux. So I am not going to map this section.

I want to map the peripherals with the strongest restrictions possible (device memory with no re-ordering, no speculative fetching, and no cacheing), and I want to map the normal memory as normal, cacheable, and re-orderable (and then use atomics and memory fences where necessary for synchronization).

Looking at the docs for memory attributes, it looks like flagging the peripheral memory as Device-nGnRnE would be best, since it is the most restrictive and the safest bet for now.

For the other memory, we can map it as normal, re-orderable, inner-shareable, cacheable, groupable, etc.

From the aarch64 memory management guide section 7.2, some registers relevant to memory translation are:

SCTLR_ELx (fields M, C, EE) enables the MMU, enables caches, and controls endianness for table walks (more info),
TTBR0_ELx and TTBR1_ELx (fields BADDR, ASID) set the physical address for the start of translation tables (more info),
TCR_ELx (fields PS/IPS, TnSZ, TGn, SH/IRGN/ORGN, EPDn) set the size of the physical address output range, the size of address space covered by the table, the granule size, the cacheability and sharability to be used by MMU table walks, and disabling walks to specific tables (more info),
MAIR_ELx (Attr field) controls the type and cacheability in Stage 1 tables (more info).

Continuing along with the blog post, they do the following:

in_el1:
    mov     x0, 0b0101
    msr     spsr_el1, x0
    msr     DAIFSet, 0b1111

We set the bottom four bits of SPSR_EL1 to 0b0101, where the leftmost bit means we take exceptions in an aarch64 execution state, and the three right bits mean that if we were to return from an exception, we drop into EL1, using the stack pointer SP_EL1. The DAIFSet instruction sets the D, A, I, F bits of PSTATE to 0b1111 in our case, which means that no exceptions can occur (until we set it back), effectively allowing us a “critical section” of code that cannot be interrupted; this is where we can setup the MMU without unexpected interruptions. At this point the blog post sets values in the same 5 registers listed above relating to the MMU (which I found in the arm docs), which is a good sign that we are on the right track.

Here’s my setup for setting the registers to appropriate values, generating the translation table, and finally enabling the MMU:

pub mod tables {

    use core::arch::asm;

    const TABLE2_SIZE: usize = 512;

    pub fn setup_mmu() {
        #[repr(C, align(4096))]
        struct TranslationTable([u64; TABLE2_SIZE]);

        const fn create_translation_table2() -> TranslationTable {
            let mut out = [0; TABLE2_SIZE];
            let mut i = 0;

            while i < TABLE2_SIZE {
                let mut val: u64 = 0;

                if i < 128 {
                    // Leave as 0; this is linux memory and we don't want to map it.
                } else {
                    if i < 504 {
                        // Normal memory.1
                        val |= 0b0 << 54; // Set XN[1] to 0
                        val |= 0b0 << 53; // Set XN[0] to 0 (executable from both EL1 and EL0)
                        val |= 0b11 << 8; // Set SH (shareability -> inner shareable)
                        val |= 0b000 << 2; // Set AttrIndex to 0 so we read ATTR0 from MAIR_EL1.
                    } else {
                        // Peripheral memory.
                        val |= 0b1 << 54; // Set XN[1] to 1
                        val |= 0b0 << 53; // Set XN[0] to 0 (non-executable from both EL1 and EL0)
                        val |= 0b10 << 8; // Set SH (shareability -> outer shareable)
                        val |= 0b001 << 2; // Set AttrIndex to 1 so we read ATTR1 from MAIR_EL1.
                    }

                    val |= 0b1 << 10; // Set AF (access flag to 1 since not accessed)
                    val |= 0b00 << 6; // S2AP to 11 (access permission to read/write)
                    val |= 0b0 << 5; // Set NS (non-secure = 0 means we are in secure address map)

                    val |= (i as u64) << 21; // Set translation destination!

                    val |= 0b01; // Set block descriptor to 01, meaning

                    out[i] = val;
                }

                i += 1;
            }

            TranslationTable(out)
        }

        static TRANSLATION_TABLE_LVL2: TranslationTable = create_translation_table2();

        const TCR_EL1: u64 = {
            let mut out: u64 = 0;

            out |= 0b101 << 32; // IPS (intermediate physical address size) is 48.
            out |= 0b10 << 30; // TG1; set granule size to 4K (for 2MB lvl 2).
            out |= 0b11 << 28; // SH1 (shareability) set to inner-shareable.
            out |= 0b01 << 26; // ORGN1 set outer cacheability to normal outer write-back write-allocate.
            out |= 0b01 << 24; // IRGN1 set inner cacheability to the same
            out |= 0b1 << 23; // EPD1, DISABLE TABLE WALKING on TLB miss; we only want lower addresses to be translated.
            out |= 0b0 << 22; // A1; use TTBR0_EL1 ASID for ASID (good because we have no TTBR1_EL1).
            out |= 34 << 16; // T1SZ; set to 34 so that only 30 bits are used for addressing (only bottom 30 bits are relevant).
            out |= 0b00 << 14; // TG0, set to 4K granule size.
            out |= 0b11 << 12; // SH0; Shareability (same as above)
            out |= 0b01 << 10; // ORGN0 outer cacheability (see above)
            out |= 0b01 << 8; // IRGN0 inner cacheability (see above)
            out |= 0b0 << 7; // EPD0; enable table walks with TTBR0_EL1
            out |= 34; // T0SZ; (same as above) set for 30 bit max address.

            out
        };

        const MAIR_EL1: u64 = {
            let mut out: u64 = 0;
            out |= 0b00000000 << 8; // Set ATTR1 to Device-nGnRnE.
            out |= 0b11111111; // Set ATTR0 to Normal Inner Write-Back Non-Transient Allocate memory.
            out
        };

        unsafe {
            asm!(
                "
                    # disable exceptions so we are not interrupted.
                    msr DAIFSet, 0b1111 


                    # Set relevant registers
                    msr TCR_EL1, {tcr_el1}
                    msr MAIR_EL1, {mair_el1}
                    msr TTBR0_EL1, {table_base_addr}
                    msr TTBR1_EL1, {table_base_addr}

                    # Invalidate TLB cache
                    tlbi vmalle1

                    # Enable MMU!
                    mrs {tmp}, sctlr_el1
                    orr {tmp}, {tmp}, 0x1
                    orr {tmp}, {tmp}, (0x1 << 12)
                    msr sctlr_el1, {tmp}

                    dsb ish
                    isb
                ",
                tcr_el1 =  in(reg) TCR_EL1,
                mair_el1 = in(reg) MAIR_EL1,
                table_base_addr = in(reg) &TRANSLATION_TABLE_LVL2 as *const TranslationTable as u64,
                tmp = out(reg) _,
            );
        }
    }
}

The comments in there explain what’s going on at each step, and I’d like to point out that while I used the blog post and LLMs for clarifying which registers are relevant and gaining an intuition for what’s going on, as well as the blog post above as a vauge guide, I ultimately wrote all the code and comments in this snippet, closely following the armv8 reference manual to figure out which bits mean what, and what I should do in order to set up the MMU the way I need it.

This snippet is the final version after troubleshooting some minor issues, such as accidentally setting all the memory to read-only (causing it to crash), accidentally mixing up ATTR0 and ATTR1 in the index in the translation tables (so I had peripherals mapped as normal memory and RAM mapped as device memory), and some other minor details.

As soon as this didn’t crash the program, something really cool happened: I started receiving waay more messages from bare metal! I was still sending the value of CurrentEL, but now instead of sending the message only two or three times per userspace poll (i.e. per second), there are about 210 messages per poll cycle! Here are some log snippets comparing the two:

Before:

[INFO] Core Status: [Running, Init, Init]
[INFO] Message: Ping
[INFO] Message: TestSendingAU64Lol(8)
[INFO] Message: TestSendingAU64Lol(8)
[INFO] No new messages.
[INFO] Sending message...
[INFO] testing sending a message: Ok(())

After:

[INFO] Core Status: [Running, Init, Init]
[INFO] Message: Ping
[INFO] Message: TestSendingAU64Lol(4)
[INFO] Message: TestSendingAU64Lol(4)
... (200+ lines of the same)
[INFO] Message: TestSendingAU64Lol(4)
[INFO] Message: TestSendingAU64Lol(4)
[INFO] Message: TestSendingAU64Lol(4)
[INFO] No new messages.
[INFO] Sending message...
[INFO] testing sending a message: Ok(())  

This is very promising, as I believe it is likely that enabling cache has allowed for massive speedup of my “delay” loop, which is currently:

for _ in 1..2500000 {
    asm!("nop");
}

Presumably the nop instructions are now cached to the extent that we have a 100x speedup. Cool!

Conclusion

I tested my MPSC ring buffer again, but it is still crashing. The likely culprit now seems to be incompatible memory attributes between the bare metal core and linux. Eventually I will come back and troubleshoot that, but for now I am happy just using an SPSC channel per bare metal core. It took me a few weeks, but the process of enabling the MMU helped me learn some interesting things about how ARM processors work under the hood (exceptions/exception levels, system registers, what the MMU does, how to read ARM docs, etc), and as a result my code can now benefit from caching and other memory optimizations!

Next up I am going to start working on the actual DSP code. I think I am going to set up a way to execute my code on my laptop through my audio interface, allowing me to test how my DSP stuff sounds separately from testing how the DSP runs on the bare metal RPi. I also have to finish testing my ADC/DAC breakout board, and set up the bare metal audio IO pipeline (how to efficiently read/write from the I2S peripheral) once I am able to control the ADC and DAC.