Generating correct initial page tables from formal hardware descriptions

Conference Paper

Author(s):
Achermann, Reto; Cock, David; Haecki, Roni; Hossle, Nora; Humbel, Lukas; Roscoe, Timothy; Schwyn, Daniel

Publication date:
2021-10

Permanent link:
https://doi.org/10.3929/ethz-b-000511171

Rights / license:
In Copyright - Non-Commercial Use Permitted

Originally published in:
https://doi.org/10.1145/3477113.3487270
Generating correct initial page tables from formal hardware descriptions

Reto Achermann
University of British Columbia
Vancouver, Canada

David Cock
ETH Zurich
Zurich, Switzerland

Roni Haecki
ETH Zurich
Zurich, Switzerland

Nora Hossle
ETH Zurich
Zurich, Switzerland

Lukas Humbel
ETH Zurich
Zurich, Switzerland

Timothy Roscoe
ETH Zurich
Zurich, Switzerland

Daniel Schwyn
ETH Zurich
Zurich, Switzerland

Abstract

Modern hardware platforms are increasingly complex and heterogeneous. System software uses a hodgepodge of different mechanisms and representations to express the memory topology of the target platform. Considerable maintenance effort is required to keep them in sync while often sharing is impossible due to hard-coded values. Incorrect platform-specific values in the hardware initialization sequence can lead to security critical and hard-to-find bugs because of misconfigured translation hardware, inaccessible devices, or the use of bad pointers.

We present a better way for system software to express and initialize memory hardware. We adopt an existing, powerful hardware description language, and efficiently compile it to generate correct initial page tables and memory maps for OS kernels and firmware from a single system description.

We evaluate our system on multiple architectures and platforms, and demonstrate that we can use the generated data structures to successfully initialize translation hardware, devices, memory maps, and allocators enabling easy support of new hardware platforms.

1 Introduction

Hardware is becoming both more complex, and simultaneously more diverse: Even small SoCs now comprise a dozen dramatically different processors (application cores, DSPs, accelerators, etc.), bound together with a complex non-uniform interconnect with each agent having a unique view of system addresses. At the same time, the number of different platforms to which software must be ported is growing dramatically each year, beyond the rate at which high-quality initialization and management code can be written. One result is that, despite enormous investment by platform vendors, the state of the art for platform initialization is a hodgepodge of repurposed mechanisms, none quite fit-for-purpose, and the widespread reuse of canned initialization snippets, silently replicating and importing inaccurate assumptions about hardware.

In this paper, we show by example that there is a better way. We consider the particular case of page table generation and allocator initialization, which exposes much of the complexity of modern hardware (including non-uniform addressing and heterogeneous processing), while being critical to the correct and secure operation of the system. We adopt our existing description language for addressing architectures (Sockeye), which has been shown to be able to express real, extremely complex modern systems, and which has a rigorous formal interpretation (decoding nets). Finally, we formulate the problem of system initialization as one of compilation: can we (efficiently) generate correct initialization data (here, the initial page tables and allocator state) from a Sockeye description of the system?

In the remainder of this paper we show that the answer to the above question is a resounding yes. The decoding net formalization allows us to frame the problem as determining whether a page table exists which maps the desired CPU-visible (“virtual”) address space to the projection of system resources (RAM, device registers, etc.) onto the CPU’s “physical” addresses (as computed from the decoding net), within the constraints of the virtual memory system (e.g. granularity). As we will show, this requires simply the recursive projection of resources in reverse direction (from enclosed to enclosing AS) along the decoding net.

Lastly, we show that a straightforward expression of the decoding net rules in Prolog produces an efficient solution, which works in practice. We are able to generate correct
We further show that the solution is fully general, and not special-cased to any particular architecture, by generating both initialization data and configurations for the official Arm Fast Models simulator for a variety of pathologically complex and nonuniform hypothetical platforms, and showing that the generated configuration suffices to boot the OS in all cases without any user intervention whatsoever.

Our contributions thus are demonstrating that system initialization data does not have to be hand-crafted but can be derived efficiently from a formal model and showing that it can be used to boot an OS even on systems with esoteric topologies. The standalone toolchain is available open source [24].

2 Background

Consider the problem of simply booting a modern machine. A typical boot sequence for an ARMv8 platform is shown in Figure 1: a chain of different firmware and OS Components loaded one after the other, each requiring information about the hardware platform; indeed the OS is the last component to be loaded and executed.

During boot, all this system software needs to understand the complex memory topology of the platform and reflect this in the hardware initialization sequence. A key part of this challenge, and the one we focus on in this paper, is the creation of the page tables to configure the processor’s MMU (and System MMU), constructing the memory map for populating the memory managers, initializing devices at the right location and programming them with the correct memory addresses.

Today, each component in the boot sequence must initialize hardware, or make use of hardware configured in a prior step. The way this hardware is described today is typically hardcoded by programmers (even for “discoverable hardware”), and in a variety of ad-hoc and ill-defined formats. This results in a considerable engineering burden for each new device (as anyone who has done an OS bringup on a new piece of modern hardware can attest), and moreover has the potential to introduce catastrophic and hard-to-debug errors as a result of memory misconfiguration [12, 14, 25], or wrongly passed tables between components [13, 15, 16].

The memory topology of modern hardware platforms makes this problem even worse [7]. Memory accesses from processor cores and devices traverse multiple buses, memory controllers, and memory translation and protection units before reaching their destination e.g. memory cell or device register. For example, Figure 2 shows a typical modern SoC from NXP, where the meaning of a (physical) address is relative to the processor core (A35 or M4) or DMA-capable device (LPUART, PPB) leading to situations where the same resources appear at different addresses, or different resources appear at the same address, viewed from different cores.

Current approaches

Today, computers use a mix of different mechanisms and representations of system state (including memory maps) at different stages in the boot process: ACPI [17], UEFI [20], hardcoded values, DeviceTree [10], etc. Linux even builds initial page tables using hand-written assembly [19]. Arm Trusted Firmware [18] uses a C data structure to initialize system page tables [11].

Keeping these representations in sync is purely manual. Hard-coding various aspects of the platform makes it hard to share code between platforms and increases the maintenance effort needed to support a wide range of systems. OS developers try to separate initialization code from the platform-specific values it uses, but quickly run into problems: in practice, modern hardware cannot be described faithfully as a set of arguments to a C function, and the ideal of a single externalized platform description is never achieved.

The most complete description format for platforms today is DeviceTree [10], used by the Linux kernel to describe non-discoverable platform information, and also employed (with different device tree files) by some intermediate bootloaders.

DeviceTree files capture the platform’s application processors, memory and caches, devices, interrupt sources, and clocks in a tree-like data structure with a single root. While
Generating correct initial page tables from formal hardware descriptions

Generating correct initial page tables from formal hardware descriptions

Discussion
We argue that, rather initializing and booting a machine rely-
ing on replicated, hand-written, low-level code, interpreting a semantic-free and inherently incomplete description of the platform hardware, a better way is needed.

Instead, we start with a **formally-specified** way to describe platforms, which can capture the full complexity of modern systems with different processors and interconnects, and then use this description to generate low-level system software and firmware components that are correct by construction. This approach is not only motivated by reducing engineering cost, but is also an absolute prerequisite for formally verifying low-level system software for a given platform.

Our work is thus aligned with other OS synthesis efforts such as Termite [21] which synthesizes device drivers from behavioral descriptions of devices and the OS. Hu et al. [8] identify problems and opportunities in synthesizing an OS kernel, pointing out that some OS components are inherently hard to synthesize and arguing for a hybrid approach. This paper is an example: page tables are generated from descriptions independently of the rest of the OS.

In this paper, we demonstrate that initial page tables can be constructed generically from formal specifications of the system at hand. We not only show how this can be done efficiently, but also demonstrate that it works for heterogeneous systems with highly esoteric memory addressing. We start with the formal representation of addressing in modern SoCs that forms the foundation of our approach.

Decoding nets
We begin with our formally specified model, the **decoding net** [2, 3]. We have shown it captures the memory topology of a broad variety of hardware platforms in a rigorous and well-defined way. Decoding nets express the addressing structure of a system as a directed graph: nodes represent (virtual or physical) address spaces or devices (including RAM), and edges the possible translation between them. The model distinguishes *address-space-local* names (*address*), and *global* names (*name*) that are qualified by their enclosing address space. Each node may **accept** a set of (local) addresses (e.g., RAM or device registers), and/or **translate** them to one or more global names (e.g., MMU or PCI bridges).

**name** = Name nodeid address

**node** = Node accept = {address}

**translate** :: address → {name}

**Figure 3** shows the simplified decoding net for an x86 machine with a Xeon Phi [9] accelerator card. The dark nodes correspond to the accept-only leaf nodes in the graph.

The Sockeye language [23] is a syntax to express the memory topology of a hardware platform as a decoding net. Sockeye bears some superficial similarities to DeviceTrees, but in contrast has clear semantics that can express decoding nets formally (and, indeed, generate Isabelle/HOL representations of such nets). It provides syntactic elements such as regions and modules that help to express the system in a concise and understandable way. **Figure 4** shows a small excerpt from the description for the system depicted in **Figure 3**.

Sockeye is designed around reusable blocks of decoding net nodes called **modules**. Each module has a name, parameters, and a set of input and output nodes to be bound on instantiation. **Figure 4** is an excerpt of a Xeon Phi PCI-based accelerator module, restricted to the view from its CPU (**K1OM_CORE**). This address space in turn has a window to its local APIC (**LAPIC**) and maps the rest to the core local space (**PHYS**). This in turn contains memory (**GDDR**) an MMIO region for the control registers (**MMIO**) and an aperture on the system address space (**SMPT_IN**).

Sockeye allows tagging of memory regions with predicate logic terms, e.g., RAM regions are tagged with with **mem**, and device registers with **devreg**. We exploit this information in correctly mapping devices in our generated page tables.
Figure 4. Simplified Sockeye description of a Xeon Phi co-processor PCI card. Note the map to SMPT_IN providing a window to host resources.

3 Implementation

We use generating initial kernel page tables as an example as it exercises the model (specifically in identifying device regions), without being dependent on the details of a particular operating system as all kernels use quite similar layouts (in contrast to, say, the operation of their memory allocators). We also use the same techniques described here for both the static initialization and dynamic runtime state of the Barrelfish memory allocator and device manager, which we hope to present in followup work.

The initial structure of a kernel’s virtual address space is quite simple, and generally consists of a 1-1 mapping of some portion of the system address space, including enough RAM for the kernel’s internal needs, plus any devices (such as interrupt controllers) that the kernel itself relies upon. Additional device mappings are typically added at runtime, either within the kernel’s own virtual address space or into a user process’s space.

The challenge in constructing the initial page table is thus not in constructing the page table itself. The virtual-physical map is unconstrained down to the translation granule and thus can represent any desired mapping. The specific problem to be solved by querying the decoding net is rather to identify which regions are accessible to the processor (in particular the required devices), what their properties are (e.g., device registers must usually be mapped uncached), and at what address in the CPU’s ‘physical’ address space they appear. The page-table generator needs to know, for example, whether a large mapping must be split to specify that some sub-range is to be mapped uncached for a device.

Complexity

Recall, the decoding network is a directed acyclic graph, with accepting regions (here RAM or devices) at the leaves, and CPU cores (or other bus-mastering agents) at the roots (Figure 3). It is thus possible in principle to compute the regions visible at any node iteratively: beginning at the leaves, follow the edges in reverse to determine where this region appears in other spaces (noting that it may appear in many, only a sub-region may appear, it may appear twice, etc.), and repeating until all regions have been projected up as far as the target node of the CPU’s page table mappings.

As Figure 5 illustrates, the number of regions (and hence complexity of any algorithm enumerating them) is exponential in the diameter of the decoding net. We here see one accepting region (Acc) mapped twice into the immediately preceding address space, which in turn is mapped twice into its predecessor.

In this example we will generate at least $2^n$ distinct regions for a root address space at distance $n$ from the resource. Note, non-contiguous regions prohibit merging, thus all $2^n$ regions might need to be represented.

Solving for a desired configuration will in general involve a search through this exponentially-large space (runtime algorithms, e.g., allocation have stricter requirements than the initial page-table generation). We cannot expect an efficient sub-exponential algorithm to exist for the general case. In practice, such pathological examples do not occur in actual hardware, and established heuristic search strategies perform well. Indeed, we take advantage of the fact that the experimental OS on which we evaluate (Barrelfish) incorporates the Eclipse/CLP solver for just such system configuration tasks (the System Knowledge Base [22]), and encode the problem quite directly as a set of Prolog predicates which (as Section 4 shows) performs very well in practice. Note, our encoding only uses the Datalog subset of Prolog.

Prolog Encoding

Figure 6 gives the syntax used to encode a decoding net as Prolog assertions. The translate and accept facts are generated by straightforward compilation from a Sockeye description of the system (e.g., Figure 3). For efficient evaluation, we
do not express individual addresses directly, but rather use larger blocks that form (not-necessarily-contiguous) regions.

The the predicate below expresses the one-step projection from destination region $D$ up to some source region $S$ in a predecessor address space. It ensures the existence of a mapping from some region of a source address space $SM$ to some region of a destination space $DM$, such that the intersection of $DM$ and $D$ is exactly $SI$, or the image of the source region under the mapping. The remaining conjunctions establish the position of $S$ within the mapping source region $SM$ as a function of the position of the image within $DM$.

Finding the location of all regions visible in some top-level address space $A$ requires solving for $S$ for every value of $D$ for which accept($D$) holds in the transitive closure of decode_step_rev, i.e., where the source region eventually maps to the accepting region. Adding architecture-specific constraints on allowable page-table entries (e.g., page alignment), and enumerating all solutions for $S$ then gives the values for all page table entries. Properties (e.g., cacheability) are taken directly from the accepting region.

**Output Generation and Integration**

The system integration toolchain (Figure 7) uses the results of the reachable-region query on the decoding net in several compile time and runtime locations.

Firstly, the returned mapping entries are encoded into machine-specific page-table descriptors and output as static initializers for a C array comprising the initial page tables. As the page table is (on all these architectures) multilevel, we exploit the linker and loader to correctly finalize the decoding net. We evaluate our approach with two experiments: First, we constructed, intentionally hard to deal with memory topologies. Then, we demonstrate that it even works for constructed, intentionally hard to deal with memory topologies.

**4 Evaluation**

We show that it is feasible to generate initial kernel page tables and memory maps for an OS running on various real platforms. Then, we demonstrate that it even works for constructed, intentionally hard to deal with memory topologies.

**4.1 Real Platforms**

Decoding nets can accurately capture memory topologies of real hardware [3]. Here we show that our Sockeye-generated ELF relocation records ensuring that the correct addresses will be filled in either by the linker (if building static, position-dependent code) or the (boot-)loader if the kernel is dynamically-loaded and position-independent.

Secondly, the query results are used to initialize various other OS data structures, including the initial state of the memory allocator (i.e., the location of all RAM regions), and the device manager (which is told the resources required by a driver for all statically-discoverable devices). These runtime structures are thus guaranteed consistent with the kernel’s internal view of the memory system, and the offline model.

Finally, the decoding net itself (expressed in the syntax of Figure 6) is (on Barrelfish) seeded into the SKB. This data is queried dynamically at runtime (and indeed, extended as the result of online device discovery), and used to initialize additional devices requiring, for example, IOMMU/page table configuration. This includes, among others, the Xeon Phi accelerator used as an example here, which incorporates numerous full CPU cores which run their own instance of the OS kernel. The online model is further used (with appropriate queries) to correctly allocate and map DMA-able memory accessible to devices with different views of the system from that of the CPU (again, including the Xeon Phi).

---

**Figure 6.** Prolog datatypes and dynamic predicates

**Figure 7.** Integration of the query engine with the OS. Highlighted components are part of the boot image.
We generate page tables and other initialization data for each platform as described in Section 3, and use them to boot Barrelfish [6]. The bootdriver uses the generated page tables to initialize the kernel address space by simply setting the translation base register to the start of the page table binary. The bootdriver then loads the kernel, passing the locations of page tables and device mappings as an argument. The kernel accesses all memory and device via the generated tables.

In all cases, the kernel boots and configures all devices correctly using the generated page tables installed.

Despite the simplicity of the Prolog implementation, the generation process is robust enough to handle a wide variety of real hardware platforms. While we use Barrelfish for demonstration, nothing about the generated tables or data structures is OS specific, and could be used just as well in other OSes such as Linux or seL4.

4.2 Simulated Platforms

We show that the Sockeye toolchain is also capable of handling platforms with pathologically-hard memory topologies, where cores have completely different views on memory.

We specify the memory topology of the platforms using the Sockeye language. We use another backend to generate a configuration for the Arm FastModels simulator [4] corresponding to the specified topology. We use this to generate a range of unusual platforms for evaluation as follows:

We adapt the base topology of the A57 FVP of the previous experiment to generate three additional topologies (Figure 8 shows the basic structure). There are four one-GiB DRAM regions. In addition to its MMU, each core has its own configurable memory map defining its visibility of DRAM.

In the base case, both cores have the same view: $[0, 1, 2, 3]$ the first GiB maps to DRAM0, the second to DRAM1, etc. The remainder are configured as follows:

- **Swapped**: DRAM is split in two and the address ranges where the cores see the halves are swapped relative to each other. One sees DRAM as $[0, 1, 2, 3]$ and the other as $[2, 3, 0, 1]$.
- **Private**: DRAM $[1, 2]$ are shared, and each core has a private region of DRAM. The mappings are $[1, 2, 0]$ and $[1, 2, 3]$.
- **Swapped + Private**: This combines the others: the shared regions of the previous topology are swapped. The resulting mappings are $[1, 2, 0]$ and $[2, 1, 3]$.

The toolchain generates the simulator configuration, and the page tables and memory maps from the same Sockeye description. We boot Barrelfish on the simulated platforms.

Barrelfish boots successfully into userspace on all platforms. The cores use the same code except of the parts generated from the topology information. Processes on the cores successfully communicate over shared memory.

The generation approach is robust enough not only for real hardware, but in adversarial scenarios with exceptionally peculiar memory topologies. By using a single Sockeye description, the generated core-specific memory maps ensure a consistent view of memory and thus enable shared-memory message channels.

5 Future Work

Generating page tables is a first step towards OS configuration based on the decoding net model. We plan to apply the approach outlined in the paper to the full boot process. If we can precisely specify the starting state for each boot stage, then we can not only eliminate unsafe memory accesses due to wrongly configured translation tables, but also precisely specify the contract between two stages.

Similarly, we can use the same approach to express additional protection mechanisms (e.g., Arm TrustZone [5]) in Sockeye and generate configurations to divide resources in secure and non-secure worlds.

Moreover, we plan to use the runtime representation and algorithms presented in this paper in memory allocators to find memory regions that can be shared between the driver software and accelerators/devices, and to guide configuration using the recently proposed mmapx interface [1].

Finally, our deep embedding of Prolog in Isabelle/HOL provides a framework to link the algorithms and facts produced by the Sockeye compiler back to the decoding net model and enables proofs about its correctness.

6 Conclusion

In this paper, we have presented a system that leverages the sound foundation provided by the decoding net model, and the Sockeye language to generate platform-specific data structures such as page tables and memory maps. We have outlined the required algorithms, their implementation in Prolog, and the integration into the build system to obtain a page table binary image that is then used by the operating system to configure the translation hardware.

Our evaluation qualitatively shows the application and integration of the address space model into the OS toolchain to generate low-level, platform-specific OS code and data structures. Our approach and implementation thereof is functional even when run on simulated platforms with unusual address space topologies not supported by other systems.

Acknowledgments

We thank the anonymous reviewers for their valuable feedback and comments. We acknowledge the generous support of Arm Ltd., VMware and Huawei for this work.
References


