Anatomy of Meltdown – A Technical Journey
- This blog reviews the details of Meltdown and discusses the inherent immunity for end users provided by Bromium’s architecture.
- Meltdown is an Intel CPU vulnerability leveraging speculative execution which gives an attacker-controlled process the ability to read arbitrary memory belonging to the kernel.
- Although it doesn’t allow for an attacker to directly take control of the kernel, there is enough sensitive information residing within kernel memory (password hashes, bitlocker keys, etc.) that this presents a direct threat to the end user’s security.
- In an upcoming blog post, I will discuss Meltdown’s brother, Spectre, and show how Bromium mitigates likely Spectre attacks. I’ll explain the steps we have taken in response to harden the Bromium hypervisor against possible future Spectre attacks.
What follows is a technical summary – or anatomy – of Meltdown and how it works. It’s important to know that we don’t believe Bromium customers are risk from Meltdown because of how we do application isolation that’s hardware enforced. There’s more on this at the end of the blog. So let’s get started with a short review.
Key Concepts Related to Meltdown
- Virtual memory versus physical memory
For reasons of both security and storage optimization, process memory as it sees it doesn’t map directly to the physical memory. Rather, its virtual memory is translated via page tables managed by the CPU to a physical offset in memory. Since the mapping of the user space memory of each process in the system should not generally overlap (with the exception of shared libraries and memory segments designed with this in mind), this precludes the ability of one process to directly read sensitive data from the user space memory of another.
- User space versus kernel space
The operating system’s kernel has access to all the hardware devices and resources of the machine, and its role is to abstract and multiplex those resources to application processes in a controlled manner, such that security properties are maintained. The kernel runs at a high CPU privilege level, ring 0, whereas application processes normally run in “user space”, ring 3. When an application process wishes to make use of some resource to which it does not have direct access, for example to write to disk, it would make a request to the kernel via a system call.
Most operating systems use the CPUs virtual memory hardware such that the address space of a process is split into two sections: an area only accessible when running in ring 0 (the kernel), and an area accessible in any privilege level. The application code and data resides in the area accessible at any privilege level, and when it makes a system call to transfer to the kernel execution commences in the ring 0 only area. The kernel accessible area is typically common to all processes and contains a wealth of sensitive data, much of it unrelated to the currently executing process.
Having the user space and kernel in a single address space is very convenient from a programming point of view, and has never previously presented a problem as the CPU can normally be relied upon to correctly maintain the security isolation to protect the kernel area from access from ring 3. The Meltdown attack violates this isolation by cleverly exploiting a rash internal design decision as to precisely how Intel CPUs implement the security checks.
- Speculative out-of-order execution
Programs are compiled into a series of sequential instructions for a CPU to execute. However, in the quest for improved performance modern CPUs only provide an illusion of sequential execution, they are actually looking ahead in the instruction sequence to try to spot instructions that they can get started on in advance and execute in parallel with other instructions. The first x86 CPU to have out of order execution was the Pentium Pro, and in the following 20 years the technique has become ever more sophisticated with the CPU now searching hundreds of instructions into the future.
Take the following code x86 code:
; 1) load the memory at rsp into rax
mov rax, [rsp]
; 2) increment rbx
; 3) move the contents of rbx into rdx
mov rdx, rbx
We have one slow operation followed by two fast operations. Since the result of 2) and 3) is not dependent on 1) (and even in some cases when it is), the CPU will begin computing the results of 2) and 3) without yet knowing the answer to 1). This is what is meant by “out-of-order”.
Crucially, the CPU must ensure that the illusion of sequential execution in program order is maintained. Thus while speculatively executing future instructions it must make sure that if something goes wrong with an earlier instruction, e.g. a page fault or other exception, it must be able to wind the internal state back such that there are no visible effects from the speculation. It does this by “retiring” instructions in-order, determining that the instruction operation has completed and that nothing has gone wrong, and then permanently committing the results.
In the case of Meltdown, a user space program contains code that performs a load from an address that should only be accessible in ring 0 kernel mode, and then performs other instructions using the value that is read. Racing ahead looking for work to do, Intel CPUs will speculatively issue the load from kernel memory. They then allow the further computation on the value that is read. When program execution order catches up with the load instruction and it is ready for retirement the privilege violation will be spotted and a page fault generated. The internal CPU state will be wound back to the faulting instruction and hence the results fn any computation done on the value that was read should never be visible. No harm, no foul. Or so it was thought…
It turns out that although the value that was read or any computation based on it isn’t visible directly, the discoverers of Meltdown learnt that they could leak the value indirectly, via subtle side effects resulting from the instructions speculatively executed after the load that use the value that was read.
- CPU caching
Due to the high latency of loading the contents of a piece of memory from RAM, often taking 100s of CPU cycles, all modern CPUs use one or more levels of cache to store the contents of recently accessed memory addresses. Before retrieving the contents from main memory, the CPU first checks that the piece of memory has not already been stored in the cache.
The variant of Meltdown described here uses the cache as a “covert side-channel” to leak the value read from kernel memory, made possible by the ability to reliably measure the difference in the lookup time between cached and non-cached memory.
With this information in mind, here is an overview of the execution of the attack. As previously mentioned, a process cannot access kernel memory directly. Nevertheless, Meltdown exposes how it can be read indirectly using speculative execution and communicated to the attacker. The process is relatively straightforward once the previously described concepts are understood.
An instruction sequence is set up where an attempt is made to read a byte in the kernel’s memory from user space. Even though the processor will generate an exception when the user attempts to read kernel memory, Intel CPUs execute memory lookups and check the permission status of said lookups at instruction retirement time. This establishes a race condition whereby the CPU is free to speculatively execute based on the results of the forbidden lookup up until the load is retired and the exception generated. This gives the attacker a window to leak the value via a side-channel.
In this variant, the cache is used as side channel to exfiltrate the forbidden kernel value from the compromised process. This is done by using the byte as index to de-reference a user defined memory array of cache line-sized values, thereby causing an element of the array to be loaded in to the cache. Although the process which made the forbidden memory lookup will eventually generate an exception and crash, from another process via forking, the attacker iterates though the user defined array’s elements determining from their lookup times the recently cached index, thus deriving the forbidden byte that was read from the kernel. The process can be easily repeated to iterate through the kernel address space, accessing further data.
Good engineering practice is to always implement any security checks on untrusted input as early as possible. This makes them easier to reason about, and minimizes the exposure that other code has to unvalidated data, reducing the attack surface. The same good sense applies to hardware design too, and Intel were asking for trouble by leaving the check until retirement time. Even if performance reasons dictated that the load from kernel address space be issued before privilege checks were performed it should never have been possible for other instructions to consume and operate on the value that was read. CPUs from other vendors don’t have this problem, and I am sure that future Intel CPU microarchitectures won’t either. However, I expect such CPUs might be at least a couple of years out unless Intel can patch up a core design that is already inflight.
In the meantime, Operating System vendors are having to make significant changes to the virtual memory handling of their OSes to work around the issue. Rather than having the kernel address space mapped into each process, on vulnerable CPUs Linux and Windows now operate with separate page tables for kernel mode and user mode. The Linux folks call this “Kernel Page Table Isolation” (KPTI), Microsoft call it “Kernel Virtual Address Space Shadowing” (KVAS). By avoiding having the kernel address space mapped into user processes the CPU is unable to speculatively load values from it, mitigating the issue.
KPTI/KVAS has significant implementation complexity and it has required a fairly heroic effort from the OS vendors to hurriedly create patches to mitigate the issue. However, switching page tables when entering and leaving the kernel results in Translation Look-aside Buffer (TLB) flushes and impacts performance, particularly for workloads that perform lots of system calls. Fortunately, on newer CPUs, some of this overhead can be mitigated using a feature called Process Context IDentifiers (PCID) that eliminates some of the TLB flushing. The net result is that the overhead of KPTI/KVAS might not be too noticeable on a typical modern desktop system, but for IO-intensive workloads that perform many system calls (e.g. a database server) it could still be quite an issue. Expect lots of class action lawsuits…
Bromium’s Isolation Provides Immunity
Attacking a machine using Meltdown requires the ability to run code on it. On a shared machine like a Terminal Server the attacker may already have a log in, but for a typical endpoint desktop or laptop the most likely vector will be the same way that such machines are regularly compromised: via a malicious web page, email attachment, downloaded file, etc. The attacker will use a malicious web page or file to get execution on the machine, then leverage Meltdown to access information that might enable it to read sensitive data and escalate its privilege.
Using Bromium, web pages, email attachments, documents, etc. will all be opened in their own dedicated micro-VM, isolated from the underlying physical system, and without access to any sensitive information. Even on a vulnerable CPU without Microsoft’s KVAS mitigation, were an attacker to use Meltdown to read kernel data in the micro-VM they would not find anything sensitive to read – the micro-VM does not contain any of the hosts secrets, password hashes, etc. Further, even if they managed to take full control of the micro-VM they would not be able to use Meltdown to access memory belonging to the host since the guest and host do not share an address space. Bromium endpoint users are thus protected against the likely routes for delivering Meltdown attacks.
In the next post, I will be discussing the Spectre vulnerability and explaining how it compares to Meltdown. We’ll also take a look at the steps that have been taken to harden Bromium isolation against potential new attack vectors that may appear as a result. Subscribe to our blog via email (near the top on the right) and you’ll get the next blog in your inbox.