x86/x64 CPU architecture: the stack & stack frames
20.Feb.2024, Yuriy Georgiev
The stack as a data structure
The CPU stack
The stack in the CPU is managed and affected by several registers.
SP – stack pointer
BP – base pointer
IP – instruction pointer
Respectively the register names ESP, EBP, EIP are used in 32-bit (x86) systems, and RSP, RBP, RIP in 64-bit (x64) systems.
The letter “E” before the name of the register in 32-bit systems stands for “Extended”.
The AMD64 ISA extension added 8 additional general-purpose registers, named R8 through R15. The 64-bit extended versions of the original 8 registers had an “R” prefix added to them for symmetry.
The RIP register (instruction pointer) points to the next instruction to be executed by the CPU.
The RBP register (base pointer) points to the bottom (the beginning) of the stack.
The RSP register (stack pointer) points to the top (the tip, or the end) of the stack.
Stack vs Heap
A program in your RAM is loaded at a certain address when using protected mode in modern Operating Systems.
Let’s ignore the paging and memory virtualization and imagine we work with the RAM in a simple and raw manner. Since it really doesn’t matter for the purpose of this tutorial, we will consider that the application is loaded and mapped in the memory starting from the very first address – address 0.
The application has code, predefined data (strings, images, etc.), undefined data (area of memory that is reserved but not occupied yet with information), etc.
Whenever you decide to request a big chunk of memory at runtime (like in the malloc() function in C), the operating system requests this memory from the heap.
The heap is a dynamically allocated memory. You don’t have control over where within the memory it will be allocated – the operating system takes care of that.
However, the heap grows upwards. Let’s say we have a block of heap memory starting at address 1000 with a size of 200 bytes. This block of memory will occupy the addresses between 1000 and 1200 (decimal).
The stack, however, grows in the opposite direction. Therefore the stack base pointer (RBP) is set at an address higher than the tip of the heap memory and grows downwards.
For example, if the allocated memory ends at address 1200, the stack Base Pointer could be pointing at 1300 and it will grow down to address 1200 (meaning that the stack is 100 bytes big). These are just imaginary numbers and examples to illustrate the way it works.
The important thing you need to remember is that the stack “grows” from higher addresses to lower addresses as items are pushed onto it.
How the CPU stack works
Function Prologue
Whenever you enter a function, pass arguments to this function, declare a local variable, or leave a function, the stack is used by the CPU in one way or another.
In order to be able to explain the code, we need… code. So here is a small program written in C that sums 3 integers.
On the left is the C code, and on the right is its assembly code after compilation.
Let’s say you are in the main() function, about to call the function called sum() and you pass 3 arguments to it.
int a = sum(10, 20, 30)
In low-level assembly instructions it will look like this:
mov edx, 30
mov esi, 20
mov edi, 10
Finally, main() will issue the subroutine call instruction:
call sum
When the call instruction is executed, the value of the RIP (Instruction Pointer) register is pushed onto the stack which essentially saves the address before entering the sum() function, to which the program should return execution after leaving sum().
The state of the stack is currently looking like this:
After the “call” instruction, the RIP is redirected to the address of the function sum().
As you can notice the 3 arguments are moved to 3 32-bit general purpose registers (EDX, ESI, EDI).
You are now in the sum() function. The first thing that happens on CPU level is creating a stack frame. This will be a portion of the stack that will be available only for the current function. Whenever the function returns this stack frame will be destructed.
In sum() the very first instructions that will actually prepare the stack frame are:
push rbp
mov rbp, rsp
What this does is to “save” the current position of the Base Pointer (the bottom of the “current” stack frame) and replace it with the Stack Pointer (the tip/top of the stack). So the new Base Pointer is the current top of the stack.
This operation of saving the old base pointer and moving it to the tip of the stack is the so-called Function Prologue.
The Stack Frame
The stack frame is the gap between the RBP and RSP.
In our example, the only thing that is in the stack is the old RBP value, so the stack frame looks like this:
However, we have 3 arguments inside registers that are used as arguments to the function sum(). To use them, the compiler generates code that will save them into the stack right after we enter the function sum().
mov DWORD PTR [rbp-12], edx
The old code is colored in yellow.
The lines “mov dowrd ptr[rbp-4], edi” etc. instruct the CPU to copy the values that are in the registers (EDI, ESI, EDX) to the specified memory address locations.
Note the address arithmetic. Every address is an offset relative to the RBP. Since each integer is 4 bytes in size, for each value the CPU uses an address that is 4 bytes away from the previous one. Note it is subtracting from the RBP. If you remember the stack grows downwards, from higher to lower memory addresses. This should speak to you.
This code utilizes the Base Pointer to save the values of the 3 arguments passed to the function via registers, on the stack.
At this point the stack should look like this:
Function Epilogue
After doing its job, the function sum() prepares to return to its caller (the main() function).
This is called the Function Epilogue.
The code that does that is the following:
add eax, edx
pop rbp
ret
The code “add eax, edx” adds the value of EDX to the value in EAX. EAX is the default register that holds data that a function will return. It is not a part of the function epilogue but I left it here to explain how functions return values. This explains why functions can return only a single value. If you like to return more than one value, you need to use a structure or pointers passed to the function as arguments. Higher-level languages like Python can disguise that, but it is what happens under the hood.
After that, it pops the “old” RBP from the stack, replacing the current value of RBP, which restores the previous “bottom” of the stack frame (that belongs to main()).
The “ret” instruction transfers control to the return address located on the stack (the RIP (instruction pointer) pushed to the stack before calling the sum() function).
Why the Stack is Faster than the Heap
– Stack is allocated when the thread starts – this is really the first and foremost thing. You do not have to ask your OS for memory; instead, you get a preallocated chunk. For ARM64, x86, and x64 machines, the default stack size is 1 MB.
– Stack variables do not require allocating on the heap, which is managed by a memory allocator, which is a slow process. Stack variables do not need to be “freed”, which is also a slow process.
These are the major reasons why people say the “stack” is faster than the “heap”, even though they all come from the same memory place physically.
Also, stack variables are put one next to another. This plays well with the CPU cache.
– Deallocating is a simple matter of moving the Base Pointer and Stack Pointer – given that we are really just moving up & down inside our own 1MB of space, all we need to do in order to allocate and wipe out data is to move a pointer that tells us where’s our stack bottom.
– Addresses can be precalculated at compile time, while the heap is allocated dynamically at run time and retrieves its address after the allocation.
– When the CPU requests data from a certain address in the memory it retrieves a whole cache line, which is 64 bytes in x64 systems (128 in Apple’s M2), instead of simply the value it will work with. Once the data is retrieved from the RAM it is stored in the CPU cache L1 which is an extremely fast memory. The only memory faster than the L1 Cache is the registers.
If the stack variables are stored one after another, requesting the value of one variable leads to retrieving everything after it up to 64 bytes. This leads to caching the following stack variables in the L1 cache making them available for fast access.