x86/x64 CPU architecture: the stack & stack frames

20.Feb.2024, Yuriy Georgiev

The stack as a data structure

The stack is a linear data structure. It follows a last in, first out (LIFO) protocol. The latest piece of data added to a stack is the one which is eligible to be removed first. If three nodes ( a , b and, c ) are added to a stack in this exact same order, the node “c” must be removed first.

The CPU stack

The stack in the CPU is managed and affected by several registers.

SP – stack pointer
BP – base pointer
IP – instruction pointer


Respectively the register names ESP, EBP, EIP are used in 32-bit (x86) systems, and RSP, RBP, RIP in 64-bit (x64) systems.
The letter “E” before the name of the register in 32-bit systems stands for “Extended”.

The AMD64 ISA extension added 8 additional general-purpose registers, named R8 through R15. The 64-bit extended versions of the original 8 registers had an “R” prefix added to them for symmetry.

The RIP register (instruction pointer) points to the next instruction to be executed by the CPU.
The RBP register (base pointer) points to the bottom (the beginning) of the stack.
The RSP register (stack pointer) points to the top (the tip, or the end) of the stack.

 

Stack vs Heap

A program in your RAM is loaded at a certain address when using protected mode in modern Operating Systems.
Let’s ignore the paging and memory virtualization and imagine we work with the RAM in a simple and raw manner. Since it really doesn’t matter for the purpose of this tutorial, we will consider that the application is loaded and mapped in the memory starting from the very first address – address 0.

The application has code, predefined data (strings, images, etc.), undefined data (area of memory that is reserved but not occupied yet with information), etc.
Whenever you decide to request a big chunk of memory at runtime (like in the malloc() function in C), the operating system requests this memory from the heap.


The heap is a dynamically allocated memory. You don’t have control over where within the memory it will be allocated – the operating system takes care of that.
However, the heap grows upwards. Let’s say we have a block of heap memory starting at address 1000 with a size of 200 bytes. This block of memory will occupy the addresses between 1000 and 1200 (decimal).

The stack, however, grows in the opposite direction. Therefore the stack base pointer (RBP) is set at an address higher than the tip of the heap memory and grows downwards.
For example, if the allocated memory ends at address 1200, the stack Base Pointer could be pointing at 1300 and it will grow down to address 1200 (meaning that the stack is 100 bytes big).  These are just imaginary numbers and examples to illustrate the way it works.

The important thing you need to remember is that the stack “grows” from higher addresses to lower addresses as items are pushed onto it.

How the CPU stack works

Function Prologue

Whenever you enter a function, pass arguments to this function, declare a local variable, or leave a function, the stack is used by the CPU in one way or another.

In order to be able to explain the code, we need… code. So here is a small program written in C that sums 3 integers.
On the left is the C code, and on the right is its assembly code after compilation.

 

Let’s say you are in the main() function, about to call the function called sum() and you pass 3 arguments to it.

int a = sum(10, 20, 30)

In low-level assembly instructions it will look like this:

mov    edx, 30
mov    esi, 20
mov    edi, 10

Finally, main() will issue the subroutine call instruction:

call sum

When the call instruction is executed, the value of the RIP (Instruction Pointer) register is pushed onto the stack which essentially saves the address before entering the sum() function, to which the program should return execution after leaving sum().
The state of the stack is currently looking like this:

 

st1

After the “call” instruction, the RIP is redirected to the address of the function sum().
As you can notice the 3 arguments are moved to 3 32-bit general purpose registers (EDX, ESI, EDI).

You are now in the sum() function. The first thing that happens on CPU level is creating a stack frame. This will be a portion of the stack that will be available only for the current function. Whenever the function returns this stack frame will be destructed.
In sum() the very first instructions that will actually prepare the stack frame are:

push    rbp
mov     rbp, rsp

What this does is to “save” the current position of the Base Pointer (the bottom of the “current” stack frame) and replace it with the Stack Pointer (the tip/top of the stack). So the new Base Pointer is the current top of the stack.
This operation of saving the old base pointer and moving it to the tip of the stack is the so-called Function Prologue.

The Stack Frame

The stack frame is the gap between the RBP and RSP.
In our example, the only thing that is in the stack is the old RBP value, so the stack frame looks like this:

st2

However, we have 3 arguments inside registers that are used as arguments to the function sum(). To use them, the compiler generates code that will save them into the stack right after we enter the function sum().

push    rbp
mov     rbp, rsp
mov     DWORD PTR [rbp-4], edi
mov     DWORD PTR [rbp-8], esi

mov     DWORD PTR [rbp-12], edx

 

The old code is colored in yellow.

The lines “mov dowrd ptr[rbp-4], edi” etc. instruct the CPU to copy the values that are in the registers (EDI, ESI, EDX) to the specified memory address locations.
Note the address arithmetic. Every address is an offset relative to the RBP. Since each integer is 4 bytes in size, for each value the CPU uses an address that is 4 bytes away from the previous one. Note it is subtracting from the RBP. If you remember the stack grows downwards, from higher to lower memory addresses. This should speak to you.

This code utilizes the Base Pointer to save the values of the 3 arguments passed to the function via registers, on the stack. 
At this point the stack should look like this:

st3

Function Epilogue

After doing its job, the function sum() prepares to return to its caller (the main() function).

This is called the Function Epilogue.
The code that does that is the following:

add eax, edx
pop rbp

ret

The code “add eax, edx” adds the value of EDX to the value in EAX. EAX is the default register that holds data that a function will return. It is not a part of the function epilogue but I left it here to explain how functions return values. This explains why functions can return only a single value. If you like to return more than one value, you need to use a structure or pointers passed to the function as arguments. Higher-level languages like Python can disguise that, but it is what happens under the hood.

After that, it pops the “old” RBP from the stack, replacing the current value of RBP, which restores the previous “bottom” of the stack frame (that belongs to main()).
The “ret” instruction transfers control to the return address located on the stack (the RIP (instruction pointer) pushed to the stack before calling the sum() function).

Why the Stack is Faster than the Heap

– Stack is allocated when the thread starts – this is really the first and foremost thing. You do not have to ask your OS for memory; instead, you get a preallocated chunk. For ARM64, x86, and x64 machines, the default stack size is 1 MB. 

– Stack variables do not require allocating on the heap, which is managed by a memory allocator, which is a slow process. Stack variables do not need to be “freed”, which is also a slow process.
These are the major reasons why people say the “stack” is faster than the “heap”, even though they all come from the same memory place physically.
Also, stack variables are put one next to another. This plays well with the CPU cache.

– Deallocating is a simple matter of moving the Base Pointer and Stack Pointer – given that we are really just moving up & down inside our own 1MB of space, all we need to do in order to allocate and wipe out data is to move a pointer that tells us where’s our stack bottom.

– Addresses can be precalculated at compile time, while the heap is allocated dynamically at run time and retrieves its address after the allocation.

– When the CPU requests data from a certain address in the memory it retrieves a whole cache line, which is 64 bytes in x64 systems (128 in Apple’s M2), instead of simply the value it will work with. Once the data is retrieved from the RAM it is stored in the CPU cache L1 which is an extremely fast memory. The only memory faster than the L1 Cache is the registers.
If the stack variables are stored one after another, requesting the value of one variable leads to retrieving everything after it up to 64 bytes. This leads to caching the following stack variables in the L1 cache making them available for fast access.