Virtual Memory Management in 32-bit Operating Systems (The 4GB Addressing Limit)
Yuriy Georgiev, 09.Aug.2022
In the following article I will explain briefly how the virtual memory is organised by the OS in 32 bit systems and where the 4GB address limitation comes from.
It works with two tables (imagine a 2D array) — a page directory (1024 indexes), each directory entry contains page table of 1024 indexes to pages, each page 4KB in size to hold data and code.
So 1024*1024*4096 = 4 294 967 296, which is exactly 4 GB. Now you understand why 64bit systems have orders of magnitude bigger addressing limit.
In the code as well as in the compiled binary the address of any variable or function you have is virtual. It gets converted to physical by using the mechanism illustrated on the image below.
So basically you have a 2D array of 4096 bytes buffers to hold data and code of your program.
They also are aligned so if one page uses less than 4096 bytes, it’s filled up with zero bytes in order to make the next page start at 4097th byte. So the pages are always 4096 bytes big, no matter if the process utilizes the whole page or not.
The memory pages are not separated by anything, they are one big chunk of memory in your RAM.
The separation is logical and performed by the CPU and the OS on privileged level (meaning that a user level process cannot access other processes memory — you need higher privileges).
This “container” of memory is shared between the threads of the process in case of multi-threading.
Every process has its own virtual memory space to keep its data and code. This is the so famous Protected Mode.
Every 32 bit address is split by 3 — 10 bits for Page Directory Index, 10 bits for Page Table Index, 12 bits for offsets to data within the page. The address is maintained by the CPU register CR3 known as “Page Directory Base Register” (PDBR).
Best case scenarios during code optimization is to get a page (and it’s virtual-to-physical address conversion) cached in the CPU.
This is one of the reasons stack variables are faster than pointers — they do not require second virtual-to-physical address conversion (CPU cycle eater 1), with data stored in another page (CPU cycle eater 2).
If the data the pointer points to is in another page, the CPU makes a fetch request to the BUS which in return requests the data from the RAM, then it transfers it to the CPU cache and lets the CPU do its job with it — all that is controlled by the CPU’s Control Unit and it’s a huge performance penalty sometimes, when the Execution Unit finishes its current job before the requested data arrives.
It’s good to note that the BUS fetches 64 bytes (called a “cache line”) no matter what, then the CPU uses just the data it needs out of it. Therefore declaring your variables one afther another rises the chance to have your “next” variable you’re going to work with cached in the CPU already, after finishing working with the current one.
This is called “cache optimization” or “locality”.
If there is one thing to remember from this article it’s this:
Be smart when declaring your variables. Declare your vars by context as they are used in your code; the biggest ones first so you can rise the chance to fetch more than one during cache line fetch.
There is also a memory alignment of your variables – you can search about that on the internet or wait for me to write a tutorial on that, it’s out of the scope of this article.
Few vars used in the same subroutine should be declared one after another for cache friendly code.
I think that wraps up the surface of the VMM. I hope I clarified few stuff to you when it comes about virtual memory management. At least in theory. However, as you can see it’s no black magic, it’s just a matter of several details.
Good luck and stay tuned.