8088 & 386SX – the stupidest CPU architectures ever

Yuriy Georgiev, 10.Aug.2022

Recently I was reading some ancient literature, and I was hit by a horrible statement.

Let me take a few steps back and briefly explain the core of the 8088 CPU elements so you can better understand my frustration.

Some of the main units of the 8088 CPU are as follows.

The Execution Unit (EU) executes the code and performs actions on all the operants associated with it.

The Bus Interface Unit (BIU) provides an interface between the Execution Unit and the BUS on the mainboard. It also has a Prefetch Queue which holds the instructions next to be executed by the CPU. So whenever the EU finishes its work, it will immediately continue with the next one in the Queue.

The Control Unit (CU) manages the whole process of that and some more.

The usual (and preferable) way those units work is roughly as follows.

While the EU does its computing magic, the Control Unit ensures the Queue is fully loaded so it can feed the EU and not allow it to be idle. If there is free space in the Queue, the CU requests the BUS to fetch the next instructions and data to be loaded from the RAM. Loading from RAM is SLOW. Very slow. So we really should make sure we load everything we need from the RAM in the Prefetch Queue while the Execution Unit is busy to avoid “CPU Starvation”.

And here comes the “fun part”, which caused my frustration.

The Execution Unit on the 8088 is 16-bit. Meaning it can load a WORD size (2 bytes) data in one register and execute 16-bit instructions from its cache. That’s good for this small chip that operates on 5-10 Mhz frequency.

The BUS, however, is 8-bit. And that kills the party. The BUS access by the 8088 takes 4 clock cycles, or 0.838 microseconds in the 4.77 MHz PC, and transfers 1 byte.

But the fun keeps going. In those days all peripheral devices used Direct Memory Access (DMA) to communicate with the RAM via the BUS as well. This makes the BUS stop fetching data for the CPU from the RAM and lets the devices access the RAM periodically for a few microseconds.

So what we end up with is a fast Execution Unit that goes into starvation for several microseconds once in a while (remember that’s on a 5-10 MHz, so it’s not a negligible amount of time). And that’s because of the low amount of data transferred by the BUS (cycle eater 1), which is interrupted once in a while by the peripheral devices (cycle eater 2).

Even with the prefetch queue, EU starvation is inevitable. Test cases are showing that clearly.

Small example of assembly code:
mov AX, word ptr [SomeVar]
is twice slower than:
mov AL, byte ptr [SomeVar]

Why? Let me explain.

mov means “move” or “copy”, AX is a 16-bit CPU register (think of it as a hardware variable), word ptr [SomeVar] means “the 2 bytes at address SomeVar”.

The second example is the same but works with the lower 8-bits of the AX register (AL for A-Low) and accesses a byte size variable at the SomeVar address.

So, since we have an 8-bit BUS to access and transfer data with, for the WORD size data, we need to access the RAM two times to transfer these 2 bytes. While the Execution Unit can process them way faster since it can work with the whole WORD size data.

Hold on; the show doesn’t stop here. Once transferred into the CPU, we need to do something with this data. Multiply it, add it, shift it, whatever. And at the end, we need to store it back in the memory because otherwise, we will lose the result. 

The CPU registers are volatile memory, just like the RAM. However, they are way more limited in terms of count and size, you have just a few of them. So the CPU has limits to what data it can keep at a time.

Now, the BUS needs 8 more cycles to transfer those 2 bytes back to the RAM. Or perform two transfers, one for each byte.

And that’s in case it’s not interrupted by some peripheral device.

The solution? Code optimization.

You cannot avoid the transfer back and forward to the RAM, but you can optimize the computation once the data is loaded into the CPU cache. Also, you can manage your code in a way such you can do more operations on specific data at once in advance so you can avoid later reads and writes to the RAM.

The 8088 is an ancient processor, released back in 1979. Let’s say it was a design choice forced by business, market, manufacturing, and other constraints. 

Nope – not the case at all. At the very same time, Intel released 8086, which had a 16-bit BUS. It’s just that at the time a lot of mainboard manufacturers still produced 8-bit BUS size motherboards. The CPU had to be compatible so it can sell.

And here is the kicker. Intel did that again in 1985 with the 386 architecture. They have released 386SX, which has a 32-bit CPU with 16-bit BUS, and its “non-demo version” – the original 386 (also known as 80386 or i386), which has a 32-bit CPU with 32-bit BUS. Surprise, surprise.

This story ends here. I’m still amazed by the magnitude of influence of the marketing over the design choices made back then, and I wonder if they have something to do with the ones done today. Sure, they are business justified, but still this is wrong. We had to wait a very long time to see CPUs with a smaller than 14-nanometer transistor sizes after all.

I really hope that Pat Gelsinger (the CEO of Intel at the time of writing this article) will make the right decisions. He did that for VMware Inc., which I can tell for a fact as I’m currently working there.

History is full of mind-blowing decisions, which are the foundation of our present. Let’s do better when we are in a position to take directions that will last for years.

That was my horror story. I hope you’ve enjoyed it.