PART 1: On computation - algorithms and automata

PART 2: The art of emulating

PART 3: MOS 6502

PART 4: Once upon a GPU


PART 3: MOS Technology 6502

My undergraduate degree curriculum included a lecture called 'Computer Architecture'. I learned how CPUs and computer systems operate by studying two 'cases' : MIPS and i386. Since then, I went through a wide spectrum of designs and instruction sets, some mainstream like Intel, ARM, NVIDIA, IBM Power, niche ones like CELL BE (PS3) or RISC-V or some specialized such as Tilera or Epiphany. They are all useful. I have also done some work on much older 8-bit CPUs such as Intel 8080, Z80 or MOS 6502. If I were to design a university course, I would start with one of those, because they are small enough to map them at the transistor level, yet functional - they have just the right amount of complexity to express a program and some of the solutions used in those early design help understand why the CPUs evolved in some particular direction.

This may be only a subjective statement, but in my opinion, the 6502 CPU is the most elegant design I have seen. It is so compact and neat, that I will try to explain how it works within this short article. Let's start by giving it some context - technically the full name of the CPU is MOS Technology 6502. It's an 8-bit CPU released in 1975 and powered such machines as Commodore 64, Apple I, Apple II, Atari 2600, Nintendo Entertainment System (NES). A descendant of the original design is still in production! - Link. Just look at this to appreciate its significance:

Steve Wozniak, Steve Jobs and 6502 die-shot on the wall
Commodore 64
Did you know that T-800 was powered by 6502?!

So, we have a CPU die which has some elements and then there is a screenshot of the movie showing some 6502 assembly code - let's see what the Terminator was trying to run and how it was achieved by the CPU.

I'll do something risky now, you may want to skip this part if you already know how CPUs work. Let's assume you don't, but know Neural Networks. Assuming that an average reader is well aware of various machine learning models, it may work. Let's draw an analogy between an CPU and a recurrent model such as LSTM (although technically a CPU would be closer to a Differentiable Neural Computer / NTM). Or it could be a Transformer - you will see the analogy.

CPU vs LSTM cell

The exact architecture is not relevant here, but the key concepts are:
- repeated operation and transformation of data
- part of the input data is aquired from within (internal state)
- part of the input data is taken from the external environment (we may be able to write to it, but cannot assume exclusive access and full control)
- output is also internal/external, input and output addresses do not have to be the same elementwise

The entire CPU has roughly 3500 transistors and can be fully simulated nowadays in software - Visual 6502

The following visualization has been made using Visual 6502 and a tool call xray6502

Therefore we need to figure out the internal/external processing done by a CPU. First, let's assume that we are going to describe the entire CPU's state as a structure like this one:

typedef struct { ... } cpu;

Now we need to answer the following questions:

1. What constitutes the state?
2. How is the state changed?
3. How are the rules governing the state-change represented?
4. What is a program?
5. Can a CPU change the state of an external device?

State = Registers

The state of the 6502 (and possibly all CPUs) can be represented by the contents of its internal registers. A register within a CPU is the smallest building block which has store a state (as opposed to some circuits which are stateless, those are purely combinational, for details, check the difference between sequential, for example latches and combinatonal logic) number of the registers and their sizes vary between CPUs (x64 registers). Some registers have a very specific function (store condition flags, accumulator, floating point, vector ops, etc). Hopefully, by the end of this article you will understand why CPUs don't contain millions of registers.

The 6502 has the following set of registers:

A (Accumulator) (8 bits) - This is the main register that is used in all arithmetic and logical operations
X (X Index) (8 bits) - a special register used to store memory offsets
Y (Y Index) (8 bits) - just like X, it is used mainly for indexing
P (Processor Status) (8 bits) - it 'remembers' some information related to previously executed operation (example: was the result 0?)
SP (Stack Pointer) (8 bits) - Used to address the 256-byte stack.
PC (Program Counter) (16 bits) - It points to the address of the instruction that will be executed next

That's it. What does it mean?

In order to represent the state of the CPU we just need to store 8+8+8+8+8+16 = 56 bits. You will see how. For now, let's complete the structure:

typedef uint8_t u8; typedef uint16_t u16; typedef struct { u8 A; u8 X; u8 Y; u8 P; u8 SP; u16 PC; } state_6502;

Therefore, as per technical description (wiki):

The 6502 is a little-endian 8-bit processor with a 16-bit address bus

What else do we need?

Let's introduce an example, which will motivate what we need next. Any conventional CPU operates in discrete fashion by manipulating it's state. In other words, the following can be expressed as:

while (running) { state[t+1] := function (state[t]) }

To be more concrete, let's try to make it real code:

// modify registers 'in-place' void cpu_step(state_6502 *s) { // current state 's' determines what we are going to do operation = get_operation(s); operation(s); // execute } while (running) { cpu_step(&state); }

Here, the function is the CPU logic we need to accomplish some task. We need to learn more about the way a CPU like 6502 encodes the function. Given a state s at some time t we need to obtain 3 pieces of information: 1. Instruction to be executed, 2. Addressing Mode, 3. External operands. 6502 does that in a 3-step process. The first step is called fetch. It uses the value stored in the Program Counter as an address to a byte B which is then read and PC is incremented (by default). Based on the value of that byte, we can determine the sub-type of the instruction (addressing mode) and if any extra bytes need to be fetched (depending on the addressing mode, there can be 0,1 or 2) - this step is called decode. The last step is optional and retrieves additional data incrementing PC. After this is completed we can execute an instruction which produces new values of registers and results in an optional write to the external memory. A loop like this one (fetch, decode, execute) is called a cycle.

{ u8 B = mem[PC++]; mode, instruction, n_operands = decode(B); /* fetch extra n_operands */ execute(); /* we are done with this cycle */ }

Addressing modes

There are 56 instructions and 13 addressing modes. This makes 6502 quite powerful. Some combinations are not valid, however. Some of the addressing modes make use of X and Y registers. Those are used primarily to implement loops or store memory addressed in a compound form (base + offset). Here's the C code implementing the modes, b is the number of additional bytes to be read

Good resources

void imp() {m=0; b=0; } // implied, 1 void acc() {m=1; b=0; } // accumulator, 1 void imm() {m=2; b=1; d=(u16)f8(); } // immediate, 2 void zp() {m=3; b=1; d=(u16)f8(); } // zero page, 2 void zpx() {m=4; b=1; u8 r=f8(); d=(r+X) & 0xff;} // zero page, x, 3 void zpy() {m=5; b=1; u8 r=f8(); d=(r+Y) & 0xff; } // zero page, y, 3 void rel() {m=6; b=1; u8 r=f8(); if (r<0x80) d=PC+r; else d=PC+r-0x100;} // relative, 2 void abso() {m=7; b=2; d=f16(); } // absolute, 3 void absx() {m=8; b=2; d=f16(); cyc+=((d>>8)!=((d+X)>>8)) ? pg[op] : 0; d+=X; } // absolute, x, 3 void absy() {m=9; b=2; d=f16(); cyc+=(d>>8)!=((d+Y)>>8) ? pg[op] : 0; d+=Y; } // absolute, y, 3 void ind() {m=10; b=2; d=r16(f16()); } // indirect, 3 void indx() {m=11; b=1; u8 r=f8(); d=r16((u8)(r + X)); } // indirect x void indy() {m=12; b=1; u8 r=f8(); d=r16((u8)(r)); cyc+=(d>>8)!=((d+Y)>>8) ? pg[op] : 0; d+=Y;} // indirect y

There is a 1:1 mapping between an instruction opcode and the addressing mode, so one possible implementation of decoding is through a look-up table as we do in this version. The 'Decode ROM' in the 'real' version is most likely implementing it in this fashion

Here's the complete 'step'

void cpu_step() { op = f8(); /* read opcode */ addrtable[op](); /* determine m, b */ /* read b bytes as operands */ optable[op](); /* execute */ }

Interacting with external memory

In order to accomplish any non-trivial task, we need to interact with some kind of external storage which either contains some data we need (let's say the program) or we need to write to it (say, a framebuffer in this case which is used to draw something). There is one fundamental limitation when it comes to communicating with the outside world which makes a CPU more like a single-headed Turing Machine rather than an LSTM - We can address only one memory location at any given point in time - think of it as a form of hard attention. This applied to any memory read or write, including operation fetch. The external memory in theory can be unbounded, but we are also limited by the range of possible addresses generated by a given CPU. 6502 has a 16-bit address bus, which means that we can address up to 64kB of memory.

Address / Data IO

u8 fetch(state_6502 *s, u8 *mem) { u8 byte = mem[s->PC]; s->PC++; return byte; }


A program is a chunk or chunks of memory which the CPU will read in a sequential order, unless a JUMP or CALL/RETURN instruction is executed. The program counter is used to keep track of the location in a program. Typically a program is loaded into a RAM (read/write) memory accessible by a CPU, but there also can be built-in ROM into a CPU which contains some very basic programs used during boot.

Fetch/Decode Cycle

As you can see, there is a fundamental bottleneck which existed and still exists in modern CPU. The so-called 'von Neumann architecture' is one where a CPU communicates through a narrow (compared to the internal CPU's capacity) bus. Now it may become clearer why it is better to compare a CPU to an architecture where external memory is explicitly defined. For example 'Differentiable Neural Computer' uses attention to 'focus' on a memory location for either write or read. There are multiple heads which increase the width of the bus, but there is always a tradeoff between the cost of having a wide bus and the benefits.

Source: Nature

Let's look at an example program (assembly source)

start: lda #$e1 sta $0 lda #$01 sta $1 ldy #$20 write: ldx #$00 eor ($0, x) sta ($0),y inc $0 bne write inc $1 ldx $1 cpx #$06 bne write rts

After assembly:

[0xa9,0xe1,0x85,0x00,0xa9,0x01, 0x85,0x01,0xa0,0x20,0xa2,0x00, 0x41,0x00,0x91,0x00,0xe6,0x00, 0xd0,0xf6,0xe6,0x01,0xa6,0x01, 0xe0,0x06,0xd0,0xee,0x60,0x00]

Here's a link to the assembler I wrote: You can also use any other available assembler to experiment with the CPU. One more thing which is important - the memory map: 0 - 0xff is the ZERO PAGE, 0x100-0x200 is STACK, 0x200 - 0x600 is used as a FRAMEBUFFER and by default the PROGRAM is at 0x600. You can use the python frontend to use the CPU - '' or call it from C directly.

Program Synthesis

Here I will explain one of the motivations of writing an emulator of 6502 - I wanted to experiment with automatic program generation (program synthesis). That is, given some target behavior, come up with a program which accomplishes that task. Let me show you an example experiment (included in the github repo). The goal is to fill the screen (memory region 0x200-0x400), so it is relatively simple. The learning mechanism is also quite primitive, yet works. It is based on the Genetic Algorithm which treats programs as strings (genomes), which can undergo mutation (change some value) or crossover (2 strings 'mate' and produce 2 new offsprings). The algorithm itself is self-contained in where I show the simplest case of finding a secret string and determining an image based on the fitness evaluation.

Applied to a string

Applied to an image

# a simple mutation strategy, just select a random element and replace with a random symbol def mutate(chromosome, chars): i = random.randint(0, len(chromosome) - 1) # i # X = xxxxxxxxxxxxxxxxxxx ( original) # ---------|------------- # M - xxxxxMxxxxxxxxxxxxx ( mutated ) mutated=list(chromosome); mutated[i]=random.choice(chars) return "".join(mutated) def crossover(x,y): # x and y are parents i = random.randint(1, len(x) - 2) # crossover point 0 (i) j = random.randint(1, len(y) - 2) # crossover point 1 (j) if i > j: i, j = j, i # i j # X = xxxxx xxxxxxxxxxxxx xxxxxxx ( parent 0) # Y = yyyyy yyyyyyyyyyyyy yyyyyyy ( parent 1) # ------------------------------- # p = xxxxx yyyyyyyyyyyyy xxxxxxx ( child 0 ) # q = yyyyy xxxxxxxxxxxxx yyyyyyy ( child 1 ) # p,q = x[:i] + y[i:j] + x[j:], y[:i] + x[i:j] + y[j:] return p, q # 2 children

What I wanted to accomplish with the 6502 is similar to the image search case. There is one difficulty - The path leading to the final image may be windy and involves 'learning' concepts like loops, memory copy, counters etc. We generate the final image indirectly through the code. The following images show the intermediate steps of learning such a program. The leftmost image is the first generation and the last one is after approximately 290 generations (about 3 minutes) - not bad!. You can also see the final binary which was generated. We can actually disassemble it and see what happened. I included this example as in the repo (

623c bd3c 4f52 ad26 2dad 94d6 379c 55dc f4b3 9e18 e8fd 6d36 729b 130c 51a2 17e0 395b 00f6 2b58 6b37 dce6 4e4e 603d 8bfc 9e52 c26d 2232 ff1e f4be e93a df9b bfaa 8387 5ee2 4411 8b56 6512 10b4 3799 f6a6 30d0 3bdc b598 102c 4773 78f2 09ce 5931 3f5c 012d 6c64 9917 dc94 0a6e 05d3 97ae 111d 9050 161b a5f8 ed0e 4c7e 7e7c fc44

I have 'played' with this approach some time and also disassembled learned programs. Here's a brief collection of learned behaviors I've seen - it's remarkable that even in case of simple tasks, there are multiple very different solutions found

Just a counter - the intensity of the upper dot is the counter, which then is used to iterate over the lower block

When trying to learn a more complex shape, sometimes the strategy is to fill a chunk, then subtract from it

A different approach is to fill a shape using additive, small increments

A simple loop to fill a block

Weird, but fast fills

Here, we can see that the RAM is used (left)

Recursive fill!

Fill through self-modification

Bonus: I included a CUDA implementation of 6502 which allows you to run thousands of programs in parallel, check the nv6502 directory

It's quite funny, but a single V100 core can emulate a 6502 CPU at about 2MHz which is faster than the original CPU

An array of 6502 CPUs can be emulated in parallel resulting in a large population size

Console output showing the result computed on the GPU

To conclude:

Here is the entire logic for the 6502 written in CUDA - roughly 200 lines - that is what I like about the 6502! It is very concise and elegant.

#define STACK_PG 0x0100 #define ZN(x) { Z=((x)==0); S=((x)>>7) & 0x1; } #define LDM { d=(m>2) ? r8(n,d) : d; } #define LD_A_OR_M() u8 w=(m==1)?A:r8(n,d) #define ST_A_OR_M() if (m!=1) w8(n,d,w); else A=w; __device__ u8 r8 (_6502 *n, u16 a) { return n->mem[a % MEM_SIZE]; } // byte read __device__ void w8 (_6502 *n, u16 a, u8 v) { n->mem[a % MEM_SIZE] = v; } // byte write __device__ u8 f8 (_6502 *n) { return r8(n, PC++); } // byte fetch //// 16-bit versions __device__ u16 r16 (_6502 *n, u16 a) { u16 base=a & 0xff00; return (r8(n,a) | (r8(n,base|((u8)(a+1))) << 8)); } // buggy __device__ u16 r16_ok(_6502 *n, u16 a) { return (r8(n,a) | (r8(n,a+1) << 8)); } __device__ u16 f16 (_6502 *n) { return (f8(n) | ((f8(n))<<8)); } // //// stack ops __device__ u8 pop8 (_6502 *n) { SP++; return r8(n, STACK_PG | SP); } __device__ u16 pop16 (_6502 *n) { return (pop8(n) | ((pop8(n))<<8)); } __device__ void push8 (_6502 *n, u8 v) { w8(n, STACK_PG | SP, v); SP--; } __device__ void push16(_6502 *n, u16 v) { push8(n,(v>>8)); push8(n,v); } __device__ void jr (_6502 *n, u8 cond) { if (cond) { PC=(u16)d; } } // //// decoding addressing mode __device__ void imp (_6502 *n) { m=0; b=0; } // implied, 1 __device__ void acc (_6502 *n) { m=1; b=0; } // accumulator, 1 __device__ void imm (_6502 *n) { m=2; b=1; d=(u16)f8(n); } // immediate, 2 __device__ void zp (_6502 *n) { m=3; b=1; d=(u16)f8(n); } // zero page, 2 __device__ void zpx (_6502 *n) { m=4; b=1; u8 r=f8(n); d=(r+X) & 0xff;} // zero page, x, 3 __device__ void zpy (_6502 *n) { m=5; b=1; u8 r=f8(n); d=(r+Y) & 0xff; } // zero page, y, 3 __device__ void rel (_6502 *n) { m=6; b=1; u8 r=f8(n); if (r<0x80) d=PC+r; else d=PC+r-0x100;} // relative, 2 __device__ void abso(_6502 *n) { m=7; b=2; d=f16(n); } // absolute, 3 __device__ void absx(_6502 *n) { m=8; b=2; d=f16(n); d+=X; } // absolute, x, 3 __device__ void absy(_6502 *n) { m=9; b=2; d=f16(n); d+=Y; } // absolute, y, 3 __device__ void ind (_6502 *n) { m=10; b=2; d=r16(n,f16(n)); } // indirect, 3 __device__ void indx(_6502 *n) { m=11; b=1; u8 r=f8(n); d=r16(n,(u8)(r + X)); } // indirect x __device__ void indy(_6502 *n) { m=12; b=1; u8 r=f8(n); d=r16(n,(u8)(r)); d+=Y;} // indirect y //instructions __device__ void _adc(_6502 *n) { u8 a = A; LDM; A=d+A+C; ZN(A); u16 t = (u16)d + (u16)a + (u16)C; C=(t > 0xff); V = (!((a^d) & 0x80)) && (((a^A) & 0x80)>0 ); } // Add Memory to Accumulator with Carry __device__ void _sbc(_6502 *n) { u8 a = A; LDM; A=A-d-(1-C); ZN(A); s16 t = (s16)a - (s16)d - (1-(s16)C); C=(t >= 0x0); V = (((a^d) & 0x80)>0) && (((a^A) & 0x80)>0); } // Subtract Memory from Accumulator with Borrow __device__ void _cp (_6502 *n, u8 _a, u8 _b) { u8 r=_a-_b; C=(_a>=_b); ZN(r); } __device__ void _ora(_6502 *n) { LDM; A|=d; ZN(A); } // "OR" Memory with Accumulator __device__ void _and(_6502 *n) { LDM; A&=d; ZN(A); } // "AND" Memory with Accumulator __device__ void _eor(_6502 *n) { LDM; A^=d; ZN(A); } // "XOR" Memory with Accumulator __device__ void _cmp(_6502 *n) { LDM; _cp(n,A,d); } // Compare Memory and Accumulator __device__ void _cpx(_6502 *n) { LDM; _cp(n,X,d); } // Compare Memory and Index X __device__ void _cpy(_6502 *n) { LDM; _cp(n,Y,d); } // Compare Memory and Index Y __device__ void _bcc(_6502 *n) { jr(n,!C); } // Branch on Carry Clear __device__ void _bcs(_6502 *n) { jr(n,C); } // Branch on Carry Set __device__ void _beq(_6502 *n) { jr(n,Z); } // Branch on Result Zero __device__ void _bit(_6502 *n) { LDM; S=(d>>7) & 1; V=(d>>6) & 1; Z=(d & A)==0; } // Test Bits in Memory with A __device__ void _bmi(_6502 *n) { jr(n, S); } // Branch on Result Minus __device__ void _bne(_6502 *n) { jr(n,!Z); } // Branch on Result not Zero __device__ void _bpl(_6502 *n) { jr(n,!S); } // Branch on Result Plus __device__ void _brk(_6502 *n) { B=1; } // Force Break __device__ void _bvc(_6502 *n) { jr(n,!V); } // Branch on Overflow Clear __device__ void _bvs(_6502 *n) { jr(n, V); } // Branch on Overflow Set __device__ void _clc(_6502 *n) { C=0; } // Clear Carry Flag __device__ void _cld(_6502 *n) { D=0; } // Clear Decimal Mode __device__ void _cli(_6502 *n) { I=0; } // Clear interrupt Disable Bit __device__ void _clv(_6502 *n) { V=0; } // Clear Overflow Flag __device__ void _dec(_6502 *n) { u16 d0 = d; LDM; d--; d &= 0xff; ZN(d); w8(n,d0,d); } // Decrement Memory by One __device__ void _dex(_6502 *n) { X--; ZN(X); } // Decrement Index X by One __device__ void _dey(_6502 *n) { Y--; ZN(Y); } // Decrement Index Y by One __device__ void _inc(_6502 *n) { u16 d0=d; LDM; d++; d &= 0xff; ZN(d); w8(n,d0,d); d=d0; } // Incr Memory by One __device__ void _inx(_6502 *n) { X++; ZN(X); } // Increment Index X by One __device__ void _iny(_6502 *n) { Y++; ZN(Y); } // Increment Index Y by One __device__ void _jmp(_6502 *n) { PC=d;} // Jump to New Location __device__ void _jsr(_6502 *n) { push16(n,PC-1); PC=d; } // Jump to New Location Saving Return Address __device__ void _lda(_6502 *n) { LDM; A=d; ZN(A); } // Load Accumulator with Memory __device__ void _ldx(_6502 *n) { LDM; X=d; ZN(X); } // Load Index X with Memory __device__ void _ldy(_6502 *n) { LDM; Y=d; ZN(Y); } // Load Index Y with Memory __device__ void _lsr(_6502 *n) { LD_A_OR_M(); C=w & 1; w>>=1; ZN(w); ST_A_OR_M(); } // Shift Right One Bit __device__ void _asl(_6502 *n) { LD_A_OR_M(); C=(w>>7) & 1; w<<=1; ZN(w); ST_A_OR_M();} // Shift Left One Bit __device__ void _rol(_6502 *n) { LD_A_OR_M(); u8 c = C; C=(w>>7) & 1; w=(w<<1) | c; ZN(w); ST_A_OR_M(); } // Rotate One Bit Left (Memory or Accumulator) __device__ void _ror(_6502 *n) { LD_A_OR_M(); u8 c = C; C=(w & 1); w=(w>>1) | (c<<7); ZN(w); ST_A_OR_M(); } // Rotate One Bit Right (Memory or Accumulator) __device__ void _nop(_6502 *n) { /* No Operation */ } __device__ void _pha(_6502 *n) { push8(n, A); } // Push Accumulator on Stack __device__ void _php(_6502 *n) { push8(n, P | 0x10); } // Push Processor Status on Stack __device__ void _pla(_6502 *n) { A=pop8(n); Z=(A==0); S=(A>>7)&0x1;} // Pull Accumulator from Stack __device__ void _plp(_6502 *n) { P=pop8(n) & 0xef | 0x20; } // Pull Processor Status from Stack __device__ void _rti(_6502 *n) { P=(pop8(n) & 0xef) | 0x20; PC=pop16(n); } // Return from Interrupt __device__ void _rts(_6502 *n) { PC=pop16(n)+1;} // Return from Subroutine __device__ void _sec(_6502 *n) { C=1;} // Set Carry Flag __device__ void _sed(_6502 *n) { D=1;} // Set Decimal Mode __device__ void _sei(_6502 *n) { I=1;} // Set Interrupt Disable Status __device__ void _sta(_6502 *n) { w8(n,d,A);} // Store Accumulator in Memory __device__ void _stx(_6502 *n) { w8(n,d,X);} // Store Index X in Memory __device__ void _sty(_6502 *n) { w8(n,d,Y);} // Store Index Y in Memory __device__ void _tax(_6502 *n) { X=A; ZN(X); } // Transfer Accumulator to Index X __device__ void _tay(_6502 *n) { Y=A; ZN(Y); } // Transfer Accumulator to Index Y __device__ void _tsx(_6502 *n) { X=SP;ZN(X); } // Transfer Stack Pointer to Index X __device__ void _txa(_6502 *n) { A=X; ZN(A); } // Transfer Index X to Accumulator __device__ void _txs(_6502 *n) { SP=X; } // Transfer Index X to Stack Pointer __device__ void _tya(_6502 *n) { A=Y; ZN(A); } // Transfer Index Y to Accumulator // undocumented __device__ void _lax(_6502 *n) { _lda(n); X=A; ZN(A); } // lda, ldx __device__ void _sax(_6502 *n) { w8(n,d,A&X); } __device__ void _dcp(_6502 *n) { _dec(n); _cp(n,A,d); } __device__ void _isb(_6502 *n) { _inc(n); _sbc(n); } __device__ void _slo(_6502 *n) { _asl(n); _ora(n); } __device__ void _rla(_6502 *n) { _rol(n); _and(n); } __device__ void _sre(_6502 *n) { _lsr(n); _eor(n); } __device__ void _rra(_6502 *n) { _ror(n); _adc(n); }


And many others I can't remember

My Apple II

PART 4: Once upon a GPU

Nowadays, the term GPU is commonly found in machine learning context. It may be shocking to many to learn that they existed 10, 20 or even 30 years ago. Recently, I tried to recreate the way one had to follow in order to perform a simple computation on a GPU. Things have not changed dramatically and the story of what we have now is a story of unindented consequences.

Just in order to provide some kind of a context I would like to give you a short history of why a GPU was needed.


This is the space where the entire frame to be displayed is stored. Early CPUs such as 8080 and 6502 had a dedicated memory space which could be written to directly by the CPU using a regular load/store operation. Here you can see the 32x32 framebuffer of my toy 6502 system. The pixels are written by the CPU to an area of memory which is used as the canvas.

The 6502 emulator used here:
6502 assembler and the program to generate the pattern:

CPU needs help

Let's take one of my all-time favourite games, Space Invaders, and its first incarnation for a 8080 CPU - the arcade version (Computer Archeology link)

We have a framebuffer/screen, 256 x 224 pixels, to be exact, monochrome. All you need to perform is a very simple task: set each pixel in a frame (i.e. write a value at (x,y) for all 256x224 pixels ), i.e. memset. That is 57344 pixels in total. Not a big deal. But you would probably want to process around 30 frames per second at least or better, 60. Then: 57344 x 60 = 3440640 pixels per second.

The basic CPU-MEM-VIDEO system

The 8080 CPU uses some part of its RAM as a screen buffer. It can read from the memory, and write to it at a rate not exceeding 1 word / cycle. That is not including the cost it takes to loop over an array. A simple task, becomes impossible.

Here is the actual piece of code doing this work: (full disassembly)

ClearScreen: 1A5C: 21 00 24 LD HL,$2400 ; Screen coordinate 1A5F: 36 00 LD (HL),$00 ; Clear it 1A61: 23 INC HL ; Next byte 1A62: 7C LD A,H ; Have we done ... 1A63: FE 40 CP $40 ; ... all the screen? 1A65: C2 5F 1A JP NZ,$1A5F ; No ... keep going 1A68: C9 RET ; Out

Each instruction has some cost expressed in the number of cycles needed - 8080 opcode reference. So, in order to clear one pixel, the lower bound to execute is about 7 cycles. You would need a CPU running at about 25 MHz to do just that.

Here is this piece of code in 'action'. On the left hand side you can see the MEM address being cleared (register HL) - it needs to go all the way up to address 0x3fff. In the middle there is a visualization showing the vram section of the memory (initially the contents are gray). The right hand side shows the program counter, instruction count since start and string dissassembly.

In the RAM space [bytes 2400-3fff] ( super-slow version for debugging )

In the pixel space [224 x 256] (~1000x faster than the above one, still sub-realtime)

In fact, the screen is cleared only when it's really needed. If you analyze the code, the authors had to be quite creative to be able to draw everything quickly, but not everything was possible. I recommend looking at the disassembled code, just to see how painful the process must have been.

Clearing the screen was not the only problem which the authors of the game were facing. A more complicated operation was related to the movement of objects ( the bullet/laser beam, aliens shifting ). This required something a little bit different.

The need for speed

Enter the 'Accelerator'

In addition to the 8080 CPU, there is an auxiliary bit shift register which had to be added in order to `shift` the alien pixels quicker as they move left or right. The CPU can write to this register and read from it through its I/O inferface. This may very well be the first version of a 'GPU' or an accelerator - an alien shifting device (Fujitsu MB14241).

/* a C struct representing what the circuit is and what it does */ struct mb14241 { u8 lo, hi, shift_amount; }; /* written by 8080 (OUT 4) */ void append(struct mb14241 *m, u8 new_val) { m->hi = m->lo; m->lo = new_val; } /* written by 8080 (OUT 2) */ void set_shift(struct mb14241 *m, u8 amount) { m->shift_amount = amount & 0x3; } /* read by 8080 (IN 3) */ u8 read(struct mb14241 *m) { return ((m->lo | m->hi << 8) >> (8 - m->shift_amount)) & 0xff; }

MB14241 in action

To let you experience this wonderful game and experiment with the code, I have written a full emulator of the Space Invaders machine (8080 CPU, the 'GPU', I/O, interrupt subsystem). Here's the github link:

For more details on MB14241, see the Dedicated Shift Hardware at the Computer Archelogy site -

Moving forward

Other consoles from that era like ATARI 2600 also included some specialized circuits to accelerate rendering (TIA). ATARI exploits symmetries and intruduces some special registers used for missiles and battlefield.

Patterns like these are relatively cheap

Link to the code for TIA in C: tia.c

We need more...

Later consoles like NES and Gameboy had a separate PPU (link) a Pixel Processing Pnit which was capable of rendering without CPU intervention and operated on tiles - small textures). Also it allowed smooth scrolling.

Tile layout in NES games

Tile layout in PACMAN for NES, L: Color coded tile IDs, R: What we see after the 'textures' are fetched

NES PPU emulation is much more involved

u8 nmi_enable = bit(PPUCTRL, 7); for (u64 i=0; i> (py & 7)) & 0x1]; // render background u8 nt_idx = PPUCTRL & 0x3; u16 nt_base = nt_bases[nt_idx]; u16 bg_table=bit(PPUCTRL,4) ? 0x1000 : 0x0000; u16 tile=vram[nt_base + ((py & 248) << 2) + (px >> 3)]; u16 pattern_off=bg_table + (tile << 4) + (py & 7); u8 curattrib = (vram[nt_base + 0x3C0 + ((py & 224) >> 2) + (px >> 5)] >> (((px & 16) >> 3) | ((py & 16) >> 2)) & 3) << 2; u8 curpixel = (chr[pattern_off & 0x1fff] >> (~px & 7)) & 1; curpixel |= ((chr[(pattern_off+8) & 0x1fff] >> (~px & 7)) & 1) << 1; u32 color = palette[vram[0x3f00 + (curattrib | curpixel)] & 0x3f]; pix[3*(px+py*IM_W)+0] = (color >> 16) & 0xff; pix[3*(px+py*IM_W)+1] = (color >> 8) & 0xff; pix[3*(px+py*IM_W)+2] = (color >> 0) & 0xff; } col++; ppu_cycle = (ppu_cycle+1) % 341;

Link to the code for NES PPU in C: ppu.c

Gameboy: Smooth 2D scrolling

You can study my implementation of PPU for GAMEBOY:

Part of the code emulating the PPU for Gameboy

void gpu_draw_sprites() { u8 gpu_sprite_on = (REG_LCDC >> 1) & 0x1; u8 gpu_sprite_size = (REG_LCDC >> 2) & 0x1 ? 16 : 8;//8x8, 8x16 if (gpu_sprite_on) { for (u8 idx=0; idx<40; idx++) { u8 i=39-idx; u16 spriteaddr=0xfe00 + ((s16)i) * 4; s32 spritey = ((s32)((u16)r8(spriteaddr+0)))-16; s32 spritex = ((s32)((u16)r8(spriteaddr+1)))-8; u16 tilenum = r8(spriteaddr+2); if (gpu_sprite_size == 8) tilenum &= 0xff; else tilenum &= 0xfe; u8 flags = r8(spriteaddr+3); u8 usepal1 = (flags >> 4) & 0x1; u8 xflip = (flags >> 5) & 0x1; u8 yflip = (flags >> 6) & 0x1; u8 belowbg = (flags >> 7) & 0x1; u8 c_palnr = flags & 0x7; //u8 c_vram1 = (flags >> 3) & 0x1; u8 line = REG_SCANLINE; if (line < spritey || line >= (spritey + gpu_sprite_size)) { continue; } if (spritex < -7 || spritex >= (160)) { continue; } u16 tiley; tiley = yflip ? (gpu_sprite_size-1-(line-spritey)) : (line-spritey); u16 tileaddress = 0x8000 + tilenum*16 + tiley*2; u16 b0 = tileaddress; u16 b1 = tileaddress+1; u8 data0 = r8(b0); u8 data1 = r8(b1); u32 screen_off = line*160*3; for (u8 x=0; x<8; x++) { if (((spritex+x)<0) || ((spritex+x) >= 160)) continue; if (belowbg && bgprio[(spritex+x)] != 0) continue; u8 off = xflip ? x : (7-x); u8 pal = usepal1 ? REG_OBJPAL1 : REG_OBJPAL0; u8 color0_idx = ((data0 >> (off)) & 0x1); u8 color1_idx = ((data1 >> (off)) & 0x1); u8 color_idx = color0_idx + color1_idx*2; if (color_idx == 0) continue; u8 r,g,b; u8 color = (pal>>(color_idx*2))&0x3; if (color == 0) {r=255; g=255; b=255;} if (color == 1) {r=192; g=192; b=192;} if (color == 2) {r=96; g=96; b=96;} if (color == 3) {r=0; g=0; b=0;} pix[!buffer][screen_off+3*(spritex+x)+0] = r; pix[!buffer][screen_off+3*(spritex+x)+1] = g; pix[!buffer][screen_off+3*(spritex+x)+2] = b; } } } }

1990s onward

- S3, 2D acceleration era. The term GPU was coined by Sony in reference to the 32-bit Sony GPU in the Playstation.

- OpenGL 1.0 introduced in 1993

- 1995 Diamond Edge 3D - NV1

1996 Diamond Monster 3D (Voodoo 1)


- 2000s brought programmable shading

- 2007 - CUDA released

Last update: 2/20/20