DEMYSTIFYING COMPUTER ARCHITECTURE
PART 1: On computation - algorithms and automata
PART 2: The art of emulating
PART 3: MOS 6502
PART 4: Once upon a GPU
PART 5: DYI CUDA
PART 3: MOS Technology 6502
My undergraduate degree curriculum included a lecture called 'Computer Architecture'. I learned how CPUs and computer systems operate by studying two 'cases' : MIPS and i386. Since then, I went through a wide spectrum of designs and instruction sets, some mainstream like Intel, ARM, NVIDIA, IBM Power, niche ones like CELL BE (PS3) or RISC-V or some specialized such as Tilera or Epiphany. They are all useful. I have also done some work on much older 8-bit CPUs such as Intel 8080, Z80 or MOS 6502. If I were to design a university course, I would start with one of those, because they are small enough to map them at the transistor level, yet functional - they have just the right amount of complexity to express a program and some of the solutions used in those early design help understand why the CPUs evolved in some particular direction.
This may be only a subjective statement, but in my opinion, the 6502 CPU is the most elegant design I have seen. It is so compact and neat, that I will try to explain how it works within this short article. Let's start by giving it some context - technically the full name of the CPU is MOS Technology 6502. It's an 8-bit CPU released in 1975 and powered such machines as Commodore 64, Apple I, Apple II, Atari 2600, Nintendo Entertainment System (NES). A descendant of the original design is still in production! - Link. Just look at this to appreciate its significance:
Steve Wozniak, Steve Jobs and 6502 die-shot on the wall
Commodore 64
Did you know that T-800 was powered by 6502?!
So, we have a CPU die which has some elements and then there is a screenshot of the movie showing some 6502 assembly code - let's see what the Terminator was trying to run and how it was achieved by the CPU.
I'll do something risky now, you may want to skip this part if you already know how CPUs work. Let's assume you don't, but know Neural Networks. Assuming that an average reader is well aware of various machine learning models, it may work. Let's draw an analogy between an CPU and a recurrent model such as LSTM (although technically a CPU would be closer to a Differentiable Neural Computer / NTM). Or it could be a Transformer - you will see the analogy.
CPU vs LSTM cell
The exact architecture is not relevant here, but the key concepts are: - repeated operation and transformation of data - part of the input data is aquired from within (internal state) - part of the input data is taken from the external environment (we may be able to write to it, but cannot assume exclusive access and full control) - output is also internal/external, input and output addresses do not have to be the same elementwise
The entire CPU has roughly 3500 transistors and can be fully simulated nowadays in software - Visual 6502
The following visualization has been made using Visual 6502 and a tool call xray6502
Therefore we need to figure out the internal/external processing done by a CPU. First, let's assume that we are going to describe the entire CPU's state as a structure like this one:
typedef struct { ... } cpu;
Now we need to answer the following questions:
1. What constitutes the state?
2. How is the state changed?
3. How are the rules governing the state-change represented?
4. What is a program?
5. Can a CPU change the state of an external device?
State = Registers
The state of the 6502 (and possibly all CPUs) can be represented by the contents of its internal registers. A register within a CPU is the smallest building block which has store a state (as opposed to some circuits which are stateless, those are purely combinational, for details, check the difference between sequential, for example latches and combinatonal logic) number of the registers and their sizes vary between CPUs (x64 registers). Some registers have a very specific function (store condition flags, accumulator, floating point, vector ops, etc). Hopefully, by the end of this article you will understand why CPUs don't contain millions of registers.The 6502 has the following set of registers:
A (Accumulator) (8 bits) - This is the main register that is used in all arithmetic and logical operations
X (X Index) (8 bits) - a special register used to store memory offsets
Y (Y Index) (8 bits) - just like X, it is used mainly for indexing
P (Processor Status) (8 bits) - it 'remembers' some information related to previously executed operation (example: was the result 0?)
SP (Stack Pointer) (8 bits) - Used to address the 256-byte stack.
PC (Program Counter) (16 bits) - It points to the address of the instruction that will be executed next
That's it. What does it mean?
In order to represent the state of the CPU we just need to store 8+8+8+8+8+16 = 56 bits. You will see how. For now, let's complete the structure:
typedef uint8_t u8;
typedef uint16_t u16;
typedef struct {
u8 A;
u8 X;
u8 Y;
u8 P;
u8 SP;
u16 PC;
} state_6502;
Therefore, as per technical description (wiki):
The 6502 is a little-endian 8-bit processor with a 16-bit address bus
What else do we need?Let's introduce an example, which will motivate what we need next. Any conventional CPU operates in discrete fashion by manipulating it's state. In other words, the following can be expressed as:
while (running) {
state[t+1] := function (state[t])
}
To be more concrete, let's try to make it real code:
// modify registers 'in-place'
void cpu_step(state_6502 *s) {
// current state 's' determines what we are going to do
operation = get_operation(s);
operation(s); // execute
}
while (running) {
cpu_step(&state);
}
Here, the function is the CPU logic we need to accomplish some task. We need to learn more about the way a CPU like 6502 encodes the function. Given a state s at some time t we need to obtain 3 pieces of information: 1. Instruction to be executed, 2. Addressing Mode, 3. External operands. 6502 does that in a 3-step process. The first step is called fetch. It uses the value stored in the Program Counter as an address to a byte B which is then read and PC is incremented (by default). Based on the value of that byte, we can determine the sub-type of the instruction (addressing mode) and if any extra bytes need to be fetched (depending on the addressing mode, there can be 0,1 or 2) - this step is called decode. The last step is optional and retrieves additional data incrementing PC. After this is completed we can execute an instruction which produces new values of registers and results in an optional write to the external memory. A loop like this one (fetch, decode, execute) is called a cycle.
{
u8 B = mem[PC++];
mode, instruction, n_operands = decode(B);
/* fetch extra n_operands */
execute();
/* we are done with this cycle */
}
Addressing modes
There are 56 instructions and 13 addressing modes. This makes 6502 quite powerful. Some combinations are not valid, however. Some of the addressing modes make use of X and Y registers. Those are used primarily to implement loops or store memory addressed in a compound form (base + offset). Here's the C code implementing the modes, b is the number of additional bytes to be read
Good resources
https://skilldrick.github.io/easy6502/
https://en.wikibooks.org/wiki/6502_Assembly
https://www.masswerk.at/6502/6502_instruction_set.html
void imp() {m=0; b=0; } // implied, 1
void acc() {m=1; b=0; } // accumulator, 1
void imm() {m=2; b=1; d=(u16)f8(); } // immediate, 2
void zp() {m=3; b=1; d=(u16)f8(); } // zero page, 2
void zpx() {m=4; b=1; u8 r=f8(); d=(r+X) & 0xff;} // zero page, x, 3
void zpy() {m=5; b=1; u8 r=f8(); d=(r+Y) & 0xff; } // zero page, y, 3
void rel() {m=6; b=1; u8 r=f8(); if (r<0x80) d=PC+r; else d=PC+r-0x100;} // relative, 2
void abso() {m=7; b=2; d=f16(); } // absolute, 3
void absx() {m=8; b=2; d=f16(); cyc+=((d>>8)!=((d+X)>>8)) ? pg[op] : 0; d+=X; } // absolute, x, 3
void absy() {m=9; b=2; d=f16(); cyc+=(d>>8)!=((d+Y)>>8) ? pg[op] : 0; d+=Y; } // absolute, y, 3
void ind() {m=10; b=2; d=r16(f16()); } // indirect, 3
void indx() {m=11; b=1; u8 r=f8(); d=r16((u8)(r + X)); } // indirect x
void indy() {m=12; b=1; u8 r=f8(); d=r16((u8)(r)); cyc+=(d>>8)!=((d+Y)>>8) ? pg[op] : 0; d+=Y;} // indirect y
There is a 1:1 mapping between an instruction opcode and the addressing mode, so one possible implementation of decoding is through a look-up table as we do in this version. The 'Decode ROM' in the 'real' version is most likely implementing it in this fashion
Here's the complete 'step'
void cpu_step() {
op = f8(); /* read opcode */
addrtable[op](); /* determine m, b */
/* read b bytes as operands */
optable[op](); /* execute */
}
Interacting with external memory
In order to accomplish any non-trivial task, we need to interact with some kind of external storage which either contains some data we need (let's say the program) or we need to write to it (say, a framebuffer in this case which is used to draw something). There is one fundamental limitation when it comes to communicating with the outside world which makes a CPU more like a single-headed Turing Machine rather than an LSTM - We can address only one memory location at any given point in time - think of it as a form of hard attention. This applied to any memory read or write, including operation fetch. The external memory in theory can be unbounded, but we are also limited by the range of possible addresses generated by a given CPU. 6502 has a 16-bit address bus, which means that we can address up to 64kB of memory.
Address / Data IO
u8 fetch(state_6502 *s, u8 *mem)
{
u8 byte = mem[s->PC];
s->PC++;
return byte;
}
Program
A program is a chunk or chunks of memory which the CPU will read in a sequential order, unless a JUMP or CALL/RETURN instruction is executed. The program counter is used to keep track of the location in a program. Typically a program is loaded into a RAM (read/write) memory accessible by a CPU, but there also can be built-in ROM into a CPU which contains some very basic programs used during boot.
Fetch/Decode Cycle
As you can see, there is a fundamental bottleneck which existed and still exists in modern CPU. The so-called 'von Neumann architecture' is one where a CPU communicates through a narrow (compared to the internal CPU's capacity) bus. Now it may become clearer why it is better to compare a CPU to an architecture where external memory is explicitly defined. For example 'Differentiable Neural Computer' uses attention to 'focus' on a memory location for either write or read. There are multiple heads which increase the width of the bus, but there is always a tradeoff between the cost of having a wide bus and the benefits.
Source: Nature
Let's look at an example program (assembly source)
start:
lda #$e1
sta $0
lda #$01
sta $1
ldy #$20
write:
ldx #$00
eor ($0, x)
sta ($0),y
inc $0
bne write
inc $1
ldx $1
cpx #$06
bne write
rts
After assembly:
[0xa9,0xe1,0x85,0x00,0xa9,0x01,
0x85,0x01,0xa0,0x20,0xa2,0x00,
0x41,0x00,0x91,0x00,0xe6,0x00,
0xd0,0xf6,0xe6,0x01,0xa6,0x01,
0xe0,0x06,0xd0,0xee,0x60,0x00]
Here's a link to the assembler I wrote: https://github.com/krocki/as6502. You can also use any other available assembler to experiment with the CPU. One more thing which is important - the memory map: 0 - 0xff is the ZERO PAGE, 0x100-0x200 is STACK, 0x200 - 0x600 is used as a FRAMEBUFFER and by default the PROGRAM is at 0x600. You can use the python frontend to use the CPU - '6502.py' or call it from C directly.
Program Synthesis
Here I will explain one of the motivations of writing an emulator of 6502 - I wanted to experiment with automatic program generation (program synthesis). That is, given some target behavior, come up with a program which accomplishes that task. Let me show you an example experiment (included in the github repo). The goal is to fill the screen (memory region 0x200-0x400), so it is relatively simple. The learning mechanism is also quite primitive, yet works. It is based on the Genetic Algorithm which treats programs as strings (genomes), which can undergo mutation (change some value) or crossover (2 strings 'mate' and produce 2 new offsprings). The algorithm itself is self-contained in github.com/krocki/min-ga where I show the simplest case of finding a secret string and determining an image based on the fitness evaluation.
Applied to a string
Applied to an image
# a simple mutation strategy, just select a random element and replace with a random symbol
def mutate(chromosome, chars):
i = random.randint(0, len(chromosome) - 1)
# i
# X = xxxxxxxxxxxxxxxxxxx ( original)
# ---------|-------------
# M - xxxxxMxxxxxxxxxxxxx ( mutated )
mutated=list(chromosome); mutated[i]=random.choice(chars)
return "".join(mutated)
def crossover(x,y): # x and y are parents
i = random.randint(1, len(x) - 2) # crossover point 0 (i)
j = random.randint(1, len(y) - 2) # crossover point 1 (j)
if i > j: i, j = j, i
# i j
# X = xxxxx xxxxxxxxxxxxx xxxxxxx ( parent 0)
# Y = yyyyy yyyyyyyyyyyyy yyyyyyy ( parent 1)
# -------------------------------
# p = xxxxx yyyyyyyyyyyyy xxxxxxx ( child 0 )
# q = yyyyy xxxxxxxxxxxxx yyyyyyy ( child 1 )
#
p,q = x[:i] + y[i:j] + x[j:], y[:i] + x[i:j] + y[j:]
return p, q # 2 children
What I wanted to accomplish with the 6502 is similar to the image search case. There is one difficulty - The path leading to the final image may be windy and involves 'learning' concepts like loops, memory copy, counters etc. We generate the final image indirectly through the code. The following images show the intermediate steps of learning such a program. The leftmost image is the first generation and the last one is after approximately 290 generations (about 3 minutes) - not bad!. You can also see the final binary which was generated. We can actually disassemble it and see what happened. I included this example as train.py in the repo (https://github.com/krocki/6502)
623c bd3c 4f52 ad26 2dad 94d6 379c 55dc
f4b3 9e18 e8fd 6d36 729b 130c 51a2 17e0
395b 00f6 2b58 6b37 dce6 4e4e 603d 8bfc
9e52 c26d 2232 ff1e f4be e93a df9b bfaa
8387 5ee2 4411 8b56 6512 10b4 3799 f6a6
30d0 3bdc b598 102c 4773 78f2 09ce 5931
3f5c 012d 6c64 9917 dc94 0a6e 05d3 97ae
111d 9050 161b a5f8 ed0e 4c7e 7e7c fc44
I have 'played' with this approach some time and also disassembled learned programs. Here's a brief collection of learned behaviors I've seen - it's remarkable that even in case of simple tasks, there are multiple very different solutions found
Just a counter - the intensity of the upper dot is the counter, which then is used to iterate over the lower block
When trying to learn a more complex shape, sometimes the strategy is to fill a chunk, then subtract from it
A different approach is to fill a shape using additive, small increments
A simple loop to fill a block
Weird, but fast fills
Here, we can see that the RAM is used (left)
Recursive fill!
Fill through self-modification
Bonus: I included a CUDA implementation of 6502 which allows you to run thousands of programs in parallel, check the nv6502 directory
It's quite funny, but a single V100 core can emulate a 6502 CPU at about 2MHz which is faster than the original CPU
An array of 6502 CPUs can be emulated in parallel resulting in a large population size
Console output showing the result computed on the GPU
To conclude:
Here is the entire logic for the 6502 written in CUDA - roughly 200 lines - that is what I like about the 6502! It is very concise and elegant.
#define STACK_PG 0x0100
#define ZN(x) { Z=((x)==0); S=((x)>>7) & 0x1; }
#define LDM { d=(m>2) ? r8(n,d) : d; }
#define LD_A_OR_M() u8 w=(m==1)?A:r8(n,d)
#define ST_A_OR_M() if (m!=1) w8(n,d,w); else A=w;
__device__ u8 r8 (_6502 *n, u16 a) { return n->mem[a % MEM_SIZE]; } // byte read
__device__ void w8 (_6502 *n, u16 a, u8 v) { n->mem[a % MEM_SIZE] = v; } // byte write
__device__ u8 f8 (_6502 *n) { return r8(n, PC++); } // byte fetch
//// 16-bit versions
__device__ u16 r16 (_6502 *n, u16 a) { u16 base=a & 0xff00; return (r8(n,a) | (r8(n,base|((u8)(a+1))) << 8)); } // buggy
__device__ u16 r16_ok(_6502 *n, u16 a) { return (r8(n,a) | (r8(n,a+1) << 8)); }
__device__ u16 f16 (_6502 *n) { return (f8(n) | ((f8(n))<<8)); }
//
//// stack ops
__device__ u8 pop8 (_6502 *n) { SP++; return r8(n, STACK_PG | SP); }
__device__ u16 pop16 (_6502 *n) { return (pop8(n) | ((pop8(n))<<8)); }
__device__ void push8 (_6502 *n, u8 v) { w8(n, STACK_PG | SP, v); SP--; }
__device__ void push16(_6502 *n, u16 v) { push8(n,(v>>8)); push8(n,v); }
__device__ void jr (_6502 *n, u8 cond) { if (cond) { PC=(u16)d; } }
//
//// decoding addressing mode
__device__ void imp (_6502 *n) { m=0; b=0; } // implied, 1
__device__ void acc (_6502 *n) { m=1; b=0; } // accumulator, 1
__device__ void imm (_6502 *n) { m=2; b=1; d=(u16)f8(n); } // immediate, 2
__device__ void zp (_6502 *n) { m=3; b=1; d=(u16)f8(n); } // zero page, 2
__device__ void zpx (_6502 *n) { m=4; b=1; u8 r=f8(n); d=(r+X) & 0xff;} // zero page, x, 3
__device__ void zpy (_6502 *n) { m=5; b=1; u8 r=f8(n); d=(r+Y) & 0xff; } // zero page, y, 3
__device__ void rel (_6502 *n) { m=6; b=1; u8 r=f8(n); if (r<0x80) d=PC+r; else d=PC+r-0x100;} // relative, 2
__device__ void abso(_6502 *n) { m=7; b=2; d=f16(n); } // absolute, 3
__device__ void absx(_6502 *n) { m=8; b=2; d=f16(n); d+=X; } // absolute, x, 3
__device__ void absy(_6502 *n) { m=9; b=2; d=f16(n); d+=Y; } // absolute, y, 3
__device__ void ind (_6502 *n) { m=10; b=2; d=r16(n,f16(n)); } // indirect, 3
__device__ void indx(_6502 *n) { m=11; b=1; u8 r=f8(n); d=r16(n,(u8)(r + X)); } // indirect x
__device__ void indy(_6502 *n) { m=12; b=1; u8 r=f8(n); d=r16(n,(u8)(r)); d+=Y;} // indirect y
//instructions
__device__ void _adc(_6502 *n) {
u8 a = A; LDM; A=d+A+C; ZN(A);
u16 t = (u16)d + (u16)a + (u16)C; C=(t > 0xff);
V = (!((a^d) & 0x80)) && (((a^A) & 0x80)>0 );
} // Add Memory to Accumulator with Carry
__device__ void _sbc(_6502 *n) {
u8 a = A; LDM; A=A-d-(1-C); ZN(A);
s16 t = (s16)a - (s16)d - (1-(s16)C); C=(t >= 0x0);
V = (((a^d) & 0x80)>0) && (((a^A) & 0x80)>0);
} // Subtract Memory from Accumulator with Borrow
__device__ void _cp (_6502 *n, u8 _a, u8 _b) { u8 r=_a-_b; C=(_a>=_b); ZN(r); }
__device__ void _ora(_6502 *n) { LDM; A|=d; ZN(A); } // "OR" Memory with Accumulator
__device__ void _and(_6502 *n) { LDM; A&=d; ZN(A); } // "AND" Memory with Accumulator
__device__ void _eor(_6502 *n) { LDM; A^=d; ZN(A); } // "XOR" Memory with Accumulator
__device__ void _cmp(_6502 *n) { LDM; _cp(n,A,d); } // Compare Memory and Accumulator
__device__ void _cpx(_6502 *n) { LDM; _cp(n,X,d); } // Compare Memory and Index X
__device__ void _cpy(_6502 *n) { LDM; _cp(n,Y,d); } // Compare Memory and Index Y
__device__ void _bcc(_6502 *n) { jr(n,!C); } // Branch on Carry Clear
__device__ void _bcs(_6502 *n) { jr(n,C); } // Branch on Carry Set
__device__ void _beq(_6502 *n) { jr(n,Z); } // Branch on Result Zero
__device__ void _bit(_6502 *n) { LDM; S=(d>>7) & 1; V=(d>>6) & 1; Z=(d & A)==0; } // Test Bits in Memory with A
__device__ void _bmi(_6502 *n) { jr(n, S); } // Branch on Result Minus
__device__ void _bne(_6502 *n) { jr(n,!Z); } // Branch on Result not Zero
__device__ void _bpl(_6502 *n) { jr(n,!S); } // Branch on Result Plus
__device__ void _brk(_6502 *n) { B=1; } // Force Break
__device__ void _bvc(_6502 *n) { jr(n,!V); } // Branch on Overflow Clear
__device__ void _bvs(_6502 *n) { jr(n, V); } // Branch on Overflow Set
__device__ void _clc(_6502 *n) { C=0; } // Clear Carry Flag
__device__ void _cld(_6502 *n) { D=0; } // Clear Decimal Mode
__device__ void _cli(_6502 *n) { I=0; } // Clear interrupt Disable Bit
__device__ void _clv(_6502 *n) { V=0; } // Clear Overflow Flag
__device__ void _dec(_6502 *n) { u16 d0 = d; LDM; d--; d &= 0xff; ZN(d); w8(n,d0,d); } // Decrement Memory by One
__device__ void _dex(_6502 *n) { X--; ZN(X); } // Decrement Index X by One
__device__ void _dey(_6502 *n) { Y--; ZN(Y); } // Decrement Index Y by One
__device__ void _inc(_6502 *n) { u16 d0=d; LDM; d++; d &= 0xff; ZN(d); w8(n,d0,d); d=d0; } // Incr Memory by One
__device__ void _inx(_6502 *n) { X++; ZN(X); } // Increment Index X by One
__device__ void _iny(_6502 *n) { Y++; ZN(Y); } // Increment Index Y by One
__device__ void _jmp(_6502 *n) { PC=d;} // Jump to New Location
__device__ void _jsr(_6502 *n) { push16(n,PC-1); PC=d; } // Jump to New Location Saving Return Address
__device__ void _lda(_6502 *n) { LDM; A=d; ZN(A); } // Load Accumulator with Memory
__device__ void _ldx(_6502 *n) { LDM; X=d; ZN(X); } // Load Index X with Memory
__device__ void _ldy(_6502 *n) { LDM; Y=d; ZN(Y); } // Load Index Y with Memory
__device__ void _lsr(_6502 *n) { LD_A_OR_M(); C=w & 1; w>>=1; ZN(w); ST_A_OR_M(); } // Shift Right One Bit
__device__ void _asl(_6502 *n) { LD_A_OR_M(); C=(w>>7) & 1; w<<=1; ZN(w); ST_A_OR_M();} // Shift Left One Bit
__device__ void _rol(_6502 *n) { LD_A_OR_M(); u8 c = C; C=(w>>7) & 1; w=(w<<1) | c; ZN(w); ST_A_OR_M(); } // Rotate One Bit Left (Memory or Accumulator)
__device__ void _ror(_6502 *n) { LD_A_OR_M(); u8 c = C; C=(w & 1); w=(w>>1) | (c<<7); ZN(w); ST_A_OR_M(); } // Rotate One Bit Right (Memory or Accumulator)
__device__ void _nop(_6502 *n) { /* No Operation */ }
__device__ void _pha(_6502 *n) { push8(n, A); } // Push Accumulator on Stack
__device__ void _php(_6502 *n) { push8(n, P | 0x10); } // Push Processor Status on Stack
__device__ void _pla(_6502 *n) { A=pop8(n); Z=(A==0); S=(A>>7)&0x1;} // Pull Accumulator from Stack
__device__ void _plp(_6502 *n) { P=pop8(n) & 0xef | 0x20; } // Pull Processor Status from Stack
__device__ void _rti(_6502 *n) { P=(pop8(n) & 0xef) | 0x20; PC=pop16(n); } // Return from Interrupt
__device__ void _rts(_6502 *n) { PC=pop16(n)+1;} // Return from Subroutine
__device__ void _sec(_6502 *n) { C=1;} // Set Carry Flag
__device__ void _sed(_6502 *n) { D=1;} // Set Decimal Mode
__device__ void _sei(_6502 *n) { I=1;} // Set Interrupt Disable Status
__device__ void _sta(_6502 *n) { w8(n,d,A);} // Store Accumulator in Memory
__device__ void _stx(_6502 *n) { w8(n,d,X);} // Store Index X in Memory
__device__ void _sty(_6502 *n) { w8(n,d,Y);} // Store Index Y in Memory
__device__ void _tax(_6502 *n) { X=A; ZN(X); } // Transfer Accumulator to Index X
__device__ void _tay(_6502 *n) { Y=A; ZN(Y); } // Transfer Accumulator to Index Y
__device__ void _tsx(_6502 *n) { X=SP;ZN(X); } // Transfer Stack Pointer to Index X
__device__ void _txa(_6502 *n) { A=X; ZN(A); } // Transfer Index X to Accumulator
__device__ void _txs(_6502 *n) { SP=X; } // Transfer Index X to Stack Pointer
__device__ void _tya(_6502 *n) { A=Y; ZN(A); } // Transfer Index Y to Accumulator
// undocumented
__device__ void _lax(_6502 *n) { _lda(n); X=A; ZN(A); } // lda, ldx
__device__ void _sax(_6502 *n) { w8(n,d,A&X); }
__device__ void _dcp(_6502 *n) { _dec(n); _cp(n,A,d); }
__device__ void _isb(_6502 *n) { _inc(n); _sbc(n); }
__device__ void _slo(_6502 *n) { _asl(n); _ora(n); }
__device__ void _rla(_6502 *n) { _rol(n); _and(n); }
__device__ void _sre(_6502 *n) { _lsr(n); _eor(n); }
__device__ void _rra(_6502 *n) { _ror(n); _adc(n); }
Sources
https://skilldrick.github.io/easy6502/
https://en.wikibooks.org/wiki/6502_Assembly
https://www.masswerk.at/6502/6502_instruction_set.html
http://www.6502asm.com/
And many others I can't remember
My Apple II
PART 4: Once upon a GPU
Nowadays, the term GPU is commonly found in machine learning context. It may be shocking to many to learn that they existed 10, 20 or even 30 years ago. Recently, I tried to recreate the way one had to follow in order to perform a simple computation on a GPU. Things have not changed dramatically and the story of what we have now is a story of unindented consequences.
Just in order to provide some kind of a context I would like to give you a short history of why a GPU was needed.
Framebuffer
This is the space where the entire frame to be displayed is stored. Early CPUs such as 8080 and 6502 had a dedicated memory space which could be written to directly by the CPU using a regular load/store operation. Here you can see the 32x32 framebuffer of my toy 6502 system. The pixels are written by the CPU to an area of memory which is used as the canvas.
The 6502 emulator used here: github.com/krocki/6502
6502 assembler and the program to generate the pattern: github.com/krocki/as6502
CPU needs help
Let's take one of my all-time favourite games, Space Invaders, and its first incarnation for a 8080 CPU - the arcade version (Computer Archeology link)
We have a framebuffer/screen, 256 x 224 pixels, to be exact, monochrome. All you need to perform is a very simple task: set each pixel in a frame (i.e. write a value at (x,y) for all 256x224 pixels ), i.e. memset. That is 57344 pixels in total. Not a big deal. But you would probably want to process around 30 frames per second at least or better, 60. Then: 57344 x 60 = 3440640 pixels per second.
The basic CPU-MEM-VIDEO system
The 8080 CPU uses some part of its RAM as a screen buffer. It can read from the memory, and write to it at a rate not exceeding 1 word / cycle. That is not including the cost it takes to loop over an array. A simple task, becomes impossible.
Here is the actual piece of code doing this work: (full disassembly)
ClearScreen:
1A5C: 21 00 24 LD HL,$2400 ; Screen coordinate
1A5F: 36 00 LD (HL),$00 ; Clear it
1A61: 23 INC HL ; Next byte
1A62: 7C LD A,H ; Have we done ...
1A63: FE 40 CP $40 ; ... all the screen?
1A65: C2 5F 1A JP NZ,$1A5F ; No ... keep going
1A68: C9 RET ; Out
Each instruction has some cost expressed in the number of cycles needed - 8080 opcode reference. So, in order to clear one pixel, the lower bound to execute is about 7 cycles. You would need a CPU running at about 25 MHz to do just that.
Here is this piece of code in 'action'. On the left hand side you can see the MEM address being cleared (register HL) - it needs to go all the way up to address 0x3fff. In the middle there is a visualization showing the vram section of the memory (initially the contents are gray). The right hand side shows the program counter, instruction count since start and string dissassembly.
In the RAM space [bytes 2400-3fff] ( super-slow version for debugging )
In the pixel space [224 x 256] (~1000x faster than the above one, still sub-realtime)
In fact, the screen is cleared only when it's really needed. If you analyze the code, the authors had to be quite creative to be able to draw everything quickly, but not everything was possible. I recommend looking at the disassembled code, just to see how painful the process must have been.
Clearing the screen was not the only problem which the authors of the game were facing. A more complicated operation was related to the movement of objects ( the bullet/laser beam, aliens shifting ). This required something a little bit different.
The need for speed
Enter the 'Accelerator'
In addition to the 8080 CPU, there is an auxiliary bit shift register which had to be added in order to `shift` the alien pixels quicker as they move left or right. The CPU can write to this register and read from it through its I/O inferface. This may very well be the first version of a 'GPU' or an accelerator - an alien shifting device (Fujitsu MB14241).
/* a C struct representing what the circuit is and what it does */
struct mb14241 {
u8 lo, hi, shift_amount;
};
/* written by 8080 (OUT 4) */
void append(struct mb14241 *m, u8 new_val) {
m->hi = m->lo;
m->lo = new_val;
}
/* written by 8080 (OUT 2) */
void set_shift(struct mb14241 *m, u8 amount) {
m->shift_amount = amount & 0x3;
}
/* read by 8080 (IN 3) */
u8 read(struct mb14241 *m) {
return ((m->lo | m->hi << 8) >> (8 - m->shift_amount)) & 0xff;
}
MB14241 in action
To let you experience this wonderful game and experiment with the code, I have written a full emulator of the Space Invaders machine (8080 CPU, the 'GPU', I/O, interrupt subsystem). Here's the github link:
github.com/krocki/invaders
For more details on MB14241, see the Dedicated Shift Hardware at the Computer Archelogy site - https://computerarcheology.com/Arcade/SpaceInvaders/Hardware.html
Moving forward
Other consoles from that era like ATARI 2600 also included some specialized circuits to accelerate rendering (TIA). ATARI exploits symmetries and intruduces some special registers used for missiles and battlefield.
Patterns like these are relatively cheap
Link to the code for TIA in C: tia.c
We need more...
Later consoles like NES and Gameboy had a separate PPU (link) a Pixel Processing Pnit which was capable of rendering without CPU intervention and operated on tiles - small textures). Also it allowed smooth scrolling.
Tile layout in NES games
Credits dustmop.io
Tile layout in PACMAN for NES, L: Color coded tile IDs, R: What we see after the 'textures' are fetched
NES PPU emulation is much more involved
u8 nmi_enable = bit(PPUCTRL, 7);
for (u64 i=0; i> (py & 7)) & 0x1];
// render background
u8 nt_idx = PPUCTRL & 0x3; u16 nt_base = nt_bases[nt_idx];
u16 bg_table=bit(PPUCTRL,4) ? 0x1000 : 0x0000;
u16 tile=vram[nt_base + ((py & 248) << 2) + (px >> 3)];
u16 pattern_off=bg_table + (tile << 4) + (py & 7);
u8 curattrib = (vram[nt_base + 0x3C0 + ((py & 224) >> 2) + (px >> 5)] >> (((px & 16) >> 3) | ((py & 16) >> 2)) & 3) << 2;
u8 curpixel = (chr[pattern_off & 0x1fff] >> (~px & 7)) & 1;
curpixel |= ((chr[(pattern_off+8) & 0x1fff] >> (~px & 7)) & 1) << 1;
u32 color = palette[vram[0x3f00 + (curattrib | curpixel)] & 0x3f];
pix[3*(px+py*IM_W)+0] = (color >> 16) & 0xff;
pix[3*(px+py*IM_W)+1] = (color >> 8) & 0xff;
pix[3*(px+py*IM_W)+2] = (color >> 0) & 0xff;
}
col++; ppu_cycle = (ppu_cycle+1) % 341;
Link to the code for NES PPU in C: ppu.c
Gameboy: Smooth 2D scrolling
You can study my implementation of PPU for GAMEBOY: github.com/krocki/gb
Part of the code emulating the PPU for Gameboy
void gpu_draw_sprites() {
u8 gpu_sprite_on = (REG_LCDC >> 1) & 0x1;
u8 gpu_sprite_size = (REG_LCDC >> 2) & 0x1 ? 16 : 8;//8x8, 8x16
if (gpu_sprite_on) {
for (u8 idx=0; idx<40; idx++) {
u8 i=39-idx;
u16 spriteaddr=0xfe00 + ((s16)i) * 4;
s32 spritey = ((s32)((u16)r8(spriteaddr+0)))-16;
s32 spritex = ((s32)((u16)r8(spriteaddr+1)))-8;
u16 tilenum = r8(spriteaddr+2);
if (gpu_sprite_size == 8) tilenum &= 0xff; else tilenum &= 0xfe;
u8 flags = r8(spriteaddr+3);
u8 usepal1 = (flags >> 4) & 0x1;
u8 xflip = (flags >> 5) & 0x1;
u8 yflip = (flags >> 6) & 0x1;
u8 belowbg = (flags >> 7) & 0x1;
u8 c_palnr = flags & 0x7;
//u8 c_vram1 = (flags >> 3) & 0x1;
u8 line = REG_SCANLINE;
if (line < spritey || line >= (spritey + gpu_sprite_size)) { continue; }
if (spritex < -7 || spritex >= (160)) { continue; }
u16 tiley;
tiley = yflip ? (gpu_sprite_size-1-(line-spritey)) : (line-spritey);
u16 tileaddress = 0x8000 + tilenum*16 + tiley*2;
u16 b0 = tileaddress;
u16 b1 = tileaddress+1;
u8 data0 = r8(b0);
u8 data1 = r8(b1);
u32 screen_off = line*160*3;
for (u8 x=0; x<8; x++) {
if (((spritex+x)<0) || ((spritex+x) >= 160)) continue;
if (belowbg && bgprio[(spritex+x)] != 0) continue;
u8 off = xflip ? x : (7-x);
u8 pal = usepal1 ? REG_OBJPAL1 : REG_OBJPAL0;
u8 color0_idx = ((data0 >> (off)) & 0x1);
u8 color1_idx = ((data1 >> (off)) & 0x1);
u8 color_idx = color0_idx + color1_idx*2;
if (color_idx == 0) continue;
u8 r,g,b;
u8 color = (pal>>(color_idx*2))&0x3;
if (color == 0) {r=255; g=255; b=255;}
if (color == 1) {r=192; g=192; b=192;}
if (color == 2) {r=96; g=96; b=96;}
if (color == 3) {r=0; g=0; b=0;}
pix[!buffer][screen_off+3*(spritex+x)+0] = r;
pix[!buffer][screen_off+3*(spritex+x)+1] = g;
pix[!buffer][screen_off+3*(spritex+x)+2] = b;
}
}
}
}
1990s onward
- S3, 2D acceleration era. The term GPU was coined by Sony in reference to the 32-bit Sony GPU in the Playstation.
- OpenGL 1.0 introduced in 1993
- 1995 Diamond Edge 3D - NV1
1996 Diamond Monster 3D (Voodoo 1)
GLQuake
- 2000s brought programmable shading
- 2007 - CUDA released
Last update: 2/20/20