Things I Wish I Knew About Assembly

My talk for RustConf this year includes an technical deep dive of the MissingNo glitch from Pokemon Red and Blue. It was important to me to really understand not just what happened in this glitch, but why it happened. This meant I had to spend a lot of time over the last year reading through a disassembly of the game.

While I had a very rudimentary understanding of x86 assembly going into this, I had never seen this assembly syntax before. I also didn’t know much about some of the intricacies of assembly programming. As a result I went down a few rabbit holes of misinformation, and had to learn a lot about the GameBoy’s assembly. Here are some things I learned that would have saved me time had I known them up front.

The disassembly I used was built with the RGBDS assembler. Some of this may be specific to that tool chain, some of it may be specific to the GameBoy hardware, and some of it may be specific to z80 assembly. Unfortunately I don’t know enough about other z80 variants, or other toolchains to say for sure.

The GameBoy used a variant of z80 assembly. It removed all of the I/O instructions (the GameBoy only used memory mapped I/O), as well as the i, r, ix, and iy registers, and a few other instructions. In their place it added a few extra instructions for manipulating stack pointers, more easily writing to the addresses used for I/O, and a few other helpers. Oh and just for fun, incrementing or decrementing a 16 bit register would corrupt sprite memory if the value in the register was within a certain range.

Instructions Require Specific Operands #

I was pretty perplexed when I came across some code like this:

ld a, [base + 0]
ld b, a
ld a, [base + 1]
ld c, a
ld a, [base + 2]
ld d, a

You’d think they could just write this instead:

ld b, [base + 0]
ld c, [base + 1]
ld d, [base + 2]

But the ld instruction can’t just take any of the logical combinations of source/destination that you might expect. For indirect loads, either the destination must be a, or the source must be [hl]. That means that if you want to read from some arbitrary address into some arbitrary register, you must first either store the address in hl, or store the value in a. The code could have also been written like this:

ld hl, base + 0
ld b, [hl]
ld hl, base + 1
ld c, [hl]
ld hl, base + 2
ld d, [hl]

A side effect of this is that a and hl are rarely used as general purpose registers. You were extremely likely to need to overwrite whatever was in those registers as intermediates to do whatever you needed. As a result, data was often loaded from memory in weird places, or even reloaded from memory multiple times purely due to the limited number of registers available. As you might expect, this made bugs more common in code that couldn’t fit everything it needed to do its work in 4 registers.

Code size was often more important than execution speed #

For a lot of games, doing everything they needed in ~16ms was no problem. Or dropping the frame rate to 30 fps was acceptable. But code size had a very real cost associated with it. The smallest – and therefore cheapest – cartridges only had 32KiB of ROM (though I think the smallest game shipped with than 256KiB, nobody used the smallest sizes available).

You could have up to 4MiB max of ROM for your game if you needed it, which by today’s standards is absurdly small. But most folks tried to stay even lower than that. Whenever you needed more ROM for your program, you had to double the ROM on the cartridge. And that meant that producing your game was that much more expensive, cutting directly into the profits your game could make.

As a result, folks often optimized for code size at the expense of execution speed – though there are absolutely some exceptions such as audio processing code, or when doing complex visual effects like parallax.

Random instructions were used as an optimization #

There were time’s I’d see a seemingly random instruction for no reason. It turns out there were some cases where if you want to do something specific, there’s another instruction you could use that is smaller and faster. Here’s some examples:

; load 0 into a. Takes 2 cycles and 2 bytes
ld a, 0
; Takes 1 cycle and 1 byte, but does not preserve flags
xor a

; set z if a is equal to 0. Takes 2 cycles and 2 bytes
cp 0
; Takes 1 cycle and 1 byte
or a
; Also takes 1 cycle and 1 byte
and a

; set z if a is equal to 1. Takes 2 cycles and 2 bytes
cp 1
; Takes 1 cycle and 1 byte. Can also be used on registers other than `a`
dec a

String literals could mean anything! #

Games would sometimes use custom text encoding. Just because you see something in a string literal doesn’t mean that it will represent the bytes that could mean in other languages. For example, in Pokemon, the character “@” wasn’t printed. In the pokered disassembly, they map that character to the byte 0x50, which is the “end of name” control character. And any time you see “<trainer>” in the same disassembly, that is actually just the byte 0x5D.

c means multiple things #

I’m a little embarrassed how long I got tripped up by this one. One of the general purpose registers is called c, but c could also mean the “carry” flag. Which c the character c in your assembly means depends on the context. ld c, 1 will load that into the c register, but jp c, $F00 will jump to 0xF00 only if the carry flag is set.

But these are separate places, and instructions which set or reset the carry flag will not affect the value of the c register.

Bank switching is like a more primitive form of segmentation #

The GameBoy used a 16 bit processor. Half of that address space was used for your cartridge’s ROM. So unless your entire game fit in 32KiB, you needed a way to access more than 64KiB within the 32KiB address space. To do this, the GameBoy used a system called bank switching.

Modern operating systems use segmentation or paging to allow more physical memory than can be represented by the available address space. However, this only helps when you have multiple processes with separate memory spaces. Each individual process can still only access as much memory as can be fit in a pointer.

In these systems, a few bits of the pointer represent a segment or index into a page table, while the remaining bits represent an offset. So a program’s pointer may not be the same as the physical address it represents, but every physical address has a distinct pointer. With bank switching, we instead have the same pointer mean multiple things.

The first 16KiB of address space (0x0000-0x3FFF) always represented “bank 0”. They directly mapped to that physical address on the cartridge. But the next 16KiB (0x4000-0x7FFF) could point to anywhere. By changing the active “bank” number, this address space could point to different sections of the cartridge’s ROM.

While this meant that you could have much more code than would otherwise be possible, it also meant that instructions dealing with pointers, like jp, call, ret, or even indirect loads might have to worry about switching to the appropriate bank. This meant that you essentially had to invent your own wide pointer (though it could at least be 3 bytes instead of 4, since you could have at most 256 banks). Additionally, it was common to have your own calling convention that preserved the current bank number as well as the return address.

Just to make things a bit more horrifying, the way you switched banks was incredibly funky as well. Given that everything on the GameBoy used memory mapped I/O, you might expect that there’s a random byte of address space you write to with the active RAM bank and active ROM bank. But that’s not the case.

Instead you would write 1 or 0 to anywhere in the address space 0x0000-0x3FFF to set RAM/ROM mode, and then write a byte to 0x4000-0x7FFF to pick which bank you wanted to switch to. Since both of those address ranges are ROM, this will cause a segfault which gets intercepted by the memory bank controller on your cartridge, which then performs the switch. There was no way to read the active page number.

This also interacts with the quirk we talked about earlier where you can’t just use any combination of operands with ld. If you want to switch to bank 3, you can’t just write ld [$4000], 3. Either the address has to be in hl, or the value has to be in a. So no matter what, switching banks requires clobbering a register as well!

The stack was a luxury, not to be overused #

The GameBoy only had 8KiB of general purpose RAM (called wram) available to the game. You could extend this by including RAM on the cartridge (called sram), but because cartridge RAM had to deal with bank switching, it was much more difficult to use. As a result, memory in general was very scarce.

By default the GameBoy would set the stack pointer to the top of a 127 byte section of RAM that was otherwise unused, but most programs would just set it to the top of wram at boot, so any unused memory would be available for the stack. For Pokemon Blue this still only gave them a 207 byte stack, so it was used only when absolutely necessary.

This also resulted in an interesting calling convention where all registers were callee preserved, but they preserved them in a specific spot in memory rather than on the stack. This meant that functions couldn’t call other functions, since registers were always preserved in the same spot in memory, not on the stack. This likely would have been a soft constraint anyway, since a stack overflow would be so easy to achieve.

Conclusion #

This is just a random smattering of things that tripped me up while going through the disassembly of Pokemon Blue. Developing for the GameBoy must have been an extremely difficult task, and I have a lot of respect for the developers who had to keep all of this straight while developing. If you ever decide to go source diving into your favorite GameBoy games, hopefully this saves you some time.

 
57
Kudos
 
57
Kudos

Now read this

Neat Rust Tricks: Passing Rust Closures to C

One of Rust’s biggest selling points is how well it can interoperate with C. It’s able to call into C libraries and produce APIs that C can call into with very little fuss. However, when dealing with sufficiently complex APIs, mismatches... Continue →