6502 Assembler

If you aren't interested in this topic, just skip this page. You won't miss any information used to build an emulator.

A lot of time has passed since I wrote the 6502 disassembler page, and since then I've done several projects that have put the elements of writing an assembler together in my mind. Even though this is a compiler-type project, I am not a professional compiler engineer, and this code is not production robust. I think that writing assemblers for older processors is well within the capabilities of most programmers. If you can write an emulator, you can write an assembler.

The csv file of 6502 data contains a lot of the information needed for the assembler. I use the CSV data for several reasons. I use it to recognize 6502 instructions (like ADD, LDA, and JSR), which I call mnemonics in the code. I also use the addressing mode info from the CSV file. The combination of a mnemonic and an addressing mode lets me pick a 6502 opcode for the machine code.

6502 Assembly Syntax

The bad news is that there are a lot of different formats for 6502 assembly syntax. I developed this assembler by looking at different files, and choosing a sensible set of things to implement. Mostly I leaned on uchess.asm to pick a syntax.

Tokenizer

The assembler I'm presenting here has two phases, a text processing phase that I am calling the tokenizer, and a code generation phase. The string processing is all done in the tokenizer so the code generation doesn't have to worry about irregular text. Python takes most of the pain out of string processing for me.

The tokenizer examines each line and builds a list of what is in each line. This is the list of line structures that the Tokenizer recognizes, and an example of each

LABEL                           SETW

BYTE DATA                               .byte   $00, $F0, $FF, $01, $10

LABEL BYTE DATA                 MOVEX   .byte   $00, $F0, $FF

BASE                            *=$1000

INSTRUCTION                             INX

INSTRUCTION OPERAND                     EOR     $55

LABEL INSTRUCTION               OUTC    CLC

LABEL INSTRUCTION OPERAND       DISMV   LDX     #$04

DEFINE = value                  PMAXC   = $F0

The algorithm is:

  1. Strip off comments (everything after the ';')

  2. Strip leading white space

  3. Split the line into "words", each of which is a piece of text that was separated by whitespace

  4. Figure out which line structure applies

  5. Build a data structure (just a python list) that describes the line and includes some debugging information

For example:

source line

WHSET     LDA     SETW,X



is syntax

LABEL INSTRUCTION OPERAND



processes into

[93, [(3, 'WHSET'), (1, 'LDA'), (2, 'SETW,X')]]



93: the source line number (0 based)

then there is a list of tuples:

    3 is label, 'WHSET' is the value

    1 is instruction, 'LDA' is the value

    2 is operand, 'SETW,X' is the value - SETW is a label that will defined later

Support for other 6502 dialects could be added to the Tokenizer without needing to modify any of the downstream code. As long as it transforms the text into this format, then all the downstream passes will handle it. A main benefit of Tokenize being separate is to insulate the rest of the code from a particular dialect of 6502 assembly language.

Code Generator

The tokenizer produces the information described above for every line. The code generator takes several passes through this data and builds another list that contains machine code representing each line of source code. I'll describe the passes below.

Pass 1 - Find All Defines and Labels

All 6502 assembly programs (regardless of particular syntax) have a way to use symbols to stand in for number in the source code, which is like C's #define structure. The example I gave above is PMAXC = $F0. The Tokenizer gives define lines the identifier "DEFINE". Labels are always the first element of a line. I'm building 2 python dictionaries, one of defines and one of labels.

For each line, if the line starts with DEFINE, I add the symbol to defines{} with the value be the define value. If the line starts with LABEL, I add the LABEL to the labels{} dictionary with the value to be the index of the line with the label. We'll use this information later to figure out where labels are in the source code.

Pass 2 - Assign PC and Pick Opcodes

Pass 2 does several things as it iterates through the tokens the second time. This pass builds a new list that all the subsequent passes operate on. The format of an entry in the new list is

( address, index, [INSTRUCTION | RESOLVE], bytes, optional label)



For example, this instruction

LDA    #$00

got tokenized to this

[71, [(1, 'LDA'), (2, '#$00')]]



And that got transformed to

(4096, 0, 1, [169, 0])

slightly reformatted for clarity:

($1000, 0 ,INSTRUCTION, [$A9, $00])

$A9 is the opcode for LDA

I put comments in the code for how each token in the token list is handled. Every token in the token list is going to get put into memory. I'll start it pointer at 0, but most 6502 assembly programs put themselves somewhere else in memory with a line like ORG $1000 or in the case of uchess *=$1000. I pick an opcode for each instruction which tells me how many bytes it uses. That tells me what the address will be for the next instruction.

Most mnemonics have multiple opcodes that are for various addressing modes of the 6502. The only way to decide which one to use is by looking at the syntax of the operand. Sometimes the syntax isn't definitive, so an addressing mode is chosen by looking at the mnemonic. The opcode picking looks at legal addressing modes for the operand, and sometimes uses the length of the address given in the operand. (It might choose zero page for 2-byte addresses like "$F6", and choose indexed for 4 byte addresses like "$1EA6".)

When I see a label in the token stream, I note the line of IR where the label occurs. I can't do anything else about them yet. When an instruction references a label, I don't know where the label is yet, because I haven't given every line an address yet. When that case is encountered, I mark that this instruction needs more work in a future pass by labeling it RESOLVE and appending name of the label. Here's an example of that:

BEQ    POUT4

Transformed by Pass 2 to

(5005, 460, 8, [240, 0], 'POUT4')

or reformatted for clarity

($138D, 460, RESOLVE, [$F0, $00], 'POUT4')

$F0 is the opocde for BEQ

Pass 3 - Transform Labels

Now that every line has a PC, I can know what the address of the label is. The label list currently contains the index of the line that the label references. Since those lines have addresses now, I can change the label list from line indexes to actual addresses. So I go through the list of labels and transform the label list (in place) from a line index to an actual address.

Pass 4 - Resolve Label References

Loop through the instructions and look for the ones marked RESOLVE. When I see those, use the label name recorded to look up the label address. If the instruction is 2 bytes (Relative Branches) figure out its offset and fix the instruction. If it is 3 bytes, it is an absolute reference, so put in the address of the label.

That's It

Now I can record the code to disk. The only thing left is to watch for gaps in the PCs between instructions and do something, I fill with zero. The gaps occur between sections of code where the programmer changed the code base address to something new.

If you get the code from this from github, you can download the source to uchess and put it through the assembler like this:

./6502assembler.py --ops 65C02ops.csv --source uchess.asm

prints to stdout, or

./6502assembler.py --ops 65C02ops.csv --source uchess.asm --bin uchess.bin

to save to the file uchess.bin

View the 6502 assembler (also in the github project)

uchess is actually written for the 65C02, so to assemble it you'll have to get the 65C02.csv file

← Prev: 6502-disassembler


Post questions or comments on Twitter @realemulator101, or if you find issues in the code, file them on the github repository.