self-mod'ing code on 65xx

definition and purpose
caveats
ideas, the last point involving hardware
outside links

I was looking around to see what others have done with self-modifying code (SMC), toying with the idea of writing a 6502/816-oriented article on it. The more I look, the more I think there is significant capability there that most of us have been neglecting. Not much of this article is of my own invention. Please help me improve it, by emailing me more ideas, links, short pieces of SMC, etc., at wilsonminesBdslextremeBcom (replacing the B's with @ and .). I'll give you credit of course.

Before you get started in SMC: A recent comment from a newbie prompts me to recommend that you first get cozy with the 65xx processors' existing instructions' offerings in indirection, indexing, and pre- and post-indexed indirects. If you're new to the 6502 family, it's likely that there's already a built-in way to do what you want.

Definition and purpose of SMC:

Self-modifying code (SMC) is where the program writes bytes to its own instructions' op codes or operands. This article is, so far, about SMC done in 65xx assembly language, not high-level languages (HLLs). Possible reasons for using SMC include, but are not limited to, the following:

to make up for instructions or addressing modes the processor doesn't have; for example, double indexing, double indirect, or doing A=A+Y without using a variable
to change instructions based on a condition, and make it execute more quickly than branching to or around various parts of code (or intentionally not execute at all, as in changing a JSR to a BIT!)
to speed up a routine by eliminating a level of indirection; and related to that,
to eliminate the need for variables to take a separate variable space, since the variable can be in an instruction's operand
The 65816's MVP and MVN memory-moving instructions only have one addressing mode, and the operands tell which banks to operate in. In a multitasking system, a program and its data space may occupy different banks every time it is loaded, and SMC may be the only way to specify varying banks.
Related, there's dynamic code generation.

SMC is simple to do on the 65xx processors and various others of its era, because an instruction's op code and operand are in separate bytes, not merged, and all op codes are one byte long. These processors also have their instructions and data on the same bus, having a Von Neumann architecture, not a Harvard architecture. There is no problem storing to instructions.

If you're resorting to SMC for speed, its greatest value might be where you have a loop with one or more instructions that get modified from outside the loop before the loop is started. Then, a couple of cycles saved per iteration may result in hundreds of cycles saved by the time the loop finishes.

Caveats:

The code must be in RAM, not ROM.
It is usually harder to read and maintain. In particular, it requires an extra measure of clear commenting, beyond the usual!
An extra level may be added to debugging. Bruce Clark wrote, "One thing I've done when writing self-modifying code is only make one change at a time from non-self-modifying to self-modifying, then test that change. [...] Using those principles, self-modifying code doesn't seem (to me) to be significantly more difficult to debug than other code."
If the code is to be re-entrant, the modded instruction(s) will need to be saved at the beginning and restored upon exit. In this case, you'll have to evaluate whether there's a net benefit or not.

Ideas:

Writing to an operand makes the instruction indirect. The 65xx instruction set has for example LDA addr but not LDA (addr) (where addr is a 16-bit address). LDA $1234 assembles AD 34 12. AD is the op code, and 34 12 is the operand. For SMC, you could write a different value to the operand itself, making the operand a variable which the instruction uses to find out what address to read.
Note that this means that on the 65816, if you have the accumulator in 16-bit mode, you can have the accumulator specify the address, like doing an LDA (A)! If the desired address were in 16-bit X or Y, the same example instruction would become LDA (X) or LDA (Y). The parentheses emphasize the indirection, ie, that the address to read was in one of the registers, not stuck permanently in the operand; but another way to write it would be LDA ,A, LDA ,X, or LDA ,Y; or, more to the 65xx lingo, LDA 0,A, LDA 0,X, or LDA 0,Y. The A, X, and Y can be swapped around, and other instructions could be used as well, not just loads. In this particular instance, the '02 would only be able to specify a ZP address, since A, X, and Y are limited to 8 bits, not enough to specify a 16-bit address.
Carrying this a step further, you can add post-indexing to absolute addresses, turning for example LDA addr,X or LDA addr,Y into LDA (addr),X or LDA (addr),Y. You can also get LDX (addr),Y and LDY (addr),X. (You will not be able to do pre-indexed indirects though, for example LDA (addr,X).)
One of the many shortfalls of the NMOS 6502 was that it had no indirect addressing mode at all (even in ZP) without using Y; so if you wanted a straight, non-indexed indirect, you had to put 0 in Y—which is a problem if you still needed Y for something else. The only exception was JMP (addr). The 65c02 (CMOS) added the non-indexed zero-page indirect (ZP) addressing mode, but not an absolute indirect (addr) except for the already existing JMP (addr). Even the 65816 did not do that with 16-bit addresses, let alone 24-bit. Writing to the operands of any of the direct instructions makes them indirect. Writing to the indirect JMP (addr) instruction makes it doubly indirect. There's an example of that further down. Writing the address to the instruction's operand bytes saves variable space, speeds up execution, and frees up Y.
It may seem trickier where indexing gets added to the indirection. Just remember that the indexing will always happen after the indirection done by SMC, as shown above.
White Flame wrote,
Self-modification can speed up dispatch points, and I don't think that's a problem. [Note however that the CMOS 65c02 has JMP (addr,X) which the NMOS 6502 lacks. The linked article is for C64 which never had the CMOS version. —GW]
I also have some round-robin code in one of my projects. Each routine gets a pass per frame or per interrupt, depending on whatever's active. The first instruction of the routine is BIT nextRoutine when enabled, which gets self-modded to JMP nextRoutine when disabled. I do similar in interrupting AcheronVM. Rerouting flow of control is really low overhead this way, as no checks are involved in the standard case. Changing a single instruction opcode costs exactly the same as setting a variable/flag to a particular value.
Add X to A by storing X to the operand of a subsequent ADC #__ instruction:
```
        STX  label+1
label:  ADC  #0          ; (Operand gets modified by preceding instruction.)
```
(or do other functions like ANDing or ORing A & Y, etc.).
Update a branch destination with a 1-byte vector (instead of a table of two-byte addresses): (This one is again from 6502.org forum member White Flame.)
```
        LDA  branch_table,x
        STA  branch+1
branch: BNE  *
```
Save and restore a register one cycle faster than pushing and pulling, without using the stack:
```
        STY  label+1     ; This is in place of PHY.
        <do_stuff>
label:  LDY  #0          ; Operand gets modified.  PLY replacement.
```
This can be extra beneficial in the case of the NMOS 6502 which lacks the PHX, PLX, PHY, and PLY instructions. (But as usual for SMC, it should not be used in re-entrant code without the appropriate care. If it might nest between the ST_ and LD_, the first-saved value will get overwritten since it's not using a stack.)
Saving and restoring the stack pointer can be done similarly, again without using other memory:
```
        TSX
        STX  label+1        ; Store stack pointer in LDX's operand.
        <mess_up_the_stack>
label:  LDX  #0             ; Operand gets modified.
        TXS
```
(For a 65816 in native mode, replace TSX, STX, LDX, and TXS with TSC, STA, LDA, and TCS, with A in 16-bit mode, like this:
```
        TSC
        STA  label+1        ; Store stack pointer in LDX's operand.
        <mess_up_the_stack>
label:  LDA  #0             ; Operand gets modified.
        TCS
```
Here is the NEXT routine in my indirect-threaded code (ITC) 65816 Forth, which gets copied to direct page (actually page 0 in mine) RAM before ever being run. It has to be in RAM for the SMC; but I also put it in direct page for the faster storing to the SMC operands W and IP, since Forth spends a substantial percentage of its time in NEXT so it's important to keep it quick (unless you're running subroutine-threaded code (STC) Forth which has no NEXT). In my '816 Forth, I keep A in 16-bit mode and X and Y in 8-bit mode almost full-time.
Note particularly that the JMP (____) below becomes a double indirect, as the initial LDA fetches an address and stores it in JMP's operand, then the JMP (____) uses that fetched address to look up another address of where to jump to, then finally does the jump.
```
ROM_IMAGE_OF_NEXT:    ; This gets copied to page 0 RAM before running.
                    ; clocks (Remember this is '816, and LDA and STA do 2 bytes):
preIP:  LDA  1234     ; 5  Get cell pointed to by instruction pointer.  (Code &
        STA  W        ; 4  IP together eliminates a level of indirection.)  Put
                      ;    that in the word pointer (which points to a CFA).
        LDA  IP       ; 4  Contents must be kept anyway.  Then increment the
        INA           ; 2  instruction pointer so it will be ready for next
        INA           ; 2
        STA  IP       ; 4  one, either to come to next or be saved to return
                      ;    to after a secondary call.  Faster than INC_zp 2X.
preW:   JMP  (1234)   ; 5  Finally, jump to the code pointed to by the word
                      ;    pointer.  (code & W together here eliminates a JMP,
 ;------------------  ;    an advantage of having NEXT in direct-page RAM.)
```
The "1234" occurrences are just 16-bit placeholders for the operands that will get modified before use. The address of IP is one byte after label preIP, and the address of W is one byte after label preW. Both of these are two-byte variables. The loading and double-incrementing of IP above may seem like a waste; ie, why not shorten the four instructions to just a pair of INC IP lines? However, the way that's shown is faster.
Here's an LDA table,X,Y equivalent, ie, double-indexed. To get the address of the byte you want to read, it takes the table address, adds Y to it, then adds X to that, then reads the final resulting address to get the desired byte. The table should normally start on a page boundary, and the high byte of the operand is not getting written to here (although it might get written to by other code!).
```
        STY  label+1
label:  LDA  table,X
```
With 8-bit index registers, the effective address is base + X + Y. However, on the 65816, with index registers in 16-bit mode, the effective address is X+Y! :) (For either the '02 or the '816, the STY could also be replaced with STA.) If there's other X-modifying code between the two lines shown, or if you simply want to double the X value, the STY could even be replaced with STX.)
In the cyclic-executive method of cooperative multitasking, use the first JSR in the main loop to call a task manager which will write to subsequent JSRs. The multitasking method is discussed in my article "Simple methods for multitasking without a multitasking OS (for systems that lack the resources to implement a multitasking OS, or where hard realtime requirements would rule one out anyway)." An actual SMC task manager is not discussed in any detail there, but the idea should be clear enough to get any semi-experienced hobbyist programmer going.

This is not exactly self-modifying code, but it nevertheless synthesizes non-existent instructions. The NesDev wiki suggests an identity table of 256 bytes, where each byte is simply the index value, ranging from 0 to $FF, like this:


TBL: BYTE  $00, $01, $02, $03, $04, $05, $06, $07, $08, $09, $0A, $0B, $0C, $0D, $0E, $0F
     BYTE  $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $1A, $1B, $1C, $1D, $1E, $1F
                  < . . . >
     BYTE  $F0, $F1, $F2, $F3, $F4, $F5, $F6, $F7, $F8, $F9, $FA, $FB, $FC, $FD, $FE, $FF

then,

the non-existent instruction	can be made from:
`TYX`	`LDX TBL,Y`
`TXY`	`LDY TBL,X`
`AND X`	`AND TBL,X`
`AND Y`	`AND TBL,Y`
`ORA X`	`ORA TBL,X`
`ORA Y`	`ORA TBL,Y`
`EOR X`	`EOR TBL,X`
`EOR Y`	`EOR TBL,Y`
`ADC X`	`ADC TBL,X`
`ADC Y`	`ADC TBL,Y`
`SBC X`	`SBC TBL,X`
`SBC Y`	`SBC TBL,Y`
`CMP X`	`CMP TBL,X`
`CMP Y`	`CMP TBL,Y`

Related, here's a contribution from Errol ("Strobe") for NMOS 6502's which lack the BIT # instruction but have BIT <absolute>: The use of BIT with the identity table is different from the cases above in that it does not use indexing. You reference TBL+<constant>. For example, if the table is at address $C000 and you want to synthesize BIT #%00010010, the resultant code would be BIT $C012. A macro could be used for clarity. I might call it BITimm (for "BIT immediate"). Then in this example you could write BITimm %00010010, and it would assemble BIT $C012.

incrementing or decrementing an op code:
- choose whether or not to add ,Y post indexing to these ZP indirect instructions on the 65c02, ORA, AND, EOR, ADC, STA, LDA, CMP, and SBC
- choose between BIT and AND in these address modes: ZP, absolute, and abs,X
The initial state of the op code would of course need to be known, to make sure that incrementing or decrementing by 1 doesn't yield an unwanted one.
On the 65c02, columns x3 and xB of the op code table are no-ops, so SMC could be used to optionally turn op codes in neighboring columns into one of those no-ops. The obvious red flag is to be careful about the number of bytes taken by those, since replacing a 3-byte instruction with a 1-byte (or even 2-byte) op code might yield results even more interesting than you wanted. Fortunately, the number of bytes in the instructions remains the same for these neighboring instructions. Op codes 02, 04, 06, 08, 0C, 0E, 44, 54, D4, and F4 are all two-byte do-nothing instructions, and 5C, DC, and FC are all 3-byte do-nothing instructions. For a little more, see my article on the differences between the NMOS and CMOS 6502's, here.
Note that several of these "do-nothing" ops do access memory, potentially touching an I/O device and altering its status. The cases where it will matter may be rare; but keep it in mind and exercise appropriate caution. Jeff Laughton has investigated this extensively, and has a write-up here. He uses the LDD mnemonic to mean "LoaD and Discard." There are:
```
   2 op codes for LDD Absolute  (DC FC)      ← use caution
   1 op code  for LDD ZP        (44)         ← use caution if there's I/O in ZP
   3 op codes for LDD ZP,X      (54 D4 F4)   ← use caution if there's I/O in ZP
   7 op codes for LDD Immediate (02 22 42 62 82 C2 E2) ← no caution needed
```

This last part goes beyond SMC, as it involves discrete or programmable logic to fool the processor in order to make new instructions. With this hardware, a single WDM instruction can do multiple simultaneous output and input functions, and based on what was found at the specified input, artificially modify the next instruction as it gets delivered to the processor, sometimes adding an extra operand byte (by using RDY along with the special hardware logic). The subject is too involved to detail it here, and I hope to have an article later on how to do it; but for now, maybe the following will stimulate the reader's imagination.

Twiddling bit 5 of op codes is particularly interesting, and Jeff Laughton has devised circuits that do that, depending on inputs during an immediately preceding WDM instruction which he uses to do input and output at the same time. Which input and output bits to use are both specified in the WDM instruction's operand's bit fields. Depending on the state of a specified input, the circuit intercepts bit 5 of the next op code and optionally inverts it before handing it to the processor, giving a lot of options to do one thing or another without the time-consuming branches around the unwanted parts of the code. The circuit knows when an op code is being fetched (as opposed to operand or data or a dead-bus cycle) by both VDA and VPA are high (on the 65816), or SYNC is high (on the 6502).

Following are some noteworthy '816 opcode pairs which differ only in bit 5. (Many could be applied to the '02 as well.) All op codes are in hex.

CLC vs SEC (18 vs 38) Depending on the state of the selected input bit, you can either set or clear the carry flag.
ASL A vs ROL A (0A vs 2A) (also other address modes) Depending on the state of the selected input bit, you can select to shift either the C flag or a 0 in.
LSR A vs ROR A (4A vs 6A) (also other address modes)
These pairs are highly useful for inputting a serial bit stream based on the state of an input bit. If A is preloaded with $FF then a single WDM can input and shift a bit. (Use ASL A vs ROL A to left-shift a 0 or 1. Use LSR A vs ROR A to left-shift a 0 or 1.)
BIT # vs LDA # (89 vs A9)
Highly useful for a conditional overwrite of the contents of A, based on the state of an input bit.
DEX vs NOP (CA vs EA)
Moderately useful for a conditional DEX based on the state of an input bit.
INC A vs DEC A (1A vs 3A) (also other address modes)
Useful?? (Would prefer instead to have DEC vs NOP, or INC vs NOP.)
TCD vs TDC (5B vs 7B)
Moderately useful for a conditional "keep one of two 16-bit values"
TXA vs TAX (8A vs AA)
TXY vs TYX (9B vs BB)
Moderately useful for a conditional "keep one of two values"
STA vs LDA (all address modes)
Moderately useful for a conditional "keep one of two values"
STX vs LDX (most address modes)
STY vs LDX (most address modes)
Moderately useful for a conditional "keep one of two values"
DEY vs TAY (88 vs A8)
Moderately useful for a conditional overwrite of the contents of Y, based on the state of an input bit. Dec's Y if not taken.
CMP # vs SBC # (C9 vs E9) (also other address modes)
Moderately useful for a conditional subtract-with-carry to A, based on the state of an input bit.
BRL vs LDX # (82 vs A2)
Moderately useful for a conditional long branch based on the state of an input bit. If not taken, X is destroyed. (Be careful: LDX # may be 2 or 3 bytes on the '816.)
BPL vs BMI (10 vs 30)
BVC vs BVS (50 vs 70)
BCC vs BCS (90 vs B0)
BNE vs BEQ (D0 vs F0)
Moderately useful. The branch is taken if (pertinent CPU flag) XOR (tested input bit) is true.
ORA vs AND (all address modes)
EOR vs ADC (all address modes)
Too weird to be useful?

He also had this kind of option for inverting an operand byte, the second byte of the next instruction following a WDM instruction! (although the usefulness there takes more brain-twisting to grasp). Following are some examples:

BRA (also any of the conditional branches)
Highly useful. The 2nd byte is a displacement applied to PC, so flipping bit 5 alters the destination by $20. Typically, two parallel, almost-identical code sequences (loops?) would be located $20 bytes apart. WDM / BRA will input a bit and choose between code sequences accordingly. For loops, DEX / WDM / BNE (for example) at the bottom of the loops will perform the exit test and, if no exit, will input & choose between code sequences.
JSR or JMP
Moderately useful. The 2nd byte is the LS byte of a destination address. Flipping bit 5 alters this by $20. Requires attention regarding location of the subroutines.
LDA STA AND ORA EOR ADC SBC CMP INC DEC etc, using various non-indirect address modes
Moderately useful. The 2nd byte is the LS byte of an operand address. Flipping bit 5 alters this by $20. Requires attention regarding location.
LDA STA AND ORA EOR ADC SBC CMP etc, using (ind) (ind,X) (ind),Y address modes
Moderately useful. The 2nd byte is the LS byte of an indirect pointer address. Flipping bit 5 alters this by $20. Requires attention regarding location.
ORA #
AND #
Limited usefulness. The 2nd byte is immediate data. Bit 5 of A is conditionally set or cleared.

In our discussion as he was developing the circuit, we ran into a difficulty in that depending on when an interrupt hit, the next op code fetched might not be the next one executed. Fortunately he figured out a solution, so that keeping the relevant instruction pairs together in execution time won't require disabling interrupts.

This would take programming into new territory. Clearly some good macros would be in order, to take advantage. (you still need to know what's in the macros—it's just that you don't have to keep seeing the ugly internal details every time you invoke one.)

outside links:

Codebase64 search results for SMC
Self modifying code forum topic from 2008
Bruce Clark's post on using SMC to shorten Forth's AND, OR, XOR, -, and +
Software - 65816 - Memory move, on the 6502.org wiki
token threading, on the 6502.org wiki
Fast block move on 65816 in bank 0 (using PEI) forum topic
Strotmann Atari wiki SMC page (archived)
Using BIT to skip 2 bytes (forum topic) Not quite SMC, but related. Also discussed is using BIT ZP or BIT # to skip a single byte. This strange use of BIT is usually for jumping around an instruction intended to be the first one at an entry point, such that the instruction doesn't get executed unless you branch or jump to it from outside. It saves a byte compared to branching around the instruction. See also this forum post.
Wikipedia article on SMC
gardners' comments on SMC, in the context of his GS4502B attempt to create a high-performance 6502-compatible CPU

last updated Feb 21, 2022