Let's admit right up front: The 6502 is poorly suited for relocatable code. It's not totally incapable of it though, nor does every application need to wring maximum performance out of the processor. We sometimes accept compromises in performance to get another desired benefit. (I will also admit that this particular section is only musings on my part which hopefully will give you ideas, as I don't have practical experience in this area. Email me with ideas to improve this section, or discuss it on the 6502.org forum. I'll put the direct link here if a discussion gets started.)
6502 programs usually have the luxury of knowing ahead of time the address of every jump destination, variable, etc.. It's a simple system and usually has no memory management unit (MMU); so if you want multiple pre-assembled programs loaded in RAM at once, in unforeseen order, the code must be relocatable, and references to addresses in the same program (not the kernel's fixed addresses) must be relative, not absolute. This is especially true if you want to be able to move the program even after it is loaded, like to delete a program and move others so the free memory space is all at one end, available to load a bigger program. (If a program is already active, you probably would only move it when it's at a good pausing point so it doesn't have actual addresses on the stack that would suddenly become invalid as a result of the move.)
Side note: For an impressive 6502 OS that allows program relocation at the time of loading (but not after), see André Fachat's GeckOS scalable preëmptive multitasking/multithreading OS which has Unix-like features, dynamic memory management, relocatable file format, a standard library, internet support, virtual consoles, and remote login, and runs on a Commodore 64 and other 6502 platforms! Undefined address references are solved at load time.
Code that is completely relocatable must refer to routines and data by addresses that are relative to the current program counter, rather than absolute. Outside of the relative-branch instructions, the 6502 is inefficient at this. The 65816 has instructions specifically designed for the job, and if there's an instruction lacking, it's easier to synthesize it on the '816. The native 65816 instructions can be synthesized on the '02 (although inefficiently), and we will look at that in the next section, section 13.
Making the code relocatable on the 6502 may be realistic for some applications, due to relative subroutine calls and long jumps being only occasional. However, I have found that making most of the data relocatable as well is harder, due to the constant references. So whadya do?
One possibility that might work if there's little enough data is to "glue" it to a particular address range and coordinate that across combinations of programs that may be active simultaneously so they don't interfere with each other's data. Each program will probably need a little ZP space in addition to more non-ZP space. This method would yield the best performance, because references to data addresses are not relative; for example, CPY FOOBAR can still be something like CC 3E 17, using CPY abs.
Another possibility is to have one or more arrays of variables, and store the array address(es) in the two-byte ZP slot(s) assigned to that program, for the program to use with (ZP),Y addressing. The address of these pointers in ZP would have to be coordinated with other programs that might be active at the same time, similar to the agreed-upon variable space in the paragraph above (but only for the pointers, not all the variable space). You'll have to use double indexing sometimes, and you might wish for an extra index register available, or be able to add directly to index registers. (Self-modifying code may be helpful.) If the variable space is lumped with the program's space such that they get moved together as a unit, then the array address will have to be re-calculated after loading and after every program move. If the OS assigns the array space and keeps it separate from the program space, it may not need to get moved every time the program does.
Do you know of better possibilities? Email me and let me know, at wilsonminesBdslextremeBcom (replacing the B's with @ and .).
If you're still not dissuaded from trying relocatable code on a 6502, the where-am-I subroutines should become valuable. Regardless of where the program got loaded or moved to, you can always get the program counter's contents using a where-am-I subroutine.
The simplest is as follows. Remember that these particular subroutines do have to be at a fixed location in the kernel where
all relocatable programs can find them.
WhereAmI: TSX ; Must start at a fixed location.
LDA 101,X ; Get low byte in A,
LDY 102,X ; high byte in Y.
RTS
;------------------
which will give the address of the last byte of the JSR instruction, returning the low byte of the address in A
and high byte in Y. X gets overwritten.
Note: It might seem appropriate sometimes to just JSR to an RTS instruction, and look at what's beyond the stack pointer (at lower addresses) after the execution of the RTS instruction. The problem with that idea is that interrupts may overwrite the address you want to examine. Even if it is acceptable to SEI...CLI for the operation, there's still the NMI that could overwrite the address.
If it's ok to just leave the address on the stack for later use, you could just do this, without using an RTS:
JSR <next_instruction>
(keeping in mind that JSR pushes the address of the last byte of its own operand onto the stack). This one
is not relocatable, but may have uses anyway.
We might want to go further. You'll normally want to add an offset to it to get the target address. You can do that in the same routine you use to find out where you are.
There's a good page on where-am-I routines in the 6502.org
wiki. Those will not be repeated here, but instead we will look at some modifications. The first one is that we will assume
that your reset routine included the normal LDX #$FF, TXS so we can get rid of the INX
instructions and use 101,X and 102,X instead of 100,X
and not have to worry about possibly accidentally indexing into page 2. We will also make the offset to be 16-bit; and since
the 64K address space wraps, you can then get to any desired address, whether behind or ahead:
; Start with offset lo byte in A, hi in Y.
WhereAmI1: CLC ; Start here if C's state is unknown.
WhereAmI2: TSX ; Start here if C is known and accounted for in offset.
ADC $101,X ; Add lo byte of return address,
PHA ; and save it so we can use A for adding the
TYA ; high bytes. High byte of offset came in on Y.
ADC $102,X
TAY ; Put the high byte away in Y again,
PLA ; and get the low byte back in A.
RTS ; The return address remains unaffected.
;------------------
(Note: Although we did a PHA, the $102,X is still the next byte
after 101,X because the number in X from the TSX did not change when S did.)
(Note: Further down we'll look at a way to shorten these subroutines.)
But if you want to preserve X (like because you're using it for a ZP data stack pointer), do it this way:
; Start with offset lo byte in A, hi in Y.
WhereAmI1: CLC ; Start here if C state is unknown.
WhereAmI2: PHX ; Save X. This requires different base numbers below. Start
TSX ; at WhereAmI2 if C is known and accounted for in the offset.
ADC $102,X ; Add lo byte of return address,
PHA ; and save it so we can use A for adding the
TYA ; high bytes. High byte of offset came in on Y.
ADC $103,X
TAY ; Put the high byte away in Y again,
PLA ; and get the low byte back in A.
PLX ; Restore X. Note: N and Z bits now reflect X, not A or Y.
RTS ; The return address remains unaffected.
;------------------
Again, these subroutines do have to start at known addresses in your kernel. All parts of all relocatable routines that need
them will JSR to them at their respective addresses.
Doing it with inputs and outputs on the ZP data stack does not make it much longer or slower:
WhereAmI3: PHX ; ( offset -- addr )
TSX
TXA
TAY ; Y will be used to index into the return stack,
PLX ; and X, restored, to index into the data stack.
CLC
LDA 0,X ; Get the low byte of the offset, and
ADC $102,Y ; add it to the low byte of the return address.
STA 0,X ; Store it on the data stack.
LDA 1,X ; Now do the same with the high byte.
ADC $103,Y
STA 1,X ; Depth remains the same for both stacks,
; so no adjustment is needed.
RTS ; The return address remains unaffected.
;------------------
WhereAmI3 above leaves the resulting address on the data stack, ready for fetching, jumping to, executing (as a
subroutine), calculating and adding another 16-bit index value to, etc..
For a next step, you can make a macro, used something like:
Rel_to_Abs FOOBAR
For the WhereAmI1's above, the macro definition might look something like:
Rel_to_Abs: MACRO target
LDA #<{target - $ - 6} ; low byte
LDY #>{target - $ - 4} ; high byte
JSR WhereAmI1
ENDM
;------------------
that is, if "$" in your assembler gives the address of the first byte of the instruction formed by that line.
For WhereAmI3 above (for using the ZP data stack), the only reason to have a macro at all is that
you want a literal offset, not one you already calculated and left on the ZP data stack. You would have something like:
Rel_to_Abs: MACRO offset
LITERAL offset ; (Straightline this if your assembler won't nest macros.)
JSR WhereAmI3
ENDM
;------------------
(Note the use of nested macros. If your assembler does not allow them, just expand LITERAL out in
the Rel_to_Abs macro definition.)
Technically, the WhereAmI routine could be written to find the offset as inlined data following the JSR; but since it will be needed often, it might be much too slow.
So what can you use them for? The most common needs would be relative jumps and subroutine calls and memory accesses, and variations
on WhereAmI1 & 2 above can be made to improve the efficiency. A jump becomes:
JMP_rel: CLC ; Start with offset lo byte in A, hi in Y.
TSX ; Start with the low byte.
ADC $101,X
STA $101,X
TYA ; Then do high byte.
ADC $102,X
STA $102,X
RTS ; RTS is used for the jump, not a return.
;------------------
A relative subroutine call will have to use a pair of ZP bytes as a virtual register. Call them temp for
now:
JSR_rel: CLC ; Start with offset lo byte in A, hi in Y.
TSX ; Start with the low byte.
ADC $101,X
STA temp
TYA ; Then do high byte.
ADC $102,X
STA temp+1
JMP (temp) ; Return addr is still on the stack for RTS later.
;------------------
A little further down, we'll get to a trick that makes these slightly more efficient.
If you're using an NMOS 6502, you'll have to zero the Y register and use JMP(temp),Y. Again, note that the wrap-around nature of the memory map means you can jump forward or back, to any part of the memory.
You will probably want to limit the relative loads and stores to larger tables, and find ways to use fixed addresses for smaller,
more-frequently accessed data. In a table, the index value can be calculated before the relative load or store routine is called,
and the calculation will often be done by the assembler so it does not further hamper the run speed. A relative load might be done
this way:
LDA_rel: CLC ; Start with offset lo byte in A, hi in Y.
TSX
ADC $101,X ; Start with the low byte.
STA temp
TYA ; Then do high byte.
ADC $102,X
STA temp+1
LDA (temp)
RTS ; Return addr is still on the stack for RTS.
;------------------
If you want to have the option to do LDA(temp,X) or other such options, you might leave
the LDA out of the subroutine and do it in the main routine after the subroutine is run. (Actually, you'll
probably want to do that anyway, so you have all the other options like ORA, AND, EOR, etc. to use
with (temp).)
STA relative won't work the same way as LDA relative if we try to keep things
consistent with the offset low byte being in A. The data could go in X, then have an STX relative, like this:
STX_rel: CLC ; Start with offset lo byte in A, hi in Y.
PHX
TSX
ADC $101,X ; Start with the low byte.
STA temp
TYA ; Then do high byte.
ADC $102,X
STA temp+1
PLA
STA (temp) ; Return addr is still on the stack for RTS.
RTS
;------------------
Probably all of these will usually be called from inside macros, to make the source code a lot more readable without adding any further
overhead in the executable program.
Again, the subroutines for relative access are not efficient; but they may meet a particular need. Not every application requires maximum execution speed. Even for those where speed is paramount, the requirement might only be in the interrupt performance and/or a very small part of the non-interrupt code, and optimizing these may be plenty.
However, there's something you can do to gain a small improvement in performance. For the relocatable code, you can require that all program-moving be carried out in increments of 256 bytes. That way, the low byte of the destination address can usually be finalized by the assembler, and won't need to be calculated at run time. The minimal price you pay to get this added performance is some memory waste with gaps between programs, averaging about 128 bytes. Having for example five relocatable programs in memory at once then (with four gaps) would, on the average, result in somewhere around two pages wasted, or less than 1% of the 6502's memory map space.
The first WhereAmI1 & 2 above then becomes:
; Start with offset high byte in A.
WhereAmI1: CLC ; Start here if C's state is unknown.
WhereAmI2: TSX ; Start here if C is known and accounted for in offset.
ADC $102,X ; Add high byte of return address.
RTS ; The return address remains unaffected.
;------------------
which is much shorter. The second one, preserving X (like because you're using it for a ZP data stack pointer), becomes:
; Start with offset high byte in A.
WhereAmI1: CLC ; Start here if C state is unknown.
WhereAmI2: PHX ; Save X. This requires different base numbers below. Start
TSX ; at WhereAmI2 if C is known and accounted for in the offset.
ADC $103,X ; Add high byte of return address.
PLX ; Restore X. Note: N and Z bits now reflect X, not A.
RTS ; The return address remains unaffected.
;------------------
Again, much shorter. WhereAmI3, using the ZP data stack, still uses 16-bit cells on the stack, but ignores
the low byte, and becomes:
WhereAmI3: PHX ; ( offset[only_hi_byte_handled_here] -- addr )
TSX
TXA
TAY ; Y will be used to index into the return stack,
PLX ; and X, restored, to index into the data stack.
CLC
LDA 1,X ; Handle high byte only.
ADC $103,Y
STA 1,X ; Depth remains the same for both stacks,
; so no adjustment is needed.
RTS ; The return address remains unaffected.
;------------------
and don't forget to modify the macro that calls it. The low byte is still significant; but limiting code movement to multiples of 256
bytes means the low byte can be pre-calculated by the assembler as is normal in non-relocatable code instead of being left to do at runtime.
Here are the others. JMP_rel only gets one instruction removed.
JMP_rel: TSX ; Start w/ offset lo byte in A, hi byte in Y.
STA $101,X ; Do low byte just as it is. No need for addition.
; Remember to compensate for RTS going to addr-1.
TYA ; Then do high byte.
CLC
ADC $102,X ; (This one does have the addition, because it's
STA $102,X ; an offset.)
RTS ; RTS is used for the jump, not for a return.
;------------------
JSR_rel: TSX ; Start w/ offset lo byte in A, hi byte in Y.
STA temp ; Store pre-calc'ed low byte w/o adding anything.
TYA ; Then do high byte.
CLC
ADC $102,X ; (This one does have the addition, because it's
STA temp+1 ; an offset.)
JMP (temp) ; Return addr is still on the stack for RTS later.
;------------------
LDA_rel: TSX ; Start w/ offset lo byte in A, hi byte in Y.
STA temp
TYA
CLC
ADC $102,X
STA temp+1
LDA (temp)
RTS ; Return addr is still on the stack for RTS.
;------------------
STX_rel: STA temp ; Store pre-calculated low addr byte. Data is in X.
PHX ; Then do high address byte.
TYA
TSX
CLC
ADC $102,X
STA temp+1
PLA
STA (temp)
RTS ; Return addr is still on the stack for RTS.
;------------------
The choice of addressing modes here is rather slim. You could write routines for more addressing modes; but using the ZP data stack opens
up all the addressing modes you could possibly want, using the last version of WhereAmI3 above. (Obviously
the subroutines used for the data stack operations would need to be at fixed, known addresses, and then they can be called from multiple
relocatable programs.)
As is often the case, code becomes much more readable by using macros, and offers no penalties. For example, you might have:
BSR STORE_SPLIT
where BSR (Branch to SubRoutine, with 16-bit offset) is defined as:
BSR: MACRO Offset ; Branch to SubRoutine, with offset as a macro parameter.
LDA #<Offset ; Put the low byte of the offset in A,
LDY #>Offset ; and the high byte in Y,
JSR JSR_rel ; then call the subroutine that does it.
ENDM
;------------------
and assembles the three lines, LDA, LDY, and JSR.
Now that your imagination is stirred (or maybe you already knew something I hadn't thought of), you'll probably have some ideas to take this further. Let me know, so I can improve this for everyone. I'll give you credit. Email me at wilsonminesBdslextremeBcom (replacing the B's with @ and .). Let me know also if you find an error, or if something is confusing or misleading.
The next section takes this where-am-I subject further, dealing with synthesizing some of the 65816's new stack-related
instructions on the 6502. It's obviously better to use the '816 for this if you have that hardware option, but I want
to point out the possibilities for the lesser 6502 also.
11. synth instructions w/ RTS & JSR <--Previous |
Next--> 13. synthesizing 65816 stack instructions
last updated Sep 3, 2020