Section 5 (on stack addressing) started with a few tricks, and said there would be more later. Here are some that use RTS, RTI, and JSR in possibly unexpected ways.
It is important to remember what these instructions actually do. Thinking only of their normal usage may keep us from seeing other possibilities. JSR and RTS are jumps with special effects on the hardware stack. They don't necessarily have to be used together, and JSR doesn't necessarily have to go to a subroutine at all, and RTS doesn't necessarily have to return to someplace the routine was called. (That's why the order in the title is reversed—to break the tight connection to subroutines.) RTI also pulls the status off the stack. It doesn't necessarily have to be used as a return-from-interrupt.
JSR is typically thought of as "Jump to SubRoutine;" but actually stands for "Jump, Saving Return address," and would be more accurately thought of as "Jump, Saving Departure address" ("JSD"?). The departure address is not even the address of the next instruction after the JSR, but rather the address of the last byte of the three-byte instruction.
RTS stands for "ReTurn from Subroutine" (why isn't it "RFS"?), but might be more accurately thought of as "Jump to Stacked Address plus 1" (how about "JSA"?).
Starting with a review of the last (#4) trick in section 5, you can use the hardware stack as a pointer for an indirect jump, instead of using a pair of ZP bytes. It removes the need for any variable space:
LDA ____ ; (high byte) (or it could be that you calculate the target address here.)
LDA ____ ; (low byte) Remember RTS requires the 16-bit addr to be the target minus 1.
RTS ; (Not including the LDA's, this takes 3 bytes and 12 clocks.)
Before the 65c02 came along with its JMP(abs,X) instruction, a way to do an indirect jump
using a jump table with the NMOS 6502 was:
; Start with function (an even number) in X.
LDA TABLE+1,X ; Read high address byte from the actual table, and
PHA ; push it. Low byte comes next, below.
LDA TABLE,X ; Be sure to make the table reflect start addresses
PHA ; minus 1, since RTS increments the address by 1.
RTS ; RTS does the absolute indexed indirect jump.
If you're using something like a Commodore 64 where you really can't use the CMOS version, you could still make a macro of the above
routine, to get it on a single line. (You can read about the differences between the CMOS and NMOS 6502 in
In the above piece of code, the table has two consecutive bytes for each entry. This means you can't have a table of more than 128
entries, and depending on what you start with, you might have to put the entry number in the accumulator and do ASL
before putting it in X. (That's why the first comment line says you must start with an even number in X.) A way around
that is to split the table into two halves, one half having the high bytes and one having the low bytes. Now each half, indexed by
the same number in X or Y compared to the other half, can be up to 256 bytes, meaning you can have 256 entries, even though they're two bytes
each. It would go like this instead:
LDA high-byte_table, X
LDA low-byte_table, X
This technique is useful for token threading which has been used in interpreted BASIC and other interpreted languages. There's more
discussion on it in this article in the 6502.org wiki.
An interesting extension to that for the 65816 is Bruce Clark's idea of using RTS as a 6-cycle NEXT in a direct-threaded code (DTC) Forth kernel, in the forum topic "65816 direct threaded NEXT." The stack pointer S effectively becomes the Forth program counter! The program then is a list of addresses of routines, but the strange thing is that the list is on the hardware stack! Since the 65816 has a 16-bit stack pointer, all the first 64K of memory is available for the program. The price for the added performance (compared to a much longer NEXT) is that interrupts are only allowed at certain places. (There's a proposed hardware solution for this in section 18, "Stacks potpourri.") There are almost no JSRs in Forth unless it's subroutine-threaded code (STC), and other uses of the return stack could be handled with another virtual stack. Bruce puts forth a similar idea for indirect-threaded code (ITC) Forth, using PLX, JMP(0,X), in the forum topic, "65816 indirect threaded NEXT." (The data stack pointer would have to be Y, since X is taken.)
For situations where you want to jump using the actual address (where subtracting 1 would have to be done at run time and the overhead would be excessive), you can use RTI instead of RTS, but push the status on the stack with PHP after the address, since RTI will be pulling it off and putting it in P before jumping to the address on the stack.
<have actual address on hardware stack>
PHP ; Push the status you want RTI to put into effect.
RTI ; Jump to actual address, without RTS's offset.
A different status choice possibility is to load a status byte value into another available register (A, X, or Y) and push that instead.
Unfortunately there's no JMP((addr),Y) (where the content of addr points to the beginning address of
a table, and you read the Y and Y+1 bytes into the table to find out where to jump to); but you can do this:
INY ; We have to start with the high byte.
LDA (ZP),Y ; Jump table address is held by a ZP location.
PHA ; Push high byte.
DEY ; Get ready to point to low byte.
PHA ; Push the low byte.
(PHP) ; (Add PHP here if you want to use RTI instead of
; RTS, and put the actual addresses in the table.)
RTS/RTI ; Jump to the address gotten from the table.
In section 7, we looked at how a subroutine can find inlined data following a call to that subroutine, by the fact that the JSR puts its own ending address on the hardware stack. The subroutine adjusted the return address on the stack so that when the processor gets back to the main program, it would skip over the data and not try to execute it as if it were instructions and crash.
Now we'll add another twist. You can do the same thing without the JSR having a
matching RTS. The called code piece will probably still end in RTS, but it
will have removed a return address from the stack after having used it to find the data, so the RTS will instead
take the program pointer back to the routines that called label1, label2, etc. (shown below). In the following example, we
have various places in the code that start with optional individualized actions, and they have individualized data, but then have a common
way to handle that data:
label1: <do "A" stuff>
<do "A" stuff>
(and elsewhere in the program,)
label2: <do "B" stuff>
<do "B" stuff>
Foobar: <Do the process common to A, B, and relatives, using the data following the
JSRs. After using the address stacked by a JSR to find data, discard
that address. (You will not RTS to the JSRs that called Foobar, but
instead RTS back to the routines that called label1, label2, etc.)>
From WDC's excellent
programming manual which I can never pass up an opportunity to
recommend, "Programming the 65816, Including the 6502, 65C02 and 65802," chapter 12, and page 190:
Now go back to the indirect jump but in an entirely different setting, in our RPN operations. Consider a looping program structure like DO...LOOP (introduced near the bottom of section 8) or FOR...NEXT, with a 16-bit index and limit, and we want a macro or subroutine LEAVE which, as you might expect, leaves (aborts) the loop early if some condition is met, without finishing the count, and causes execution to resume at the first instruction following the end of the loop. Just branching out is not enough, because we also have to delete the loop index and limit (4 bytes) from the hardware (return) stack.
We will use RTS, not to return from subroutine, but to jump to the first instruction after the end of the loop. In the DO...LOOP below, the DO macro assembles JSR do followed by the loop end address (minus 1, for RTS) which will get filled in by the LOOP macro since DO does not yet know where the end of the loop will be. do takes the loop's ending address (minus 1) immediately following the JSR do instruction as inlined data, and puts it on the hardware stack, followed by the limit and the index which it takes from the ZP data stack and also puts on the hardware stack. loop, assembled by the LOOP macro, branches back up to the top of the loop if incrementing the index doesn't cause it to cross the line between the limit, and limit minus 1. If the line is crossed so loop drops through, it removes the four bytes of index and limit from the hardware (return) stack first (as does the leave routine before branching to the end using RTS).
Here's an example usage. (To remember the order of the limit and index for DO, think of the tag on a
gift, "To Susan, from Bob." The "to" comes first.)
LITERAL ARRAY+2000 ; Head toward a limit (ie, destination value) of ARRAY+2000 minus one,
LITERAL ARRAY ; starting from ARRAY. (These could arrive on the data stack other ways too of course.)
DO ; Put yada's addr-1 on hardware stack, then the loop limit, then the starting index value.
JSR I ; Subroutine "I" copies the current loop index (counter) value to the data stack, for use
<do_stuff> in our routine.
IF_EQ ; If the zero flag is set,
<do_stuff> ; do this and that, and
LEAVE ; then remove the loop index and limit from the hardware stack, and RTS to "yada" below.
yada: <continue> ; The macros don't need the label but I put it here to show where LEAVE will jump to.
where the LEAVE macro just lays down JMP leave (so you don't have to remember
whether to use JMP, or to instead use JSR like you would for
?LEAVE), and leave is defined as:
leave: PLA ; Remove the loop index (counter) from the
PLA ; hardware stack. (Low byte, then high byte.)
PLA ; Remove the loop limit from the hardware
PLA ; stack. (Low byte, then high byte.)
RTS ; DO had stacked the ending address-1; so RTS
;------------- ; removes and increments it, and jumps to "yada".
It uses JMP leave rather than JSR leave, because
we're not coming back there, and JSR would require removing and discarding two additionals bytes from the
hardware stack (ie, return stack).
A side note about related loop subroutines and macros:
Without the macros, we would have to remember to use JMP on one and JSR on the other, which would probably be very bug-prone. A way around it if we were stuck with ancient assemblers without macro capability would be to put an extra pair of PLAs in LEAVE and call them both with JSR.
Section 13, on synthesizing the 65816's new stack-related instructions on the 6502, carries this theme of using RTS and JSR for unconventional uses further.
10. 65c02 added instructions <--Previous | Next--> 12. where-am-I routines
last updated Mar 8, 2019