home   |   stacks treatise index   |   1. Intro: stack basics   |   2. subroutine return addresses & nesting   |   3. interrupts   |   4. virtual stacks   |   5. stack addressing   |   6. passing parameters   |   7. inlined data   |   8. RPN operations   |   9. RPN efficiency   |   10. 65c02 added instructions   |   11. synth instructions w/ RTS/RTI/JSR   |   12. where-am-I routines   |   13. synthesizing 65816 stack instructions   |   14. local variables, environments   |   15. recursion   |   16. enough stack space?   |   17. forming program structures   |   18. stack potpourri   |   19. further reading   |   A: StackOps.ASM   |   B: 816StackOps.ASM   |   Appendix C


65816's instructions and capabilities relevant to stacks, and 65c02 code which partially synthesizes some of them

For a peek at the 65816's added capabilities in the area of the hardware stack, consider these added instructions it has.  (It also has a 16-bit stack pointer, allowing up to nearly 64K of stack space.  Note also that the '816 does not require the hardware stack area and the direct page to be separate like the 6502 does, and one implication is that you can use the extra direct-page addressing modes on stack addresses.)

Let's look first at the ones that are easier to explain and synthesize.

First are ones that don't have any net effect on the stack but we will use the stack to synthesize them.  These two are too simple to give them their own listing.  TXY can be done on the 65c02 with PHX, PLY; and TYX with PHY, PLX.  If you don't need to preserve A, then of course TXA, TAY and TYA, TAX are faster.

REP and SEP are two-byte instructions for clearing and setting selected bits in the processor status register P according to a mask in the operand; but since you can't synthesize them on the '02 without ANDing or ORing, and the 6502's set/clear instructions are only two clocks each anyway, and there are no useful ones missing, there's probably no point in imitating REP and SEP directly.  A related tactic however is useful inside an ISR to set the interrupt-disable bit (I) in the stacked status byte so the RTI at the end does not re-enable interrupts.  Do it like this:

       ORA  #4

The 816's PEA, PEI, PER, which are "Push Effective Absolute ("Immediate" would have been better for this one, but the "I" was already taken for "Indirect"), Indirect, and Relative 16-bit address," always push 16 bits, regardless of whether the accumulator is in 8- or 16-bit mode, since they are normally used for addresses.

PEA is a three-byte instruction that pushes a two-byte literal (its operand), which is typically an address but it can also be data, onto the stack, without affecting the processor registers.  One use of it is to pass data to a subroutine.  For a 6502 to do it would require a routine like this (shown as a macro):

PEA:   MACRO  data
       STA    temp_A    ; (Can't push this onto the hardware stack.)
          LDA  #>data   ; data high byte
          LDA  #<data   ; data  low byte
       LDA    temp_A

(Saving and restoring the accumulator here is of course optional, and to really imitate the 816's PEA, you would also have to save and restore the status as well; but note that you can't use PHP, PHA, and PLP for it, since pushing the data in between them would mean the PLA and PLP would pull the wrong bytes off the stack.)

PEI is a two-byte instruction that takes the 16-bit data at the direct-page address pointed to by the operand and pushes it onto the stack, without affecting the processor registers.  (Direct page on the '816 is like ZP on the 6502, except this 256-byte segment can be anywhere in the first 64K of memory space.  It is not locked to page 0, nor does it have to start on a page boundary.)  The instruction is written like PEI(ZP_addr); but it does not read and push the contents of the address pointed to by ZP_addr; instead, it reads and pushes the contents of ZP_addr itself.  The Fischer book, on p.216-218, says it's indirect, and the L&E manual also writes it PEI(DP); but the L&E text as well as my own experiments say that it's not really indirect.  The equivalent '816 instructions would be (in 16-bit accumulator mode):

       LDA  ZP_addr   ; Read 16-bit data,
       PHA            ; and push it.

except that again PEI does it without affecting the status or accumulator; so in order to synthesize that, you would have to temporarily store them in variables as above in the discussion on PEA.  For a 6502 to synthesize PEI, you would need a routine like this (again shown as a macro):

PEI:   MACRO  ZP_addr
       STA    temp_A
          LDA  ZP_addr+1
          LDA  ZP_addr
       LDA    temp_A

Again, the storing and restoring of A and P are optional, and you might even want to do it with conditional assembly, using another macro parameter to determine whether or not to put them in.

The '816 references addresses in direct page when you do a PEI; but since you're synthesizing an instruction here with a macro, you could make it address data anywhere, not limited to ZP.

PER is probably the least rewarding of the three when trying to synthesize it on a 6502, because PER's main usefulness is in things that the 6502 is so poorly suited to, primarily relocatable code.  It is a three-byte instruction that adds the next instruction's runtime address to the signed offset given by the operand, and pushes the result onto the stack.  This makes it useful for synthesizing instructions that can make the code able to be loaded at different addresses each time, depending on what the next available memory segment is at load time), or moved even after it is is loaded (for example, to be able to delete a program you are no longer using and move other programs in to close up the gap and leave all available memory together at the end of the 64K bank).  Its operation bears similarities to that of the 816's BRL (Branch Relative Long) instruction.  For a 6502 to synthesize PER would require a routine something like the following which is again written as a macro.  The resulting stacked number is relative to the address of the first instruction following PER, in this case meaning the instruction following the LDA temp_A below:

PER:   MACRO   label
       STA     temp_A
          LDA  #>{label - * - 8}   ; Load and
          PHA                      ; push high byte first,
          LDA  #<{label - * - 5}   ; then low byte.
          JSR  PER_6502
       LDA     temp_A

The 5 and 8 above may need to be adjusted depending on how your assembler handles "*" (or "$"), whether it gives the address of the LDA # op code or that of the operand.  They will also have to be adjusted if temp_A is not in ZP.  Subroutine PER_6502 referred to above is:

PER_6502:             ; NOTE: PER_6502 must be at a consistent, known
       STX  temp_X    ; address, unlike the routines that call it.
       LDA  $101,X
       ADC  $103,X
       STA  $103,X
       LDA  $102,X
       ADC  $104,X
       STA  $104,X
       LDX  temp_X

The STX temp_X and LDX temp_X above could be replaced with PHX and PLX to save a couple of bytes, but you'd lose 3 more clocks (which may be ok) on a routine that's already slow.  That's if you have temp_X in ZP.  If it's not in ZP, PHX and PLX will save four bytes but you'll only lose one more clock, meaning the speed expense will be negligible.   Regardless, don't forget to bump all the 100's numbers up by 1 if you use PHX and PLX.

That whole thing above, 18 instructions, 38 bytes, and 62 clocks (if temp_A and temp_X are in ZP, otherwise 64 clocks), is done in a single three-byte, six-clock PER instruction on the 65816, making the '816 more than ten times as efficient in both speed and memory for this operation.

PER can be used for referencing data in a relocatable data structure, to get an indirect address.

Another thing it can be used for is a simulated JSR-relative, or BSR (ie, Branch to SubRoutine, or Branch, Saving Return address, "branch" meaning relative, and in this case with a 16-bit offset).  Here's the idea:

         PER  RETURN-1         ; Put the return addr on the stack,
         PER  <subroutine-1>   ; then the subroutine's addr, then
         RTS                   ; use RTS to branch to the subroutine.
RETURN:  <continue>            ; The subroutine's RTS will come back here.

or, in a higher-level macro (nesting the PER macro above) that does the same thing:

        BSR  <subroutine>

Again, it's pretty inefficient on the 6502, but it can be done.  This is for when the whole program needs to be relocatable.  If you don't need that, then just use JSR because you'll know at assembly time what the subroutine's address is.

The '816 itself doesn't have a branch-to-subroutine instruction, but it can be synthesized this way (which might as well be put in a macro):

        PER  $+5
        BRL  <subroutine_addr>

Yes, the '816 has a BRL (Branch Relative Long, ie, with a 16-bit offset) which is not really related to the stack except that we can use the stack to synthesize it on the 6502, much like the BSR above:

        PER  <target-1>
        RTS                ; Again RTS is used for a jump, not an actual return.

A where-am-I routine for the '816 would only be PER 0, PLA.  Really simple!

Side note:  There's a short article about using BRL and PER in relocatable 65816 code in the 6502.org wiki, here.

A nifty thing sometimes done with PER on the '816 is in conjunction with stack-relative indirect indexed addressing, which for example lets you have the address of a table on the stack at some arbitrary depth, and index into that table.  An example 65816 instruction might be LDA(3,S),Y, which gets the table address from the 3rd and 4th bytes on the stack, adds Y to the address, and uses the result to know where to load the accumulator from.  This is a two-byte, seven-clock (eight if you have the accumulator set to 16-bit) instruction on the '816.  Further improving the possibilities is that Y can also be 16-bit.  Synthesizing an instruction like LDA(3,S),Y on the 6502 might be done something like:

        LDA  103,X
        STA  temp      ; (temp is in ZP)
        LDA  104,X
        STA  temp+1
        LDA  (temp),Y

which takes 21-22 clocks for reading a single byte.  (The '816 can read a byte pair in 8 clocks with A set to 16-bit.)  If you need to save X, bracket the above with PHX and PLX, and increment the base numbers by 1.  (If you have an NMOS 6502, you'll have to use A to save and restore X, further complicating matters!)  Note that stack-relative indirect indexed addressing has double indexing and double indirection!

Someone on the forum lamented that the '02 does not have a JSR (addr,X) like the '816 does.  The 65c02 does however have a JMP (addr,X) though; so what you can do is push the return address onto the stack and then use that JMP addressing mode.  Here's the idea, presented as a macro:

JSR_Indx_IndrX:  MACRO  SubroutineAddr  ; JSR indexed indirect, X, ie, JSR (addr,X)
        LDA  #>{RetAdr - 1}       ; Get the high byte of the return address minus 1
        PHA                       ; and push it onto the stack, followed by
        LDA  #<{RetAdr - 1}       ; the low byte.
        JMP  (SubroutineAddr,X)

Obviously it uses A and will affect the status register as well (unlike the '816); so if you need those saved, you'll need to take extra measures.  The easiest would be to use a second macro parameter to tell it whether to assemble using A, X, or Y so you have the option if for example you don't want A overwritten but Y is not being used at the moment.  (The 65c02 has PHX and PHY, unlike the NMOS 6502, so you can use X or Y without going through A to push the contents.)

The 65816 has other stack-related instructions and addressing modes which by their very nature are not applicable to the 6502, including:

but here we are limiting the discussion to the ones whose synthesis on the 6502 involves the stack.

12. where-am-I routines <--Previous   |   Next--> 14. local variables & environments

last updated Apr 24, 2017