For a peek at the 65816's added capabilities in the area of the hardware stack, consider these added instructions it has. (It also has a 16-bit stack pointer, allowing up to nearly 64K of stack space. Note also that the '816 does not require the hardware stack area and the direct page to be separate like the 6502 does, and one implication is that you can use the extra direct-page addressing modes on stack addresses.)
Let's look first at the ones that are easier to explain and synthesize.
First are ones that don't have any net effect on the stack but we will use the stack to synthesize them. These two are too simple to give them their own listing. TXY can be done on the 65c02 with PHX, PLY; and TYX with PHY, PLX. If you don't need to preserve A, then of course TXA, TAY and TYA, TAX are faster.
REP and SEP
are two-byte instructions for clearing and setting selected bits in the processor status register P according to a mask in
the operand; but since you can't synthesize them on the '02 without ANDing or ORing, and the 6502's set/clear instructions are only two
clocks each anyway, and there are no useful ones missing, there's probably no point in imitating REP
and SEP directly. A related tactic however is useful inside an ISR to set the interrupt-disable bit (I)
in the stacked status byte so the RTI at the end does not re-enable interrupts. Do it like this:
PLA
ORA #4
PHA
The 816's PEA, PEI, PER, which are "Push Effective Absolute ("Immediate" would have been better for this one, but the "I" was already taken for "Indirect"), Indirect, and Relative 16-bit address," always push 16 bits, regardless of whether the accumulator is set to 8 or 16 bits' width, since they are normally used for addresses.
PEA is a three-byte instruction that pushes a two-byte literal (its
operand), which is typically an address but it can also be data, onto the stack, without affecting the processor registers. One use
of it is to pass data to a subroutine. For a 6502 to do it would require a routine like this (shown as a macro):
PEA: MACRO data
STA temp_A ; (Can't push this onto the hardware stack.)
LDA #>data ; data high byte
PHA
LDA #<data ; data low byte
PHA
LDA temp_A
ENDM
;------------------
(Saving and restoring the accumulator here is of course optional, and to really imitate the 816's PEA, you
would also have to save and restore the status as well; but note that you can't use PHP, PHA,
and PLP for it, since pushing the data in between them would mean the PLA
and PLP would pull the wrong bytes off the stack.)
PEI is a two-byte instruction that takes the 16-bit data at the
direct-page address pointed to by the operand and pushes it onto the stack, without affecting the processor registers. (Direct page
on the '816 is like ZP on the 6502, except this 256-byte segment can be anywhere in the first 64K of memory space. It is not locked
to page 0, nor does it have to start on a page boundary.) The instruction is written like
PEI(ZP_addr); but it does not read and push the contents of the address pointed to by ZP_addr; instead, it reads and
pushes the contents of ZP_addr itself. The Fischer book, on p.216-218, says it's indirect, and the L&E manual also writes
it PEI(DP); but the L&E text as well as my own experiments say that it's not really
indirect. The equivalent '816 instructions would be (in 16-bit accumulator width):
LDA ZP_addr ; Read 16-bit data,
PHA ; and push it.
except that again PEI does it without affecting the status or accumulator; so in order to synthesize that,
you would have to temporarily store them in variables as above in the discussion on PEA. For a 6502
to synthesize PEI, you would need a routine like this (again shown as a macro):
PEI: MACRO ZP_addr
STA temp_A
LDA ZP_addr+1
PHA
LDA ZP_addr
PHA
LDA temp_A
ENDM
;------------------
Again, the storing and restoring of A and P are optional, and you might even want to do it with conditional assembly, using another macro
parameter to determine whether or not to put them in.
The '816 references addresses in direct page when you do a PEI; but since you're synthesizing an instruction here with a macro, you could make it address data anywhere, not limited to ZP.
PER is probably the least rewarding of the three when trying to
synthesize it on a 6502, because PER's main usefulness is in things that the 6502 is so poorly suited to, primarily
relocatable code. It is a three-byte instruction that adds the next instruction's runtime address to the signed offset given by
the operand, and pushes the result onto the stack. This makes it useful for synthesizing instructions that can make the code able to be
loaded at different addresses each time, depending on what the next available memory segment is at load time), or moved even after it is
is loaded (for example, to be able to delete a program you are no longer using and move other programs in to close up the gap and leave
all available memory together at the end of the 64K bank). Its operation bears similarities to that of the 816's BRL
(Branch Relative Long) instruction. For a 6502 to synthesize PER would require a routine something
like the following which is again written as a macro. The resulting stacked number is relative to the address of the first instruction
following PER, in this case meaning the instruction following the LDA temp_A below:
PER: MACRO label
STA temp_A
LDA #>{label - * - 8} ; Load and
PHA ; push high byte first,
LDA #<{label - * - 5} ; then low byte.
PHA
JSR PER_6502
LDA temp_A
ENDM
;------------------
The 5 and 8 above may need to be adjusted depending on how your assembler handles "*" (or
"$"), whether it gives the address of the LDA # op code or that of the
operand. They will also have to be adjusted if temp_A is not in ZP. Subroutine
PER_6502 referred to above is:
PER_6502: ; NOTE: PER_6502 must be at a consistent, known
STX temp_X ; address, unlike the routines that call it.
TSX
CLC
LDA $101,X
ADC $103,X
STA $103,X
LDA $102,X
ADC $104,X
STA $104,X
LDX temp_X
RTS
;------------------
The STX temp_X and LDX temp_X above could be replaced
with PHX and PLX to save a couple of bytes, but you'd lose 3 more clocks (which
may be ok) on a routine that's already slow. That's if you have temp_X in ZP. If it's not
in ZP, PHX and PLX will save four bytes but you'll only lose one more clock, meaning
the speed expense will be negligible. Regardless, don't forget to bump all the 100's numbers up by 1 if you use
PHX and PLX.
That whole thing above, 18 instructions, 38 bytes, and 62 clocks (if temp_A and temp_X are in ZP, otherwise 64 clocks), is done in a single three-byte, six-clock PER instruction on the 65816, making the '816 more than ten times as efficient in both speed and memory for this operation.
PER can be used for referencing data in a relocatable data structure, to get an indirect address.
Another thing it can be used for is a simulated JSR-relative, or
BSR (ie, Branch to SubRoutine, or Branch, Saving Return address, "branch" meaning relative, and in this case with
a 16-bit offset). Here's the idea, using the PER macro in 6502 assembly language:
PER RETURN-1 ; Put the return addr on the stack,
PER <subroutine-1> ; then the subroutine's addr, then
RTS ; use RTS to branch to the subroutine.
RETURN: <continue> ; The subroutine's RTS will come back here.
or, in a higher-level macro (nesting the PER macro above) that does the same thing:
BSR <subroutine>
Again, it's pretty inefficient on the 6502, but it can be done. This is for when the whole program needs to be
relocatable. If you don't need that, then just use JSR because you'll know at assembly time what the subroutine's
address is.
The '816 itself doesn't have a branch-to-subroutine instruction, but it can be synthesized this way (which might as well be put
in a macro):
PER $+5
BRL <subroutine_addr>
Yes, the '816 has a BRL (Branch Relative Long, ie, with a
16-bit offset) which is not really related to the stack except that we can use the stack to synthesize it on the 6502, much like
the BSR above:
PER <target-1>
RTS ; Again RTS is used for a jump, not an actual return.
A where-am-I routine for the '816 would only be PER * (or, depending on your assembler, PER $),
PLA. Really simple!
Side note: There's a short article about using BRL and PER in relocatable 65816 code in the 6502.org wiki, here.
A nifty thing sometimes done with PER on the '816 is in conjunction with stack-relative
indirect indexed addressing, which for example lets you have the address of a table on the stack at some arbitrary depth, and
index into that table. An example 65816 instruction might be LDA(3,S),Y, which gets the table address
from the 3rd and 4th bytes on the stack, adds Y to the address, and uses the result to know where to load the accumulator from. This is
a two-byte, seven-clock (eight if you have the accumulator set to 16-bit) instruction on the '816. Further improving the possibilities
is that Y can also be 16-bit. Synthesizing an instruction like LDA(3,S),Y on the 6502 might be done
something like:
TSX
LDA 103,X
STA temp ; (temp is in ZP)
LDA 104,X
STA temp+1
LDA (temp),Y
which takes 21-22 clocks for reading a single byte. (The '816 can read a byte pair in 8 clocks with A set to 16-bit.) If
you need to save X, bracket the above with PHX and PLX, and increment the base numbers
by 1. (If you have an NMOS 6502, you'll have to use A to save and restore X, further complicating matters!) Note that stack-relative
indirect indexed addressing has double indexing and double indirection!
Someone on the forum lamented that the '02 does not have a JSR (addr,X) like the '816 does. The 65c02
does however have a JMP (addr,X) though; so what you can do is push the return address onto the stack and then
use that JMP addressing mode. Here's the idea, presented as a macro:
JSR_Indx_IndrX: MACRO SubroutineAddr ; JSR indexed indirect, X, ie, JSR (addr,X)
LDA #>{RetAdr - 1} ; Get the high byte of the return address minus 1
PHA ; and push it onto the stack, followed by
LDA #<{RetAdr - 1} ; the low byte.
PHA
JMP (SubroutineAddr,X)
RetAddr:
ENDM
;------------------
Obviously it uses A and will affect the status register as well (unlike the '816); so if you need those saved, you'll need to take extra
measures. The easiest would be to use a second macro parameter to tell it whether to assemble using A, X, or Y so you have the option if
for example you don't want A overwritten but Y is not being used at the moment. (The 65c02 has PHX
and PHY, unlike the NMOS 6502, so you can use X or Y without going through A to push the contents.)
The 65816 has other stack-related instructions and addressing modes which by their very nature are not applicable to the 6502, including: