Subroutines' input and output data is passed in four ways, the last two involving stacks:
Beginning programmers might hesitate to go beyond #2 above. But why (or when) might #3 or #4 sometimes be a better choice?
#3 above does not explicitly put any data on the stack but instead just uses the stacked return address for finding the data. Since the next section will be devoted to this, the only thing I will say about it here is that it is particularly well suited for when the data following the JSR does not change (for example, a string constant), or, if it does (which it only can if it's in RAM), it at least remains a constant length. If the data is needed in more than one place, it can have a label, but using it elsewhere then will of course work differently, requiring the address to be passed explicitly. I have used it many times in macros where the JSR and the data both get laid down by the single-line macro call, for example,
DISPLAY "Press Esc to end"which may assemble,
JSR DISP_IMM DFB "Press Esc to end", 0 ; DFB tells the C32 assembler, "DeFine Byte(s)." ; Other assemblers might use .DB, .BYTE, etc..If desired, you could have one or more fields in the string which get modified after the template is displayed, for example that part of the string may be a time (e.g., hours and minutes) which may start out as "X:XX" and the correct time will get calculated later and overwrite these few characters of the display. More on this #3 method in the next section.
I tell about the problem of #2 above (RAM variables) in this forum post. If you need a variable long-term, fine—leave it as a variable. Other variables however may be needed only intermittently, and their contents will be irrelevant between uses. If you don't want them taking up RAM space full time (as in a µC with very little RAM), you might be tempted to have data spaces that get used for different things at different times, even with multiple labels for the same space in order to have a label that's meaningful for each situation. Now you can see the danger of having two routines or sets of routines each thinking they have the variable all to themselves when actually they don't, and one overwrites data still needed by the other. It makes for a debugging nightmare. Been there, done that.
ZP RAM, with its benefits, may be in short supply in some systems, as far as variables go. So why would you want to devote some of that precious space to a data stack when it's already in short supply? As it turns out, it's a pretty good investment. It's like variables that only exist when they're needed, and afterwards, the space is released for something else to use it. This was described about 40% of the way down page 4, especially the part about needing lots of variables for intermediate pending results in calculations. Keeping those straight is no longer a problem if you use a stack for them. (Want meaningful names for them? No problem. We'll get to that in section 14, on local variables and environments.)
Lastly, if a piece of code needs to be re-entrant (ie, needs to call itself, or may be interrupted by something else that calls it), each level of nesting needs its own variables that are not stepped on by other levels, and a stack of some kind will be needed. When the stack is used for these, they use memory only while they're needed, and then they disappear from existence.
Let's address the matter of passing a parameter byte on the hardware stack, #4A above. For just one byte, it won't be nearly as efficient as just using a register (usually A), but we'll do this anyway just to show the method. Consider a subroutine that turns hex nybbles into ASCII bytes, so there's one byte of input, and one of output, and we'll pass them on the hardware stack.
<do_stuff> ; Get the nybble into A. Allowable value is 0-F. PHA ; Other code can be put between the PHA and the PLA, as long JSR NYB2ASCII ; as it doesn't care if NYB2ASCII overwrites A and X. The PLA ; subroutine's input and output are protected on the stack though. <do_stuff> ; Do something with the ASCII output, like add it to a string, display it, etc..and the subroutine it calls would do this:
NYB2ASCII: TSX LDA $103,X ; 103 reaches past the return address, to the input parameter. CMP #$0A ; Anything below $0A will end up in the $30's. CLC ; CLC before the BMI so we only have to do it once. BMI n2a1 ; For 9 or less, skip the next instruction. ADC #7 ; $0A becomes $11, $0B becomes $12, etc.. C is still clear. n2a1: ADC #$30 ; Whether the 7 got added above or not, this gives the ASCII. STA $103,X ; Put it on the stack, overwriting the input value. Note RTS ; that we read and overwrote the byte just behind the ;------------------ ; return address, leaving the return address undisturbed.Again, in this case since we use A anyway, putting the single number on the stack is not efficient; but the situation changes if we need to pass four parameters back and forth for example. Indexing into the stack requires X, so we should leave that available and then pass two parameters via the stack, while the other two could go through the stack or through A and Y, and we can do it without variables.
Here's an example for calling a multiply routine that takes two unsigned 16-bit numbers on the hardware stack and replaces them with the 32-bit unsigned product. First we'll push the two input numbers, each one getting pushed high byte first, so the high byte takes the higher address, as is normal for 6502:
<do_stuff> ; Get high byte of first input, PHA ; and push it. <do_stuff> ; Get low byte of first input, PHA ; and push it. <do_stuff> ; Get high byte of second input, PHA ; and push it. <do_stuff> ; Get low byte of second input, PHA ; and push it. JSR UM_STAR ; Now you can call the subroutine below that does the multiplying. ; If you pull the product off the stack now, the byte order will be: PLA ; 2nd-highest byte <do_stuff> PLA ; high byte <do_stuff> PLA ; low byte <do_stuff> PLA ; 2nd lowest byte <do_stuff>The order of the output bytes may seem strange, but will be explained later. If you don't like it, you could change it to first pull off the low byte, then 2nd lowest, then 2nd highest, and lastly the high byte, by changing the indexed numbers in the UM_STAR subroutine below. I do recommend keeping it so the two inputs get pushed high byte first though.
Note that depending on what the next operation is, you might not even want to pull the output off the stack yet, as it might be best to just leave it there for subsequent subroutines. An example that comes to mind is if you want to multiply the two numbers and then get the square root of the product (by calling subroutine SQRT for example), to get the geometric mean of the initial two 16-bit inputs. (The geometric mean is kind of like the average, but it's logarithmic, not linear; so the geometric mean of 1 and 100 for example is 10, not 50.5.)
If you want to access that output data that's on the stack before pulling it off, perhaps even in arbritrary order, remember that X is still valid from the subroutine using it for indexing into the stack, as shown below. If you've done TSX again, adjust your indexed numbers to account for the fact that the return address is no longer there, and in the case below, that there was a PLA after the TSX such that now after the subroutine return, the outputs are at 101,X to 104,X. If you have not done TSX again, the old indexed numbers will still be valid.
The example above calls the UM_STAR subroutine below which takes Bruce Clark's improvement on my commented bug fix on the UM* multiplication in fig-Forth, and it modifies it for I/O on the hardware stack. The structures are per my structure-macro article and source code linked there. They assemble exactly the same thing you would by hand, but make the conditions and branches in the source code clearer.
UM_STAR: LDA #0 ; Unsigned, mixed-precision (16-bit by 16-bit input, 32-bit output) PHA ; multiply. Add a variable byte to the stack, initializing it as 0. TSX ; Now 101,X holds that new variable, 102,X and 103,X hold the return LSR $107,X ; address, and 104,X to 107,X holds the inputs and later the outputs. ROR $106,X FOR_Y 16, DOWN_TO, 0 ; Loop 16x. The DEY, BNE in NEXT_Y below will drop through on 0. IF_CARRY_SET CLC PHA ; Note that the PHA (and PLA below) doesn't affect the indexing. LDA $101,X ADC $104,X STA $101,X PLA ADC $105,X END_IF ROR ROR $101,X ROR $107,X ROR $106,X NEXT_Y STA $105,X PLA ; Retrieve the variable byte we added at the top, cleaning up the stack. STA $104,X ; Again note that the PLA changed S but not X, so the 104 is still 104. RTS ;------------------
(Note: In many situations, it will be advantageous to give names to the items on the stack, using EQUates, so it's more clear what 101,X is, what 102,X is, 103,X, and so on. (But don't forget to still put the ,X after them.) More on that later.
Now suppose you have a subroutine that needs four input parameter bytes, but outputs six bytes. To use the hardware stack for two more outputs than inputs, if the subroutine puts the extra bytes on the stack without regard to the return address, the return address won't be at the top where the RTS needs it. The easy way to solve this problem is to prepare additional room on the stack before calling the subroutine, like this:
PHA ; Push two dummy bytes onto the stack to hold the PHA ; positions open for outputs of the subroutine called below. <do_stuff> ; Prepare the subroutine input bytes to be passed. JSR subroutine ; Without changing the stack pointer, the subroutine ; can now give you two more bytes of output than input.This is more efficient than having the subroutine itself push more bytes onto the stack and then juggle to get the return address back to the top of the stack for the RTS so it returns to the right place.
OTOH, what if you need more inputs than outputs. In that case, pull the dummy output bytes off the stack after the subroutine return.
A simple way to restore the stack to a previous depth is:
<do_stuff> ; Set up inputs, reserve output byte places in stack, etc.. TSX ; Mark the current stack position. <do_more_stuff> JSR <subroutine> <do_stuff> ; Handle outputs, etc.. TXS ; Restore stack to the marker set earlier, possibly also ; to put certain outputs at the top.X will need to be left intact; so if the subroutine needs it, it will have to save and restore it, most likely using PHX and PLX. Another reason to preserve X may be so the TSX...TXS don't interfere with other operations.
The next section (section 7) tells of a totally implicit way to pass a parameter, where the parameter resides just past the return address that was put on the hardware stack by the JSR instruction. This puts the entire work load on the subroutine.
The hardware stack is definitely not the only way to do a stack though. #4B above mentions a data stack in page 0 (or ZP). Why have a separate stack for data, when we have the hardware stack there with its own dedicated pointer (ie, register S), and automatic incrementing and decrementing of that pointer? As mentioned before, an advantage of having a virtual stack in ZP is the extra addressing modes. Note that in the example of passing parameters via the hardware stack above, I intentionally chose one that did not deal with addresses, like for strings, or memory-moving, because there's no (abs,X) addressing mode on the 65(c)02.
Another major reason however is that the separate data stack solves a particular problem that comes with trying to pass data on the hardware stack. We will still use the hardware stack for subroutine return addresses, interrupts, and other things, but generally pass data through our new virtual data stack in ZP, indexed by X.
Consider the problem of using the page-1 hardware stack for passing parameters to a subroutine, where that routine passes parameters to another, and now there's a subroutine-return address on top putting the target data farther down the stack than expected, so you get incorrect results. Let's build on the example above, writing a subroutine GEOMEAN to take the geometric mean of two numbers. (Geometric means of more than two can be taken, but I've never needed more, and this routine will make the point anyway.)
GEOMEAN: JSR UM_STAR ; Multiply two 16-bit unsigned inputs on the stack. Get 32-bit product. JSR SQRT ; Take a 32-bit input on the stack and get a 16-bit output, leaving two RTS ; dummy bytes on stack (problem: GEOMEAN's return addr is still on top!) ;------------------ ; (See the note on shortening JSR-RTS to just JMP, two paragraphs down.)Let's say you start by putting $1234 on the hardware stack (with the high byte taking the higher address), then $5678, then do a JSR GEOMEAN. When the program pointer reaches the first instruction after the PHA and TSX shown above inside UM_STAR, UM_STAR looks for the first number to multiply ($1234 in this example) to be at $106,X and $107,X, and the second number to multiply ($5678 in this example) to be at $104,X and $105,X, with the return address at $102,X and $103,X. Now however, there are two return addresses on top of the input numbers: one for calling UM_STAR and one for calling GEOMEAN; so UM_STAR will erroneously ignore the $1234 and multiply $5678 by the return address for GEOMEAN, and come up with a worthless answer, and it won't even matter since it just overwrote GEOMEAN's return address and the program will most likely be totally crashed!
If you're on top of it, you undoubtedly caught the JSR, RTS pair above which can normally be shortened to just JMP, saving 1 byte and 9 clocks (JMP, RTS versus JSR, RTS, RTS); but now you can see there's yet another incompatibility problem with passing the parameters on the hardware stack!
The separate data stack in ZP avoids these problems, because the subroutine return addresses don't go on it. ZP addressing itself is also more efficient than absolute addressing of course, and there are more addressing modes for ZP. Parameter-passing becomes implicit. Samuel Falvo (kc5tja on the 6502.org forum) has an excellent post about this, the second post in the topic "Stack Frames?" Key benefits he mentions are:
CALL REPLINE (A(4), M(), "Limit=", B, 600, T2)and all those parameters have to be specifically dealt with in the call. Several steps must be carried out at the call and return times, beyond just the call and return. However, with a separate data stack, input parameters that have been derived earlier can be left on the stack, just waiting there, without interfering with anything, until REPLINE is called, not storing them elsewhere, and not having to re-load them to put them in a stack frame now. Subroutine outputs can also be read from the stack and dealt with whenever you get around to it, again without having to store them right away as part of the return process. The resulting simpler, more efficient instruction might then be simply,
REPLINEor in the case of 6502 assembly language,
JSR REPLINESide note 1: Some compilers for high-level languages do a lot of optimizing, and on a processor with a lot of registers, might leave parameters derived earlier in registers and pass them to the subroutine that way if it's safe, rather than on the hardware stack. If they're available though, in many cases it will be because previous values were stacked. On the 6502, we could set aside bytes in memory (especially ZP) to act as processor registers, say 32 bytes for example; but if not treated as a stack, those can fill up and experience the same problems we went to the stack to get away from.
Side note 2: Another way to pass parameters is through a parameter block, and then you only put the address of the block on the hardware stack before calling the subroutine. Again Samuel Falvo describes this for us in the forum topic, "Passing Parameters: best practices?"
Here's how we'll make the transition to a separate ZP data stack in the routines above. We will use X as the stack pointer, initialized in the beginning of the program to point one byte past the highest byte of the (still empty) ZP data stack area with $00,X. If you need X for something else during the program, you can use the hardware stack to save and restore it. To put a byte on the stack, we do a DEX, then store the byte at 0,X. (Usually we'll handle two bytes at a time and call them cells; and to put a cell on the stack, we do DEX twice, then store the high byte at 1,X and the low byte at 0,X. You don't have to go in byte pairs, but we will see later that it simplifies certain operations.)
Now modifying the routines above for using the ZP data stack, we get the following. (I'll leave the first one handling a single byte instead of a byte pair, to compare apples to apples.)
<do_stuff> ; Get the nybble into A. Allowable value is 0-F. DEX STA 0,X ; Other code can be put between the STA and the LDA, and JSR NYB2ASCII ; NYB2ASCII's input and output are protected on the stack. LDA 0,X ; X must be preserved as the data stack pointer of course. INX <do_stuff> ; Do something with the ASCII output, like add it to a string, display it, etc..and the subroutine it calls would do this:
NYB2ASCII: ; (The initial TSX is no longer necessary.) LDA 0,X ; Get the input parameter from the data stack. CMP #$0A ; Anything below $0A will end up in the $30's. CLC ; CLC before the BMI so we only have to do it once. BMI n2a1 ; For 9 or less, skip the next instruction. ADC #7 ; $0A becomes $11, $0B becomes $12, etc.. C is still clear. n2a1: ADC #$30 ; Whether the 7 got added above or not, this gives the ASCII. STA 0,X ; Put it on the data stack, overwriting the input value. Note RTS ; that the return address is not on this stack to worry about. ;------------------Modifying the routine that called UM_STAR (the unsigned, mixed-precision multiplying routine), it might now use DEX, STA 0,X instead of PHA for each byte push; but since you're dealing with functions that need too many bytes to just pass them all at once via the registers, you'll probably be deriving values of 16 or more bits in other subroutines which will already contain the DEX's and STA's. More on that later, in section 8 on RPN operations.
UM_STAR itself becomes:
UM_STAR: DEX ; Add a stack byte to use as a temporary variable. LDA #0 STA 0,X ; 0,X addresses the temporary variable. These are in ZP. LSR 4,X ROR 3,X FOR_Y 16, DOWN_TO, 0 IF_CARRY_SET ; The 1st time thru the loop, A needs 0; so don't use STZ above. CLC PHA LDA 0,X ADC 1,X STA 0,X PLA ADC 2,X END_IF ROR ROR 0,X ROR 4,X ROR 3,X NEXT_Y STA 2,X LDA 0,X STA 1,X INX ; Take back the stack byte we used as a temporary variable. RTS ;------------------This time a byte was added on the data stack with DEX for a temporary variable (and later dropped it with INX), analogous to the PHA (and later PLA) used in the earlier version further up where I/O was on the hardware stack. It was put on the data stack instead of the hardware stack this time, because X is being used for the data stack and we don't want to have to keep saving and restoring it like we would have to if the temporary variable were on the hardware stack. ZP,X addressing is more efficient than abs,X anyway. The ZP,X numbers got incremented by 1 to make room for this new temporary variable byte. You could avoid the DEX and INX and address the new byte with $FF,X instead, as it wraps around to stay in ZP, and it would save two bytes and four clocks—just make sure you won't have any subroutines or interrupts cutting in that might overwrite that byte because X seemed to indicate that the location was free.
A slightly more-efficient way to do it which we'll get into later is to have a fixed scratchpad space (with no pointer, since indexing is not needed) in ZP of about eight bytes max, and call it N. This is in addition to the ZP data stack which is still used for the subroutine's inputs and outputs. The limitation is that a subroutine must be completely finished with N when it exits, and it cannot call another subroutine that might use N while N's contents are still needed. There are situations where this works out well; and the fact that indexing is not needed for temporary bytes in N saves cycles too.
Now GEOMEAN can be:
GEOMEAN: JSR UM_STAR ; Multiply two 16-bit unsigned inputs on the stack and get a 32-bit product. JMP SQRT ; (JSR, RTS) Take a 32-bit input on the stack and get a 16-bit output. ;------------------Advantages of having the separate data stack are:
Since the (ZP,X) addressing mode allows us to use addresses on the stack directly, let's look at an example of a memory-moving subroutine. We'll call it CMOVE, for "character move." Before calling it, we will put six bytes on the ZP data stack, in three 16-bit numbers (high byte taking the higher address in all cases): first the "from" address, then the "to" address, then the length which can be up to tens of thousands of bytes. I'll leave it without the structure macros this time. Note the earlier-mentioned (from 80% of the way down the page in section 4) stack-effect comment on the first line, inside parentheses, where the items to the left of the "--" are what the subroutine expects as inputs, with the top of stack being at the right, closest to the "--". If the subroutine had output data items, they would be shown to the right of the "--", again with the top of stack being the right-most number, this time nearest the ")".
CMOVE: LDA 0,X ; "See-move" Character (memory) move ( from to len -- ) ORA 1,X BEQ POP3 ; If remaining length is 0, branch to POP3 which is just ; 6 INX's, then RTS.) LDA (4,X) ; Get a byte and STA (2,X) ; transfer it. INC 4,X BNE cm1$ INC 5,X ; Increment the source addr cm1$: INC 2,X BNE cm2$ INC 3,X ; and the destination addr, cm2$: DEC 0,X ; decrement the count left, LDA 0,X CMP #$FF BNE CMOVE DEC 1,X ; and go back up for another loop. If we're done, BRA CMOVE ; that fact will be caught in the first three lines. ;------------------A version of this is in the accompanying StackOps.ASM file. If the two memory areas overlap, CMOVE can be used to move bytes down toward address 0; but if you need to move bytes up (toward address $FFFF) and the ranges overlap, use CMOVE_UP instead, which is also included there. BTW, the 65816 has two memory-move instructions (MVN and MVP) that do the same thing. It takes a few instructions to set up the registers first, then MVP or MVN take only seven clocks per byte moved; IOW, moving a thousand bytes would take half a millisecond at 14MHz. These two instructions can be interrupted too, and after the interrupt is serviced, the memory-moving process will resume.
I like to have another subroutine called QCMOVE (for "Quick CMOVE") on 6502 which is much faster than the regular CMOVE but has a maximum length of $FF bytes, being indexed with an 8-bit index register. I find most of my moves are few enough bytes to do with the shorter, faster QCMOVE.
The subroutine called FETCH, about 60% of the way down Section 4 which introduces virtual stacks, is an example of replacing a cell on the stack with the contents of the address the cell points to. It only deals with a 2-byte input and output, but shows a use of the ZP indexed indirect addressing, where absolute indexed indirect addressing is not available in many instructions for use with the hardware stack (only JMP (abs,X), and even that one is lacking on the NMOS 6502). The accompanying StackOps.ASM file has more code for strings and other things that will use lots of input and output bytes for the subroutines.
Taking stacks a step further, #4C near the top of this page mentions another stack anywhere in RAM, similar to #4B but may be specifically for something like floating-point numbers which will not be addresses that need the benefits of the added addressing modes available in ZP.
Pursuing the example of a separate stack for floating-point numbers, suppose you want 48-bit (six-byte) numbers. 48 bits is enough for 12 digits (when converted to decimal) plus exponent and signs. You could put these on the normal ZP data stack, but stack gymnastics become unwieldy, like if you want to rotate the third cell to the top but the cells are different sizes.
Continuing, suppose you want a stack space of eight 48-bit (six-byte) cells, and you set aside an area of RAM with 48 bytes (6x8), which is $30 in hex; so suppose you allot addresses $200 through $22F. It could be indexed by X, but we're already using X for the regular data stack in ZP, so that would require saving two different index values, saving one while the other is used, and vice-versa. We can avoid much of that by using Y, since you only need the absolute indexed addressing, no indirects. You still get the abs,Y addressing mode with ADC, AND, CMP, EOR, LDA, LDX, ORA, SBC, and STA. If you need ASL, BIT, DEC, INC, LDY, LSR, ROL, ROR, or STZ in an absolute indexed addressing mode, you'll need to use X. This will be less often, although the shifts and rotates will be needed for normalizing and de-normalizing floating-point numbers. (BIT abs,X and STZ are not on the NMOS 6502. I have an article on the differences between the NMOS 6502 and the CMOS 65c02 at http://wilsonminesco.com/NMOS-CMOSdif/.)
Now you can split the stack into parallel sections; and instead of doing DEY or INY a half-dozen times to add or drop a cell, respectively, we'll do it only once, and then have:
$200-207 hold the low bytes of the mantissas of the eight cells, $208-20F hold the next-to-lowest bytes of the mantissas, $210-217 hold the middle bytes, $218-21F hold the second-to-highest bytes, $220-227 hold the high bytes of the mantissas, and $228-22F hold the byte with the exponents and signs.
When there's nothing on the stack, Y holds 8. Immediately before putting the first number on the stack, you decrement Y to 7, and then:
$207 gets the low byte of the mantissa of the new number, $20F gets the next-to-lowest byte of the mantissa, $217 gets the middle byte, $21F gets gets the second-to-highest byte, $227 gets the high byte of the mantissa, and $22F gets the exponent and signs.
To read the various bytes of TOS (the top-of-stack number) without adding or dropping a number, do LDA $200,Y, LDA $208,Y, LDA $210,Y, etc.. So the low byte for example is always $200,Y, and at the other end, the exponent-and-signs byte is always $228,Y, and Y gets you to the right place in the stack space for however deep the stack is at the moment. To read the various bytes of NOS (the next-on-stack number, which is right under the top of the stack), do LDA $201,Y, LDA $209,Y, LDA $211,Y, etc.. You still do not change Y to select which byte of a cell you're accessing.
It's essentially like splitting the one stack into several—one for each byte of the size of cell you want. In this case, we chose six bytes per cell, and we have what amounts to six stacks in a sense, accessed in parallel, all sharing the same index register. (What you might have already anticipated then is that a stack could conceivably be many, many pages in size, as long as there are no more than 256 cells (so it can be indexed by Y), and it is split up this way.)
The next time you want to add a number on this stack, Y gets decremented again to 6, and the new number occupies $206, $20E, $216, etc..
When a number is dropped from the stack, Y gets incremented instead of decremented. If you ever get down to Y=0, the stack is full, and adding another cell will overflow the stack, writing the next bytes to $2FF, $207, $20F, $217, $21F, and $227, guaranteeing problems! Even if there's nothing important at $2FF, the other bytes will overwrite bytes of the number at the bottom of this stack.
This matter of extra stacks was discussed in this forum topic, in the context of the Forth programming language (but it still applies here).
Of course there's nothing saying these have to be floating-point numbers, only that the cells need to be the size you decided on, which in this case is six bytes. They could be high-precision scaled-integer numbers, short strings, or anything else you choose to make them.
So why not split high byte and low byte on the ZP data stack the same way in order to reduce the number of DEX and INX occurrences? The answer is that you need the two bytes of a cell together in order to do LDA(ZP,X) for example. That's not to say you can't address the problem another way; but I'm not convinced there's a net advantage. One way is to move the TOS to a fixed ZP address with the two bytes together when you need an indirect. Another is to keep the TOS at a fixed ZP address (basically like a virtual register) with the two bytes together instead of in the indexed area. This requires moving it to the indexed area when you add something else on top of it, and requires moving the previous NOS to the TOS area when the old TOS is dropped.
Bruce Clark says on the 6502.org forum that of the three varieties of BASIC he listed further up (Apple 1 BASIC and Integer BASIC being cut from the same cloth, as are Applesoft and EhBASIC), only the Applesoft/EhBASIC variety put intermediate values on the hardware stack. The Apple-1/Integer BASIC variety and Tiny BASIC use a Forth-like data stack, even using ZP,X addressing. For completeness, KIM-1 Focal uses a data stack that can span multiple pages (using (ZP),Y addressing).
When is the stack not a good way to pass subroutines' input and output data? Moving data to and from a stack takes time. If you have a lot of data, like a string or table or array, leave it somewhere else in memory and just put its address on the stack. Very large sets of data may run the 6502 out of stack space, and I understand Pascal tends to put very large things (even whole procedures) on the stack; but in most programming you're likely to do on the 6502, the one page (256 bytes) of memory that it assigns to the hardware stack space is well beyond what you will need. More on that later, in section 16.
And again, for inside a subroutine, there's N, the commonly 8-byte ZP variable space introduced in section 4 for subroutines like multiply and divide where the iterative process and constant indexing would take longer than moving the data to N and back. N is not for passing parameters between subroutines though.
And as you've probably noticed already, treating everything as two-byte cells is not efficient on the 6502 for dealing with only 8-bit quantities. (OTOH, the 65816 handles 16-bit quantities almost as efficiently as 8-bit.) But when you're using assembly language, you can mix to your heart's content, using the subroutines to facilitate especially the 16-bit operations, and revert to other methods for the 8-bit if you like. (You can do a data stack of single bytes too, but it's easier to keep things under control if you keep the stack cells a consistent size, typically 16-bit on the 6502.) Some of the many examples given in section 8 on RPN operations with 16-bit cells, as well as what's in Appendix A which is the source-code file StackOps.ASM, would be ridiculously inefficient to use for a long sequence of 8-bit operations. The prime one is probably the DO...LOOP with its 16-bit limit and index on the return (ie, hardware) stack. Many of these tools make the programmer more efficient; but if you only need an 8-bit counter for the loop, it will usually make a lot more sense from a standpoint of memory consumption and execution speed to forgo the DO...LOOP stack subroutines and just use something like the old
LDY #$3C ; (just an example value) label: <do_stuff> <do_stuff> DEY BNE labelwhich, BTW, if you're looking for programmer productivity, is the exact same thing as what you get in my 6502 structure macros with
FOR_Y $3C, DOWN_TO, 0 ; (just an example value) <do_stuff> <do_stuff> NEXT_Yand the only stacks involved to do that are in the assembler, not the machine code it produces. How to do that magic will be covered in section 17. In most cases there is actually zero penalty in runtime speed and memory consumption—only improved development speed, reduced bugs, and easier maintainability. In this particular example, the FOR_Y assembles the LDY #3C and makes the assembler remember the address of the top of the loop and tells it what to assemble at the end, and NEXT_Y uses that info to assemble the DEY, BNE <loop_top>. It may seem rather pointless for a simple loop; but it becomes extremely valuable when you have structures nested several levels deep, and the branches in the version without the structure macros go to a dizzying pile of labels, in what we call "spaghetti code."