virtual stacks

So far, we've only been talking about the 6502's hardware stack which:

resides in page 1 of the memory map,
has a dedicated processor register (S) as its stack pointer, and
built-in microprocessor instructions automatically take care of the indexing into the stack area and the incrementing and decrementing of the stack pointer.

You can implement virtual stacks too though. This just means you will synthesize the stack in software, typically using an index register as a pointer, and you'll have to use separate instructions to increment and decrement the index register and to store individual bytes of a multi-byte cell.

Ok, so why would you want to, if the 6502's built-in hardware stack operations already do this for you automatically? Well, it turns out that using an additional stack to handle data separately from return addresses has major attractions:

It makes parameter-passing between routines implicit and more efficient since routines won't have to keep stepping over return addresses. This becomes especially important when parameters are passed to a subroutine which may in turn pass one or more parameters to another one which won't know how many return addresses are sprinkled in unless each routine does more stack juggling!
It makes stack handling much easier than trying to put the data on the return (hardware) stack, partly for the reason above.
There are more addressing modes in ZP (zero page), so more operations can be done in place rather than having to do an intermediate move to ZP. There are plenty of things you can't do directly with the hardware stack anyway, like pull a value off to add to the contents of the accumulator.
The resulting easier stack operations reduce the number of variables needed along with the risk of overwriting data that another pending routine expects to find still intact. (This is true of the hardware stack too, but having the separate data stack carries it further.)
It improves nestability and re-entrancy of routines. (This is true of the hardware stack too, but having the separate data stack carries it further.)
It reduces bugs and crashes, improving development time.

The data stack becomes a workspace for arithmetic and logic operations.

HP RPN (reverse polish notation, discussed in a minute) calculators had a data stack that was separate from the return stack too; but they didn't carry it as far as we will discuss here.

The simplest illustration of handling data on a stack does not do a very good job of showing why it's so efficient. That will be covered in a little more depth in Section 9. Here's an idea of it though. To add two numbers, say 2 and 5, you put them on the stack and then tell the computer to add. It takes the two off the stack and adds them, replacing them with the answer. So you have for example:


  2 5 +

and the 2 and the 5 disappear and the answer 7 is the new top of stack. For slightly more complexity, consider √(6(2+5)):


  2 5 +   6 *   SQRT

or:


  6   2 5 +   *   SQRT

Slightly different, √6(2+5) (ie, 2+5, times the square root of 6) would be:


  2 5 +   6 SQRT   *


  6 SQRT   2 5 +   *

Any level of complexity can be expressed and carried out without parentheses; and after an initial familiarization, it becomes second-nature. An algebraic notation shows what you get, whereas RPN shows how you get there. If you really want the program to show the algebraic (which an assembly-language program won't anyway), you can still put the algebraic in the comments if you like, and then you'll have both.

The logic is called Reverse Polish Notation, or RPN. The "Polish" part comes from the Polish mathematician Jan Łukasiewicz who invented the parenthesis-free notation in the 1920's; and the "reverse" part comes from the fact that RPN puts the numbers on the stack and then tells what to do with them, instead of having the instructions precede the numbers. (So no, it did not come from some joke about Polacks doing things in a backward and senseless way.) Łukasiewicz introduced the idea of giving the operator first, followed by the numbers it would operate on, for example, + 3 4 instead of 3+4. In the 1950's and 60's, it was turned around (hence the name "reverse Polish notation") by several people inventing and re-inventing, turning the example into 3 4 +. The numbers are presented first, and then the operator tells what to do with them. (The only part turned around is that the operator comes last instead of first. The numbers' order remains if it matters, as in division, where / 3 4 in Polish becomes 3 4 / in reverse Polish, because swapping the 3 and the 4 would give a different answer.) The stack idea came to be in computers at the same time. RPN was used starting in the early 1960's to make computers more efficient.

To show the use of the data stack, suppose we had a stack with a 1 in the top cell (which was put on the stack last), a 3 in the next cell, and so on, to some arbitrary depth ("TOS" stands for "top of stack."):

If you tell the computer to add (perhaps by calling a subroutine or macro named PLUS), the top two cells will be taken off and replaced with a 4, since 3 + 1 is 4. The result will be:


TOS  4
     5
     7

Note that the 5 and the 7 (and anything else that might be under the 7) remains undisturbed, regardless of the depth. For the PLUS operation, you don't care what else is under there or how deep the stack is. It simply does not matter.

If next you tell the computer to multiply (perhaps by calling a subroutine named STAR), the top two remaining cells will be taken off and replaced with 20, since 5 * 4 is 20:


TOS 20
     7

Suppose you wanted to divide that 20 by 2 now. Put the 2 on the stack. It will get pushed onto the top:


TOS  2
    20
     7

Then divide (perhaps by calling a subroutine named SLASH). You'll get:


TOS 10
     7

Hewlett-Packard calculators called the top of the stack (TOS) the X register, the next one (next on stack, or NOS) Y, the next one (which I sometimes abbreviate 3OS for "3rd on stack") Z, and the last one T. The HP manuals always drew the stack up-side-down so X was always at the bottom and things get pushed up instead of down; but it's the same idea. (Side note: There's an RPN primer here that's oriented to RPN calculators with a 4-level stack.)

There is nothing however that says a stack has to be limited to four levels like the Hewlett-Packard calculators had. The memory allowed for the stack is the only limit on how complex you can go in RPN. When HP arrived at the 4-level stack, calculators were not programmable yet, so you had to keep a mental note of what was on each level. There was no program running the process and keeping track of subroutines and their data stack usage, and HP determined that it was unrealistic to mentally keep track of more than four. When we move to programming, the runtime limitation is gone, and the programmer only has to give his attention to the few stack levels (typically three or less) being used by the particular routine he's working on at the moment.

So no, letting the stack get really deep does not make it unwieldy. In fact, even the 7 in the diagrams above never got touched in these examples. The number of levels that are on the stack waiting for you to get back to is irrelevant, like a ship at sea with no concern for how deep the ocean is at that location as long as it is deep enough to avoid hitting the bottom. Also, when you call a subroutine which takes, say, two inputs and gives back one output, how many data-stack levels are used inside the subroutines it calls is of no consequence. It simply doesn't matter (again, assuming the memory allowed for the stack is enough that you won't ever run out).

Subroutines automatically leave other subroutines' "scratchpad space" on the stack undisturbed. In fact, "recursion" refers to when a routine calls itself, even multiple nested levels deep, which a stack allows it to do. (As you might expect, it is important to make sure that the condition to back out of the recursion is met before you overrun the available stack space.) We'll talk about recursion in section 15.

HP calculators of the 70's and 80's especially used RPN because of its efficiency at a time when registers and memory were still expensive. Later, stack processors came along which were very simple and yet screamed performance, routinely doing more Forth MIPS than MHz, having the stacks onboard so they didn't have to go out on the bus for stack access.

Looking only at how math is done with RPN makes for a very limited view though. The stack approach also handles parameters for logic operations, strings, array indexes, file pointers, controls for things like motors and audio and temperature, etc.. It makes it very easy for a function to return as many answers as needed, for example the address and length of two different strings. That's something that the C programming language does very poorly.

Section 8 has more on RPN examples and efficiency, with a link to accompanying code you can use.

The advantages of a data stack are particularly pronounced in higher-level languages. However, I did do an all-assembly-language application for a product in the late 1980's with a lot of math calculations, and found that managing variables and keeping various math routines' pending inputs and outputs from stepping on other routines' temporary storage was definitely a challenge. Implementing a data stack, if I hadn't still been so green and had thought of it, would have been very helpful. Six years later I did a control application in assembly with what amounted to long Boolean equations, and a stack of single-byte cells would have been helpful there too.

Why? I ran up quite a large inventory of ZP (ie, zero-page) variables in those projects; and although most of them were only needed at certain times (and most of the time they held data that was no longer needed), there was no way to know in advance which ones would be needed simultaneously; so they had to have their own spaces full time, like a toothbrush which is never shared even though it's usually sitting idle. It's possible to have the majority of your truly needed variables, even system variables, in higher RAM, as very few need to be in ZP full time. Then they're brought to the data stack in ZP there while they're being used, are put back in higher RAM if there are changes to save, and otherwise it's a type of automatic garbage collection, without the usual performance hit of garbage collection, and without needing any garbage-collection routines. Stuff lives in high RAM, but mostly gets operated on on-stack.

It's like hotel rooms. A hotel with 42 rooms can accommodate an unlimited number of guests, as long as no more than 42 parties show up at once. Any of them can go back as often as they like, but they'll probably get a different room each time (analogous to the address in the stack). There's no need to have a room (or a set of sheets or towels or anything else) reserved for Mr. & Mrs. Smith full time just because they sometimes stay at that hotel; yet that's what I was doing in the projects mentioned above.

A 6502 implementation of the Forth language makes excellent use of zero page (ie, addresses $0000-00FF, ZP for short) for a data stack, using X as the stack pointer, INX and DEX to increment and decrement that pointer, and indexed (and sometimes indexed indirect) addressing. Again, this is separate from the return stack (which uses the 6502 hardware stack), and it makes many operations much more practical than trying to pass parameters between routines on the hardware stack in page 1. Even though it's called the "data stack," note that byte pairs on this stack can be anything, whether for example a constant that you put there, an address of a variable or routine or string, a pointer into an array, math inputs and outputs, flags, etc.. We will be doing the same thing here in assembly.

It works particularly well in ZP because of the extra addressing modes, especially that most of the accumulator-oriented instructions have a (ZP,X) addressing mode but not an (abs,X) addressing mode. It's almost like the 6502 was designed for Forth (although I'm sure Chuck Peddle and Bill Mensch were not thinking of Forth when they designed the 6502). You can for example calculate an address on the data stack and then read the contents of that address with LDA(ZP,X). There is no corresponding LDA(abs,X), so obviously it works better to have the data stack in ZP. As a plus, using this method, X seldom needs to be saved and used for anything other than the stack pointer; so you don't need to lament the loss of your X. X won't often need to be saved and restored for other things. And yes, the stack takes precious ZP space; but again, it reduces the need for so many variables to be in ZP.

more than

all

Ok, so you want an assembly-language example. Expanding on the above, let's say that

address $1E04 is the base of an array, and
each record in the array has six bytes, and
we want to read the first byte of record 71 (in decimal, the first record being record number 0).
Let's say the 71 ($47) is already on the data stack,
and we're using two bytes per cell, so the high byte is 00 (for $0047 which is the desired record number), and
the stack pointer, X, contains $D9 at the moment. In this case, this is the last used stack byte, not the next available one like we had in the hardware stack; so our 71, or $0047, has the 00 at address $00DA, and the $47 at address $00D9.


        DEX             ; We'll start by adding a cell, or two bytes,
        DEX             ; to the stack, which we will be filled in next.
        STZ  1,X        ; Make the high byte of that cell to be 0,
        LDA  #6         ; and the low byte to be 6, as there are 6 bytes per record.
        STA  0,X

        JSR  STAR       ; Do the multiplication.  We will use a single-precision multiply here, so the result is 16-bit, not 32.
                        ; Since two (2-byte) cells were multiplied and a single cell answer was returned, STAR included INX INX
                        ; and the stack depth is back to where we started, except that now the top stack cell contains an address
                        ; offset into the array instead of a cell number.  That address offset in this case is $47*6, or $01AA.
        DEX             ; Again add a cell, or two bytes,
        DEX             ; which will be filled in, next.
        LDA  #>ARRAY
        STA  1,X        ; Put the address of the array (ie, $1E04), high byte and low byte, on the stack.
        LDA  #<ARRAY
        STA  0,X

        JSR  PLUS       ; Add the $01AA offset to the $1E04 array base address, resulting in $1FAE.  Since PLUS took in two 16-bit
                        ; numbers and gave only 16 bits of result, the routine included INX INX and the stack depth is back where
                        ; we started, except that now the top stack cell has the final address instead of an offset into the array.

        LDA  (0,X)      ; Now read the first byte of record #71 (or $47) in the array and
        STA  0,X        ; put it in the low byte of the cell at the top of the stack, replacing the address we no longer need.
        STZ  1,X        ; Put a 0 in the high byte of the top of the stack.

Note that the JSR did not foul up anything with the data stack operations, because it stores the return address on the return (hardware) stack in page 1, not the data stack in ZP.

The above was mostly written out longhand (except for the multiply and add routines) to avoid hiding any details. Ideally once you get going, you will use more subroutines and macros, and it might look more like this:


        PUT  6          ; Put the number of bytes per record on the stack, by which to
        JSR  STAR       ; multiply the number of records we want to go into the array.
        PUT  ARRAY      ; Put the base address of the array on the stack,
        JSR  PLUS       ; and add it to the offset to get the final address we want to read.
        JSR  FETCH      ; Fetch the data there.  (Fetch actually reads two bytes here though, not just one.)

Obviously that makes the whole thing much easier to handle, being just five lines. FETCH would look like this (and here I made it for 16-bit, instead of fetching just a single byte):


FETCH:  LDA  (0,X)      ; Get the first (ie, low) byte.
        PHA             ; Save it for six lines down since we still need A and the address at the top of the stack.
        INC  0,X        ; Increment the low byte of the address we're fetching.
        BNE  1$         ; If that turned it into 00,
        INC  1,X        ; then we have to increment the high byte too.
 1$:    LDA  (0,X)      ; Now get the second (ie, high) byte, and
        STA  1,X        ; store it in the high byte of the top-of-stack cell.
        PLA             ; Get the first (ie, low) byte back,
        STA  0,X        ; and store it in the low byte of the top-of-stack cell.
        RTS             ; Since you start and end with the same stack depth, you don't need INX or DEX.
 ;------------------

The 65816 does 16-bit operations such as this much more gracefully, by the way. Compare:


FETCH:  LDA  (0,X)
        STA  0,X
        RTS
 ;------------------

Without the RTS, it's only one byte longer than the JSR FETCH line, and much faster, so you could just make it a macro that inlines the LDA andSTA;then the JSR FETCH above becomes just FETCH. That way the 65816 does FETCH four times as fast as the 6502 does it (12 clocks versus usually 48 or sometimes 53, depending on the branch).

Note that the top of the stack (TOS) is always 0,X (low byte) and 1,X (high byte), regardless of how deep the stack is at the moment. Similarly, the second-from-top stack cell ("next on stack" or NOS) is always 2,X (low byte) and 3,X (high byte), again regardless of how deep the stack is at the moment, and so on.

You might be wondering why the DEX DEX after INX INX. It seems like a waste of everything. Well, in assembly you could get away with not dropping cells that you're just going to put right back; but in a higher-level language (or even in assembly with a lot of subroutines) it would get especially messy to have different versions of routines that do and don't shrink and increase the stack. You'd have more bugs than an ant hill! It generally works out best to just go ahead and drop cells you just used, even when it means DUPlicating a cell first if something will need it again later.

You would have a set of routines you can use as building blocks, for example:

STORE to store the second stack item to the address indicated by the top of stack (and drop both of them);
FETCH to replace an address at the top of the stack with the contents at that address (already shown above);
SWAP to swap the top stack item and the one below it;
DUP to duplicate the top stack cell;
DROP to drop a cell from the top of the stack, for example if a routine produces more outputs than you need;
OVER to make an additional copy of the second stack item, and put the copy on the top;
ROT to rotate the third cell to the top, so a-b-c (c being at the top) becomes b-c-a (now a being at the top);
EXECUTE to run the subroutine whose address is at the top of the stack (and drop that address from the stack);
TO_R to transfer a data-stack cell to the return stack;
R_FROM to transfer a return-stack cell to the data stack;
OFF to clear the byte pair (ie, store zero) at the address indicated by the top-of-stack cell (and drop that address from the stack);
ON to set the byte pair at the address indicated by the top-of-stack cell to $FFFF (and drop that address from the stack);
STAR (multiply), SLASH (divide), INCR, DECR,PLUS, MINUS, AND16,OR16(the "16" added to the name to prevent clashes with 6502 mnemonics);
etc.

(This will be discussed further in section 8, with a link to source code for these and other stack-oriented routines in Appendix A, file StackOps.ASM.)

You might also be wondering why always two bytes per cell when sometimes one is enough. It is again to reduce complexity to the programmer, and avoid bugs, all improving productivity. Making all the stack cells all two-byte allows you to use the same routines mentioned in the list above for everything. If you only need one, you can zero-out or ignore the high byte, or even just not use the stack for that portion of your assembly-language program. There's nothing that says you have to take an all-or-nothing approach.

Adding stack comments to the code adds clarity. Here's a simple example for STORE:


     ; ( value addr -- )

meaning that the second-from-top cell is the value you want to store, and the top stack cell has the address to store it in, and after the store, both items are deleted from the stack (signified by the absence of anything after the "--"). The comment should be put at least at the subroutine or macro itself,


FETCH:  LDA  (0,X)      ; ( addr -- value )
        <more code>
        RTS
 ;-------------

and sometimes also in the code that calls them. (In this example, it means we take the address in the top stack cell and replace it with the value read from that address.) Other things that might be used (besides "value" and "addr" above) are:

"n" to normally mean a two's complement signed number, where $8000 to $FFFF is treated as negative
"u" to specifically mean an unsigned number, meaning that $8000 to $FFFF will not be treated as negative
"d" for a double-precision (ie, two-cell, or 32-bit in the case of 6502) number, normally signed,
"ud" for an unsigned double-precision number
"f" for flag, ie, 0 is "false" and anything else is considered "true"
"c" for character, ie, byte, where the high cell is ignored or expected to be 0
or something that better describes the value, like "d-prod" for the double-precision product of multiplication, "°C" for a temperature, "mm" for millimeters of distance a machine moves, etc..

Much of this is just an imitation of a subroutine-threaded code (STC) Forth. In someone's musings (name withheld to protect the guilty) on language implementations on the 6502, he says that errors in stack manipulation are a serious problem in Forth. This shows his inexperience with it. Yes, a person who is new to Forth will crash it frequently; but that quickly ends with a little experience to get the hang of it, and he will become more productive.

The data stack examples above all assume 16-bit (two-byte) cells. Surprisingly, 16-bit scaled-integer math is more than adequate for most math operations you're likely to need on a 6502-based controller, including things like trig, log, and square-root functions, things that most people think you need floating-point for. I was very skeptical myself before I became experienced with scaled-integer which I now prefer for most things. Floating-point math puts a big overhead on a computer that doesn't have a floating-point coprocessor. I give the hows and whys of scaled-integer at http://wilsonminesco.com/16bitMathTables/. Do not be put off by the look-up table theme of the page. You don't need the look-up tables to take advantage—although if you do choose to go the look-up table route, look up values will often be hundreds of times as fast as having to actually calculate them, and they will be correct in all 16 bits. I will give routines for some of these ominous-sounding functions in Appendix A, file StackOps.ASM (routines that actually calculate the functions rather than look them up in the large tables).

I realize there are a few applications where floating-point is appropriate though, so read on.

I should mention here that some routines will make so many accesses to numbers on the stack that especially for the sake of speed (as well as for extra addressing modes, such as (ind),Y), it may be worth it to copy them into another temporary variable section of ZP to avoid so much indexing. We can call it "N", and experience says 8 bytes is generally enough. The routines above can use it, and when the individual routine finishes, N is completely freed up. It is never used to hold values between routines or to pass values from one routine to another. This type of routines does not call other ones anyway, so nesting is not a problem either. In a few cases, N-1 might be used, so don't run other variables right up against N.

Now suppose that you want yet another stack, say 32 bits (four bytes) per cell. It could be for floating-point numbers (if my last paragraph and the link there weren't enough to convince you you don't need them ;-) ), or for another use as we discussed (along with methods) on the 6502.org forum at http://forum.6502.org/viewtopic.php?f=9&t=493. Making it to be a third stack has some advantages. The bigger the cells get, the more pronounced the advantages to separating them from the main data stack.

If these large cells are manipulated on the main data stack, the stack gymnastics can get rather cumbersome, interfering with nimble operations there. Giving them their own stack prevents this. These large cells do not contain addresses (at least on the 6502), and that fact brings us a couple of benefits. I'll use the example of a 64-byte space in the address range of $200-$23F for 16 cells of 32 bits (four bytes) each:

They won't need the special ZP addressing modes, so they can be put elsewhere in memory and not use up precious ZP space. Most instructions available with the abs,X addressing mode are also available in abs,Y, so you will seldom need to save, change, and restore X as a stack pointer, as X will not need to be constantly shared between the normal ZP data stack and the floating-point stack.
The individual bytes of a given large cell do not have to be kept together (like they would for indirect addressing if they contained an address to use), so you can reduce the amount of incrementing and decrementing of the pointer by having for example addresses $200-20F for the low byte of all 16 cells, $210-21F for the second-to-lowest byte, $220-22F for the third, and $230-23F for the last (being possibly for signs and exponent of a floating-point number, not necessarily the high byte). Then dropping a cell takes a single increment of the pointer (one instruction), rather than four increments or getting it into the accumulator, clearing the carry flag, adding 4, and putting the pointer back.

Suppose we use Y for the stack pointer. We would initialize it at $10. We'll make this stack grow down as well, like the others. The process of pushing a cell onto the stack begins with decrementing Y (only once), so in the pushing, using STA $200,Y, the low byte of the first cell getting pushed would land at address $20F, and using STA $210,Y its 2nd-to-lowest byte would land at $21F, and using STA $220,Y the 3rd byte would land at $22F, and using STA $230,Y the fourth byte would land at $23F. If you push another cell on the stack, you would just decrement Y again (only once), and its low byte would go into address $20E, the second at $21E, and so on. Dropping a stack item requires only INY.

Some links:

The Wikipedia article on the IEEE floating point standard is at http://en.wikipedia.org/wiki/IEEE_754.
There's a 49-page paper on the IEEE floating-point standard, from UC Berkeley, at http://www.eecs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf.
There's a 6502 BCD floating-point scientific math package, 12-digit plus 3-digit exp, up to hyperbolic functions, at http://www.crbond.com/calc65.htm.
There are small 32-bit floating-point math co-processors (interfaced by two-wire synchronous serial) available at http://www.awce.com/pak1.htm. Although it looks like a good product, I must comment that what they say about scaled-integer math lacking precision is not correct. I give an explanation at http://wilsonminesco.com/16bitMathTables/, in approximately the 6th paragraph.

You might even want to make it eight bytes per cell so you can handle complex floating-point numbers (complex numbers being used all the time in electronics engineering) without having to use a separate cell for the real part and another for the imaginary part; then at the times you only need the real part, you either zero-out or ignore the imaginary part. (If you're not familiar with complex numbers, it basically just means you put the number on a graph, meaning there will be both an X and a Y component, instead of on a number line with only an X value. So actually, yes, you really can get the square root of -1, which is called "i" (or "j" in electronics), and it's on the graph at X=0 and Y=1.)

What other kinds of stack might one want? One I've thought of doing is a string stack, where the cell size would have to be at least enough to accommodate the longest string I would want to work with for the particular application. That could use quite a bit of memory, depending on how many cells I decide to allow and how long a string I decide to let a cell accommodate, but it may still be worth it in some cases. Since my own applications have not made heavy use of strings, what I have done so far is to just have two general-use string variables, and, when it's time to operate on them, just put addresses and lengths on the data stack. String constants still exist of course, and in the case of Forth, there's also the PAD where number-output strings are formed, so not every string goes through my two string variables; but their addresses and lengths still get handled on the data stack. If the strings exist elsewhere in memory and they can be manipulated in place as needed without overwriting data that another pending routine will be needing, then moving so many bytes to a string stack may be a waste of processor time.

3. interrupts <--Previous | Next--> 5. stack addressing

last updated Feb 10, 2023