home   |   stacks treatise index   |   1. Intro: stack basics   |   2. subroutine return addresses & nesting   |   3. interrupts   |   4. virtual stacks   |   5. stack addressing   |   6. passing parameters   |   7. inlined data   |   8. RPN operations   |   9. RPN efficiency   |   10. 65c02 added instructions   |   11. synth instructions w/ RTS/RTI/JSR   |   12. where-am-I routines   |   13. synthesizing 65816 stack instructions   |   14. local variables, environments   |   15. recursion   |   16. enough stack space?   |   17. forming program structures   |   18. stack potpourri   |   19. further reading   |   A: StackOps.ASM   |   B: 816StackOps.ASM   |   Appendix C


Stacks Potpourri

On this page:

Stack usage in a multitasking OS

Section 16 gives considerations on the amount of available hardware stack space on the 6502, so that will not be repeated here; but note that the test I gave it there shows that for many if not most applications, there's actually enough page-1 stack space to split it up and give each of several tasks its own section of page 1.  (This goes for page 0 too, if you use a separate data stack there.)  If you can determine that 51 bytes (hitch hiking on my experiment in section 16) gives plenty of space (including headroom) for each task, you could divide page 1 into five equal segments of 51 bytes each, like so:

       task 1      task 2      task 3      task 4      task 5 
      $100-132    $133-165    $166-198    $199-1CB    $1CC-1FE

It will be clear enough how to divide it into different numbers and sizes of segments.  If you know ahead of time roughly how much stack space each task needs, you might be able to allot different sizes of stack segments for each, allowing more tasks because ones that don't need much stack space don't take it away from the ones that do.  I expect that with care, you could in many cases safely do several more tasks than that.

Each task's stack pointer will be restored to it when it gets its turn to run again, taking its position from the appropriate variable which you load into X and then transfer to the stack pointer with TXS; and at the end of its turn to run, the reverse will be done to store the position away for the next time it runs.  (If you have one or more separate data stacks in ZP or elsewhere in memory, those will need to be handled similarly.)  If there's an actual OS handling the multitasking, its stack pointer will need to be saved and restored too.  (For simple ways to do multitasking without a multitasking OS, for systems that don't have the resources to run such an OS, or where hard realtime requirements may rule one out anyway, see my related article.)

What Jonathan Halliday is doing in his preemptive multitasking GUI OS for Atari 6502 computers (a few paragraphs down) allows more stack space than meets the eye, by leaving the major tasks' stack space in place full time, but moving stack memory for low-priority tasks so several of them can use the same portion of page 1, effectively expanding it.  He writes, "This is where scheduling strategies can become densely complex.  If the scheduler wants to context-switch to a task whose stack frame is cached out, it looks for sleeping tasks to evict.  If it finds one, it then looks for the lowest-priority task currently using the stack and evicts that instead.  Of course that sleeping task could just be on the verge of waking up in the next system tick, but there's a compromise to be found here, and too much complexity will just slow everything down anyway."  The user-interface process has the highest priority of all, ensuring that the UI updates remain fast.

On the NESDev forum, lidnariq posted, in the topic "6502 cooperative multitasking framework" a fascinating little chunk of code he found while reverse-engineering a dot-matrix impact printer's firmware.  He writes, "Yield saves A,X,Y,S, and PC of the calling thread and switches to another thread.  The scheduler appears to just be round-robin; if thread N yields, thread N+1 is resumed ... until all threads have yielded, in which case they're all marked as 'allowed to execute' again and it resumes with the first one in the table.  Specifically, this was running on a Mitsubishi 740-class microcontroller, which uses a superset of the original NMOS 6502 instruction set, and has many of the same instructions (although with different encodings) as the WDC 65C02."  See the code at the link.

On the Codebase 64 Commodore 64 site, Gregg has a short article called Threads on the 6502 where he writes,

Code follows.  It's relatively short and easy to understand.

Others' impressive 6502 multitasking OSs

To see some enthusiasm and skill in a 6502 multitasking OS, check out André Fachat's GeckOS/A65, a full-featured preemptive multitasking/multithreading operating system for the 6502, with dynamic memory management, implementing some Unix-like features, like signals, semaphores, relocatable file format, standard library, internet support via a kind of simplified sockets, and virtual consoles.  Glenn Holmer gives a very helpful talk, with slides, on the C64 version, at the Vintage Computer Festival MidWest (VCFMW) 2019, at https://www.youtube.com/watch?v=jtlAOdJmeDI.  See also Cenbe's commentary on it, at https://www.lyonlabs.org/commodore/onrequest/geckos-analysis.html [Shucks.  That last link has gone dead, and archive.org doesn't even have it.  If anyone can give me an updated link, please email me.]

Related, see the preemptive multitasking GUI OS for Atari 6502 computers.  At the time of this writing, it is not finished yet, but here are a couple of screenshots provided (I'm sure Jon won't mind my giving him free publicity):

    picture     picture

A little bit from the page:

Is it really pointless to try something like this on a 6502 in 2015?  I am attracted to the related philosophy in the articles of a very sharp, experienced professional programmer, Samuel Falvo, "Software Survivalism" and "Neo-Retro Computing."  And although I have not had any exposure to 6502 Atari computers, I am fascinated by the discipline and care and inventive techniques used to pull this off in a system with such limited memory and execution speed.  It's definitely a good practice, and forces one to be more efficient with the resources.  It's a valuable skill to transfer to any system, including modern ones where bloatware and programming sloppiness are unfortunately the norm.  Today as I write this, our son's smart phone crashed, and took several minutes to re-boot.  Why did it crash?  Bugs.  Why did it take so long to re-boot?  Bloat.  My hat is off to Jonathan Halliday, and also to Paul Fisher, Jörn Mika, and others I might have missed who contributed to his (so-far-nameless) Atari multitasking GUI OS!

Measuring stack depth, and reasons

Since you can always get the hardware stack pointer value with TSX, you can subtract to get the difference between that and where the stack started, to get the depth.  Since you cannot take the origin minus X or even A directly, you could use the stack itself for the temporary storage, to avoid using variable space.  (Actually there is a more-efficient way that I'll get to in a minute; but first I'll show how to do it this way.)  Remember that on the page-1 hardware stack, S points to the next available location to store something in, whereas in virtual stacks we usually have the index register point to the low byte (or first byte) of the top-of-stack cell.

        LDA  #<stack_origin>
        SBC  100,X            ; Note!  Not 101,X!  (PHX came after TSX.)
        PLX                   ; We're done with this on the stack, so drop it.
                              ; The result excludes the stack byte we kept X in.

If you have a ZP data stack and are using X as its pointer, the above will require saving and restoring X also, something like:

        LDA  #<stack_origin-1>   ; The -1 is because we first pushed
        SEC                      ; X as our data-stack pointer.
        SBC  100,X

In both cases, the result at the end is in A, and you can push that to the stack after the code fragments above if you need to.  Also in both cases, the result tells the number of bytes.  (On a data stack like below, the depth is normally reported as the number of cells, and there may be two or more bytes per cell.)

Assuming you use X or Y for a pointer into a virtual stack which gets a lot of attention in this treatise, the pointer is a little more directly accessible, but unfortunately you still can't subtract it directly, but you can temporarily store it elsewhere:

        STA  0,X
        LDA  #<stack_origin+1>   ; The +1 is because we first did
        SEC                      ; did a DEX.  (See note below.)
        SBC  0,X

This gives the number of bytes.  If you want the number of 2-byte cells, follow it with an LSR.

You could leave out the DEX and INX and shorten it a bit if you can be sure an interrupt won't overwrite the stacked pointer value:

        STA  $FF,X               ; This should give a ZP addr mode.
        LDA  #<stack_origin>
        SBC  -1,X

But here's the more-efficient method I said was coming.  It's a less-obvious solution, but shorter and quite a bit faster:

        EOR  #$FF
        ADC  #<stack_origin>

In all cases, you can leave out the SEC if you know the carry flag will already be set when arriving at this part of the code.

So why measure the stack depth?  After all, once you get the hang of stacks, it's rather uncommon to accidentally over- or underflow a stack even in development, and the danger is almost completely non-existent after that, so there's really no point in adding the overhead of always checking.  Well, actually, even with no such danger, there are good reasons to get the stack depth.

One application I've used in Forth extends the CASE structure's capabilities with SET_OF (and also RANGE_OF).  I think I got the idea from Forth Dimensions magazine in the 1990's.  A { (left curly brace) checks the data stack depth and stores it temporarily, and the } (right curly brace) checks the data stack depth again and returns the difference.  SET_OF uses the difference which } puts at the top of the data stack to know how many numbers to compare to.

                    0A         OF  <do_stuff>  END_OF   \ In the case of it being 0A, do this.
                    0E         OF  <do_stuff>  END_OF   \ In the case of it being 0E, do this.
        { 13 41 5B 20E }   SET_OF  <do_stuff>  END_OF   \ If it's one of these numbers, do this.
                15  28   RANGE_OF  <do_stuff>  END_OF   \ If it's in the range of 15 to 28, do this.

In another use, I had a situation where I wanted to type on the command line a list of numbers to be handled, and automatically allow varying numbers of inputs, then calculate a table of gear ratios and accompanying info and print a chart:

   {  6  7  8  9 18 19 }  DON'T_USE
   { 12 13 14 15 17 19 21 23 25 }  { 30 42 52 }  CALC_GEARS

The DON'T_USE (note that Forth allows using almost any character in names) set of ratios was to be left out of the calculations because of undesirable chain angles.  (This was for bicycle gearing.)  Note that since the } puts a depth number on the data stack, you can have one group after another, as long as you don't run out of stack space, and the routine that handles them can make sense of it.  There will not be any errors from having a wrong number of parameters on the stack.  The 12...25 sequence shows the numbers of teeth on each cog of a 9-speed cassette, but it could just as easily be a different number of cogs, say 8 or 10; and the 30...52 sequence shows the numbers of teeth on each of three chainrings, but it could just as easily be two chainrings.  So DON'T_USE and CALC_GEARS could just as easily handle the following, even though the numbers of parameters are different:

   {  9 10 11 12 }  DON'T_USE
   { 12 13 14 15 16 17 19 21 23 25 }  { 39 53 }  CALC_GEARS

Since it's all stack-based, the following would work the same (but it would be less intuitive):

   { 12 13 14 15 16 17 19 21 23 25 }  { 39 53 }   { 9 10 11 12 } DON'T_USE  CALC_GEARS

Right before the DON'T_USE, there would be 19 cells on the data stack in this case (plus any that were there before this line); 16 for the numbers in the three groups, plus one on top of each group, telling its size.  DON'T_USE removes the top 5 and puts them in an array, and CALC_GEARS removes the rest and prints the chart.

Hardware allowing RTS to be used as NEXT in 65816 Forth while allowing interrupts

Bruce Clark brought up on the 6502.org forum the possibility of using RTS as a 6-cycle NEXT in DTC Forth on the '816.  S, the 16-bit stack pointer, becomes the Forth's instruction pointer (ie, its program counter), IP!  The drawback is that S points within your program, and as a result, interrupts cannot be allowed, because part of your program would get overwritten during the interrupt sequence when the return address and status are stored.  Jeff Laughton shows a circuit to remedy that in the Six-cycle NEXT (with a h/w mod to accommodate interrupts) forum topic.  The circuit below detects RTS opcodes and forces the resulting "stack" memory accesses to remain in the program bank instead of going to Bank 0:


The version below lets RTS and PLA both pull from the program bank:


Basically, Forth runs in a non-0 bank, and most of bank 0 must be left available for the hardware stack, since it will not be known what the value in the Forth program counter (IP, held in register S) is when an interrupt hits.  Stack operations relating to the interrupt will write to and read from bank 0, while RTS serving as Forth's NEXT reads the bank that the program is in, rather than bank 0 where the hardware stack resides.  See the forum topic for explanation and discussion.  What was not discussed was how the Forth words that access the return stack would work, for example do, ?do, loop, +loop, leave, ?leave, unloop (ie, the internals compiled by DO, ?DO, LOOP, +LOOP...), I, J, >R, R>, and R@. 

Using S as an index register for fast data transfer to/from stacked buffer

blargg has an interesting write-up on using S as an index register in a sense for transferring a small buffer of data (small enough to fit in the available hardware stack area) to or from an output device.  Pushing or pulling the data is faster than using X- or Y-indexed addressing and incrementing or decrementing X or Y.  X and Y are both now left free for other uses.  The lack of any kind of handshaking or status-checking might rule it out for most applications, but see if it gives you any ideas.

This example quickly outputs a buffer of 0-terminated stacked data to a memory-mapped device outside of ZP.  Each byte takes 11 cycles to read from the buffer and output:

        LDA  #0
        PHA             ; Push nul terminator first.
        ...             ; Now push the data

        <When it's time to output the data, jump to "next" below.>

read:   STA  port
next:   PLA
        BNE  read

This next example quickly reads data from a device and stops when it receives a 0.  Each byte takes 10 cycles to input and write to the buffer:

        TSX             ; Save the current stack pointer.
        STX  end

write:  LDA  port       ; Read from the device
        PHA             ; and push the byte.
        BNE  write      ; Drop through when you read a 0.

read:   PLA             ; Jump here when it's time to read the data.
        ...             ; Use the data.
        CPX  end        ; Keep checking the stack pointer value
        BNE  read       ; and drop through when we've reached it.

By putting the buffer at the bottom of page 1, S can be used as both a counter and an index for a write buffer.  Each byte takes 12 cycles to input and write to the buffer:

        TSX             ; Save the stack pointer.
        STX  stack

        LDX  #size

loop:   LDA  port
        BNE  loop

        ...             ; Use the data.
        LDX  stack
        TXS             ; (He had TSX, but I fixed it.)

By putting the buffer at the top of page 1, S can be used as both a counter and an index for a read buffer.  The normal stack would need to be placed lower in page 1 to coexist with this scheme.  Each byte takes 13 cycles to read from the buffer and output:

        LDX  #0         ; Init stack.
        ...             ; Push data on stack.
loop:   PLA
        STA  port
        BNE  loop

Atari programmable RPN calculator application, with source code

Here is another floating-point library, in source form, by Carol Shaw, from 1979 (ie, for NMOS 6502), as well as the calculator source.


The calculator requires 24K of RAM.

17. forming program structures <--Previous   |   (Next--> 19. further reading)

last updated Jul 15, 2020