RPN operations

The basis for the separate data stack and its use were laid out in section 4, along with the term "Reverse Polish Notation," or "RPN," from the 1920's parentheses-free notation invention of Polish mathematician Jan Łukasiewicz and the fact that RPN puts the numbers on the stack first and then tells what to do with them, instead of vice-versa. RPN was used starting in the early 1960's to make computers more efficient. Philip J. Koopman, Jr.'s book "Stack Computers: The New Wave" came out in 1989, and is available to read online.

Is it better than algebraic notation? It may initially seem less intuitive, and it requires a short time to get the hang of it; but many users then come to prefer it for most things in programming. I'm one of them. My first programmable calculator, a TI-58c in 1981, used algebraic, and next was the HP-41 which used RPN. The algebraic required more keystrokes, but for a short time I could have gone either way. As I gained more experience with RPN, particularly in situations where I was using the calculator/hand-held computer to control lab equipment and take data, I came to prefer RPN. In equations, algebraic shows more directly what you get, while RPN shows how you get there. When I want both, I do the programming in RPN and then put the algebraic notation in the comments.

The argument in school calculators is "Algebraic lets you enter the equation just as you see it on paper!" Well, for one thing, a lot of programming work falls outside of actual equations. For another, even in the field of calculators, in real life we don't usually have an equation in front of us. I think through it, "Let's see—I need A, and then square that; now take B, multiply it by C and add that result to the earlier one. Now I need D raised to the power of E, and divide the earlier result by that. Done. Oops, no, I still need to take the log of that..." That's the way much of real engineering is. It's only in school that you get canned problems and have the equation in the book and you're supposed to just drop the numbers in the chute and turn the crank—then we wonder why we get graduates who passed all their classes and yet show a disconnect between that knowledge and any understanding of what the problem is in their circuit on the workbench! I've hired a lot of electronics technicians and a few engineers, and I gave all the applicants a circuit-analysis test with a dozen simple problems. Not one of them ever got it all correct. It was kind of frustrating.

It's not just about math of course. Take the analogous example from baseball, "Throw the ball to the second-baseman." In English, this sentence is expressed in algebraic order (although that's not true of all spoken languages.) It starts with what to do, which is to throw. Throw what? You're told to throw, but you haven't yet been told what to throw. Even if you did know, you haven't been told what direction yet. In RPN, "ball" would come first, then the direction, and finally after the needed parameters are in place, the instruction is given. So you have:

    ball  direction  throw

which makes a lot more sense in most programming situations, and in fact it's also much easier for compilers to handle. "Ball" is put on the data stack, followed by the direction, and finally there's a command to handle them. You might have the following to type (ie, display or print) a string for example:

    string_address  string_length  type

The inputs on the data stack are usually consumed by the command (so make a DUPlicate first if you'll need it again later). Many commands will return one or more results on the data stack when they're finished.

In this section, we will further develop some functions and show how to use them. Source code for hundreds of functions, dealing with two-byte cells on the data stack, is in Appendix A. (Some will appear to be quite inefficient in the number of instructions required. It's not because of RPN, but because the 6502 is not so efficient at handling 16-bit quantities which we will mostly be dealing with. The 65816 is far better at that.) A few of those functions will be repeated here with extra explanation; otherwise, we want to cover more application-oriented examples here with more explanation of the whys, to help become more fluent in RPN. I try to keep them short to avoid getting bogged down.

If you have an application where you only need single-byte cells on the data stack, many of the stack functions in Appendix A might become short enough to appropriately be made into short macros instead of subroutines. This would be a rather limited situation since you can't have full addresses in a single byte, but I have definitely had such occasions in the past, needing to handle a lot of boolean inputs and functions with intermediate results that had to be left pending for further handling later.

The subroutines below will serve as examples, with explanation, of various RPN operations. One thing that keeps coming up is the need to put a 16-bit literal on the data stack, so we make a macro, called LITERAL, which was introduced in the last section and just lays down the JSR PUSH_LIT instruction followed by the 16-bit number to put on the stack. (Several commonly used literals might be used often enough to justify making them their own subroutines so it's not necessary to put data after the JSR. These might include, but not necessarily be limited to, -1, 0, 1, 2, 3, 4, $20 (ASCII space), and $8000, plus many others that are specific to the application.) A list of basic stack operations was introduced briefly about 70% of the way down the page in section 4. Note also the stack-effect comments at the first line of each subroutine below, in the parentheses, with the dash, introduced about 80% of the way down the page in section 4. Again, the part to the left of the "--" is what the subroutine needs upon entry, and the part to the right of the "--" is what it leaves on the stack after the subroutine is executed. In both cases, the TOS (top-of-stack cell) is toward the right end of the set.

Let's look at an example to convert degrees Fahrenheit to degrees centigrade, scaled by ten so we have 0.1° resolution. °C=5/9(°F-32). We'll introduce a subroutineSTAR_SLASH. MINUS was mentioned earlier, and is self-explanatory:



DegFtoC:                   ; ( °F*10 -- °C*10 )
         LITERAL 320       ; Put 32.0 (in decimal) on the 
         JSR  MINUS        ; data stack, and subtract.
         LITERAL 5         ; Below, we'll multiply by 5
         LITERAL 9         ; and divide by 9.  Remember
         JMP  STAR_SLASH   ; (JSR, RTS)    */ has a 32-
 ;------------------       ;   bit intermediate result.

STAR_SLASH (ie, */) has a stack effect of ( n1 n2 n3 -- n1*n2/n3 ), with a double-precision (ie, 32-bit) intermediate result (to avoid overflow) after the multiplication and before the division. If you wanted to save a few bytes, you could put the 5 and 9 on the stack all at once as a double (ie, 32-bit) number instead, with D_LITERAL. Since a double has the high byte on top, 9 looks like the high byte of a double, so you would enter $90005, or 589,829 in decimal. (Be careful when doing that with one or both numbers being negative though, to keep the signs right.) Then DegFtoC will look like this:



DegFtoC: LITERAL 320       ; Put 32.0 (in decimal) on the data
         JSR  MINUS        ; stack, and subtract.
         D_LITERAL $90005  ; Like LITERAL 5 followed by LITERAL 9.
         JMP  STAR_SLASH   ; Multiply by 5 and divide by 9.
 ;------------------

Side note:

STAR_SLASH,



: U*/_RND       ( U1 U2 U3 -- U1*U2/U3 )  \ Fractional part of answer is rounded in, not truncated.
   >R  UM*
   R@  -1 SHIFT  0  D+
   R>  UM/MOD  NIP	;



U_STAR_SLASH_RND:   ; ( U1 U2 U3 -- U1*U2/U3 )  Fractional part of answer is rounded in, not truncated.
        JSR  to_R          ; Transfer the number to divide by (ie, U3) from data stack to return stack.
        JSR  UM_STAR       ; Multiply U1 by U2, getting a double-precision (32-bit) answer.
        JSR  Rfetch        ; Get a copy of U3 from the return stack and put it on the data stack.
        JSR  NEG1          ; Put -1 on the data stack because we want to logical shift right (not left,
        JSR  SHIFT         ; hence -1, not 1) 1 bit, ie, to get half of the 16-bit divisor's value.
        JSR  ZERO          ; Make it double-precision (32-bit) with a high cell of 0 put on the stack.
        JSR  DPLUS         ; Add those two double-precision numbers.
        JSR  Rfrom         ; Now get U3 off the return stack and put it back on the data stack.
        JSR  UM_SLASH_MOD  ; Do the standard UM/MOD, getting 16-bit quotient and 16-bit remainder,
        JMP  NIP           ; and get rid of the remainder which ended up just under the quotient.
 ;------------------

to_R, UM_STAR, Rfetch,

Rfrom

2/

_2SLASH

-1 SHIFT,

2/

The way to make it round the fractional part of the answer instead of truncating it is to add half the divisor's value to the numerator right before dividing. If you calculate with guard digits and then reduce the number of digits for display, you will normally want to do this trick again when converting for display. <end side note>

Again, the same idea can be carried further in scaled-integer applications for work that is often thought to be floating-point territory. I had situations in the automated test equipment (ATE) where for example a 12-bit A/D converter produced a 4096-count output for a 0-5V input (after signal conditioning and ranging), meaning each count is 1.221mV. You can adjust the output this way:



        LITERAL  1221      ; (in decimal, not hex)
        LITERAL  1000
        JSR  STAR_SLASH    ; Mult by 1221 and div by 1000 (w/ 32-bit intermediate result).

or shorten the literals with a double again, converting to hex to be able to separate the two 16-bit numbers visually into $4C5 and $3E8:



        D_LITERAL  $3E804C5   ; Like LITERAL 1221 followed by LITERAL 1000.
        JSR  STAR_SLASH       ; (Or use JMP if you make this a subroutine.)

(or you could just leave it in decimal, as 65,537,221, or 1000*65,536 + 1221.) Or, why not make another macro that does the same thing but makes it more clear:



        RATIONAL  1221, 1000  ; Numerator and denominator, to
        JSR  STAR_SLASH       ;        multiply by 1221/1000.

or even:



        MULT_BY_RATIONAL  1221, 1000

Rationals can give good, unexpected accuracy with few digits. For example, you can represent pi as 355/113, which is accurate to about 8 digits. You can represent √2 as 19601/13860, which is accurate to about nine digits, even though both numbers still fit comfortably in 16-bit cells, even as signed numbers. I have a list of commonly used rational numbers, along with their accuracies and how I calculated them, here.

Note: You can avoid the division by using more bits in the multiplication and multiplying by a number that will give the 16-bit answer in the high cell of a double-precision (ie, 32-bit) number instead of the low cell. Then you either dump the low cell, or round it into the high cell. So to multiply by the 1.221 above, you could multiply by 1.221*65,536 (which is 80,019, or $1:3893), and use JSR NIP to get rid of the low cell of the result.

The entire ATE project was done without floating-point, including logarithms. But to do the above in floating-point, you might have something like:



        FLOAT  1.221, "E0"
        JSR    FP_MULT

Now suppose you want a random number generator. Something that helps make it random is to use the free-running timer/counter in a 6522 VIA or similar I/O IC, since you don't know what its value will be when you read it at more-or-less random times. VIA1T1CL below stands for VIA1 (ie, the first 6522 VIA chip), Timer 1, Counter Low byte. If you read a 16-bit value onto the data stack, you'll get the high byte too, ie, VIA1T1CH. Read the byte pair and do some scrambling by multiplying by $7ABD (which will cause overflow and we don't care about losing those high bits, since we're after random numbers anyway) and adding $1B0F (which may cause addition overflow, and again, we don't care). The result will be MOD(T1CL*$7ABD+$1B0F, $10000).

This function by itself is not very good for randomizing if you were to input 1, 2, 3, 4, etc.; but keep in mind that even the multiply takes differing numbers of cycles based on the inputs, so T1's count becomes more random. It definitely won't be multiples of some number, even if you do a loop to generate an array of random numbers. You also don't know what T1's starting value will be.



                        ; ( -- u )  Produces a random number, range 0-$FFFF.
RANDOM: JSR  _VIA1T1CL  ; Get address of the counter byte pair on the stack.
        JSR  FETCH      ; Get the contents of the counter.  It's ok if T1CH
        LITERAL  $7ABD  ; increments after you read T1CL.  Put literal on
        JSR  STAR       ; the ZP data stack and multiply.  Overflow is ok.
        LITERAL  $1B0F  ; Now put another literal on the data stack, and
        JMP  PLUS       ; (JSR, RTS)            add.  Again, overflow is ok.
 ;------------------

You could of course optionally go further with the scrambling and with saving one or both bytes or portions of past outputs to involve in the formation of future ones.

Now when you want random numbers in a certain range, you start with the unsigned maximum number on the stack, and call CHOOSE (below). A subroutine mentioned above is UM_STAR ("you-emm-star," for "unsigned, mixed-precision multiply," "mixed precision" meaning that it takes in two 16-bit numbers and gives 32-bit result). Another is NIP which removes 2OS, ie, the second-on-stack cell, meaning the one under TOS, the top-of-stack cell. NIP does the same thing as JSR SWAP followed byINX INX (ie, what the DROP macro lays down), but is more efficient.



                        ; ( u1 -- u2 )  Produce random nr, range 0 to (u1-1).
CHOOSE:  JSR  RANDOM    ; Get the whole-range random number from above.
         JSR  UM_STAR   ; Multiply by your input number to scale the range,
         JMP  NIP       ; (JSR, RTS)  giving 32-bit output.  Dump the low 16.
 ;------------------

The reason for NIP instead of DROP here is that CHOOSE's input number is scaled by $10000, meaning $FFFF multiplies RANDOM's output by just a hair less than 1; so for example if you wanted a range of 0-$400 (minus 1, so 0-$3FF inclusive), and you enter $400, the output will have a maximum of $FFFF*$400, or $3FFFC00. DROP would take off the high byte and leave you with $FC00 which is not what you want. NIP OTOH takes off the low cell and leaves you with $3FF.

Let's take an iterative process: Square root. For the square root subroutine below, maximum input is $3FFFFFFE (half the maximum positive number for 32 bits, minus 1), or 1,073,741,822 in decimal. If you enter a higher number, it returns the root of this number, which is 32767d. The method used is r=(x+r²)/2r, repeated until you converge on an answer. The first guess is $7FFF. Lower first guesses can get by on fewer iterations on low input numbers, but can cause overflow errors with high input numbers. First I used aBEGIN...UNTILstructure to watch for when you get the same answer twice in a row and exit; but if the input number is a perfect square minus 1, like 99, the odd- and even-numbered iterations never agreed, and looping continued. The short-code solution was to always loop enough to finish any root (√0 took 15 iterations). The number of iterations has to be odd, because on the square-minus-1 problem, an even number of iterations gave the approximate root minus 1, which means √0 gave -1. This method is not lightning fast, but I went for short code.

A subroutine not introduced until now is DU_MIN which takes two double-precision (ie, 32-bit) unsigned numbers on the stack and deletes the higher one, keeping only the lower one. Another is PICK, which looks back in the stack to the depth specified, and puts a duplicate of that cell at the top. 0 PICK is the same as DUP, and 1 PICK is the same as OVER. Another is _2STAR (2*) which just uses shifting to quickly multiply the cell by two. The last one is UM_SLASH (UM/) which takes in an unsigned double number and an unsigned single, and outputs an unsigned single quotient, discarding the remainder. The stack effect is ( ud u -- u_quot ). MINUS_ROT does exactly the opposite of ROT, so its stack effect is ( a b c -- c a b ). The "^" below starts a comment on what's on the data stack after that line has executed.


                                ; ( ud -- u )
SQRT:   D_LITERAL $3FFFFFFE     ; Between $3FFFFFFE and the 32-bit input
        JSR  DU_MIN             ; number, take the lower of the two.

        LITERAL   $7FFF         ; Put initial estimate of $7FFF on stack.
                                ; Use Y to loop 15x.  It drops thru when
        FOR_Y  $0F, DOWN_TO, 0  ; DEY in NEXT_Y makes Y become 0.   ^ D n1
           JSR  DUP             ; To square the estimate, just put two
           JSR  DUP             ; duplicates of it on the stack, and
           JSR  UM_STAR         ; multiply them, getting a 32-bit result.
           LITERAL 4            ; 4 PICK twice reaches back into the stack
           JSR  PICK            ; and gets a full copy of the two cells of
           LITERAL 4            ; the double number input to put at the
           JSR  PICK            ; top of the stack.
           JSR  D_PLUS          ; Add it.  Now you have:     ^ D n1 (x+r²)
           JSR  ROT             ; ROT changes that to:       ^ D (x+r²) n1
           JSR  _2STAR          ; Now double the n1 (the new estimate)
           JSR  UM_SLASH        ; to divide the (x+r²) by.       ^ D n2
        NEXT_Y                  ; DEY, BNE.  Loop until Y becomes 0.
        JSR  MINUS_ROT          ;                                     ^ n2 D
        _2DROP                  ; This is a macro in StackOps.ASM.    ^ n2
        RTS
 ;------------------

The stack-effect comment on the JSR ROT line above makes it look like we just did aSWAP; but (x+r²) is a double, ie, 32 bits, which is why n1 became the TOS after the ROT.

I'm no algorithms expert, so there might be better ways to get a square root. I got this method from a magazine, probably Embedded Systems Programming. Lee Davison (RIP) has one here which is faster, but longer, and only handles 16-bit inputs (giving an 8-bit output), and requires variables (unlike the subroutine above).

The same article I got the above method from also had a very simple, short one that requires no multiplication or division, but is extreeeeeemely slow. There will be situations where that's ok though. If there were any reason to have a square-root function in a sprinkler controller or thermostat, I'm sure the slowness would be no problem. I tell about it, with an assembly-language program suitable for 15-bit inputs, here. It uses variables, but could be modified to use the data stack instead.

String manipulation can be handled in RPN too. The innards may get a bit tricky to follow, but we still eliminate the need for variables; and once the components (SWAP, CMOVE, etc.) are in place, the string subroutines will be shortened. Below are two subroutine samples, STR_STO (string store) and STR_ADD (to add one string to another), for counted strings, meaning that instead of using a null-byte delimiter, the first byte tells the length of the string. With a minor change, they could be made to handle strings with two bytes for telling the length, so strings could be thousands of bytes long, with very little added complexity.

If you have only the address of the string, COUNT takes that as an input and returns two cells: the address of the first data byte, and the length of the string; and it can be defined this way:



COUNT:  LDA  (0,X)    ; ( addr -- addr+1 len )   Get the length.
        INC  0,X      ; Increment the low addr byte, to get to the start of the data.
        BNE  c4       ; If that made it roll over to 00,
        INC  1,X      ; then increment the high byte too.  ^ addr+1 (Length is still in A.)
 c4:    JMP  C_PUSH   ; (JSR, RTS)  Push the length byte as a new TOS cell w/ high byte=0.
 ;------------------

Subroutines used below that have not been introduced until now are C_STO (8-bit character store), INCR (increment the data the TOS points to by 1), DECR (decrement the data the TOS points to by 1), and _2PICK which is justLITERAL 2and PICK merged since it gets enough use to justify that. (Again,PICKmakes a copy of the data-stack cell that's at the specified depth, put puts that copy at the top of the stack, replacing the depth specifier.) And again, the "^" below starts a comment on what's on the data stack after that line (or the last line of actual code) has executed.



STR_STO:              ; S!    ( str_addr length dest_addr -- )    Store counted string.
        JSR  _2DUP    ; DUPlicate the length & the destination address for C_STO, and
        JSR  C_STO    ; store the length byte (not word) at the desired destination.
        JSR  INCR     ; Go dest_adr+1, as text will start at the byte after the length.
        JSR  SWAP     ;               ^ str_source_adr dest_adr len
        JMP  CMOVE    ; (JSR, RTS)    ^ <empty>
 ;------------------

                      ; S+    Copy the 2nd counted string to the end of the 1st.
STR_ADD:              ; ( addr1 length1 addr2 length2  -- addr1 length1+length2 )
        JSR  SWAP     ; Get addr2 on TOS.                       ^ adr1 len1 len2 adr2
        JSR  _2OVER   ; Make a copy of adr1 & len1    ^ adr1 len1 len2 adr2 adr1 len1
        JSR  PLUS     ; and add these two to get addr where to move the 2nd string to.
                      ;                               ^ adr1 len1 len2 adr2 adr1+len1
        JSR  _2PICK   ; Copy length2.            ^ adr1 len1 len2 adr2 adr1+len1 len2
        JSR  CMOVE    ; Concatenate strings.  CMOVE's from,to,count inputs are the
                      ; last 3 param.s shown in the previous line.   ^ adr1 len1 len2
        JSR  PLUS     ; Add the lengths of the two strings to get total length.
        JSR  DUP      ; ^ adr1 tot_len tot_len
        JSR  _2PICK   ; ^ adr1 tot_len tot_len adr1
        JSR  DECR     ; ^ adr1 tot_len tot_len adr1-1
        JMP  C_STO    ; (JSR, RTS)  Store new length of string 1.    ^ adr1 len1+len2
 ;------------------

Note that the strings themselves do not go on either stack. Only the address and often the count does. This is normal for longer pieces of data. Some other languages (notoriously Pascal) put very large pieces of data on the stack, even whole procedures; but for those large sets, it is usually better to only put their addresses on the stack.

Appendices A and B * * * * * have various subroutines for handling strings, including forming formatted number output (called "pictured numeric output") in strings. The number is turned into text in a string variable called the "pad," and then you can add it into other strings you are forming. A nice thing about the way we'll do it is that it's just as easy to do it in one base as another; in fact, the same routines are used, unmodified, and a variable called BASE (which, not surprisingly, holds the base of the numbers to read from text, or form into a text string) makes the difference. You can even change the base between digits that are converted, which is practical for things like outputting a time string, like "12:41:23", where the seconds' and minutes' units are in base ten, but the tens are in base 6; so if you're converting a number of seconds, 45,683 or $B273 in this case, the representation in the variable is in hex but you convert a digit at a time (right to left) in base ten, then six, then ten again, the six again, the ten two more times, adding the "hold" characters (":" in this case) at the appropriate times.

How to do the conversion: We will do this all in hex, regardless of the desired output base. We will not use the 6502's decimal mode. Take your input number and divide by BASE. The integer part of the quotient goes back where the input number was, and the remainder, converted to ASCII, becomes the right-most digit. Now do it again, even changing BASE first (like going from 10 to 6 above) if desired. The remainder becomes the next digit, moving left. In our "12:41:23" example above, we would add the ":" after the seconds' digit (using subroutineHOLD), before continuing on to do the minutes' digit, and again between the minutes' pair and the hours' pair. If a sign in involved, you first push the sign on the return stack with toR, then take the absolute value of the number and convert it, then pull the sign back with Rfrom and add it to the left end of the pictured numeric output when you're done.

If you wanted to convert the degrees output near the top of this section, you would convert the first digit, then insert the radix ("." in the U.S., and usually "," in Europe) using HOLD, then continue until the number you're dividing becomes zero. If you get more than three digits to the left of the decimal point and want to separate groups of three, you can insert the separator (we use the comma in the U.S.), again using HOLD. HOLD takes an argument from the top of the data stack, the ASCII value of whatever you want in insert. It could be / , . : ' or anything else.

I've thought about using a string stack, but my own need for it has been too small to justify it. It seems best to just have two or three string variables to form strings in. Strings can of course be any length; so if you make a string stack, being able to find the next string on the stack requires you to either make every cell to be the maximum length, or examine the string-length byte (in the case of counted strings) of each string or the end-of-string byte (in the case of null-terminated strings). You would be more likely to do stack operations in place with strings than with numbers, since you can't pull whole strings into processor registers to concatenate, compare, search for the occurrence of one in the other, etc..

For more on forming numeric output strings, see chapter 7 of "Starting Forth," especially from about the middle of the page on down. (We're not really doing Forth here, but the trail has already been blazed in good stack operations, and we can do them in assembly language.) There's more on interpreting numbers in text input in chapter 10 of the same online book.

If you've been experimenting with ideas, you might have found that there's sometimes a need to get something at the TOS out of the way temporarily to make stack manipulation easier; but you don't want to use a variable. So what do you do? You can use the other stack—the return (ie, hardware) stack. toR removes a cell from the top of the data stack and transfers it to the return stack. Rfrom goes the other direction, removing a cell from the return stack and transferring it to the data stack. They can be defined in the ways shown below.

However! To shorten the routines, here we use a scratchpad space, called "N", of usually 8 ZP bytes, mentioned 75% of the way down section 6. So why not just use N to save the top data-stack cell(s)? It's because we set rules that N is to only be used within a subroutine, and any usage of N must be finished and N must be freed up before another subroutine is called, and before the return from the current subroutine. (Obviously then if an interrupt hits and the ISR uses N, it must preserve it; but nothing else should be having to.) If these rules are kept, it works out quite well.


 ;                                                  clocks:

                    ;         The JSR calling it takes 6.
toR:    PLA         ; First pull the return addr off   4
        STA  N      ; the return stack and put it in   3
        PLA         ; N, the temporary area mentioned  4
        STA  N+1    ; 3/4 of the way down section 6.   3

        LDA  1,X    ; Now push the top data-stack      4
        PHA         ; cell onto the return stack.      3
        LDA  0,X    ;                                  4
        PHA         ;                                  3

        INX         ; Get rid of the now-unused cell   2
        INX         ; that was at the top of the       2
                    ; data stack.
        LDA  N+1    ;                                  3
        PHA         ;                                  3
        LDA  N      ; Push the return addr back        3
        PHA         ; onto the return stack.           3

        RTS         ;                                  6
 ;------------------                          total:  56


                    ;         The JSR calling it takes 6.
R_from: PLY         ; Pull return addr low off stack   4
        PLA         ; followed by return addr high.    4
        INY         ; Increment return addr low and    2
        STY  N      ; store it for indirect JMP below. 3
        BNE  rf1    ; If INY made it roll over to 00,  3|2
        INA         ; increment return addr high, too. 0|2
 rf1:   STA  N+1    ;                                  3

        DEX         ; Prepare a new cell space on      2
        DEX         ; the data stack.                  2

        PLA         ; Take desired data off the return 4
        STA  0,X    ; stack and put in on the data     4
        PLA         ; stack, low byte then high byte.  4
        STA  1,X    ;                                  4

        JMP  (N)    ;                                  6
 ;------------------                  total:  usually 51

As I was discussing this with Jeff Laughton during the writing, he brought up some unorthodox ways of shortening it and speeding it up, as he has a knack for doing both in hardware and sofware. However, it's still not very efficient on 6502. It's much more efficient on the 65816. If you don't need them often, or you want better performance, you could make them macros, at 8 bytes each (which is 5 more than a JSR takes), and 18 and 20 clocks:



toR:    MACRO      ; ( n -- )                       clocks:
        LDA  1,X   ; Copy high byte to return          4
        PHA        ; (hardware) stack, then            3
        LDA  0,X   ; do the low byte.                  4
        PHA        ;                                   3
        INX        ; Finally, free-up that             2
        INX        ; cell on the data stack.           2
        ENDM       ;                           total: 18
 ;------------------

Rfrom:  MACRO      ; ( -- n )
        DEX        ; First, prepare a new cell         2
        DEX        ; space on the data stack.          2
        PLA        ; Get low byte off return stack     4
        STA  0,X   ; and put it on the data stack,     4
        PLA        ; then to the same with the         4
        STA  1,X   ; high byte.                        4
        ENDM       ;                           total: 20
 ;------------------

Let's look at using the data stack to calculate an address of a field of a record in an array, and then copy a string variable to that field. Suppose:

The array is 14KB long.
The records in the array are 23 bytes each.
There are four fields in each record:
- The 1st field is 2 bytes.
- The 2nd field is 3 bytes.
- The 3rd field is 16 bytes. We will copy string variable _STR1 to this record.
- The 4th field is 2 bytes.
We want record #469. (That number was set up before entering the subroutine below.)


File_The_String:        ; ( Record# -- )  (Record number previously calc'ed.)
        LITERAL 23      ; Multiply the record number by
        JSR  STAR       ; the number of bytes in each record.

        JSR  _ArrayBase ; Add that to the
        JSR  PLUS       ; array's base address.

        LITERAL 5       ; Add the field offset to get the
        JSR  PLUS       ; actual address of the desired field.

        JSR  _STR1      ; Put the address of STR1 on the stack,
        JSR  SWAP       ; then change the parameter order for CMOVE.
        LITERAL 16      ; ^ From_addr To_addr length  (the order CMOVE needs)
        JMP  CMOVE      ; (JSR, RTS)  Done.  ^ <empty>
 ;------------------

I just put in 16 as the number of bytes for CMOVE to move, as it's easier than finding the string count. If the string is shorter, extra bytes past the end will get moved, but it won't do any damage. There's no requirement that it be a string—it's just that that was the assumption here.

Modifications can be made, for example to have the offset into a record be an input parameter on the data stack:



File_The_String:        ; ( Record# offset -- )
        JSR  SWAP       ; Put the record number at the top to handle first.
        LITERAL 23      ; Multiply the record number by
        JSR  STAR       ; the number of bytes in each record.

        JSR  _ArrayBase ; Add that to the
        JSR  PLUS       ; array's base address.

                        ; Add the field offset to get the
        JSR  PLUS       ; actual address of the desired field.

        JSR  _STR1      ; Put the address of STR1 on the stack,
        JSR  SWAP       ; then change the parameter order for CMOVE.
        LITERAL 16      ; ^ From_addr To_addr length  (the order CMOVE needs)
        JMP  CMOVE      ; (JSR, RTS)  Done.  ^ <empty>
 ;------------------

or to merge the addition of the array base and the addition of the field offset into the record:



File_The_String:             ; ( Record# -- )  (Record number previously calc'ed.)
        LITERAL 23           ; Multiply the record number by
        JSR  STAR            ; the number of bytes in each record.

        LITERAL  ArrayBase+5 ; Add that to the array's base
        JSR  PLUS            ; address and the field offset.

        JSR  _STR1           ; Put the address of STR1 on the stack,
        JSR  SWAP            ; then change the parameter order for CMOVE.
        LITERAL 16           ; ^ From_addr To_addr length  (the order CMOVE needs)
        JMP  CMOVE           ; (JSR, RTS)  Done.  ^ <empty>
 ;------------------

Note that I'm differentiating the following this way:

ArrayBase (with no leading underscore) is the assembler label for the location of the array itself.
_ArrayBase (with leading underscore) is the label for a subroutine to put the address of the array on the data stack.

The same goes for STR1 and _STR1. You can show the difference any way you like of course. This is only a suggestion. If you use it enough times in the code, the subroutine (which takes three bytes to call) will save program bytes compared to LITERAL which is a macro that assembles aJSR PUSH_LITinstruction followed by the 16-bit number to put on the data stack (for a total of five bytes per occurrence). LITERAL will execute the most slowly, followed by a subroutine to put the number on the stack directly (without using LITERAL), and of course the fastest is to straight-line it, even if in a macro.

Or how about finding the address of a record in a two-dimensional array, and then reading it and putting its contents on the data stack. Suppose:

The array is 12KB long.
It has 68 rows (numbered 0-67) by 90 columns (numbered 0-89) of two-byte records.


                           ; ( row# col# -- n )
        JSR  SWAP          ; Get row# on top, and
        LITERAL 90         ; multiply it by the number
        JSR  STAR          ; of columns in a row, then
        JSR  PLUS          ; add the desired col# number
                           ; to get the record number.
        JSR  _2STAR        ; Now multiply by 2, since there
                           ; are two bytes per record.
        LITERAL ArrayBase  ; Add address of base of array
        JSR  PLUS          ; to get address of desired record.

        JSR  FETCH         ; Replace that addr with its contents.

Loops with 16-bit index, limit, and increment size can be done using the data stack too. The 6502 is not very efficient at the 16-bit stuff, but sometimes the job needs to be done and the added programmer productivity of having a ready-made set of tools for it is well worth it. The 65816 is of course much more efficient at handling 16-bit quantities, and you can have for example:



        LDY  #1240
loop:      <do_stuff>
           <do_stuff>
           INY
           CPY  #5804
        BNE  loop

It takes far more instructions to do it on the 6502 though. I have FOR...NEXT macros to do that in the section of this website on program-structure macros, at http://wilsonminesco.com/StructureMacros/ and the form that the above example would take on is:



        FOR_Y  1240, UP_TO, 5804   ; (C32 requires commas between parameters.)
           <do_stuff>
           <do_stuff>
        NEXT_Y

and the macros hide the ugly internal details (although you're still in complete control of them). But now suppose you want to use the data stack with 16-bit cells for loop parameters. There are tools in Appendix A for that, called DO and LOOP and a few related ones. The general form (again referring to the example above) is:



        LITERAL  5804   ; Put the loop limit on the data stack.
        LITERAL  1240   ; Put the initial loop index on the data stack.
        DO
           <do_stuff>
           <do_stuff>
        LOOP

where DO and LOOP are macros that assembledoand loop and the branch addresses, not needing labels. (How to have the assembler do this will be covered in section 17, and we will discuss how to set up a stack in the assembler itself to be able to nest structures.) Loop limit and index don't have to be literals of course. They will often be calculated on the data stack before arriving at DO instead. Note the order, which is like a tag on a gift, which might say, "To Emily, From Bob", with the loop limit going on the data stack before the initial loop index. The index will get incremented or decremented until it crosses the line between the limit and limit minus 1. We'll see a little more on this near the bottom of section 11. At run time, do takes the parameters off the data stack and puts them on the return stack, optionally following the loop end address. Since the looping controls are using the stacks and not registers or variables, loops can be nested several levels deep without trouble. Again it's the nature of stacks.

Appendix C is a slightly off-topic piece about RPN calculators, a production test application of controlling instrumentation on the workbench, plus assembly language, and Forth, taken from an email I sent to a smart nephew who was thinking about what direction he wanted to go in programming and robotics as he started college.

7. inlined data <--Previous | Next--> 9. RPN efficiency

last updated Feb 4, 2024