home   |   the links mine   |   6502 primer    |   large math look-up tables    |   65c02 assembly structure macros   |   simple multitask

intro: integer math & tables   |   math table files' descriptions   |   rational-number approximations   |   how Intel Hex files work   |   how the files were calc'd & formed



Large Look-up Tables for Hyperfast, Accurate, 16-Bit Fixed-Point/Scaled-Integer Math


On this page:

Introduction & Contents

Floating-point math puts a big overhead on a computer that doesn't have a floating-point coprocessor; but we really do need floating-point for all but the simplest operations-- or do we??  Can we really do things like trig, log, and square-root functions without floating point?

The answer of course is yes, we can do it, and very well.

In 1987 I wrote a small set of 7-digit decimal floating-point functions in 6502 assembly for a product at work.  What a lot of clock cycles it took!  A couple of years later I was introduced to fixed-point and scaled-integer math, how you typically use 16-bit cells (sometimes with 32-bit intermediate results) on 8-bit computers, and scale the needed range of numbers to what the cells can handle.  At first I was skeptical; but eventually I came to prefer it (in systems with no floating-point coprocessor).  I wrote the software for some automated test equipment without floating-point.  There will always be a place for floating-point, especially with calculators where the applicable ranges cannot be anticipated; but I have mostly quit using floating-point for data acquisition and control for everything up to and including the fast fourier transform (FFT) for spectral analysis.  Digital signal processing is frequently done without floating-point.

But if the reason to use fixed-point and scaled-integer math is to gain performance, what if we go further and make large look-up tables for the functions that would otherwise require a lot calculations to get?  Some functions can be done surprisingly accurately by interpolating from a smallish table; but the interpolation requires multiplications and divisions too.  Memory today is large enough and cheap enough to have lots of tables of 128KB each (64K cells of two bytes each), so the 16-bit input number times two (which can be done without a real multiplication) gives the index value into a table of 16-bit answers.  (A few of the tables I have included are 64KB and 256KB.)  With big look-up tables, we can get greater speed and get all 16 bits correct, with no interpolating, and no error from using less-than-perfect algorithms to calculate the functions with limited precision.  In some cases, the look-up can be nearly a thousand times as fast as calculating the function.

It's like having a fixed-point/scaled-integer math coprocessor!

If the processor doesn't have the address range for it, you could give your memory map a window of say 256 bytes into a much larger address space (using I/O to set the higher address bytes), or even go entirely through I/O.  Even if you were to use serial and go through a 6522's synchronous serial port and shift registers, with three wires (clock, data, and addr_latch/data_load), it is still much faster than actually calculating the answers.  At a minimum, you would have to shift out three bytes of address and shift in two bytes of data, or shift the address out twice (incremented in between) and read a single byte of data each time.  It would take three 74HC595's serial-to-parallel shift registers for the address, a '165 parallel-to-serial shift register for the data, and a '126 since the '165 doesn't have a tri-stratable serial output.

Unexpected accuracy and range without floating point

So how do you get the accuracy and range without floating point?  After all, if you divide 10 by 3, you'll get 3, which is a huge error, right?  There's a way around it.  Suppose the 10 is a voltage on a power supply that will never exceed 14V, and you measured it with a 10-bit A/D converter that gives its 1023 maximum output at 15V * 1023 / 1024.  (IOW, 1024 would be 15V, but that takes 11 bits and we only have 10, and %1111111111 is 14.985V.)  So each lsb represents 14.65mV.  10V will give an A/D converter output of 683.  Now dealing with bigger numbers, the accuracy and precision suddenly improve.

What if you want to take 60/63rds of that number?  If you divide by 63 you'll get 10, then multiply by 60 and you'll get 600, hardly even close to the 650 correct answer.  Instead, do the multiplication first, getting 40,980, then the division, getting 650.  If you're using signed numbers for some of your work, you'll notice that 40,980 is actually -24,556 (ie, negative) in 16-bit signed numbers; but for this kind of operation, we typically keep a double-precision intermediate result, and positive numbers in a 32-bit answer go up to +2,147,483,647, meaning there's a ton of headroom left-- more than enough for even a 12- or 14-bit A/D converter.  (Of course, there's nothing keeping you from using triple- or quad-precision on those rare occasions that warrant it.)

And by the way, real-world I/O in control situations is never floating-point, whether timers, counters, A/D and D/A converters, servos, etc..

This leads into rational numbers, or numbers that can be expressed with good accuracy as a ratio of two integers.  To multiply by π (3.1415926...) for example, you multiply your input number by 355 and get a 32-bit intermediate result, then divide that by 113 and get a 16-bit final result.  You can optionally handle the remainder after the division for more accuracy.  The error in this fraction is only 0.0000085%.

A chemist friend didn't think it would work because he frequently needs to use Avogadro's number (6.0221415x1023).  I explained that since he probably doesn't need more than 4 or 5 significant digits, he can just scale it.  You don't need 24 digits to represent it in fixed-point/scaled-integer any more than you do in floating-point!  Just moving the decimal point over, you can fit 60221 in 16 bits, or, if it has to be signed, use 30111.  The scale factor must be kept for when it's time for results output, but the burden is transferred to the programmer, freeing the processor to deliver better performance and reduced complexity.

Similarly, if you had to deal with pF (picoFarad, or .000000000001 Farad), if 1pF was the smallest unit you needed, you would scale it so 1pF is represented by 1, and 1000pF (or 1nF) is represented by 1,000, etc..  You definitely will not be needing more than five digits in measuring capacitance, regardless of the range you're in!  The same goes if you're measuring leakage current at the inputs of a FET-input op amp, in fA (femtoAmps, or .000000000000001 Amp).  Think of a digital multimeter which, although it has only 3.5 digits, can change ranges and get 0.1mV resolution in the 200mV range, and 1V resolution in the 2,000V range.  Even though it only has 3.5 digits, it can handle a 20,000,000:1 voltage range.  Or think of an analog multimeter with various scales marked on the meter.  One is marked "25" full-scale, and you know if that means 25µA, 0.25V, 250V, etc. by what range you're in.  IOW, the user sets the scale to make the best use of the range of readings the meter can give for the application.

Keep in mind that the scale factor doesn't necessarily have to be evenly divisible 16, 10, 5, 2, or even an integer at all.  See the situation with degrees and radians further down.  Also, the six log and antilog tables can be used for any base, by scaling.  In fixed-point, they work best in base 2; but from there, you get the natural (base e) log for example by multiplying the base-2 log by 7050/10171 (or just cranking that number into the scale factor).  The table descriptions page gives supporting information.

Of course you can make your standard single-precision cell to be more than 16 bits, like 20, 24, 32, etc.; but 16 works well for a lot of applications with 8-bit computers, and the tables provided here are for 16.

Dollars and cents accuracy even in hex

How about handling the cents accurately in money calculations?  You don't even need decimal.  Hexadecimal works just fine, contrary to the arguments of some.  A dollar is represented as 100 (64h), not 1.00, and a cent is just 1.  If you need tenths of cents, then represent the dollar as 1,000 (3E8h) instead.  This does not keep you from converting to a more-normal representation (like $28.39) when it's time for human-readable output.  In the internal representation, $.99 times two is still $1.98, $2.49 + $14.40 is still $16.89, etc..

Angles in trigonometry

Uh-oh, now how about angles for trig?  A 16-bit circle works out nicely, scaling 360° to be exactly 65,536 counts, with 1 lsb being .00493164°, or .32959 arc-minutes, or 19.7754 arc-seconds.  The MOD function doesn't even come up, as there's no need for it.  359° plus 2° is still 1°, and 45° minus 90° is still -45° or 315°.  You can consider it signed or unsigned, either way.  90° is represented by $4000, 180° is represented by $8000, and 270° or -90° is represented by $C000.  A radian is represented by $28BE.  When you take the sin or cos, the output range will be ±1, which, for best resolution, we can scale to as high as ±$7FFF or $8000 depending on how you want to work the trapping (since you can't get +$8000 in a 16-bit signed number).

There's a Wikipedia article on binary angular measure (BAM) and "brads" (binary radians) here, and Jack Crenshaw addresses it in chapter 5 of his fine book "Math Toolkit for Real-Time Programming."

What about tangent, since the function has a nasty habit of going to ±infinity and ±90°?  Just return the sin & cos both, as a rational number.  Even infinity is represented, because you can have the denominator (the cos portion) as 0.  What you do with infinity is your business <laugh>, but it can be represented!

Easy conversion to and from any base

You may be wondering about base conversions for when you want decimal input and output even though the computer internally uses hexadecimal.  After all, we're also talking about handling things like decimal points and signs for input and output.  It's not particularly difficult; and you can use the same routines to convert to and from any base, changing only the number in variable BASE.  It does require multiplication and division, but you only do it when it's time for human I/O, and otherwise let the computer do its business efficiently in hex.  The processor does not even need any decimal-mode arithmetic instructions like the 6502 & '816 have. 

Here's an explanation of how, simplified not quite to the point of lying.  It should give the basic understanding so you can write suitable routines.  For inputting numbers from a text string (e.g. typed in from a keyboard), initialize a number as 0 to build on, then take the first digit in the string and add its value to the hex number you're building.  Continue until you're out of digits, each time multiplying the build by the number in variable BASE and then adding the new digit's value to the build.  If you encounter a decimal point, keep track of how many digits were after it.  In Forth, the decimal point automatically makes the result double-precision; but you can convert back to single if you want to.  If there was a minus sign, record that too.

For converting hex numbers to other bases for output (which will normally be a string), initialize a blank string.  You will build it from right to left.  Divide your number by what's in variable BASE, and use the remainder to add a text digit to the build, even if it's a 0.  Keep doing that until there's 0 in the number.  You can add a decimal point or other characters between digits, e.g. 12.345 or 12:36:40 (actually you might want to change BASE from 10 to 6 and back for the time readout, if you started with a number of seconds!)

The way Forth does this output number formatting is somewhat explained starting at about the middle of the page of chapter 5 of Leo Brodie's "Starting Forth" (with ANS updates here, and they're mostly calling single-precision 32-bit, like I want for the 65Org32 with all 32-bit registers).

What's in the files

Subroutines or look-up tables could be made for virtually any function, scaling the inputs and outputs to take advantage of the resolution available in your standard cell size.  The tables here have a standard cell size of 16 bits, or two bytes in an 8-bit computer. /big>

I will improve this as I get feedback, but for now it should be complete enough that anyone who really wants to take advantage of it can.  I can supply 1Mx8 EPROM pairs with the tables pre-programmed into them.  Many thanks to Tony (Nightmaretony on the 6502.org forum) for helping get them programmed on his Needham's Electronics EMP-10 programmer, and for donating some ST Microelectronics M27C801-100's for this.  Until I run out of those, the price will only be what it costs to ship them.  I'm in kind of a race to get more material posted on my website while my job is slow.  If you want to use the data to program your own ROMs or load into RAM, be my guest.  That's what it's here for.  If you have questions or comments or find problems, please email me at the address at the bottom.


AboutHexFiles.html has a description of all the hex files, including differences between similarly named ones and how to change the bank number if desired.  (I'm calling 64KB a bank, as on the 65816.)  It could also be called "HowToUse.html" because it also includes some helps on how to get the extras, like

I might later provide more application code, but for now, any reader with some strength in math will be able to see what needs to be done to take advantage of the tables and meet his need, and the hard part (forming the tables themselves) is already done for him.  The page also has my first usage of a cool way to show equations in your html files, from this web page.

Intel Hex is of course a text file, and most of the hex files here are 360KB in length, for 128KB of ROM space.  The longest ones are INVERT.HEX and SQUARE.HEX at 720KB for 256KB of ROM space each.  INVERT.HEX is partly for getting inverses (which makes division faster, because you can mulitply by the inverse).  The byte order of two- and four-byte cells is reversed, ie, "little-endian," low byte first, for normal 6502 and 65816 operation, one of the things these processors do to improve performance.

   file name    table size    comments
   SQUARE.HEX     256KB    partly for multiplication.  32-bit output
   INVERT.HEX     256KB    partly for division, to mulitply by the inverse.  32-bit output.
   SIN.HEX        128KB    sines, also for cosines and tangents
   ASIN.HEX       128KB    arcsines, also for arccosines
   ATAN.HEX        64KB    ends at 1st cell of LOG2.HEX (next)
   LOG2.HEX       128KB    also for logarithms in other bases
   ALOG2.HEX      128KB    also for  antilogs  in other bases
   LOG2-A.HEX     128KB    logs of 1 to 1+65535/65536 (ie, 1.9999847), first range for LOG2(X+1) where X starts at 0
   ALOG2-A.HEX    128KB    antilogs of 0 to 65535/65536 (ie, .9999847), the first range for 2x-1
   LOG2-B.HEX     128KB    logs of 1 to 1+65535/1,048,576 (ie, 1.06249905), a 16x zoom-in range for LOG2(X+1)
   ALOG2-B.HEX    128KB    antilogs of 0 to 65535/1,048,576 (ie, .06249905), a 16x zoom-in range for 2x-1
   SQRT1.HEX       64KB    square roots,  8-bit truncated output
   SQRT2.HEX       64KB    square roots,  8-bit  rounded  output 
   SQRT3.HEX      128KB    square roots, 16-bit  rounded  output
   BITREV.HEX     128KB    set of bit-reversing tables, up to 14-bit, particularly useful for FFTs
   BITREV15.HEX   128KB    15-bit bit-reversing table (not included in EPROM)
   MULT.HEX       128KB    multiplication table like you had in 3rd grade, but up to 255x255

   MathTbls.zip            all the tables, zipped, including BITREV15.HEX which is not in the supplied EPROMs
   ROM0.HEX                a single Intel Hex file for ROM0 as I plan to supply it (also available zipped)
   ROM1.HEX                a single Intel Hex file for ROM1 as I plan to supply it (also available zipped)

                           --> The bottom of the table files' descriptions page has a summary of which
                               tables are in which EPROM, along with addresses in those EPROMs.

Since my own Needham's EPROM programmer software seems to have a bug in it for large files (and Needham's is out of business), I went on a quick search and found SRecord 1.60.  Initially I just wanted to confirm that my Intel Hex files were valid and error-free as far as Intel Hex goes; but this software also lets you transform EPROM file types, concatenate, split, etc..  Version 1.52 is in the Ubuntu (Linux) software center for one-click download and installation.  Enter "EPROM" for a search term and it comes right up.  SRecord is command-line-only, which initially made it confusing because I didn't see any new icons and couldn't find it under "Applications".  The voluminous .pdf manual could stand to have better command-line examples, but you'll figure it out.  Much of the manual is spent on telling about multitudes of file types you will never use, so there's not really that much that you have to read.

RationalApprox.html has some commonly needed rational-number (or fraction) approximations.  The example of π (pi) was given above.  Another example would be that you want to multiply by √2, so you multiply by 239, and divide the double-precision intermediate result by 169.  The error of this pair is only -.00088%.  If you want it closer, use 19601/13860.  The error of this pair is only +.00000013%.  And as always, integers still allow you to handle the remainder from the division for minimizing or avoiding rounding errors in especially the smaller numbers.  RationalApprox.html also shows how I found the numbers using my HP-71 hand-held computer.

The automated test equipment (ATE) shown in my project pages on 6502.org did not use any floating point, even for logarithms (for dB).  I didn't have the tables yet (and that much ROM took a lot more room back then), so I calculated them when necessary instead of looking them up.

CalcMethods.html, for the curious, tells how I calculated the tables and formed the files using my HP-71 hand-held computer. 


Ideas for interfacing

normal on-the-bus method:  The easiest way is of course to just put the EPROMs on the bus, or load the files into RAM, if you have a processor with megabytes of address space like the 65816.

memory-map window into larger address space:  In the case of a processor that only has a 16-bit data bus like the 6502, you could use a smaller window into a larger address space.  When that window is addressed, the larger address space it looks into gets enabled, and output bits from your I/O ICs complete the higher address bits needed by the table memory.  The window could be 256 bytes for example, for the low address byte of the EPROM bus on the other side of the window, and a couple of 8-bit output ports would feed the middle and high address bytes.

parallel I/O ports:  Another possibility of course it to put it all on I/O ports, and write the address to the output ports and read the data from an input port.  If the extra current drain from having one EPROM active all the time is not a problem, you won't even need to select and de-select them-- just set the address and read it.

If that takes up all your parallel I/O, there's another solution.  Read on.

Synchronous-serial:  Computer already built up?  Not enough memory space?  Parallel I/O already taken?  Don't want to wait until you can build your next computer to incorporate the look-up tables?  Serial is of course not the first choice for speed, but it may be your ticket to implement the tables, and it's still well over an order of magnitude faster (not to mention more accurate) than having to actually calculate the math functions on a 6502.  It takes 16 clocks' time to shift each byte of address and data, and, although we use delays for some of it, the processor can be getting the next byte ready to go while the current one is being shifted.

A solution that may not be so obvious is to use the synchronous-serial port of a 6522 VIA for example.  If you have 6522, chances are, you haven't used the CA and CB pins for anything else anyway.  To set the address, use SR (shift register) mode 110 (which you set in the ACR), shift out under control of Φ2, and to read the data, SR mode 010, shift in under control of Φ2.

Here's a diagram to show the idea, since the shift-register interface method is less well known among hobbyists:
three-wire synchronous-serial interface
This circuit is almost identical to one I have working.  It's msb-first in all cases, the 6522's SR, the 74HC165, and the 74HC595, so you don't have to reverse any bit orders, either in hardware or software.  Set direction/load line (which you can put on CA2 of the 6522) low to shift out the address into the 595's, then set it high to read the data from the '165.  The low-to-high transition transfers the '595 shift-register contents to the output latches, and the high level freezes the parallel input of the '165 and allows it to be shifted out.  The high or low output of CA2 is set in the PCR, bits 3, 2, and 1.  Use 110 for low output and 111 for high output.

The three 74HC595's give 24 output bits, a little more than any EPROM needs for address lines.  If you use 1Mx8 EPROMs, you will need a couple of EPROMs, and they have 20 address lines.  You could use a couple of the '595 output bits to select which EPROM you're reading instead of adding more address-decode logic, but I don't particularly recommend it because if you accidentally enable both EPROMs on power-up before the 595's are initialized, you'll have bus contention, along with possibly high currents and hot parts.  Using the one additional address bit (for 21) and feeding the CS\ and OE\ of one of the EPROMs through an inverter to make sure they can't both be enabled at once would probably be best.

The reason you see the 165's output run back around to its input is that it is putting out the first bit (msb) before the first rising clock edge, then that first edge makes it put out the second bit, meaning you'll lose the first bit if you don't put it back around to the input and make it rotate around again.  To deal with this, you can either connect the EPROM data lines to the 165's parallel input rotated over one bit position, or rotate it back into correct position in software with something like:

      LSR  A    ; Put lsb into C flag.
      PHP       ; Save the flag.
      ROL  A    ; Put the byte back where it was, so you can
      PLP       ; get the C flag back and
      ROR  A    ; rotate it into the high bit.
                ; Now bits 65432107 are 76543210 as they should be.
the five instructions combined taking 13 clocks on a 6502.


So how do we look up a value?

normal on-the-bus method:  The details will depend on the hardware; but take for example the first hardware senario above, where the memory containing the large look-up tables is directly on a 65816's bus.  Then take a sine function, where in 16-bit scaled-integer math the angle is scaled such that 65,536 counts represents the full circle of 360°.  The table has 65,536 answers pre-calculated, two bytes each, so the address of the answer will be twice the input value, plus the base address of the table (which, as supplied here, is $2:0000).  So one way to do it is:

                            ; Start with the input number in the 16-bit accumulator.                  number of clocks:
SIN:     ASL  A             ; Double the input number by shifting left one bit position,                     2
         STA  TBL_ADR       ; and store the low 16 bits in the DP variable that we'll use as a pointer.      4
         LDA  #SIN_TBL_BANK ; Get the bank number where the table starts.                                    3
         ADC  #0            ; If the ASL above left the C flag set, increment the bank number.               3
         STA  TBL_ADR+2     ; Store it in the bank byte of the pointer variable.                             4
         LDA  [TBL_ADR]     ; Read the sine value.  The two bytes of the answer will never straddle the      7
         RTS                ;                                                             bank boundary.
 ;------------------
The routine takes 23 clocks (2.3µs @ 10MHz), or 35 clocks (3.5µs @ 10MHz) if you include the JSR & RTS pair.  That is extremely fast compared to having to calculate the sine function with a lot of multiplications and divisions!

There are different ways to do it.  By not putting the accumulator temporarily into 8-bit mode, the above uses an extra byte in the direct-page variable (DP is like ZP on 6502, but movable on the 65816) for a total of four instead of three bytes, wasting a byte of DP memory in order to save a byte of program memory and three clocks' execution time.  (REP & SEP each take two bytes and three clocks, and you won't make up that much by using a 8-bit accumulator to streamline the handling of the bank byte.) 


memory-map window into larger address space:  This senario probably implies a 16-bit address bus and 8-bit data bus like the 65c02 has.  We can extend the above, something like this (on 65c02):

                             ; Start with the input number's low byte in A and high byte in Y.        number of clocks:
SIN:   ASL  A                ; Double the input number by shifting left one bit position.                    2
       TAX                   ; The low byte will be the index value in the 256-byte window into the          2
                             ;                                               larger table memory space.
       TYA                   ; Since you can't rotate in Y, transfer the input number's high byte to A,      2
       ROL  A                ; then rotate to continue the doubling into the next address byte.              2
       STA  TBL_ADR_MID_BYTE ; TBL_ADR_MID_BYTE is an 8-bit output port going to the ROM's address bits      4
                             ;                                                             8 through 15.
       LDA  #SIN_TBL_BANK    ; Get the bank number where the table starts.                                   2
       ADC  #0               ; If the ROL above left the C flag set, increment the bank number.              2
       STA  TBL_ADR_HI_BYTE  ; TBL_ADR_HI_BYTE is an 8-bit output port going to the ROM's address bits       4
                             ;                                                               16 and up.
       LDA  TBL_PAGE,X       ; TBL_PAGE is the address of the beginning of the page that is a window         4
                             ; into table memory.  Read the low byte of the answer into A,
       LDY  TBL_PAGE+1,X     ; and the high byte into Y.  (Doing the +1 on TBL_PAGE is faster than INX.)     4

       RTS
 ;------------------
The routine takes 28 clocks (2.8µs @ 10MHz), or 40 clocks (4µs @ 10MHz) if you include the JSR & RTS pair.  Curiously, it is not much slower than the senario above where the memory was actually on the µP's own wider bus!  (The hardware is more involved though.)


parallel I/O ports:  The code for this is very similar:

                             ; Start with the input number's low byte in A and high byte in Y.        number of clocks:
SIN:   ASL  A                ; Double the input number by shifting left one bit position.                    2
       STA  TBL_ADR_LO_BYTE  ; TBL_ADR_LO_BYTE is an 8-bit output port going to the ROM's A0:A7.             4
       TAX                   ; If it's a write-only port, keep a copy to increment to read the answer's      2
                             ;                                                       second byte later.
       TYA                   ; Since you can't rotate in Y, transfer the input number's high byte to A,      2
       ROL  A                ; then rotate to continue the doubling into the next address byte.              2
       STA  TBL_ADR_MID_BYTE ; TBL_ADR_MID_BYTE is an 8-bit output port going to the ROM's address bits      4
                             ;                                                             8 through 15.
       LDA  #SIN_TBL_BANK    ; Get the bank number where the table starts.                                   2
       ADC  #0               ; If the ROL above left the C flag set, increment the bank number.              2
       STA  TBL_ADR_HI_BYTE  ; TBL_ADR_HI_BYTE is an 8-bit output port going to the ROM's address bits       4
                             ;                                                               16 and up.
       LDA  TBL_DATA_PORT    ; Read the low byte of the answer, then                                         4
       INX                   ; increment the low address byte (remember that no answer will straddle a       2
       STX  TBL_ADR_LO_BYTE  ; page boudary, so carrying is not a concern), and store it                     4
       LDY  TBL_DATA_PORT    ; and then read the high byte of the answer into Y.                             4

       RTS
 ;------------------
The routine takes 38 clocks (3.8µs @ 10MHz), or 50 clocks (5µs @ 10MHz) if you include the JSR & RTS pair, so the speed is still not suffering much, and it is still extremely fast compared to having to calculate the sine function with a lot of multiplications and divisions!


Synchronous-serial:  First we do a one-time set up the 6522 VIA's SR and define a couple of subroutines (I have not tried this 6502 code exactly as is, but it's mostly like something else that's already working):


SR_SETUP:
        LDA  VIA_SR        ; Reset the 6522 VIA's shift register (SR),
        LDA  VIA_ACR
        AND  #11000011b
        STA  VIA_ACR       ; then move right into SR_OUT below.
 ;------------------
                           ;                                                                          number of clocks:
SR_OUT: LDA  #1100b        ; Put CA2 in output mode, outputting a low level to turn                          2
        STA  VIA_PCR       ; off the 74HC126 and put the '165 into load mode.                                4

        LDA  VIA_ACR                                                                                         4 
        ORA  #00011000b                                                                                      2
        AND  #11111011b                                                                                      2
        STA  VIA_ACR       ; Put the SR in mode 110, shifting out under control of Φ2.                       4
        RTS                ; The whole subroutine takes 30 clocks, including the JSR-RTS pair.
 ;------------------

SR_IN:  LDA  VIA_ACR                                                                                         4
        ORA  #00001000b                                                                                      2
        AND  #11101011b                                                                                      2
        STA  VIA_ACR       ; Put the SR in mode 101, shifting in under control of Φ2.                        4

        LDA  #1100b                                                                                          2
        STA  VIA_PCR       ; Make sure CA2 is outputting a low to load data into the '165,                   4

        LDA  #1110b                                                                                          2
        STA  VIA_PCR       ; then back high to enable the '126 and shift the data out to the VIA SR.         4
                           ; RCV puts CA2 low again at the end, to turn off the '126.
        LDA  VIA_SR        ; Do a dummy read to make the SR start shifting.  The byte you read will be       4
                           ; left from shifting the address out.  There's no need to keep it.
        RTS                ; The whole subroutine takes 40 clocks, including the JSR-RTS pair.
 ;------------------
                           ; Actually, you could use 164's instead of 595's, then you wouldn't have to
CA2_HI_PULSE:              ; strobe them.  Reading the data will mess up the addr with 164's, but by then
        LDA  #1110b        ; the data will already have been loaded into the 165, so it shouldn't matter.    2
        STA  VIA_PCR       ; Make CA2 high                                                                   4
        AND  #11111100b    ; and then low again                                                              2
        STA  VIA_PCR       ; to strobe the address into the 74HC595s' output latches.                        4
        RTS                ; The whole subroutine takes 18 clocks including the JSR-RTS pair, which is
                           ; fine because when it's needed, we need to let a byte finish shifting.
 ;------------------

TBL_ADL:    DFS  1         ; One-byte variables: table address low, middle, and high bytes.
TBL_ADM:    DFS  1         ; Since we have to read twice to get two bytes, and feed the updated address
TBL_ADH:    DFS  1         ; between, we don't want to have to re-calculate the address.
TBL_DA_LO:  DFS  1         ; Data low  byte read from the table
TBL_BANK:   DFS  1         ; Starting bank number for table of interest.  Set up before calling 128KB_LOOKUP below.

The main routine 128KB_LOOKUP below is for all tables that have 64K 2-byte answers, filling two banks and no more.  IOW, the one for the arctangent for example will need some adjustment because it has its last answer in the first cell of the log2 table which has no use for a log-of-zero answer.  The cycle counts assume the variables are not in ZP.  If you put them in ZP you'll save a few clocks but nothing significant.  A JSR-RTS is used quite a few times for delay, giving more delay than needed, so you could save some more if you replace them with the right number of NOPs.  As is, it takes 356 clocks, including JSR & RTS, if I counted right, which is under 36µs @ 10MHz, much slower than the other methods above but still well over an order of magnitude faster than actually calculating the answers on demand (except for multiplication).  Again, the routine is to give an idea of how to do it.  I have not built up all the interfacing method possibilities to actually test the code, but what's here is similar to another setup I already have running, so there should be little need for debugging.
                           ;                                                                          number of clocks:
128KB_LOOKUP:              ; Start with input number's low byte in A, high byte in Y, TBL_BANK set.
        TAX                ; Save A since SR_OUT writes over it.  (X is used again for something             2
        JSR  SR_OUT        ; else below too.)  Get ready to send the ROM address.                           30
        TXA                ; Restore the input number's low byte.                                            2

        ASL  A             ; Double the input number by shifting left one bit position.                      2
        STA  VIA_SR        ; Send out the low address byte of the ROM.                                       4
        STA  TBL_ADL       ; Also store it in a variable for later getting the 2nd data byte.                4

        TYA                ; Since you can't rotate in Y, transfer the input number's high byte to A,        2
        ROL  A             ; then rotate (with C flag) to continue the doubling into the next address byte.  2
        NOP                ; Give time to finish shifting low address byte out before giving it another.     2
        STA  TBL_ADM       ; While we're waiting, store the middle address byte for later getting the 2nd    4
        STA  VIA_SR        ; data byte.  Send out the middle address byte for the ROM.  It takes 16          4
                           ; clocks to shift 8 bits in or out.
        LDA  TBL_BANK      ; Get the bank number where the table starts.                                     4
        ADC  #0            ; If the ROL above left the C flag set, increment the bank number.                2
        JSR  end           ; Just JSR-RTS to give time to finish shifting this address byte out before      12
        STA  VIA_SR        ; giving it another.  Send out the high address byte for the ROM.  The JSR end    4
        STA  TBL_ADH       ; gave a little extra.  Now save the high addr byte for later getting the 2nd     4
        JSR  end           ; data byte.  JSR end again to give time to finish shifting hi addr byte out.    12

        JSR  CA2_HI_PULSE  ; Strobe the address into the 74HC595s' output latches.                          18

        JSR  SR_IN         ; Get ready to bring data in now, and do a dummy read to get data started.       40
        JSR  end           ; Give enough time for SR to finish shifting, then                               12
        LDA  VIA_SR        ; read the first (low) data byte.  (JSR end actually gives extra time.)           4
        STA  TBL_DA_LO     ; Store the first (low) byte of the answer.                                       4

        JSR  SR_OUT        ; Get ready to send the ROM address out again, although we'll increment it.      30
        LDA  TBL_ADL       ; Get back the low address byte we derived earlier,                               4
        INA                ; increment it to read the high byte of the answer, and                           2
        STA  VIA_SR        ; send it out the SR.  The address increment won't carry, because no answer       4
                           ; straddles a page or bank boundary.
        JSR  end           ; Again, give time to complete a byte shift before giving another byte to SR.    12
        LDA  TBL_ADM       ; Get back the middle address byte derived earlier, and                           4
        STA  VIA_SR        ; send it out to the SR.                                                          4

        JSR  end           ;                                                                                12
        LDA  TBL_ADH       ; Get back the high address byte derived earlier, and                             4
        STA  VIA_SR        ; send it out to the SR.                                                          4

        JSR  end           ; Need more time for shift to finish.  (This gives more than we need.)           12
        JSR  CA2_HI_PULSE  ; Strobe the address into the 74HC595s' output latches.                          18
        JSR  SR_IN         ; Get ready to bring data in now, and do a dummy read to get data started.       40
        JSR  end           ; Give enough time for SR to finish shifting, then                               12
        LDA  VIA_SR        ; read the first (low) data byte.  (JSR end actually gives extra time.)           4
                           ; (This listing is for data connections rotated one bit position as discussed
                           ; in the hardware section above.)
 end:   RTS                ; At the end, the high byte of the answer is in A and the low byte in TBL_DA_LO.
 ;------------------



last updated Nov 19, 2012               contact: Garth Wilson, wilsonmines@dslextreme.com