To quickly test bit 6 or 7 of a byte in memory, you can use the BIT instruction to test them without affecting A, X, or Y. Regardless of what's in A, BIT FOOBAR will put bit 7 of FOOBAR's contents into the N flag, and bit 6 into the V flag, so you can branch on these bits without loading first. A good example is flag variables, where you use a byte as just a "yes" or "no" record. "Branch if the 'ACKnowlege' record is a yes," without affecting A, X, or Y, is as simple as:
BIT ACK_FLAG BMI _____
(If you have a CMOS 6502 made by Rockwell or WDC and the byte is in ZP, you can even do it in a single BBS7 instruction.)
BTW, there's a summary of the differences between the NMOS 6502 and the CMOS 65c02 on this site, here. The CMOS really does offer a lot of improvements.
To clear the flag, use STZ (STore Zero) on the CMOS 6502, so you don't have to LDA #0 first:
To set it, if you know there's a byte in one of the registers that has its high bit set, use that, and you won't need to use LDA# $FF first:
STA ACK_FLAG ; (or store X or Y, whichever has the high bit set)
If you know for sure it's already 0, just decrement it:
Actually, it doesn't have to start out as 0 if you just make sure you don't decrement it so many times that the high bit gets cleared.
BIT is really nice for testing port bits. If you will need to quickly test an I/O line like for the data line when you're bit-banging a synchronous-serial connection, put it on bit 6 or bit 7 of a parallel port, so you can just do:
BIT PORT_A BMI ____ ; (or BPL or BVC or BVS, as appropriate)
This prevents the need to load the port byte first (which would affect one of the registers, probably the accumulator), or load a bit mask like you would have to do for other bits with the BIT instruction, or even load and AND like you'd have to do if there were no BIT instruction. (Obviously you'll change the port name as appropriate.)
For the CMOS 6502, TSB and TRB let you test and then set or clear bits in memory in a single instruction, after the desired mask is in the accumulator.
If you will need to quickly toggle an I/O bit that has a known value, put it on bit 0 of a parallel port. If the bit normally sits at 0 (low voltage),
INC PORT_A DEC PORT_A
will produce a fast positive pulse with only two instructions and without affecting A, X, Y, or the other port bits of the port. Similarly, if it normally sits at 1 (high voltage),
DEC PORT_A INC PORT_A
will produce a fast low pulse. For a bonus, the final INC or DEC will put bit 7 in the N flag, so you can pulse bit 0 and test bit 7 at the same time. (Note that bit directions don't have to all be the same for a port. You can have some pins be inputs while other ones are outputs, at the same time.) If you needed bits 0 and 1 toggling out of phase, you could INC and DEC the port between values 1 and 2 (01 and 10 in binary).
Use the single-byte instruction PHA, PHX, or PHY to save a register's value temporarily while using the register for something else. When you're ready to bring it back, use the corresponding single-byte PLA, PLX, or PLY. Remember it's a last-on first-off stack though, so mind the order if you're putting multiple values there. Make sure branching won't foul you up with stack programming errors.
If you want to test for a 1 and don't need to keep the value, decrement it and then do BEQ or BNE. The decrement might be with DEC, DEA, DEX, or DEY. Decrementing a register takes the same amount of time as the compare-immediate instructions, but you'll save a byte. If the value is in memory, the DEC_abs is two bytes shorter than LDA_abs, CMP#, although it takes the same amount of time. Other addressing modes are available too of course.
If you want to test for an $FF and don't need to keep the value, increment it and then do BEQ or BNE. The increment might be with INC, INA, INX, or INY. Incrementing a register takes the same amount of time as the compare-immediate instructions, but you'll save a byte. If the value is in memory, the INC_abs is two bytes shorter than LDA_abs, CMP#, although it takes the same amount of time. Other addressing modes are available too of course.
Need a SWN (SWap Nybble) instruction? Here it is, thanks to David Galloway. It takes 8 bytes and 12 clocks:
ASL A ADC #$80 ROL A ASL A ADC #$80 ROL ANeed a slick delay? Take this one from Bruce Clark. The delay is 9*(256*A+Y)+8 cycles (plus 12 more for JSR & RTS if you make it a subroutine). This assumes that the BCS does not cross a page boundary.
loop: CPY #1 DEY SBC #0 BCS loopHe writes: "A and Y are the high and low bytes (respectively) of a 16-bit value; multiply that 16-bit value by 9, then add 8 and you get the cycle count. So the delay can range from 8 to 589832 cycles, with a resolution of 9 cycles. One of the nice things about this code is that it's easy to figure out what values to put in A and Y when you want a delay of, e.g. (approximately) 10000 cycles." Here's the same thing with my structure macros (the resulting machine code being identical):
BEGIN CPY #1 DEY SBC #0 UNTIL_CARRY_CLEARThere's more at http://6502org.wikidot.com/software-delay. In fact, that wiki, although not very big, has a lot of great resources for this kind of thing. Check out also the source code repository on 6502.org.
Avoid commonly wasted instructions:
1. An automatic compare-to-zero instruction is built into the following 65c02 instructions: LDA, LDX, LDY, INC, INX, INY, DEC, DEX, DEY, INA, DEA, AND, ORA, EOR, ASL, LSR, ROL, ROR, PLA, PLX, PLY, SBC, ADC, TAX, TXA, TAY, TYA, and TSX. This means that, for example, a CMP #0 after an LDA is redundant, a wasted instruction. The only time a 65c02 (CMOS) needs a compare-to-zero instruction after one of these is if you want to compare a register that was not involved in the previous instruction; for example,
DEY CPX #0
(Note the Y and the X are not the same register.) If you can spare a register to which you can transfer the one you want to test, you can save a byte with the transfer instead of a compare instruction. The example above, if the contents of A don't need to be kept, could be changed to:
and then you can branch on the N or Z flag which tell if X was negative or zero. The TXA isn't any faster (both TXA and CPX# take two clocks); but TXA takes only one byte, whereas the CPX #0 takes two bytes.
The NMOS 6502 did have a bug in that the flags weren't always correct after a decimal-mode operation like ADC; so then you might have to follow it with the CMP #0 to get the N and Z flags right. It's best to just use the CMOS processor.
2. Similarly, if you want a compare to $80 strictly for branching on the N flag results, you can omit the compare-to-$80 instruction and branch on the opposite state of the N flag. For example
DEA ; (same thing as DEC A) CMP #$80 BMI <label>
can be replaced with
DEA ; (same thing as DEC A) BPL <label>
3. If you have a CMOS 6502 (65c02), take advantage of the extra CMOS instructions and addressing modes. The 65C02 is not just a low-power version of the NMOS 6502. Besides having more instructions and addressing modes, the CMOS version fixed all of the NMOS 6502 bugs. I compiled all the many differences between the CMOS and NMOS 6502 in this article.
4. When the end of a routine has JSR immediately followed by RTS, you can usually replace the pair with JMP, and put in the comments,
JMP <subroutine_addr> ; (JSR, RTS)
JSR, RTS takes 12 clocks. JMP absolute takes 3, and the single jump above and the use of the RTS at the end of the other subroutine gives you the same execution effect in most circumstances but saves execution time and one byte (ie, an RTS instruction). (The exception is covered in chapter 6, "Parameter-passing methods," of the stacks treatise, about 40% of the way down the page, in the paragrphs right before and after the short listing for the subroutine called "GEOMEAN".) Something else you can take advantage of is that there's also a JMP(addr) and a JMP(addr,X).
5. The 6502 interrupt sequence automatically pushes the processor status register on the stack, and restores it as part of the return-from-interrupt (RTI) instruction. There is no need to start an interrupt-service routine (ISR) with PHP and end it with PLP. There is also no need to set the interrupt-disable flag at the beginning of an interrupt (using SEI). That is automatic too, part of the interrupt sequence, immediately after pushing the processor status register P onto the stack. And, since the previous status is restored by the RTI, do not re-enable interrupts just before RTI.
6. When practical, set up loops such that the counter ends at 00 or decrements to FF to finish the loop, so you can branch on the Z or N flag condition and don't need to add a compare-immediate instruction.
7. In ISRs, don't waste time saving and restoring registers the ISR itself won't use or disturb. Also, don't waste time polling interrupt sources that are not enabled. (This is covered much more thoroughly in my interrupts primer.)
9/22/16: I came across a bunch of additional tips like these in the article "6502 Hacks" by Mark S. Ackerman on .pdf pages 111-114 of the voluminous (778MB!) scanned volume 12 of Dr. Dobb's Journal, available at http://6502.org/documents/publications/dr_dobbs_journal/dr_dobbs_journal_vol_12.pdf (Again, be warned: it's a huge download. There's a ton of other great material in this 1040-page scan too though.)
4/28/17: I came across a page on the Nesdev wiki about synthetic instructions, for example comparing A to X,
here. Note that some of this is more
efficient to do in self-modifying code if the program material is in RAM where you can write to instructions' operand space (or even
to the op codes themselves). I hope to post an article in the future on self-modifying code, as I believe there's a significant
amount of power available there that most of us have been ignoring. The Nesdev wiki "Programming guide" index is
And as with any programming language:
If you still have a dot-matrix impact printer that uses fanfold paper, it will be nice for printing long program listings. Get it out. You'll be glad you haven't gotten rid of it yet. When our daughter-in-law who just graduated in computer science complained about the page breaks with the school's laser printers being a pain for programming, I suggested cutting off the bottom margin of each page and taping the pages together, bottom of one to top of the next, which is what she ended up doing.
For explanation of the V (oVerflow) flag, see the tutorial on it on 6502.org.
You can find explanation of the B (Break) flag and its usage in this forum topic which also links to a couple of other discussions on BRK.
Bruce Clark has an extensive tutorial on doing comparisons on the 6502 here, including tricks, implications, signed & unsigned, decimal mode, multi-byte, and more.
BigEd on the forum observed, "With 6502, I suspect more than one beginner has wondered why they can't do arithmetic or logic operations on X or Y, or struggled to remember which addressing modes use which of the two. And then the intermediate 6502 programmer will be loading and saving X and Y while the expert always seems to have the right values already in place."
You can do temporary storage or pass parameters to subroutines by way of the hardware stack (in page 1) if it helps reduce RAM variable needs. The subroutine does not need to pull all the stack items off to access a byte some number of levels down. To index into the stack, you can do for example:
TSX LDA 105,X
Repeating TSX won't be necessary for continuted accesses to various stack items throughout the routine. Just change the number before the ",X" above, and you won't have to keep incrementing and decrementing X either. For knowing what that number should be, it will be important to note that the stack pointer is decremented immediately after storing a byte onto the stack, so it points to the next available byte. (Remember that the stack grows down, not up.) LDA 101,X is the same as PLA PHA, same number of instruction bytes (assuming you already did TSX) but faster, getting the top stack byte into the accumulator without removing it from the stack. The tactic becomes all the more valuable when you want to reach further into the stack. This technique is expanded upon in the stacks treatise, especially in chapters 5, 6, 13, 14, and 15.
A common criticism of the 6502 is that the stack space is so limiting. A few higher-level languages (notoriously Pascal) do put very large pieces of data and entire functions and procedures on the stack instead of just their addresses. For most programming though, the 6502's stack is much roomier than you'll ever need. When you know you're accessing the stacks constantly but don't know what the maximum depth is you're using, the tendency is to go overboard and keep upping your estimation, "just to be sure." I did this for years myself, and finally decided to do some tests to find out. I filled the 6502 stack area with a constant value (maybe it was 00—I don't remember), ran a heavy-ish application with all the interrupts going too, did compiling, assembling, and interpreting while running other things in the background on interrupts, and after awhile looked to see how much of the stack area had been written on. It wasn't really much—less than 20% of each of page 1 (return stack) and page 0 (data stack). This was in Forth, which makes heavy use of the stacks. The IRQ interrupt handlers were in Forth too, although the software RTC (run off a timer on NMI) was in assembly language.
10/1/15: I have a treatise on 6502 stacks (plural, not just the page-1 hardware stack), with 19 chapters plus appendices, starting from the basics and advancing through many unexpected aspects, stopping just as we get into multitasking, here.
5/16/14: I posted an article on simple methods to do multitasking on the 6502 without a multitasking OS, with methods that are suitable for hard realtime work and keep the outstanding interupt performance (unlike the usual situation with pre-emptive multitasking OSs).
I'll take the opportunity here make a few more recommendations that are my own opinion. I will warn you up front that even some experienced assembly-language programmers may disagree with me; but if you're just starting and are not entrenched in contrary habits yet, I think these will be beneficial:
TIMER: DFS 4 PHI_ACCUM: DFS 2 INC_RATE: DFS 2 THRESH_INC: DFS 2 ALT_TM: DFS 2 ALT_INTERVAL: DFS 2 PORTA_REC: DFS 1becomes:
TIMER: DFS 4 ; <comments> PHI_ACCUM: DFS 2 ; <comments> INC_RATE: DFS 2 ; <comments> THRESH_INC: DFS 2 ; <comments> ALT_TM: DFS 2 ; <comments> ALT_INTERVAL: DFS 2 ; <comments> PORTA_REC: DFS 1 ; <comments>which is much more readable. ("DFS" in the C32 assembler stands for "DeFine Storage," ie, a RAM variable, allotting as many bytes to it as the number following the "DFS" says.)