Mr Sprite Tech

(C) 2012-2013 by Antoine VIGNAU and Olivier ZARDINI

> What is Mr Sprite Tech ?

This article is a technical presentation of the sofware Mr Sprite and its inner algorithm.

We hope that the technical information given here and the available source code will help you in your own projects.
 


> Apple IIgs Hardware

CPU

The Apple IIgs cpu is the 65c816, a 16 bit microprocessor using a 8 bits data bus and a 24 bits address bus. The good news is that it can address MB of RAM and handle 16 bit numbers, the bad news is that all it can do is read or write one BYTE after the other. So writing 16 bits into memory will cost more cycles than just writing 8.

The 65816 is a low tech microprocessor, just a 16 bit '6502' :
- only one generic register (A), two index registers (X, Y) plus extra one-purpose registers (Stack, Status, Direct Page...).
- no multiplication, division or advanced processing instructions (Arithmetic Shift for more than one position...)
- no post-incrementation of the registers
- no trace mode or administrator mode
- the 16 bit registers keep the code into 64 KB boundaries ($01/FFFF + 1 = $01/0000). Up to you to jump from one bank to another one.
- the Stack and the Direct page are 'locked' into bank $00
- a very low frequency (2.8 Mhz for the IIgs, 3.5 Mhz for the Super Nintendo...)
- an Apple IIgs memory organization that forces a 1 MHz access for any writes in banks $E0 or $E1   8-[

A few good news :
- a very compact object code (one BYTE per Opcode)
- a number of cycles per instruction very low (2 for basic operations, 5 or 6 for memory reads / writes)
- a progressive
cycle cost for addressing modes (small for Direct Page or Stack, medium for current bank  and large for other banks)

Two miracles :
- The Shadowing lets you work in bank $01 (or others) and what you do there is mirrored to bank $E1. Not at full speed, but it is still better than working directly in a low speed bank.
-  For Auxiliary Ram compatibility purpose (I guess ?), the Stack and the Direct Page my be located anywhere in Bank $01. Amen.

The Apple IIgs is running a 65c816 microprocessor at 2.5 MHz (2.8 MHz from code located in Rom). Per second, we have 2,500,000 cycles. The screen is refreshed 60 times per seconds, so the number of cycles available per scan is about 41666.
Adding a accelerator card  helps a lot, but all the accesses to $E0 and $E1 are still at 1MHz.

GRAPHICS

Unlike the processor that was considered 'light' when it was delivered (yes, Macintosh and other folks were already running 32 bits 68000 at 8 MHz ...), the Apple IIgs graphic capabilities were cool.

The good:

- 320x200, 16 colors / line, 4096 in the palette / 32 KB size. No trick or complex memory organisation. We have 4 bits per pixels, 2 pixels / bytes. All linear, the begining of line N start at the end of Line N-1.
- 16 Palettes, so 256 colors on the screen without CPU usage or complex palettes switch.
- 1 SCB / line : how to quickly choose palette, resolution, Fill Mode and Interruption for each line.
- Fill Mode. Never really used in the past decades, but probably plenty of fun things to do with with it on a rainy day.
- Few softswitchs indicating the line
currently refreshed by the spot. We really need such information to obtain synchronized animations.
- All graphic related stuff packed together in memory (graphic page, scb, palettes) so dumping this memory area gives us a easy to use graphic file.
- 16 colors in 640 mode. Not real ones but the dithering process gives us a colorful Finder in 16 colors where Amiga guys had a 4 colors and the Atari ST folks only 3 !

The bad :
- The graphic page address is fixed, in $E1/2000. No possibility to choose a dynamic address. Good bye hardware scroll...
- Only ONE graphic page. Impossible to prepare a picture on Page #1 while Page #2 is displayed and switch to Page #1 when ready, unlike we did on the Apple IIe.

The ugly :
- The graphic page address is... $E1/2000, yes, in one of the two slow memory Banks ! Not only we have 32 KB to update with a 8 bit Data bus microprocessor but they decided to put the graphic page at a location we can only access at 1 Mhz    8-[

Conclusion

Is this as bad as it looks ?

No, hopefully. Of course we don't have all the cpu power we would dream of (Apple IIe is 1 Mhz for 8KB graphic Page, Apple IIgs is 2.5 Mhz / 32 KB) but the advantage of the IIgs is its simplicity. Everything is easy to do and designed in the easiest way for the programmer. If you remember how colors were coded in Apple II's HGR (1 bit to decide the color scheme, 1 bit for the pixel to be ON or OFF plus other funny things like the ODD / EVEN columns and the fact than the graphic page is not linear) we can say THANK YOU for the Apple IIgs graphic page organisation. Anyone, after one afternoon of work, can display something in the Apple IIgs graphic page.

Looking to other similar computers (Atari ST, Amiga, Macintosh & PC), the Apple IIgs is not  that far from the top. In 1986, a Macintosh is a black & white computer, with no joystick connector and nothing done for low level programming. The PC has a CGA / EGA graphic card, no sound and so many configurations (CPU frequencies) that we can't really speak about 'computer' but 'family of'. The closest machines were the Atari ST and the Amiga. With the Apple IIgs they were sharing the same vision, a multi-media computer for home, affordable, and multi-purposes. None of these computers received any professionnal software (Excel, Word, Lotus 123...) and all the three died in the late 90s, due to the move of PC computers as multi-media machines (Doom killed the market, showing that the PC was now THE game machine). Since the Apple IIe reign, for the first time, you have with a PC the multi-media and the professional software together on a single machine. Before that, the Apple IIgs was a winner for education in the USA. In Europe the Atari ST was very popular for musicians (due to built-in Midi connectors) and the Amiga was the best for demos and homebrew graphics (DeluxePaint) & musics (Modules). Games were very good of the 3 machines. The European market of the Atari ST & Amiga drives them into arcade style games and the USA market drives the Apple IIgs more into adventure style games.

For graphic capabilities, let's say games & animation, the Amiga comes first, followed by the Atari ST and so the Apple IIgs. Anyway, all are close together, despite Amiga dedicated graphic CPUs (Blitter, Coper). Even if their 68000 at 7/8 Mhz is much more powerful than our 65c816 at 2.5 Mhz, their graphic page organisation consumes a lot of power. For the Amiga (7 Mhz, 40 KB graphic page), the graphic page is divided into 5 different areas, each coding 1 of the 5 bit plan. With 32 colors on the screen, you need 5 bitmap area of 320x200 bit each (8 KB). So, if you want to display 1 pixel using color 1F, you have to update in 5 different memory locations, the 5 bits. Because you have a 68000, you must use the .B (BYTE) adressing mode because the other ones (.W and .L) don't want to write to ODD address. To summerize, you need 5 read / write to memory to switch on 1 pixel. On the Apple IIgs, you need juste 1 (LDA / [ORA] / STA). On the Atari ST (8 Mhz, 32 KB graphic page), they did not separate the bit plan into several areas but they mixed them together with an interleave of 1 WORD. So once again, if you want to display a pixel using color F (Atari ST has 16 colors), you have to update 4 consecutive WORD (4 read / write) : the 4 bits of 1 pixel is coded in 4 consecutive Words. The 65816 doesn't have issues with addressing ODD addresses so no need for a weird memory organisation to provide EVEN addresses only. The conclusion is that on Atari ST and Amiga, you spend your time and your energy to shift sprites bits to display them at the right position (as we did in the Apple IIe where we needed 7 sprites to cover all possible X positions). The other things that put the IIgs close to the 2 others is the 16 palettes / 256 colors mode we have. On the Atari ST or the Amiga, you have to program a palette switch, using interruptions. This is more complex and required few extra CPU power. The other advantage of the IIgs over the Atari ST and the Amiga is the expansion slots. You can expect from the Apple IIgs users to have 2 MB or more. The Atari STF is limited to 1 MB and most sold units were delivered with only 512 KB. Same thing with the Amiga 500 where 512 KB was the core market and few had 1 MB. The Atari ST / Amiga market was targeted for 512 KB machines. Some games were requiring 1 MB, but they were rare. With our 2 MB (or plus) plus an hard drive, the Apple IIgs can use pre-computed stuff to lighten cpu work where the Atari ST & Amiga guys have to compute in real time what they want to display.

So why games are much better on the Atari ST and the Amiga ?  At least 3 reasons. The first one, an probably the more important, has nothing to do with with technical considerations. Atari ST and Amiga were strong in the european market and were sold as game machines. So a lot of small companies entered into the business for creating games for these two low cost computers (in Europe the IIgs was 3 times more expensive than the ST or the Amiga). Competition is good for the quality and everyone was trying to do better than the other companies. Every new game was superior to the previous one. Because the two machines had similar hardware (68000 microprocessor, 320x200 resolution, same joystick, about the same number of colors on the screen), they were capable of limiting the effort by re-using the same code & graphics for the two computers (this also explains why the first Amiga games do not use the real capabilities of the machine but are lowered at the ST level).  More games = More gamers buying an Amiga or a ST computers so more companies entering in the market. So, if the games were very good on these platforms, it is firstly because of the number of qualified people trying hard to do so. How many good game programmers did we have on the IIgs ? Probably less than 10.  So how many games push the IIgs to its limit ? Probably the same number ! Have we really seen the limits of the IIgs ? Probably not.

Of course some technical criterias help to understand the gap between the 3 machines. Atari ST and Amiga can use any part of their memory as graphic page (the ST must starts its graphic page on 256 bytes boundaries = no hardware scroll possible). This means they have as many graphic pages as they want. You can prepare the picture to display on Page #2 while Page #1 is on the screen and immediately display Page #2 by modifying a vector. Impossible to do that on the IIgs because of the unique Graphic page. If you can switch from one page to another immediately, you can have smooth full screen scrolling (even if it is slow), not jerky scroll where you can see jaggies like we have on the Apple IIgs. The Amiga has a blitter, a hardware chip capable to copy memory blocks from one location to another applying logical operations (AND, OR...) on the fly. You can't imagine a better help to manipulate Sprites ! Even if the processor frequency is 10% lower than the Atari ST (7 Mhz for the Amiga, 8 Mhz for the Atari) and even if the graphic page is bigger on Amiga (40 KB for 320x200 in 32 colors) than on Atari ST (32 KB for 320x200 in 16 colors), the Amige keeps an advantage thanks to all its dedicated hardware. With the arrival of FPS (Wolfenstein 3D, Doom...), the fashion was 3D and texture mapping. For such rendering, the blitter can't help and the bit-plan organisation of the graphic memory is a nightmare (modifying 1 pixel = 4 or 5 read/writes). If Wolf 3D went to the Apple IIgs and not to the Atari ST or the Amiga, it is thanks to our linear 4 bits / pixels memory graphic organisation. It was the only one capable to handle this properly. Not a revenge, but a nice joke : the weakest computer was capable to score where the two champions failed. The legacy of Atari ST and Amiga games library (more than one thousand) is a very good thing for the Apple IIgs games programmers. We can pick up there what we need to build the Apple IIgs release of such games. The Atari ST version is helpful for graphics (320x200 16 colors) and the Amiga is helpful for sound effects (8 bit like the IIgs). In the past, many of Apple IIgs games were converted from Atari ST / Amiga versions (Space Harrier Demo, Morteville Manor Demo, Plotting, Teenage Queen, The Tinies, LemminGS...).

The Apple IIgs has not really been pushed to its limit in the time it was on the market (unlike the two other european folks). We have seen the light (Rastan, Sword of Sodam, Task Force...) but not really entered the room. This is why, two decades later, it is still fun to play with the IIgs to see what it can really do. At the opposite of our modern PCs which have nearly unlimited CPU Power, Memory size or Disk space, it is refreshing to go back to program on a 16 bit machine where you can touch the limit of the machine in everything you try to do.
    

> Displaying Sprites


A sprite is a visual object you want to display on a screen. It can be of any size and usually we define it by its width & height. We can draw a frame around it to define its limit (in sprite sheet) or isolate each sprite in its own picture :

Orginal Size (w=36, h=62) x4 Magnify x10 Magnify details with pixels separation enable (grid)
Even if a Sprite is defined by the frame containing all its pixels, the sprite itself is not a rectangle. Some pixels in the frame define the Sprites, the other ones are considered as 'background' (in light blue here).
The sprite is supposed to be displayed on a picture. The picture with no sprite is named 'background', as opposite to the sprites supposed to be displayed in the foreground.

+ =
Sprite + Background Picture = Foreground Sprite + Background Picture
If we directly cut / paste the sprite in the picture, we bring also the sprite background color (light blue), and we probably don't want that !
So we need extra information to define which pixels of the sprite picture is part of the sprite and which pixels are part of the background. We could use a color, like we did here, but because the Apple IIgs as only 16 colors for its screen, sacrifying one color to define the backgound area would force us to use only 15 colors for the sprites. We can do other way by defining a Mask, a negative version of the sprite :

Sprite Mask
Each black color pixel defines a valid sprite pixel and the white one define a background pixel. We can now use the Mask to properly insert the sprite into the background picture :

+ + =
Sprite + Mask + Background Picture = Foreground Sprite + Background Picture 

When you create Apple IIgs games, you provide the Sprite sheet and the Mask sheet together. To simplify Sprite & Mask identification, you organize them in the same order in both of your pictures. Here is a Sprite + Mask sheets example comming from LemminGS game :


Sometimes, you can compute the Mask picture from the Sprite sheet picture (if you are not using all the 16 colors). You save few KB on the disk because you only store one picture instead of two. But once in memory, you need both pictures to be able to display the Sprite as expected.

Because Apple IIgs use 16 colors / 4 bits per pixels graphic mode in 320x200 (resolution used by 99% of the games), each pixel is coded using a hexadecimal value from 0 to F. In our example, The Sprite pixels use values 1 (grey), 2 (light pink)  and 3 (black). The background color MUST use value 0 (light blue) :


The mask MUST use 2 values : 0 (black) for the Sprite pixels and F (white) for the background :


The background can use any 0-F values, we don't really care about :


If the pixel values for Sprite background and Mask are imperative, it is because the way the Masking process works :
- We use a AND to force the background pixels values to 0
- We use a OR to force the background pixels to the Sprite values

Let's see that :
      - First step, let's apply the Mask on the Background picture using a AND operation :  Intermediate = Background AND Mask

AND =
Background Picture AND Mask = Intermediate Picture (Background Picture with 0 values)
  The logical AND forces to 0 all pixels of the Background picture matching a 0 value from Mask (don't change values for pixels matching a F value). The AND operation 'clears' the background picture in the area where the sprite will be displayed.

      - Second step, we apply the Sprite on the Intermediate picture using a OR operation : Result = Intermediate OR Sprite

OR =
Intermediate Picture OR Sprite = Result Picture
 The logical OR changes all pixels of the Intermediate picture matching a non 0 value from Mask (don't change pixels matching a 0 value). The OR operation draws the sprite into the intermediate picture.

  To summarize, we obtain the Result Picture with a combinaison of 2 logical operations : Result = (Background AND Mask) OR Sprite

   Here we don't care about the colors (blue, green, white, black...). The only mandatory thing is to have 0 and F values for Mask and 0 for Sprite background. The AND / OR operations work only with these values.


> Classic Algorithm

The classic way to display a sprite on an Apple IIgs screen use the following Code sequence, for all the pixels of the sprites :

          LDA   BACKGROUND
          AND   MASK
          ORA   SPRITE
          STA   SCREEN

We draw usually directly in Bank $01 (faster because of Shadowing) from a unknown bank (the code displaying the sprite may be located anywhere, so we use LONG addressing mode) and we work in 16 bits to process 4 pixels / operations. We have to use indexed operations because we have to cover all sprites pixels. If we have Sprites in bank $04 and Mask in Bank $05, the code looks like :

          LDAL  $012000,X
          ANDL  $050000,X
          ORAL  $040000,X
          STAL  $012000,X

To draw 1 line of the Sprite, we have to loop over all pixels of the line, using sprite width as limit. We start the code by initializing the screen address where to display the sprite and the Sprite + Mask address :

Init      LDA   ScreenAddress    ; Screen Address ($2000-$9D00) where the sprite must be displayed
          STA   
SCREEN1+1
          STA   SCREEN2+1
          LDA   SpriteAddress    ; Initialiaze Sprite + Mask address
          STA   MASK+1
          LDA   SPRITE+1

          LDX   SpriteWidth-2    ; Sprite width in byte - 2 = (Sprite width in pixel / 2) - 2
SCREEN1   LDAL  $012000,X
MASK      ANDL  $050000,X
SPRITE    ORAL  $040000,X
SCREEN2   STAL  $012000,X
          DEX
          DEX
          BPL   SCREEN1

We have now to loop over all lines (sprite height) to draw the whole sprite :

Init      LDA   ScreenAddress    ; Screen Address ($2000-$9D00) where the sprite must be displayed
         
STA   SCREEN1+1
          STA   SCREEN2+1
          LDA   SpriteAddress    ; Initialiaze Sprite + Mask address
          STA   MASK+1
          LDA   SPRITE+1
          CLC

          LDY   SpriteHeight-1   ; Sprite height - 1

ONELINE   LDX   SpriteWidth-2    ; Sprite width in byte - 2 = (Sprite width in pixel / 2) - 2
SCREEN1   LDAL  $012000,X
MASK      ANDL  $050000,X
SPRITE    ORAL  $040000,X
SCREEN2   STAL  $012000,X
          DEX
          DEX
          BPL   SCREEN1

          LDA   SCREEN1+1        ; Update address for next line
          ADC   #$00A0
          STA   SCREEN1+1
          STA   SCREEN2+1
          LDA   MASK+1
          ADC   #$00A0
          STA   MASK+1
          STA   SPRITE+1
          DEY
         
BPL   ONELINE

          RTS

The green Code lines are only executed once.  The blue ones are executed once per Row (sprite height). The Red ones are executed for each 16 bits (4 pixels) values (sprite heigth * sprite width / 4).

If we choose to not count the green ones, the number of cycles required to display a sprite on the Apple IIgs screen is :

2            CLC                    ; Clear the carry because of the ADC #$00A0
3            LDY   SpriteHeight-1


3  ONELINE   LDX   SpriteWidth-2
6  SCREEN1   LDAL  $012000,X
6  MASK      ANDL  $050000,X
6  SPRITE    ORAL  $040000,X
6  SCREEN2   STAL  $012000,X
2            DEX
2            DEX
3|2          BPL   SCREEN1          ; 3 cycles if we branch, 2 if not

5            LDA   SCREEN1+1
3            ADC   #$00A0
5            STA   SCREEN1+1
5            STA   SCREEN2+1
5            LDA   MASK+1
3            ADC   #$00A0
5            STA   MASK+1
5            STA   SPRITE+1
2            DEY
3|2         
BPL   ONELINE          ; 3 cycles if we branch, 2 if not

The formula is :  #cycles = 5 + SpriteHeigth*(3+(SpriteWidth/4 * 31) - 1 + 41) - 1


For our Sprite 
(width=36, height=62), the result is 19968 cycles. A little bit more because we need to initialize the address for Sprite, Mask and screen. We have also to add the RTS at the end of the code.

This generic routine can be enhanced in many ways. First idea would be to move it into Bank $01, so the LDAL / STAL could be turned into LDA / STA. We would gain 2 cycles (one per instruction). The cycle number would drop to 18852, but a JSL / RTL + PHK / PLB cost few extra cycles. This is still about half of the number of cycles available per scan, if you expect 60 frames / second. 


> Mr Sprite's Algorithm

The weakness of the classic algorithm is linked to its heavy operations (LDA / AND / ORA / STA) for ALL 4 pixels blocks of the sprite. Furthermore, you have to add the post decrementation operations (DEX DEX) and the loop jump one (BPL). Bigger is the sprite, more important is the #cycles required to process the sprite. The important criteria is not the number of data pixels used by the sprite but the size of the frame around the sprite (sprite width x sprite heigth). If we count the number of cycles required for a 4 pixels block (16 bit), we found 6+6+6+6+2+2+3= 31 cycles, so about 31/4 = 7,75 cycles / pixels !

The purpose of a Sprite Compiler algorithm is to build a sprite specific code instead of a generic algorithm valid for any kind of sprite. A Sprite specific code packs together the sprite data (pixel color, width, height, background pixel) and the code used to display it. We do not have anymore the code in one location and the sprite data (+ Mask) in another place, we mix them both together as a code capable to display the sprite. In a sprite compiled code, there are no loop anymore. We save automatically the DEX DEX BPL part, so 7 cycles (22,5 % of the classic algorithm) !

Let's start with our Barbarian sprite. From a human point of view, the sprite is a set of colored pixels . We have chosen the green color to show the background part of the sprite :


Due to the Apple IIgs 16 colors mode, each pixel uses 4 bits, so we pack 2 pixels in a Byte.  
For the computer point of view, the sprite is a set of bytes :


There are 3 types of bytes :
        - The full background ones (2 green background pixels).
        - The mixed ones (1 green background pixel + 1 sprite data pixel).
        - The full sprite data ones (2 sprite data pixels).

We can remap our sprite picture by providing a new color mode :
        -  Green for the full background bytes
        -  Blue for the mixed bytes
        -  Red for the full sprite data bytes

 
This first approach helps us to define the work to do :
        - There is nothing to do for Green bytes : Let's ignore them.
        - We have to provide a full LDA / AND / ORA / STA for the Blue bytes because we have to take care of the background pixels (it is what we were doing in the classic algorithm).
        - We have to  provide a basic LDA / STA for the Red bytes : No need to take care of the background (no need for a Mask processing using AND / ORA).

Looking to our Barbarian sprite, it is easy to see than green and red take a huge % of the picture. Very good news ! Because green = nothing to do = no cycle needed = big save comparing to classic algorithm. This time, we only spend CPU cycles for sprite data pixels, not for background pixels.  For the Red ones, it is also a very good news because we replace our LDAL / ANDL / ORAL / STAL by a basic LDA / STAL :

3            LDA   #SpriteData    ; Sprite 4 pixels data (16 bit)
6            STAL  Screen,X


If we can process 2 red bytes together (16 bit), we drop down from 24 cycles (6+6+6+6) to only 3+6=9 (37,5 %) !

If we process the Red byte alone, we have to use 8 bits code :

2            LDA   #SpriteData    ; Sprite 2 pixels data (8 bit)
5            STAL  Screen,X


This time, we need only 2+5=7 cycles. As expected, it is more interesting to work with 16 bits code (9/4=2,25 cycles / pixel) than with 8 bit one (7/2=3,5 cycles / pixels). Everywhere we can, we have to use 16 bit code to optimize the ratio cycles / pixels (basic algorithm has a bad 7,75 cycles / pixel). We need 2 kinds of fast code :
        - A code to put sprite data on the screen without looking at the background (red area).
        - A code to provide AND / ORA processing between the screen and the sprite data (blue area).

Based on the the 65c816 instructions set, we can use :
        - For the red areas : STA / STX / STY / STZ instructions (Page Direct, Stack or X/Y indexed) or PHA / PHX / PHY / PHD / PEA / PEI / PER instructions.
        - For the blue areas : LDA / AND / ORA / STA instructions (Page Direct, Stack or X/Y indexed) or TSB / TRB (Direct Page) instructions.

The idea is of course to get the fastest instructions possible. Most of games we know (Sword of Sodam, Task Force...) use the LDA / STA Direct Page instructions. They are fast (from 3 to 5 cycles for a STA Direct Page) and easy to use. They both provide a solution for red and blue areas. We choose NOT to use them. We have prefered to use only Stack instructions :
        - PHA / PHX / PHY / PHD and PEA for the red areas.
        - LDA sr,S / AND / ORA / STA sr,S for the blue areas.

On red areas, the PHA / PHX / PHY and PHD (all 16 bit) use only 4 cycles ! The ratio is 1 cycle / pixel. You can't have a better one. The PEA instuction is only 5 cycles (for 16 bits), so the ratio is 5/4 = 1,25 cycle / pixel, still VERY good. The PEA instruction has the power of a LDA / STA because the Operand is the 16 bit value to put on the Stack. So a PEA $1234 acts like a LDA #$1234 / STA Screen. The LDA / STA uses at best 3+5=8cycles (16bits, Direct Page). The other big advantage of the stack related instruction set is the Post indexing. You push something at the top of the stack and the stack pointer is automatically updated. You don't have to care about incrementing an index.

On blue areas, we use the Stack related LDA $XX,S and the STA $XX,S where XX is a 00 to FF offset.  So we can address any memory location in a range of 256 bytes of the current Stack address. Looks like a LDA $Address,X where Address would be the Stack pointer address and X a value between 0 and 255. The LDA $XX,S instruction use 4 cycle in 8 bit mode and 5 cycles in 16 bit mode. The same number of cycles than the Direct Page LDA. The STA $XX,S uses also 4/5 cycles (as the Direct Page STA).  The AND / ORA use constant values (the Sprite Data), so the number of cycle is very low (2 cycles in 8 bit mode, 3 cycles in 16 bit mode).

So we understand easily that we can go VERY FAST on red areas pushing values on the stack and quite fast on blue areas where we can use LDA $XX,S / STA $XX,S instructions. The only thing we have to do is to put the stack pointer (Stack Register) at the right of the red area and call the PH* instructions (stack goes forward, from high to low address). Once at the left of the red area, we can use LDA $XX,S / STA $XX,S to process the remaining blue areas (now located at the right of the stack pointer, where the $XX,S can be used) :

Step 1 : Put the stack pointer at the end of red area (rightmost position)
Step 2 : Use a PH* or a PEA instruction to put 16 bit red data on the screen. The stack pointer is automatically moved of 2 bytes to the left (lower address)
Step 3Use a PH* or a PEA instruction to put 16 bit red data on the screen. The stack pointer is automatically moved of 2 bytes to the left (lower address)
Step 4Use a PH* or a PEA instruction to put 16 bit red data on the screen. The stack pointer is automatically moved of 2 bytes to the left (lower address)
Step 5Use a PH* or a PEA instruction to put 16 bit red data on the screen. The stack pointer is automatically moved of 2 bytes to the left (lower address)
Step 6Use a PH* or a PEA instruction to put 16 bit red data on the screen. The stack pointer is automatically moved of 2 bytes to the left (lower address)
Step 7Use a 8 bit LDA $0C,S / AND / ORA / STA $0C,S to process the rightmost blue area ($0C=14 bytes far from of the Stack pointer).

If there is empty parts (background bytes) in the middle of the sprite, we have to move the stack pointer over the gap :

Step 1 : Put the stack pointer at the end of the LEFTMOST red area (rightmost position)
Step 2 : Use a PH* or a PEA instruction to put 16 bit red data on the screen. The stack pointer is automatically moved of 2 bytes to the left (lower address)
Step 3Use a PH* or a PEA instruction to put 16 bit red data on the screen. The stack pointer is automatically moved of 2 bytes to the left (lower address)
Step 4Put the stack pointer at the end of NEXT RIGHT red area (rightmost position) : Stack Pointer = Stack Pointer + $0A
Step 5Use a PH* or a PEA instruction to put 16 bit red data on the screen. The stack pointer is automatically moved of 2 bytes to the left (lower address)
Step 6Use a PH* or a PEA instruction to put 16 bit red data on the screen. The stack pointer is automatically moved of 2 bytes to the left (lower address)
Step 7Use a 8 bit LDA $06,S / AND / ORA / STA $06,S to process the rightmost blue area ($06=6 bytes far from of the Stack pointer).

Once the line has been processed, we move the stack pointer to the next line and we process it. We always move the stack pointer forward (Stack Pointer = Stack Pointer + delta), never backward. So we only use ADD instructions in the algorithm, never SBC. SBC by itself is not a problem but if we alternate ADC (to jump to next line to process) and SBC (to move stack pointer on a line over the gap), we have everytime to provide a CLC (before the ADD instruction) and a SEC (before the SBC instruction). By choosing only one direction (ADD), we can clear the carry at the beginning of the algorithm (CLC) and never care about the carry anymore. We save 2 cycles (CLC) for every ADD operations. This may seems like small win but a good algorithm is full of this kind of small gain !

The stack usually works in Bank $00 but, because of retro-compatibility reason (Language card in the Apple II), it can also be put in Bank $01 (same thing for the Direct Page). Because the Graphic Page can also be put in Bank $01 if the Shadowing is enabled, the winning combination is :
        - Enable Shadowing (STZ $C035)
        - Disable Interrupts (SEI)
        - Put Stack and Direct Page in Bank $01 (LDA $C068 / ORA #$0030 / STA $C068)
        - Put Stack Pointer at the Target Screen location (LDA TargetScreen / TCS)

At the end of the code displaying the Sprite, you have to put everything
back  in order (Stack & Direct Page in Bank $00, enable interrupts, restore stack and Direct Page register...).

To optimize as much as we can our algorithm, we have to use carefully the 65816 instructions and the 8/16 bits mode :
        - Whenever it is possible, use 16 bit instead of 8 bit instructions for 2 bytes (or more) blue areas (LDA ,S / AND # / ORA # / STA ,S).
        - Use 8 bit  instructions for 1 byte area (red or blue).
        - Use the PH*/PEA instructions on large red areas.
        - Use LDA # / STA ,S on small (one or two bytes) read areas (a single LDA # / STA ,S  is less expensive than moving the stack pointer + using the PH* + moving the satck pointer to the next location).

To visualize the choices of the algorithm, we provide a new colored vision of our sprite, this time with 
OrangeYellow and Purple new color codes :

Orginal Byte Picture (36 x 62 pixels)
18 x 62 = 1116 Bytes
Color Code : Black = 0, Grey = 1, Light Pink = 2
Red / Blue Byte Picture
R=444 (39%), B=64 (5%), Background=608 (54%)
16 bit Patterns : 2222 (39), 0000 (39), 0022 ( 9)
Red / Orange / Purple / Yellow / Blue Byte Picture
R=
328 (29%), O=58 (5%), P=72 (6%), Y=25 (2%), B=25 (2%)
Background=608 (54%)

Here are the new color codes for Byte picture :

Color Code#Cycle / PixelColor Code explanation
 N / A  1 Background byte, nothing to do.
  (2+4) / 2 = 3 cycles / pixel    1 Sprite Byte with 2 pixels of Sprite Data. We use a 8 bit LDA #Data, STA $xx,S to update the screen.
  (4+2+2+4) / 2 = 6 cycles / pixel    1 Mixed Sprite /  Background Byte (1 pixel of each). We use a 8 bit LDA $xx,S / AND #Mask / ORA #Sprite / STA $xx,S to update the screen.
  (3+5) / 4 = 2 cycles / pixel    2 Sprite Bytes with 4 pixels of Sprite Data. We use a 16 bit LDA #Data, STA $xx,S to update the screen. There are not enough data to use PH* or PEA here.
  (5+3+3+5) / 4 = 4 cycles / pixel    2 Mixed Sprite /  Background Bytes (at least 1 pixel of background among these 4 pixels). We use a 16 bit LDA $xx,S / AND #Mask / ORA #Sprite / STA $xx,S to update the screen.
  (4 or 5) / 4= 1 or 1,25 cycle / pixel    4 Sprite Bytes with 16 pixels of Sprite Data. We can use PHx or PEA here because the number of bytes to update is large enough to legitimate the stack pointer update.

If we look at the ratio #cycles / byte, the Red bytes goes very fast due to the usage of 16 bits PH* / PEA.  So come the Orange (16 bit data storage) and the Yellow (8 bit data storage). Without any surprise, as soon as we have to deal with background data, the ratio is bad. The 16 bits (Purple) is nevertheless much better than the 8 bit one (Blue).  If we look to the % of each color bytes in the Barbarian Sprite, we can see than the group having a good #cycles / pixel ratio is the more important one (80 %) :

Color CodeNumber of Bytes% of Total Sprite Data Bytes
 328 64,5 %
 5811,4 %
 254,9 %
 7214,1 %
 254,9 %

The conversion of the Red / Blue Byte graphic into the Red / Orange / Yellow / Purple / Blue Byte graphic is done by applying few basic rules :
        - Two Blue close together become a double Purple (2 x 8 bit LDA / AND / ORA / STA = 1 x 16 bit LDA / AND / ORA / STA) 
        -  One Red surrounded by Green becomes a Yellow (8 bit Sprite data).
        - Two Red surrounded by Green or Blue become a double Orange (not enough Red to use the PHx / PEA).
        - A Blue and a Red close together become a double Purple if the Red can't be used in a better way with another close Red (to create a Orange or a large Red area)  
        - A Red area always requires a EVEN number of Red bytes (PHx and PEA push 16 bits on the stack). The extra one can be converted into Yellow or combine with a Blue to create a double Purple.
        - A Blue - Red - Red - Blue will be converted as Purple - Purple - Purple - Purple because it is a better choice than Blue - Orange - Orange - Blue (what we gain with the Orange is not enough compared to what we loose with the 2 Blue).

The optimization of the algorithm uses few tricks :
        - Because we don't use the Direct Page, we can use the D register, as well as X and Y (all 16 bits) , to store the most used 16 bits Patterns (4 pixels) of the Sprite. We will use them in the Red areas with a PHD / PHX / PHY but also in the Orange areas through a TDC, TXA, TYA (instead of a LDA #SpriteData).
        - If the color code of the sprite pixel is 0, we don't need the ORA in the code sequence LDA / AND / ORA / STA. The AND has already put the 0 value.
        - If the color code of the sprite pixel is F, we don't need the AND in the code sequence LDA / AND / ORA / STA. The ORA forces the value to F.
        -  We use only one CLC at the begining of the code. All the next to come ADC will work fine because the Carry is not used anywhere else.
        - Because the LDA / STA $XX,S adressing mode can cover 256 bytes from the stack pointer current position, we can process the current line but also the line below (1 line is $A0 long). So if there is no Red areas, we can process 2 lines before moving the stack pointer (2 lines below).
        - We gather all 8 bit instructions together to minimize the number of switchs between 16 bit and 8 bit modes (SEP #$20 / REP #$30).
        - We may use a LDA #SpriteData / PHA instead of a PEA $SpriteData if there are several other PHA with the same Pattern to insert in the current Red area.

We can see that the pixel color codes may be important to save few cycles by dropping an AND or an ORA in the LDA / AND / ORA / STA sequence : color codes 0 and F are the ones to choose for the pixels close to the sprite border. For the Barbarian sprite, the two colors would be Black and Light Pink.

The following animation shows you how the algorithm works to display the Barbarian picture :
        - It process the picture from the Top to the Bottom :
                - It process the Orange / Purple / Yellow / Blue bytes for the current line and the next one (if there are any) using LDA / STA or LDA / AND / ORA / STA. This is the cleanup step to ensure you only have Red areas on the current line and the next one.
                - It process all Red areas of the current line (if there are any) using high speed PH* / PEA.
                - It moves the Stack pointer to the next line (at the rightmost position of the first Red area).

 

The 2 BYTE graphics are created by Mr Sprite CODE command if you use the -V flag :

MrSprite.exe  CODE  -V  D:\Project\FullSprite.gif  639C42  000000  636363  FFBD9C

The Red / Blue Byte graphic file name suffix is  _B_rb and the Red / Orange / Purple / Yellow / Blue Byte graphic file name suffix is _B_ropyb. The width is half of the Sprite width because we use 1 pixel / Byte :

FullSprite.gifFullSprite_B_rb.gifFullSprite_B_ropyb.gif

Looking at the ropyb picture tells you immediatelly if the Sprite display will use a lot of cycles. If you see a lot of Red and very few Blue, this is good news. You can have a look to the Sprites available in the Performance section to compare each other. The bigger one is is not always the one requiring the largest number of cycles to be displayed.

The CODE command also provides some statistics about the number of colored Bytes and their %, for both rb and ropyb pictures :

  - Sprite BYTE rb    : 18x62 Total=1116, Red=444 (39%), Blue=64 (5%), Background=608 (54%).
  - Sprite BYTE ropyb : 18x62 Total=1116, Red=328 (29%), Orange=58 (5%), Purple=72 (6%), Yellow=25 (2%), Blue=25 (2%), Background=608 (54%).

The final line indicates the width*height of the Sprite, the size of the object code and the number of cycles required to run the code (=display the sprite) :

 => 36*62 pixels, 1483 bytes (2232), 2570 cycles (19968).

Is the sprite object code size (1483 bytes) very big compared to the memory space required to store the original Sprite picture (18*62 bytes) and its Mask (18*62 bytes) ? This is another good news from this study. We consume less space by coding a Sprite into 65816 instructions than having the Sprite as a picture in memory (because we also need the Mask picture, we have to count the size twice). Mr Sprite computes you both sizes = the size of the object code (1483) and the size of the Graphic + Mask (2232). With the Barbarian Sprite, we gain more than 30 % ! Bigger the background area (Green), the more you save.

We also provide a comparison between the number of cycles used by the Mr Sprite coding (2570 cycles) and the number of cycles used by the classic algorithm (19968 cycles).

In the source code file, we see again the Sprite information (widh*height, object code size, number of cycles) :

Spr_000  CLC                 ; 36x62, 1483 bytes, 2570 cycles
         SEI                 ; Disable Interrupts
         PHD                 ; Backup Direct Page
         TSC                 ; Backup Stack
         STA    StackAddress
         LDAL   $E1C068      ; Direct Page and Stack in Bank 01/
         ORA    #$0030
         STAL   $E1C068
         TYA                 ; Y = Sprite Target Screen Address (upper left corner)
         TCS                 ; New Stack address
         LDX    #$2222       ; Pattern #1 : 78
         LDY    #$0000       ; Pattern #2 : 78
         LDA    #$2200       ; Pattern #3 : 19
         TCD   
*--

We also provide the statistics for the 3 most used Patterns allocated to registers X, Y and D.


> Performance

Let's test a few Sprites to see how many cycles they need to be displayed. Remember, we have around 41,500 cycles available per scan if we want 60 pictures / second animation.

We have selected a set of Sprites from different famous Arcade games, displayed here in a 320x200 screen where you can see their relative sizes :


Each sprite has between 3 and 16 colors. There is a mix of different sizes. On a modern high resolution screen, the native 320x200 looks too small so we have magnified x2 all pictures for a better view. We provide the original Sprite but also the complexity analysis graphic provided by Mr Sprite.

Karate Champ
32x48 pixels
2168 cycles
Double Dragon
20x63 pixels
2479 cycles
Double Dragon
20x61 pixels
2472 cycles
Double Dragon
32x75 pixels
3591 cycles
Shinobi
32x57 pixels
2593 cycles
Shinobi
32x64 pixels
3134 cycles
Golden Axe
56x65 pixels
3693 cycles
Golden Axe
44x73 pixels
3168 cycles


Street Fighter 1
56x92 pixels
6045 cycles
Final Fight
84x93 pixels
7193 cycles
Final Fight
56x93 pixels
6463 cycles
Street Fighter II
116x94 pixels
7056 cycles
Metal Slug
164x75 pixels
10198 cycles
Metal Slug
156x121 pixels
15967 cycles


> Proof of Concept

Here is an animation of the Metal Slug UFO Sprite using 24 different pictures (widht=156, height=121) provided as a .2mg image disk (source code available on disk). The animation runs at 60 frames / second.

Click to run the animation, click again to quit. The red border shows the CPU usage (the sprite display is of course faster at 8 Mhz than at 2,5 !) . We start to draw the Sprite when the spot is refreshing the line located under the Sprite :

Real Apple IIgs at 2,5 MhzKEGS emulator at 2,5 Mhz KEGS emulator at 8 Mhz

We can notice than the simulation (KEGS 0.91 running on a PC) does not provide the same results than the real world ! The real Apple IIgs is slower than the virtual one. Not a real surprise. If the emulator is not accurate, it's probably because it does not simulate very well the Shadowing from $01 to $E1 (1 Mhz speed bank) effect. We write data into a fast speed bank ($01/2000) but behind, the IIgs have to update the slow speed bank ($E1/2000) with the same data and this operation requires extra cycles. All the numbers we have used in this technical article are purely theoretical. The Shadowing effect has not been taken into account. What we have tried here, with MrSprite's algorithm, is to provide the fastest code possible. Fastest in the virtual world is also the fastest in the real world !

We will go back into details in this aspect (real cycles / theoretical cycles) in a next article when we will look at sprites animation, where the synchronisation with the spot requires all our attention.

MrSprite Proof of Concept #1                     Mr Sprite Proof Of Concept #1