Fork me on GitHub

Of Lighting A Candle (And Casting A Shadow)

2002-06-23 (last edition of the initial revision)

I fear I will never find anyone
I know my greatest pain is yet to come
Will we find each other in the dark
My long lost love

— Nightwish
Beauty of the Beast

Hello again! I just got four Jaguar games from Aldebaran, he’s away over the midsummer feast so he was kind enough to leave me four Jaguar titles, since I own a Jaguar but have no games, heh …​ I thought I’d wait until tomorrow before knocking myself out though, and do some good right now (or perhaps it’s because the TV is blocked). Ahh, I’ve gotten hold of the new Nightwish CD; Century Child, if you don’t own it already, make sure to do so! They play Finnish metal combined with real cute songs and the main singer is schooled in classic opera, so they have a real cool sound and are probably my favourite musical artists.

We’re up to tutorial number 10; which makes me very glad and proud over the work achieved. This had not been possible without the support of readers and other VIP’s. To celebrate the tenth "anniversary", I’m going to give you something special; a putpixel routine. Actually, that statement was almost meant as a kind of ironic, funny statement. I mean, not until tutorial number 10 do we learn how to code a putpixel routine, in the PC’s MCGA mode for example, this is something you hardly have to learn, it’s implied. However, as anyone who knows this much about programming the ST will know, coding putpixel for the ST is a pain in the butt.

The putpixel will hopefully be a prequel to a tutorial on sprites. Sprites, by the way, are anything that moves on the screen, such as a little spaceship, a funny rabbit or just a bouncing ball. A putpixel routine is a routine designed to put a single pixel on the screen, so on the ST, it lights a single dot in one of 16 colours. To achieve this, four bits have to be changed in four different places in memory, and nothing else must be changed in the screen memory, else too much will be changed.

Let’s first see how we will know which pixel to change. We want to be able to provide information in coordinates, like pixel 160,100, which is about the middle of the screen. The ST would treat any such coordinates with a big question mark, so we have to find a way to translate the coordinates. All pixels come after one another in the screen memory, starting with the top left one, and ending with the bottom right.

This means that X value is of course worth 1 position, since pixel 2,0 is the third pixel on screen. Each Y value is worth as much as the number of X coordinates on one scan line, in the case of ST low resolution; 320. So if we have the coordinate 160,100, that would be the 160 + 100 × 320 = 32161th pixel on screen (one extra is added to the value since we start counting from 0). The total number of pixels on screen is 320 × 200 = 64000 pixels, which is about twice as much as 32160, so the formula seems to work. However, this won’t work on a ST, because we can’t simply count pixels that easily.

The information for a single pixel is contained in four bits in four consecutive words, one bit in each word. So, instead what we need to know is at what word the first bit is, and which bit exactly it is we want to deal with. There are 16 bits in each word, by dividing our X value with 16, we will get the number of word clusters to count in the result, and the exact bit in the remainder.

Here’s why: suppose we want the 19th pixel, this would mean jumping over the first word cluster, which contains information for 16 pixels, and then manipulate the third most significant bit in the next word cluster. 19 / 16 = 1.1875, which means we get the result 1 and the remainder 16 × 0.1875 = 3, it works! So, given the ST’s screen memory configuration, how exactly will we treat coordinate 160,100?

We assume the screen memory points to the start of the screen memory. First, the Y coordinate, which is the simplest; each scan line is 160 bytes, so multiply the Y coordinate with 160 and add it to the screen address. Then divide the X value with 16, multiply the result with 8 and add to the screen memory. Why 8? Because each word cluster that we are to jump over is 8 bytes; two bytes to a word and four consecutive words. Now, the screen memory points to the first word of the four consecutive words we need to alter, in each word we want to alter one bit. The exact bit is obtained from the remainder of the X division. There may be other intricate methods, but this one is robust, straightforward and good for learning. Supposing a0 points to the beginning of screen memory, d0 holds the X-coordinate and d1 holds the Y-coordinate, this is how it’s done.

                mulu.w  #160, d1                 ; 160 bytes to a scan line
                add.l   d1, a0                   ; add y value to screen memory
                divu.w  #16, d0                  ; number of clusters in low, bit in high
                clr.l   d1                       ; clear d1
                move.w  d0, d1                   ; move cluster part to d1
                mulu.w  #8, d1                   ; 8 bytes to a cluster
                add.l   d1, a0                   ; add the cluster part to screen memory
                clr.w   d0                       ; clear out the cluster value
                swap    d0                       ; put the bit to alter in low part of d0

There is some magic worked here with high and low parts of the data registers. With high and low part, we mean the two first, and two last bytes of the register, thus the high part is the first 16 bits, and the low part the last 16 bits (reading from left to right). By performing instructions with word size, you only affect the lower part of a register, and leave the higher part unchanged.

The divu instruction, leaves the result in the low part and the remainder in the high part of the register. After the divu, the move.w will only move out the lower part, the cluster part, of d0. Following is a multiplication on the cluster value, and an addition to screen memory. Finally, clear out the cluster part of d0, and a swap instruction. The swap instructions flips the high and low part of a register, so now d0 neatly holds only the value for the bit to be changed, and a0 points to the correct place.

So, now that we know where to change, how do we know what to change? We will have a value between 0 and 15, that is supposed to be put in those four bits in the screen memory (the colour of the pixel). We can’t just move in the data, we could devise some plan with bset and bclr instructions, but that may be clumsy and will probably involve branches for testing, which is slow. Instead, we will use our knowledge of masks and Boolean algebra to solve the problem.

By putting the colour data in the high part of a register, we can then rotate the least significant bit of the colour into the lower part, and then do a shift on only the lower part of the register, to put the colour bit in the correct place; mask prepared! Suppose we have the colour value in d2, and the number of bits to change in d0, obtained from the example above, this is how it works.

                swap    d2                       ; put colour value in high part
                lsr.l   #1, d2                   ; put one bit of colour in shiftable position
                lsr.w   d0, d2                   ; shift by number in d0

Memory will perhaps look something like this, d1 = 5.

High part Low part

d2

%0000000000000000

%0000000000001011

                swap    d2
High part Low part

d2

%0000000000001011

%0000000000000000

                lsr.l   #1, d2
High part Low part

d2

%0000000000000101

%1000000000000000

                lsr.w   d0, d2
High part Low part

d2

%0000000000000101

%0000010000000000

Ah, now the lower part of d2 will hold a terrific mask, we have one bit set, the one bit that we want to alter in screen memory, a simple ORing of the mask will make sure that the bit is set in screen memory, then we just add two to the screen pointer, and repeat the process. Wrong! Problem is, depending on our value, we want to either clear or set the bit, as you can see, on the third run through, the bit should be cleared, not set (the third bit counting from the left in the colour value is 0). If we just OR in the mask, and the bit we want to clear in the screen memory, is set to begin with, we will end up with a set bit where we want a cleared bit. Buh, that sounded awful, example! This is how it will look the third run through

High part Low part

d2

%0000000000000001

%0000000000000000

note how the fifth most significant bit in the lower part is not set.

Screen memory

%1111111111111111

By ORing d2 with the screen memory, we won’t be clearing out the fifth most significant bit in the screen memory, although we need it to be cleared in order for the pixel to have the correct value. So, before ORing in our mask, we need to make sure the bit is cleared. This is done by ANDing in a mask with all bits set except the one bit we want to change. Mask preparation looks like this.

                move.w  #%0111111111111111, d1
                ror.w   d0, d1

d1

%11111011111111111

The ror instruction, for ROtateRight, will rotate the register, making sure that whatever goes out the right (or left) will then come in to the left. Thus, if the least significant bit is 0, a 0 will me moved in the most significant bit, if it’s a 1, a 1 will be moved in. The difference between a logical shift and a rotate, is that a logical shift will move in 0’s, while rotate will move in whatever went out. Examine Motorola 68000 Instruction Execution Times if you wish to further your knowledge on this, there are also arithmetic shifts, but I don’t use them here. Now, by first ANDing in the clear mask, we can then safely or in our pixel mask, like so.

                swap    d2                       ; colour in the high part of d2
                move.w  #%0111111111111111, d1
                ror.w   d0, d1                   ; clear mask prepared

                lsr.l   #1,                      ; d2 shift in the next colour bit
                ror.w   d0, d2                   ; shift colour bit into position
                and.w   d1, (a0)                 ; prepare with mask (bclr)
                or.w    d2, (a0)+                ; or in the colour
                clr.w   d2                       ; clear the old used bit

Then just repeat that over and over, or rather, three times more. The only thing not covered before is the last line, clearing out the old used bit, without this, remnants might be left on the next time around. This is one way of putting a pixel to screen. Bit planes make a putpixel routine so incredibly slow and clumsy. This though, is just a generic putpixel routine, a pixel routine designed for a specific purpose might be much faster, involving only one bit plane perhaps. You don’t have to mess around with all four bit planes every time, say you only want to use four colours for your stuff, then just leave two bit planes alone, since they aren’t needed, this will speed up things. This is the entire putpixel routine.

putpixel:
                ; a0 screen address
                ; d0 X coordinate
                ; d1Y coordinate
                ; d2 colour

                mulu.w  #160, d1                 ; 160 bytes to a scan line
                add.l   d1, a0                   ; add y value to screen memory
                divu.w  #16, d0                  ; number of clusters in low, bit in high
                clr.l   d1                       ; clear d1
                move.w  d0, d1                   ; move cluster part to d1
                mulu.w  #8, d1                   ; 8 bytes to a cluster
                add.l   d1, a0                   ; add the cluster part to screen memory
                clr.w   d0                       ; clear out the cluster value
                swap    d0                       ; put the bit to alter in low part of d0

                ; now a0 points to the first word of the bitplane to use
                ; d0.w holds the bit number to be manipulated in the word

                swap    d2                       ; colour in the high part of d2
                move.w  #%0111111111111111, d1
                ror.w   d0, d1                   ; clear mask prepared

                rept    4                        ; do this 4 times
                lsr.l   #1, d2                   ; shift in the next colour bit
                ror.w   d0, d2                   ; shift colour bit into position
                and.w   d1, (a0)                 ; prepare with mask (bclr)
                or.w    d2, (a0)+                ; or in the colour
                clr.w   d2                       ; clear the old used bit
                endr

                rte return form putpixel

                ; end putpixel

That was that. A nice putpixel routine to use for our convenience, slow as hell because of two multiplications and one division. Because this is a tutorial, and I want to push against practical use, I’ve also written a stupid little program that puts 50 pixels a second on the screen, like a screen saver. However, that program also includes some nice tricks so read on!

The ST has a number of system variables, they are found at very low addresses, starting at $400 and ending at $516. Like the name suggests, these variables contain lots of special information on the system, and they can provide quite the shortcut to finding out some information. For example, $44e, called _v_bas_ad, is a long word containing a pointer to the screen memory (the logical screen memory). If you just want a quick and dirty program, like this one, and want to find out the screen address without traps, or hooking it up yourself, simply read the value here.

                move.l  $44e, a0                 ; a0 points to screen memory

There are some other useful system variables, which will be presented when the need arises. If we want to have a screen saver like program, we want to be able to output random pixels, right? So we need a way to generate a random number. Random numbers are usually obtained from reading the system clock, and then applying some algorithm to the obtained value. We don’t want to mess with that, especially not when there is a very nice trap that will do the job nicely for us. Trap number 17 in XBIOS will generate a 24-bit random number and put it in d0.

                move.w  #17, -(a7)               ; trap number 17, random
                trap    #14                      ; call XBIOS
                addq.l  #2, a7                   ; clean up stack
                ; random number in d0

Well, a random number of 24-bits is a number between 0 and 16777216. We want values in the range of 0 to a maximum of 319; 0-15 for the colour, 0-319 for the X coordinate and 0- 199 for the Y coordinate. So how can we get the random number "down to our level", so to speak? We could put the random call in a loop, and in the end of each loop check the random value, and if it is to big, just repeat the loop. While this would work, it would be so incredibly slow because the odds of the value falling within parameters are extremely small. However, by first lessening the value, we gain tremendous time.

As we so well know, the colour value consists of 4 bits. By ANDing the random value with %1111, we effectively set all bits to 0 except the first four (which may be 0 or 1, depending on the initial value). Thus, our initial random value of 24-bits has been reduced to a 4-bit value, making it perfect for our needs. The X and Y coordinates however, are a bit different, since their representation does not consist of a complete set of bits (numbers that do so are the ones immediately before the powers of 2, i.e. 1, 3, 7, 15, 31, 63, 127, 255, etc). We can’t simply mask off the X co-ordinate’s unnecessary bits like we did the colour, rather we have to keep one bit more than what is needed.

The value 319 uses 9 bits, so we will have to AND the X coordinate with %111111111, but %111111111 = 511, so after masking off the bits in our random value, we’ll have a number between 0 and 511. Now, we must use a loop to check the value, and make a new random number if it should prove to be over 319. The odds for the value of being within parameters are greatly increased though.

get_x
                move.w  #17, -(a7)
                trap    #14
                addq.l  #2, a7                   ; get random number
                and.l   #%111111111, d0          ; make it maximum 511
                cmp     #319, d0
                bgt     get_x                    ; loop until d0 < 320

The instruction bgt will branch if the value compared is greater than. This loop then, will loop until the value in d0 is not greater than 319. The Y coordinate is obtained by doing much the same thing, but we only need to AND with 8 bits, because the Y coordinate should be < 200, and 8 bits make up 255. There, all the theory we need, this is the complete source of the program.

                jsr     initialise

                move.l  $44e, a0                  ; a0 poins to screen memory
                move.w  #0, $ff8240               ; black background
                move.l  #7999, d0
clear
                clr.l   (a0)
                dbf     d0, clear                 ; clears screen to colour 0

main
                move.w  #37, -(a7)
                trap    #14
                addq.l  #2, a7                    ; wait retrace
  
get_x
                move.w  #17, -(a7)
                trap    #14
                addq.l  #2, a7                    ; get random number
                and.l   #%111111111, d0           ; make it maximum 511
                cmp     #319, d0      
                bgt     get_x                     ; loop until d0 < 320
                move.l  d0, d7                    ; store x coordinate

get_y
                move.w  #17, -(a7)
                trap    #14
                addq.l  #2, a7                    ; get random number
                and.l   #%11111111, d0            ; make it maximum 255
                cmp     #199, d0                  ; loop until d0 < 200
                bgt     get_y
                move.l  d0, d6                    ; store y coordinate
  
                move.w  #17, -(a7)
                trap    #14
                addq.l  #2, a7                    ; get random number
                and.l   #%1111, d0                ; make it maximum 15
                move.b  d0, d2                    ; put colour number in d2
                                
                move.l  d7, d0                    ; put x coordinate in d0
                move.l  d6, d1                    ; put y coordinate in d1
              
                move.l  $44e, a0                  ; a0 points to screen memory
                jsr     putpixel                  ; put pixel on screen
              
                cmp.b   #$39, $fffc02             ; space pressed?
                bne     main                      ; if not, repeat main
                
                jsr     restore
                
                clr.l  -(a7)                      ; clean
                trap   #1                         ; exit


putpixel:
; putpixel routine
; a0 screen adress
; d0 x-coordinate
; d1 y-coordinate
; d2 colour
                mulu.w  #160, d1                  ; 160 bytes to a scan line
                add.l   d1, a0                    ; add y value to screen memory
                divu.w  #16, d0                   ; number of clusters in low, bit in high
                clr.l   d1                        ; clear d1
                move.w  d0, d1                    ; move cluster part to d1
                mulu.w  #8, d1                    ; 8 bytes to a cluster
                add.l   d1, a0                    ; add cluster part to screen memory
                clr.w   d0                        ; clear out the cluster value
                swap    d0                        ; bit to alter in low part of d0

; now a0 points to the first word of the bitplane to use
; d0.w  holds the bit number to be manipulated in the word  

                swap    d2                        ; colour in the high part of d2
                move.w  #%0111111111111111, d1
                ror.w   d0, d1                    ; clear mask prepared
              
                rept    4                         ; do this 4 times
                lsr.l   #1, d2                    ; shift in the next colour bit
                ror.w   d0, d2                    ; shift colour bit into position
                and.w   d1, (a0)                  ; prepare with mask (bclr)
                or.w    d2, (a0)+                 ; or in the colour
                clr.w   d2                        ; clear the old used bit
                endr
                  
                rts
; end putpixel

                include initlib.s
tutorial 10 screenshot
Figure 11. Pixels

Yes, first a normal initialisation, then the neat trick of putting the screen address in a0, followed up by putting the background black and clearing the screen. Then, a main routine, the question here is why I didn’t I use the $70 as described in tutorial 9. The reason is a bit farfetched, but valid. Because of the random loops, there is a theoretical possibility of the main routine taking longer than 1/50th of a second, it’s virtually impossible, but it could happen. If this were to happen, the $70 vector would be called while the previous one were still being executed, resulting in a crash. With the method I use here, however, there is no danger of a crash.

Obtaining the X coordinate, as described above, only new thing is storing the coordinate in d7. This is because the coming random trap for the Y coordinate will destroy everything in register d0, and then some, register d7 however, is safe. Same goes for the Y coordinate. Finally, the colour value is obtained, and moved over to d2, then the X and Y coordinates are moved into their respective registers. These registers could be anything, or a variable or whatever storage possibility, but the putpixel routine is designed to have the X coordinate in d0 and the Y coordinate in d1, so this is how it’s supposed to be. After the screen address has been put in a0, all is set for the putpixel call.

The putpixel routine is exactly as described above, so nothing new there. Signing off with a check for a pressed spacebar, and that concludes the program. Note how I put the putpixel routine in its own subroutine, instead of including it in the main program, which could also have been done. This results in tidier code, the downside being that it takes more time to execute, but time is no issue here.

Speaking of time, I actually think that I’ll fill up some space here with a bit on optimisation, something that will have to come one day or another anyway. There are two multiplications and one division in the putpixel routine, horrible. These can be replaced with shift instructions, but it’s a bit tricky. Each shift either doubles or halves the value in the register. So how do we do a multiplication of 160 and a division by 16, where we also keep the remainder?

First, the Y part; here, we want to have a result equalling d1 × 160. 160 is not a value you may shift by, since all shift will produce multiplication results of 2, 4, 8, 16, 32, 64, 128, 256 and so on. However, 128 + 32 = 160, the value we want to multiply with, and when things come to multiplication, we are allowed to split the multiplication in two and add the result; d1 × 160 = d1 × 32 + d2 × 128. All we have to do is copy our Y coordinate into another register, shift one register with 5 (multiplication of 32), shift the other with 7 (multiplication of 128) and add the results together.

                move.w  d1, d3                   ; copy Y coordinate
                lsl.w   #7, d1                   ; mulu #128,d1
                lsl.w   #5, d3                   ; mulu #32,d3
                add.w   d3, d1                   ; add results together
                add.l   d1, a0                   ; add result to screen address

Note the word size used in all operations. There may still be garbage in the upper part of d3, but this is never touched in any of the operations. Since the maximum value we will handle is 199 × 160 = 31840 is less than the maximum for a word size, which is 216 = 65536, it’s ok to only use word size instructions, it also saves time. Our mulu instruction would take a maximum of 70 clock cycles, but in this case I think it’s 42. The technique of shifting takes

12 + 6 + 2 × 7 + 6 + 2 × 5 + 8 = 58

Heh, seems we wasted time rather than saving. Let that be an important lesson, sometimes the job’s just not worth doing. ☺

So now the X part, first, put the thing in the upper part of d0 with a swap. Now, with a right shift of 4, we will effectively divide the number by 16, which is what we want to achieve, the result will be in the upper part, and the remainder will be in the highest bits in the lower part. Now, what we need to do is simply to put the remainder down in the lowest bits in d0, so we right shift by 12. The reason for right shifting by 12 is that the remainder takes up a maximum of 4 bits (remainder maximum is 15, %1111), and 12 + 4 = 16 which is the number of bits in the lower part of a data register. Unfortunately though, you can’t shift by 12 when shifting with a number, so we’ll just have to divide the shift in one 8 and one 4 part, 8 being the highest number you may shift by.

Swap down the result in the lower part, and shift it left by 3 in order to multiply with 8. We make sure to keep the operation word size in order not to affect the remainder in the upper part. Then, add the result to the address register, but only use a move with word size, in order to only add the multiplied result, and leave the remainder well alone. Lastly, a clear out of the result part and a swap to put everything right for the next part of the putpixel.

                swap    d0                       ; put in upper part
                lsr.l   #4, d0                   ; divide by 16
                lsr.w   #8, d0                   ; shift down remainder ...
                lsr.w   #4, d0                   ; ... by 12 bits total
                swap    d0                       ; result in lower part
                lsl.w   #3, d0                   ; multiply with 8
                add.w   d0, a0                   ; add result to screen address
                clr.w   d0                       ; clear out result
                swap    d0                       ; put remainder in lower part

That was that, now let’s see if this optimisation did us some good. Unoptimized takes about

140 + 6 + 4 + 40 + 12 + 4 + 4 = 210

The division is an approximation. Also, I don’t think we really need to move some data to d1 to manipulate it, so the unoptimized could do some optimization too, but that’s not too important. Now let’s see what the shift-optimized part will take

4 + 8 + 4 × 2 + 6 + 8 × 2 + 4 + 6 + 3 × 2 + 8 + 4 + 4 = 88

Even though I’m a bit unsure of some values here, it’s obviously quite a save in any case. That was a little taste on how to optimize easy, just replace multiplications and divisions with shifts, sometimes quite a saving, but not always.