Optimizing in higher programming languages

Optimizing in higher programming languages.

Note: The examples in this text are compiled by hand and are a simplification of the reality, to make things easier to understand. The high level language listings are in C++/Java-like form. The ASM-listings (in Intel-like syntax) are an approximation of the outcome of a compiler with a few optimization algorithms. Real outcomes may vary. It's no problem if you don't understand ASM. The examples are just an illustration of what happens actually. Clk-indications are based on an Intel 80386-like processor. Things like pipelines and the difference between reg-reg, mem-reg, reg-mem and mem-mem access are not considered to make things less complicated. Although I start rather trivially, there should be something to learn for everyone in this text. And if you don't understand exactly what I'm doing in an example, then just draw out the bits on a piece of paper to get a better idea of what's going on. The Windows calculator can also help. Put it in scientific mode, and it can convert between decimals, hexadecimals and bits. It's important that you don't only know THAT it works, but also HOW it works. Anyway, enough chit-chat. Let's get started...

Introduction

Well, as you all know (or should know at least) ASM is dropping in popularity fast. Nearly all software written in the last few years is written in higher programming languages like C++ and Pascal/Delphi. But along with this development, another phenomenon appears: programmers don't understand the computer-system anymore. They just understand the higher language. Manipulating bits is no longer common knowledge, but is limited to an elite core of programmers.

And nowadays I keep hearing fables like: "Just program the algorithm. The compiler will take care of the optimization." Sure, today's compilers have some sort of optimization built in. But that's just limited to aligning the code for pipelines, and some other small tweaks to the code. There is still much to be gained by taking into account how a computer works, and adapting your algorithm to that. I will show this in the following text.

"So why is this interesting?" you might ask. Well, if you can't program ASM because there simply isn't any (like in Java), or you aren't familiar with ASM, then you can optimize in the higher programming language itself. And even if you do know ASM, it can still be interesting, simply because you can program it faster and easier in the high language than in ASM.

Accessing variables

Maybe you wonder why exactly ASM is faster than a higher language? Well, the extra speed comes mainly from the fact that you can use the registers the way you like it. You can keep a variable inside the processor for as long as you need it, eliminating unnecessary memory access. And remember this for the rest of your life: memory is slow! A compiler usually follows the following pattern: load, manipulate, store. This uses 2 memory accesses per operation. If you need the same variable again, it needs to be loaded again.

Modern compilers try to analyse the code and keep the variable in a register if possible. But this means that you need to write the code in a way that the compiler analyses it correctly. So if you need to do more than one operation on a variable, arrange your code so that the operations are done in sequence. I'll give you a simplified example:

This is wrong:

int value1 = 25;
int value2 = 30;
int value3 = 35;
int value4 = 40;

(...)

++value1;
--value2;
++value3;
--value4;

// Here you access the first value again, but it will no longer be in a
// register since you accessed other values in between
// So the CPU has to load it again
if (value1 > 25)
    (...);

if (value2 > 30)
    (...);

if (value3 > 35)
    (...);

if (value4 > 35)
    (...);
-----------------

And this is right:

int value1 = 25;
int value2 = 30;
int value3 = 35;
int value4 = 40;

++value1;

// Now you access it again directly, so the compiler will keep it in a
// register
if (value1 > 25)
    (...);

// In fact, in some languages you could combine the two commands now:
if (--value2 > 30)
    (...);

if (++value3 > 35)
    (...);

if (--value4 > 40)
    (...);

-----------------

Doing math without math instructions

Another reason why ASM is faster is that you can use all sorts of logical operations on bytes. Bytes are sets of bits. And with these bits there are nice things to be done. In most languages you can use these operations too. In C++ for example, we have these operators:

op:     ASM:    does:
---     ----    -----
>>      SAR     shift bits right, and extend the sign (arithmetic shift)
<<      SHL/SAL shift bits left (arithmetic and logic shift are the same)
&       AND     logical and-operation on two variables
|       OR      logical or-operation on two variables
^       XOR     logical exclusive or-operation on two variables
~       NOT     invert all bits of the variable

Java adds this one, as compensation for not having unsigned variables:

>>>     SHR     shift bits right, and add new 0's to the left.

So we have 'direct' access to the logical operations. Now why would we want that? Simple: logical operations are fast. Faster than arithmetic operations. And that's why they're interesting. So I'm going to explain some fundamental stuuf about electronics here.

ANDs, ORs, XORs and NOTs.

A CPU is basically made up of these simple logic units. So what are they, and more importantly, how do they work?

Binary, or Boolean logic works with only 2 values: TRUE and FALSE. Those are represented in bits like this: 1 for true, and 0 for false. In positive logic, the most common one, a logical 1 is represented by +5v. A 0 is represented by 0v. Sometimes, the designers decide to use negative logic, because it's easier, faster or cheaper to design the circuit that way. In that case, the 1 is represented by 0v, and the 0 is represented by +5v. Sometimes, a 1 signal is also called a HIGH signal, and 0 is called a LOW signal.

AND

A logical AND returns true only if both inputs are true. Else it returns false. So when you input 1 AND 1, you get 1. Input anything else, like 1 AND 0, and you get 0.

Let's draw up a TRUTH TABLE of this:

input 1 input 2 output
------- ------- ------
0       0       0
0       1       0
1       0       0
1       1       1

A logical OR returns true if either of the inputs is true. So this works exactly like the mathematical or. So when you input 0 OR 0, you get 0. Input anything else, like 1 OR 0, and you get 1.

The OR table looks like this:

input 1 input 2 output
------- ------- ------
0       0       0
0       1       1
1       0       1
1       1       1

XOR

A logical XOR returns true only if 1 of the 2 inputs is true. So, this is an EXCLUSIVE OR, hence the name. So when you input 1 XOR 0 (or 0 XOR 1), you get 1. Input anything else, like 1 OR 1, and you get 0.

The XOR table takes this shape:

input 1 input 2 output
------- ------- ------
0       0       0
0       1       1
1       0       1
1       1       0

NOT

The NOT function simply returns the inverted value of the input. So, enter 0, and you get 1. Enter 1, and you get 0.

This results in the following table:

input   output
-----   ------
0       1
1       0

Basically, that's all the CPU can do. With smartly designed circuits, you can easily create arithmetic functions like ADDing and SUBtracting. Use these new functions cleverly, and you can create even more functions, such as MULtiplying and DIViding.

But since the CPU designers have already done that for you, I'm not going to explain it here. If you're interested, then look it up in a book.

In a CPU, you always work with bytes, or sets of bytes. A byte is a set of 8 bits. These sets of bits are represented in HEXADECIMAL values (fire up your Windows calculator, put it in scientific mode, and you can convert between binary, hexadecimal and decimal values). So how do hex values and bits work?

Well, each bit in the byte represents a value. We look at the bitset from right to left in increasing level of importance, just like with normal decimal values. The rightmost bit is the LEAST SIGNIFICANT BIT (LSB). The leftmost bit is the MOST SIGNIFICANT BIT (MSB).

The first bit represents 1, the second bit represents the double of that: 2. The third bit is again the double of the that: 4. And so on.

So in general, the following relation between value and position holds: The bit at position n represents value 2 to the power n. (We start counting at 0, because we wouldn't want to waste precious memory... So the first bit is bit 0, the second bit is bit 1, etc.)

So, the value of a byte is made up from the sum of the bits in that byte. A few examples:

00001000

Bit 3 is set: 2^3 = 8

10010010

Bit 1 is set: 2^1 =   2
Bit 4 is set: 2^4 =  16
Bit 7 is set: 2^7 = 128
                    ---
Total value:        146

That's 92 in hex, which I'm going to explain next...

Hexadecimal values

In hex, each digit is represented by a NIBBLE. A nibble is a half byte, so it's a set of 4 bits. It can be very useful to remember the following table of hex values and their binary equivalent nibble. If you know this table by heart, back and forth, then you can convert any binary value to hex and vice versa from the top of your head. So here we go:

binary  hex
------  ---
0000    0
0001    1
0010    2
0011    3
0100    4
0101    5
0110    6
0111    7
1000    8
1001    9
1010    A
1011    B
1100    C
1101    D
1110    E
1111    F

This can be really useful when you need a BITMASK, which I will discuss later. For instance, you need a mask that looks like this:

11001011

But in most cases, you can only enter hex or decimal values, so you need to know the equivalent of this byte in hex or decimal.

We choose hex, because it converts easily. That's because the radix of hex (16) is the radix of binary (2) to a certain power (4 in this case).

So, we split up the byte in 2 separate nibbles, and convert each nibble separately, using the above table:

Binary: 1100 | 1011

Hex:       C |    B

Now we glue all nibbles back together, and we know the hex value of the byte. So the byte 11001011 has the value of CB in hex. This we can enter in the computer. Easy, isn't it? And the funny part is, that we didn't even need to know what the value was. CB is 203 decimal. But 203 doesn't make any sense, compared to the bitset. A lot of the time the shape of the number in bits is more important than the value. That's why hex numbers can be much more useful. The hex digits correspond directly to the nibbles in the bitset. Going the other way around works just as well ofcourse. Replace each hex digit with it's corresponding nibble, and you get the bitset.

But, we've only seen 8 bits so far... And our computer is 32 or maybe even 64 bits. How does that work then?

Well, the length of a bitset the computer can process is called a WORD. So in a 32 bit computer, we work with a WORD of 32 bits, or 4 bytes. On Intel PC's however there is a slight problem... The word size was defined for the original 8086 processor, which was 16 bit. So on any Intel PC, a word is still 16 bits, even though the CPU may actually be 32 bits or 64 bits. 32 bits is then called a DOUBLE WORD (DWORD). And similarly, 64 bits is called a QUADRUPLE WORD (QWORD). Let's use these strange, but widely used Intel measurements from now on in this essay. So from now on I will use the following terms:

term    length
----    ------
nibble   4 bits
byte     8 bits
word    16 bits
dword   32 bits
qword   64 bits

How do WORDs, DWORDs and QWORDs work then? Well, just like bytes actually. They consist of more bits. But we still work from right to left in increasing level of significance. So we just find some extra bytes 'glued in front of' the first byte. The LSB is still the rightmost bit, and the MSB is still the leftmost bit.

A small example to make it clear:

We have the following word:

1000111101110001

What's the hex value of this then? We can find it the same way as with the byte: split the nibbles...

Binary: 1000 | 1111 | 0111 | 0001

Hex:       8 |    F |    7 |    1

Glue it together and we get the hex value of the word: 8F71.

DWORDs, QWORDs work exactly the same way.

In this text, I will only use bytes from now on, so the bitsets will not be too long and too hard to follow. But with words, dwords and qwords it all works exactly the same. There are just more bits.

The logical functions are processed on bytes (or sets of bytes, like the aforementioned word, dword and qword). All bits in the bitsets are processed in parallel. So, let's look at some examples:

bits:           decimal:        hex:
-----           --------        ----
11001100        204             CC
00001111 AND     15 AND          F AND
------------    -------         ------
00001100         12              C

00001100         12              C
11000000 OR     192 OR          C0 OR
-----------     ------          -----
11001100        204             CC

11001100        204             CC
00001111 XOR     15 XOR          F XOR
------------    -------         ------
11000011        195             C3

00001111 NOT     15 NOT          F NOT
------------    -------         ------
11110000        240             F0

As you can see, the hex values make a lot more sense than the decimal values. You can immediately see the binary nibble and the corresponding hex digit.

So, what can you do with these functions? Well, this is the interesting part. With these functions, you can do all sorts of clever stuff, and very fast too, because they're basic for the CPU to execute. The CPU can do these operations in a single cycle.

Bitmasking

So, let's talk about a technique I already mentioned: BITMASKING. With this technique you put a 'mask' on the byte, which 'hides' the bits in which you are not interested.

Look at the above AND operation of 2 bytes, and think about how the AND operation works. You'll see that only the bits that the 2 bytes have in common will be in the output byte. Consider the second input byte: 00001111. Of any byte you AND with this, only the lower nibble will be in the output. So this byte effectively turns off, or 'masks out' the high 4 bits.

So, with a mask, you can turn off the bits you're not interested in. Just create a mask with 1's in the bits you're interested in, and 0's in the other bits.

A small example to illustrate this:

Let's say we have a certain value in which the third and fifth bits are set when an error has occured. We now want to check whether the error has occured or not. So we're only interested in bits 2 and 4.

First, we design our mask in binary form:

00010100

But we'll need hex or decimal to use it in our program. We choose hex, and convert the mask to hex:

0001 | 0100

   1 |    4

So the mask is 14 hex.

So we can write the check routine easily like this: int mask = 0x14; bool errorOccured(int value) { int check = value & mask; return (check == mask); } ----------------- Short isn't it? Bet it's fast too.

So what if we need to set some bits in a byte? Well, think about what the OR operation does, and you'll know. Look at the OR example above, and you'll see that the second byte, 11000000, will turn on the 7th and 8th bit of any byte you OR it with.

Let's extend the AND example with an OR example, that sets the error bits. We can use the same mask to set the error: 00010100, or 14 hex.

So it will look something like this:

int mask = 0x14;
int value;

void setError()
{
    value |= mask;
}
-----------------

That's all there is to it.

Modulo-operations

AND masking can also be useful when you need to do a modulo-operation. What we're going to do here, is a bit hard to explain, but I'll give it a shot anyway...

For example, take a digital clock with seconds display. Put your hand on the high digit, and see what happens: the low digit keeps going from 0 to 9. That's modulo 10.

Think about it, and you'll discover that by masking digits n and higher, you are effectively doing a modulo 10^n operation. Or, even better: modulo (radix)^n.

This is what we're going to do on bits now. We're going to mask out the high bits. And in binary numbers, the radix is 2. So this way, we can do a modulo 2^n operation.

For example, we have a counter that needs to run between 0 and 63, so modulo 64. For a modulo 2^n with an AND, we need a mask of 1 smaller: (2^n)-1. This will clear all the high bits of the counter.

The code will look like this:

int counter;

// Increase the counter
counter++;

// Mask out the high bits for the modulo 64 operation
counter &= 63;
-----------------

This is ofcourse much faster than using the %-operator, which requires an IDIV operation, which is very slow.

With OR masking there are also nifty things you can do:

For example, the BMP file format needs to have each scanline of the image aligned, or 'padded' to a DWORD. The BMP file format also stores the scanlines upside down for some odd reason, but let's not go there now.

So, our scanlines need to be a multitude of 4 bytes. We need to calculate how many extra bytes we need to add to our scanlines to align it correctly. We're going to do this with a smart OR operation...

We need a multitude of 4. So the lowest 2 bits should be like this: 00. Hmm... And with OR, you can only turn on bits. So how do we do that? Simple: we turn on the lowest 2 bits, and INCrement with 1.

But there's a tiny problem when the length was already padded... Because then it will be 1 DWORD too many after the OR. Well, no problem there: just DECrease the length 1 before you OR.

So the complete algorhytm will look something like this:

int alignToDWORD(int scanLength)
{
    scanLength--;

    scanLength |= 3;

    return ++scanLength;
}
-----------------

Elementary, is it not? And very fast too. It could also be done with an AND, or even with bitshifts, by the way (how? Try to figure those ones out for yourself). But now we can use the very simple INC and DEC instructions which will be just a bit faster...

About the XOR, there is not much to tell. This operation just flips the bits in the mask. That's nice when you want to toggle some bits in a byte. It's also very popular for encrypting stuff, because of the toggling effect. Just XOR an array of data with a certain mask, and it will be totally different. But, when you XOR again with the same mask, the original data will be restored. Think about it. Isn't that something?

Just a small example to illustrate the toggle:

We want to toggle a certain bit in a value, say bit 6.

So, we construct the mask:

01000000

In hex that would make 40 ofcourse.

So we get something like this:

int value;
int mask = 0x40;

void toggle()
{
    value ^= mask;
}
-----------------

And we're done. That was easy...

Now onto the final instruction: NOT. This one is deceivingly simple. It just inverts all the bits in the byte. But it can be of great help with all sorts of things, like creating masks for example.

Say we have a certain value, that needs the lower 5 bits set to 0, if the 6th bit is set.

We use the NOT to create a mask for an AND operation. Here you will see that a simple DEC can be very useful too. It will go something like this:

int value;

// Get the 6th bit of the value.
int mask=value & 0x20;

// Fill the lower 5 bits with 1, using a DEC.
mask--;

// Now invert the mask, and it will cut out the lowest 5 bits.
value &= ~mask;
-----------------

Hmmm, that wasn't hard at all, now was it? Just think of what trouble you would have to go through, if you weren't using the logical operations...

Divide and conquer

Now, let's start with the shifts. As you should know by now, the bits in a byte correspond with values. As a rule, the nth bit corresponds with a value of 2^(n-1). For example, the 4th bit corresponds with 8 (2^3). Now, what would happen if you would shift all bits 1 position to the right? An example to make it clear:

The number 8:

00001000

We shift one to the right:

int value = 8 >> 1;

You get:

00000100

The 3rd bit is now set, so the bitset now represents 2^(3-1). In other words: the number 4.

Well, will you look at that? We just divided 8 by 2, and we didn't use any arithmetic instruction for it. What have we gained by this?

SAR:     3 clks
IDIV:   19 clks

Hmm, that's 16 clks saved! 533% gained!

Another small example:

The number 47:

00101111

Shift 3 positions right:

00000101

That's

2^(3-1) + 2^(1-1) = 4 + 1 = 5

. So we effectively divided 47 by 8.

Apparently there is a relation between the number of bits shifted and the divider. After some thinking, you'll discover that relation has to be this: If you shift n bits to the right, you're dividing by 2^n.

And after some more thinking, you'll understand that you can do multiplications in the same way by shifting left. Another relation: If you shift n bits to the left, you're multiplying with 2^n.

The key to this is to write your code in a way that you can use a shift instead of a 'real' division or multiplication.

By the way, here's a quick tip if you need to use divisions: If you are dividing a large set of data by the same constant n, it will be faster to calculate 1/n once, and then multiply the set of data by the new constant. Why is this faster? Simple: A multiplication is a faster instruction than a division. Bear in mind though, that the accuracy of a multiplication is somewhat lower than that of a division, but usually that's no problem. A small example:

This is wrong:

float data[256];
float n = .323f;

for (int i = 0; i < 256; i++)
{
    data[i] /= n;
}
-----------------

And this would be right:

float data[256];
float n = .323f;

// Calculate our new constant here
float in = 1/n;

for (int i = 0; i < 256; i++)
{
    data[i] *= in;
}
-----------------

Yes, optimizing can be that simple. But, what have we gained? Let's see, suppose a FDIV is 17 clks, and a FMUL is 3 clks:

1st routine: 256 x 17        = 4352 clks
2nd routine: 17 + (256 x 3)  =  785 clks
			  ----------------
		      gained = 3567 clks

Hmm, let's see... About 454% gain? Not bad for such a small change in the code.

Accuracy or speed: what do you need?

With the bit-operations we've learnt, there's lots of nice algorhytms to make for special computer-related mathematics. What happens if you combine them?

A nice example is using transparency with two 24-bit pixels. Transparency works like this: Of the two pixels, you separately add the R, G and B values. So you 'unpack' the R,G, and B's and add them. Then you need to saturate them to a maximum of 8 bits, otherwise you can't pack the values correctly because of the 'overflowing' bits.

Normally, it would look something like this:

int pixel = 0xA9D3FF;
int transp = 0xFFFFFF;

// Get R-value from pixel
int r = pixel >> 16 & 0xFF;

// Get G-value from pixel
int g = pixel >> 8 & 0xFF;

// Get B-value from pixel
int b = pixel & 0xFF;

// Add transparent R-value
r += transp >> 16 & 0xFF;

// Add transparent G-value
g += transp >> 8 &0xFF;

// Add transparent B-value
b += transp & 0xFF;

// Saturate colors
if (r > 0xFF)
    r = 0xFF;

if (g > 0xFF)
    g = 0xFF;

if (b > 0xFF)
    b = 0xFF;

// Build new pixel
int newPixel = r << 16 | g << 8 | b;

(Yes I know the code is not optimized in the way I told you earlier, but this makes the code a little easier to read.) NB: You could make this a little faster in C++ by using a byte* pointer and reading out the RGB values directly from &pixel and &transp.

-----------------

So we have this, when we compile:

.data
pixel           dd 0A9D3FFh
transp          dd 0FFFFFFh
r               dw (?)
g               dw (?)
b               dw (?)
newPixel        dd (?)

.code
.386
; Get R-value from pixel
	mov     eax, [pixel]
	shr     eax, 16
        and     eax, 0FFh
        mov     [r], eax

; Get G-value from pixel
	mov     eax, [pixel]
	shr     eax, 8
        and     eax, 0FFh
        mov     [g], eax

; Get B-value from pixel
	mov     eax, [pixel]
        and     eax, 0FFh
        mov     [b], eax

; Add transparent R-value
	mov     eax, [transp]
	shr     eax, 16
        and     eax, 0FFh
        mov     ebx, [r]
        add     eax, ebx
        mov     [r], eax

; Add transparent G-value
	mov     eax, [transp]
	shr     eax, 8
        and     eax, 0FFh
        mov     ebx, [g]
        add     eax, ebx
        mov     [g], eax

; Add transparent B-value
	mov     eax, [transp]
        and     eax, 0FFh
        mov     ebx, [b]
        add     eax, ebx
        mov     [b], eax

; Saturate colors.
        mov     eax, [r]
        cmp     eax, 0FFh
	jle     noRclip
        mov     eax, FFh
noRclip:
        mov     [r], eax

        mov     eax, [g]
        cmp     eax, 0FFh
	jle     noGclip
        mov     eax, FFh
noGclip:
        mov     [g], eax

        mov     eax, [b]
        cmp     eax, 0FFh
	jle     noBclip
        mov     eax, FFh
noBclip:
        mov     [b], eax

; Build new pixel
	xor     eax, eax
        mov     ebx, [r]
	shl     ebx, 16
	or      eax, ebx
        mov     ebx, [g]
	shl     ebx, 8
	or      eax, ebx
        mov     ebx, [b]
	or      eax, ebx
	mov     [newPixel], eax
-----------------

And we're done. Now, how many clks would that be?

opcode: clks:   count:  total:
------- -----   ------  ------
MOV:    2       28      56
SHR:    3        4      12
AND:    2        6      12
ADD:    2        3       6
CMP:    2        3       6
JLE:    3        3       9
XOR:    2        1       2
SHL:    3        2       6
OR:     2        3       6
-------------------------------
total:          53      115

In pure ASM, you could ofcourse create a much better routine. But in Java for example, there is no ASM, so you have to make it better in Java itself. Or maybe you don't know how to use ASM, or you don't want to use ASM, because you want your C++ program to comply to the POSIX standard. What will you do then?

Most time is lost in the 'unpacking' of the R, G and B values of the pixels. So, we decide to use one bit less per colour, and add and saturate them all at the same time, using a 'carry' on the bit that is now freed. Looking something like this:

int pixel = 0xA9D3FF;
int transp = 0xFFFFFF;

// Shift bits 1 right and clear 8th bit of each byte in the dword
// This will kill the LSB in the color, and make the RGB-values 7 bits
// The freed 8th bit can now be used to saturate the colours
int newPixel = (pixel >> 1) & 0x7F7F7F;

// Same for the other pixel
int transpPixel = (transp >> 1) & 0x7F7F7F;

// Add the transparent colours
newPixel += transpPixel;

// Create a mask for saturation
// First get the carry bits of the newPixel
int mask = newPixel & 0x808080;

// Then here's a neat trick to saturate the pixel
// Move the carry-bits to the 1st bit in every byte of the dword
// Then subtract it from the mask
// This effectively fills the overflowed 7-bit RGB-values with 1's
// So, we have a mask for the RGB-values to saturate
mask -= (mask >> 7);

// Now or the newPixel with the mask, giving the overflowed bytes the correct
// value
newPixel |= mask;

// But it's still with 7 bit values, so we shift 1 to the left
newPixel <<= 1;
-----------------

Okay, so this routine doesn't look much like the one above, but the result is about the same. Only some small things are different, as a result of the 7 bit calculation. Because we shift one left at the end, the 1st bit of the byte will be the carry of the byte before that. This means there can be a 1 in the 25th bit. But, since we don't look at the 25th bit with 24 bit graphics, this is not important, so we don't have to turn it off (Pop question: How would you turn it off, if it was important?). And so we only have an inaccuracy in the 1st bits. But, since this is the LSB, it does not matter. You won't see it.

So, how would this look when you compile it?

.data
pixel           dd 0A9D3FFh
transp          dd 0FFFFFFh
transpPixel     dd (?)
mask            dd (?)
newPixel        dd (?)

.code
.386
; Shift bits 1 right and clear 8th bit of each byte in the dword
	mov     eax, [pixel]
	shr     eax, 1
	and     eax, 7F7F7Fh
	mov     [newPixel], eax
	
; Same for the other pixel
	mov     eax, [transp]
	shr     eax, 1
	and     eax, 7F7F7Fh
	mov     [transpPixel], eax

; Add the transparent colours
	mov     ebx, [newPixel]
	add     eax, ebx
	mov     [newPixel], eax

; Create a mask for saturation
	and     eax, 808080h
	mov     [mask], eax

	mov     ebx, eax
	shr     ebx, 7
	sub     eax, ebx
	mov     [mask], eax

; Now or the newPixel with the mask
	mov     ebx, [newPixel]
	or      eax, ebx

; But it's still with 7 bit values, so we shift 1 to the left
	shl     eax, 1
	mov     [newPixel], eax
-----------------

Hmm, looks much shorter... But is it faster? Let's have a look:

opcode: clks:   count:  total:
------- -----   ------  ------
MOV:    2       11      22
SHR:    3        4      12
AND:    2        3       6
ADD:    2        1       2
SUB:    2        1       2
OR:     2        1       2
SHL:    3        1       3
-------------------------------
total:          22      49

Wow!! Whaddayaknow... From 115 to 49. More than twice as fast! And, you're using less overhead, since you're using less variables. It will probably be even faster in reality, because you're using less memory access and more registers than before. (By the way... We're using bit 8 as a carry bit, but you may have already figured that you can use bit 9 too. This way you can eliminate some shifts. Can you convert this routine and optimize it even further?)

So you see, you should decide exactly what you want from a routine. This example shows that when you choose to use one bit less accuracy, you can gain more than 50% of CPU-speed. And, considering the fact that it is impossible to detect the inaccuracy with the naked eye, the inaccuracy does not matter at all.

So, when you realize once again, that you're dealing with a computer that only understands bits and bytes, you can use it to great advantage in higher programming languages. And if you would program the second routine directly in ASM, you couldn't gain much. The only way you could get it much faster, would be to create an ASM-specific routine, which uses powerful commands and memory access not possible in other languages.

Think about how the compiler will compile your source. By controlling the input of the compiler, you can control the output. If you want to be a really good coder, I suggest you learn some ASM. Coding large programs in ASM may be too time-consuming, but in most languages you can use inline-ASM. Combining the ease of a high level language with the raw power of ASM will result in very fast and compact programs.

Take some time to study the behavior of your processor, like pipelines and the number of clks of the instructions. And most compilers can output an ASM source code. Look at it sometime, so you'll know how your compiler behaves.

Now, to end this all, we will give you some small pure-ASM examples of the saturating transparency routine. After all, ASM is still the fastest and most versatile language around.

Using MMX:

.data
pixel           dd 0A9D3FFh
transp          dd 0FFFFFFh
newPixel        dd (?)

.code
.586
.mmx
	movd    mm0, [pixel]
	paddusb mm0, [transp]
	movd    [newPixel], mm0
-----------------

4 clks!!!

Or, if you use 2 pixels at the same time:

.data
pixels          dq 0A9D3FF029292h
transps         dq 0FFFFFF993399h
newPixels       dq (?)

.code
.586
.mmx
	movq    mm0, [pixels]
	paddusb mm0, [transps]
	movq    [newPixels], mm0
-----------------

Still uses 4 clks, but with 2 pixels at a time, so 2 clks per pixel! Note: after the paddusb, you get a pipeline 'slot' or 'stall'. So, you can load the next 2 pixels there in mm1 at no extra cost. This effectively gives you 1.5 clks per pixel!! In other words: MMX is really fast, as long as you don't use a compiler or DirectX...

And remember: Knowledge is CPU-power!!!

Z-Nith and X-Calibre