'If' Is Slow!

Aug 20, 20207 min read

Did you know that an ‘if’ statement, or any other conditional expression for that matter, can take a toll on the performance of your program? This is essentially because conditionals branch out the flow of instructions. Luckily, there’s a branch in programming called branchless programming (ha!).

To understand this, let’s take a step back and delve into pipelined processors.

Typically, a CPU instruction cycle is split into four stages: fetch, decode, execute and write. In simple CPUs, this cycle is carried out serially. However, to better exploit the CPU’s resources, in most modern CPUs the instruction cycle is instead executed concurrently – this is called pipelining. Pipelining attempts to keep every part of the processor busy with some instruction by dividing the incoming instructions into different steps to be performed by different units of the processor. In other words, at each CPU cycle, each instruction is in a different processing stage – when one instruction is being executed, the next one is already being decoded, and the one after it is already being fetched. You can think of how a bicycle is produced in a factory: at any given time, one bike is being assembled, lights are being mounted on another, a third is being painted, etc. Pipelines can be short with a few stages (as short as three) or long with many stages (twenty and over).

So why do conditionals slow down the running time?

When a branched instruction enters the processor pipeline, naturally the branch destination is not known before the instruction has been decoded and its destination calculated. But in order to keep the pipeline full, the processor needs to know the branch destination even before the branched instruction has been decoded. Just letting some parts of the processor stay idle is a waste of its resources, so instead of pausing the pipeline until the branched instruction is decoded and executed, most processors will “take a bet” and load the next instructions of one of the possible branch destinations.

You can already notice the problem. In case it turns out later on that the destination choice was wrong, the processor will need to flush the pipeline and start loading the instructions from the correct branch destination. This can be especially wasteful for CPUs with long pipelines, since in such a case the CPU will need to flush many more instructions. Such a stall may last for many tens of cycles.

Modern processors are equipped with branch predictors (a special circuitry that remembers the previous results of each branch instruction) and sometimes even with sophisticated branch prediction algorithms, but even so – in the case of a misprediction, the performance can take quite a hit.

What can then be done to avoid such dreadful stalls?

Let’s take a look at a few tricks. In each of the following techniques the conditional is replaced with a branchless equivalent. I will write the examples in C but they are generally valid for any other programming language. Parameter validation and error handling will be omitted.

1. Turning the conditional into a boolean value

Let’s start with a simple function. It takes two integers and returns the smaller of the two:

int minimum (int a, int b)
{
  if (a < b)
    return a;
  else
    return b;
}

Without performing the comparison between a and b, the processor does not know which to return. For a branchless alternative, we could replace the logic above with the following:

int minimum (int a, int b)
{
  return a * (a < b) + b * (b <= a);
}

Now, if a is the minimum, (a < b) will evaluate as true, which is defined to expand to an integer value of 1 according to C standard. (b <= a) will evaluate as false and will take the value 0. In this case the function will return: a * 1 + b * 0 = a.

This technique works for other operations as well. Here is an example of a conditioned assignment:

#include <stddef.h> // size_t
void constrain (int arr[], size_t size, int ceiling)
{
  for (size_t i = 0; i < size; ++i)
  {
    if (arr[i] > ceiling)
      arr[i] = ceiling;
  }
}

This function “constrains” the integers of an array to a given “ceiling” integer.

Here too, we can get rid of the ‘if’ statement with the technique of turning it into a boolean value:

#include <stddef.h>
void constrain (int arr[], size_t size, int ceiling)
{
  for (size_t i = 0; i < size; ++i)
  {
    arr[i] = ceiling * (arr[i] > ceiling) + arr[i] * !(arr[i] > ceiling);
  }
}

2. Replacing the conditional with an arithmetic operation

Let’s say we would like to run through an array of integers and determine if a given value is in it. A simple way to do this would be:

#include <stdbool.h>
#include <stddef.h>
bool isFound (int arr[], size_t size, int val)
{
  for (size_t i = 0; i < size; ++i)
  {
    if (arr[i] == val)
      return true;
  }
  return false;
}

Now consider the following alternative:

#include <stdbool.h>
#include <stddef.h>
bool isFound (int arr[], size_t size, int val)
{
  int result = 1;
  for (size_t i = 0; i < size; ++i)
  {
    result *= (arr[i] - val);
  }
  return result == 0;
}

In this case we took advantage of the arithmetic property of identity: two numbers are identical if their subtraction equals to 0. To store the result of this subtraction in each iteration, we multiplied it by the next one each time – if the subtraction result of any iteration evaluates to 0, it will trail all along the loop. After exiting the loop, all that is left to check is if the result is actually 0, meaning the value was found somewhere in the array.

Only thing to watch out for here is integer overflow following each multiplication.

3. Using bit manipulations

Consider this function which sums up all the integer in an array which are greater than a given value:

#include <stddef.h>
int sumGreater (int arr[], size_t size, int val)
{
  int sum = 0;
  for (size_t i = 0; i < size; ++i)
  {
    if (arr[i] >= val)
      sum += arr[i];
  }
  return sum;
}

Here we can replace the ‘if’ statement with a bit hack:

#include <stddef.h>
#include <limits.h>
int sumGreater (int arr[], size_t size, int val)
{
  int sum = 0;
  for (size_t i = 0; i < size; ++i)
  {
    int mask = (arr[i] - val) >> (sizeof(int) * CHAR_BIT - 1);
    sum += ~mask & arr[i];
  }
  return sum;
}

The trick here is that if arr[i] is greater than or equal to val, (arr[i] - val) will be positive and the sign bit will be 0. Otherwise, it will be 1. Shifting this bit by the number of bits in int will “spread” the 0 or 1 across all the bits. This is called sign extension and it serves to preserve the sign of negative numbers when operator shift right >> is used (as a matter of fact, this behavior is not standard defined but is implementation dependent, meaning it is not guaranteed by every compiler - see section 6.5.7 of the latest draft standard). Eventually, there will create a mask with either 0 or 1 on all bits. The following & operation will turn arr[i] into 0 if ~mask is all 0’s or will simply carry no effect if ~mask is all 1’s.

There is a whole bunch of other branchless algorithms found on Bit Twiddling Hacks.

4. Using lookup tables

Occasionally it can be useful to replace branches with lookup tables (LUTs). Lookup tables are arrays arranged around key-value pairs. Accessing an indexed array is usually a simpler operation, thus saving valuable processing time.

To remove branches with this technique, simply use an array to store the branch destinations - these may include either data or actions (as function pointers).

The following:

static void foo () {}
static void bar () {}
 
enum choice_t
{
  FOO,
  BAR
};
 
void doSomething (enum choice_t choice)
{
  switch (choice)
  {
    case FOO:
      foo();
      break;
    case BAR:
      bar();
      break;
  }
}

Can be replaced with:

static void foo () {}
static void bar () {}
 
enum choice_t
{
  FOO,
  BAR,
  NUM_OF_ACTIONS
};
 
typedef void (*action_t) ();
 
const action_t actionLUT[NUM_OF_ACTIONS] = 
{
  foo,
  bar
};
 
void doSomething (enum choice_t choice)
{
  actionLUT[choice]();
}

Just keep in mind that lookup tables occupy some space in memory, and in some environments - such as embedded systems - memory is meticulously managed and reducing space complexity is often more critical than runtime performance.

5. Removing unnecessary comparisons

There is another trick to reduce the branches on the isFound() function on #2. Looking closely, there is actually another condition being checked at each iteration of the loop - which is the condition of termination (i < size).

Let’s look at the function definition again:

#include <stdbool.h>
#include <stddef.h>
bool isFound (int arr[], size_t size, int val)
{
  for (size_t i = 0; i < size; ++i)
  {
  if (arr[i] == val)
    return true;
  }
  return false;
}

Another way to get rid of this condition would be to use val as a sentinel value:

a. Check if the last item of the array is equal to val. If so, terminate.

b. Otherwise, store the last item of the array in a separate variable and insert val as the last item of the array.

c. Loop over the array, breaking when the current item is equal to val.

d. Verify the iteration index. If the array has been fully scanned, this means val had been found only as the last item (which we inserted).

This clever method is known as “Elephant in Cairo”. Here is the result:

#include <stdbool.h>
#include <stddef.h>
bool isFound (int arr[], size_t size, int val)
{
  int last;
  size_t i;
 
  if (arr[size-1] == val)
  {
    return true;
  }
  else
  {
    last = arr[size-1];
    arr[size-1] = val;
  }
 
  for (i = 0; arr[i] != val; ++i);
 
  return i != size-1;
}

When should you go branchless?

The first thing to keep in mind is that the techniques presented above are considered micro optimizations. Nowadays, most compilers are already doing a pretty good job of optimizing branches (read about the cmovinstruction). Let’s inspect the assembly generated by the C compiler x86-64 gcc 10.2 (using godbolt). For the minimum() function above (with branches):

cmp edi, esi

mov eax, esi

cmovle eax, edi

ret

For the branchless version of the minimum() function:

mov eax, 0

cmp edi, esi

cmovge edi, eax

cmovge eax, esi

add eax, edi

ret

The first version actually has fewer assembly instructions executed (which is a pretty reliable way of assessing performance). Therefore, more often than not making your code clear and simple to understand will go a long way. But what if your compiler vendor doesn’t implement a piece of code as fast as necessary? This is the case for many compilers of low-end embedded systems. Knowing such tricks is certainly extremely valuable for people who implement language infrastructure.

The place where branchless algorithms really shine is the case of performance-critical code: one or two functions that will run many many times on one particular processing unit.

A few areas where branchless code can really come in handy are:

Embedded systems (as mentioned);
GPU programming: GPUs have multiple cores sharing the same instruction pipeline (each core runs the same instruction but with other parameters);
Cryptographic applications: attackers can gather information based on how long certain operations take (e.g. short passwords on login) - branchless algorithms which run at constant time regardless of input, may thwart such attacks.

In the future we should expect more complex processor designs (even in the low-end CPUs), probably with longer pipelines. Branching misprediction penalties will thus become higher and higher. For a performance aware developer, taking care of branches will become more and more important.

'If' Is Slow!

Recent Posts

Comments