Laszlo

Hello, I am Laszlo

Software-Engineer, .NET developer

Contact Me

SIMD Sum 2

In this post, I'll explore how to optimize a simple array summing operation using SIMD (Single Instruction Multiple Data) operations in C#. The example is inspired by Matt Godbolt's GOTO 2024 talk What Every Programmer Should Know about How CPUs Work - Matt Godbolt - GOTO 2024 about CPU architecture and branch prediction. We'll see how leveraging SIMD instructions can dramatically improve performance by reducing branch mispredictions and processing multiple elements in parallel.

A part of this talk describes the branch prediction feature of CPU. It uses a simple task for demonstration: a method is given a large set of random numbers, sums the total of the numbers and separately also sums the numbers below 128. The talk shows a sample implementation in Python and C++ and explains the reasons for the observed performance difference.

In this post I will implement this example in C# with a single difference: the set of input numbers are bytes and not ints.

Naive Implementation

The naive implementation looks as follows exactly the Python code shown in the talk, but with C# syntax. It iterates the input and aggregates the total sum and the sum of numbers below 128. It chooses to use the uint type for the aggregated values to avoid overflow. I assume that a uint should be able to represent the total.

public (uint, uint) Sum(byte[] input)
{
    uint sumSmall = 0;
    uint sumTotal = 0;
    for (var i = 0; i < input.Length; i++)
    {
        if (input[i] < 128)
            sumSmall += input[i];
        sumTotal += input[i];
    }
    return (sumSmall, sumTotal);
}

I use BenchmarkDotNet to measure the performance of this method with 1,000,000 bytes as the input on .NET 10 Preview 3 runtime.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3915)
12th Gen Intel Core i7-1255U, 1 CPU, 12 logical and 10 physical cores
.NET SDK 10.0.100-preview.3.25201.16
  [Host]     : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2
  DefaultJob : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2


| Method   | Mean        | Error     | StdDev    | BranchMispredictions/Op | BranchInstructions/Op |
|--------- |------------:|----------:|----------:|------------------------:|----------------------:|
| Sum      | 4,038.86 us | 76.923 us | 78.994 us |                 501,912 |             2,623,794 |

The result presented above is not comparable with the result shown in the talk, as the size of the inputs differ as well as the machine/platform where the benchmarks were executed. One interesting observation is that the branch misprediction rate in this sample is ~20%, while the Python code in the talk showed <1%. After seeking clarification from Matt Godbolt, this could be explained by Python (being interpreted) has significantly more branching to perform compared to C# where the JITting is excluded from this measurement.

SIMD Implementation

The Basic Approach

With modern .NET, better results can be achieved using the SIMD instructions. This implementation uses types from the System.Numerics namespace, which provides a versatile wrapper over multiple vectorized instruction sets. On my test laptop, it is leveraging AVX2.

How It Works

The SIMD implementation follows these key steps:

  1. Load multiple bytes into a SIMD register as a vector
  2. Compare all elements with 128 in a single instruction
  3. Use the comparison result as a mask to select values from a zero vector and from the original vector
  4. Accumulate results in vector registers
  5. Periodically widen and sum to prevent overflow

For brevity the details are explained with code comments.

The Code

public (uint, uint) SumSimd(byte[] input)
{
    // Two vectors to sum the intermediate vector results (to avoid overflow).
    Vector<uint> vSumSmall = Vector<uint>.Zero;
    Vector<uint> vSumTotal = Vector<uint>.Zero;

    // Two vectors to contain intermediate results.
    Vector<ushort> vInterSmall = Vector<ushort>.Zero;
    Vector<ushort> vInterTotal = Vector<ushort>.Zero;
    byte b = 0;

    Vector<byte> vLimit = Vector.Create<byte>(128);
    int i = input.Length - Vector<byte>.Count;
    for (; i >= 0; i -= Vector<byte>.Count)
    {
        // 👇 Load the next vector of integers from the input and compare it with 128.
        var vCurrent = Vector.LoadUnsafe(ref input[i]);
        var vMask = Vector.LessThan(vCurrent, vLimit);

        // Select 0 or the original value based on the mask into a new vector.
        var vDiff = Vector.ConditionalSelect(vMask, vCurrent, Vector<byte>.Zero);

        // Convert the vector of bytes to vector ushorts 
        // and sum them to the intermediate results.
        Vector.Widen(vDiff, out var lower, out var upper);
        vInterSmall += lower + upper;
        Vector.Widen(vCurrent, out lower, out upper);
        vInterTotal += lower + upper;

        // Every 128th iteration sum the intermediate vector
        // to uint vectors to avoid overlow.
        unchecked { b += 2; }
        if (b == 0)
        {
            Vector.Widen(vInterSmall, out var lower2, out var upper2);
            vSumSmall += lower2 + upper2;
            Vector.Widen(vInterTotal, out lower2, out upper2);
            vSumTotal += lower2 + upper2;
            vInterSmall = Vector<ushort>.Zero;
            vInterTotal = Vector<ushort>.Zero;
        }
    }

    // Flushing the final values of the vectors into a uint sum.
    Vector.Widen(vInterSmall, out var flower2, out var fupper2);
    vSumSmall += flower2 + fupper2;
    Vector.Widen(vInterTotal, out flower2, out fupper2);
    vSumTotal += flower2 + fupper2;
    var sumSmall = Vector.Sum(vSumSmall);
    var sumTotal = Vector.Sum(vSumTotal);

    // Iterating the remaining set of numbers that cannot fit into vector.
    for (i += Vector<byte>.Count - 1; i >= 0; i--)
    {
        if (input[i] < 128)
            sumSmall += input[i];
        sumTotal += input[i];
    }

    return (sumSmall, sumTotal);
}

Performance Measurements

Please note that the results presented below are not comparable with the results shown in the talk, as the size of the inputs differ as well as the machine/platform where the benchmarks were executed.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3915)
12th Gen Intel Core i7-1255U, 1 CPU, 12 logical and 10 physical cores
.NET SDK 10.0.100-preview.3.25201.16
  [Host]     : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2
  DefaultJob : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2


| Method  | Mean        | Error     | StdDev    | BranchMispredictions/Op | BranchInstructions/Op |
|-------- |------------:|----------:|----------:|------------------------:|----------------------:|
| Sum     | 4,038.86 us | 76.923 us | 78.994 us |                 501,912 |             2,623,794 |
| SumSimd |    49.69 us |  0.535 us |  0.418 us |                      50 |                63,586 |

The results show that the SIMD solution is significantly faster on a randomized (unsorted) input array with 1,000,000 elements. It gains performance benefits in two ways:

  • The vectorized SIMD operations process multiple elements at a time. This places an upper limit of 16-fold improvements on the test machine as 16 ushort elements can be contained by the vector registers.
  • The branch misprediction (and branching operations) are both significantly lower. The CPU spends more cycles on useful operations (as opposed to mispredicted branches).

I ran the same benchmarks on Linux/Arm64 (Raspberry Pi Pico). This device with .NET 10 Preview 3 uses 64 bit SIMD registers. Such a vector can fit 4 ushort values. The remaining performance benefits are due to the branch (mis)predictions.

BenchmarkDotNet v0.14.0, Debian GNU/Linux 12 (bookworm)
Unknown processor
.NET SDK 10.0.100-preview.3.25201.16
  [Host]     : .NET 10.0.0 (10.0.25.17105), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 10.0.0 (10.0.25.17105), Arm64 RyuJIT AdvSIMD


| Method  | Mean      | Error     | StdDev    |
|-------- |----------:|----------:|----------:|
| Sum     | 17.599 ms | 0.0144 ms | 0.0127 ms |
| SumSimd |  2.764 ms | 0.0123 ms | 0.0115 ms |

Conclusion

The performance comparison between the naive and SIMD implementations demonstrates significant benefits of using vectorized operations for this simple sum calculation:

  • On x64/Windows with AVX2, the SIMD version is ~80x faster, dropping execution time from 4ms to just 50us
  • On ARM64/Linux, we still see a ~6x speedup even with smaller vector registers
  • Branch mispredictions are reduced dramatically from ~500k to just 50 on x64

The improvements come from two main factors:

  1. Processing multiple elements in parallel through SIMD vector operations
  2. Drastically reducing branches and branch mispredictions by replacing the conditional logic with vectorized comparisons

While the SIMD implementation is more complex, the performance gains can be substantial for computation-heavy scenarios, especially on modern CPU architectures with wide vector registers. The results also highlight how branch prediction challenges in the naive implementation can significantly impact performance when working with unpredictable data.

This example demonstrates that understanding CPU architecture features like SIMD and branch prediction remains relevant for optimizing performance-critical code, even in managed languages like C#.