Laszlo

Hello, I am Laszlo

Software-Enginner, .NET developer

Contact Me

SIMD Gather and Scatter with Contains All ASCII

Introduction

One of the most difficult problems with SIMD is handling non-contiguous memory access. To address this challenge AVX-512 adds gather and scatter instructions to load and store memory in an array at non-adjacent indexes. These instructions enable a whole new set of algorithms to be vectorized using SIMD operations.

Gather is a single instruction that loads data from non-adjacent indexes of an array into a Vector register.Scatter is a single instruction that stores data at non-adjacent indexes to an array from a Vector register.

Both instructions have a source/destination register parameter, a reference to an array parameter, and another vector parameter containing the indexes for each lane to be loaded or stored.

In this blog post, I will explore one algorithm using gather and scatter instructions with .NET 10. Unfortunately, .NET 10 has not yet implemented these instructions for AVX family of instruction sets, although the API surface has already been approved. Another issue for using the AVX instruction set is that my current CPU does not support AVX-512 (or above). Fortunately, .NET implements the gather and scatter intrinsics for Scalable Vector Extension (SVE). SVE is introduced by Armv8-A architecture. As I also have a Snapdragon X Elite laptop, I thought this would be an excellent opportunity for the test. Unfortunately, while this CPU is derived from Armv8-A architecture, it does not implement SVE. Eventually, I settled on using SVE in the cloud. The Cobalt machines of Azure implement SVE, hence those are used for the performance measurements in this post.

Problem

Modern programming languages often come with a set of core libraries that provide the basic functionality for common types. One such type is string. In .NET string type comes with a wide variety of operations to build, parse, search or amend objects of this type. Most search methods such as Contains, IndexOf, etc. have already been vectorized. These methods help to find individual characters, patterns, and custom character groups. These operations can validate if an input string only contains characters in a certain range. However, they don't answer if an input string contains all characters of a character group. For example, does input x contain all characters of lower-case ASCII letters ([a-z])?

Note that, to answer this question the whole input needs to be read, as in worst case one character could be missing from the given character class.

In the example below, I make the following assumptions:

  • The input string only contains lower-case ASCII letters.

  • The distribution of chars in the input is non-uniform.

  • Most inputs do not have all characters present from [a-z] range.

Solutions

In this section, I show two implementations, one regular and one using SIMD with SVE.

Regular Approach

The first solution uses a backing array, where each element of the array corresponds to one of the lower-case characters in [a-z]. The algorithm iterates through the input and for each character in the [a-z] range sets the corresponding index of the array to true. Finally, it validates that every element of the array is set to true.

Depending on the expected input values (and the distribution of characters in the inputs), it might be worth checking if the buffer array contains only trues in every iteration of the main loop.

public bool ContainsAllAsciiLowerCaseLetters(ReadOnlySpan<char> a)
{
    // Inputs with less then 26 in length cannot
    // contain all lower-case characters.
    if (a.Length < 26)
        return false;
    bool[] buffer = new bool[26];
    for (int i = 0; i < a.Length; i++)
    {
        var c = a[i];
        // Consider a wider range of input: if (c >= 'a' && c <= 'z')
        buffer[c - 'a'] = true;
    
        // Consider checking if buffer
        // contains only true values.
    }

    foreach (var c in buffer)
        if (!c)
            return false;
    return true;
}

SVE Approach

With SVE, the implementation follows the same logic, but it handles Vector<uint>.Count number of chars of the input at a time.

  • The Scatter and LoadVectorUInt16ZeroExtendToUInt32 intrinsics work with pointer types, hence the method has an unsafe modifier and the csproj file allows for unsafe code.

  • The input parameter is Span<char> instead of being a ReadOnlySpan<char>, as it needs pinning (with fixed keyword), so the GC won't move the object while untracked pointers point to it.

  • The backing array is a stack allocated int[], because Scatter has no overloads for bools or smaller types.

  • The backing array is stack allocated (implicitly pinned).

  • LoadVectorUInt16ZeroExtendToUInt32 loads and expands the UTF-16 characters into uint.

  • Scatter sets the value 1 in the backing array at the indexes of currentV.

  • When handling a wide range of inputs, consider using the mask vector to filter characters of (c >= 'a' && c <= 'z')

public unsafe bool ContainsAllAsciiLowerCaseLettersSimd(Span<char> a)
{
    if (a.Length < 26)
        return false;

    // Using int for the backing array.
    int* buffer = stackalloc int[26];
    var offset = new Vector<uint>('a');

    // Pin the input.
    fixed (char* ptrA = a)
    {
        ushort* source = (ushort*)ptrA;
        int i = 0;
        for (; i < a.Length - Vector<uint>.Count; i += Vector<uint>.Count)
        {
            // Load chars into Vectors, also widening to uint. Loads 4 chars into a vector.
            var currentV = Sve.LoadVectorUInt16ZeroExtendToUInt32(Vector<uint>.One, &source[i]);
            
            // Consider a wider range of input by validating (c >= 'a' && c <= 'z')
            currentV -= offset;

            // Scatter sets 1 in the backing array at the indexes of currentV
            Sve.Scatter(Vector<int>.One, buffer, currentV, Vector<int>.One);
        }

        // Handling the remaining part of the input. 
        // With SVE consider using masking.
        for (; i < a.Length; i++)
            buffer[a[i] - 'a'] = 1;
    }

    for (int i = 0; i < 26; i++)
        if (buffer[i] != 1)
            return false;

    return true;
}

Performance Comparison

The SIMD version of the method handles 4 chars at a time (as 4 uint values fit in a vector on the current implementation of SVE in .NET 10). That gives a theoretical limit of 4 for the speed-up compared to the regular approach. In practice it is rare to observe such performance improvements, because the vectorized solution also needs to load the data in vector registers. However, less branching typically means less branch miss-predictions. In this case the gain is about x2, which is respectable in this case, however it also comes with the tradeoff of unsafe code.

BenchmarkDotNet v0.15.4, Windows 11 (10.0.26200.6584) (Hyper-V)
Cobalt 100 3.40GHz, 1 CPU, 2 logical and 2 physical cores
.NET SDK 10.0.100-rc.1.25451.107
[Host]     : .NET 10.0.0 (10.0.0-rc.1.25451.107, 10.0.25.45207), Arm64 RyuJIT armv8.0-a
DefaultJob : .NET 10.0.0 (10.0.0-rc.1.25451.107, 10.0.25.45207), Arm64 RyuJIT armv8.0-a


| Method                               | Input              | Mean     | Error   | StdDev  |
|------------------------------------- |------------------- |---------:| -------:| -------:|
| ContainsAllAsciiLowerCaseLetters     | aaaa(...)wxyz[387] | 258.4 ns | 0.45 ns | 0.42 ns |
| ContainsAllAsciiLowerCaseLettersSimd | aaaa(...)wxyz[387] | 112.0 ns | 0.05 ns | 0.04 ns |

The performance of strings without all desired characters in the input has matching performance.