SIMD Gather and Scatter with Contains All ASCII
10/13/2025 | 6 minutes to read
Introduction
One of the most difficult problems with SIMD is handling non-contiguous memory access. To address this challenge AVX-512 adds gather and scatter instructions to load and store memory in an array at non-adjacent indexes. These instructions enable a whole new set of algorithms to be vectorized using SIMD operations.
Gather is a single instruction that loads data from non-adjacent indexes of an array into a Vector register.Scatter is a single instruction that stores data at non-adjacent indexes to an array from a Vector register.
Both instructions have a source/destination register parameter, a reference to an array parameter, and another vector parameter containing the indexes for each lane to be loaded or stored.
In this blog post, I will explore one algorithm using gather and scatter instructions with .NET 10. Unfortunately, .NET 10 has not yet implemented these instructions for AVX family of instruction sets, although the API surface has already been approved. Another issue for using the AVX instruction set is that my current CPU does not support AVX-512 (or above). Fortunately, .NET implements the gather and scatter intrinsics for Scalable Vector Extension (SVE). SVE is introduced by Armv8-A architecture. As I also have a Snapdragon X Elite laptop, I thought this would be an excellent opportunity for the test. Unfortunately, while this CPU is derived from Armv8-A architecture, it does not implement SVE. Eventually, I settled on using SVE in the cloud. The Cobalt machines of Azure implement SVE, hence those are used for the performance measurements in this post.
Problem
Modern programming languages often come with a set of core libraries that provide the basic functionality for common types. One such type is string
. In .NET string
type comes with a wide variety of operations to build, parse, search or amend objects of this type. Most search methods such as Contains
, IndexOf
, etc. have already been vectorized. These methods help to find individual characters, patterns, and custom character groups. These operations can validate if an input string
only contains characters in a certain range. However, they don't answer if an input string
contains all characters of a character group. For example, does input x
contain all characters of lower-case ASCII letters ([a-z]
)?
Note that, to answer this question the whole input needs to be read, as in worst case one character could be missing from the given character class.
In the example below, I make the following assumptions:
The input
string
only contains lower-case ASCII letters.The distribution of chars in the input is non-uniform.
Most inputs do not have all characters present from
[a-z]
range.
Solutions
In this section, I show two implementations, one regular and one using SIMD with SVE.
Regular Approach
The first solution uses a backing array, where each element of the array corresponds to one of the lower-case characters in [a-z]
. The algorithm iterates through the input and for each character in the [a-z]
range sets the corresponding index of the array to true
. Finally, it validates that every element of the array is set to true
.
Depending on the expected input values (and the distribution of characters in the inputs), it might be worth checking if the buffer array contains only
true
s in every iteration of the main loop.
public bool ContainsAllAsciiLowerCaseLetters(ReadOnlySpan<char> a) { // Inputs with less then 26 in length cannot // contain all lower-case characters. if (a.Length < 26) return false; bool[] buffer = new bool[26]; for (int i = 0; i < a.Length; i++) { var c = a[i]; // Consider a wider range of input: if (c >= 'a' && c <= 'z') buffer[c - 'a'] = true; // Consider checking if buffer // contains only true values. } foreach (var c in buffer) if (!c) return false; return true; }
SVE Approach
With SVE, the implementation follows the same logic, but it handles Vector<uint>.Count
number of chars of the input at a time.
The
Scatter
andLoadVectorUInt16ZeroExtendToUInt32
intrinsics work with pointer types, hence the method has anunsafe
modifier and thecsproj
file allows for unsafe code.The input parameter is
Span<char>
instead of being aReadOnlySpan<char>
, as it needs pinning (withfixed
keyword), so the GC won't move the object while untracked pointers point to it.The backing array is a stack allocated
int[]
, becauseScatter
has no overloads forbool
s or smaller types.The backing array is stack allocated (implicitly pinned).
LoadVectorUInt16ZeroExtendToUInt32
loads and expands the UTF-16 characters intouint
.Scatter
sets the value 1 in the backing array at the indexes of currentV.When handling a wide range of inputs, consider using the mask vector to filter characters of
(c >= 'a' && c <= 'z')
public unsafe bool ContainsAllAsciiLowerCaseLettersSimd(Span<char> a) { if (a.Length < 26) return false; // Using int for the backing array. int* buffer = stackalloc int[26]; var offset = new Vector<uint>('a'); // Pin the input. fixed (char* ptrA = a) { ushort* source = (ushort*)ptrA; int i = 0; for (; i < a.Length - Vector<uint>.Count; i += Vector<uint>.Count) { // Load chars into Vectors, also widening to uint. Loads 4 chars into a vector. var currentV = Sve.LoadVectorUInt16ZeroExtendToUInt32(Vector<uint>.One, &source[i]); // Consider a wider range of input by validating (c >= 'a' && c <= 'z') currentV -= offset; // Scatter sets 1 in the backing array at the indexes of currentV Sve.Scatter(Vector<int>.One, buffer, currentV, Vector<int>.One); } // Handling the remaining part of the input. // With SVE consider using masking. for (; i < a.Length; i++) buffer[a[i] - 'a'] = 1; } for (int i = 0; i < 26; i++) if (buffer[i] != 1) return false; return true; }
Performance Comparison
The SIMD version of the method handles 4 chars at a time (as 4 uint
values fit in a vector on the current implementation of SVE in .NET 10). That gives a theoretical limit of 4 for the speed-up compared to the regular approach. In practice it is rare to observe such performance improvements, because the vectorized solution also needs to load the data in vector registers. However, less branching typically means less branch miss-predictions. In this case the gain is about x2, which is respectable in this case, however it also comes with the tradeoff of unsafe code.
BenchmarkDotNet v0.15.4, Windows 11 (10.0.26200.6584) (Hyper-V) Cobalt 100 3.40GHz, 1 CPU, 2 logical and 2 physical cores .NET SDK 10.0.100-rc.1.25451.107 [Host] : .NET 10.0.0 (10.0.0-rc.1.25451.107, 10.0.25.45207), Arm64 RyuJIT armv8.0-a DefaultJob : .NET 10.0.0 (10.0.0-rc.1.25451.107, 10.0.25.45207), Arm64 RyuJIT armv8.0-a | Method | Input | Mean | Error | StdDev | |------------------------------------- |------------------- |---------:| -------:| -------:| | ContainsAllAsciiLowerCaseLetters | aaaa(...)wxyz[387] | 258.4 ns | 0.45 ns | 0.42 ns | | ContainsAllAsciiLowerCaseLettersSimd | aaaa(...)wxyz[387] | 112.0 ns | 0.05 ns | 0.04 ns |
The performance of strings without all desired characters in the input has matching performance.