Laszlo

Hello, I am Laszlo

Software-Engineer, .NET developer

Contact Me

Distinct and HashSet

A common pattern in business-as-usual (BAU) applications is that a class in the business logic layer requests a set of entities by their integer IDs from a repository class. Here is one possible implementation of this pattern:

  1. A method in the business layer receives a list of integers as an argument (List<int>).
  2. It applies filtering and selection on these integers by creating a new enumerable.
  3. It extracts the unique values from the filtered and projected numbers.
  4. It passes the filtered, selected unique numbers to a query method in the repository class.
  5. This repository method processes the passed in integer enumerable:
    • If there aren't any numbers in the enumeration, it returns null (or empty results).
    • Otherwise, it copies the numbers into an SQL query and returns the query's result.

In this blog post, I will focus on two possible implementations to produce the unique set of integers:

  1. Using the Distinct LINQ extension method.
  2. Using the ToHashSet<T> method to materialize the result before passing it to the repository class.

Let's compare these implementations from a performance point of view.

Comparison

I will re-create the above scenario in a minimal reproducible example. The input argument is a list of integers:

var source = Enumerable.Sequence(0, 1000, 1).ToList();

In real-life the source integers are likely to be more random.

The business logic applies a Where and a Select to filter and project the input numbers. The predicates passed to these methods have a significant impact on overall performance. This blog post explores how.

var enumerable = source.Where(x => x < 50).Select(x => x);

Generating the unique set of integers is implemented in two ways:

var itemsToProcess = enumerable.Distinct();
// OR
var itemsToProcess = enumerable.ToHashSet();

TLDR;

The benefit of using the Distinct extension method is that it does not force the enumeration to materialize, hence it does not incur an allocation for an integer collection. The benefit of using ToHashSet is that it memoizes the results, making multiple iterations of it as an enumerable cheaper.

Both result objects are stored in the itemsToProcess variable, which is typed as IEnumerable<int> and can be passed to the repository method. However, the underlying objects are different: Distinct returns an iterator, while ToHashSet returns a HashSet object.

For the purpose of this blog post, I replace the body of the repository method with a simple 'copy' of the numbers to an array. This operation iterates each item of the input in a similar way to how an SQL query creation would serialize the numbers into a WHERE column IN (value1, value2, value3, ...); clause.

[MethodImpl(MethodImplOptions.NoInlining)]
public int[] Process(IEnumerable<int> source)
{
    if (!source.Any())
        return null; // To shortcut, return null when the input is empty.
    return WorkProtected(source);
}

[MethodImpl(MethodImplOptions.NoInlining)]
public int[] WorkProtected(IEnumerable<int> source)
{
    return source.ToArray();
}

In the above snippet, the Process method 'iterates' the source enumerable twice: once for the Any() method call and once for the ToArray() method call. Let's dive into what this iteration means.

Performance Comparison

Let's compare the performance of these implementations using the .NET 10 runtime and BenchmarkDotNet. This post explores three scenarios:

  • The Where predicate filters in 50 numbers (as shown above in the sample).
  • Using a Where predicate that results in an empty collection: Where(x => x > 1000).
  • Using a Where predicate that returns only the last item: Where(x => x > 999).

50 Items

Test results with 50 items Where(x => x < 50):

| Method       | Mean     | Error     | StdDev    | Gen0   | Gen1   | Allocated |
|------------- |---------:|----------:|----------:|-------:|-------:|----------:|
| WithHashSet  | 1.739 us | 0.0255 us | 0.0213 us | 0.4921 | 0.0019 |   3.02 KB |
| WithDistinct | 1.513 us | 0.0202 us | 0.0179 us | 0.5035 | 0.0038 |   3.09 KB |

With 50 items, the results are close to equal in both mean execution time and allocated memory.

Empty Collection

With the Where predicate: Where(x => x > 1000)

| Method       | Mean     | Error   | StdDev  | Gen0   | Allocated |
|------------- |---------:|--------:|--------:|-------:|----------:|
| WithHashSet  | 976.5 ns | 8.99 ns | 7.51 ns | 0.0229 |     144 B |
| WithDistinct | 251.6 ns | 1.33 ns | 1.24 ns | 0.0100 |      64 B |

With an empty collection, the WithDistinct benchmarks run significantly faster while using less than half the memory.

Last Item

When only the last item of the enumeration is included (Where(x => x > 999)):

| Method       | Mean     | Error     | StdDev    | Gen0   | Allocated |
|------------- |---------:|----------:|----------:|-------:|----------:|
| WithHashSet  | 1.018 us | 0.0100 us | 0.0084 us | 0.0439 |     280 B |
| WithDistinct | 1.238 us | 0.0072 us | 0.0060 us | 0.0534 |     344 B |

In this case, the DistinctIterator allocates more and runs longer. The mean execution time difference is influenced by the predicate invoked by the Where extension method.

These results might not meet the usual expectations: shouldn't the Any() call cost twice as much with the Distinct implementation? Yet, it is significantly faster in certain cases and somewhat slower in others.

ToHashSet Extension Method

In the case of the ToHashSet call, a new set is created. The Any() and the ToArray() calls operate on this type. The Any() call becomes cheap because the set has a known count property. The ToArray() call will copy the items of the HashSet to a contiguous array.

Distinct Iterator

To further understand the results, let's explore the internals of the Distinct call. The first thing to notice is that the expression source.Where(x => x < 50).Select(x => x) returns a specialized iterator. The Where extension method returns a ListWhereIterator, and the Select extension method returns a ListWhereSelectIterator. This means that instead of separate iterators, a single iterator is used.

The Distinct method returns a DistinctIterator that also wraps the previous ListWhereSelectIterator as a source. When the underlying collection is an empty array, it also returns an empty array, so no distinct call is needed. The code snippet above does not benefit from this fast path because the underlying collection is not an array.

The next call is the Any() method. This method branches execution based on the underlying type of the IEnumerable<T> source. In the current case, it finds the source is an iterator, DistinctIterator. This is useful because the Iterator base class provides a few fast path operations. For example, when a count/length property is not directly available in a cheap way, the iterator tries to fetch the first item from the underlying iterator.

The ListWhereSelectIterator uses the ArrayWhereSelectIterator to find the first item.

Exploring Behaviors

When there are no items in the enumeration, the Process method becomes cheap as there is no HashSet allocated at all (as opposed to the ToHashSet solution).

When there are some items allocated, the Any() only needs to iterate the input until the first item, so the performance characteristics are driven by the input data and the Where precondition:

  • With dense results, the cost of this iteration becomes negligible, as shown with Where(x => x < 50).
  • With sparse results, this iteration may become expensive, as shown with Where(x => x > 999).

With numbers in the filtered enumeration (Where(x => x < 50)), the DistinctIterator's ToArray call will create a HashSet to produce the distinct items before copying them into the final array. This is the same operation that the ToHashSet executes, hence the common case is very close in terms of mean execution time and allocated memory.

However, evaluating this multiple times will become costly as a new set is created every time.

I assume that if the repository method knowingly iterates the complete source multiple times, then it declares an ICollection<T> or IReadOnlyCollection<T>, etc., as the input parameter in its signature.

Conclusion

In conclusion, neither solution is inherently better or worse; they have different performance characteristics depending on the input in edge cases. The key observation is that the Any() call does not force a full evaluation of the Distinct operation, which provides an advantage with 'empty' results. In other cases, using ToHashSet is a good choice if lazy evaluation is not required. There is no universally "better" solution; the choice depends on the specific requirements of the application. Understanding the performance characteristics of Distinct and ToHashSet allows developers to make informed decisions and optimize their code for different scenarios.