LINQ and time complexity and data structures, oh my!

LINQ is a wonderful thing. It significantly enhances the expressiveness of C#, and provides lots of other benefits as well. Unfortunately, it comes with a cost.

It can hide things from you.

One of the real dangers is out-of-control time complexity. How many times will the following code perform the .Distinct() operation?

var rnd = new Random();
var values = Enumerable.Range( 1, 1000).Select(r => rnd.Next(10));

var uniqueValues = values.Announce( "Performing Distinct() operation...").Distinct();
if (uniqueValues.Count() > 2) uniqueValues.First().Dump();

If you answered ‘two’, congratulations! My output looks like this:

Performing Distinct() operation...
Performing Distinct() operation...
3

Of course, ReSharper warns you about this: “Possible multiple enumeration of IEnumerable”. ReSharper won’t catch all sins, though. I ran into something similar to this recently:

// Example data
var rnd = new Random();
var values = Enumerable.Range(1, 10000).Select(r => rnd.Next(10)).ToList();
var otherData = Enumerable.Range( 1, 10000).Select(r => rnd.Next(20)).ToList();
var counter = new object();

// Problem code
var uniqueValues = values.CountCalls(counter).Distinct();
var otherDataWhichMatchesValues = otherData.Where(od => uniqueValues.Contains(od));
otherDataWhichMatchesValues.Count().Dump();
MyExtensions.GetCallCount(counter).Dump();

That took 19 seconds to run, and made 10,000 calls to Distinct()! Can you see what’s going on? The Where() operation is testing each entry in otherData against uniqueValues – enumerating uniqueValues once for every entry in otherData – and ReSharper 8 doesn’t warn you about it! If you’re used to seeing that warning whenever you try to enumerate an IEnumerable more than once, you might be tempted to think that it won’t happen. You would, of course, be wrong.

A Distinct() operation runs in O(n) time, Where() introduces an additional O(n), and .Contains depends on the underlying data structure – which in this case, is a List. So our overall operation is running in O(n^3) – that’s a real performance killer on any non-trivial data set.

Dropping a .ToList() after the .Distinct() reduces our run time to 3 thousandths of a second, and reduces our Distinct() operations to one. Much better! (Well, at least, it seems that way for now.)

I have a habit which ReSharper doesn’t like much. I usually avoid methods like Count(predicate), and prefer to chain a Count() on the end of a Where(predicate). One of the reasons I do this is that I think it makes it clearer which calls are LINQ queries, subject to multiple evaluations, and which calls will cause evaluations. Of course, that doesn’t help if you don’t spot that .Distinct() is in the LINQ namespace in the first place!

It’s easy to forget about things like time and space complexity, but there’s a reason you learned that stuff: It’s important! Whenever you write a loop or make a LINQ call, something in the back of your mind should be thinking about how nested the current operation is, and how much time and space complexity you’re layering on top. That’s not to say that you should necessarily optimise everything, but it may help you to spot problems before they get out of control.

There are two bigger-picture lessons to learn from this.

Algorithms and Data Structures

The real root of the problem, in this case, came from not thinking through the algorithm or selecting appropriate data structures. Rather than trying to decide on an outcome first, the programmer has worked through a series of operations against the available data, until it has ended up in the right shape. You can picture the thought process:
– I need to get the entries in otherData which match entries in the values list.
– That will be slow, so I’ll call distinct on values.

The fix – adding a call to ToList() after calling distinct – has unfortunately introduced a more subtle performance bug. It works well for the test data set, but it won’t perform as well if values is sparse: if Distinct() removes a lot of duplicates, then we’ll see a performance improvement, but if there are few duplicates, the original problem the programmer was trying to fix will remain. Let’s measure.

// Example data
var rnd = new Random();
var dataRange = 10;
var values = Enumerable.Range( 1, 10000).Select(r => rnd.Next(dataRange)).ToList();
var otherData = Enumerable.Range( 1, 10000).Select(r => rnd.Next(dataRange)).ToList();
var counter = new object();

// Operation
for ( int i = 1; i < 100; i++) {
     var uniqueValues = values.CountCalls(counter).Distinct().ToList();
     var otherDataWhichMatchesValues = otherData.Where(od => uniqueValues.Contains(od));
     otherDataWhichMatchesValues.Count().Dump();
}

We’ve now introduced a variable – dataRange – which will roughly control how many duplicates we’ll get. This code roughly parallels our original, with the ToList() fix (run through numerous iterations to exaggerate the timings). As is, it completes in 0.6s, but if we change dataRange to 1000, the run-time increases to 5.4s.

Consider the operations we want to do. We’re looking to build the ‘values’ dataset in the first place, and then we’re going to make many calls to .Contains() against it. While the time complexity of an insert operation on a list is O(1), Contains() is O(n). What we really want is a data structure which is O(1) for both insert and contains operations – and that’s a hash. So the fix we should really make is to change the values dataset to a HashSet, and drop the distinct operation altogether:

var rnd = new Random();
var dataRange = 10;
var values = new HashSet<int>(Enumerable.Range(1, 10000).Select(r => rnd.Next(dataRange)));
var otherData = Enumerable.Range(1, 10000).Select(r => rnd.Next(dataRange)).ToList();
var counter = new object();

for (int i = 1; i < 100; i++) {
     var otherDataWhichMatchesValues = otherData.Where(od => values.Contains(od));
     otherDataWhichMatchesValues.Count().Dump();
}

Now the run time is around 0.1s, regardless of the value of dataRange. As we can see, it’s not only faster in the sparse case, but it’s faster even than the ideal case with many duplicates – which we timed at 0.6s with 100 iterations.

I see a lot of developers who have a favourite data structure – usually a list or an array – and rarely or never make use of other data structures. If a piece of code has emerged as a target for optimisation, you may as well optimise it properly the first time around. Throwing a .ToList() onto the end of a LINQ expression is kind of a code smell: it indicates that you may not have selected the right data structure in the first place. Unless, of course, you’ve looked at the time complexities of the operations you need, and a list is a good fit: in which case, by all means ToList() it! If you’re feeling particularly lazy, want to ignore everything I’ve said so far, and roll the dice, you could even just .ToRandomCollectionType() it.

Over-reliance on tools

As a final thought, I want to caution people against over-reliance on tools. The problem is not that the developer relied on ReSharper to spot multiple enumerations; even if they’re used to spotting them without it, this is an unusual operation (which is why the ReSharper rule didn’t catch it – and yes, there is an open issue to address exactly this situation). The problem emerges when the developer not only starts to rely on ReSharper to pick up rules like this one, but starts to assume that code without ReSharper warnings is good code. That goes hand-in-hand with the fact that ReSharper isn’t always able to fix the warnings appropriately: in this case, even if ReSharper had caught the problem, its solution – to enumerate the IEnumerable to a list – wouldn’t have been the appropriate (or at least, the best) solution.

One thought on “LINQ and time complexity and data structures, oh my!”

  1. I don’t understand how you’re getting Distinct to call that many times in your second example (before putting the .ToList on). I don’t have the internals of what your CountCalls(object) method is doing so I’ve tried to improvise and I only get 99 calls to Distinct.

    I believe I understand the compounding complexity of having .Contains inside the .Where evaluation, but I’m having a hard time getting the measurement piece to work in my own example.

    Thanks,

Leave a Reply

Your email address will not be published. Required fields are marked *