LINQ and time complexity and data structures, oh my!

LINQ is a wonderful thing. It significantly enhances the expressiveness of C#, and provides lots of other benefits as well. Unfortunately, it comes with a cost.

It can hide things from you.

One of the real dangers is out-of-control time complexity. How many times will the following code perform the .Distinct() operation?

var rnd = new Random();
var values = Enumerable.Range( 1, 1000).Select(r => rnd.Next(10));

var uniqueValues = values.Announce( "Performing Distinct() operation...").Distinct();
if (uniqueValues.Count() > 2) uniqueValues.First().Dump();

If you answered ‘two’, congratulations! My output looks like this:

Performing Distinct() operation...
Performing Distinct() operation...
3

Of course, ReSharper warns you about this: “Possible multiple enumeration of IEnumerable”. ReSharper won’t catch all sins, though. I ran into something similar to this recently:

// Example data
var rnd = new Random();
var values = Enumerable.Range(1, 10000).Select(r => rnd.Next(10)).ToList();
var otherData = Enumerable.Range( 1, 10000).Select(r => rnd.Next(20)).ToList();
var counter = new object();

// Problem code
var uniqueValues = values.CountCalls(counter).Distinct();
var otherDataWhichMatchesValues = otherData.Where(od => uniqueValues.Contains(od));
otherDataWhichMatchesValues.Count().Dump();
MyExtensions.GetCallCount(counter).Dump();

That took 19 seconds to run, and made 10,000 calls to Distinct()! Can you see what’s going on? The Where() operation is testing each entry in otherData against uniqueValues – enumerating uniqueValues once for every entry in otherData – and ReSharper 8 doesn’t warn you about it! If you’re used to seeing that warning whenever you try to enumerate an IEnumerable more than once, you might be tempted to think that it won’t happen. You would, of course, be wrong.

A Distinct() operation runs in O(n) time, Where() introduces an additional O(n), and .Contains depends on the underlying data structure – which in this case, is a List. So our overall operation is running in O(n^3) – that’s a real performance killer on any non-trivial data set.

Dropping a .ToList() after the .Distinct() reduces our run time to 3 thousandths of a second, and reduces our Distinct() operations to one. Much better! (Well, at least, it seems that way for now.)

I have a habit which ReSharper doesn’t like much. I usually avoid methods like Count(predicate), and prefer to chain a Count() on the end of a Where(predicate). One of the reasons I do this is that I think it makes it clearer which calls are LINQ queries, subject to multiple evaluations, and which calls will cause evaluations. Of course, that doesn’t help if you don’t spot that .Distinct() is in the LINQ namespace in the first place!

It’s easy to forget about things like time and space complexity, but there’s a reason you learned that stuff: It’s important! Whenever you write a loop or make a LINQ call, something in the back of your mind should be thinking about how nested the current operation is, and how much time and space complexity you’re layering on top. That’s not to say that you should necessarily optimise everything, but it may help you to spot problems before they get out of control.

There are two bigger-picture lessons to learn from this.

Algorithms and Data Structures

The real root of the problem, in this case, came from not thinking through the algorithm or selecting appropriate data structures. Rather than trying to decide on an outcome first, the programmer has worked through a series of operations against the available data, until it has ended up in the right shape. You can picture the thought process:
- I need to get the entries in otherData which match entries in the values list.
- That will be slow, so I’ll call distinct on values.

The fix – adding a call to ToList() after calling distinct – has unfortunately introduced a more subtle performance bug. It works well for the test data set, but it won’t perform as well if values is sparse: if Distinct() removes a lot of duplicates, then we’ll see a performance improvement, but if there are few duplicates, the original problem the programmer was trying to fix will remain. Let’s measure.

// Example data
var rnd = new Random();
var dataRange = 10;
var values = Enumerable.Range( 1, 10000).Select(r => rnd.Next(dataRange)).ToList();
var otherData = Enumerable.Range( 1, 10000).Select(r => rnd.Next(dataRange)).ToList();
var counter = new object();

// Operation
for ( int i = 1; i < 100; i++) {
     var uniqueValues = values.CountCalls(counter).Distinct().ToList();
     var otherDataWhichMatchesValues = otherData.Where(od => uniqueValues.Contains(od));
     otherDataWhichMatchesValues.Count().Dump();
}

We’ve now introduced a variable – dataRange – which will roughly control how many duplicates we’ll get. This code roughly parallels our original, with the ToList() fix (run through numerous iterations to exaggerate the timings). As is, it completes in 0.6s, but if we change dataRange to 1000, the run-time increases to 5.4s.

Consider the operations we want to do. We’re looking to build the ‘values’ dataset in the first place, and then we’re going to make many calls to .Contains() against it. While the time complexity of an insert operation on a list is O(1), Contains() is O(n). What we really want is a data structure which is O(1) for both insert and contains operations – and that’s a hash. So the fix we should really make is to change the values dataset to a HashSet, and drop the distinct operation altogether:

var rnd = new Random();
var dataRange = 10;
var values = new HashSet<int>(Enumerable.Range(1, 10000).Select(r => rnd.Next(dataRange)));
var otherData = Enumerable.Range(1, 10000).Select(r => rnd.Next(dataRange)).ToList();
var counter = new object();

for (int i = 1; i < 100; i++) {
     var otherDataWhichMatchesValues = otherData.Where(od => values.Contains(od));
     otherDataWhichMatchesValues.Count().Dump();
}

Now the run time is around 0.1s, regardless of the value of dataRange. As we can see, it’s not only faster in the sparse case, but it’s faster even than the ideal case with many duplicates – which we timed at 0.6s with 100 iterations.

I see a lot of developers who have a favourite data structure – usually a list or an array – and rarely or never make use of other data structures. If a piece of code has emerged as a target for optimisation, you may as well optimise it properly the first time around. Throwing a .ToList() onto the end of a LINQ expression is kind of a code smell: it indicates that you may not have selected the right data structure in the first place. Unless, of course, you’ve looked at the time complexities of the operations you need, and a list is a good fit: in which case, by all means ToList() it!

Over-reliance on tools

As a final thought, I want to caution people against over-reliance on tools. The problem is not that the developer relied on ReSharper to spot multiple enumerations; even if they’re used to spotting them without it, this is an unusual operation (which is why the ReSharper rule didn’t catch it – and yes, there is an open issue to address exactly this situation). The problem emerges when the developer not only starts to rely on ReSharper to pick up rules like this one, but starts to assume that code without ReSharper warnings is good code. That goes hand-in-hand with the fact that ReSharper isn’t always able to fix the warnings appropriately: in this case, even if ReSharper had caught the problem, its solution – to enumerate the IEnumerable to a list – wouldn’t have been the appropriate (or at least, the best) solution.

Shared session factories in NHibernate

NHibernate really is a fantastic ORM… unless you use it badly. Or unless you use it kinda OK. Or unless you use it almost-but-not-quite-perfectly. Then it can be a right pain in the neck. There are a lot of things you can get wrong, and those things can cause you a world of pain.

The particular pain I dealt with recently was memory pressure.

One of the most crippling things you can do to a server is to fill up its RAM. Once your RAM is full, your OS has to start copying things out of RAM onto disk, and then back into RAM when you need to use them again. Disks are slow – much, much slower than RAM – and this is going to hurt your performance badly.

Picture this. You’ve done all the right things. Your software is built using a loosely-coupled Service-Oriented Architecture. You have a website, and it hands all sorts of tasks off to a separate service layer. You have a second service handling various data import tasks. As your load increases, it’s going to be very easy to scale horizontally: you can move your services off to separate servers, and the only thing you need to do is update a few network addresses. Once you expand beyond what you can handle with four servers (those three functions plus a separate database server), you can load-balance each function individually.

You’ve also decided to handle multiple tenants with multiple databases. This one isn’t the right decision in every situation, but there are lots of times when it makes sense, particularly if you’re holding a lot of data for each client. It makes it trivial to archive individual clients off. It makes it easy to offer different tiers of backup. It stops your row-counts from getting too high, and it isolates small clients from the performance headaches of the massive data sets maintained for larger clients.

NHibernate is going to kick you in the teeth for doing this.

We’ve been watching the problem approach for a while now. The base memory overhead for each process soared past a gigabyte some time ago. As our client list headed towards a hundred, our memory overhead headed towards two gigabytes per process. I didn’t need to run a memory profiler to know where the problem was (although I did use one to confirm my suspicions). The culprit was the NHibernate session factories. A single session factory can run towards 20 MB. With fifty clients, that means you have a full gigabyte of RAM filled with nothing but session factories. I didn’t want to have to start scaling horizontally early just because of this, and after all, this gigabyte of memory consisted of lots and lots of copies of 20 MB structures which were identical except for a single string: the database connection string. That’s horribly wasteful. (Actually, there were other differences, but we’ll get to those.) I also couldn’t start disposing of session factories once they hadn’t been used for a little while: these things take a while to construct, and we can’t let our users sit around for several seconds when they log in for the first time in a while. I needed to start re-using our session factories.

There are at least two approaches you can take here. The one I chose has two caveats: firstly, that you’re using NHibernate.Cfg.Environment.ReleaseConnections = “on_close”, and secondly that you’re not using stateless sessions at all. We’ve been moving towards ditching stateless sessions for some time anyway, because stateless sessions don’t support listeners, so the second requirement wasn’t a problem for us. The first setting is a bit more troubling, because it’s legacy behaviour: rather than letting NHibernate manage connections using one of its newer strategies, it forces NHibernate to provide a connection when a session is first opened, and use that connection for the duration of the session. This was acceptable because we were already using the legacy setting, for reasons undocumented in either code comments or our source control history. I haven’t looked into the costs and benefits of this legacy mode compared to the other strategies.

So, let’s dive into some code. First of all, you’re going to need to set your connection provider:

       cfg.SetProperty(Environment.ConnectionProvider,
                       typeof(SharedCompanyConnectionProvider).AssemblyQualifiedName);

Then, seeing as there’s no such thing as a SharedCompanyConnectionProvider, you’ll need to implement it!

        public class SharedCompanyConnectionProvider : DriverConnectionProvider
        {
            protected override string ConnectionString
            {
                get { return NHibernateSessionManager.Instance.DatabaseSettings.GetCurrentDatabaseConnectionString(); }
            }
        }

If that looks a bit scary, good. If not, let me explain. Your connection provider is no longer thread-safe! It’s relying on a singleton which serves up a connection string. This is dangerous code, and you need to be careful how you use it. (Don’t even think of using this without putting some tests around it – see later in this post.)

Now, on to wherever it is you build your sessions. Mine looks something like this:

            private static readonly object CompanySessionFactoryLockObject = new object();
            ...
            lock (CompanySessionFactoryLockObject)
            {
                var sessionFactory = NHibernateSessionManager.Instance.GetSessionFactory();
                NHibernateSessionManager.Instance.DatabaseSettings.SetCurrentDatabaseConnectionString(databaseGUID);
                ISession session = sessionFactory.OpenSession();
            }

I’ve removed a lot of the detail, but that should give you the gist of what’s going on. The key component here is the lock() line. Now that our connection provider isn’t thread-safe, we have to ensure no other threads interrupt between setting the connection string on the singleton, and creating the actual session (at which time the connection provider will provide a session with the current connection string).

The final step in the process is to make sure you have some thorough testing around what you’re doing. The risk of getting it wrong is that your session factory hands you a connection to the wrong database, and that could be very bad. I’m not going to run through the entire test setup, but it’s certainly not a unit test – this thing runs in a test suite which uses a real database instance and creates (in the case of this test) five complete databases which we’ll be accessing from various threads.

        private volatile static string _assertFailed;
        private const int NumThreadsPerDb = 2;

        [Test]
        public void HammerMultipleDatabasesSimultaneously_BehavesWell()
        {
            List<Thread> runningThreads = new List<Thread>();
            foreach (var coGuid in companyGuids)
            {
                for (int i = 0; i < NumThreadsPerDb; i++)
                {
                    var thread = new Thread(StartHammering);
                    runningThreads.Add(thread);
                    thread.Start(coGuid);
                }
            }
            while (runningThreads.Any(thread => thread.IsAlive))
            {
                if (_assertFailed != null)
                    runningThreads.ForEach(thread => thread.Abort());
                else
                    Thread.Sleep(1000);
            }
            if (_assertFailed != null) Assert.Fail(_assertFailed);
        }

        public void StartHammering( object companyGUIDObj)
        {
            // nb don't assert on a thread. We're set up to set a message into _assertFailed instead.
            var CompanyGUID = (Guid)companyGUIDObj;
            string expectedDbName = CoDatabaseNames[companyGuids.IndexOf(CompanyGUID)];
            try
            {
                Entity entity;
                using (var session = NHibernateSessionManager.Instance.GetNewSession(CompanyGUID))
                {
                    // Set up the entity with some unique data
                    session.Save(entity);
                }
                for (int i = 0; i < NumTests; i++)
                {
                    using (var session = NHibernateSessionManager.Instance.GetNewSession(CompanyGUID))
                    {
                        if (!session.Connection.ConnectionString.Contains(expectedDbName))
                            throw new Exception( "Got a connection for the wrong database!");
                        var ent = session.Get<Entity>(entity.GUID);
                        // Check some unique thing about the entity. Change it to something else for the next iteration.
                    }
                }
            }
            catch (ThreadAbortException) { }
            catch (Exception ex)
            {
                if (!ex.ToString().Contains( "ThreadAbortException"))
                    _assertFailed = ex.ToString();
            }
        }

There’s a lot going on there. The key theme is that we’re creating a bunch of threads, and each thread is assigned to a particular database. New sessions are continuously created, and then queried to ensure they contain the expected object. If the object is not found, or the session has the wrong connection string, then something has gone wrong, and the whole system isn’t behaving in a thread-safe fashion.

Note that in a multi-threaded test situation, you cannot just throw an exception if something goes wrong – you need to pass information about the failure to your primary thread.

One final (and important) step is to ensure the test does fail appropriately if the system doesn’t behave as expected. Remove the lock statement around your session creation code and run the test; you should see it fail. Adding the lock back in should fix it.