Practical Concurrency: Some Rules
When Tinderbox started out, computers had a single processor. When it was time to update agents, Tinderbox interrupted your work for a moment, updated agents, and then allowed you to resume. We tried hard to guess when it would be convenient to do this, but of course that’s not always something your computer can anticipate.
Nowadays, your computer has somewhere between 4 and 24 processors. Starting with Tinderbox 6, agents no longer interrupted you; agents do their work on one processor while you do your work on another. The two tasks run concurrently.
Concurrent operations can be tricky to get right. For example, suppose one operation is writing “The truth will set you free” into a buffer, and another operation is writing “Donald Trump” into the same buffer. You might end up with “The tr Trump”, or “Donald will set you free”, or “Toearl…” or something else. If one processor reads while another is writing, it might see a partial result, and that might be complete nonsense. This means you need to take great care whenever processors share results.
Getting concurrency right by the book is one thing, and getting it right in the trenches is something else entirely. I’ve recently changed my approach to concurrency; here are my newly-revised rules.
- You can get away with murder. Going by the book, you’ve got to use extreme caution and you’ve always got to get it right. In practice, Tinderbox Six took all sorts of risks and accepted that it was Doing It Wrong in order to get stuff done. That worked remarkably well for a long time. Naive Concurrency blows up when two operations step on each others’ toes: a lot of the time, they’ll just be lucky and will go for hours (or years) without getting in each others’ way.
- You can often get away without concurrency. (This is the Law of Brent Simmons, named for the developer of Vesper who pledged to do everything on the main thread.) Computers are fast: much of the time, just ask one processor to do everything and you’ll be fine. You can’t always do without concurrency: some things like network access do require it. But if you just do everything on the main thread, you’ll often find that everything is fast enough.
- The profiler is now good. It wasn’t always. In the Tinderbox 4 era, firing up the Profiler meant recompiling the world, and that took 20 minutes. Then, you'd get slow and inconclusive results, and throw up your hands. Life today is better, recompiling the world only takes a minute or two. For Tinderbox, ruthless refactoring has eliminated lots of classes that had a zillion dependencies, and that means I seldom need to recompile the world anyway.
- Placekicks are easy. The placekick concurrency pattern is a way to offload a complex calculation or network access if you don't need the answer right away. In the placekick patterns, the “ball” is a bundle of work that needs to be done; you set it up, you kick it to another processor, and then you forget all about it. For example, Tinderbox notes may need to display miniature thumbnails of their text, and because those thumbnails might have lots of different fonts, styles, and pictures, they can be slow to build. So, when we add the note’s view, we fire up a new operation to build the thumbnail in the background, and leave a note behind to say that there's no thumbnail yet but it's under construction. If we need to draw the note before the thumbnail's ready, we simply skip the thumbnail; when it’s finally ready, we leave the thumbnail the appropriate place and it’ll get drawn the next time we update the screen. Placekicks are hard to get wrong: you kick the ball, and you're done. When the operation has done what it was asked to do, it leaves the result in an agreed-upon place; it doesn't need to coordinate with anything else or ask permission. If you can, use placekicks and only placekicks.
- The placekicker shouldn’t tackle. The concurrent operation has one responsibility: kick the ball. It does its task. It may also need to do something to let the system know that it’s finished its work, but that last thing should be trivial. Post a notification, or set a flag, or send one object a single message. Don’t mix responsibilities.
- Never put anything but a placekick on a global queue. Placekicks can’t deadlock. You know they can’t deadlock. Any idiot can see they can’t deadlock. If there’s any question, make a private queue instead.
- Queues are light and cheap. Operations are light and cheap, too. It takes a few microseconds to make a GCD dispatch queue, and scarcely longer to make an NSOperationQueue. Adding an operation to a queue is just as fast. It’s not necessary to be confident that all your tasks are big enough to demand concurrent processing: if some are, there's not much overhead to simply spinning everything off.
- Focused queues are easier to use. If a queue has one clear purpose, it’s easier to be confident it won’t deadlock. Dispatch queues are cheap. Don’t share queues, don’t reuse queues, don’t worry about making queues.
- Classes should encapsulate their queues. This is a big change: Tinderbox used to depend heavily on a bunch of queues that were public knowledge. That’s a bad idea, First, we're sharing disposable objects — I had no idea how disposable dispatch queues are, but there’s no reason to conserve them. Second, when lots of classes are sharing a queue, then any badly-behaved class might cause untold trouble for unrelated classes that share the work queue. Placekick concurrency is an implementation detail: no one needs to know that there’s a queue in use, and they certainly don't need the details or identity of the queue.
- Test the kick, not the queue. Unit testing concurrent systems is better than it used to be, but clever factoring makes it unnecessary to unit-test placekicks. Instead, make the task operation available as its own method or method object, and let the test system test that. You’ll also want to do some integration testing on the whole concurrent system, but that’s another story.
- Classes should clean their queues. Be sure that any objects and resources that your tasks require remain available until the tasks are done. Closing the document and quitting the application requires special care that tasks be completed before we dispose of their tools.
- To clean a queue, cancel all pending operations, then wait for the current operation to finish. Do not suspend the queue, but do make sure no new tasks are added! It’s easy for a class to be sure that it doesn't add anything to its own private queue, but hard for a system to be confident that no one is adding tasks to a queue shared with lots of other objects. That’s a big advantage of private queues.
- Use queue hierarchies to consolidate bursts of work. When we open a new Tinderbox document, there's a bunch of low-level indexing that needs to be done. It’s not urgent, and typically we need only a few milliseconds per note, but some notes will take more work and there might be 10,000 notes. So, we placekick the indexing. The system sees that we want 10,000 things done! “Gadzooks!” it says to itself, “I’d better roll up my sleeves!” This can make the system s[ion up a bunch of worker threads. But we know better: it looks like a pile of work, but it’s not that much and it’s not urgent. So, we tell the system to route all 10,000 tasks to another queue with limited concurrency: now, the system says “I have 10,000 things to do, but I can only do two of then at a time: piece of cake!”
- Read sync, write async. When you read a shared object, you need a result. Read synchronously: that shows your intent and, if the wait is long, you'll see the problem at once as waiting for the read, rather than some mysterious lock or stall. Write asynchronously; it’s the classic placekick and there's no need to wait. The exception here is where we’re writing to say that the old value is dead, defunct, and not to be used; in that case, write needs to block all the readers, and asynchronous reading can get you back some time. Often, the easiest approach remains serial access managed by a single dedicated serial queue that can be used with the read sync/write async rule.