Now what is a journal club? It means that for this course:
You will read the literature for each session in advance |
There will be a test at the beginning of every session |
The results will be discussed together |
Open questions about the literature will be discussed together |
There will be NO LECTURE in those sessions. If you don't read the literature, you won't learn at thing. |
You will experience problems understanding the papers. This is OK for the first sessions |
You will learn to better understand the papers and discover the critical sections (which usually become questions for the tests (:-)) |
You will learn to ask questions about papers |
You will participate in a workshop on concurrency and parallelism, together with the industry. (One Friday afternoon) |
At the end of the journal club, you should have learned two things: First, a lot about current concepts and technologies for multicore programming. Second, how to really use papers to get an understanding of advanced engineering topics. |
We will be using two books: The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit. And the seminal "Computer Architecture, a quantitative approach by John L. Hennessy and David A. Patterson. It would not hurt to order them from an Indian Bookstore at a good price...
Concurrency and parallelism are core concepts of computing. Peter van Roy has created an overview of the most important concepts underlying out programming languages and it is worth to take a look at those. Programming Paradigms for Dummies: What Every Programmer Should Know, Peter Van Roy. And it is surprisingly hard to define what concurrency, parallelism and asynchrony really mean. Please read the paper for this session.
In this session we will start with an understanding of memory models and why they are needed. We will uncover the secrets of volatile, thread level storage, race conditions etc. We will be reading two sections from the Herlihy book, namely chapter 3.8 (Java Memory Model) and Appendix A.1 and A.2. Start with A.2 and test your understanding of multicore code!
You might have been surprised about the freedom of languages and compilers regarding your sequential statements. It gets worse when we look at how the hardware treats your code. This session is central about all things regarding multi-threading on current multicore processors and you will understand NOTHING without diving into the gory details. We will be reading Chapter 5: Thread-Level Parallelism. Many of you will notice, that they miss a lot about caching, CPUs etc. The other chapters of this book written by the "gods of risc CPUs" provide all the Information.
.
There is not just one form of consistency, and developments like NoSql databases and asynchronous processing in large scale sites are proof of it. In this session we are taking a big gulp of theory about how parallel processing works - both logically and physical. The session is based on chapter 3 of the Herlihy book. If you find this challenging (and it is), you might IN ADDITION read A foo walks into a bar. Take you time with reading the papers! If you have trouble understanding sequential consistency and linearizability take a look at the excellent wikipedia page on linearizability with examples for both. My take on it is simple: Do not move new requests before responses as it violates realtime constraints (The new request depends on a complete and valid parent state). And then take a pen and put dots on the time bands of requests. And if you manage to create those dots (which signify the moment a request becomes effective) in an order compatible with the program logic, you have shown linearizability of the requests.
Fiddling around with weakend consistency rules is one thing. Something else is trying to find algorithms which do not require blocking . This has become a field of current research due to the bad scalability of locking algorithms on multicore architetures. I would like to look at the following algorithms:
1.Session:
Chapter 5.8 in Herlihy: The compare and set operation (needed for wait-free consensus between threads) |
3. Chapter 6.4 in Herlihy: Universal wait-free construction. (to see, how complex wait-free algs can become) |
4. Chapter 7.5 in Herlihy: Queue Locks (scalable spin-locks) |
5. Chapter 8.3 in Herlihy: Reader-Writers Locks (because they are needed so often) |
2. Session
Chapter 9.3 in Herlihy: concurrent reasoning (how to use invariants for proofs) |
Chapter 9.8 in Herlihy: Non-Blocking Synchronization (lazy and optimistic methods for lists). Sets implemented as non-locking lists are an important concept for the integration of new hardware as well: the morningpaper has a discussion of "Efficient lock-free durable sets" in the context of new NVRAM non-volative storage systems. |
Chapter 10.5 und 10.6 in Herlihy: Unbounded lock-free queue and ABA problem |
Chapter 16.5 in Herlihy: Work-stealing Dequeues (needed for Fork/Join Framework in Java) |
. All can be found in the Herlihy book. For atomic snapshots, the paper by Chandy and Lamport Distributed Snapshots: Determining Global States of Distributed Systems is relevant too.
Well, do we really need all that lock/wait-free stuff? Here we are looking for real-world applications of those techniques and we might even discover some new?? (being an older guy has some advantages as well: I remember discussions around user level threads, green threads, non-preemption, kernel bypass etc. many many years ago) ways to build our applications. The first topic is Thread-Per-Core (TPC) which mandates that the number of threads and the number of cores in a multicore are kept equal. This of course has consequences for scheduling etc. (non-preemptive) and brings a lot of new concepts with it like thread-pinning, I/O only cores etc. It also mandates the use of sharding because we do not want locks to keep threads apart and therefore try to not share data. A light intro into TPC is given in Introducing Glommio - a Thread-per-Core Crate for Rust & Linux. This blog entry contains a lot of other interesting papers, e.g. the one on a discussion of How io_uring and eBPF Will Revolutionize Programming in Linux. Glauber Costa comes from ScyllaDB - a high speed distributed DB! And Efficient IO with io_uring. A short glance of those papers shows that wait and lock-free ring-buffers are a core concept to avoid context switches and that even CAS techniques can be avoid to achieve the highest throughput. Lucas Cräer wrote a nice example of such a ring-buffer on github.
Here are some more short links on applications of TCP that I found useful in the context of TCP: TPC for Kafka, Reducing tail latencies with automatic cooperative task yielding
Measuring Mutexes, Spinlocks and how Bad the Linux Scheduler Really is by Malte Skarupke (there was a heated discussion with Linus Thorvalds on the topic of spin-locks and yielding).
C is not a low level language (or better: C does not map very well to modern hardware) An analysis of performance evolution of Linux’s core operations
There are still many places, where shared state multithreading has no alternative. The concurrent package has many helpful algorithms and data structures for this case. Java 101: Java concurrency without the pain, Part 1 and Java 101: Java concurrency without the pain, Part 2
Also relevant is Doug Leas paper on the fork-join framework. To get a better understanding of the library, you can look at some animations from javaConcurrentAnimated.jar. To see how the library elements are used, take a look at the talk from Jessica Karr, Concurrency Options on the JVM. All resources are in my owncloud.
A nice talk from Thompson: https://2016.javazone.no/program/a-quest-for-predictable-latency-with-java-concurrencyA programming paradigm that has little problems with concurrency is FP. The paper contains some basic information on it. Functional Programming For The Rest of Us . We will focus on continuations, currying, side-effects etc. The sessoin is a preparation for the next sessions on Java8 streams and Reactive Programming.
Finally Java has got higher order functions. We will look at the requirements of stream.parallel(), distinguish .reduce from .collect and look at the fork/join engine behind.
GO has been designed for parallelism from the start. Does this mean there are no concurrency bugs possible because we do not have shared state. The paper Concurrency bugs in the Go language. introduces the concurrency features of GO and shows that concurrency bugs are still possible. A good discussion on Hacker News . An just a short glance at GO from a Rust perspective which highlights a different thinking. Finally, there is the famous paper on Communicating Sequential Processes (CSP). You can start with the Wikipedia link for a better overview on the history of CSP.
I admit that the concept of coroutines has always been a bit alien to me. Now is the time to take a closer look at alternative to futures in Kotlin Coroutines Good slide set and talk available on InfoQ
. There are plans to use the coroutine concept behind Swift to implement async/await syntax. The wikipedia article on coroutines is also not bad.What if we cannot afford synchronization but need to parallelize our data processing? Are there data-structures which support some form of "eventual consistency" e.g. because the results can be aggregated alter without problems? This exciting research area is actually a really hot topic at google, facebook and co. We can start with an excellent strangeloop video from Peter at Soundcloud: Consistency without consensus in production systems. There is also a nice introduction to CRDTs and CALM principles in Distributed Systems for fun and profit.. Deeper going information can be found here: CRDTs: Consistency without concurrency control, Mihai Let, Nuno Pregui, Marc Shapiro and Data Structures in the Multicore Age
An actual filesystem supporting CRDTs is ElmerFs .
This would be a topic in itself and there is a nice intro in the Herlihy book. But I'd like to see it covered together with Clojure in our workshop on C&P in January. We sometimes need the concept of an atomic "unit of work" (ACID). But we don't want programmers to deal with locks. The STM concept offers a solution, where the compiler controls updates to data structures and automatically rolls back, if something went wrong. We use the chapter ont STM in Herlihy to understand the algorithm. And there is this Talk by Maurice Herlihy on Transactional Memory that we could use.
We did have an introduction by Benjamin Hensle at the beginning and we won't be able to do much more this term. The Lmax-distributor is certainly a topic for next time. Asynchronous processing pipelines with backpressure. How does the multi-threading work under the hood?
If we find some time after the workshop, we could use it to discuss two small papers. One describes an interesting experiment by well-known database enfant terrible Michael Stonebreaker et.al.: Take the most commonly used technologies for concurrency and parallelismus and run them on a simulated 1000-core machine. Staring into the Abyss: An Evaluation ofConcurrency Control with One Thousand Cores is a sobering experience as it shows, that NO concurrency control technology today is able to deal efficiently with 1000 cores.
Finally, the Linux-legend Thorvalds questions the use of multi-core architectures in many use cases in his post on Avoiding Ping Pong. Some quotes from there: "What's the advantage? You won't get scaling for much longer, and current trends are actually for lower power anyway. So what's the upside of pushing the whole parallelism snake-oil? We know that we need fairly complex OoO CPU's anyway, because people want reasonable performance and it turns out OoO is actually more efficient than slow in-order. The whole "let's parallelize" thing is a huge waste of everybody's time. There's this huge body of "knowledge" that parallel is somehow more efficient, and that whole huge body is pure and utter garbage. Big caches are efficient. Parallel stupid small cores without caches are horrible unless you have a very specific load that is hugely regular (ie graphics". Ok, enough. Go and read the post...And this is not all by far: I got a real surprise reading Scalability! But at what COST?. A wonderful comparison of single-threaded local processing with concurrent distributed stuff: Do not underestimate the cost of communication!
Latency and how to deal with it: Effectively Prefetching Remote Memory with Leap
On 15th January we are going to have a workshop on concurrency and parallelism, together with the industry. I suggest the following topics for short presentations. A final discussion of the different approaches will hopefully increase our understanding of this complicated topic. We will also discuss patterns of applicability with respect to the different technologies.
Some of the ideas for the workshop follow the book "seven concurrency models in seven weeks" by Paul Butcher. and some were suggestions from the course.
Clojure and Akka are well known for their agent-based architecture. And we should take a look at both. The Clojure approach is well documented in a chapter in Paul Butcher's book and Vanessa Werner is currently working at Exxeta on Akka as part of her master thesis.
Hidden behind the rather arcane topic of "Communicating Sequential Processes", MP has been a core architectural feature of Erlang for many years. And not just in the telco industry. Large-scale sites use many Erlang programs too. The GO language made the concept even more popular. The Butcher book has a chapter on CSP with Clojure and we could use it as an intro for the concept.
State and updates of state are usually the most critical parts of concurrent systems. But it is rather hard to imagine a system, where you can have variables without the state problem. Clojure seems to separate identity from state using STM mechanisms. A good reason to look at how this works by reading the chapter in Paul Butchers book. There is also a chapter on STM in the Herlihy book if you need more background on the algorithms behind. BTW: what level of isolation is provided by the algorithm in the Herlihy book? is it sequential consistency? linearizability? some form of snapshot isolation?
This has not been covered in our course, but it gives us a good chance to better understand the differences between thread-level parallelism (our topic) and what can be done with different kinds of problems, e.g. massively parallel algorithms. The goal would be to clearly distinguish use-cases for both technologies.
Just about the hottest topic in Big Data and large-scale sites: real-time processing of data in pipelines. There is an interesting architectural pattern called "lambda architecture", (not to confuse with Amazon's new lambda architecture, which is equally exciting..) which allows both batch and real-time processing of data. It is worth to take a look at this and see, what we can do with this kind of system.