Participants of the courses "ULS" are simulating an architecture group dealing with a large scale application for some corporation. The atmosphere is more like a workshop than a seminar. The very heterogeneous skills and knowledge in a master course makes it quite hard to define hard topics. For this reason the course will start with participants taking a look at existing large scale sites. They present those sites using EXISTING papers, blogs etc. (explaining things This usually results in a large number of technologies mentioned but new to most participants. In the next step we define large architectural areas and students take over responsibility for them and start diving into papers and blogs (see below). A strategy group e.g. deals with different approaches and their price/performance/learning curve etc. A serverless group might create ideas for using this technology compared to e.g. a microservice approach. Step by step the understanding of ULS components will improve and the groups will notice dependencies between the architectural parts.
As we discuss large scale components we will invariable discover some core distributed technologies which we do not yet understand. Things like Paxos and Raft, special distributed transactional modes and more. We will pick some for a deeper inspection and a group will try to give an intro. There are also some special topics included in the course like System Design Interviews. Once we have a basic understanding of ULS, we will do a workshop where participants will have to fill in a SDI on some topic. In the following session we will discuss the results.
The discussion topics in the ULS course will be largely decided by the participants, work can be done in groups. The following list gives some ideas for topics.
Participants are required to read papers and articles around their topic.
If you don't like to read, don't take this course!!! There is a google drive folder where we will collect presentations and other resources.
What can you do if you detect some larger gaps in your distributed systems skills? The following literture might come in helpful in this case:
Martin Kleppman, Data-Intensive Applications
Distributed Systems Reading List Good conferences are Velocity, Strangeloop and Usenix.
Optional topcis for the course: (feel free to offer your own ideas). How about serverless at the edge (cloudflare?), model-driven programming for the cloud? Web Assembly for ULS?
The strateg group deals with the overall architecture, prices, disasters etc.
Head in the cloud - a migration story
MVPs and $100k AWS Bills: Reflections on the launch of Octopus Cloud 1.0
Our journey to type checking 4 million lines of Python
Systems @Scale 2019 New York recap
Killing Kafka : the pitfalls of Over-Architecting
Automated Disaster Recovery using CloudEndure
serverless - the kubernetes killer?
https://tech.channable.com/posts/2019-10-04-why-we-decided-to-go-for-the-big-rewrite.html
Moving HPC to the Cloud: A Guide for 2020
Edge computing with AWS local zones
System design hack: Postgres is a great pub/sub and job server
Fault Tree Analysis Applied to Apache Kafka
How Yelp uses Flink for predicting store visits in real time
Estimating systems with napkin math
Serverless: 15% slower and 8x more expensive
Formal Foundations of Serverless Computing
https://medium.com/@sbrisalsEpisode #20: The Serverless Journey of LEGO.com with Sheen Brisals
Episode #21: Getting Started with Serverless (Special Episode)
Operations
Scaling in the presence of errors—don’t ignore them
Building A Scalable Monitoring System
The 2019 Accelerate State of DevOps: Elite performance, productivity, and scaling
Dapper — Google’s Secret Weapon
How Deep Systems Broke Observability — and What We Can Do About It,| LightStep
Procella: unifying serving and analytical data at YouTube
Help! My Azure Site Performance Sucks! — Part 1
Finding a Needle in a Call Stack - Intro to Distributed Tracing
It looks like both topics are rather hard to have at the same time...
Building Secure and Reliable Systems
Resilience Engineering The What and How - devopsdays Washington, DC 2019
What Breaks Our Systems A Taxonomy of Black Swans
Amazon Web Services’ Approach to Operational Resilience in the Financial Sector and Beyond
The Post-Incident Review Issue 1: Autumn/Winter 2019
Paper review. Gray Failure: The Achilles' Heel of Cloud-Scale Systems
How Jennifer Aniston broke Instagram
On Eliminating Error in Distributed Software Systems
Closing Loops and Opening Minds: How to take control of systems, big and small
wrong workload placement causing data center death
Down The Rabbit Hole of Performance Monitoring
Engineering Esports The Tech That Powers Worlds | Riot Games Technology
How Much of a Genius-Level Move Was Using Binary Space Partitioning in Doom?
TinyLFU: A Highly Efficient Cache Admission Policy
Introducing Ristretto: A High-Performance Go Cache
Splash the cache: how caching improved our reliability
Enhancing Bandaid load balancing at Dropbox by leveraging real-time backend server load information
7 mistakes when using Apache Cassandra
SLOG: Serializable, Low-latency, Geo-replicated Transactions
Building a Large-scale Distributed Storage System Based on Raft - Cloud Native Computing Foundation
GRIT: a Protocol for Distributed Transactions across Microservices
Build with DynamoDB | S1 E5 – A Data Modeling Use Case Deep Dive - YouTube
Parallel Commits: An Atomic Commit Protocol For Globally Distributed Transactions - Cockroach Labs
Spotify’s Event Delivery – life the Cloud
How Slack Built Shared Channels
Datacenter RPCs can be General and Fast
MinIO | Enterprise Grade, High Performance Object Storage
Scribe: Transporting petabytes per hour - Facebook Engineering
Building and Running Applications at Scale in Zalando
PostgreSQL Connection Pooling: Part 1 – Pros and Cons
The boring technology behind a one- person Internet company
Here we are interested in how AI can be scaled and how AI might support ULS sites internally, e.g by driving intelligent caches. But customer facing AI is OK as well.
At Scale - @Scale 2019 Keynote AI - The Next Big Scaling Frontier | Facebook
Workshop with inovex?
Assignment problem: solve it with AI?
http://highscalability.com/blog/2020/9/22/snakes-in-a-facebook-datacenter.htmlWorkload placement