Ultra Large Scale Systems Winter 2021

Ultra Large Scale Systems

Ultra Large Scale Systems (ULS)

Participants of the courses "ULS" are simulating an architecture group dealing with a large scale application for some corporation. The atmosphere is more like a workshop than a seminar. The very heterogeneous skills and knowledge in a master course makes it quite hard to define hard topics. For this reason the course will start with participants taking a look at existing large scale sites. They present those sites using EXISTING papers, blogs etc. (explaining things This usually results in a large number of technologies mentioned but new to most participants. In the next step we define large architectural areas and students take over responsibility for them and start diving into papers and blogs (see below). A strategy group e.g. deals with different approaches and their price/performance/learning curve etc. A serverless group might create ideas for using this technology compared to e.g. a microservice approach. Step by step the understanding of ULS components will improve and the groups will notice dependencies between the architectural parts.

As we discuss large scale components we will invariable discover some core distributed technologies which we do not yet understand. Things like Paxos and Raft, special distributed transactional modes and more. We will pick some for a deeper inspection and a group will try to give an intro. There are also some special topics included in the course like System Design Interviews. Once we have a basic understanding of ULS, we will do a workshop where participants will have to fill in a SDI on some topic. In the following session we will discuss the results.

The discussion topics in the ULS course will be largely decided by the participants, work can be done in groups. The following list gives some ideas for topics.

Participants are required to read papers and articles around their topic.

Note

If you don't like to read, don't take this course!!! There is a google drive folder where we will collect presentations and other resources.

What can you do if you detect some larger gaps in your distributed systems skills? The following literture might come in helpful in this case:

Distributed Systems for fun and profit
Martin Kleppman, Data-Intensive Applications
Distributed Systems Reading List Good conferences are Velocity, Strangeloop and Usenix.

Optional topcis for the course: (feel free to offer your own ideas). How about serverless at the edge (cloudflare?), model-driven programming for the cloud? Web Assembly for ULS?

Strategy

The strateg group deals with the overall architecture, prices, disasters etc.

Head in the cloud - a migration story

Layers in Software, Jessitron

MVPs and $100k AWS Bills: Reflections on the launch of Octopus Cloud 1.0

Our journey to type checking 4 million lines of Python

Systems @Scale 2019 New York recap

Killing Kafka : the pitfalls of Over-Architecting

Automated Disaster Recovery using CloudEndure

serverless - the kubernetes killer?

https://tech.channable.com/posts/2019-10-04-why-we-decided-to-go-for-the-big-rewrite.html

Moving HPC to the Cloud: A Guide for 2020

serverless game changer

migrating to microservices

Edge computing with AWS local zones

A solution looking for a problem

Streams

System design hack: Postgres is a great pub/sub and job server

Fault Tree Analysis Applied to Apache Kafka

How Yelp uses Flink for predicting store visits in real time

Mantis

Analyzing Efficient Stream Processing on Modern Hardware

System Design Interview and napkin math

https://www.educative.io/courses/grokking-the-system-design-interview?aid=5082902844932096üutm_source=googleüutm_medium=cpcüutm_campaign=grokking-manualügclid=EAIaIQobChMI98PTn4G07AIVg-F3Ch10CAcMEAAYAiAAEgLv0fD_BwE

napkin math

Estimating systems with napkin math

napkin math

hardware changes and scalability

async processing and queues

Serverless

Serverless: 15% slower and 8x more expensive

Formal Foundations of Serverless Computing

https://medium.com/@sbrisalsEpisode #20: The Serverless Journey of LEGO.com with Sheen Brisals

Episode #21: Getting Started with Serverless (Special Episode)

webhooks, service bus etc.

Operations

Operations

Hot SRE trends in 2019

Scaling in the presence of errors—don’t ignore them

Building A Scalable Monitoring System

The 2019 Accelerate State of DevOps: Elite performance, productivity, and scaling

Dapper — Google’s Secret Weapon

How Deep Systems Broke Observability — and What We Can Do About It,| LightStep

Procella: unifying serving and analytical data at YouTube

Help! My Azure Site Performance Sucks! — Part 1

Finding a Needle in a Call Stack - Intro to Distributed Tracing

Security and Reliability

It looks like both topics are rather hard to have at the same time...

Building Secure and Reliable Systems

Resilience Engineering The What and How - devopsdays Washington, DC 2019

What Breaks Our Systems A Taxonomy of Black Swans

Amazon Web Services’ Approach to Operational Resilience in the Financial Sector and Beyond

The Post-Incident Review Issue 1: Autumn/Winter 2019

Paper review. Gray Failure: The Achilles' Heel of Cloud-Scale Systems

How Jennifer Aniston broke Instagram

On Eliminating Error in Distributed Software Systems

Resilience engineering papers

Closing Loops and Opening Minds: How to take control of systems, big and small

wrong workload placement causing data center death

Fault tolerance through optimal workload placement

Networking

Snap: a microkernel approach to host networking

Games

Down The Rabbit Hole of Performance Monitoring

Engineering Esports The Tech That Powers Worlds | Riot Games Technology

The Future of League's Engine

How Much of a Genius-Level Move Was Using Binary Space Partitioning in Doom?

VALORANT'S 128-TICK SERVERS

Caching and Load Balancing

TinyLFU: A Highly Efficient Cache Admission Policy

Introducing Ristretto: A High-Performance Go Cache

Splash the cache: how caching improved our reliability

Enhancing Bandaid load balancing at Dropbox by leveraging real-time backend server load information

Global DBs

7 mistakes when using Apache Cassandra

SLOG: Serializable, Low-latency, Geo-replicated Transactions

Building a Large-scale Distributed Storage System Based on Raft - Cloud Native Computing Foundation

GRIT: a Protocol for Distributed Transactions across Microservices

Build with DynamoDB | S1 E5 – A Data Modeling Use Case Deep Dive - YouTube

Parallel Commits: An Atomic Commit Protocol For Globally Distributed Transactions - Cockroach Labs

Spanner, TrueTime and the CAP Theorem

Architectures

Spotify’s Event Delivery – life the Cloud

How Slack Built Shared Channels

Datacenter RPCs can be General and Fast

MinIO | Enterprise Grade, High Performance Object Storage

Scribe: Transporting petabytes per hour - Facebook Engineering

Building and Running Applications at Scale in Zalando

PostgreSQL Connection Pooling: Part 1 – Pros and Cons

The boring technology behind a one- person Internet company

Gitlab DB arc and components. Migration doc.

uber

AI

Here we are interested in how AI can be scaled and how AI might support ULS sites internally, e.g by driving intelligent caches. But customer facing AI is OK as well.

At Scale - @Scale 2019 Keynote AI - The Next Big Scaling Frontier | Facebook

AI at uber

Service Meshes