1. Failure Trends in a Large Disk Drive Population [ PDF ] A popular Google paper from 14 months ago, but a good read if it's something you may have to deal with. Fairly readable language too. Summary: Really burn your drives in for the first few months. Few metrics are good indicators of failure.
2. Pinpoint: Problem Determination in Large, Dynamic Internet Services [ PDF ] Tracing requests and statistical analysis to find faults. They wrote a framework for root cause analysis. Summary: Complex systems fail in complex ways. "What went wrong?" is easy to ask, hard to answer.
3. Deconstructing Paxos [ Google | Wikipedia ] Protocol for achieving group consensus over unreliable links. The PDF itself is hard to digest, the Wikipedia page does a much better job. Summary: I still think clustering sucks.
4. The importance of understanding distributed system configuration [ PDF ] An easy read, claiming (and I'd agree), that human error when adjusting system configuration is an extremely common reason for system outages. The paper discusses why that is and some ways to minimize it.
5. Addressing Human Error with Undo [ PDF ] A set of powerpoint slides in MS Comic from 2001. They discuss solving human error problems with having an undo facility. Summary: Font selection aside, this is a really good simple read, it pleads for you to stop trying to prevent human error, and give people ways to recover from it. I 100% agree. (Explodie was born from this.)
6. Release It! Design and Deploy Production-Ready Software [ Site ] Wow. This book (PDF download, $22) reads like a history of the company I'm at. Cascading failures, database contention, developers not understanding the Big Picture, simple changes with dire consequences, the works. If your software runs in a production environment, skip the movie and popcorn and buy this book instead. It's broken into patterns and anti-patterns with some great war stories. Summary: There's a good reason this won a Jolt award.