Highlights from a deep dive into data engineering
Posted in data engineering, distributed systems
These were some of my favorite reads from a summer-long reading spree on data engineering and related topics.
One of the very few upsides of the lockdown in my personal life was the hefty amount of additional free time that I suddenly found myself with, to do whatever I liked. Provided it did not involve going out or meeting people, that is. So trying to invest some of that time into something productive, I decided it was due time to do some serious study on data engineering and big data technologies in general. Coming from a background of applied Mathematics with little in the way of software engineering, at the beginning it was difficult to really understand some the technologies that come together in data science products, and at times I felt there were some conceptual gaps I needed to fill in order to really get all of these stuff. Doing an MSc in HPC helped to close some of these gaps, but you can only learn so much in a one year program that also covers a lot of other aspects; besides, HPC is not really the same as distributed systems. I wanted to get into the gory details to really understand them, and ultimately to become a better ML engineer.
There weren't any single comprehensive resources (not that I found, anyway), so my learning experience came from a mixture of online courses at a few platforms, and lots of books, specially from O'Reilly Media, as they have plenty of books explaining almost any software technology one can think of. Every one of them taught me something, but there were some that really stood out and that I now think are fantastic resources for anyone trying to have a go at these topics, even for those that come from backgrounds other than computer science or system engineering. Below, I described the ones that helped me the most.
Designing Data-Intensive Applications by Martin Kleppmann
This is a book that explains, at a high level, the why and how behind the architecture of distributed systems that power some of the largest platforms on the planet. Starting from first principles (why do we even need a database? how would we design and organize one?), it goes all the way to explain how modern, massive distributed systems are choreographed to serve millions of users over the globe, and how they are optimized to bear such a heavy load. It finishes with a very interesting discussion on the future of distributed system architectures. The book is brilliantly explained, making every topic crystal clear without overwhelming you with technical details; this was perhaps my favorite one on the subject, and I would recommend it as great introduction to any aspiring data scientist or data engineer.
As a side note, I recently discovered that the author posted recordings from his lectures at Cambridge; in them, he discusses some of the topics treated in the book in a more rigorous mathematical manner. Definitely another great resource.
Architecting Modern Data Platforms by Jan Kunigk et al
In my reading spree I ended up reading multiple books devoted to different aspects of Hadoop, the distributed processing systems par excellence, and playing with it a fair bit as well. Even though the ubiquity of public clouds may make on-premise Hadoop a less popular choice than, say, 5 years ago, and even though it's main processing engine, MapReduce, has been superseded by more modern and performant engines such as Spark, I think it's truly fascinating to read about all of the aspects that one has to have in mind when a behemoth such as Hadoop needs to go into production.
It turns out, it's a serious engineering challenge, and this book goes into all the gory details of it, and I mean all of it. From how to take advantage of the processor's architecture, optimally access disks, choose appropriate hardware for the servers, spread racks in a fault tolerant manner in a data center, design a resilient inter-rack network topology, design secure cluster access, manage encryption keys, deploy secondary services (such as edge nodes and metadata storage) around the cluster, make the cluster services fault tolerant, and virtualize the infrastructure, among others.
It really is a lot of detail, and some of it went above my head (though I'm fairly certain I'm never going to design security protocols or virtualise a data center), but I absolutely enjoyed it because it clarified so many aspects of systems engineering in my head; suddenly these massive systems don't look like some sort of abstract black magic anymore, but as engineering solutions to well defined problems. This was definitely one of my favorite reads as well.
Web Scalability for Startup Engineers by Artur Ejsmont
This one is in my opinion a more practical version of the first book I mentioned, explaining the most common problems and solutions to scalability issues implemented by today's digital platforms.
The book works at the schematic and architectural level, explaining, for instance, how and why infrastructure should be separated in front and back ends, how to scale a database horizontally, how to put a distributed cache in front of a database to offload work; how to design good RESTful APIs, among other things.
While the first book discusses the implications of building a distributed system in a more academic fashion, this one is about telling you how to design software that scales. Ultimately it makes for a enlightening read.
Database Internals by Alex Petrov
After reading the books mentioned so far along with others, I felt like I had a clear understanding of how different database technologies worked at a high level, and what characteristics and use cases set them apart; but, I was still curious of how all of this worked under the hood, at the level of the memory and CPU. At the end of the day, in order to work properly databases perform a whole series of tasks, each of which is fairly impressive: indexing, query parsing and optimization, concurrency control, garbage collection, data compression, consensus protocols, failover protocols, and so on.
This book explains just that: it reviews how databases are organized at the disk block level, how they are updated efficiently, how old data is purged in background processes and how data consistency is kept, either by locks or by multi-version concurrency control mechanisms. Overall I thought it complemented more conceptual, high-level books really well.
The books above, and all others I read, helped me to get a solid understanding of data engineering and distributed systems, topics that I had been wanting to study properly for a long time but couldn't get the chance to until recently; all of this came really handy when studying for Google's Data Engineering certification. Finally, I have to mention that LinuxAcademy (which now merged with acloudguru.com) is a great platform for getting your hands dirty with distributed technologies. Sadly, they don't have much about Hadoop, but they do about plenty of other modern distributed technologies such as Kubernetes, and their courses are great.