Are your frameworks bulletproof?

Are your frameworks bulletproof?

Imagine that you're a developer in a large company working to solve a high-profile architectural problem in an app. Your team has compiled all of the company's use-cases within that particular problem and developed a framework that solves not only all of those use-cases but also some additional ones that you're expecting developers will need in the near future as the project grows. The developers of the company are on board with this change, and over a large period of time, you've written extensive documentation on how to use this new framework, instructed developers on where to find said docs, and migrated the project to use it.

Fast-forwarding a couple of years, the project is now in such a bad state that you're in a position in which you essentially need to throw everything away and start from scratch. To make matters worse, not only the architectural problem your framework was supposed to fix is still unsolved, it's now arguably in a much worse shape and considerably harder to fix than before. What in holy heaven happened here?

I've been witnessing projects end up in this situation over and over throughout the years, and it's always because the developers are making the same mistake: they didn't account for human ingenuity. It's important to always keep in mind that the main priority of most developers is to save time; when facing a coding challenge in which the correct way to go is to completely refactor their current code, even if you provide extensive documentation on how to solve that particular problem, most people’s first instinct will be to avoid that as much as possible by trying to come up with clever workarounds.

This is not necessarily done out of malice or ignorance. It could be very well that the developer is being pressured into following this path by external factors such as unreasonable product managers, but if you do not account for this when designing systems that will be used by other developers, you’re bound to have issues in the long run.

While hacky code is generally not too much of a big deal in smaller companies where it’s common practice for all developers to be aware of what everyone else is doing (thus having more space for feedback and collaboration), in the world of large companies where teams are designed to run independently from each other with minimal collaboration, workarounds are ticking time bombs. As years go by and the project evolves, these hacks and workarounds get lodged so deep into the code that your project now has entire systems written on top of them, making them so unpredictable and fragile that one point you become outright unable to make any further improvements to a system without first removing the hacks from the code. However, in large companies, the amount of work necessary to fix situations like this can be so unfeasible that there's often no choice other than to write even more hacks and hope for the best, leading to borderline unmanageable systems with several years of unfixable legacy code that will haunt the project and its engineers for the rest of its life.

In over a decade of software engineering, I have never gotten to know a single large product that didn’t have this problem. It seems that the nature of how large companies operate makes it inevitable that at one point someone will merge a dangerous workaround that will have severe consequences in the long run, but it's possible for you as a platform engineer to reduce the likelihood of this happening by designing your systems to be as bulletproof as possible.

What this means exactly depends on what you’re developing, but if we take the original example and pretend that we're developing a large-scale framework for one of these large companies, your objective should be to make it as hard as possible for people to misuse your APIs by empowering static linters and implementing runtime safety checks.

The best open-source example I can give of a framework that does this is Bazel. It can be very hard to configure Bazel projects as the build system has very strict rules you need to follow in order for it to work as intended, but no matter how hard you try, you simply cannot get them wrong. I've tried to workaround Bazel rules multiple times in moments where I was very unwilling to refactor my files to match what Bazel wanted me to do, but I never succeeded as Bazel has linters and safety checks that stop you dead on your tracks whenever you try to do anything funny. This can feel a bit excessive at first, but it makes complete sense as one of the main features of Bazel is the ability to hermetically seal your project. If I was allowed to go forward with any of the clever workarounds I was planning to write, I would have introduced issues that in the long term would have completely ruined those projects.

This practice has a clear downside; it massively reduces the team's productivity when facing issues involving these frameworks. On the other hand, doing this can actually open up the way for you to develop intelligent systems that actually increase developer productivity in the long run, and the most interesting example of this happening is how Xcode's Automatic Reference Counting system came to be.

At the beginning of the platform, reference counting on iOS used to be a completely manual process that was very easy to get wrong. Apple started helping developers by creating a series of static checks on Xcode that was able to detect and block developers from introducing some of the most common mistakes, and as they improved this system, they realized at one point that it became so good at reference counting that it wasn't necessary to have developers do it manually anymore. You still need to be careful with other forms of memory mistakes like retain cycles, but when it comes to general memory management, the process is now so automated and abstracted from you that it's being deemed unnecessary for new developers to even learn that this is a thing.

While we cannot ignore the fact that these safety procedures can be extremely detrimental to productivity in the beginning, when it comes to large established projects, I'd argue that this is a very needed trade-off to ensure the project can continue to smoothly evolve as the years go by. Dangerous legacy codes are one of the biggest walls to overcome when maintaining large projects, and actively preventing them from being introduced can save you and the company countless hours of work in the long run.