tl;dr monorepo is a hot topic but some very rational and practical reasons drove us to it, they may not translate well to bigger organizations.
Nanoteam is the hypest sounding neologism I have found to describe my permanent situation as working in a team of two engineers, me being one half and not having plans to raise millions to change this by an order of magnitude.
The subconscious multirepo approach
The multirepo is a choice you did not know you made. You wake up one morning realizing that even though it felt like a no brainer ; it was a choice, and you picked the wrong side.
Every project starts with one specific software project. In our case a website/webapp, and at some point something else pops. We are going to need another entry point / another website / another webapp. It feels natural to mutualize code, at least a backend library, and a front end library. After a bit of googling you make composer and yarn work with private repositories. A few lines of rsync to synchronize files in development. In the blink of an eye, you end up with four different repositories.
Sprinkle over this a couple extra automation, market research or product exploration projects ; after a few years you will have at least ten different repositories.
What felt wrong with multirepos
It is not bad per se, in a bigger corporation, you could have site-a, site-b, backendlib and frontendlib, with one team behind each ; and the teams would synchronize from time to time. It would actually feel healthy to have a separation of concerns that is clear.
I had discussion on the matter with friends, some told me that most of the problems we were encountering were a symptom of bad architecture between the different repositories’ dependencies. It is a possibility, but I can’t think about something better, nor I have seen anything better at others, but I don’t have a very large sample on this.
Behind all those repositories most of them had continuous integration and deployment (CI/CD). As soon as you do tests that are not perfectly isolated, everything starts to fall apart, because the dependencies are often not in sync when your test are ran in the CI. So this meant that you get used to seeing red in your CI/CD on a normal day, it is a bad thing to get used to. It means that you have lost trust in it.
To fix this, you could plug dependencies on the HEAD of the dependencies’ git repositories. But it costs you the ability of having a version of your repository that represents a version of the application at some point in time.
Furthermore, if you start to review pull requests around features, they are often spread across different repositories. When every single repository starts to have their own “feature/new-cool-feature” branch, it gets really messy to simply test this on your local machiine.
The expectations behind a monorepo
The key problem to switch to a monorepo is that suddenly CI/CD gets trickier. The expectations could be the following:
- Build result cache: a change in a directory should not trigger the test in other directories.
- Build artifact cache: artifact from previous builds should be available to the current one.
- Dependency aware build: a change in a dependency should trigger the build of the component using it.
- Parallel execution: different projects should be executed in parallel.
Since our dependencies are always reused through injecting the code somewhere. I have not spent a second on the item 2 ; and I perceived 4 as a luxury.
It felt wrong to put every projects flat at the root of the mono repository, and it was very tricky to find examples of a proper categorization, we ended up with the following folder hierarchy:
consumer: projects that are accessed by someone or something (like an API) from the outside of the company.
internal: projects that are used internally.
library: projects that are only imported in one of the previous.
I think that relying on a CI/CD provider could also have helped ; as many of them enable some level of caching. GitLab actually seems to have something exactly for this
I ended up rewriting something that is CI/CD and language agnostic, only supporting Amazon S3 as a backend for execution cache storage. Writing drkns took me obviously more than initially planned, and also let me wondered if I didn’t have a serious case of a not-invented-here syndrom.
I initially completely ignored the need for parallelization as it was not a key feature. Lucky as we are, we almost got it for free. I extended drkns to support generating a CI/CD configuration file from a template, and suddenly everything was parallelized because every single project was executed inside it own job and the jobs were dependencies aware.
Benefits, expected and unexpected
While this solved the problems mentioned in the first part of this text through meaningful multiproject multidependency versioning. We also got some unexpected benefits.
Versioning the simplest piece of code is now easier. While you can create a new repository for two lines of code, it is not free ; it asks to git init, then go to your GitHub / Bitbucket / GitLab to create a remote counterpart, set the access right, push… This overhead is gone: create a directory inside the monorepo and you are all set.
Setting up CI/CD is also simpler, we used to have a provider for git hosting, and another for CI/CD - which again asked us to create the CI/CD on said provider and configure it to have Slack notifications, not needed again. Just create the right config file, and we are good to go.
Because things are simpler, it also makes mutualizing code easier. They were a few deployment scripts that were very similar if not identical across projects, we haven’t regrouped everything yet, but it feels now very natural to do so.
If there is one thing you lose here, it is fine grain access control. While no secret is to be stored inside the code, some organization will surely be paranoid enough to not be too keen on this.
Everytime you finish that kind of refactoring / huge quality of life project you feel very satisfied and legitimated in your initial intentions. Is it a form of confirmation bias? If I rationalize it, I think it provided strength to our organization:
- More reliable CI/CD implies faster product cycles.
- Easier code review also implies faster product cycles.
- More spontaneous code storage, means that even the smallest of the utilities is versioned and easily accessible. It is also easier to search for.