It’s been a while since the last post. We have switched to Unreal Engine, grown a bit, started on a new game, and reinvented a lot of base tech. Our investment in build systems is beginning to pay off!
We have a proof-of-concept that builds UE engine code and an example game, all using GitHub Actions. The build jobs are run by self-hosted runners in Google Cloud. The build system automatically spins up/down the runners so we only pay for what we use.
This is well suited for making custom-built UE versions to distribute within the team. It is not yet useful for building game applications.
- Clean UE build job: 2 hours to build + package + upload
- No-changes UE build job: 30 minutes
- Clean example-game build job: 10 minutes to fetch UE4, 10 minutes to build + upload game
- No-changes example-game build job: 3 minutes
Fetch UE build: downloads 2.7GB of data, expands to 40GB on-disk, takes 10 minutes
The UE build produced is a full installed build for Win64. Editor, game, client, server, all source code, all debug symbols, templates. The no-changes UE build job runs 800 compilation actions even though nothing should have changed. Not sure why.
How do you set up an effective CI system for Unreal Engine based games? There is no ready-to-use template available. It is difficult to design a turn-key solution that works for small projects. The UE source code is not available publicly. Large projects usually use Perforce and their tooling often relies on internal company infrastructure. Therefore, every company is doomed to reinvent it more-or-less from scratch.
We have a crude solution today; a single beefy physical machine that runs Jenkins. All software has been installed manually. It has served its purpose so far; we have been able to iterate quickly on our game, and we haven’t even tried building & distributing our own engine. The core problem is that the approach does not scale. Adding extra build machines is error prone. Most build system development is done on the production system, and test runs & configuration mistakes impact production negatively. Every single step, such as building and distributing a custom UE4 version, is a huge effort. It shouldn’t be that hard.
This is what we expect from a modern solution:
All code should be managed in cloud-based, branch-capable SCMs. We keep game code & content in Plastic SCM. We would like to have engine code there as well, but Git would also be acceptable for engine code.
There should be a direct correlation between the amount of change and the amount of build work done. One engine code change should not result in all workstations rebuilding the engine code. One game code change should not result in all workstations rebuilding the game code. One content change should not result in all workstations rebuilding the game content.
There should be a direct correlation between the amount of change and developers’ wait time. Changes with little impact should build quicker than changes with large impact.Incremental builds, shared caches, differential uploads/downloads – use any means necessary to ensure people don’t need to spend a lot of time wait for the result of small changes.
There should be a direct correlation between the amount of build work done, and the $$$ spent. Adding an infrequent build job should incur a small cost; adding a frequent build job should incur a larger cost. It should be possible to introduce small extra jobs and only have to pay a small extra cost to have these not impact the existing jobs in any way.
It should be straight-forward to spin up a replica of the build system. Developers should iterate on their own replicas, and then merge locally-tested changes back to the production system.
The build system setup should be available as Open Source. Companies keep reinventing the wheel in this space, and many do it poorly. Enough of that - better share implementations and enable each other to focus on building great games.
The build system should be effective for users spread across the Internet. Support people working from home, accelerate operations for people in offices. Don’t rely solely on VPN for securing the infrastructure as a whole.
Why GitHub + GitHub Actions?
We already know that we can set up a build system that spits out engine and game builds quickly. It is doable with enough $$$ and ops/sysadmin manpower. What if we look at it like a non-games developer would?
Embrace Open Source. If we are serious about open source, the tooling should be on GitHub - that’s where people are most likely to find it. Also, nobody wants to put crap onto the Internet with their name on it … so open sourcing automatically enforces a high minimum quality bar :)
Embrace GitOps. All infrastructure and application configuration should originate from Git repositories. Change your infrastructure by committing to repo(s), then wait for build jobs to reconfigure the build system.
Embrace forking. It should be possible to set up a replica of the build system by forking repo(s) and changing some configuration settings.
GitOps requires a build system for the build system. GitHub Actions is the obvious choice for that; it is freely available and has minimal administrative overhead.
Epic makes the Unreal Engine source code available via GitHub. The simplest way of maintaining a copy of the engine source and gradually importing changes to it is to mirror their repository into a local, private repository. GitHub Actions is readily available, so why not use it for making engine builds too? (We expect engine changes to be a lot less frequent than game changes, and want to build the engine separately from the game.)
Git and Git LFS are not ideal for game content, granted - so we would still like to keep real game code/content in Plastic SCM. However, it is useful to have an example game in GitHub. It allows quick iterations on the build system before bringing changes to the much larger build system that handles the real game.
All code and configuration resides in GitHub. GitHub Actions reacts to changes in the repositories, and triggers builds. Builds are performed by VMs in Google Cloud. Build results are stored Storage Buckets in Google Cloud. The custom build VMs give us high build throughput. We keep VM costs low by starting and stopping VMs on-demand. We keep storage costs low by performing differential transfers.
The build system itself is managed via the Build system repository. This contains a Packer script that creates a VM image with the GHA agent, Visual Studio 2019 Build Tools, and other software that is necessary to build UE code. It also contains two Terraform scripts that bring up all the infrastructure in Google Cloud - Storage buckets that hold the build outputs, VMs that will register with GitHub Actions, and some Cloud Function logic that observes GitHub Actions activity and starts/stops the VMs as necessary.
GitHub Actions offers builder VMs. At the time of writing, those VMs all are 2-core CPUs with 7GB RAM and 14GB disk. The disk contents are not be preserved between builds. That doesn’t match what UE needs. Good UE build throughput requires lots of cores, 1-2GB RAM per core, 300GB of disk (for the engine) and persistent disk to allow UAT/UBT to perform incremental builds. Therefore we run self-hosted VMs. However, those VMs become expensive. Therefore, we start/stop them on-demand. That gets us close to the “pay for what you use only” ideal; the only significant cost for a stopped VM is the disk, say … $50/month for 300GB of SSD space? The startup time is about 60 seconds.
The Unreal Engine build process is located in the Engine build script repository. This links to the UE source repository (which is private due to license requirements) and invokes UAT to build installed builds, and then uses Longtail to upload fininshed engine builds to Google Cloud Storage.
The game build process is located in the example game repository. This fetches a pre-built UE4 version using Longtail, invokes UAT to build the game, and uploads the resulting binary to another bucket in Google Cloud Storage.
Longtail allows us to deduplicate builds when uploading them to Google Cloud Storage. Deduplication means that storage costs do not go up that much if we build often; it will just result in a large number of manifests. Deduplication + local caching also keeps the increase in network transfer (and associated time & cost) manageable, if people fetch new builds often.
Is this useful in real game projects?
Some parts of it can be. We are going to use this flow for building UE installed builds for our team. It suits us well because we are not yet planning to make frequent changes to the engine source code. 2 hours of lead time for engine code->binary changes is acceptable to us.
We are going to use Longtail for distributing UE installed builds. It looks very promising.
We are going to use Packer+Terraform for controlling build system configuration. It is light years ahead of old-school build system management.
We are not going to use this for building the game - not yet. We need to keep game code and content in Plastic SCM; the developer experience is worth it to stay there. We need to use Jenkins or something custom to be able to talk to Plastic. We need to use physical VMs for the frequent jobs to keep cost under control.
The Packer image generation does not lock the compiler version of MSVC. Code is compiled in two steps; first engine code on the engine builder VM; then game code either on the game builder VM or on a developer’s workstation. If the MSVC version on the engine builder VM is newer than in the other locations, then there is a risk for weird linker errors. The MSVC version really should be locked down by the VM image build scripts. (The VS installer just makes this very difficult.)
GitHub Actions is not 100% reliable with self-hosted runners that start and stop. Sometimes it is slow to notice that a runner has woken up. Sometimes it does not reflect job status changes (queued => in progress) in the web UI. There have been a couple of instances of GitHub breaking self-hosted runner functionality in the backend and then fixing it some days later. There are efforts underway to support one-shot jobs natively, but nothing is available yet.
GitHub Actions will “forget” a runner that has been offline for 30 days. Re-registration is manual.
Making any changes to a builder VM (such as, updating the MSVC version) will interrupt - and fail - an in-progress build job.
How do we create and operate an effective build system for game builds? Jenkins is the obvious candidate. Custom controller logic + Nomad is candidate #2. Is it feasible to deploy & do configuration management via GitOps?
Is it feasible to have a combination of VMs and physical machines for the agents? Can we use the same CM scripts for installing software on VMs as for physical machines?
Can we make builder updates without impacting in-progress jobs? Perhaps with a model where we first drain the build system, then update, then re-enable builds? Even better, perhaps we can enqueue the VM updates? The simplistic model where Terraform instantly destroys/creates VMs will not be sufficient.
How do we distribute content build results? Perhaps a DDC per location can be used to speed things up - one in Google Cloud, one near the physical VMs, one near developers in the office?
Can we isloate compilation, cooking and packaging into separate jobs, and use a graph model instead of a sequential model to describe each final product’s build workflow?
Can we enable content creators to work with 100% pre-built binaries (both editor and game) when they work in a branch-oriented SCM?
How far can we trim Engine builds? Not everyone and every build agent needs those 40GB of files…
This has been a lot of fun to R&D, and it is great to see tangible results. There is lots more to be done still.