Migrating BBM Android Continuous Integration to Cloud with Genymotion Cloud and GCP — Part 2

This article is the second of a series of 3, which outlines in detailed all problems we faced, and our journey finding solution for each of them. If you haven’t read the first article, I suggest you to read it first to get some context. The next article will describe the actual work on building the CI infrastructure in the cloud with Genymotion Cloud and GCP.

In previous article, we were facing problems with our Continuous Integration set up.

  1. Job queue time
  2. Pipeline job runtime
  3. Network speed
  4. Emulator stability
  5. Power outage and manual agent nodes launching process

Problem #1: Job queue time

BBM Android team has around 30 engineers by the end of 2017, spread all over IndonesiaSingaporeCanada, and Vietnam. As a company where pair programming is the norm, it means we have around 15 pairs working on a separate stream of work at the same time.

Each pair work on their branch, create a pull request (PR), and then the job to run automated tests for their changes is generated automatically by Multibranch pipeline job. As we only have limited agent nodes to run tests for branches with PR, the test job is usually queued for around 30 to 60 minutes quite often. If there were a problem with agent nodes which require a restart, the queue would grow, as there are no agent nodes to run the job. After restarting the agent nodes, sometimes jobs will be executed after 1 to 2 hours.

This queue problem prevents pairs from taking on the next task in the backlog effectively, as they need to continually check that tests for their PR are passing. If the changes are breaking any functionality, the pair needs to switch context to the previous task, and make changes. And after making necessary changes, the pair most likely has to wait for quite long before the tests run.

Goal: Start job as soon as possible!

Problem #2: Pipeline job runtime

The pipeline job for running automated tests consists of 14 stages. The stages are:

  1. Initialization (2–3 minutes). This stage includes checking out source code, acquiring and preparing Genymotion emulators, and downloading necessary libraries to build BBM Android.
  2. Unit tests (15–18 minutes). This stage compiles and runs all unit tests in main application module and other 13 library modules.
  3. Instrumentation tests for library modules (12–15 minutes). This stage sequentially compiles and runs all instrumentation tests in 7 library modules.
  4. Build APKs to run instrumentation tests for application module (6 minutes). This stage builds the application module APK and its testing application APK.
  5. Install main and testing application APKs (1 minute). This stage installs both application module APK and testing APK to all Genymotion emulators.
  6. Instrumentation tests for application module (12 minutes). This stage splits all instrumentation tests to all Genymotion emulators and runs them in parallel.
  7. Lint checking (7 minutes). This stage runs lint checking on application module and one library module.
  8. Compile error checking for release build (7 minutes). This stage runs compile error checking when creating release build type.

In total, the pipeline job took a little over an hour to complete.

Goal: Shorten job run time by half!

Problem #3: Network speed

As all agent nodes are located in KMK headquarter in Jakarta, Indonesia, the network connection used by those nodes is shared with hundreds of other employees from all departments.

Sometimes, jobs are aborted automatically because the agent node cannot download the required source code or third-party libraries. Isn’t it frustrating to see the job for your PR aborted after waiting an hour in the queue to get started? There are better times, though, when the test job is completed after 1.5 hours, because the download speed is slow.

Our office utilizes a few internet service providers to cater to everyone’s need. But from time to time, one or more providers will have a problem with their service, and as a result, it will slow down the network speed.

Goal: Stable and faster download speed!

Problem #4: Emulator stability

At times, Genymotion emulators running on our GNU/Linux boxes could be unstable for unknown reasons. After being used and reused multiple times, one or more emulators are unusable. At one time, the emulator is stuck, you can’t interact with it. Other time, random Activity under test stuck at the front, though it can be dismissed. As a result, tests are randomly failing here and there. We’re still unsure what caused this issue, though. A few theories: CPU usage/memory management on the agent nodes.

Goal: Fresh emulators, please?

Problem #5: Power outage and manual agent nodes launching process

Yes! It is happening a few times. When this happens, someone needs to physically visit the place where all agent nodes are located, and ensure all are starting up and running. Otherwise, CI won’t be available to any engineers.

Did I mention that we only have one display for all machines? Imagine all the troubles plugging and unplugging the display cable to each machine.

Goal: Automatically launch agent nodes!

Planned Solutions

Let’s take a look at the five goals we’re trying to achieve.

  1. Start job execution as soon as possible. If we can reserve an agent node for each job, so it no longer has to compete with others for resource.
  2. Shorten job runtime by half. If we can upgrade our machines easily from time to time, and running multiple things at the same time, for sure we can reduce the time required to complete the job.
  3. Stable and faster download speed. If we have fast download speed, it will help reducing the build time.
  4. Fresh emulators for each job execution. If each job is given fresh emulators, it will reduce the number of flaky tests.
  5. Automatically launch agent nodes. If agent nodes can be launched without human intervention, it will be faster, and no one needs to waste time dealing with it.

It is obvious that scaling our existing CI infrastructure to achieve all that is not possible without making a radical change. So, what options do we have? First, we could explore a managed CI service provided by well known companies. Or, we could rebuild our CI infrastructure in the cloud and have fun while doing that. We decided to go with the latter!

There are two important keys to move to the cloud. First, the cloud provider where master and agent nodes are running. And second, the Android emulators used by all agent nodes to run instrumentation tests.

Cloud Provider

Deciding the cloud provider is an easy one. As KMK has been one of Google Cloud Platform’s (GCP) customer, it’s an obvious choice. But, when it comes to how and where we run our Android emulators, we have a lot of homework to do.

Android Emulator

We have a few options in choosing Android emulator:

  1. AWS Device Farm or Firebase Test Lab
  2. Headless Android emulator on top of Google Compute Engine (GCE) instance
  3. Genymotion

AWS Device Farm or Firebase Test Lab do not fit our need, as they are designed to run the same tests on multiple devices. In our case, we split our automated tests according to the amount of emulators that we have, and run them all in parallel using adb shell command . We scratched this option off and move on to the next one.

Next, we tried running Android emulator in headless mode with xvfb. During our testing period, we found out that the tests are not stable. There are times that tests are just stuck and never completes. So, we parked this option, and exploring the last option we have on the list.

And finally, we turn to Genymotion. Genymotion is a cross-platform Android emulator which comprises a set of sensors and features, designed to help app developers and testers to run their automated tests in a virtual environment. Other than the desktop version, like what is used on existing CI configuration, Genymotion also provides cloud version, called Genymotion Cloud.

Genymotion Cloud comes in two options: PaaS and SaaS. Genymotion Cloud PaaS allows you to spin up an instance on pretty much well known cloud providers, such as GCP, AWS, or Alibaba. While Genymotion Cloud SaaS allows you to create emulators and start using them right away, all managed by Genymotion team.

From the spike we did, we realize that using Genymotion Cloud SaaS is very reliable. We can create, start, and destroy emulators quickly, easily, and reliably. From dozen of sessions running instrumentation tests that we have, all of them are running perfectly fine. It also provides tools that are easy to use and has all the necessary functionalities we need to achieve what want to do.

Genymotion Cloud + GCP = 💖

How does combining Genymotion Cloud and GCP solves our problems, theoretically?

  1. Job waiting time. This is related to limited agent nodes available. In our existing CI infrastructure, we only have 3 agent nodes. By moving to GCP, we can now create as many agent nodes as needed at anytime. The idea is that when a job is executed, we will start a new VM instance on GCP, register it as a agent node, and then run our automated tests on it.
  2. Pipeline job runtime. This is related to specs of machines used for agent nodes. By moving to GCP, we can start a new VM instances with higher specs, so it can run more processes at the same time. If we need a higher specs, we can just change the instance type and be done with it! And by using Genymotion Cloud, we can start as many emulators as needed, and run multiple tests in parallel.
  3. Network speed. This one is obvious!
  4. Emulator stability. As we can start Genymotion Cloud emulators anytime at will, we will create a bunch of emulators when the job starts, and destroy those upon job completion. This will ensure each job uses a freshly started emulators.
  5. Power outage and manual agent nodes launching process. This one is obvious, too!

Here is the simplified version of the pipeline process.

In the next article, I will go into detail of the step by step process in rebuilding our CI infrastructure in the cloud.

Thanks to Ellinor Kwok. 


Related Posts

Streaming Festival Disrupto Exploration and Experimentation 2020

Streaming Festival Disrupto Exploration and Experimentation 2020

Resiko Berbahaya menggunakan VPN gratisan di Laptopmu!

Resiko Berbahaya menggunakan VPN gratisan di Laptopmu!

Part II — Understanding about RuleChain

Mengenal dasar RxSwift

No Comment

Leave a Reply

Your email address will not be published. Required fields are marked *