Administration of Coin

This part of the documentation is for people maintaining an instance of Coin.

If you have not ever built or interacted with Coin, check the Launching Coin section below, it will get you started.

Restarting Coin

It is important to be aware of the consequences of one’s actions… When Coin gets interrupted, for whichever reason, it does not keep any state around. In detail, Coin tracks “state” in the form of files for finished steps (aka Work Items). So if a build of qtbase was successful, Coin remembers and re-uses the binaries when building other modules. On the other hand, shutting down Coin while the integration is still running means, that the build will be counted as invalid and discarded. Coin attempts to remove all stale virtual machines on shutdown and startup to keep resources in order. When running with OpenNebula, clones of VM templates are owned by the OpenNebula user launching the system. This allows to run one central “definite” instance (“real”) which has the actual connection to Gerrit as input and does the approval of changes when integrations finish. At the same time several developers can run “private” instances under their own account on the Coin server, which share the resources with the “real” system without getting into the way. To allocate VMs, OpenNebula’s API is used. The third option is to run a “local” build. This means you run everything on your own Linux system without contacting any hypervisor for shared resources.

All in all these allow for sensible development environments that do not interfere with the live system.

If you are to run your own “private” instance of Coin, you might not want it to do any real changes in Gerrit. For that we have defined that by default only the user vmbuilder will listen to and apply changes to Gerrit.

To start Coin run:

systemctl --user start coin

The service is controlled by the usual systemd commands, such as:

systemctl --user status coin
systemctl --user stop coin

For logging, use journald:

journalctl --user -u coin -f

Check journald’s options to learn more (-f will follow, you can instead use time ranges or many other options to read the logs).

Alternatively you can run the run_ci.py script directly, but be aware that that’s not an option in production, since the instance has to keep on running, instead of being killed when you log out. This script must be run only once, terrible things will happen if two instances run at the same time! Run your own instance with:

pipenv run ./src/run_ci.py

For everyone that does not have access to OpenNebula, there is the light version, running on your local Linux system, run_ci:

pipenv run ./src/run_ci.py --hypervisor local

Launching Coin

The master runs on Linux only. It requires Python >= 3.6, Go >= 1.8, Git >= 2.0.0 and npm (from nodejs) installed. While it needs many other dependencies, a Makefile automates installing them and makes running the system easy.

Dependencies

python virtualenv
python3 (>= 3.5)
python3-dev (for some of the packages installed by pip)
python3-sphinx
git (>= 2.0.0)
go (>= 1.8)
npm and nodejs (or nodejs-legacy on Debian) v20.9.0+
pkg-config
flex
bison
zip

On Debian/Ubuntu (>=18.04) the needed packages are:

apt install nodejs npm python3-dev python3-sphinx cmake \
            pkg-config flex bison zip pigz p7zip-full \
            python3-pip zlib1g-dev libffi-dev libssl-dev \
            openssl libsqlite3-dev

For opennebula setup, you will need to install additional packages:

apt install opennebula-common opennebula-tools ruby-opennebula

To use opennebula hypervisor locally, you will need to set environment variables:

ONE_XMLRPC
QT_CI_OWN_IP

The makefile uses pipenv to manage the python virtual environment, install it:

python3 -m pip install --user pipenv

Make sure to have pipenv in the PATH.

In addition you need to install golang from https://golang.org/doc/install . The packages shipping with Linux distributions are unfortunately not suitable as often they do not support cross-compilation, which coin does.

First time setup

In the Coin root directory is a Makefile. It helps setting things up quickly. The recommended way to get started is to just run:

make -j1

in the top level directory. Running make will:

update git submodules

install needed node modules for the website

build the Go parts

Of course it’s possible to do all the steps manually, just peek inside the Makefile to understand what’s going on.

To double check everything works at least these 2 tests are recommended to run:

src/test_storage.py
src/test_repositorymanager.py

Coin is a collection of scripts and binaries that can be run individually. To make common tasks easy, a few helper scripts can be used to start the needed parts.

To read this documentation as website, launch the CI:

./run_ci --hypervisor local

and head over to http://localhost:8080. The documentation will be at http://localhost:8080/coin/doc.

Provisioning

The goal is to automate this away, but for the time being, the VMs used for testing need to be provisioned with a few additional packages and patches. Make sure that the CI is running (e.g. ./run_ci private), otherwise the provisioning will fail since it expects that the webserver is running. Then run:

./bin/coin provision

This command will create a clone of all templates found in the base directory and install additional required software.

Web Server

The web server is a quick way to understand the CI status. During startup it prints the URL allowing access it, it will look like this: http://localhost:8080

For making the web server available publicly, we recommend setting up your primary web server as reverse proxy. For apache that means you need three modules:

mod_proxy
mod_proxy_http
mod_proxy_wstunnel

All three modules need to be enabled and loaded. Suppose the Coin web server is running on port 8080, the apache configuration could then look something like this:

ProxyPass /coin/websocket ws://localhost:8080/coin/websocket
ProxyPass /coin http://localhost:8080/coin

Runtime configuration

Some aspects of Coin can be adjusted while Coin is running. These configurations are in coin_configuration.json in ci-working-dir. example file:

{
    "gerrit_monitor": {
        "disable_cherry_pick_delay": false,  # Disables cherrypick accumulation delay.
        "serialize_dependency_updates": true,
        "parallel_integration_delay": 1800,  # Delay between parallel integrations.
        "parallel_project_rules": [
            "qt/qtbase/dev"  # Projects which are allowed to run in parallel. Regex support.
        ]
    }
    "opennebula": {
        "sched_rules": [
            {
                "SCHED_REQUIREMENTS": "NESTED_VIRT = true",
                "match": {
                    "target.os": "QNX|Android"
                }
            }
        ]
    }
}

Gerrit monitor

Some attributes of parallel integrations are configurable in runtime.

Disable_cherry_pick_delay

By default cherrypicks by cherrypick_bot@qt-project.org user are accumulated for 6h. This can be used to disable the delay. Can also be bypassed by restaging a change manually.

Serialize_dependency_updates

If true does not allow dependency updates to run in parallel to other integrations.

Parallel_project_rules

Defines which projects are allowed to be run in parallel. Rules are matched against format of “<project>/<branch>”.

Opennebula

Allows creating SCHED_REQUIREMENTS rules that will be applied to virtual machines based on workitemconfiguration. This enables e.g. selecting only specific hosts for some workitems.

SCHED_REQUIREMENTS

Corresponds directly to opennebula’s SCHED_REQUIREMENTS. The given rule will be added and passed as is to opennebula.

Match

Matching defines for which configurations the rule will be applied. All rules defined under match must match to the rule to be applied. Consists workitemconfiguration attribute and regex value pairs. Nested attribute can be used where the rule below matches to workitemconfiguration.target.os:

"target.os": "QNX|Android"

If value regex matches, the match rule is considered to be fulfilled.

Updating production

We want to deploy new code as fast as possible, but without risking releases or blocking CI for longer period of time. Keep in mind that updating requires full restart, all work items will be interrupted or canceled.

General rules

1. When restarting the CI system, communicate this ahead of time in the #qtci channel to allow for objections and discussion. 2. Despite our continuous release process there may be times that are more “critical” than others in terms of stability and the ability to create packages for the releases. We should allow for short periods of freeze in terms of updating the CI system in production if we agree on it up-front. 3. When updating the production instance to a new version (not just re-starting), this happens during maintenance break. Exceptions to this can be hotfixes.

Instructions sequence

Kill currently running Coin instance:

systemctl --user stop coin

Maintenance webserver is automatically started.

Apply the update:

git pull

There should not be any conflicts, if there are it means that the production status was not clean. Status in production must always follow the state in gerrit branch. Fix the situation to branch in gerrit and checkout the current production branch.

Run Coin again:

systemctl --user start coin

The Coin service will automatically stop the maintenance web server.

After update monitoring

As Coin is brought up, it will continue working on the existing work queue that got interrupted at the update restart. Immediately after the update it’s good practice to keep an eye on the journald outputs. Check for any errors or deviations in the logs.

Creation of new work items is done when Coin is triggered to run a new build. This is the point where errors can occur while creating those new work items.

Next it’s good to check if agents really go to running state. This can be checked on Coin’s web interface under the Agents. If Coin has problems communicating with OpenNebula, or we have issues cloning virtual machines, the agents will never reach the running state. It could also be any kind of problem setting the machine up before provisioning is launched.

Then we wait a while for post build events, such as uploading logs. Possible errors here don’t appear until builds have finished and logs are being uploaded. This combined with other post build events can be checked by looking at the end result of builds. If builds start appearing green in the web interface, most likely everything is OK. If we start getting red builds you should check what’s going on. Compile and test failures are still OK regarding the CI (unless you see every build failing on same errors that shouldn’t be caused by the commits being built), but system failures are something one should pick up on.

Also don’t expect the builds to finish completely in seconds. Unless you manually trigger a build that has already been dealt with, you should expect some time spent on compiling and perhaps testing.

The Daily monitoring routines are a good checklist to also go through at this point. That will cover the checking of web services and Gerrit syncs. Check them a few times over several hours, to find “hanged” states. they may show that Gerrit is out of sync.

Daily monitoring

First log on to Coin master machine with your user credentials using ssh:

ssh <username>@coin

Now you can grep for errors and warnings with the following command:

journalctl –user -u coin -S “1h ago” | grep -iE “traceback|warning|error|critical”

You can also tail the log in real-time:

journalctl –user -u coin -f

Any errors printed should be addressed by creating a bug report in JIRA by including as much information and log prints about the error as possible. The ticket should include:

Summary of the conditions that caused the issue

Link to build log

Link to integration which failed

https://bugreports.qt.io/projects/COIN

If you start seeing multiple error logs it’s time to notify the CI team via the irc #qtci channel or in person. Pay attention to the timestamps before the log messages since old logs will also be printed.

You can use Grafana to monitor Coin test results to detect flakiness in tests:

https://testresults.qt.io/grafana/dashboard/db/overview

Check that there is enough disk space on the file system:

df /

If there is not enough space, it might be that the garbage collector is not freeing enough space or that individual users are using too much disk space.

Check the number of agents in agentpool:

http://coin/coin/agentpool

and check if there are many (hundreds of) agents in “waiting” state with none or only a handful in “running” state. Usually these situations indicate an error state that should be noticeable in the journald logs also. If there are no errors in coin.log check if the current builds were all freshly started. If not, then contact the CI team.

Check that all the OpenNebula servers are up by logging in to OpenNebula as “oneadmin” user. Then go to:

http://one-master.ci.qt.io/#hosts-tab

and check that all the servers show “ON” in the “Status” column.

Check that web servers are running:

https://testresults.qt.io/coin/

https://ci-files01-hki.ci.qt.io:8081/coin/

http://coin/coin/

Also check that while they are running that they also shows the latest build results. If they don’t appear up to date, we have a problem somewhere.

Check out Gerrit’s status and match it with Coin’s status. They should converge.

https://codereview.qt-project.org/#/q/status:integrating,n,z

https://codereview.qt-project.org/#/q/status:staged,n,z

If these steps all seem OK, grab a cup of coffee and try again :)

Troubleshooting

Some uncommon administration tasks may be required, especially during development.

Debugging with gdb

Debugging deadlocks and hung threads can be possible by installing python debugger:

sudo apt install python3-dbg pipenv –rm pipenv –python=/usr/bin/python3-dbg

Attach gdb to the executing python process

gdb python3-dbg -p 12345

Useful commands

py-list Show current code in execution py-bt Inspect python callstack

In case of (no debugging symbols found) -message, you may import

(gdb) source /usr/share/gdb/auto-load/usr/bin/python3.6-gdb.py

Running garbage collection

Normally gc is executed automatically, keeping some save threshold of free disk space and it is not needed to run it manually. Nevertheless it is possible to force gc by running:

bin/coin gc

Artifacts invalidation

It might be necessary to invalidate all artifacts. It may happen for example if Coin code changes in an incompatible way. Solution to this is relatively easy. The whole system needs to be stopped and inside storage.py buildKey() function should get new revision number (grep for “COIN revision”). After restart everything will work as after storage cleanup. Old binaries may be removed manually or by garbage collection.