Administration of Coin
This part of the documentation is for people maintaining an instance of Coin.
If you have not ever built or interacted with Coin, check the Launching Coin section below, it will get you started.
Restarting Coin
It is important to be aware of the consequences of one’s actions… When Coin gets interrupted, for whichever reason, it does not keep any state around. In detail, Coin tracks “state” in the form of files for finished steps (aka Work Items). So if a build of qtbase was successful, Coin remembers and re-uses the binaries when building other modules. On the other hand, shutting down Coin while the integration is still running means, that the build will be counted as invalid and discarded. Coin attempts to remove all stale virtual machines on shutdown and startup to keep resources in order. When running with OpenNebula, clones of VM templates are owned by the OpenNebula user launching the system. This allows to run one central “definite” instance (“real”) which has the actual connection to Gerrit as input and does the approval of changes when integrations finish. At the same time several developers can run “private” instances under their own account on the Coin server, which share the resources with the “real” system without getting into the way. To allocate VMs, OpenNebula’s API is used. The third option is to run a “local” build. This means you run everything on your own Linux system without contacting any hypervisor for shared resources.
All in all these allow for sensible development environments that do not interfere with the live system.
If you are to run your own “private” instance of Coin, you might not want it to do any real changes in Gerrit. For that we have defined that by default only the user vmbuilder will listen to and apply changes to Gerrit.
To start Coin run:
systemctl --user start coin
The service is controlled by the usual systemd commands, such as:
systemctl --user status coin
systemctl --user stop coin
For logging, use journald:
journalctl --user -u coin -f
Check journald’s options to learn more (-f will follow, you can instead use time ranges or many other options to read the logs).
Alternatively you can run the run_ci.py script directly, but be aware that that’s not an option in production, since the instance has to keep on running, instead of being killed when you log out. This script must be run only once, terrible things will happen if two instances run at the same time! Run your own instance with:
pipenv run ./src/run_ci.py
For everyone that does not have access to OpenNebula, there is the light version, running on your local Linux system, run_ci:
pipenv run ./src/run_ci.py --hypervisor local
Launching Coin
The master runs on Linux only. It requires Python >= 3.6, Go >= 1.8, Git >= 2.0.0 and npm (from nodejs) installed. While it needs many other dependencies, a Makefile automates installing them and makes running the system easy.
Dependencies
python virtualenv
python3 (>= 3.5)
python3-dev (for some of the packages installed by pip)
python3-sphinx
git (>= 2.0.0)
go (>= 1.8)
npm and nodejs (or nodejs-legacy on Debian) v20.9.0+
pkg-config
flex
bison
zip
On Debian/Ubuntu (>=18.04) the needed packages are:
apt install nodejs npm python3-dev python3-sphinx cmake \
pkg-config flex bison zip pigz p7zip-full \
python3-pip zlib1g-dev libffi-dev libssl-dev \
openssl libsqlite3-dev
For opennebula setup, you will need to install additional packages:
apt install opennebula-common opennebula-tools ruby-opennebula
To use opennebula hypervisor locally, you will need to set environment variables:
ONE_XMLRPC
QT_CI_OWN_IP
The makefile uses pipenv to manage the python virtual environment, install it:
python3 -m pip install --user pipenv
Make sure to have pipenv in the PATH.
In addition you need to install golang from https://golang.org/doc/install . The packages shipping with Linux distributions are unfortunately not suitable as often they do not support cross-compilation, which coin does.
First time setup
In the Coin root directory is a Makefile. It helps setting things up quickly. The recommended way to get started is to just run:
make -j1
in the top level directory. Running make will:
update git submodules
install needed node modules for the website
build the Go parts
Of course it’s possible to do all the steps manually, just peek inside the Makefile to understand what’s going on.
To double check everything works at least these 2 tests are recommended to run:
src/test_storage.py
src/test_repositorymanager.py
Coin is a collection of scripts and binaries that can be run individually. To make common tasks easy, a few helper scripts can be used to start the needed parts.
To read this documentation as website, launch the CI:
./run_ci --hypervisor local
and head over to http://localhost:8080. The documentation will be at http://localhost:8080/coin/doc.
Provisioning
The goal is to automate this away, but for the time being, the VMs used for testing need to be provisioned with a few additional packages and patches. Make sure that the CI is running (e.g. ./run_ci private), otherwise the provisioning will fail since it expects that the webserver is running. Then run:
./bin/coin provision
This command will create a clone of all templates found in the base directory and install additional required software.
Web Server
The web server is a quick way to understand the CI status. During startup it prints the URL allowing access it, it will look like this: http://localhost:8080
For making the web server available publicly, we recommend setting up your primary web server as reverse proxy. For apache that means you need three modules:
mod_proxy
mod_proxy_http
mod_proxy_wstunnel
All three modules need to be enabled and loaded. Suppose the Coin web server is running on port 8080, the apache configuration could then look something like this:
ProxyPass /coin/websocket ws://localhost:8080/coin/websocket
ProxyPass /coin http://localhost:8080/coin
Runtime configuration
Some aspects of Coin can be adjusted while Coin is running. These configurations are in coin_configuration.json in ci-working-dir. example file:
{
"gerrit_monitor": {
"disable_cherry_pick_delay": false, # Disables cherrypick accumulation delay.
"serialize_dependency_updates": true,
"parallel_integration_delay": 1800, # Delay between parallel integrations.
"parallel_project_rules": [
"qt/qtbase/dev" # Projects which are allowed to run in parallel. Regex support.
]
}
"opennebula": {
"sched_rules": [
{
"SCHED_REQUIREMENTS": "NESTED_VIRT = true",
"match": {
"target.os": "QNX|Android"
}
}
]
}
}
Gerrit monitor
Some attributes of parallel integrations are configurable in runtime.
Disable_cherry_pick_delay
By default cherrypicks by cherrypick_bot@qt-project.org user are accumulated for 6h. This can be used to disable the delay. Can also be bypassed by restaging a change manually.
Serialize_dependency_updates
If true does not allow dependency updates to run in parallel to other integrations.
Parallel_project_rules
Defines which projects are allowed to be run in parallel. Rules are matched against format of “<project>/<branch>”.
Opennebula
Allows creating SCHED_REQUIREMENTS rules that will be applied to virtual machines based on workitemconfiguration. This enables e.g. selecting only specific hosts for some workitems.
SCHED_REQUIREMENTS
Corresponds directly to opennebula’s SCHED_REQUIREMENTS. The given rule will be added and passed as is to opennebula.
Match
Matching defines for which configurations the rule will be applied. All rules defined under match must match to the rule to be applied. Consists workitemconfiguration attribute and regex value pairs. Nested attribute can be used where the rule below matches to workitemconfiguration.target.os:
"target.os": "QNX|Android"
If value regex matches, the match rule is considered to be fulfilled.
Updating production
We want to deploy new code as fast as possible, but without risking releases or blocking CI for longer period of time. Keep in mind that updating requires full restart, all work items will be interrupted or canceled.
General rules
1. When restarting the CI system, communicate this ahead of time in the #qtci channel to allow for objections and discussion. 2. Despite our continuous release process there may be times that are more “critical” than others in terms of stability and the ability to create packages for the releases. We should allow for short periods of freeze in terms of updating the CI system in production if we agree on it up-front. 3. When updating the production instance to a new version (not just re-starting), this happens during maintenance break. Exceptions to this can be hotfixes.
Instructions sequence
Kill currently running Coin instance:
systemctl --user stop coin
Maintenance webserver is automatically started.
Apply the update:
git pull
There should not be any conflicts, if there are it means that the production status was not clean. Status in production must always follow the state in gerrit branch. Fix the situation to branch in gerrit and checkout the current production branch.
Run Coin again:
systemctl --user start coin
The Coin service will automatically stop the maintenance web server.
After update monitoring
As Coin is brought up, it will continue working on the existing work queue that got interrupted at the update restart. Immediately after the update it’s good practice to keep an eye on the journald outputs. Check for any errors or deviations in the logs.
Creation of new work items is done when Coin is triggered to run a new build. This is the point where errors can occur while creating those new work items.
Next it’s good to check if agents really go to running state. This can be checked on Coin’s web interface under the Agents. If Coin has problems communicating with OpenNebula, or we have issues cloning virtual machines, the agents will never reach the running state. It could also be any kind of problem setting the machine up before provisioning is launched.
Then we wait a while for post build events, such as uploading logs. Possible errors here don’t appear until builds have finished and logs are being uploaded. This combined with other post build events can be checked by looking at the end result of builds. If builds start appearing green in the web interface, most likely everything is OK. If we start getting red builds you should check what’s going on. Compile and test failures are still OK regarding the CI (unless you see every build failing on same errors that shouldn’t be caused by the commits being built), but system failures are something one should pick up on.
Also don’t expect the builds to finish completely in seconds. Unless you manually trigger a build that has already been dealt with, you should expect some time spent on compiling and perhaps testing.
The Daily monitoring routines are a good checklist to also go through at this point. That will cover the checking of web services and Gerrit syncs. Check them a few times over several hours, to find “hanged” states. they may show that Gerrit is out of sync.
Daily monitoring
First log on to Coin master machine with your user credentials using ssh:
ssh <username>@coin
Now you can grep for errors and warnings with the following command:
journalctl –user -u coin -S “1h ago” | grep -iE “traceback|warning|error|critical”
You can also tail the log in real-time:
journalctl –user -u coin -f
Any errors printed should be addressed by creating a bug report in JIRA by including as much information and log prints about the error as possible. The ticket should include:
Summary of the conditions that caused the issue
Link to build log
Link to integration which failed
https://bugreports.qt.io/projects/COIN
If you start seeing multiple error logs it’s time to notify the CI team via the irc #qtci channel or in person. Pay attention to the timestamps before the log messages since old logs will also be printed.
You can use Grafana to monitor Coin test results to detect flakiness in tests:
https://testresults.qt.io/grafana/dashboard/db/overview
Check that there is enough disk space on the file system:
df /
If there is not enough space, it might be that the garbage collector is not freeing enough space or that individual users are using too much disk space.
Check the number of agents in agentpool:
and check if there are many (hundreds of) agents in “waiting” state with none or only a handful in “running” state. Usually these situations indicate an error state that should be noticeable in the journald logs also. If there are no errors in coin.log check if the current builds were all freshly started. If not, then contact the CI team.
Check that all the OpenNebula servers are up by logging in to OpenNebula as “oneadmin” user. Then go to:
http://one-master.ci.qt.io/#hosts-tab
and check that all the servers show “ON” in the “Status” column.
Check that web servers are running:
https://testresults.qt.io/coin/
https://ci-files01-hki.ci.qt.io:8081/coin/
Also check that while they are running that they also shows the latest build results. If they don’t appear up to date, we have a problem somewhere.
Check out Gerrit’s status and match it with Coin’s status. They should converge.
https://codereview.qt-project.org/#/q/status:integrating,n,z
https://codereview.qt-project.org/#/q/status:staged,n,z
If these steps all seem OK, grab a cup of coffee and try again :)
Troubleshooting
Some uncommon administration tasks may be required, especially during development.
Debugging with gdb
Debugging deadlocks and hung threads can be possible by installing python debugger:
sudo apt install python3-dbg pipenv –rm pipenv –python=/usr/bin/python3-dbg
Attach gdb to the executing python process
gdb python3-dbg -p 12345
Useful commands
py-list Show current code in execution py-bt Inspect python callstack
In case of (no debugging symbols found) -message, you may import
(gdb) source /usr/share/gdb/auto-load/usr/bin/python3.6-gdb.py
Running garbage collection
Normally gc is executed automatically, keeping some save threshold of free disk space and it is not needed to run it manually. Nevertheless it is possible to force gc by running:
bin/coin gc
Artifacts invalidation
It might be necessary to invalidate all artifacts. It may happen for example if Coin code changes in an incompatible way. Solution to this is relatively easy. The whole system needs to be stopped and inside storage.py buildKey() function should get new revision number (grep for “COIN revision”). After restart everything will work as after storage cleanup. Old binaries may be removed manually or by garbage collection.