Creating a Cluster Pool
ci-operator
allows
CI jobs that need an OCP cluster for testing to claim a pre-installed one from so-called “cluster pools”
using Hive. This document describes how to set up custom pools backed by custom
cloud platform accounts owned by users or teams that want their jobs to use clusters provisioned with these custom
accounts.
Prerequisites
- Verify that the cloud platform you want to use is supported by Hive
- Configure the cloud account: Hive does not require more configuration of the cloud account than installing an OpenShift cluster. Make sure that the account has quotas for the pools to contain the desired number of clusters.
Instructions
The OpenShift CI Hive instance is deployed on a dedicated OpenShift cluster, called hosted-mgmt
and all resources involved in
creating and running a custom cluster pool need to be created there. This is done via GitOps in the openshift/release
repository.
Prepare Your Cloud Platform Credentials
First, you need to make sure the cloud platform credentials that will be used to install cluster for the pool are
available on the hosted-mgmt
cluster. If you are not familiar with OpenShift CI custom secret management, please consult the
Adding a New Secret to CI document first.
- Select a suitable collection in Vault to hold your cluster pool secret. Alternatively, create a new suitable collection in collection self-service.
- In the selected collection, create a secret with the necessary keys and values. The specific needed keys depend on the cloud platform; consult the Hive Cloud Credentials document.
- Set
secretsync/target-clusters
key tohosted-mgmt
to make sure your credentials are synced to the necessary cluster. - Set
secretsync/target-namespace
key to a name of the namespace that will hold your pools (${team}-cluster-pool
is a good baseline name). - Set
secretsync/target-name
to a name under which the secret will be accessible in the cluster ($platform-credentials
is a good baseline name).
At the end, you should have a secret similar to the following in Vault:
selfservice/dptp-demo-collection/dptp-demo-pool-credentials:
Create a Directory for Your Manifests
In the openshift/release repository, create a folder in the
clusters/hosted-mgmt/hive/pools
directory that will contain manifests of all your resources ( seeopenshift-ci
as an example).Create
OWNERS
file in the directory to allow your teammates make and approve changes.Create a manifest for the namespace that will hold your Hive resources (the namespace name must match the one where you instructed Vault to sync your secret) and set up RBACs for the pool owners to debug on the
hosted-mgmt
cluster:
|
|
Create a Manifest for Your Cluster Pool
Create a manifest for your cluster pool. The ClusterPool
resource specification is available
at Hive’s documentation; consult that
document for more information about individual fields.
|
|
Pay attention to the following stanzas:
metadata.labels
: These labels will be used by jobs to specify what cluster they want to obtain for their testing.spec.baseDomain
: A base domain for all clusters created in this pool. This is the domain for which you created a hosted zone when configuring the cloud platform account (see prerequisites).spec.imageSetRef
: A reference to aClusterImageSet
in the cluster that determines the exact version of clusters created in the pool.ClusterImageSets
are cluster-scoped resources and their manifests are present inclusters/hosted-mgmt/hive/pools
directory. Either select one of the already available, or create a new one. DPTP maintains a set ofClusterImageSets
that are regularly bumped to most recent released OCP versions.spec.installConfigSecretTemplateRef
: a reference to aSecret
that serves as an installation config template. See the below section for more information.spec.platform.$CLOUD.credentialsSecretRef
: A reference to the secret you created in the Prepare your cloud platform credentials section.pullSecretRef.name
: Must be keptpull-secret
. OpenShift CI will populate your namespace with this secret that contains all pull secrets necessary to install an OCP cluster.
Sizing Your Cluster Pool
Hive maintains the number of clusters in the pool as specified by its size
. A provisioned cluster will
be hibernating after staying idled for
some time and can be woken up if a job claims it. Hive removes it from the pool once it is claimed and creates a new
cluster to maintain the pool’s size
. The cluster will be destroyed 4 hours after it is claimed. If several jobs file
claims from one ClusterPool
simultaneously, Hive will fulfill all claims until the number of living clusters reaches
the pool’s maxSize
.
All live and hibernated clusters consume resources in the cloud account, so your maxSize
should be set according to
your cloud platform limits and quotas and presence of other cluster pools or other resource consumption in your cloud
platform account.
Create a Manifest for Your Install Config Template Secret
This secret is used via a reference from the ClusterPool
resource and
allows customizing the cluster, such as setting the number of workers or the type of instances. It is usually not
necessary to keep this manifest actually secret as it often does not contain anything sensitive.
|
|
Submit a PR to openshift/release
Submit a PR to openshift/release with all your new manifests. After the PR merges, the manifests are committed to the cluster and Hive starts installing clusters for your pool.
Use the Cluster Pool from a CI Job
After the pool manifests are applied on the hosted-mgmt
cluster, the cluster pool can by used by CI jobs by setting
a cluster_claim
stanza with values matching the labels on the pool:
For more details about tests that run on claimed clusters, see the testing with a cluster from a cluster pool document.
Troubleshooting Cluster Pools
See the upstream documentation for general troubleshooting information for ClusterPools.
For information specific to ClusterPools in OpenShift CI, read on.
Accessing Cluster Installation logs
The cluster pools are maintained by Hive behind the scenes, so installation failures, cloud platform account misconfigurations and similar issues are not exposed to actual CI jobs: the jobs will simply never successfully claim a cluster if Hive fails to install them.
The installation logs can be found in the hive
container logs with the following commands:
|
|
Rotating Cloud Credentials
If we need to rotate cloud credentials, the best practice is
- Scale down the size of each pool using the credentials to 0 and wait for all
ClusterDeployments
from those pools to go away; - Rotate the credentials on the cloud and modify the secret containing the credentials
- Scale up the pools to their original sizes
If the secret has been updated with the new credentials and the old ones are no longer valid and there are some clusters provisioned with the old credentials, Hive is not able to deprovision those clusters and they consume the pool size. To fix this, the owner of those pool has to manually update the secret of the cloud credentials in the namespace created for those clusters.
Configuring Install Attempts Limit
Hive keeps retrying on provisioning a new cluster until it succeeds.
As a result, the retry could go indefinitely, for example, if there is a bug from the OpenShift installer.
The ongoing ClusterDeployment
s are counted in the size of the underlying cluster pool and it could lead
to the situation where all ClusterDeployment
s are perma-failing and thus no tests get a cluster via claim.
By configuring ClusterPools.spec.installAttemptsLimit
, Hive stops retries after the limit and deletes the failed ClusterDeployment
and a new ClusterDeployment
will be created to satisfy the pool’s size.
A side effect of this configuration is that the installation logs are gone, with
the failed ClusterDeployment
, which are useful e.g., for filing installer bugs.
Configuring Timeout on Awakening a Hibernating Cluster
By configuring ClusterPools.spec.hibernationConfig.resumeTimeout
, Hive stops waiting after the specified time on waking up a hibernating cluster, considers it broken and replaces it. If not set, Hive will keep waiting until it succeeds.
Renaming a Cluster Pool
After the PR that renames a cluster pool gets merged, the new pool will be created while the old one still exists on the cluster. Because the automation which does the GitOps with the manifests does not support removal, please contact us if we need to delete a cluster pool that is no longer needed.
To avoid hitting the resource limits on the cloud, it is suggested to downscale the cluster pool to zero in a separate PR before renaming it.
Existing Cluster Pools
The following table shows the existing cluster pools that a user can claim a cluster from. Each pool defines a set of
characters about the clusters that are provisioned out of it. For instance, the cluster
pool ci-ocp-4-6-amd64-aws-us-east-1
is composed of OCP 4.6 clusters on AWS’s us-east-1
region. The values of
READY, STANDBY, SIZE and MAX SIZE are taken
from the status and the specification of each pool
. Clicking shows the more
details of the cluster pool, such as the release image that is used for provisioning a cluster and the labels defined on
the pool. The Search box can filter out the pools according to the given keyword.
The cluster pools owned by openshift-ci
are for OpenShift workloads, maintained by DPTP,
and they can be used by any tests in the openshift
org. Pools with different owners should be used only with
knowledge and approval of their owner. This is not currently programmaticaly enforced
but it will be soon.
NAMESPACE | NAME | READY | STANDBY | SIZE | MAX SIZE | IMAGE SET | OWNER |
---|
Info
The cluster pools in namespacefake-cluster-pool
are for DPTP’s internal
usage, such as e2e tests for the pool feature of ci-operator
. For example, the claims against the
pool fake-ocp-4-7-amd64-aws
are annotated with hive.openshift.io/fake-cluster: "true"
which tells Hive to return a
syntactically correct kubeconfig
right away without provisioning any cluster.