Dynamic Scheduling of Prowjobs

How Prowjob scheduler and Prowjob dispatcher work together to provide dynamic scheduling of prowjobs

Dynamic scheduling was introduced in response to frequent cluster failures observed with a growing cluster fleet. The goal is to enable the rescheduling of jobs by taking into account Prometheus data (job volumes over the last two weeks) and available clusters. Dynamic scheduling consists of two components:

  • A reworked Prowjob dispatcher, running as a daemon
  • An external plugin to the scheduler (upstream component)

Prowjob Dispatcher

The dispatcher uses two files:

  1. Dispatcher Configuration File, the old dispatcher config file, is editable by the dispatcher during rescheduling events. This configuration file has been long used and contains job-to-cluster assignments and scheduling mechanisms (such as manual job assignments and colocation).

  2. Cluster Configuration File, the second config file, which is more critical for scheduling, contains information about the clusters enabled in the system.

The dispatcher reacts to changes in the cluster configuration, loads the first configuration, and performs the scheduling work. It retains its old behavior of rescheduling jobs based on recent Prometheus input every Sunday.

The dispatcher maintains its own copy of the job-to-cluster assignments. After successful scheduling, it opens a PR (pull request) against openshift/release, which should be merged by the triage role as soon as possible. The PR retains file-based scheduling as a backup for dynamic scheduling and also updates the necessary config changes here.

The dispatcher functions as a REST server that responds to requests containing job names. If a matching job is found in the database, the dispatcher returns the cluster assignment. The job-to-cluster assignment data is also stored in a Persistent Volume Claim (PVC) to prevent data loss in case of pod failures.

Cluster Configuration File

aws:
  - name: build01
  - name: build03
  - name: build05
    capabilities:
      - vpn
  - name: build09
  - name: build10
    capabilities:
      - arm64
    capacity: 20
gcp:
  - name: build02
  - name: build04
    blocked: true

The cluster configuration file supports several additional configuration fields:

  • blocked: true completely eliminates the cluster from scheduling. An alternative is to remove the cluster from the configuration; in this case, manual assignments will still be respected.
  • capacity is a number from 1 to 100 (default: 100) that indicates the desired capacity percentage for the cluster’s load. The algorithm considers this value but may not always strictly adhere to it due to other factors, such as manual assignments, capacity, and job distribution. It has been observed that values between 1-25 significantly reduce the cluster’s load, values between 26-50 reduce it slightly, and anything above 50 does not guarantee a noticeable decrease in load.
  • capabilities is a list of capabilities assigned to the cluster. Some jobs may use these capabilities to be scheduled on the appropriate cluster.

External Plugin for the Scheduler

The external plugin, queries a specified URL (in the case of the Test Platform deployment, it’s the dispatcher) to determine the cluster assignment for a given job name. The plugin includes a configurable cache to avoid querying the same data multiple times in a short period.

Troubleshooting

I am triage, I want to merge a PR created by the dispatcher, but tests are failing. What should I do?

If the failing tests are related to cluster incompatibility, it might be that the sanitizer is out of sync with the dispatcher. Since the PR is more important, tests should be overridden, and the issue should be reported in Jira.

I want to force the dispatcher to schedule a job on a specific cluster. I submitted a PR with the change, and it was merged, but the dispatcher is not respecting it. How can I achieve that?

At this time, it’s not possible without using hacks. The dispatcher’s database takes priority over assignments in config files, which are considered a backup. To reschedule jobs, a PR changing the cluster configuration should be submitted. If the cluster is a special cluster (e.g., app.ci), the dispatcher will respect that. If the cluster is manually assigned, the dispatcher may respect it, provided the cluster is not blocked.

I want to merge a PR created by the dispatcher, but merging is blocked due to a conflict in openshift/release. What should I do?

Restart the dispatcher pod. This will trigger the dispatcher to update the PR on the latest openshift/release main branch.

How can I re-trigger a dispatching event without causing an outage?

Try to execute following comands:

$ oc get pods -n ci | grep dispatcher
prowjob-dispatcher-574d7744f9-j2kll                                         2/2     Running                  1 (25h ago)      4d2h
$ oc exec -n ci -it prowjob-dispatcher-574d7744f9-j2kll -c prowjob-dispatcher -- /bin/sh

After logging to a pod’s container, execute curl command which will force dispatch event:

$ curl -X GET "http://localhost:8080/event?dispatch=true"
Last modified November 22, 2024: Updates in the dispatcher documentation (abf1ab1)