Aggregated Disruption
This section talks about aggregated disruption.
There are two types of disruption tests
- The first type consists of these 18 tests defined in the openshift/origin repo:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| [sig-api-machinery] disruption/kube-api connection/new should be available throughout the test
[sig-api-machinery] disruption/kube-api connection/reused should be available throughout the test
[sig-api-machinery] disruption/oauth-api connection/new should be available throughout the test
[sig-api-machinery] disruption/oauth-api connection/reused should be available throughout the test
[sig-api-machinery] disruption/openshift-api connection/new should be available throughout the test
[sig-api-machinery] disruption/openshift-api connection/reused should be available throughout the test
[sig-network-edge] ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/new should be available throughout the test
[sig-network-edge] ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/reused should be available throughout the test
[sig-network-edge] ns/openshift-console route/console disruption/ingress-to-console connection/new should be available throughout the test
[sig-network-edge] ns/openshift-console route/console disruption/ingress-to-console connection/reused should be available throughout the test
[sig-api-machinery] disruption/cache-kube-api connection/new should be available throughout the test
[sig-api-machinery] disruption/cache-kube-api connection/reused should be available throughout the test
[sig-api-machinery] disruption/cache-oauth-api connection/new should be available throughout the test
[sig-api-machinery] disruption/cache-oauth-api connection/reused should be available throughout the test
[sig-api-machinery] disruption/cache-openshift-api connection/new should be available throughout the test
[sig-api-machinery] disruption/cache-openshift-api connection/reused should be available throughout the test
[sig-trt] disruption/ci-cluster-network-liveness connection/new should be available throughout the test
[sig-trt] disruption/ci-cluster-network-liveness connection/reused should be available throughout the test
|
For these tests, disruption tests are aggregated in a way similar to how the other junit tests are aggregated.
- The second type consists of tests that look like these examples below and are defined in openshift/ci-tools repo:
1
2
3
4
5
| cache-kube-api-new-connections disruption P70 should not be worse
cache-kube-api-new-connections disruption P85 should not be worse
cache-openshift-api-new-connections disruption P95 should not be worse
cache-kube-api-new-connections mean disruption should be less than historical plus five standard deviations
cache-kube-api-new-connections zero-disruption should not be worse
|
The aggregation is a little more involved for this type. For these tests, we collect disruption numbers
from all job runs, calculate disruption thresholds using the data we collect, then do comparisons with
those disruption thresholds for each job run. The comparisons will result in a status (success or failure)
for each of the job runs. These statuses are then aggregated similar to how the other junit tests are
aggregated.
There are three types of comparisons:
- Mean disruption check: a disruption threshold is calculated as the historical mean plus 5 times the
standard deviation from historical data (between 10 days ago and 3 days ago) collected from prow jobs
as described in Achitecture Data Flow. If the total
disruption seconds in a prow job exceeds this disruption threshold, a test for that backend fails for that job run.
Subject to Fisher’s Exact Probability Test, a certain number of job runs has to pass to consider the test as passing
for the aggregated job that the prow job is a part of.
- Percentile distribution check: this is similar to mean disruption check, except, the disruption
threshold is based on one of three percentiles: P70, P85, and P95. There is one junit test for P70,
P85, and P95 for each backend being tested.
- Zero disruption check: this is similar to percentile disribution check except we use the maximum historical
percentile that has zero seconds of disruption.
The comparisons are run for each of these backends:
- cache-kube-api (new connections)
- cache-kube-api (reused connections)
- cache-oauth-api (new connections)
- cache-oauth-api (reused connections)
- cache-openshift-api (new connections)
- cache-openshift-api (reused connections)
- kube-api (new connections)
- kube-api (reused connections)
- oauth-api (new connections)
- oauth-api (reused connections)
- openshift-api (new connections)
- openshift-api (reused connections)
Observing the aggregated disruption results
On the prow spyglass page for an aggregated job, click on “aggregate-testrun-summary.html” and open the accordion.
In this sample output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| 4. Failed: suite=[BackendDisruption], cache-oauth-api-new-connections disruption P70 should not be worse
Zero successful runs, we require at least one success to pass (P70=6.00s failures=[1638769365845282816=24s 1638769363324506112=23s
1638769361634201600=20s 1638769360858255360=33s 1638769359960674304=19s 1638769362481451008=11s 1638769364150784000=7s 1638769364989644800=9s 1638769359117619200=16s])
1. Failure - periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1638769365845282816
2. Failure - periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1638769363324506112
3. Failure - periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1638769361634201600
4. Failure - periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1638769360858255360
5. Failure - periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1638769359960674304
6. Failure - periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1638769362481451008
7. Failure - periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1638769364150784000
8. Failure - periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1638769364989644800
9. Failure - periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1638769359117619200
|
you can see that the suite=[BackendDisruption], cache-oauth-api-new-connections disruption P70 should not be worse
test failed, the P70 value is 6 seconds
and the prow job IDs follow with the amount of disruption recorded. This allows you to see how close the underlying jobs came to passing (in the
above example, the values are pretty bad except for one of them which was off by 1 second).
Disabling disruption for a specific set of tests
In the past, disruption testing has uncovered some very subtle and difficult to find defects. In those cases,
it may take several days to find the root cause of the problem. In a situation like this, it may make sense to
write a bug, mark it as “blocking”, and get Accepted payloads flowing again in an effort to avoid blocking development.
In this case, you can temporarily turn off disruption testing for certain aggregated jobs.
To do this, make a modification similar to this code as was done in this PR:
1
2
3
4
| // Temporarily skip all azure disruption aggregation due to https://issues.redhat.com/browse/TRT-889
if strings.Contains(o.jobName, "azure") {
status = testCaseSkipped
}
|
The idea is to set the resulting status
of the aggregation to “skipped”. In the example above, the code was added to
skip disruption aggregation for tests where the job has the string “azure” in it. A similar approach could
be used to skip disruption for other types of jobs.
NOTE
Exercise caution when doing this because you are removing a key quality signal (in aggregated jobs) which
means that other disruption related defects could be introduced without you knowing. Ensure there’s a plan to
re-enable the aggregated disruption testing as soon as possible and with the understanding that once you remove it,
you may uncover another defect.Disabling aggregated disruption
Aggregated disruption tests can be disabled altogether like this:
1
2
3
4
| failedJobRunIDs, successfulJobRunIDs, _, message, err := disruptionCheckFn(ctx, jobRunIDToAvailabilityResultForBackend, backendName)
...
// Disable aggregated dirsuption testing.
status := testCaseSkipped
|
The idea is to set the resulting status
to “skipped”. The need to disable aggregated disruption should be extremely rare and should be limited to
disabling it for certain tests.