Architecture Data Flow

A high level look at how the disruption historical data is gathered and updated.

Resources

Disruption Data Architecture

High Level Diagram

DPCR ClusterJob RunUploader CronJobsDisruption UploaderGCPjob artifactsBucketsBig QueryFetch jUnit Resultsfor Jobse2e test clustere2e testsStoreResultsopenshift-ci-data-analysisJob RunUpdate BigQueryDisruption TableQuery JSON ResultsGitHubopenshift/originquery_results.jsonCommit ResultsDisruption TableJobs TableFetch List ofJobs To GatherInfra ClusterJob Analyzer CronJobsHistorical Data AnalyzerCompare andUpdate

How The Data Flows

  1. To initially setup disruption data collection, this command ./job-run-aggregator create-tables --google-service-account-credential-file <credJsonFile> is run to create the Jobs, JobRuns, and TestRuns tables in big query. That command is idempotent – i.e., it can be run any time regardless of whether the tables are created or not and is part of the job-table-updater CronJob. Each of the “Uploader” CronJobs used in disruption data collection (alert-uploader, disruption-uploader, job-run-uploader, and job-table-updater) requires the Jobs table to exist.

    The Jobs, JobRuns, and TestRuns tables will already exist so no one should have to run that command unless the Jobs table needs to be deleted/re-created. This is rare and only happens when we need to correct something in the Jobs table (because big query does not allow updates to tables). The JobRuns and TestRuns tables should generally be preserved because they contain historical disruption data.

    If someone ever has to delete the Jobs table, delete it right before the job-table-updater CronJob triggers. This way, the Jobs table will immediately be re-created for you.

  2. The Disruption Uploader is a CronJob that is set to run every 4 hours. All the Uploader jobs (disruption-uploader, alert-uploader, job-run-uploader, and job-table-updater) run in the DPCR cluster in the dpcr-ci-job-aggregation namespace, the current configuration can be found in the openshift/continuous-release-jobs private repo under config/clusters/dpcr/services/dpcr-ci-job-aggregation. .

  3. When e2e tests are done the results are uploaded to GCS and the results can be viewed in the artifacts folder for a particular job run.

    Clicking the artifact link on the top right of a prow job and navigating to the openshift-e2e-test folder will show you the disruption results. (ex. .../openshift-e2e-test/artifacts/junit/backend-disruption_[0-9]+-[0-9]+.json).

  4. We only pull disruption data for job names specified in the Jobs table in BigQuery. (see Job Primer for more information on this process)

  5. The disruption uploader will parse out the results from the e2e run backend-disruption json files and push them to the openshift-ci-data-analysis table in BigQuery.

  6. We currently run a periodic disruption data analyzer job in the app.ci cluster. It gathers the recent disruption data and commits the results back to openshift/origin. The PR it generates will also include a report that will help show the differences from previous to current disruptions in a table format. (example PR).

    Note, the read only BigQuery secret used by this job is saved in Vault using the processes described in this HowTo.

  7. The static query_results.json in openshift/origin are then used by the the matchers that the samplers invoke to find the best match for a given test (typically with “remains available using new/reused connections” or “should be nearly zero single second disruptions”) to check if we’re seeing noticeably worse disruption during the run.

How To Query The Data Manually

The process for gathering and updating disruption data is fully automated, however, if you wish to explore the BigQuery data set, below are some of the queries you can run. If you also want to run the job-run-aggregator locally, the README.md for the project will provide guidance.

Once you have access to BigQuery in the openshift-ci-data-analysis project, you can run the below query to fetch the latest results.

Query

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
SELECT
    BackendName,
    Release,
    FromRelease,
    Platform,
    Architecture,
    Network,
    Topology,
    ANY_VALUE(P95) AS P95,
    ANY_VALUE(P99) AS P99,
FROM (
    SELECT
        Jobs.Release,
        Jobs.FromRelease,
        Jobs.Platform,
        Jobs.Architecture,
        Jobs.Network,
        Jobs.Topology,
        BackendName,
        PERCENTILE_CONT(BackendDisruption.DisruptionSeconds, 0.95) OVER(PARTITION BY BackendDisruption.BackendName, Jobs.Network, Jobs.Platform, Jobs.Release, Jobs.FromRelease, Jobs.Topology) AS P95,
        PERCENTILE_CONT(BackendDisruption.DisruptionSeconds, 0.99) OVER(PARTITION BY BackendDisruption.BackendName, Jobs.Network, Jobs.Platform, Jobs.Release, Jobs.FromRelease, Jobs.Topology) AS P99,
    FROM
        openshift-ci-data-analysis.ci_data.BackendDisruption as BackendDisruption
    INNER JOIN
        openshift-ci-data-analysis.ci_data.BackendDisruption_JobRuns as JobRuns on JobRuns.Name = BackendDisruption.JobRunName
    INNER JOIN
        openshift-ci-data-analysis.ci_data.Jobs as Jobs on Jobs.JobName = JobRuns.JobName
    WHERE
        JobRuns.StartTime > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 21 DAY)
)
GROUP BY
BackendName, Release, FromRelease, Platform, Architecture, Network, Topology
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
SELECT * FROM openshift-ci-data-analysis.ci_data.Alerts_Unified_LastWeek_P95
where
  alertName = "etcdMembersDown" or
  alertName = "etcdGRPCRequestsSlow" or
  alertName = "etcdHighNumberOfFailedGRPCRequests" or
  alertName = "etcdMemberCommunicationSlow" or
  alertName = "etcdNoLeader" or
  alertName = "etcdHighFsyncDurations" or
  alertName = "etcdHighCommitDurations" or
  alertName = "etcdInsufficientMembers" or
  alertName = "etcdHighNumberOfLeaderChanges" or
  alertName = "KubeAPIErrorBudgetBurn" or
  alertName = "KubeClientErrors" or
  alertName = "KubePersistentVolumeErrors" or
  alertName = "MCDDrainError" or
  alertName = "PrometheusOperatorWatchErrors" or
  alertName = "VSphereOpenshiftNodeHealthFail"
order by
 AlertName, Release, FromRelease, Topology, Platform, Network

Downloading

Once the query is run, you can download the data locally.

BigQuery Download

Last modified July 19, 2023: trt-1125: document risk analysis (d2581e2)