Troubleshooting Failures

A guide to tools and strategies used to diagnose prow job failures.

Problem Statement

There is a lot of data collected during a job run and multiple tools to assist in analyzing that data. We can’t expect ‘always passing’ tests in an eventually consistent distributed environment so what we look for is a change in the signals generated over multiple runs. When we see a failure we want to check to see if the cause is identifiable. If it isn’t clear then we need to review the trends for the failure and then take a deeper dive into the data collected. However, the data is not complete and some tools are better at capturing & analyzing particular types or subsets of data than others. Additionally, with the number of jobs, job failures and individual tests it can be hard to determine how / where to start. There is no step by step guide but we hope to document the benefits of the tools we have available and approaches to help investigate payload failures.

Debug Tools

Expanding this panel in the prow UI will show buttons linking to several tools, preconfigured to show data for your current job run.

  • Loki - Grafana logging stack allowing you search logs and narrow time ranges. This contains ephemeral pod logs that are not present in the must-gather or gather-extra artifacts collected at the end of the CI run, as well as the kubelet journal logs found under gather-extra and others to aid in searching. Loki works well when filtered to a single job run, and can actually be used to search globally across CI runs if you’re careful with your search labels and time windows.

  • PromeCIeus - Metrics from the cluster during the test run loaded in Prometheus.

  • KaaS - Representation of the last known state of the cluster. Can be useful for navigating the cluster via oc or k9s and interrogating namespaces, logs, etc.

  • Intervals - Link to the newer Sippy UI for displaying the intervals from the job run. See below for more on intervals.

Intervals

Intervals are an important debugging tool, generated by our monitortests framework in origin, they record and categorize “interesting” things we observed during the job run. Monitortests are initialized when openshift-tests launches, prior to actually running e2e tests. After e2e testing is complete they are given a chance to analayze the intervals gathered so far by all other monitortests, as well as generate junit results based on that analysis.

The result is a json file which you can view in the job artifacts typically at a path somewhere like /artifacts/e2e-gcp-ovn/openshift-e2e-test/artifacts/junit/e2e-events_20240621-141649.json. Two e2e-events files would be common in upgrade jobs where we run openshift-tests once to perform and test the upgrade itself, and then again to run a full conformance test suite against the resulting cluster.

Example of a specific interval:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
    "level": "Info",
    "source": "Disruption",
    "locator": {
        "type": "Disruption",
        "keys": {
            "backend-disruption-name": "host-to-host-new-connections",
            "connection": "new",
            "disruption": "host-to-host-from-node-ci-op-6pxiizhm-9825d-lzxrn-master-2-to-node-ci-op-6pxiizhm-9825d-lzxrn-worker-c-scvf8-endpoint-10.0.128.3"
        }
    },
    "message": {
        "reason": "DisruptionEnded",
        "cause": "",
        "humanMessage": "backend-disruption-name/host-to-host-new-connections connection/new disruption/host-to-host-from-node-ci-op-6pxiizhm-9825d-lzxrn-master-2-to-node-ci-op-6pxiizhm-9825d-lzxrn-worker-c-scvf8-endpoint-10.0.128.3 started responding to GET requests over new connections",
        "annotations": {
            "reason": "DisruptionEnded"
        }
    },
    "from": "2024-05-17T14:24:04Z",
    "to": "2024-05-17T14:25:23Z"
},

Intervals are then charted on a timeline which you will see on the main prow job UI by expanding the “Intervals - spyglass” panel. This is a powerful tool for understanding what happened in a cluster when a problem occurred.

NOTE: we are gradually phasing away from these spyglass embedded intervals charts, to a Sippy UI for displaying them. This will improve prow page load times and allow us to improve the UI retroactively for old job runs, as well as share links to direct filtered views in slack and jira. The Sippy UI is available now under Debug Tools, but is not quite ready to replace the typical spyglass view.

Intervals are also stored in a massive bigquery database, which can be useful for scanning for certain symptoms across all CI runs.

Artifacts

The top of the prow job UI will have a link to the artifacts collected during the run. Artifacts are grouped based on the steps that compose the test run. Some of the highlights contained in the artifacts are

  • The main build-log.txt in the root of the folder, each step will also include a build-log.txt
  • ipi-install-install-stableinitial
    • Includes .openshift_install.log
  • Gather-must-gather [src]
    • Includes event-filter.html which can be used to search for events that occurred during the run
    • Includes must-gather.tar that can be downloaded, untarred and searched / viewed locally
  • Gather-extra
    • Includes nodes containing the kubelet journal logs that can be downloaded and untarred for searching and viewing locally
    • events.json a collection of all of the events collected during the run
    • pods - most recent pod logs (but not ephemeral pods)
  • Gather-network
    • Includes network.tar that can be downloaded, untarred and searched / viewed locally
  • gather-aws-console, gather-gcp-console, etc.
    • If you need to see what is going on in the base environment (checking RHCOS version, etc)
    • test artifacts junit (artifacts/e2e-aws-serial/openshift-e2e-test/artifacts/junit/)
      • e2e-timelines_openshift-control-plane - Timelines for the control plane pods
      • e2e-timelines_operators - Timelines for operators

Troubleshooting steps

  • Investigate the error message for obvious clues.
  • Navigate to Sippy > 4.17 (or your release) > Tests and search for this specific test. See how often it’s failing, on which jobs. Was there a recent change in pass rate or is this a common flake? Is it failing payloads? Is it only affecting certain variants?
    • This page will also show any linked jiras for this test. Linked jiras are established by mentioning the test name in the description of the jira, or any comment.
  • If multiple tests failed, is there a pattern visible? Does that pattern present itself on other jobs where this test failed?
  • Analyze intervals charts to see what else was going on in the cluster at the time of the problem.
  • Can the failure be traced to a recent commit?
    • If you arrived at this job due to a failure on a payload, check the changelog on that payload.
    • If the test failures have a well defined start time, the Presubmit Pull Requests page in Sippy can be used to try to identify what merged around a specific time.
  • Search Slack for the test name to see if any recent discussion turns up.
  • Use Search.CI to see if there are any bugs related to the test that might be relevant.
  • Review the test to try to get a better understanding of the test.