Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Problem:
Certain Insights Advisor features differentiate between RHEL and OCP advisor
Goal:
Address top priority UI misalignments between RHEL and OCP advisor. Address UI features dropped from Insights ADvisor for OCP GA.
Scope:
Specific tasks and priority of them tracked in https://issues.redhat.com/browse/CCXDEV-7432
This contains all the Insights Advisor widget deliverables for the OCP release 4.11.
Scope
It covers only minor bug fixes and improvements:
Scenario: Check if the Insights Advisor widget in the OCP WebConsole UI shows the time of the last data analysis Given: OCP WebConsole UI and the cluster dashboard is accessible And: CCX external data pipeline is in a working state And: administrator A1 has access to his cluster's dashboard And: Insights Operator for this cluster is sending archives When: administrator A1 clicks on the Insights Advisor widget Then: the results of the last analysis are showed in the Insights Advisor widget And: the time of the last analysis is shown in the Insights Advisor widget
Acceptance criteria:
max_over_time(timestamp(changes(insightsclient_request_send_total\{status_code="202"}[1m]) > 0)[24h:1m])
Show the error message (mocked in CCXDEV-5868) if the Prometheus metrics `cluster_operator_conditions{name="insights"}` contain two true conditions: UploadDegraded and Degraded at the same time. This state occurs if there was an IO archive upload error = problems with the pipeline.
Expected for 4.11 OCP release.
Cloning the existing rule should end up with a new rule in the same namespace.
Modifications can now be done to the new rule.
(Optional) You can silence the existing rule.
Create a new PrometheusRule object inside the namespace that includes the metrics you need to form the alerting rule.
CMO should reconcile the platform Prometheus configuration with the AlertingRule resources.
DoD
CMO should reconcile the platform Prometheus configuration with the alert-relabel-config resources.
DoD
Managing PVs at scale for a fleet creates difficulties where "one size does not fit all". The ability for SRE to deploy prometheus with PVs and have retention based an on a desired size would enable easier management of these volumes across the fleet.
The prometheus-operator exposes retentionSize.
Field | Description |
---|---|
retentionSize | Maximum amount of disk space used by blocks. Supported units: B, KB, MB, GB, TB, PB, EB. Ex: 512MB. |
This is a feature request to enable this configuration option via CMO cluster-monitoring-config ConfigMap.
Today, all configuration for setting individual, for example, routing configuration is done via a single configuration file that only admins have access to. If an environment uses multiple tenants and each tenant, for example, has different systems that they are using to notify teams in case of an issue, then someone needs to file a request w/ an admin to add the required settings.
That can be bothersome for individual teams, since requests like that usually disappear in the backlog of an administrator. At the same time, administrators might get tons of requests that they have to look at and prioritize, which takes them away from more crucial work.
We would like to introduce a more self service approach whereas individual teams can create their own configuration for their needs w/o the administrators involvement.
Last but not least, since Monitoring is deployed as a Core service of OpenShift there are multiple restrictions that the SRE team has to apply to all OSD and ROSA clusters. One restriction is the ability for customers to use the central Alertmanager that is owned and managed by the SRE team. They can't give access to the central managed secret due to security concerns so that users can add their own routing information.
Provide a new API (based on the Operator CRD approach) as part of the Prometheus Operator that allows creating a subset of the Alertmanager configuration without touching the central Alertmanager configuration file.
Please note that we do not plan to support additional individual webhooks with this work. Customers will need to deploy their own version of the third party webhooks.
Team A wants to send all their important notifications to a specific Slack channel.
* CI - CI is running, tests are automated and merged.
* Release Enablement <link to Feature Enablement Presentation>
* DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
* DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
* DEV - Downstream build attached to advisory: <link to errata>
* QE - Test plans in Polarion: <link or reference to Polarion>
* QE - Automated tests merged: <link or reference to automated tests>
* DOC - Downstream documentation merged: <link to meaningful PR>
Now that upstream supports AlertmanagerConfig v1beta1 (see MON-2290 and https://github.com/prometheus-operator/prometheus-operator/pull/4709), it should be deployed by CMO.
DoD:
DoD
DoD
Copy/paste from [_https://github.com/openshift-cs/managed-openshift/issues/60_]
Which service is this feature request for?
OpenShift Dedicated and Red Hat OpenShift Service on AWS
What are you trying to do?
Allow ROSA/OSD to integrate with AWS Managed Prometheus.
Describe the solution you'd like
Remote-write of metrics is supported in OpenShift but it does not work with AWS Managed Prometheus since AWS Managed Prometheus requires AWS SigV4 auth.
Describe alternatives you've considered
There is the workaround to use the "AWS SigV4 Proxy" but I'd think this is not properly supported by RH.
https://mobb.ninja/docs/rosa/cluster-metrics-to-aws-prometheus/
Additional context
The customer wants to use an open and portable solution to centralize metrics storage and analysis. If they also deploy to other clouds, they don't want to have to re-configure. Since most clouds offer a Prometheus service (or it's easy to self-manage Prometheus), app migration should be simplified.
The cluster monitoring operator should allow OpenShift customers to configure remote write with all authentication methods supported by upstream Prometheus.
We will extend CMO's configuration API to support the following authentications with remote write:
Customers want to send metrics to AWS Managed Prometheus that require sigv4 authentication (see https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-secure-metric-ingestion.html#AMP-secure-auth).
Prometheus and Prometheus operator already support custom Authorization for remote write. This should be possible to configure the same in the CMO configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
Authorization:
type: Bearer
credentials:
name: credentials
key: token
DoD:
Prometheus and Prometheus operator already support sigv4 authentication for remote write. This should be possible to configure the same in the CMO configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
sigv4:
accessKey:
name: aws-credentialss
key: access
secretKey:
name: aws-credentials
key: secret
profile: "SomeProfile"
roleArn: "SomeRoleArn"
DoD:
As WMCO user, I want to make sure containerd logging information has been updated in documents and scripts.
Configure audit logging to capture login, logout and login failure details
TODO(PM): update this
Customer who needs login, logout and login failure details inside the openshift container platform.
I have checked for this on my test cluster but the audit logs do not contain any user name specifying login or logout details. For successful logins or logout, on CLI and openshift console as well we can see 'Login successful' or 'Invalid credentials'.
Expected results: Login, logout and login failures should be captured in audit logging.
The apiserver pods today have ´/var/log/<kube|oauth|openshift>-apiserver` mounted from the host and create audit files there using the upstream audit event format (JSON lines following https://github.com/kubernetes/apiserver/blob/92392ef22153d75b3645b0ae339f89c12767fb52/pkg/apis/audit/v1/types.go#L72). These events are apiserver specific, but as oauth authentication flow events are also requests, we can use the apiserver event format to log logins, login failures and logouts. Hence, we propose to make oauth-server to create /var/log/oauth-server/audit.log files on the master nodes using that format.
When the login flow does not finish within a certain time (e.g. 10min), we can artificially create an event to show a login failure in the audit logs.
Right now there's no way to generate audit logs from this.
Let the Cluster Authentication Operator deliver the policy to OAuthServer.
In order to know if authn events should be logged, OAuthServer needs to be aware of it.
* Stanislav LázničkaCreate an observer to deliver the audit policy to the oauth server
Make the authentication-operator react to the new audit field in the oauth.config/cluster object. Write an observer watching this field, such an observer will translate the top-level configuration into oauth-server config and add it to the rest of the observed config.
Right now there's no way to generate audit logs from this.
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Early customer feedback is that they see SNO as a great solution covering smaller footprint deployment, but are wondering what is the evolution story OpenShift is going to provide where more capacity or high availability are needed in the future.
While migration tooling (moving workload/config to new cluster) could be a mid-term solution, customer desire is not to include extra hardware to be involved in this process.
For Telecommunications Providers, at the Far Edge they intend to start small and then grow. Many of these operators will start with a SNO-based DU deployment as an initial investment, but as DUs evolve, different segments of the radio spectrum are added, various radio hardware is provisioned and features delivered to the Far Edge, the Telecommunication Providers desire the ability for their Far Edge deployments to scale up from 1 node to 2 nodes to n nodes. On the opposite side of the spectrum from SNO is MMIMO where there is a robust cluster and workloads use HPA.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This is a ticket meant to track all the all the OCP PRs that are involved in the implementation of the SNO + workers enhancement
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Rebase openshift/builder to k8s 1.24
4.11 MVP Requirements
Out of scope use cases (that are part of the Kubeframe/factory project):
Questions to be addressed:
As a deployer, I want to be able to:
so that I can achieve
Currently the Assisted Service generates the credentials by running the ignition generation step of the oepnshift-installer. This is why the credentials are only retrievable from the REST API towards the end of the installation.
In the BILLI usage, which takes down assisted service before the installation is complete there is no obvious point at which to alert the user that they should retrieve the credentials. This means that we either need to:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Feature --->
<--- Remove the descriptive text as appropriate --->
Problem
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Running the OPCT with the latest version (v0.1.0) on OCP 4.11.0, the openshift-tests is reporting an incorrect counter for the "total" field.
In the example below, after the 1127th test, the total follows the same counter of executed. I also would assume that the total is incorrect before that point as the test continues the execution increases both counters.
openshift-tests output format: [failed/executed/total]
started: (0/1126/1127) "[sig-storage] PersistentVolumes-expansion loopback local block volume should support online expansion on node [Suite:openshift/conformance/parallel] [Suite:k8s]" passed: (38s) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options [Suite:openshift/conformance/parallel] [Suite:k8s]" started: (0/1127/1127) "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]" passed: (6.6s) 2022-08-09T17:12:21 "[sig-storage] Downward API volume should provide container's memory request [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]" started: (0/1128/1128) "[sig-storage] In-tree Volumes [Driver: cinder] [Testpattern: Dynamic PV (immediate binding)] topology should fail to schedule a pod which has topologies that conflict with AllowedTopologies [Suite:openshift/conformance/parallel] [Suite:k8s]" skip [k8s.io/kubernetes@v1.24.0/test/e2e/storage/framework/testsuite.go:116]: Driver local doesn't support GenericEphemeralVolume -- skipping Ginkgo exit error 3: exit with code 3 skipped: (400ms) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]" started: (0/1129/1129) "[sig-storage] In-tree Volumes [Driver: emptydir] [Testpattern: Dynamic PV (default fs)] capacity provides storage capacity information [Suite:openshift/conformance/parallel] [Suite:k8s]"
OPCT output format [executed/total (failed failures)]
Tue, 09 Aug 2022 14:12:13 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1112/1127 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:23 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1120/1127 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:33 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1139/1139 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:43 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1185/1185 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:53 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1188/1188 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor...
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.
There are definitely grey areas, but in general:
Questions to be addressed:
Create a PR in openshift/cluster-ingress-operator to implement configurable router probe timeouts.
The PR should include the following:
User Story: As a customer in a highly regulated environment, I need the ability to secure DNS traffic when forwarding requests to upstream resolvers so that I can ensure additional DNS traffic and data privacy.
tldr: three basic claims, the rest is explanation and one example
While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.
One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.
I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.
We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.
Relevant links:
Per the 4.6.30 Monitoring DNS Post Mortem, we should add E2E tests to openshift/cluster-dns-operator to reduce the risk that changes to our CoreDNS configuration break DNS resolution for clients.
To begin with, we add E2E DNS testing for 2 or 3 client libraries to establish a framework for testing DNS resolvers; the work of adding additional client libraries to this framework can be left for follow-up stories. Two common libraries are Go's resolver and glibc's resolver. A somewhat common library that is known to have quirks is musl libc's resolver, which uses a shorter timeout value than glibc's resolver and reportedly has issues with the EDNS0 protocol extension. It would also make sense to test Java or other popular languages or runtimes that have their own resolvers.
Additionally, as talked about in our DNS Issue Retro & Testing Coverage meeting on Feb 28th 2024, we also decided to add a test for testing a non-EDNS0 query for a larger than 512 byte record, as once was an issue in bug OCPBUGS-27397.
The ultimate goal is that the test will inform us when a change to OpenShift's DNS or networking has an effect that may impact end-user applications.
In OCP 4.8 the router was changed to use the "random" balancing algorithm for non-passthrough routes by default. It was previously "leastconn".
Bug https://bugzilla.redhat.com/show_bug.cgi?id=2007581 shows that using "random" by default incurs significant memory overhead for each backend that uses it.
PR https://github.com/openshift/cluster-ingress-operator/pull/663
reverted the change and made "leastconn" the default again (OCP 4.8 onwards).
The analysis in https://bugzilla.redhat.com/show_bug.cgi?id=2007581#c40 shows that the default haproxy behaviour is to multiply the weight (specified in the route CR) by 16 as it builds its data structures for each backend. If no weight is specified then openshift-router sets the weight to 256. If you have many, many thousands of routes then this balloons quickly and leads to a significant increase in memory usage, as highlighted by customer cases attached to BZ#2007581.
The purpose of this issue is to both explore changing the openshift-router default weight (i.e., 256) to something smaller, or indeed unset (assuming no explicit weight has been requested), and to measure the memory usage within the context of the existing perf&scale tests that we use for vetting new haproxy releases.
It may be that the low-hanging change is to not default to weight=256 for backends that only have one pod replica (i.e., if no value specified, and there is only 1 pod replica, then don't default to 256 for that single server entry).
Outcome: does changing the [default] weight value make it feasible to switch back to "random" as the default balancing algorithm for a future OCP release.
Revert router to using "random" once again in 4.11 once analysis is done on impact of weight and static memory allocation.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
When viewing the Installed Operators list set to 'All projects' and then selecting an operator that is available in 'All namespaces' (globally installed,) upon clicking the operator to view its details the user is taken into the details of that operator in installed namespace (project selector will switch to the install namespace.)
This can be disorienting then to look at the lists of custom resource instances and see them all blank, since the lists are showing instances only in the currently selected project (the install namespace) and not across all namespaces the operator is available in.
It is likely that making use of the new Operator resource will improve this experience (CONSOLE-2240,) though that may still be some releases away. it should be considered if it's worth a "short term" fix in the meantime.
Note: The informational alert was not implemented. It was decided that since "All namespaces" is displayed in the radio button, the alert was not needed.
During master nodes upgrade when nodes are getting drained there's currently no protection from two or more operands going down. If your component is required to be available during upgrade or other voluntary disruptions, please consider deploying PDB to protect your operands.
The effort is tracked in https://issues.redhat.com/browse/WRKLDS-293.
Example:
Acceptance Criteria:
1. Create PDB controller in console-operator for both console and downloads pods
2. Add e2e tests for PDB in single node and multi node cluster
Note: We should consider to backport this to 4.10
Goal
Add support for PDB (Pod Disruption Budget) to the console.
Requirements:
Designs:
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Goal
Improve the UX on the machine config pool page to reflect the new enhancements on the cluster settings that allows users to select the ability to update the control plane only.
Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes.
Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc. It is all at a pool level, they will not be able to choose specific hosts.
Requirements
Design deliverables:
Goal
Add the ability to choose between a full cluster upgrade (which exists today) or control plane upgrade (which will pause all worker pools) in the console.
Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes.
Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc. It is all at a pool level, they will not be able to choose specific hosts.
Requirements
Design deliverables:
Enable sharing ConfigMap and Secret across namespaces
Requirement | Notes | isMvp? |
---|---|---|
Secrets and ConfigMaps can get shared across namespaces | YES |
NA
NA
Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them.
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As an OpenShift engineer,
I want to initialize a validating admission webhook for the shared resource CSI driver
So that I can eventually require readOnly: true to be set on all pods that use the Shared Resource CSI Driver
None.
None.
None.
This is a prerequisite for implementing the validating admission webhook.
We need to have ART build the container image downstream so that we can add the correct image references for the CVO.
If we reference images in the CVO manifests which do not have downstream counterparts, we break the downstream build for the payload.
CI is capable of producing multiple images for a GitHub repository. For example, github.com/openshift/oc produces 4-5 images with various capabilities.
We did similar work in BUILD-234 - some of these steps are not required.
See also:
Tasks:
As a developer using SharedSecrets and ConfigMaps
I want to ensure all pods set readOnly; true on admission
So that I don't have pods stuck in the "Pending" state because of a bad volume mount
QE will need to verify the new Pod Admission behavior
Docs will need to ensure that readOnly: true is required and must be set to true.
None.
QE testing/verification of the feature - require readOnly to be true
Actions:
1. Create smoke test and submit to GitHub
2. Run script to integrate smoke test with Polarion
As an OpenShift engineer
I want the shared resource CSI Driver webhook to be installed with the cluster storage operator
So that the webhook is deployed when the CSI driver is deployed
None - no new functional capabilities will be added
None - we can verify in CI that we are deploying the webhook correctly.
None - no new functional capabilities will be added
The scope of this story is to just deploy the "hello world" webhook with the Cluster Storage Operator.
Adding the live ValidatingWebhook configuration and service will be done in a separate story.
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
https://issues.redhat.com/browse/AUTH-2 revealed that, in prinicipal, Pod Security Admission is possible to integrate into OpenShift while retaining SCC functionality.
This epic is about the concrete steps to enable Pod Security Admission by default in OpenShift
Enhancement - https://github.com/openshift/enhancements/pull/1010
ingress-operator must comply to pod security. The current audit warning is:
{ "objectRef": "openshift-ingress-operator/deployments/ingress-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.run AsNonRoot=true), seccompProfile (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }
dns-operator must comply to restricted pod security level. The current audit warning is:
{ "objectRef": "openshift-dns-operator/deployments/dns-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unre stricted capabilities (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.runAsNonRoot=tr ue), seccompProfile (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }HyperShift provisions OpenShift clusters with externally managed control-planes. It follows a slightly different process for provisioning clusters. For example, HyperShift uses cluster API as a backend and moves all the machine management bits to the management cluster.
showing machine management/cluster auto-scaling tabs in the console is likely to confuse users and cause unnecessary side effects.
See Design Doc: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
It's based on the SERVER_FLAG controlPlaneTopology being set to External is really the driving factor here; this can be done in one of two ways:
To test work related to cluster upgrade process, use a 4.10.3 cluster set on the candidate-4.10 upgrade channel using 4.11 frontend code.
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to remove the ability to “Add identity providers” under “Set up your Cluster”. In addition to the getting started card, we should remove the ability to update a cluster on the details card when applicable (anything that changes a cluster version should be read only).
Summary of changes to the overview page:
Check section 03 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend these notifications:
For these we will need to check `ControlPlaneTopology`, if it's set to 'External' and also check if the user can edit cluster version(either by creating a hook or an RBAC call, eg. `canEditClusterVersion`)
Check section 05 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
Based on Cesar's comment we should be removing the `Control Plane` section, if the infrastructure.status.controlplanetopology being "External".
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend kubeadmin notifier, from the global notifications, since it contain link for updating the cluster OAuth configuration (see attachment).
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need surface a message that the control plane is externally managed and add following changes:
In general, anything that changes a cluster version should be read only.
Check section 02 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
PatternFly Dark Theme Handbook: https://docs.google.com/document/d/1mRYEfUoOjTsSt7hiqjbeplqhfo3_rVDO0QqMj2p67pw/edit
Admin Console -> Workloads & Pods
Dev Console -> Gotcha pages: Observe Dashboard and Metrics, Add, Pipelines: builder, list, log, and run
As a developer, I want to be able to fix remaining issues from the spreadsheet of issues generated after the initial pass and spike of adding dark theme to the console.. As such, I need to make sure to either complete all remaining issues for the spreadsheet, or, create a bug or future story for any remaining issues in these two documents.
Acceptance criteria:
As a developer, I want to be able to scope the changes needed to enable dark mode for the admin console. As such, I need to investigate how much of the console will display dark mode using PF variables and also define a list of gotcha pages/components which will need special casing above and beyond PF variable settings.
Acceptance criteria:
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
The Cluster Dashboard Details Card Protractor integration test was failing at high rate, and despite multiple attempts to fix, was never fully resolved, so it was disabled as a way to fix https://bugzilla.redhat.com/show_bug.cgi?id=2068594. Migrating this entire file to Cypress should give us better debugging capability, which is what was done to fix a similarly problematic project dashboard Protractor test.
Currently, you need to navigate to
Cluster Settings ->
Global configuration ->
Console (operator) config ->
Console plugins
to see and managed plugins. This takes a lot of clicks and is not discoverable. We should look at surfacing plugin details where they're easier to find – perhaps on the Cluster Settings page – or at least provide a more convenient link somewhere in the UI.
AC: Add the Dynamic Plugins section to the Status Card in the overview that will contain:
In the 4.11 release, a console.openshift.io/default-i18next-namespace annotation is being introduced. The annotation indicates whether the ConsolePlugin contains localization resources. If the annotation is set to "true", the localization resources from the i18n namespace named after the dynamic plugin (e.g. plugin__kubevirt), are loaded. If the annotation is set to any other value or is missing on the ConsolePlugin resource, localization resources are not loaded.
In case these resources are not present in the dynamic plugin, the initial console load will be slowed down. For more info check BZ#2015654
AC:
Follow up of https://issues.redhat.com/browse/CONSOLE-3159
We need to provide a base for running integration tests using the dynamic plugins. The tests should initially
Once the basic framework is in place, we can update the demo plugin and add new integration tests when we add new extension points.
https://github.com/openshift/console/tree/master/frontend/dynamic-demo-plugin
https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md
https://github.com/openshift/console/tree/master/frontend/packages/console-plugin-sdk
We have a Timestamp component for consistent display of dates and times that we should expose through the SDK. We might also consider a hook that formats dates and times for places were you don't want or cant use the component, eg. times on a chart.
This will become important when we add a user preference for dates so that plugins show consistent dates and times as console. If I set my user preference to UTC dates, console should show UTC dates everywhere.
AC:
Currently, enabled plugins can fail to load for a variety of reasons. For instance, plugins don't load if the plugin name in the manifest doesn't match the ConsolePlugin name or the plugin has an invalid codeRef. There is no indication in the UI that something has gone wrong. We should explore ways to report this problem in the UI to cluster admins. Depending on the nature of the issue, an admin might be able to resolve the issue or at least report a bug against the plugin.
The message about failing could appear in the notification drawer and/or console plugins tab on the operator config. We could also explore creating an alert if a plugin is failing.
AC:
Goal
Background
RFE: for 4.10, Cincinnati and the cluster-version operator are adding conditional updates (a.k.a. targeted edge blocking): https://issues.redhat.com/browse/OTA-267
High-level plans in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#update-client-support-for-the-enhanced-schema
Example of what the oc adm upgrade UX will be in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#cluster-administrator.
The oc implementation landed via https://github.com/openshift/oc/pull/961.
Design
See design doc: https://docs.google.com/document/d/1Nja4whdsI5dKmQNS_rXyN8IGtRXDJ8gXuU_eSxBLMIY/edit#
See marvel: https://marvelapp.com/prototype/h3ehaa4/screen/86077932
Update the cluster settings page to inform the user when the latest available update is supported but not recommended. Add an informational popover to the latest version in update path visualization.
The "Update Version" modal on the cluster settings page should be updated to give users information about recommended, not recommended, and blocked update versions.
Story: As an administrator I want to rely on a default configuration that spreads image registry pods across topology zones so that I don't suffer from a long recovery time (>6 mins) in case of a complete zone failure if all pods are impacted.
Background: The image registry currently uses affinity/anti-affinity rules to spread registry pods across different hosts. However this might cause situations in which all pods end up on hosts of a single zone, leading to a long recovery time of the registry if that zone is lost entirely. However due to problems in the past with the preferred setting of anti-affinity rule adherence the configuration was forced instead with required and the rules became constraints. With zones as constraints the internal registry would not have deployed anymore in environments with a single zone, e.g. internal CI environment. Pod topology constraints is a new API that is supported in OCP which can also relax constraints in case they cannot be satisfied. Details here: https://docs.openshift.com/container-platform/4.7/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html
Acceptance criteria:
Open Questions:
As an OpenShift administrator
I want to provide the registry operator with a custom certificate authority for S3 storage
so that I can use a third-party S3 storage provider.
Remove Jenkins from the OCP Payload.
See epic linking - need alternative non payload image available to provide relatively seamless migration
Also, the EP for this is approved and merged at https://github.com/openshift/enhancements/blob/master/enhancements/builds/remove-jenkins-payload.md
PARTIAL ANSWER ^^: confirmed with Ben Parees in https://coreos.slack.com/archives/C014MHHKUSF/p1646683621293839 that EP merging is currently sufficient OCP "technical leadership" approval.
assuming none
As maintainers of the OpenShift jenkins component, we need run Jenkins CI for PR testing against openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin, openshift/jenkins-openshift-login-plugin, using images built in the CI pipeline but not injected into CI test clusters via sample operator overriding the jenkins sample imagestream with the jenkins payload image.
As maintainers of the OpenShift Jenkins component, we need Jenkins periodics for the client and sync plugins to run against the latest non payload, CPaas image, promoted to CI's image locations on quay.io, for the current release in development.
As maintainers of the OpenShift Jenkins component, we need Jenkins related tests outside of very basic Jenkins Pipieline Strategy Build Config verification, removed from openshift-tests in OpenShift Origin, using a non-payload, CPaas image pertinent to the branch in question.
High Level, we ideally want to vet the new CPaas image via CI and periodics BEFORE we start changing the samples operator so that it does not manipulate the jenkins imagestream (our tests will override the samples operator override)
NONE ... QE should wait until JNKS-254
NONE
NONE
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Possible staging
1) before CPaas is available, we can validate images generated by PRs to openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin by taking the image built by the image (where the info needed to get the right image from the CI registry is in the IMAGE_FORMAT env var) and then doing an `oc tag --source=docker <PR image ref> openshift/jenkins:2` to replace the use of the payload image in the jenkins imagestream in the openshift namespace with the PRs image
2) insert 1) in https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/sync-plugin/e2e/jenkins-sync-plugin-e2e-commands.sh and https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/client-plugin/tests/jenkins-client-plugin-tests-commands.sh where you test for IMAGE_FORMAT being set
3) or instead of 2) you update the Makefiles for the plugins to call a script that does the same sort of thing, see what is in IMAGE_FORMAT, and if it has something, do the `oc tag`
https://github.com/openshift/release/pull/26979 is a prototype of how to stick the image built from a PR and conceivably the periodics to get the image built from it and tag it into the jenkins imagestream in the openshift namespace in the test cluster
After installing or upgrading to the latest OCP version, the existing OpenShift route to the prometheus-k8s service is updated to be a path-based route to '/api/v1'.
DoD:
Following up on https://issues.redhat.com/browse/MON-1320, we added three new CLI flags to Prometheus to apply different limits on the samples' labels. These new flags are available starting from Prometheus v2.27.0, which will most likely be shipped in OpenShift 4.9.
The limits that we want to look into for OCP are the following ones:
# Per-scrape limit on number of labels that will be accepted for a sample. If # more than this number of labels are present post metric-relabeling, the # entire scrape will be treated as failed. 0 means no limit. [ label_limit: <int> | default = 0 ] # Per-scrape limit on length of labels name that will be accepted for a sample. # If a label name is longer than this number post metric-relabeling, the entire # scrape will be treated as failed. 0 means no limit. [ label_name_length_limit: <int> | default = 0 ] # Per-scrape limit on length of labels value that will be accepted for a sample. # If a label value is longer than this number post metric-relabeling, the # entire scrape will be treated as failed. 0 means no limit. [ label_value_length_limit: <int> | default = 0 ]
We could benefit from them by setting relatively high values that could only induce unbound cardinality and thus reject the targets completely if they happened to breach our constrainst.
DoD:
When users configure CMO to interact with systems outside of an OpenShift cluster, we want to provide an easy way to add the cluster ID to the data send.
Technically this can be achieved today, by adding an identifying label to the remote_write configuration for a given cluster. The operator adding the remote_write integration needs to take care that the label is unique over the managed fleet of clusters. This however adds management complexity. Any given cluster already has a pseudo-unique datum, that can be used for this purpose.
Expose a flag in the CMO configuration, that is false by default (keeps backward compatibility) and when set to true will add the _id label to a remote_write configuration. More specifically it will be added to the top of a remote_write relabel_config list via the replace action. This will add the label as expect, but additionally a user could alter this label in a later relabel config to suit any specific requirements (say rename the label or add additional information to the value).
The location of this flag is the remote_write Spec, so this can be set for individual remote_write configurations.
We currently use a sample app to e2e test remote write in CMO.
In order to test the addition of the cluster_id relabel config, we need to confirm that the metrics send actually have the expected label.
For this test we should use Prometheus as the remote_write target. This allows us to query the metrics send via remote write and confirm they have the expected label.
Add an optional boolean flag to CMOs definition of RemoteWriteSpec that if true adds an entry in the specs WriteRelabelConfigs list.
I went with adding the relabel config to all user-supplied remote_write configurations. This path has no risk for backwards compatibility (unless users use the {}tmp_openshift_cluster_id{} label, seems unlikely) and reduces overall complexity, as well as documentation complexity.
The entry should look like what is already added to the telemetry remote write config and it should be added as the first entry in the list, before any user supplied relabel configs.
The potential target ServiceMonitors are:
As a user, I want to understand which service bindings connected a service to a component successfully or not. Currently it's really difficult to understand and needs inspection into each ServiceBinding resource (yaml).
See also https://docs.google.com/document/d/1OzE74z2RGO5LPjtDoJeUgYBQXBSVmD5tCC7xfJotE00/edit
As a user, I want the topology view to be less cluttered as I doom out showing only information that I can discern and still be able to get a feel for the status of my project.
This epic is mainly focused on the 4.10 Release QE activities
1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis
To the track the QE progress at one place in 4.10 Release Confluence page
Acceptance criteria:
This epic covers a number of customer requests(RFEs) as well as increases usability.
Customer satisfaction as well as improved usability.
None
As a user, I should be able to switch between the form and yaml editor while creating the ProjectHelmChartRepository CR.
Form component https://github.com/openshift/console/pull/11227
As a user, I want to use a form to create Deployments
Edit deployment form ODC-5007
Currently we are only able to get limited telemetry from the Dev Sandbox, but not from any of our managed clusters or on prem clusters.
In order to improve properly analyze usage and the user experience, we need to be able to gather as much data as possible.
// JS type
telemetry?: Record<string, string>
./bin/bridge --telemetry SEGMENT_API_KEY=a-key-123-xzy ./bin/bridge --telemetry CONSOLE_LOG=debug
Goal:
Enhance oc adm release new (and related verbs info, extract, mirror) with heterogeneous architecture support
tl;dr
oc adm release new (and related verbs info, extract, mirror) would be enhanced to optionally allow the creation of manifest list release payloads. The manifest list flow would be triggered whenever the CVO image in an imagestream was a manifest list. If the CVO image is a standard manifest, the generated release payload will also be a manifest. If the CVO image is a manifest list, the generated release payload would be a manifest list (containing a manifest for each arch possessed by the CVO manifest list).
In either case, oc adm release new would permit non-CVO component images to be manifest or manifest lists and pass them through directly to the resultant release manifest(s).
If a manifest list release payload is generated, each architecture specific release payload manifest will reference the same pullspecs provided in the input imagestream.
More details in Option 1 of https://docs.google.com/document/d/1BOlPrmPhuGboZbLZWApXszxuJ1eish92NlOeb03XEdE/edit#heading=h.eldc1ppinjjh
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
I asked Zvonko Kaiser and he seemed open to it. I need to confirm with Shiva Merla
Rename Provider to Infrastructure Provider
Add GPU Provider
https://miro.com/app/board/uXjVOeUB2B4=/?moveToWidget=3458764514332229879&cot=14
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a developer building container images on OpenShift
I want to specify that my build should run without elevated privileges
So that builds do not run as root from the host's perspective with elevated privileges
No QE required for Dev Preview. OpenShift regression testing will verify that existing behavior is not impacted.
We will need to document how to enable this feature, with sufficient warnings regarding Dev Preview.
This likely warrants an OpenShift blog post, potentially?
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
This is a clone of issue OCPBUGS-8491. The following is the description of the original issue:
—
Description of problem:
Image registry pods panic while deploying OCP in ap-southeast-4 AWS region
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Deploy OCP in AWS ap-southeast-4 region
Steps to Reproduce:
Deploy OCP in AWS ap-southeast-4 region
Actual results:
panic: Invalid region provided: ap-southeast-4
Expected results:
Image registry pods should come up with no errors
Additional info:
This is a clone of issue OCPBUGS-676. The following is the description of the original issue:
—
the machine approver isn't recognizing hostnames that use capital letters as valid even though DNS is case-insensitive
an example of this is in OHSS-14709:
I0822 19:04:51.587266 1 controller.go:114] Reconciling CSR: csr-vdtpv I0822 19:04:51.600941 1 csr_check.go:156] csr-vdtpv: CSR does not appear to be client csr I0822 19:04:51.603648 1 csr_check.go:542] retrieving serving cert from ip-100-66-119-117.ec2.internal (100.66.119.117:10250) I0822 19:04:51.604003 1 csr_check.go:181] Failed to retrieve current serving cert: dial tcp 100.66.119.117:10250: connect: connection refused I0822 19:04:51.604017 1 csr_check.go:201] Falling back to machine-api authorization for ip-100-66-119-117.ec2.internal E0822 19:04:51.604024 1 csr_check.go:392] csr-vdtpv: DNS name 'ip-100-66-119-117.tech-ace-maint-prd.aws.delta.com' not in machine names: ip-100-66-119-117.ec2.internal ip-100-66-119-117.ec2.internal ip-100-66-119-117.tech-ACE-maint-prd.aws.delta.com I0822 19:04:51.604033 1 csr_check.go:204] Could not use Machine for serving cert authorization: DNS name 'ip-100-66-119-117.tech-ace-maint-prd.aws.delta.com' not in machine names: ip-100-66-119-117.ec2.internal ip-100-66-119-117.ec2.internal ip-100-66-119-117.tech-ACE-maint-prd.aws.delta.com I0822 19:04:51.606777 1 controller.go:199] csr-vdtpv: CSR not authorized
This can be worked around by manually approving the CSR
The relevant line in the machine approver appears to be here: https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L378
Description of problem:
During an upgrade from 4.10.0-0.nightly-2022-09-06-081345 -> 4.11.0-0.nightly-2022-09-06-074353 DNS operator stuck in progressing state with following error in dns-defalt pod 2022-09-07T00:47:35.959931289Z [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather-136172-050448657.tar.gz
Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-09-06-074353
How reproducible:
Intermittent
Steps to Reproduce:
1. Perform upgrade from 4.10->4.11 2. 3.
Actual results:
Upgrade unsuccessful due to API svc unreachability
Expected results:
Upgrade should be successful
Additional info:
$ omg get co | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.10.0-0.nightly-2022-09-06-081345 True True False 3h4m network 4.11.0-0.nightly-2022-09-06-074353 True True False 3h5m $ omg logs dns-default-5mqz4 -c dns -n openshift-dns /home/anusaxen/Downloads/must-gather-136208-318613238/must-gather.local.4033693973186012455/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d995e1608ba83a1f71b5a1c49705aa3cef38618e0674ee256332d1b0e2cb0e23/namespaces/openshift-dns/pods/dns-default-5mqz4/dns/dns/logs/current.log 2022-09-07T00:45:36.355161941Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server 2022-09-07T00:45:36.854403897Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server 2022-09-07T00:45:37.354367197Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server 2022-09-07T00:45:37.861427861Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server 2022-09-07T00:45:38.355095996Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server 2022-09-07T00:45:38.855063626Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server 2022-09-07T00:45:39.355170472Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server 2022-09-07T00:45:39.855296624Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server 2022-09-07T00:45:40.355481413Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server 2022-09-07T00:45:40.855455283Z [WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API 2022-09-07T00:45:40.856166731Z .:5353 2022-09-07T00:45:40.856166731Z [INFO] plugin/reload: Running configuration SHA512 = 7c3db3182389e270fe3af971fef9e7157a34bde01eb750d58c2d4895a965ecb2e2d00df34594b6a819653088b6efa2590e748ce23f9e16cbb0f44d74a4fc3587 2022-09-07T00:45:40.856166731Z CoreDNS-1.9.2 2022-09-07T00:45:40.856166731Z linux/amd64, go1.18.4, 2022-09-07T00:46:05.957503195Z [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout 2022-09-07T00:47:25.444876534Z [INFO] SIGTERM: Shutting down servers then terminating 2022-09-07T00:47:25.444984367Z [INFO] plugin/health: Going into lameduck mode for 20s 2022-09-07T00:47:35.959931289Z [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout
This fix contains the following changes coming from updated version of kubernetes up to v1.24.12:
Changelog:
This is a clone of issue OCPBUGS-2895. The following is the description of the original issue:
—
Description of problem:
Current validation will not accept Resource Groups or DiskEncryptionSets which have upper-case letters.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Attempt to create a cluster/machineset using a DiskEncryptionSet with an RG or Name with upper-case letters
Steps to Reproduce:
1. Create cluster with DiskEncryptionSet with upper-case letters in DES name or in Resource Group name
Actual results:
See error message: encountered error: [controlPlane.platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format, compute[0].platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format]
Expected results:
Create a cluster/machineset using the existing and valid DiskEncryptionSet
Additional info:
I have submitted a PR for this already, but it needs to be reviewed and backported to 4.11: https://github.com/openshift/installer/pull/6513
Description of problem:
The 4.11 version of openshift-installer does not support the mon01 zone
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-669. The following is the description of the original issue:
—
Description of problem:
This is an OCP clone of https://bugzilla.redhat.com/show_bug.cgi?id=2099794 In summary, NetworkManager reports the network as being up before the ipv6 address of the primary interface is ready and crio fails to bind to it.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
NodePort port not accessible
Version-Release number of selected component (if applicable):
OCP 4.8.20
How reproducible:
$oc -n ui-nprd get services -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
docker-registry ClusterIP 10.201.219.240 <none> 5000/TCP 24d app=registry
docker-registry-lb LoadBalancer 10.201.252.253 internal-xxxxxx.xx-xxxx-1.elb.amazonaws.com 5000:30779/TCP 3d22h app=registry
docker-registry-np NodePort 10.201.216.26 <none> 5000:32428/TCP 3d16h app=registry
$oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxx.ca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.96
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -vz 10.81.23.96 32428
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.
In a new-created namespaces the same deployment works:
[RHEL7:> oc project
Using project "test-c1" on server "https://api.xx.xx.xxxx.xx.xx:6443".
[RHEL7:- ~/tmp]> oc port-forward service/docker-registry-np 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000
[1]+ Stopped oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> bg %1
[1]+ oc4 port-forward service/docker-registry-np 5000:5000 &
[RHEL7: ~/tmp]> nc -v localhost 5000
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 127.0.0.1:5000.
Handling connection for 5000
[RHEL7: ~/tmp]> kill %1
[RHEL7: ~/tmp]>
[1]+ Terminated oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> oc get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
docker-registry-np NodePort 10.201.224.174 <none> 5000:31793/TCP 68s
[RHEL7: ~/tmp]> oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
registry-75b7c7fd94-rx29j 1/1 Running 0 7m5s 10.201.1.29 ip-xxx.ca-central-1.compute.internal <none> <none>
[RHEL7: ~/tmp]> oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxxca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.87
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -v 10.81.23.87 31793
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.81.23.87:31793.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3117. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-3084. The following is the description of the original issue:
—
Upstream Issue: https://github.com/kubernetes/kubernetes/issues/77603
Long log lines get corrupted when using '--timestamps' by the Kubelet.
The root cause is that the buffer reads up to a new line. If the line is greater than 4096 bytes and '--timestamps' is turrned on the kubelet will write the timestamp and the partial log line. We will need to refactor the ReadLogs function to allow for a partial line read.
apiVersion: v1
kind: Pod
metadata:
name: logs
spec:
restartPolicy: Never
containers:
- name: logs
image: fedora
args:
- bash
- -c
- 'for i in `seq 1 10000000`; do echo -n $i; done'
kubectl logs logs --timestamps
This is a clone of issue OCPBUGS-1522. The following is the description of the original issue:
—
Description of problem:
Normal user cannot open the debug container from the pods(crashLoopbackoff) they created, And would be got error message: pods "<pod name>" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-20-040107, 4.11.z, 4.10.z
How reproducible:
Always
Steps to Reproduce:
1. Login OCP as a normal user eg: flexy-htpasswd-provider 2. Create a project, go to Developer prespective -> +Add page 3. Click "Import from Git", and provide below data to get a Pods with CrashLoopBackOff state Git Repo URL: https://github.com/sclorg/nodejs-ex.git Name: nodejs-ex-git Run command: star a wktw 4. Navigate to /k8s/ns/<project name>/pods page, find the pod with CrashLoopBackOff status, and go to it details page -> Logs Tab 5. Click the link of "Debug container" 6. Check if the Debug container can be opened
Actual results:
6. Error message would be shown on page, user cannot open debug container via UI pods "nodejs-ex-git-6dd986d8bd-9h2wj-debug-tkqk2" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>
Expected results:
6. Normal user could use debug container without any error message
Additional info:
The debug container could be created for the normal user successfully via CommandLine $ oc debug <crashloopbackoff pod name> -n <project name>
The kube-state-metric pod inside the openshift-monitoring namespace is not running as expected.
On checking the logs I am able to see that there is a memory panic
~~~
2022-11-22T09:57:17.901790234Z I1122 09:57:17.901768 1 main.go:199] Starting kube-state-metrics self metrics server: 127.0.0.1:8082
2022-11-22T09:57:17.901975837Z I1122 09:57:17.901951 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.902389844Z I1122 09:57:17.902291 1 main.go:210] Starting metrics server: 127.0.0.1:8081
2022-11-22T09:57:17.903191857Z I1122 09:57:17.903133 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.906272505Z I1122 09:57:17.906224 1 builder.go:191] Active resources: certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
2022-11-22T09:57:17.917758187Z E1122 09:57:17.917560 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2022-11-22T09:57:17.917758187Z goroutine 24 [running]:
2022-11-22T09:57:17.917758187Z k8s.io/apimachinery/pkg/util/runtime.logPanic(
)
2022-11-22T09:57:17.917758187Z /usr/lib/golang/src/runtime/panic.go:1038 +0x215
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.ingressMetricFamilies.func6(0x40)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:136 +0x189
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.wrapIngressFunc.func1(
)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
~~~
Logs are attached to the support case
This is a clone of issue OCPBUGS-2079. The following is the description of the original issue:
—
Description of problem:
The setting of systemReserved: ephemeral-storage in KubeletConfig is not working as expected.
Version-Release number of selected component (if applicable):
4.10.z, may exist on other OCP versions as well.
How reproducible:
always
Steps to Reproduce:
1. Create a KubeletConfig on the node: apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: system-reserved-config spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/master: "" kubeletConfig: systemReserved: cpu: 500m memory: 500Mi ephemeral-storage: 10Gi 2. Check node allocatable storage with command: oc describe node |grep -C 5 ephemeral-storage
Actual results:
The Allocatable:ephemeral-storage on the node is not capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default)
Expected results:
The Allocatable:ephemeral-storage on the node should be capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default)
Additional info:
The root cause might be: process argument '--system-reserved=cpu=500m,memory=500Mi' overwrote the setting in /etc/kubernetes/kubelet.conf, one example: root 6824 1 27 Sep30 ? 1-09:00:24 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.58.47 --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --hostname-override= --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a7b6408460148cb73c59677dbc2c261076bc07226c43b0c9192cc70aef5ba62 --system-reserved=cpu=500m,memory=500Mi --v=2 --housekeeping-interval=30s
This is a clone of issue OCPBUGS-4486. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-95. The following is the description of the original issue:
—
In an OpenShift cluster with OpenShiftSDN network plugin with egressIP and NMstate operator configured, there are some conditions when the egressIP is deconfigured from the network interface.
The bug is 100% reproducible.
Steps for reproducing the issue are:
1. Install a cluster with OpenShiftSDN network plugin.
2. Configure egressip for a project.
3. Install NMstate operator.
4. Create a NodeNetworkConfigurationPolicy.
5. Identify on which node the egressIP is present.
6. Restart the nmstate-handler pod running on the identified node.
7. Verify that the egressIP is no more present.
Restarting the sdn pod related to the identified node will reconfigure the egressIP in the node.
This issue has a high impact since any changes triggered for the NMstate operator will prevent application traffic. For example, in the customer environment, the issue is triggered any time a new node is added to the cluster.
The expectation is that NMstate operator should not interfere with SDN configuration.
This is a clone of issue OCPBUGS-6831. The following is the description of the original issue:
—
Description of problem:
The console crashes when it used with a user settings ConfigMap that is created with a 4.13+ console. This version saves "null" for the key "console.pinnedResources" which doesn't happen before and the old console version could not handle this well.
Version-Release number of selected component (if applicable):
4.8-4.12
How reproducible:
Always, but only in the edge case that someone used a newer console first and then downgraded.
This can happen only by manually applying the user settings ConfigMap or when downgrading a cluster.
Steps to Reproduce:
Open the user-settings ConfigMap and set "console.pinedResources" to "null" (with quotes as all ConfigMap values needs to be strings)
Or run this patch command:
oc patch -n openshift-console-user-settings configmaps user-settings-kubeadmin --type=merge --patch '{"data":{"console.pinnedResources":"null"}}'
Open console...
Actual results:
Console crashes
Expected results:
Console should not crash
This is a clone of issue OCPBUGS-5185. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5165. The following is the description of the original issue:
—
Currently, the Dev Sandbox clusters sends the clusterType "OSD" instead of "DEVSANDBOX" because the configuration annotations of the console config are automatically overridden by some SyncSets.
Open Dev Sandbox and browser console and inspect window.SERVER_FLAGS.telemetry
Clone of https://bugzilla.redhat.com/show_bug.cgi?id=2106803 to backport the e2e fix to 4.11 and 4.10.
Description of problem: E2E: intermittent failure is seen on tests for devfile due to network call to devfile registry
Deploy git workload with devfile from topology page: A-04-TC01
Version-Release number of selected component (if applicable):
How reproducible: Intermittent
Steps to Reproduce:
1. Run test for add-flow-ci.feature to test Deploy git workload with devfile from topology page: A-04-TC01
Actual results:
Expected results: Show always pass
Additional info:
Description of problem:
Dummy bug that is needed to track backport of https://github.com/ovn-org/ovn-kubernetes/pull/2975/commits/816e30a1fbb5beb8b20fe3e96906285762dd8eb6 which is already merged in 4.12
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When adding new nodes to the existing cluster, the newly allocated node-subnet can be overlapped with the existing node.
Version-Release number of selected component (if applicable):
openshift 4.10.30
How reproducible:
It's quite hard to reproduce but there is a possibility it can happen any time.
Steps to Reproduce:
1. Create a OVN dual-stack cluster 2. add nodes to the existing cluster 3. check the allocated node subnet
Actual results:
Some newly added nodes have the same node-subnet and ovn-k8s-mp0 IP as some existing nodes.
Expected results:
Should have duplicated node-subnet and ovn-k8s-mp0 IP
Additional info:
Additional info can be found at the case 03329155 and the must-gather attached(comment #1) % omg logs ovnkube-master-v8crc -n openshift-ovn-kubernetes -c ovnkube-master | grep '2022-09-30T06:42:50.857' 2022-09-30T06:42:50.857031565Z W0930 06:42:50.857020 1 master.go:1422] Did not find any logical switches with other-config 2022-09-30T06:42:50.857112441Z I0930 06:42:50.857099 1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node worker01.ss1.samsung.local 2022-09-30T06:42:50.857122455Z I0930 06:42:50.857105 1 master.go:1003] Allocated Subnets [10.129.4.0/23 fd02:0:0:a::/64] on Node oam04.ss1.samsung.local 2022-09-30T06:42:50.857130289Z I0930 06:42:50.857122 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node worker01.ss1.samsung.local 2022-09-30T06:42:50.857140773Z I0930 06:42:50.857132 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.129.4.0/23","fd02:0:0:a::/64"]}] on node oam04.ss1.samsung.local 2022-09-30T06:42:50.857166726Z I0930 06:42:50.857156 1 master.go:1003] Allocated Subnets [10.128.2.0/23 fd02:0:0:5::/64] on Node oam01.ss1.samsung.local 2022-09-30T06:42:50.857176132Z I0930 06:42:50.857157 1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node rhel01.ss1.samsung.local 2022-09-30T06:42:50.857176132Z I0930 06:42:50.857167 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.2.0/23","fd02:0:0:5::/64"]}] on node oam01.ss1.samsung.local 2022-09-30T06:42:50.857185257Z I0930 06:42:50.857157 1 master.go:1003] Allocated Subnets [10.128.6.0/23 fd02:0:0:d::/64] on Node call03.ss1.samsung.local 2022-09-30T06:42:50.857192996Z I0930 06:42:50.857183 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node rhel01.ss1.samsung.local 2022-09-30T06:42:50.857200017Z I0930 06:42:50.857190 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.6.0/23","fd02:0:0:d::/64"]}] on node call03.ss1.samsung.local 2022-09-30T06:42:50.857282717Z I0930 06:42:50.857258 1 master.go:1003] Allocated Subnets [10.130.2.0/23 fd02:0:0:7::/64] on Node call01.ss1.samsung.local 2022-09-30T06:42:50.857304886Z I0930 06:42:50.857293 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.130.2.0/23","fd02:0:0:7::/64"]}] on node call01.ss1.samsung.local 2022-09-30T06:42:50.857338896Z I0930 06:42:50.857314 1 master.go:1003] Allocated Subnets [10.128.4.0/23 fd02:0:0:9::/64] on Node f501.ss1.samsung.local 2022-09-30T06:42:50.857349485Z I0930 06:42:50.857329 1 master.go:1003] Allocated Subnets [10.131.2.0/23 fd02:0:0:8::/64] on Node call02.ss1.samsung.local 2022-09-30T06:42:50.857371344Z I0930 06:42:50.857354 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.4.0/23","fd02:0:0:9::/64"]}] on node f501.ss1.samsung.local 2022-09-30T06:42:50.857371344Z I0930 06:42:50.857361 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.2.0/23","fd02:0:0:8::/64"]}] on node call02.ss1.samsung.local
Description of problem:
For some reason, the LSP of a pod is not properly added to the port group where the ACL of a NetworkPolicy is applied. This results on the networkpolicy not being applied to the pod and communication not possible.
Version-Release number of selected component (if applicable):
4.10
How reproducible:
Always with a concrete pod at customer environment.
Steps to Reproduce:
(not known exactly yet)
Actual results:
LSP not in port group. ACL not applied. Netpol not in effect.
Expected results:
LSP in port group. ACL applied. Netpol in effect.
Additional info:
Details in private comments, as they involve sensitive data. Deleting the pod does nothing, but it is possible that this has something to do with the pod being recreated with the same name (although the LSPs UUIDs are different in each incarnation).
Description of problem: Backport owners for 4.11 (only)
This is a clone of OCPBUGS-853.
Description of problem:
Large OpenShift Container Platform 4.10.24 - Cluster is failing to update router-certs secret in openshift-config-managed namespace as the given secret is too big. 2022-09-01T06:24:15.157333294Z 2022-09-01T06:24:15.157Z ERROR operator.init.controller.certificate_publisher_controller controller/controller.go:266 Reconciler error {"name": "foo-bar", "namespace": "openshift-ingress-operator", "error": "failed to ensure global secret: failed to update published router certificates secret: Secret \"router-certs\" is invalid: data: Too long: must have at most 1048576 bytes"} The OpenShift Container Platform 4 - Cluster has 180 IngressController configured with endpointPublishingStrategy set to private. Now the default certificate needs to be replaced but is not properly replicated to openshift-authentication namespace and potentially other location because of the problem mentioned (since the required secret can not be updated)
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.10.24
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4.10 2. Create 180 IngressController with specific certificates 3. Check openshift-ingress-operator logs to see how it fails to update/create the necessary secret in openshift-config-managed
Actual results:
2022-09-01T06:24:15.157333294Z 2022-09-01T06:24:15.157Z ERROR operator.init.controller.certificate_publisher_controller controller/controller.go:266 Reconciler error {"name": "foo-bar", "namespace": "openshift-ingress-operator", "error": "failed to ensure global secret: failed to update published router certificates secret: Secret \"router-certs\" is invalid: data: Too long: must have at most 1048576 bytes"}
Expected results:
No matter how many IngressController is created, secret management taken care by Operators need to work, even if data exceed 1 MB size limitation. In that case an approach needs to exist to split data into multiple secrets or handle it otherwise.
Additional info:
This is a clone of issue OCPBUGS-4897. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-2500. The following is the description of the original issue:
—
Description of problem:
When the Ux switches to the Dev console the topology is always blank in a Project that has a large number of components.
Version-Release number of selected component (if applicable):
How reproducible:
Always occurs
Steps to Reproduce:
1.Create a project with at least 12 components (Apps, Operators, knative Brokers) 2. Go to the Administrator Viewpoint 3. Switch to Developer Viewpoint/Topology 4. No components displayed 5. Click on 'fit to screen' 6. All components appear
Actual results:
Topology renders with all controls but no components visible (see screenshot 1)
Expected results:
All components should be visible
Additional info:
This is a clone of issue OCPBUGS-6672. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4684. The following is the description of the original issue:
—
Description of problem:
In DeploymentConfig both the Form view and Yaml view are not in sync
Version-Release number of selected component (if applicable):
4.11.13
How reproducible:
Always
Steps to Reproduce:
1. Create a DC with selector and labels as given below spec: replicas: 1 selector: app: apigateway deploymentconfig: qa-apigateway environment: qa strategy: activeDeadlineSeconds: 21600 resources: {} rollingParams: intervalSeconds: 1 maxSurge: 25% maxUnavailable: 25% timeoutSeconds: 600 updatePeriodSeconds: 1 type: Rolling template: metadata: labels: app: apigateway deploymentconfig: qa-apigateway environment: qa 2. Now go to GUI--> Workloads--> DeploymentConfig --> Actions--> Edit DeploymentConfig, first go to Form view and now switch to Yaml view, the selector and labels shows as app: ubi8 while it should display app: apigateway selector: app: ubi8 deploymentconfig: qa-apigateway environment: qa template: metadata: creationTimestamp: null labels: app: ubi8 deploymentconfig: qa-apigateway environment: qa 3. Now in yaml view just click reload and the value is displayed as it is when it was created (app: apigateway).
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-12914. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-12878. The following is the description of the original issue:
—
We want to add the dual-stack tests to the CNI plugin conformance test suite, for the currently supported releases.
(This has no impact on OpenShift itself. We're just modifying a test suite that OCP does not use.)
Description of problem: Issue described in following issue: https://github.com/openshift/multus-admission-controller/issues/40
Fixed in: https://github.com/openshift/cluster-network-operator/pull/1515
Version-Release number of selected component (if applicable): OCP 4.10
Official Red Hat tracker. Issue has been merged already.
This is a clone of issue OCPBUGS-7445. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7207. The following is the description of the original issue:
—
At some point in the mtu-migration development a configuration file was generated at /etc/cno/mtu-migration/config which was used as a flag to indicate to configure-ovs that a migration procedure was in progress. When that file was missing, it was assumed the migration procedure was over and configure-ovs did some cleaning on behalf of it.
But that changed and /etc/cno/mtu-migration/config is never set. That causes configure-ovs to remove mtu-migration information when the procedure is still in progress making it to use incorrect MTU values and either causing nodes to be tainted with "ovn.k8s.org/mtu-too-small" blocking the procedure itself or causing network disruption until the procedure is over.
However, this was not a problem for the CI job as it doesn't use the migration procedure as documented for the sake of saving limited time available to run CI jobs. The CI merges two steps of the procedure into one so that there is never a reboot while the procedure is in progress and hiding this issue.
This was probably not detected in QE as well for the same reason as CI.
This is a clone of issue OCPBUGS-7732. The following is the description of the original issue:
—
Description of problem:
When services are deleted, the services controller cache should also remove the service from its top level cache to avoid growing forever. While this is not an issue in 4.13 once the lb_cache rework merges [1], the 4.12 and older branches have this problem because that rework is meant for 4.13 only. [1]: https://github.com/ovn-org/ovn-kubernetes/pull/3387 This is the location where alreadyApplied is not deleting the removal: https://github.com/openshift/ovn-kubernetes/blob/cf9fb51510e1870961bf3a0f064b73536757a4f8/go-controller/pkg/ovn/controller/services/services_controller.go#L269 It should do the similar changes depicted here (currently merged upstream): https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/services_controller.go#L322-L324
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. create service -- use unique name 2. remove service 3. notice how alreadyApplied grows and never gets smaller 4. repeat
Actual results:
^^
Expected results:
alreadyApplied should not grow forever
Additional info:
4.12 will have an option in cri-o: add_inheritable_capabilities which will allow a user to opt-out of dropping inheritable capabilities (which comes as a fix for CVE-2022-27652). We should add it by default as a drop-in in 4.11 so clusters that upgrade from it inherit the old behavior
As discussed previously by email, customer support case 03211616 requests a means to use the latest patch version of a given X.Y golang via imagestreams, with the blocking issue being the lack of X.Y tags for the go-toolset containers on RHCC.
The latter has now been fixed with the latest version also getting a :1.17 tag, and the imagestream source has been modified accordingly, which will get picked up in 4.12. We can now fix this in 4.11.z by backporting this to the imagestream files bundled in cluster-samples-operator.
/cc Ian Watson Feny Mehta
This is a clone of issue OCPBUGS-683. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
{ 4.12.0-0.nightly-2022-08-21-135326 }How reproducible:
Steps to Reproduce:
{See https://bugzilla.redhat.com/show_bug.cgi?id=2118563#c5, The following messages here are "normal" on startup, but it is very misleading with error statement, suggest suppress them or update them to some more clear context that we can know they are in normal process. E0818 02:18:53.709223 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.715530 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.735885 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.775984 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.790449 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.856911 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.950782 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.017583 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.271967 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.338944 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.916988 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.982211 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue}
Actual results:
Expected results:
Additional info:
Created attachment 1905034 [details]
Plugin page with error
Steps to reproduce:
1. Install a plugin with a page that has a runtime error. (Demo Plugin -> Dynamic Nav 1 currently has an error for me, but you can reproduce by editing any plugin and introducing an error.)
2. Observe the "something went wrong" error message.
3. Navigate to any other page (e.g. Workloads -> Pods)
Expected result:
The pods page is displayed.
Action result:
The error message persists. There is no way to clear except to refresh the browser.
Description of problem:
Customer is facing issue similar to https://github.com/devfile/api/issues/897
Version-Release number of selected component (if applicable):
OCP 4.10.17
How reproducible:
N/A
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Tried working around it with ALL_PROXY but it did not help. Note because the console operator reverts changes pretty quickly testing this was a bit of a PITA
This is a clone of issue OCPBUGS-2495. The following is the description of the original issue:
—
Failures like:
$ oc login --token=... Logged into "https://api..." as "..." using the token provided. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get projects.project.openshift.io)
break login, which tries to gather information before saving the configuration, including a giant project list.
Ideally login would be able to save the successful login credentials, even when the informative gathering had difficulties. And possibly the informative gathering could be made conditional (--quiet or similar?) so expensive gathering could be skipped in use-cases where the context was not needed.
During initial backporting, due to a number of other colliding commits in upstream, the cobra commands facilitating caching did not get downstreamed.
This is to downstream those two lines.
Description of problem:
When creating a pod with an additional network that contains a `spec.config.ipam.exclude` range, any address within the excluded range is still iterated while searching for a suitable IP candidate. As a result, pod creation times out when large exclude ranges are used.
Version-Release number of selected component (if applicable):
How reproducible:
with big exclude ranges, 100%
Steps to Reproduce:
1. create network-attachment-definition with a large range: $ cat <<EOF| oc apply -f - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: nad-w-excludes spec: config: |- { "cniVersion": "0.3.1", "name": "macvlan-net", "type": "macvlan", "master": "ens3", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "fd43:01f1:3daa:0baa::/64", "exclude": [ "fd43:01f1:3daa:0baa::/100" ], "log_file": "/tmp/whereabouts.log", "log_level" : "debug" } } EOF 2. create a pod with the network attached: $ cat <<EOF|oc apply -f - apiVersion: v1 kind: Pod metadata: name: pod-with-exclude-range annotations: k8s.v1.cni.cncf.io/networks: nad-w-excludes spec: containers: - name: pod-1 image: openshift/hello-openshift EOF 3. check pod status, event log and whereabouts logs after a while: $ oc get pods NAME READY STATUS RESTARTS AGE pod-with-exclude-range 0/1 ContainerCreating 0 2m23s $ oc get events <...> 6m39s Normal Scheduled pod/pod-with-exclude-range Successfully assigned default/pod-with-exclude-range to <worker-node> 6m37s Normal AddedInterface pod/pod-with-exclude-range Add eth0 [10.129.2.49/23] from openshift-sdn 2m39s Warning FailedCreatePodSandBox pod/pod-with-exclude-range Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded $ oc debug node/<worker-node> - tail /host/tmp/whereabouts.log Starting pod/<worker-node>-debug ... To use host binaries, run `chroot /host` 2022-10-27T14:14:50Z [debug] Finished leader election 2022-10-27T14:14:50Z [debug] IPManagement: {fd43:1f1:3daa:baa::1 ffffffffffffffff0000000000000000} , <nil> 2022-10-27T14:14:59Z [debug] Used defaults from parsed flat file config @ /etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.conf 2022-10-27T14:14:59Z [debug] ADD - IPAM configuration successfully read: {Name:macvlan-net Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[fd43:01f1:3daa:0baa::/80] DNS: {Nameservers:[] Domain: Search:[] Options:[]} Range:fd43:1f1:3daa:baa::/64 RangeStart:fd43:1f1:3daa:baa:: RangeEnd:<nil> GatewayStr: EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LeaderLeaseDuration:1500 LeaderRenewDeadline:1000 LeaderRetryPeriod:500 LogFile:/tmp/whereabouts.log LogLevel:debug OverlappingRanges:true SleepForRace:0 Gateway:<nil> Kubernetes: {KubeConfigPath:/etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:PodName:pod-with-exclude-range PodNamespace:default} 2022-10-27T14:14:59Z [debug] Beginning IPAM for ContainerID: f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 2022-10-27T14:14:59Z [debug] Started leader election 2022-10-27T14:14:59Z [debug] OnStartedLeading() called 2022-10-27T14:14:59Z [debug] Elected as leader, do processing 2022-10-27T14:14:59Z [debug] IPManagement - mode: 0 / containerID:f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 / podRef: default/pod-with-exclude-range 2022-10-27T14:14:59Z [debug] IterateForAssignment input >> ip: fd43:1f1:3daa:baa:: | ipnet: {fd43:1f1:3daa:baa:: ffffffffffffffff0000000000000000} | first IP: fd43:1f1:3daa:baa::1 | last IP: fd43:1f1:3daa:baa:ffff:ffff:ffff:ffff
Actual results:
Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Expected results:
additional network gets attached to the pod
Additional info:
This is a clone of issue OCPBUGS-10497. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10213. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-8468. The following is the description of the original issue:
—
Description of problem:
RHCOS is being published to new AWS regions (https://github.com/openshift/installer/pull/6861) but aws-sdk-go need to be bumped to recognize those regions
Version-Release number of selected component (if applicable):
master/4.14
How reproducible:
always
Steps to Reproduce:
1. openshift-install create install-config 2. Try to select ap-south-2 as a region 3.
Actual results:
New regions are not found. New regions are: ap-south-2, ap-southeast-4, eu-central-2, eu-south-2, me-central-1.
Expected results:
Installer supports and displays the new regions in the Survey
Additional info:
See https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/regions.go#L13-L23
Our Prometheus alerts are inconsistent with both upstream and sometimes our own vendor folder. Let's do a clean update run before the next release is branched off.
This is a clone of issue OCPBUGS-3889. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-3744. The following is the description of the original issue:
—
Description of problem:
Egress router POD creation on Openshift 4.11 is failing with below error. ~~~ Nov 15 21:51:29 pltocpwn03 hyperkube[3237]: E1115 21:51:29.467436 3237 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy(c965a287-28aa-47b6-9e79-0cc0e209fcf2)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy(c965a287-28aa-47b6-9e79-0cc0e209fcf2)\\\": rpc error: code = Unknown desc = failed to create pod network sandbox k8s_stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy_c965a287-28aa-47b6-9e79-0cc0e209fcf2_0(72bcf9e52b199061d6e651e84b0892efc142601b2442c2d00b92a1ba23208344): error adding pod stage-wfe-proxy_stage-wfe-proxy-ext-qrhjw to CNI network \\\"multus-cni-network\\\": plugin type=\\\"multus\\\" name=\\\"multus-cni-network\\\" failed (add): [stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw/c965a287-28aa-47b6-9e79-0cc0e209fcf2:openshift-sdn]: error adding container to network \\\"openshift-sdn\\\": CNI request failed with status 400: 'could not open netns \\\"/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669\\\": unknown FS magic on \\\"/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669\\\": 1021994\\n'\"" pod="stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw" podUID=c965a287-28aa-47b6-9e79-0cc0e209fcf2 ~~~ I have checked SDN POD log from node where egress router POD is failing and I could see below error message. ~~~ 2022-11-15T21:51:29.283002590Z W1115 21:51:29.282954 181720 pod.go:296] CNI_ADD stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw failed: could not open netns "/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669": unknown FS magic on "/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669": 1021994 ~~~ Crio is logging below event and looking at the log it seems the namespace has been created on node. ~~~ Nov 15 21:51:29 pltocpwn03 crio[3150]: time="2022-11-15 21:51:29.307184956Z" level=info msg="Got pod network &{Name:stage-wfe-proxy-ext-qrhjw Namespace:stage-wfe-proxy ID:72bcf9e52b199061d6e651e84b0892efc142601b2442c2d00b92a1ba23208344 UID:c965a287-28aa-47b6-9e79-0cc0e209fcf2 NetNS:/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" ~~~
Version-Release number of selected component (if applicable):
4.11.12
How reproducible:
Not Sure
Steps to Reproduce:
1. 2. 3.
Actual results:
Egress router POD is failing to create. Sample application could be created without any issue.
Expected results:
Egress router POD should get created
Additional info:
Egress router POD is created following below document and it does contain pod.network.openshift.io/assign-macvlan: "true" annotation. https://docs.openshift.com/container-platform/4.11/networking/openshift_sdn/deploying-egress-router-layer3-redirection.html#nw-egress-router-pod_deploying-egress-router-layer3-redirection
This is a clone of issue OCPBUGS-8261. The following is the description of the original issue:
—
An update from 4.13.0-ec.2 to 4.13.0-ec.3 stuck on:
$ oc get clusteroperator machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.13.0-ec.2 True True True 30h Unable to apply 4.13.0-ec.3: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 105, ready 105, updated: 105, unavailable: 0)]
The worker MachineConfigPool status included:
Unable to find source-code formatter for language: node. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml type: NodeDegraded - lastTransitionTime: "2023-02-16T14:29:21Z" message: 'Failed to render configuration for pool worker: Ignoring MC 99-worker-generated-containerruntime generated by older version 8276d9c1f574481043d3661a1ace1f36cd8c3b62 (my version: c06601510c0917a48912cc2dda095d8414cc5182)'
4.13.0-ec.3. The behavior was apparently introduced as part of OCPBUGS-6018, which has been backported, so the following update targets are expected to be vulnerable: 4.10.52+, 4.11.26+, 4.12.2+, and 4.13.0-ec.3.
100%, when updating into a vulnerable release, if you happen to have leaked MachineConfig.
1. 4.12.0-ec.1 dropped cleanUpDuplicatedMC. Run a later release, like 4.13.0-ec.2.
2. Create more than one KubeletConfig or ContainerRuntimeConfig targeting the worker pool (or any pool other than master). The number of clusters who have had redundant configuration objects like this is expected to be small.
3. (Optionally?) delete the extra KubeletConfig and ContainerRuntimeConfig.
4. Update to 4.13.0-ec.3.
Update sticks on the machine-config ClusterOperator, as described above.
Update completes without issues.
We will want to establish some basic metrics we can report back to Telemetry.
Let's consider:
Below is some background info from MTC when we added Telemetry support that may help
See: https://github.com/konveyor/metrics-queries/blob/master/README.md
Design/Development info:
OpenShift Monitoring Integration Guide
Monitoring integration with OLM operators
https://www.openshift.com/blog/observability-superpower-correlation
Source Code:
https://github.com/konveyor/mig-controller/blob/master/pkg/controller/migmigration/metrics.go
This is a clone of issue OCPBUGS-1678. The following is the description of the original issue:
—
Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)
OCPBUGS-1677 is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.
This is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh
Actual results:
Unit tests fail
Expected results:
Unit tests should pass again
Additional info:
This is a clone of issue OCPBUGS-1226. The following is the description of the original issue:
—
We added server groups for control plane and computes as part of OSASINFRA-2570, except for UPI that only creates server group for the control plane.
We need to update the UPI scripts to create server group for computes to be consistent with IPI and have the instruction at https://docs.openshift.com/container-platform/4.11/machine_management/creating_machinesets/creating-machineset-osp.html work out of the box in case customers want to create MachineSets on their UPI clusters.
Related to OCPCLOUD-1135.
This is a clone of issue OCPBUGS-11404. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-11333. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10690. The following is the description of the original issue:
—
Description of problem:
according to PR: https://github.com/openshift/cluster-monitoring-operator/pull/1824, startupProbe for UWM prometheus/platform prometheus should be 1 hour, but startupProbe for UWM prometheus is still 15m after enabled UWM, platform promethues does not have issue, startupProbe is increased to 1 hour
$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml | grep startupProbe -A20 startupProbe: exec: command: - sh - -c - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi failureThreshold: 60 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 3 ... $ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep startupProbe -A20 startupProbe: exec: command: - sh - -c - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi failureThreshold: 240 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 3
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-19-052243
How reproducible:
always
Steps to Reproduce:
1. enable UWM, check startupProbe for UWM prometheus/platform prometheus 2. 3.
Actual results:
startupProbe for UWM prometheus is still 15m
Expected results:
startupProbe for UWM prometheus should be 1 hour
Additional info:
since startupProbe for platform prometheus is increased to 1 hour, and no similar bug for UWM prometheus, won't fix the issue is OK.
Description of problem:
Each LB created for a Service type LoadBalancer results in 1 client rule and <# of public subnets> health rules being created. The rules per SG quota in AWS is quite small; 60 by default, and 200 hard max. OCP has about 40 rules OOTB. Assuming an HA cluster in 3 AZs, that is 4 rules per LB. With default AWS quota, only ~5 LBs can be create and with the hard max of 200, only ~40 LBs can be created.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Create Service type LoadBalancer and observe increase in master-sg and worker-sg rules sets 2. 3.
Actual results:
4 rules are created
Expected results:
1 rules is created when the client rule is a superset of the per-subnet health rules
Additional info:
This ~4x the number of Services of type LoadBalancer. This is required for Hypershift.
This is a clone of issue OCPBUGS-4851. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4850. The following is the description of the original issue:
—
Description of problem:
Kuryr might take a while to create Pods because it has to create Neutron ports for the pods. If a pod gets deleted while this is being processed, a warning Event will be generated causing the "[sig-network] pods should successfully create sandboxes by adding pod to network" to fail.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-268. The following is the description of the original issue:
—
The linux kernel was updated:
https://lkml.org/lkml/2020/3/20/1030
to include steal
accounting
This would greatly assist in troubleshooting vSphere performance issues
caused by over-provisioned ESXi hosts.
Description of problem:
Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"
Version-Release number of selected component (if applicable): 4.10.22
How reproducible: Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"
Steps to Reproduce:
Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"
Actual results: master nodes do come up.
Expected results: master nodes should come up despite that the text "master" is not in their hostname.
Additional info:
Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"
The code for the cluster-baremetal-operator at the following link:
The following condition is concerning:
if strings.Contains(bmh.Name, "master") && len(bmh.Spec.BootMACAddress) > 0
The packages reveal that bmh.Name references the name inside the metadata of the BMH object.
Should a customer have masters with names that do not include the text "master", the above condition can never become true, and so, the following slice is never created :
macs = append(macs, bmh.Spec.BootMACAddress)
This is a clone of issue OCPBUGS-10372. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10220. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7559. The following is the description of the original issue:
—
Description of problem:
When attempting to add nodes to a long-lived 4.12.3 cluster, net new nodes are not able to join the cluster. They are provisioned in the cloud provider (AWS), but never actually join as a node.
Version-Release number of selected component (if applicable):
4.12.3
How reproducible:
Consistent
Steps to Reproduce:
1. On a long lived cluster, add a new machineset
Actual results:
Machines reach "Provisioned" but don't join the cluster
Expected results:
Machines join cluster as nodes
Additional info:
This is a clone of issue OCPBUGS-5093. The following is the description of the original issue:
—
Description of problem:
When users try to deploy an application from git method on dev console it throws warning message for specific public repos `URL is valid but cannot be reached. If this is a private repository, enter a source secret in Advanced Git Options.`. If we ignore the warning and go ahead the build will be successful although the warning message seems to be misleading.
Actual results:
Getting a warning for url while trying to deploy an application from git method on dev console from a public repo
Expected results:
It should show validated
This fix contains the following changes coming from updated version of kubernetes up to v1.24.10:
Changelog:
v1.24.11: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v12410
v1.24.10: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1249
v1.24.9: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1248
v1.24.8: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1247
v1.24.7: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1246
This is a clone of issue OCPBUGS-6913. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-186. The following is the description of the original issue:
—
Description of problem:
When resizing the browser window, the PipelineRun task status bar would overlap the status text that says "Succeeded" in the screenshot.
Actual results:
Status text is overlapped by the task status bar
Expected results:
Status text breaks to a newline or gets shortened by "..."
Description of problem:
This bugs purpose is to enable a feature backport of https://issues.redhat.com/browse/MON-1949.
This is a clone of issue OCPBUGS-6764. The following is the description of the original issue:
—
Description of problem:
The "Add Git Repository" has a "Show configuration options" expandable section that shows the required permissions for a webhook setup, and provides a link to "read more about setting up webhook".
But the permission section shows nothing when open this second expandable section, and the link doesn't do anything until the user enters a "supported" GitHub, GitLab or BitBucket URL.
Version-Release number of selected component (if applicable):
4.11-4.13
How reproducible:
Always
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-224. The following is the description of the original issue:
—
Description of problem:
OCP v4.9.31 cluster didn't have the $search domain in /etc/resolv.conf, which was there in the v4.8.29 OCP cluster. This was observed in all the nodes of the v4.9.31 cluster.
~~~
OpenShift 4.9.31
sh-4.4# cat /etc/resolv.conf
OpenShift 4.8.29
ENV: OpenStack IAD2, IPI installation. Connected cluster.
Version-Release number of selected component (if applicable):
OCP v4.9.31
How reproducible:
Always
Steps to Reproduce:
1. Install IPI cluster on OpenStack IAD2 platform having cluster version 4.9.31
2. Debug to any of the node(master/worker)
3. Check and confirm the missing search domain on all nodes of the cluster.
Actual results:
The search domain was missing when checked in `/etc/resolv.conf` file on all nodes of the cluster causing serious issues in the cluster.
Expected results:
The installer should embed the search domain in /etc/resolv.conf file on all nodes of the cluster.
Additional info:
set -eo pipefail
DISPATCHER_FILE="/etc/NetworkManager/dispatcher.d/30-resolv-prepender"
DOMAINS="$(grep -E '\s*DOMAINS=.*iad2.dc.paas.redhat.com' $DISPATCHER_FILE \
grep -oE '[a-z0-9]*.dev.iad2.dc.paas.redhat.com' \ |
tr '\n' ' ')" |
>&2 echo "IT-PaaS: overwriting search domains in /etc/resolv.conf with: $DOMAINS"
sed -e "/^search/d" \
-e "/Generated by/c# Generated by KNI resolv prepender NM dispatcher script \nsearch $DOMAINS" \
/etc/resolv.conf > /etc/resolv.tmp
mv /etc/resolv.tmp /etc/resolv.conf
~~~
This is a clone of issue OCPBUGS-2181. The following is the description of the original issue:
—
Description of problem:
E2E test Installs Red Hat Integration - 3scale operator test is failing due to change of Operator name
This is a copy of Bugzilla bug 2117524 for backport to 4.11.z
Original Text:
Description of problem:
On routers configured with mTLS and CRL defined in the CA with a CDP ; new CRL is downloaded only when restarting the ingress-operator.
2022-07-20T23:36:26.943Z INFO operator.clientca_configmap_controller controller/controller.go:298 reconciling {"request": "openshift-ingress-operator/service-bdrc"} 2022-07-20T23:36:26.943Z INFO operator.crl crl/crl_configmap.go:69 certificate revocation list has expired {"subject key identifier": "6aa909992e9890457b2a8de5659a44cab8e867a8"} 2022-07-20T23:36:26.943Z INFO operator.crl crl/crl_configmap.go:69 retrieving certificate revocation list {"subject key identifier": "6aa909992e9890457b2a8de5659a44cab8e867a8"} 2022-07-20T23:36:26.943Z INFO operator.crl crl/crl_configmap.go:169 retrieving CRL distribution point {"distribution point": "http://crl.domain.com/der/CN=XXXX,OU=XXX,O=XXX,C=XXX"}
Version-Release number of selected component (if applicable):
4.9.33
How reproducible:
Enable mTLS with a CRL
Actual results:
CRL is not download when expired
Clients get "SSL client certificate not trusted" errors while accessing resources
Expected results:
ingress-operator triggers CRL download when approaching expiration date so that the configmap is updated without manual action
This is a clone of issue OCPBUGS-4696. The following is the description of the original issue:
—
Description of problem:
metal3 pod does not come up on SNO when creating Provisioning with provisioningNetwork set to Disabled The issue is that on SNO, there is no Machine, and no BareMetalHost, it is looking of Machine objects to populate the provisioningMacAddresses field. However, when provisioningNetwork is Disabled, provisioningMacAddresses is not used anyway. You can work around this issue by populating provisioningMacAddresses with a dummy address, like this: kind: Provisioning metadata: name: provisioning-configuration spec: provisioningMacAddresses: - aa:aa:aa:aa:aa:aa provisioningNetwork: Disabled watchAllNamespaces: true
Version-Release number of selected component (if applicable):
4.11.17
How reproducible:
Try to bring up Provisioning on SNO in 4.11.17 with provisioningNetwork set to Disabled apiVersion: metal3.io/v1alpha1 kind: Provisioning metadata: name: provisioning-configuration spec: provisioningNetwork: Disabled watchAllNamespaces: true
Steps to Reproduce:
1. 2. 3.
Actual results:
controller/provisioning "msg"="Reconciler error" "error"="machines with cluster-api-machine-role=master not found" "name"="provisioning-configuration" "namespace"="" "reconciler group"="metal3.io" "reconciler kind"="Provisioning"
Expected results:
metal3 pod should be deployed
Additional info:
This issue is a result of this change: https://github.com/openshift/cluster-baremetal-operator/pull/307 See this Slack thread: https://coreos.slack.com/archives/CFP6ST0A3/p1670530729168599
Changelog between 3.5.5 and 3.5.4:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#v355-tbd
Changelog between 3.5.3 and 3.5.4:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#v354-2022-04-24
Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.
Actual results:
Taking too much time too load
Expected results:
Time taken should be reduced
Additional info:
Attached a gif for reference
This is a clone of issue OCPBUGS-6850. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-6503. The following is the description of the original issue:
—
Description of problem:
While looking into OCPBUGS-5505 I discovered that some 4.10->4.11 upgrade job runs perform an Admin Ack check, while some do not. 4.11 has a ack-4.11-kube-1.25-api-removals-in-4.12 gate, so these upgrade jobs sometimes test that Upgradeable goes false after the ugprade, and sometimes they do not. This is only determined by the polling race condition: the check is executed once per 10 minutes, and we cancel the polling after upgrade is completed. This means that in some cases we are lucky and manage to run one check before the cancel, and sometimes we are not and only check while still on the base version.
Example job that checked admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired' Jan 6 21:16:40.153: INFO: Waiting for Upgradeable to be AdminAckRequired ...
Example job that did not check admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'
Version-Release number of selected component (if applicable):
4.11+ openshift-tests
How reproducible:
nondeterministic, wild guess is ~30% of upgrade jobs
Steps to Reproduce:
1. Inspect the E2E test log of an upgrade jobs and compare the time of the update ("Completed upgrade") with the time of the last check ( "Skipping admin ack", "Gate .* not applicable to current version", "Admin Ack verified') done by the admin ack test
Actual results:
Jan 23 00:47:43.842: INFO: Admin Ack verified Jan 23 00:57:43.836: INFO: Admin Ack verified Jan 23 01:07:43.839: INFO: Admin Ack verified Jan 23 01:17:33.474: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z09ll8fw/release@sha256:322cf67dc00dd6fa4fdd25c3530e4e75800f6306bd86c4ad1418c92770d58ab8
No check done after the upgrade
Expected results:
Jan 23 00:57:37.894: INFO: Admin Ack verified Jan 23 01:07:37.894: INFO: Admin Ack verified Jan 23 01:16:43.618: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z8h5x1c5/release@sha256:9c4c732a0b4c2ae887c73b35685e52146518e5d2b06726465d99e6a83ccfee8d Jan 23 01:17:57.937: INFO: Admin Ack verified
One or more checks done after upgrade
This is a clone of issue OCPBUGS-5019. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4941. The following is the description of the original issue:
—
Description of problem: This is a follow-up to OCPBUGS-3933.
The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses, and a container is empty.
Version-Release number of selected component (if applicable):
4.8.z
How reproducible:
Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-1677. The following is the description of the original issue:
—
Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)
This issue is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.
OCPBUGS-1678 is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh
Actual results:
Unit tests fail
Expected results:
Unit tests should pass again
Additional info:
Description of problem:
Whereabouts doesn't allow the use of network interface names that are not preceded by the prefix "net", see https://github.com/k8snetworkplumbingwg/whereabouts/issues/130.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Define two Pods, one with the interface name 'port1' and the other with 'net-port1':
test-ip-removal-port1: k8s.v1.cni.cncf.io/networks: [ { "name": "test-sriovnd", "interface": "port1", "namespace": "default" } ] test-ip-removal-net-port1: k8s.v1.cni.cncf.io/networks: [ { "name": "test-sriovnd", "interface": "net-port1", "namespace": "default" } ]
2. IP allocated in the IPPool:
kind: IPPool ... spec: allocations: "16": id: ... podref: test-ecoloma-1/test-ip-removal-port1 "17": id: ... podref: test-ecoloma-1/test-ip-removal-net-port1
3. When the ip-reconciler job is run, the allocation for the port with the interface name 'port1' is removed:
[13:29][]$ oc get cronjob -n openshift-multus
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
ip-reconciler */15 * * * * False 0 14m 11d
[13:29][]$ oc get ippools.whereabouts.cni.cncf.io -n openshift-multus 2001-1b70-820d-2610---64 -o yaml
apiVersion: whereabouts.cni.cncf.io/v1alpha1
kind: IPPool
metadata:
...
spec:
allocations:
"17":
id: ...
podref: test-ecoloma-1/test-ip-removal-net-port1
range: 2001:1b70:820d:2610::/64
[13:30][]$ oc get cronjob -n openshift-multus
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
ip-reconciler */15 * * * * False 0 9s 11d
Actual results:
The network interface with a name that doesn't have a 'net' prefix is removed from the ip-reconciler cronjob.
Expected results:
The network interface must not be removed, regardless of the name.
Additional info:
Upstream PR @ https://github.com/k8snetworkplumbingwg/whereabouts/pull/147 master PR @ https://github.com/openshift/whereabouts-cni/pull/94
Tracker bug for bootimage bump in 4.11. This bug should block bugs which need a bootimage bump to fix.
Description of problem:
Sometimes we see VMs fail to power on when the land on a host that does not have enough resources. The current power on does not retry or leverage DRS to power on the node on a suitable host.
https://github.com/vmware/govmomi/issues/1026
Our code is still making calls to PowerOnVM_Task which, according to the vsphere docs, is deprecated and we should use PowerOnMultiVM_Task instead.
PowerOnVM_Task does not return a DRS ClusterRecommendation, no vmotion nor host power operations will be done as part of a DRS-facilitated power on. To have DRS consider such operations use PowerOnMultiVM_Task.
https://vdc-download.vmware.com/vmwb-repository/dcr-public/b50dcbbf-051d-4204-a3e7-e1b618c1e384/538cf2ec-b34f-4bae-a332-3820ef9e7773/vim.VirtualMachine.html#powerOn:
As of vSphere API 5.1, use of this method with vCenter Server is deprecated; use PowerOnMultiVM_Task instead.
Version-Release number of selected component (if applicable):
4.8.x
How reproducible:
Always
Steps to Reproduce:
1.
2.
3.
Actual results:
Sometimes powers on fails requiring manual intervention.
Expected results:
PowerOn should use DRS to ensure it's always successful.
Additional info:
This is a clone of issue OCPBUGS-10496. The following is the description of the original issue:
—
Description of problem:
Customer is running machine learning (ML) tasks on OpenShift Container Platform, for which large models need to be embedded in the container image. When building a new container image with large container image layers (>=10GB) and pushing it to the internal image registry, this fails with the following error message:
error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid
In the image registry Pod we can see the following error message:
time="2023-01-30T14:12:22.315726147Z" level=error msg="upload resumed at wrong offest: 10485760000 != 10738341637" [..] time="2023-01-30T14:12:22.338264863Z" level=error msg="response completed with error" err.code="blob upload invalid" err.message="blob upload invalid" [..]
Backend storage is AWS S3. We suspect that this could be the following upstream bug: https://github.com/distribution/distribution/issues/1698
Version-Release number of selected component (if applicable):
Customer encountered the issue on OCP 4.11.20. We reproduced the issue on OCP 4.11.21: $ oc version Client Version: 4.12.0 Kustomize Version: v4.5.7 Server Version: 4.11.21 Kubernetes Version: v1.24.6+5658434
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform cluster 4.11.21 on AWS 2. Confirm registry storage is on AWS S3 3. Create a new build including a 10GB file using the following command: `printf "FROM registry.fedoraproject.org/fedora:37\nRUN dd if=/dev/urandom of=/bigfile bs=1M count=10240" | oc new-build -D -` 4. Wait for some time for the build to run
Actual results:
Pushing the new build fails with the following error message: error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid
Expected results:
Push of large container image layers succeeds
Additional info:
This bug represents a backport of CCO-222 to release-4.11.
This is a clone of issue OCPBUGS-10943. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10661. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10591. The following is the description of the original issue:
—
Description of problem:
Starting with 4.12.0-0.nightly-2023-03-13-172313, the machine API operator began receiving an invalid version tag either due to a missing or invalid VERSION_OVERRIDE(https://github.com/openshift/machine-api-operator/blob/release-4.12/hack/go-build.sh#L17-L20) value being passed tot he build. This is resulting in all jobs invoked by the 4.12 nightlies failing to install.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-03-13-172313 and later
How reproducible:
consistently in 4.12 nightlies only(ci builds do not seem to be impacted).
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Example of failure https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-csi/1635331349046890496/artifacts/e2e-aws-csi/gather-extra/artifacts/pods/openshift-machine-api_machine-api-operator-866d7647bd-6lhl4_machine-api-operator.log
Description of problem:
In a 4.11 cluster with only openshift-samples enabled, the 4.12 introduced optional COs console and insights are installed. While upgrading to 4.12, CVO considers them to be disabled explicitly and skips reconciling them. So these COs are not upgraded to 4.12. Installed COs cannot be disabled, so CVO is supposed to implicitly enable them. $ oc get clusterversion -oyaml { "apiVersion": "config.openshift.io/v1", "kind": "ClusterVersion", "metadata": { "creationTimestamp": "2022-09-30T05:02:31Z", "generation": 3, "name": "version", "resourceVersion": "134808", "uid": "bd95473f-ffda-402d-8fe3-74f852a9d6eb" }, "spec": { "capabilities": { "additionalEnabledCapabilities": [ "openshift-samples" ], "baselineCapabilitySet": "None" }, "channel": "stable-4.11", "clusterID": "8eda5167-a730-4b39-be1d-214a80506d34", "desiredUpdate": { "force": true, "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "version": "" } }, "status": { "availableUpdates": null, "capabilities": { "enabledCapabilities": [ "openshift-samples" ], "knownCapabilities": [ "Console", "Insights", "Storage", "baremetal", "marketplace", "openshift-samples" ] }, "conditions": [ { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-09-28-204419 not found in the \"stable-4.11\" channel", "reason": "VersionNotFound", "status": "False", "type": "RetrievedUpdates" }, { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Capabilities match configured spec", "reason": "AsExpected", "status": "False", "type": "ImplicitlyEnabledCapabilities" }, { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Payload loaded version=\"4.12.0-0.nightly-2022-09-28-204419\" image=\"registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc\" architecture=\"amd64\"", "reason": "PayloadLoaded", "status": "True", "type": "ReleaseAccepted" }, { "lastTransitionTime": "2022-09-30T05:23:18Z", "message": "Done applying 4.12.0-0.nightly-2022-09-28-204419", "status": "True", "type": "Available" }, { "lastTransitionTime": "2022-09-30T07:05:42Z", "status": "False", "type": "Failing" }, { "lastTransitionTime": "2022-09-30T07:41:53Z", "message": "Cluster version is 4.12.0-0.nightly-2022-09-28-204419", "status": "False", "type": "Progressing" } ], "desired": { "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "version": "4.12.0-0.nightly-2022-09-28-204419" }, "history": [ { "completionTime": "2022-09-30T07:41:53Z", "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "startedTime": "2022-09-30T06:42:01Z", "state": "Completed", "verified": false, "version": "4.12.0-0.nightly-2022-09-28-204419" }, { "completionTime": "2022-09-30T05:23:18Z", "image": "registry.ci.openshift.org/ocp/release@sha256:5a6f6d1bf5c752c75d7554aa927c06b5ea0880b51909e83387ee4d3bca424631", "startedTime": "2022-09-30T05:02:33Z", "state": "Completed", "verified": false, "version": "4.11.0-0.nightly-2022-09-29-191451" } ], "observedGeneration": 3, "versionHash": "CSCJ2fxM_2o=" } } $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-09-28-204419 True False False 93m cloud-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h56m cloud-credential 4.12.0-0.nightly-2022-09-28-204419 True False False 3h59m cluster-autoscaler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h53m config-operator 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m console 4.11.0-0.nightly-2022-09-29-191451 True False False 3h45m control-plane-machine-set 4.12.0-0.nightly-2022-09-28-204419 True False False 117m csi-snapshot-controller 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m dns 4.12.0-0.nightly-2022-09-28-204419 True False False 3h53m etcd 4.12.0-0.nightly-2022-09-28-204419 True False False 3h52m image-registry 4.12.0-0.nightly-2022-09-28-204419 True False False 3h46m ingress 4.12.0-0.nightly-2022-09-28-204419 True False False 151m insights 4.11.0-0.nightly-2022-09-29-191451 True False False 3h48m kube-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h50m kube-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h51m kube-scheduler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h51m kube-storage-version-migrator 4.12.0-0.nightly-2022-09-28-204419 True False False 91m machine-api 4.12.0-0.nightly-2022-09-28-204419 True False False 3h50m machine-approver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m machine-config 4.12.0-0.nightly-2022-09-28-204419 True False False 3h52m monitoring 4.12.0-0.nightly-2022-09-28-204419 True False False 3h44m network 4.12.0-0.nightly-2022-09-28-204419 True False False 3h55m node-tuning 4.12.0-0.nightly-2022-09-28-204419 True False False 113m openshift-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h48m openshift-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 113m openshift-samples 4.12.0-0.nightly-2022-09-28-204419 True False False 116m operator-lifecycle-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h48m service-ca 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m storage 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-28-204419
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.11 cluster with only openshift-samples enabled 2. Upgrade to 4.12 3.
Actual results:
The 4.12 introduced optional CO console and insights are not upgraded to 4.12
Expected results:
All the installed COs get upgraded
Additional info:
This is a clone of issue OCPBUGS-7633. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-1125. The following is the description of the original issue:
—
(originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=1983200)
test:
[sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should restore itself after quorum loss [Serial]
is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-etcd%5C%5D%5C%5BFeature%3ADisasterRecovery%5C%5D%5C%5BDisruptive%5C%5D+%5C%5BFeature%3AEtcdRecovery%5C%5D+Cluster+should+restore+itself+after+quorum+loss+%5C%5BSerial%5C%5D
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1413625606435770368
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1415075413717159936
—
some brief triaging from Thomas Jungblut on:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.11/1568747321334697984
it seems the last guard pod doesn't come up, etcd operator installs this properly and the revision installer also does not spout any errors. It just doesn't progress to the latest revision. At first glance doesn't look like an issue with etcd itself, but needs to be taken a closer look at for sure.
Description of problem:
Setting disableNetworkDiagnostics: true does not persist when network-operator pod gets re-created.
Version-Release number of selected component (if applicable):
4.11.0-rc.0
How reproducible:
100%
Steps to Reproduce:
1. $ oc patch network.operator.openshift.io/cluster --patch '{"spec":{"disableNetworkDiagnostics":true}}' --type=merge
network.operator.openshift.io/cluster patched
2. $ oc -n openshift-network-operator delete pods network-operator-9b68954c6-bclx6
pod "network-operator-9b68954c6-bclx6" deleted
3. $ oc get network.operator.openshift.io cluster -o json | jq .spec.disableNetworkDiagnostics
false
Actual results:
disableNetworkDiagnostics set to false, not set to true as configured in step 1
Expected results:
disableNetworkDiagnostics set to true
Additional info:
Attaching must-gather.
This is a clone of issue OCPBUGS-5191. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5164. The following is the description of the original issue:
—
Description of problem:
It looks like the ODC doesn't register KNATIVE_SERVING and KNATIVE_EVENTING flags. Those are based on KnativeServing and KnativeEventing CRs, but they are looking for v1alpha1 version of those: https://github.com/openshift/console/blob/f72519fdf2267ad91cc0aa51467113cc36423a49/frontend/packages/knative-plugin/console-extensions.json#L6-L8
This PR https://github.com/openshift-knative/serverless-operator/pull/1695 moved the CRs to v1beta1, and that breaks that ODC discovery.
Version-Release number of selected component (if applicable):
Openshift 4.8, Serverless Operator 1.27
Additional info:
https://coreos.slack.com/archives/CHGU4P8UU/p1671634903447019
See these threads https://coreos.slack.com/archives/G01F05P2PTL/p1645982017061749?thread_ts=1645970469.871559&cid=G01F05P2PTL for more information
OCPBUGS-1251 landed an admin-ack gate in 4.11.z to help admins prepare for Kubernetes 1.25 API removals which are coming in OpenShift 4.12. Poking around in a 4.12.0-ec.2 cluster where APIRemovedInNextReleaseInUse is firing:
$ oc --as system:admin adm must-gather -- /usr/bin/gather_audit_logs $ zgrep -h v1beta1/poddisruptionbudget must-gather.local.1378724704026451055/quay*/audit_logs/kube-apiserver/*.log.gz | jq -r '.verb + " " + (.user | .username + " " + (.extra["authentication.kubernetes.io/pod-name"] | tostr ing))' | sort | uniq -c parse error: Invalid numeric literal at line 29, column 6 28 watch system:serviceaccount:openshift-machine-api:cluster-autoscaler ["cluster-autoscaler-default-5cf997b8d6-ptgg7"]
Finding the source for that container:
$ oc --as system:admin -n openshift-machine-api get -o json pod cluster-autoscaler-default-5cf997b8d6-ptgg7 | jq -r '.status.containerStatuses[].image' quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed $ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed | grep github SOURCE_GIT_URL=https://github.com/openshift/kubernetes-autoscaler io.openshift.build.commit.url=https://github.com/openshift/kubernetes-autoscaler/commit/1dac0311b9842958ec630273428b74703d51c1c9 io.openshift.build.source-location=https://github.com/openshift/kubernetes-autoscaler
Poking about in the source:
$ git clone --depth 30 --branch master https://github.com/openshift/kubernetes-autoscaler.git
$ cd kubernetes-autoscaler
$ find . -name vendor
./addon-resizer/vendor
./cluster-autoscaler/vendor
./vertical-pod-autoscaler/e2e/vendor
./vertical-pod-autoscaler/vendor
Lots of vendoring. I haven't checked to see how new the client code is in the various vendor packages. But the main issue seems to be the v1beta1 in:
$ git grep policy cluster-autoscaler/core cluster-autoscaler/utils | grep policy.*v1beta1 cluster-autoscaler/core/scaledown/actuation/actuator_test.go: policyv1beta1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/actuation/actuator_test.go: eviction := createAction.GetObject().(*policyv1beta1.Eviction) cluster-autoscaler/core/scaledown/actuation/drain.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/actuation/drain_test.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/legacy/legacy.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/legacy/wrapper.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/scaledown.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/static_autoscaler_test.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/drain/drain.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/drain/drain_test.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/kubernetes/listers.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/kubernetes/listers.go: v1policylister "k8s.io/client-go/listers/policy/v1beta1"
The main change from v1beta1 to v1 involves spec.selector; I dunno if that's relevant to the autoscaler use-case or not.
Do we run autoscaler CI? I was poking around a bit, but did not find a 4.12 periodic excercising the autoscaler that might have turned up this alert and issue.
This is a clone of issue OCPBUGS-8044. The following is the description of the original issue:
—
rhbz#2089199, backported to 4.11.5, shifted the etcd Grafana dashboard from the monitoring operator to the etcd operator. During the shift, the ConfigMap was renamed from grafana-dashboard-etcd to etcd-dashboard. However, we did not include logic for garbage-collecting the obsolete dasboard, so clusters that update from 4.11.1 and similar into 4.11.>=5 or 4.12+ currently end up with both the obsolete and new ConfigMaps. We should grow code to remove the obsolete ConfigMap.
4.11.>=5 and 4.12+ are currently exposed.
100%
1. Install 4.11.1.
2. Update to a release that defines the etcd-dashboard ConfigMap.
3. Check for etcd dashboards with oc -n openshift-config-managed get configmaps | grep etcd.
Both etcd-dashboard and grafana-dashboard-etcd exist:
$ oc -n openshift-config-managed get configmaps | grep etcd etcd-dashboard 1 196d grafana-dashboard-etcd 1 2y282d
Another example is 4.11.1 to 4.11.5 CI:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1570415394001260544/artifacts/e2e-aws-upgrade/configmaps.json | jq -r '.items[].metadata | select(.namespace == "openshift-config-managed" and (.name | contains("etcd"))) | .name' etcd-dashboard grafana-dashboard-etcd
Only etcd-dashboard still exists.
A new manifest for the outgoing ConfigMap that sets the release.openshift.io/delete: "true" annotation would ask the cluster-version operator to reap the obsolete ConfigMap.
libovsdb builds transaction log messages for every transaction and then throws them away if the log level is not 4 or above. This wastes a bunch of CPU at scale and increases pod ready latency.
Description of problem: This is a follow-up to OCPBUGS-2795 and OCPBUGS-2941.
The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:
level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []
Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.
I'm setting the priority to critical as it's causing all our jobs to fail in master.
Version-Release number of selected component (if applicable):
4.8.z
How reproducible:
Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This bug is a backport clone of [Bugzilla Bug 2118318](https://bugzilla.redhat.com/show_bug.cgi?id=2118318). The following is the description of the original bug:
—
+++ This bug was initially created as a clone of Bug #2117569 +++
Description of problem:
The garbage collector resource quota controller must ignore ALL events; otherwise, if a rogue controller or a workload causes unbound event creation, performance will degrade as it has to process the events.
Fix: https://github.com/kubernetes/kubernetes/pull/110939
This bug is to track fix in master (4.12) and also allow to backport to 4.11.1
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
— Additional comment from Michal Fojtik on 2022-08-11 10:52:28 UTC —
I'm using FastFix here as we need to backport this to 4.11.1 to avoid support churn for busy clusters or clusters doing upgrades.
— Additional comment from ART BZ Bot on 2022-08-11 15:13:32 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.
— Additional comment from zhou ying on 2022-08-12 03:03:34 UTC —
checked the payload commit id , the payload 4.12.0-0.nightly-2022-08-11-191750 has container the fixed pr .
oc adm release info registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-08-11-191750 --commit-urls |grep hyperkube
Warning: the default reading order of registry auth file will be changed from "${HOME}/.docker/config.json" to podman registry config locations in the future version. "${HOME}/.docker/config.json" is deprecated, but can still be used for storing credentials as a fallback. See https://github.com/containers/image/blob/main/docs/containers-auth.json.5.md for the order of podman registry config locations.
hyperkube https://github.com/openshift/kubernetes/commit/da80cd038ee5c3b45ba36d4b48b42eb8a74439a3
commit da80cd038ee5c3b45ba36d4b48b42eb8a74439a3 (HEAD -> master, origin/release-4.13, origin/release-4.12, origin/master, origin/HEAD)
Merge: a9d6306a701 055b96e614a
Author: OpenShift Merge Robot <openshift-merge-robot@users.noreply.github.com>
Date: Thu Aug 11 15:13:05 2022 +0000
Merge pull request #1338 from benluddy/openshift-pick-110888
Bug 2117569: UPSTREAM: 110888: feat: fix a bug thaat not all event be ignored by gc controller
While running a PerfScale test we noticed that the hosted ovnkube-master pods always initially error on deployment. They eventually succeed on retry however.
This is running quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters and the hypershift operator is quay.io/hypershift/hypershift-operator:4.11 on a 4.11.9 management cluster.
An example of the error in the ovnkube-master container:
```
F1102 13:27:51.935600 1 ovnkube.go:133] error when trying to initialize libovsdb SB client: unable to connect to any endpoints: failed to connect to ssl:ovnkube-master-0.ovnkube-master-internal.clusters-perf-pqd-0021.svc.cluster.local:9642: failed to open connection: dial tcp 10.131.8.25:9642: connect: connection refused. failed to connect to ssl:ovnkube-master-1.ovnkube-master-internal.clusters-perf-pqd-0021.svc.cluste
```
This is a clone of issue OCPBUGS-855. The following is the description of the original issue:
—
Description of problem:
When setting the allowedregistries like the example below, the openshift-samples operator is degraded: oc get image.config.openshift.io/cluster -o yaml apiVersion: config.openshift.io/v1 kind: Image metadata: annotations: release.openshift.io/create-only: "true" creationTimestamp: "2020-12-16T15:48:20Z" generation: 2 name: cluster resourceVersion: "422284920" uid: d406d5a0-c452-4a84-b6b3-763abb51d7a5 spec: additionalTrustedCA: name: registry-ca allowedRegistriesForImport: - domainName: quay.io insecure: false - domainName: registry.redhat.io insecure: false - domainName: registry.access.redhat.com insecure: false - domainName: registry.redhat.io/redhat/redhat-operator-index insecure: true - domainName: registry.redhat.io/redhat/redhat-marketplace-index insecure: true - domainName: registry.redhat.io/redhat/certified-operator-index insecure: true - domainName: registry.redhat.io/redhat/community-operator-index insecure: true registrySources: allowedRegistries: - quay.io - registry.redhat.io - registry.rijksapps.nl - registry.access.redhat.com - registry.redhat.io/redhat/redhat-operator-index - registry.redhat.io/redhat/redhat-marketplace-index - registry.redhat.io/redhat/certified-operator-index - registry.redhat.io/redhat/community-operator-index oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.21 True False False 5d13h baremetal 4.10.21 True False False 450d cloud-controller-manager 4.10.21 True False False 94d cloud-credential 4.10.21 True False False 624d cluster-autoscaler 4.10.21 True False False 624d config-operator 4.10.21 True False False 624d console 4.10.21 True False False 42d csi-snapshot-controller 4.10.21 True False False 31d dns 4.10.21 True False False 217d etcd 4.10.21 True False False 624d image-registry 4.10.21 True False False 94d ingress 4.10.21 True False False 94d insights 4.10.21 True False False 104s kube-apiserver 4.10.21 True False False 624d kube-controller-manager 4.10.21 True False False 624d kube-scheduler 4.10.21 True False False 624d kube-storage-version-migrator 4.10.21 True False False 31d machine-api 4.10.21 True False False 624d machine-approver 4.10.21 True False False 624d machine-config 4.10.21 True False False 17d marketplace 4.10.21 True False False 258d monitoring 4.10.21 True False False 161d network 4.10.21 True False False 624d node-tuning 4.10.21 True False False 31d openshift-apiserver 4.10.21 True False False 42d openshift-controller-manager 4.10.21 True False False 22d openshift-samples 4.10.21 True True True 31d Samples installation in error at 4.10.21: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "} operator-lifecycle-manager 4.10.21 True False False 624d operator-lifecycle-manager-catalog 4.10.21 True False False 624d operator-lifecycle-manager-packageserver 4.10.21 True False False 31d service-ca 4.10.21 True False False 624d storage 4.10.21 True False False 113d After applying the fix as described here( https://access.redhat.com/solutions/6547281 ) it is resolved: oc patch configs.samples.operator.openshift.io cluster --type merge --patch '{"spec": {"samplesRegistry": "registry.redhat.io"}}' But according the the BZ this should be fixed in 4.10.3 https://bugzilla.redhat.com/show_bug.cgi?id=2027745 but the issue is still occur in our 4.10.21 cluster: oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.21 True False 31d Error while reconciling 4.10.21: the cluster operator openshift-samples is degraded
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Upgrade to 4.10 is stuck looping in syncEgressFirewall We see transacting operations with context deadline exceeded. It looks to be trying to process 2.8 million records is one go. 2023-02-21T19:55:06.514097513Z I0221 19:55:06.435220 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:6a3ad543-a77d-4700-83b8-5ccae6b2d067} ID:1c5297ff-8588-467a-93f4-22f22d609563} {GoUUID:f6288ed3-3928-45a8-ae57-40ed94cfa249} {GoUUID:04bf90c2-fde1-4a10-baaa-6a3f1d8e2931} {GoUUID:c6609536-857c-48ae-9125-9505753180 a8} {GoUUID:c79b4398-d7cc-4dcf-8c1d-11484f318324} {GoUUID:4323ac2c-033e-43c3-885b-e951cd7a4159} {GoUUID:7b316a80-076f-4266-b7d2-bd69b1d4b874} {GoUUID:57dfecb2-2f94-4cd8-a277-8 b28205e1048} {GoUUID:2c039f15-ff11-4ceb-aa82-bcbe82fc86d1} {GoUUID:063c4121-73c3-4d53-a89d-1063e775146b} {GoUUID:25c788e3-6146-4571-98bf-61010100a22a} {GoUUID:3d3c150f-1296-4d 91-b334-506f28bff4bd}]}}] Timeout:<nil> Where:[where column _uuid == {ba9652de-5aae-4a74-a512-29f775e38c19}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded 2023-02-21T19:55:18.739739417Z E0221 19:55:18.643127 1 master.go:1369] Failed (will retry) in syncing syncEgressFirewall: failed to remove reject acl from node logical switches: error while removing ACLS: [6a3ad543-a77d-4700-83b8-5ccae6b2d067 8e004991-0382-455f-9901-33ef724acbc2 Everything is built into one operation via: https://github.com/openshift/ovn-kubernetes/blob/release-4.10/go-controller/pkg/libovsdbops/switch.go#L243 TrandactAndCheck is being called with a 10s timeout and this operation never completes.
Version-Release number of selected component (if applicable):
4.10.50
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Upgrade completes
Additional info:
This is a clone of issue OCPBUGS-7409. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7374. The following is the description of the original issue:
—
Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000
The controllers sometimes get stuck on listing members in failure scenarios, this is known and can be mitigated by simply restarting the CEO.
similar BZ 2093819 with stuck controllers was fixed slightly different in https://github.com/openshift/cluster-etcd-operator/commit/4816fab709e11e0681b760003be3f1de12c9c103
This fix was contributed by lance5890, thanks a lot!
This is a clone of issue OCPBUGS-5879. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5505. The following is the description of the original issue:
—
The upgradeability check in CVO is throttled (essentially cached) for a nondeterministic period of time, same as the minimal sync period computed at runtime. The period can be up to 4 minutes, determined at CVO start time as 2minutes * (0..1 + 1). We agreed with Trevor that such throttling is unnecessarily aggressive (the check is not that expensive). It also causes CI flakes, because the matching test only has 3 minutes timeout. Additionally, the non-determinism and longer throttling results makes UX worse by actions done in the cluster may have their observable effect delayed.
discovered in 4.10 -> 4.11 upgrade jobs
The test seems to flake ~10% of 4.10->4.11 Azure jobs (sippy). There does not seem to be that much impact on non-Azure jobs though which is a bit weird.
Inspect the CVO log and E2E logs from failing jobs with the provided [^check-cvo.py] helper:
$ ./check-cvo.py cvo.log && echo PASS || echo FAIL
Preferably, inspect CVO logs of clusters that just underwent an upgrade (upgrades makes the original problematic behavior more likely to surface)
$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-5b6966c474-g4kwk_cluster-version-operator.log && echo PASS || echo FAIL FAIL: Cache hit at 11:59:55.332339 0:03:13.665006 after check at 11:56:41.667333 FAIL: Cache hit at 12:06:22.663215 0:03:13.664964 after check at 12:03:08.998251 FAIL: Cache hit at 12:12:49.997119 0:03:13.665598 after check at 12:09:36.331521 FAIL: Cache hit at 12:19:17.328510 0:03:13.664906 after check at 12:16:03.663604 FAIL: Cache hit at 12:25:44.662290 0:03:13.666759 after check at 12:22:30.995531 Upgradeability checks: 5 Upgradeability check cache hits: 12 FAIL
Note that the bug is probabilistic, so not all unfixed clusters will exhibit the behavior. My guess of the incidence rate is about 30-40%.
$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-7b8f85d455-mk9fs_cluster-version-operator.log && echo PASS || echo FAIL Upgradeability checks: 12 Upgradeability check cache hits: 11 PASS
The actual numbers are not relevant (unless the upgradeabilily check count is zero, which means the test is not conclusive, the script warns about that), lack of failure is.
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b7d4b5bbd-zjqdt_cluster-version-operator.log | grep upgradeable.go ... I1227 06:50:59.023190 1 upgradeable.go:122] Cluster current version=4.10.46 I1227 06:50:59.042735 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:51:14.024345 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:53:23.080768 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:56:59.366010 1 upgradeable.go:122] Cluster current version=4.11.0-0.ci-2022-12-26-193640 $ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Kubernetes 1.25 and therefore OpenShift 4.12' Dec 27 06:51:15.319: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ... Dec 27 06:54:15.413: FAIL: Error while waiting for Upgradeable to complain about AdminAckRequired with message "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.": timed out waiting for the condition
The test passes. Also, the "Upgradeable conditions were recently checked, will try later." messages in CVO logs should never occur after a deterministic, short amount of time (I propose 1 minute) after upgradeability was checked.
I tested the throttling period in https://github.com/openshift/cluster-version-operator/pull/880. With the period of 15m, the test passrate was 4 of 9. Wiht the period of 1m, the test did not fail at all.
Some context in Slack thread
This is a clone of issue OCPBUGS-212. The following is the description of the original issue:
—
Description of problem:
oc --context build02 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-ec.1 True False 45h Error while reconciling 4.12.0-ec.1: the cluster operator kube-controller-manager is degraded oc --context build02 get co kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.12.0-ec.1 True False True 2y87d GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.153.28:9091: connect: cannot assign requested address
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
build02 is a build farm cluster in CI production.
I can provide credentials to access the cluster if needed.
This is a clone of issue OCPBUGS-858. The following is the description of the original issue:
—
Description of problem:
In OCP 4.9, the package-server-manager was introduced to manage the packageserver CSV. However, when OCP 4.8 in upgraded to 4.9, the packageserver stays stuck in v0.17.0, which is the version in OCP 4.8, and v0.18.3 does not roll out, which is the version in OCP 4.9
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install OCP 4.8 2. Upgrade to OCP 4.9 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2022-08-31-160214 True True 50m Working towards 4.9.47: 619 of 738 done (83% complete) $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.47 True False 4m26s Cluster version is 4.9.47
Actual results:
Check packageserver CSV. It's in v0.17.0 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE packageserver Package Server 0.17.0 Succeeded
Expected results:
packageserver CSV is at 0.18.3
Additional info:
packageserver CSV version in 4.8: https://github.com/openshift/operator-framework-olm/blob/release-4.8/manifests/0000_50_olm_15-packageserver.clusterserviceversion.yaml#L12 packageserver CSV version in 4.9: https://github.com/openshift/operator-framework-olm/blob/release-4.9/pkg/manifests/csv.yaml#L8
Description of problem:
This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2074299 for backporting purposes.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Create Loadbalancer type service within the OCP 4.11.x OVNKubernetes cluster to expose the api server endpoint, the service does not response for normal oc request. But some of them are working, like "oc whoami", "oc get --raw /api"
Version-Release number of selected component (if applicable):
4.11.8 with OVNKubernetes
How reproducible:
always
Steps to Reproduce:
1. Setup openshift cluster 4.11 on AWS with OVNKubernetes as the default network 2. Create the following service under openshift-kube-apiserver namespace to expose the api ---- apiVersion: v1 kind: Service metadata: annotations: service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1800" finalizers: - service.kubernetes.io/load-balancer-cleanup name: test-api namespace: openshift-kube-apiserver spec: allocateLoadBalancerNodePorts: true externalTrafficPolicy: Cluster internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack loadBalancerSourceRanges: - <my_ip>/32 ports: - nodePort: 31248 port: 6443 protocol: TCP targetPort: 6443 selector: apiserver: "true" app: openshift-kube-apiserver sessionAffinity: None type: LoadBalancer 3. Setup the DNS resolution for the access xxx.mydomain.com ---> <elb-auto-generated-dns> 4. Try to access the cluster api via the service above by updating the kubeconfig to use the custom dns name
Actual results:
No response from the server side. $ time oc get node -v8 I1025 08:29:10.284069 103974 loader.go:375] Config loaded from file: bmeng.kubeconfig I1025 08:29:10.294017 103974 round_trippers.go:420] GET https://rh-api.bmeng-ccs-ovn.3o13.s1.devshift.org:6443/api/v1/nodes?limit=500 I1025 08:29:10.294035 103974 round_trippers.go:427] Request Headers: I1025 08:29:10.294043 103974 round_trippers.go:431] Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json I1025 08:29:10.294052 103974 round_trippers.go:431] User-Agent: oc/openshift (linux/amd64) kubernetes/e40bd2d I1025 08:29:10.365119 103974 round_trippers.go:446] Response Status: 200 OK in 71 milliseconds I1025 08:29:10.365142 103974 round_trippers.go:449] Response Headers: I1025 08:29:10.365148 103974 round_trippers.go:452] Audit-Id: 83b9d8ae-05a4-4036-bff6-de371d5bec12 I1025 08:29:10.365155 103974 round_trippers.go:452] Cache-Control: no-cache, private I1025 08:29:10.365161 103974 round_trippers.go:452] Content-Type: application/json I1025 08:29:10.365167 103974 round_trippers.go:452] X-Kubernetes-Pf-Flowschema-Uid: 2abc2e2d-ada3-4cb8-a86f-235df3a4e214 I1025 08:29:10.365173 103974 round_trippers.go:452] X-Kubernetes-Pf-Prioritylevel-Uid: 02f7a188-43c7-4827-af58-5ebe861a1891 I1025 08:29:10.365179 103974 round_trippers.go:452] Date: Tue, 25 Oct 2022 08:29:10 GMT ^C real 17m4.840s user 0m0.567s sys 0m0.163s However, it has the correct response if using --raw to request, eg: $ oc get --raw /api/v1 --kubeconfig bmeng.kubeconfig {"kind":"APIResourceList","groupVersion":"v1","resources":[{"name":"bindings","singularName":"","namespaced":true,"kind":"Binding","verbs":["create"]},{"name":"componentstatuses","singularName":"","namespaced":false,"kind":"ComponentStatus","verbs":["get","list"],"shortNames":["cs"]},{"name":"configmaps","singularName":"","namespaced":true,"kind":"ConfigMap","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["cm"],"storageVersionHash":"qFsyl6wFWjQ="},{"name":"endpoints","singularName":"","namespaced":true,"kind":"Endpoints","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["ep"],"storageVersionHash":"fWeeMqaN/OA="},{"name":"events","singularName":"","namespaced":true,"kind":"Event","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["ev"],"storageVersionHash":"r2yiGXH7wu8="},{"name":"limitranges","singularName":"","namespaced":true,"kind":"LimitRange","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["limits"],"storageVersionHash":"EBKMFVe6cwo="},{"name":"namespaces","singularName":"","namespaced":false,"kind":"Namespace","verbs":["create","delete","get","list","patch","update","watch"],"shortNames":["ns"],"storageVersionHash":"Q3oi5N2YM8M="},{"name":"namespaces/finalize","singularName":"","namespaced":false,"kind":"Namespace","verbs":["update"]},{"name":"namespaces/status","singularName":"","namespaced":false,"kind":"Namespace","verbs":["get","patch","update"]},{"name":"nodes","singularName":"","namespaced":false,"kind":"Node","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["no"],"storageVersionHash":"XwShjMxG9Fs="},{"name":"nodes/proxy","singularName":"","namespaced":false,"kind":"NodeProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"nodes/status","singularName":"","namespaced":false,"kind":"Node","verbs":["get","patch","update"]},{"name":"persistentvolumeclaims","singularName":"","namespaced":true,"kind":"PersistentVolumeClaim","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["pvc"],"storageVersionHash":"QWTyNDq0dC4="},{"name":"persistentvolumeclaims/status","singularName":"","namespaced":true,"kind":"PersistentVolumeClaim","verbs":["get","patch","update"]},{"name":"persistentvolumes","singularName":"","namespaced":false,"kind":"PersistentVolume","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["pv"],"storageVersionHash":"HN/zwEC+JgM="},{"name":"persistentvolumes/status","singularName":"","namespaced":false,"kind":"PersistentVolume","verbs":["get","patch","update"]},{"name":"pods","singularName":"","namespaced":true,"kind":"Pod","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["po"],"categories":["all"],"storageVersionHash":"xPOwRZ+Yhw8="},{"name":"pods/attach","singularName":"","namespaced":true,"kind":"PodAttachOptions","verbs":["create","get"]},{"name":"pods/binding","singularName":"","namespaced":true,"kind":"Binding","verbs":["create"]},{"name":"pods/ephemeralcontainers","singularName":"","namespaced":true,"kind":"Pod","verbs":["get","patch","update"]},{"name":"pods/eviction","singularName":"","namespaced":true,"group":"policy","version":"v1","kind":"Eviction","verbs":["create"]},{"name":"pods/exec","singularName":"","namespaced":true,"kind":"PodExecOptions","verbs":["create","get"]},{"name":"pods/log","singularName":"","namespaced":true,"kind":"Pod","verbs":["get"]},{"name":"pods/portforward","singularName":"","namespaced":true,"kind":"PodPortForwardOptions","verbs":["create","get"]},{"name":"pods/proxy","singularName":"","namespaced":true,"kind":"PodProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"pods/status","singularName":"","namespaced":true,"kind":"Pod","verbs":["get","patch","update"]},{"name":"podtemplates","singularName":"","namespaced":true,"kind":"PodTemplate","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"storageVersionHash":"LIXB2x4IFpk="},{"name":"replicationcontrollers","singularName":"","namespaced":true,"kind":"ReplicationController","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["rc"],"categories":["all"],"storageVersionHash":"Jond2If31h0="},{"name":"replicationcontrollers/scale","singularName":"","namespaced":true,"group":"autoscaling","version":"v1","kind":"Scale","verbs":["get","patch","update"]},{"name":"replicationcontrollers/status","singularName":"","namespaced":true,"kind":"ReplicationController","verbs":["get","patch","update"]},{"name":"resourcequotas","singularName":"","namespaced":true,"kind":"ResourceQuota","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["quota"],"storageVersionHash":"8uhSgffRX6w="},{"name":"resourcequotas/status","singularName":"","namespaced":true,"kind":"ResourceQuota","verbs":["get","patch","update"]},{"name":"secrets","singularName":"","namespaced":true,"kind":"Secret","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"storageVersionHash":"S6u1pOWzb84="},{"name":"serviceaccounts","singularName":"","namespaced":true,"kind":"ServiceAccount","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["sa"],"storageVersionHash":"pbx9ZvyFpBE="},{"name":"serviceaccounts/token","singularName":"","namespaced":true,"group":"authentication.k8s.io","version":"v1","kind":"TokenRequest","verbs":["create"]},{"name":"services","singularName":"","namespaced":true,"kind":"Service","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["svc"],"categories":["all"],"storageVersionHash":"0/CO1lhkEBI="},{"name":"services/proxy","singularName":"","namespaced":true,"kind":"ServiceProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"services/status","singularName":"","namespaced":true,"kind":"Service","verbs":["get","patch","update"]}]}
Expected results:
The normal oc request should be working.
Additional info:
There is no such issue for clusters with openshift-sdn with the same OpenShift version and same LoadBalancer service. We suspected that it might be related to the MTU setting, but this cannot explain why OpenShiftSDN works well. Another thing might be related is that the OpenShiftSDN is using iptables for service loadbalancing and OVN is dealing that within the OVN services.
Please let me know if any debug log/info is needed.
This bug is a backport clone of [Bugzilla Bug 2094174](https://bugzilla.redhat.com/show_bug.cgi?id=2094174). The following is the description of the original bug:
—
Created attachment 1887340
CVO log file
Description of problem:
Clearing upgrade after signature verification fails, ReleaseAccepted=False keeps complaining about the update cannot be verified blah blah.
,
,
,
,
,
{ "lastTransitionTime": "2022-06-07T01:56:17Z", "message": "Cluster version is 4.11.0-0.nightly-2022-06-06-025509", "status": "False", "type": "Progressing" }]
Version-Release number of the following components:
4.11.0-0.nightly-2022-06-06-025509
How reproducible:
1/1
Steps to Reproduce:
1. Upgrade to a fake release
2. Check ReleaseAccepted=False due to target image signature verification failure
ReleaseAccepted=False
Reason: RetrievePayload
Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100" failure=The update cannot be verified: unable to verify sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100 against keyrings: verifier-public-key-redhat
Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.11
warning: Cannot display available updates:
Reason: VersionNotFound
Message: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-06-04-014713 not found in the "stable-4.11" channel
3. Clear the upgrade
4. Check oc adm upgrade info
ReleaseAccepted=False
Reason: RetrievePayload
Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100" failure=The update cannot be verified: unable to verify sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f100 against keyrings: verifier-public-key-redhat
Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.11
warning: Cannot display available updates:
Reason: VersionNotFound
Message: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-06-04-014713 not found in the "stable-4.11" channel
Actual results:
After upgrade is cleared, cv condition ReleaseAccepted keeps to false with message The update cannot be verified
Expected results:
After upgrade is cleared, cv condition ReleaseAccepted should stop complaining about the target image
Additional info:
Please attach logs from ansible-playbook with the -vvv flag
Description of problem:
Provisioning interface on master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP 4.10.16 IPI BareMetal install.
Customer is performing an OCP 4.10.16 IPI BareMetal install and bootstrap node provisions just fine, but when master nodes are booted for provisioning, they are not getting an ipv4 address via dhcp. As such, the install is not moving forward at this point.
Version-Release number of selected component (if applicable):
OCP 4.10.16
How reproducible:
Perform OCP 4.10.16 IPI BareMetal install.
Actual results:
provisioning interface comes up (as evidenced by ipv6 address) but is not getting an ipv4 address via dhcp. OCP install / provisioning fails at this point.
Expected results:
provisioning interface successfully received an ipv4 ip address and successfully provisioned master nodes (and subsequently worker nodes as well.)
Additional info:
As a troubleshooting measure, manually adding an ipv4 ip address did allow the coreos image on the bootstrap node to be reached via curl.
Further, the kernel boot line for the first master node was updated for a static ip addresss assignment for further confirmation that the master node would successfully image this way which further confirming that the issue is the provisioning interface not receiving an ipv4 ip address from the dhcp server.
Description of problem:
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Always
Steps to Reproduce:
1. Enable UWM + dedicated UWM Alertmanager
2. Deploy an application + service monitor + alerting rule which fires always
3. Go to the OCP dev console and silence the alert.
Actual results:
Nothing happens
Expected results:
The alert notification is muted.
Additional info:
Copied from https://bugzilla.redhat.com/show_bug.cgi?id=2100860
When a thin provisioned COW format disk is created on OCP on RHV via CSI driver (a PVC -
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml
But this is thin provisioned disk, so the initial size of the disk should be default of the engine and then grow as needed, it shouldn't be this big.
This causes all the disks created this way to be functionally preallocated (since it eats all that space), which is a real waste of space.
How reproducible: 100%
Steps to Reproduce:
1. Create a storage claim (PVC) in Openshift (
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml
) using the default storage class (or any other storage class with thinProvisioning: "true") and with requested storage i.e. 100Gi
$ oc create -f storage-claim.yaml
2. In the RHV web console navigate to Storage -> Disks and check Virtual size and Actual size of the created disk (PVC)
Actual results:
Disk from our example with requested storage 100GB reports virtual size 100GB and actual size 110 GB.
Expected results:
Thin provisioned disks should start with small initial size and then grow as needed, so its actual size should be considerably smaller (the default initial size set by the engine should be 2.5 GB if I'm not mistaken).
Note: The extra 10GB in the actual size are caused by overhead for the qcow2 disk format, which is 10%, and this was tracked here as a separate issue:
https://bugzilla.redhat.com/show_bug.cgi?id=2097139
This is a clone of issue OCPBUGS-1428. The following is the description of the original issue:
—
Description of problem:
When using an OperatorGroup attached to a service account, AND if there is a secret present in the namespace, the operator installation will fail with the message: the service account does not have any API secret sa=testx-ns/testx-sa This issue seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=2094303 - which was resolved in 4.11.0 - however, the new element now, is that the presence of a secret in the namespace is causing the issue. The name of the secret seems significant - suggesting something somewhere is depending on the order that secrets are listed in. For example, If the secret in the namespace is called "asecret", the problem does not occur. If it is called "zsecret", the problem always occurs.
"zsecret" is not a "kubernetes.io/service-account-token". The issue I have raised here relates to Opaque secrets - zsecret is an Opaque secret. The issue may apply to other types of secrets, but specifically my issue is that when there is an opaque secret present in the namespace, the operator install fails as described. I aught to be allowed to have an opaque secret present in the namespace where I am installing the operator.
Version-Release number of selected component (if applicable):
4.11.0 & 4.11.1
How reproducible:
100% reproducible
Steps to Reproduce:
1.Create namespace: oc new-project testx-ns 2. oc apply -f api-secret-issue.yaml
Actual results:
Expected results:
Additional info:
API YAML:
cat api-secret-issue.yaml
apiVersion: v1
kind: Secret
metadata:
name: zsecret
namespace: testx-ns
annotations:
kubernetes.io/service-account.name: testx-sa
type: Opaque
stringData:
mykey: mypass
—
apiVersion: v1
kind: ServiceAccount
metadata:
name: testx-sa
namespace: testx-ns
—
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
name: testx-og
namespace: testx-ns
spec:
serviceAccountName: "testx-sa"
targetNamespaces:
- testx-ns
—
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: testx-role
namespace: testx-ns
rules:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: testx-rolebinding
namespace: testx-ns
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: testx-role
subjects:
—
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: etcd-operator
namespace: testx-ns
spec:
channel: singlenamespace-alpha
installPlanApproval: Automatic
name: etcd
source: community-operators
sourceNamespace: openshift-marketplace
OVS 2.17+ introduced an optimization of "weak references" to substantially speed up database snapshots. in some cases weak references may leak memory; to aforementioned commit fixes that and has been pulled into ovs2.17-62 and later.
Description of problem:
During restart egress firewall acls will be deleted and re-created from scratch, meaning that egress firewall rules won't be applied for some time during restart
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-2438. The following is the description of the original issue:
—
Description of problem:
On the alert details page and alerting rule details page, clicking on a field that has a popover help throws an uncaught JavaScript error.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Go to Observe > Alerting pages 2. Click on an alert (or go to the rules tab then click on a rule) 3. Click on one of the underlined fields (those that have a popover help)
Actual results:
Expected results:
Additional info:
Since 4.11 OCP comes with OperatorHub definition which declares a capability
and enables all catalog sources. For OKD we want to enable just community-operators
as users may not have Red Hat pull secret set.
This commit would ensure that OKD version of marketplace operator gets
its own OperatorHub manifest with a custom set of operator catalogs enabled
This is a clone of issue OCPBUGS-6816. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-6799. The following is the description of the original issue:
—
Description of problem:
The pipelines -> repositories list view in Dev Console does not show the running pipelineline as the last pipelinerun in the table.
Original BugZilla Link: https://bugzilla.redhat.com/show_bug.cgi?id=2016006
OCPBUGSM: https://issues.redhat.com/browse/OCPBUGSM-36408
This is a clone of issue OCPBUGS-8286. The following is the description of the original issue:
—
Description of problem:
mapi_machinehealthcheck_short_circuit is not properly reconciling the state, when a MachineHealthCheck is failing because of unhealthy Machines but then is removed. When doing two MachineSet (called blue and green and only one has running Machines at a specific point in time) with MachineAutoscaler and MachineHealthCheck, the mapi_machinehealthcheck_short_circuit will continue to report 1 for MachineHealth that actually was removed because of a switch from blue to green. $ oc get machineset | egrep 'blue|green' housiocp4-wvqbx-worker-blue-us-east-2a 0 0 2d17h housiocp4-wvqbx-worker-green-us-east-2a 1 1 1 1 2d17h $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE worker-green-us-east-1a MachineSet housiocp4-wvqbx-worker-green-us-east-2a 1 4 2d17h $ oc get machinehealthcheck NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY machine-api-termination-handler 100% 0 0 worker-green-us-east-1a 40% 1 1 { "name": "machine-health-check-unterminated-short-circuit", "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-machine-api-machine-api-operator-prometheus-rules-ccb650d9-6fc4-422b-90bb-70452f4aff8f.yaml", "rules": [ { "state": "firing", "name": "MachineHealthCheckUnterminatedShortCircuit", "query": "mapi_machinehealthcheck_short_circuit == 1", "duration": 1800, "labels": { "severity": "warning" }, "annotations": { "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n", "summary": "machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes" }, "alerts": [ { "labels": { "alertname": "MachineHealthCheckUnterminatedShortCircuit", "container": "kube-rbac-proxy-mhc-mtrc", "endpoint": "mhc-mtrc", "exported_namespace": "openshift-machine-api", "instance": "10.128.0.58:8444", "job": "machine-api-controllers", "name": "worker-blue-us-east-1a", "namespace": "openshift-machine-api", "pod": "machine-api-controllers-779dcb8769-8gcn6", "service": "machine-api-controllers", "severity": "warning" }, "annotations": { "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n", "summary": "machine health check worker-blue-us-east-1a has been disabled by short circuit for more than 30 minutes" }, "state": "firing", "activeAt": "2022-12-09T15:59:25.1287541Z", "value": "1e+00" } ], "health": "ok", "evaluationTime": 0.000648129, "lastEvaluation": "2022-12-12T09:35:55.140174009Z", "type": "alerting" } ], "interval": 30, "limit": 0, "evaluationTime": 0.000661589, "lastEvaluation": "2022-12-12T09:35:55.140165629Z" }, As we can see above, worker-blue-us-east-1a is no longer available and active but rather worker-green-us-east-1a. But worker-blue-us-east-1a was there before the switch to green has happen and was actuall reporting some unhealthy Machines. But since it's now gone, mapi_machinehealthcheck_short_circuit should properly reconcile as otherwise this is a false/positive alert.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.12.0-rc.3 (but is also seen on previous version)
How reproducible:
- Always
Steps to Reproduce:
1. Setup OpenShift Container Platform 4 on AWS for example 2. Create blue and green MachineSet with MachineAutoScaler and MachineHealthCheck 3. Have active Machines for blue only 4. Trigger unhealthy Machines in blue MachineSet 5. Switch to green MachineSet, by removing MachineHealthCheck, MachineAutoscaler and setting replicate of blue MachineSet to 0 6. Create green MachineHealthCheck, MachineAutoscaler and scale geen MachineSet to 1 7. Observe how mapi_machinehealthcheck_short_circuit continues to report unhealthy state for blue MachineHealthCheck which no longer exists.
Actual results:
mapi_machinehealthcheck_short_circuit reporting problematic MachineHealthCheck even though the faulty MachineHealthCheck does no longer exist.
Expected results:
mapi_machinehealthcheck_short_circuit to properly reconcile it's state and remove MachineHealthChecks that have been removed on OpenShift Container Platform level
Additional info:
It kind of looks like similar to the issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=2013528 respectively https://bugzilla.redhat.com/show_bug.cgi?id=2047702 (although https://bugzilla.redhat.com/show_bug.cgi?id=2047702 may not be super relevant)
Description of problem:
When a pod runs to a completed state, we typically rely on the update event that will indicate to us that this pod is completed. At that point the pod IP is released and the port configuration is removed in OVN. The subsequent delete event for this pod will be ignored because it should have been cleaned up in the previous update. However, there can be cases where the update event is missed with pod completed. In this case we will only receive a delete with pod completed event, and ignore tearing down the pod. The end result is the pod is not cleaned up in OVN and the IP address remains allocated, reducing the amount of address range available to launch another pod. This can lead to exhausting all IP addresses available for pod allocation on a node.
Version-Release number of selected component (if applicable):
4.10.24
How reproducible:
Not sure how to reproduce this. I'm guessing some lag in kapi updates can cause the completed update event and the final delete event to be combined into a single event.
Steps to Reproduce:
1. 2. 3.
Actual results:
Port still exists in OVN, IP remains allocated for a deleted pod.
Expected results:
IP should be freed, port should be removed from OVN.
Additional info:
Description of problem:
When scaling down the machineSet for worker nodes, a PV(vmdk) file got deleted.
Version-Release number of selected component (if applicable):
4.10
How reproducible:
N/A
Steps to Reproduce:
1. Scale down worker nodes 2. Check VMware logs and VM gets deleted with vmdk still attached
Actual results:
After scaling down nodes, volumes still attached to the VM get deleted alongside the VM
Expected results:
Worker nodes scaled down without any accidental deletion
Additional info:
This is a clone of issue OCPBUGS-3021. The following is the description of the original issue:
—
Description of problem:
me-west1 is not listed in the survey
Steps to Reproduce:
1. run survey (openshift-install create install-config without an install config file) 2. go through prompts until regions 3.
Actual results:
me-west1 region is missing
Expected results:
me-west1 region is listed (and install succeeds in the region)
Just like kube proxy, ovnk should expose port 10256 on every node, so that cloud LBs can send health checks and know which nodes are available. This is relevant for services with externalTrafficPolicy=Cluster.
This is a clone of issue OCPBUGS-1329. The following is the description of the original issue:
—
Description of problem:
etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO
Version-Release number of selected component (if applicable):
4.10.32
How reproducible:
Not always, after ~10 attempts
Steps to Reproduce:
1. Deploy SNO with Telco DU profile applied 2. Create multiple pods with local storage volumes attached(attaching yaml manifest) 3. Force delete and re-create pods 10 times
Actual results:
etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time
Expected results:
etcd and kube-apiserver do not get restarted
Additional info:
Attaching must-gather. Please let me know if any additional info is required. Thank you!
Description of problem:
release-4.11 of openshift/cloud-provider-openstack is missing some commits that were backported in upstream project into the release-1.24 branch. We should import them in our downstream fork.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-4489. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4168. The following is the description of the original issue:
—
Description of problem:
Prometheus continuously restarts due to slow WAL replay
Version-Release number of selected component (if applicable):
openshift - 4.11.13
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem: Issue described in following issue: https://github.com/openshift/multus-admission-controller/issues/40
Fixed in: https://github.com/openshift/cluster-network-operator/pull/1515
Version-Release number of selected component (if applicable): OCP 4.10
Official Red Hat tracker. Issue has been merged already.
This is a clone of issue OCPBUGS-1629. The following is the description of the original issue:
—
Description of problem:
It is a disconnected cluster on AWS. There is an issue configuring Egress IP where the cluster uses STS. While looking into cloud-network-config-controller pod it is trying to connect to the global sts service "https://sts.amazonaws.com/" rather it should connect to the regional one "https://ec2.ap-southeast-1.amazonaws.com".
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a disconected OCP cluster on AWS.
$ oc get netnamespace | grep egress egress-ip-test 2689387 ["172.16.1.24"]
$ oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS ip-172-16-1-151.ap-southeast-1.compute.internal ip-172-16-1-151.ap-southeast-1.compute.internal 172.16.1.151 10.130.0.0/23 ip-172-16-1-53.ap-southeast-1.compute.internal ip-172-16-1-53.ap-southeast-1.compute.internal 172.16.1.53 10.131.0.0/23 ["172.16.1.24"] ip-172-16-2-15.ap-southeast-1.compute.internal ip-172-16-2-15.ap-southeast-1.compute.internal 172.16.2.15 10.128.0.0/23 ip-172-16-2-77.ap-southeast-1.compute.internal ip-172-16-2-77.ap-southeast-1.compute.internal 172.16.2.77 10.128.2.0/23 ip-172-16-3-111.ap-southeast-1.compute.internal ip-172-16-3-111.ap-southeast-1.compute.internal 172.16.3.111 10.129.0.0/23 ip-172-16-3-79.ap-southeast-1.compute.internal ip-172-16-3-79.ap-southeast-1.compute.internal 172.16.3.79 10.129.2.0/23
$ oc logs sdn-controller-6m5kb -n openshift-sdn I0922 04:09:53.348615 1 vnids.go:105] Allocated netid 2689387 for namespace "egress-ip-test" E0922 04:24:00.682018 1 egressip.go:254] Ignoring invalid HostSubnet ip-172-16-1-53.ap-southeast-1.compute.internal (host: "ip-172-16-1-53.ap-southeast-1.compute.internal", ip: "172.16.1.53", subnet: "10.131.0.0/23"): related node object "ip-172-16-1-53.ap-southeast-1.compute.internal" has an incomplete annotation "cloud.network.openshift.io/egress-ipconfig", CloudEgressIPConfig: <nil>
$ oc logs cloud-network-config-controller-5c7556db9f-x78bs -n openshift-cloud-network-config-controller E0922 04:26:59.468726 1 controller.go:165] error syncing 'ip-172-16-2-77.ap-southeast-1.compute.internal': error retrieving the private IP configuration for node: ip-172-16-2-77.ap-southeast-1.compute.internal, err: error: cannot list ec2 instance for node: ip-172-16-2-77.ap-southeast-1.compute.internal, err: WebIdentityErr: failed to retrieve credentials caused by: RequestError: send request failed caused by: Post "https://sts.amazonaws.com/": dial tcp 54.239.29.25:443: i/o timeout, requeuing in node workqueue
$ oc get Infrastructure -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2022-09-22T03:28:15Z" generation: 1 name: cluster resourceVersion: "598" uid: 994da301-2a96-43b7-b43b-4b7c18d4b716 spec: cloudConfig: name: "" platformSpec: aws: serviceEndpoints: - name: sts url: https://sts.ap-southeast-1.amazonaws.com - name: ec2 url: https://ec2.ap-southeast-1.amazonaws.com - name: elasticloadbalancing url: https://elasticloadbalancing.ap-southeast-1.amazonaws.com type: AWS status: apiServerInternalURI: https://api-int.openshiftyy.ocpaws.sadiqueonline.com:6443 apiServerURL: https://api.openshiftyy.ocpaws.sadiqueonline.com:6443 controlPlaneTopology: HighlyAvailable etcdDiscoveryDomain: "" infrastructureName: openshiftyy-wfrpf infrastructureTopology: HighlyAvailable platform: AWS platformStatus: aws: region: ap-southeast-1 serviceEndpoints: - name: ec2 url: https://ec2.ap-southeast-1.amazonaws.com - name: elasticloadbalancing url: https://elasticloadbalancing.ap-southeast-1.amazonaws.com - name: sts url: https://sts.ap-southeast-1.amazonaws.com type: AWS kind: List metadata: resourceVersion: ""
$ oc get secret aws-cloud-credentials -n openshift-machine-api -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-machine-api-aws-cloud-credentials web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token [ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credential-operator-iam-ro-creds -n openshift-cloud-credential-operator -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cloud-credential-operator-cloud-creden web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token [ec2-user@ip-172-17-1-229 ~]$ oc get secret installer-cloud-credentials -n openshift-image-registry -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-image-registry-installer-cloud-credent web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token [ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credentials -n openshift-ingress-operator -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-ingress-operator-cloud-credentials web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token [ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credentials -n openshift-cloud-network-config-controller -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cloud-network-config-controller-cloud- web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token [ec2-user@ip-172-17-1-229 ~]$ oc get secret ebs-cloud-credentials -n openshift-cluster-csi-drivers -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cluster-csi-drivers-ebs-cloud-credenti web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
Actual results:
Egress IP not configured properly and cloud-network-config-controller trying to connect to global STS service.
Expected results:
Egress IP should get configured and cloud-network-config-controller should connect to regional STS service instead of global.
Additional info:
This is a clone of issue OCPBUGS-1237. The following is the description of the original issue:
—
job=pull-ci-openshift-origin-master-e2e-gcp-builds=all
This test has started permafailing on e2e-gcp-builds:
[sig-builds][Feature:Builds][Slow] s2i build with environment file in sources Building from a template should create a image from "test-env-build.json" template and run it in a pod [apigroup:build.openshift.io][apigroup:image.openshift.io]
The error in the test says
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:21 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulling: Pulling image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Successfully pulled image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" in 1.763914719s Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Created: Created container test Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Started: Started container test Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:24 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Container image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" already present on machine Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:25 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Unhealthy: Readiness probe failed: Get "http://10.129.2.63:8080/": dial tcp 10.129.2.63:8080: connect: connection refused Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:26 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} BackOff: Back-off restarting failed container
This bug is a backport clone of [Bugzilla Bug 2111686](https://bugzilla.redhat.com/show_bug.cgi?id=2111686). The following is the description of the original bug:
—
Description of problem:
While using the console web UI with a nanokube cluster at our hack week in June 2022, we found different NPEs when the project or build status wasn't set.
Version-Release number of selected component (if applicable):
4.12 (but maybe all)
How reproducible:
Always, but only when running a nanokube cluster
Steps to Reproduce:
1. Clone and build https://github.com/RedHatInsights/nanokube
export KUBEBUILDER_ASSETS=`~/go/bin/setup-envtest use 1.21 -p path`
./nanokube
2. Run our console web UI against this cluster!
Actual results:
Crashes while navigate to the project list or details page.
Crash on the build config list page.
Expected results:
No crashes, or at least not the described.
Additional info:
This is a clone of issue OCPBUGS-262. The following is the description of the original issue:
—
github rate limit failures for upi image downloading govc.
https://search.ci.openshift.org/?search=build+error%3A+error+building+at+STEP+%22RUN+curl+-L+-o&ma