Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e test case for SecondaryNetwork of SR-IOV type #6922

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wenqiq
Copy link
Contributor

@wenqiq wenqiq commented Jan 14, 2025

  • Add a script to create a Kubernetes testbed on AWS VM Nodes with SR-IOV interface configured.

  • Add e2e test codes for SR-IOV case to validate the traffic between two Pods with secondary network enabled.

  • Add a Jenkins job named cloud-antrea-test-aws-sriov-secondary-network in Jenkins service, and add github comment trigger to trigger the e2e tests for SecondaryNetwork with SR-IOV on demand.

The complete test process record:
https://jenkins.antrea.io/job/cloud-antrea-test-aws-sriov-secondary-network/16/

@luolanzone luolanzone added this to the Antrea v2.3 release milestone Jan 15, 2025
@luolanzone luolanzone added the area/secondary-network Issues or PRs related to support for secondary networks in Antrea label Jan 15, 2025
@wenqiq wenqiq force-pushed the add-sriov-e2e branch 4 times, most recently from 758c687 to 1f77538 Compare January 20, 2025 03:06
@wenqiq wenqiq changed the title [WIP]Add e2e test case for SecondaryNetwork of SR-IOV type Add e2e test case for SecondaryNetwork of SR-IOV type Jan 20, 2025
@wenqiq wenqiq marked this pull request as ready for review January 21, 2025 00:59
@luolanzone luolanzone requested review from antoninbas and tnqn January 21, 2025 07:33
Comment on lines 101 to 103
- cidr: "172.31.22.0/24"
subnetInfo:
gateway: "172.31.16.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these IPs from? or it's totally a private CIDR which can be defined by us without any limitation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Comment on lines 422 to 432
t.Run("testCreateTestPodOnNode", func(t *testing.T) {
err := testData.createPods(t, e2eTestData.GetTestNamespace())
assert.NoError(t, err)
})
t.Run("testAssignPodIP", func(t *testing.T) {
err := testData.assignIP()
if err != nil {
t.Fatalf("Error when assign IP to ec2 instance: %v", err)
}
})
t.Run("testpingBetweenInterfaces", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are 3 steps of a single test, shouldn't be 3 tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I saw the testSecondaryNetwork function use two tests and updated them.

@wenqiq wenqiq force-pushed the add-sriov-e2e branch 2 times, most recently from aea2610 to 62000e7 Compare January 27, 2025 08:04
install_kubernetes() {
local node_ip=$1
echo "Installing Kubernetes on node $node_ip..."
ssh -o StrictHostKeyChecking=no -i "$AWS_EC2_SSH_KEY_NAME" ubuntu@$node_ip << EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be realistic to use the Ansible playbook we already have (over SSH) instead of having that long list of commands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that integrating Ansible involves significant changes and requires considerable effort. Given the scope of these modifications, it would be more efficient to implement them in a separate pull request (PR).

return out.String(), nil
}

func (data *testData) assignIP() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good if we could find a way to "isolate" the AWS-specific stuff in a single function and have the test script provide an "aws" flag when running the tests.

The function can iterate over the Pods, retrieve the Node, lookup the eni label and perform the IP assignment, all in one place. We don't need to build the mapping from Pod to eni ahead of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

}

// createNetworkAttachmentDefinition creates a NetworkAttachmentDefinition in the specified namespace.
func createNetworkAttachmentDefinition(client networkclient.Interface, namespace, name, config string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we let the script create the NetworkAttachmentDefinition with kubectl, like we do in ci/kind/test-secondary-network-kind.sh. This is also how we create the IPPool. I don't see anything "dynamic" here so I am not sure why we are doing it programmatically?

We should probably follow the existing model and create it in the default namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Namespace where the test Pod is located is randomly generated, the NetworkAttachmentDefinition needs to be in the same Namespace. Otherwise, it won`t work.

@wenqiq wenqiq force-pushed the add-sriov-e2e branch 2 times, most recently from 277ec59 to b5e18b1 Compare February 11, 2025 05:37
RUN_ALL=false
shift
;;
--cleanup-only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another cleanup job in job yaml?

Copy link
Contributor Author

@wenqiq wenqiq Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no separate cleanup job. When the cleanup-only parameter is specified, it allows for deleting specified EC2 resources in an AWS cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how can we ensure that the cleanup task always runs, regardless of whether the test function exits as expected? I didn't see any trap-like design either. We should prevent cloud resource leaks especially for unexpected job failure, as they may cause continuous costs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running tests, the EC2 resources are automatically deleted if they succeed. However, if the tests fail, it's necessary to retain the testing environment for troubleshooting purposes and manually delete the EC2 resources afterward.

Copy link
Contributor

@XinShuYang XinShuYang Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the normal way to run Antrea CI, we had related discussion in #5734 (comment) . I’m not sure if I missed any discussion about this feature, which allows users to log in to the testbed for debugging instead of collecting logs? And it's difficult to ensure that the developer can manually delete the resource each time unexpected failures occur. cc @tnqn @antoninbas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that the clean up always happens by default, even when the test fails. Otherwise we are very likely to leak resources in the AWS account when running the Jenkins job.
It is ok to request for clean up to be skipped when running the script manually. That should already be possible , as I can do:

./ci/test-sriov-secondary-network-aws.sh --setup-only
# Inspect the testbed, maybe run the tests manually with go test again
./ci/test-sriov-secondary-network-aws.sh --cleanup-only

"ippools": ["pool1"]
}
}`
err = awsTestData.createNetworkAttachmentDefinition(e2eTestData.GetTestNamespace(), "sriov-net1", configJSON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revisit my earlier comment about creating the NetworkAttachmentDefinition from sriov-secondary-networks.yml and removing createNetworkAttachmentDefinition. I don't see any reason why this would work for the VLAN tests and not for the SRIOV tests. It should be possible to have the NetworkAttachmentDefinition in the default Namespace and the Pods in the test Namespace. This is what we have for the VLAN e2e tests, and the "k8s.v1.cni.cncf.io/networks" annotation should work the same independently of the network type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -1608,3 +1608,60 @@
default-excludes: true
fingerprint: false
only-if-success: false
- 'cloud-{name}-{test_name}':
disabled: false
triggers: []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this Jenkins job triggered by GitHub comments or a timer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a relatively resource-intensive pipeline, triggering it via comments or scheduled tasks may not be ideal. Would it be more appropriate for the project maintainers to trigger it as needed? cc @antoninbas @tnqn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let @luolanzone comment as well but maybe we could the following:

  • run the test weekly - of course it's very important for resource cleanup to always work consistently as we don't want to leak EC2 instances
  • trigger the job manually from the Jenkins UI prior to merging a PR which impacts secondary network support, and potentially prior to releasing a new Antrea version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have a comment trigger to allow users to trigger the e2e tests on demand. Can we validate who trigger it and only pass through when the name is in the whitelist If we are concerning about the resource impact?
@XinShuYang any suggestion for the name validator?

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments, in particular for the bash script, but the Go code is mostly looking good to me

@@ -1608,3 +1608,60 @@
default-excludes: true
fingerprint: false
only-if-success: false
- 'cloud-{name}-{test_name}':
disabled: false
triggers: []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let @luolanzone comment as well but maybe we could the following:

  • run the test weekly - of course it's very important for resource cleanup to always work consistently as we don't want to leak EC2 instances
  • trigger the job manually from the Jenkins UI prior to merging a PR which impacts secondary network support, and potentially prior to releasing a new Antrea version

RUN_ALL=false
shift
;;
--cleanup-only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that the clean up always happens by default, even when the test fails. Otherwise we are very likely to leak resources in the AWS account when running the Jenkins job.
It is ok to request for clean up to be skipped when running the script manually. That should already be possible , as I can do:

./ci/test-sriov-secondary-network-aws.sh --setup-only
# Inspect the testbed, maybe run the tests manually with go test again
./ci/test-sriov-secondary-network-aws.sh --cleanup-only

Comment on lines +351 to +363
chmod -R g-w build/images/ovs
chmod -R g-w build/images/base
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not have to run these commands on the local machine. What are they for? Are they required for Jenkins?

cc @XinShuYang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas It seems they were first introduced in #2009. Do you still remember the reason behind it? According to the comments, it was implemented to address potential Docker caching issues.

@@ -1608,3 +1608,137 @@
default-excludes: true
fingerprint: false
only-if-success: false
- 'cloud-{name}-{test_name}-aws':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use cloud-{name}-{test_name}-cleanup for all related templates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The template uses the CLUSTERNAME to delete a cluster, which is not working for deleting the EC2 in AWS.

- job-template:
     name: 'cloud-{name}-{test_name}-cleanup'
     node: '{node}'
     triggers: '{triggers}'
     builders: '{builders}'
     parameters:
     - string:
         default: ''
         description: The cluster name of the last finished build.
         name: CLUSTERNAME
         trim: 'false'

@@ -1608,3 +1608,60 @@
default-excludes: true
fingerprint: false
only-if-success: false
- 'cloud-{name}-{test_name}':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the comment to trigger this job? I didn't see it's being defined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1608,3 +1608,60 @@
default-excludes: true
fingerprint: false
only-if-success: false
- 'cloud-{name}-{test_name}':
disabled: false
triggers: []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have a comment trigger to allow users to trigger the e2e tests on demand. Can we validate who trigger it and only pass through when the name is in the whitelist If we are concerning about the resource impact?
@XinShuYang any suggestion for the name validator?

@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 19, 2025

/test-sriov-secondary-network-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 19, 2025

/test-sriov

@XinShuYang
Copy link
Contributor

/test-sriov-secondary-network-e2e

4 similar comments
@XinShuYang
Copy link
Contributor

/test-sriov-secondary-network-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 19, 2025

/test-sriov-secondary-network-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 19, 2025

/test-sriov-secondary-network-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 19, 2025

/test-sriov-secondary-network-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 19, 2025

/test-sriov-secondary-network-e2e

org_list: '{antrea_org_list}'
white_list: '{antrea_white_list}'
only_trigger_phrase: true
trigger_permit_all: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trigger_permit_all: true means all users can trigger this job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated.

Copy link
Contributor

@luolanzone luolanzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, a few nits

# Function to install Kubernetes on a node
function install_kubernetes() {
local node_ip=$1
until ssh -o StrictHostKeyChecking=no -i "$AWS_EC2_SSH_KEY_NAME" "ubuntu@$node_ip" exit; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the script will be blocked here if there is a ssh issue, please consider to use timeout and fail the installation if a reasonable time is passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@luolanzone
Copy link
Contributor

@antoninbas @tnqn can you take a look again? Thanks

@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 20, 2025

/test-sriov-secondary-network-e2e

1 similar comment
@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 20, 2025

/test-sriov-secondary-network-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 20, 2025

/test-sriov-secondary-network-e2e

1 similar comment
@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 20, 2025

/test-sriov-secondary-network-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 20, 2025

/test-sriov-secondary-network-e2e

1 similar comment
@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 20, 2025

/test-sriov-secondary-network-e2e

@wenqiq
Copy link
Contributor Author

wenqiq commented Feb 20, 2025

/test-sriov-secondary-network-e2e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/secondary-network Issues or PRs related to support for secondary networks in Antrea
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants