-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add e2e test case for SecondaryNetwork of SR-IOV type #6922
base: main
Are you sure you want to change the base?
Conversation
758c687
to
1f77538
Compare
- cidr: "172.31.22.0/24" | ||
subnetInfo: | ||
gateway: "172.31.16.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are these IPs from? or it's totally a private CIDR which can be defined by us without any limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
t.Run("testCreateTestPodOnNode", func(t *testing.T) { | ||
err := testData.createPods(t, e2eTestData.GetTestNamespace()) | ||
assert.NoError(t, err) | ||
}) | ||
t.Run("testAssignPodIP", func(t *testing.T) { | ||
err := testData.assignIP() | ||
if err != nil { | ||
t.Fatalf("Error when assign IP to ec2 instance: %v", err) | ||
} | ||
}) | ||
t.Run("testpingBetweenInterfaces", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are 3 steps of a single test, shouldn't be 3 tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I saw the testSecondaryNetwork
function use two tests and updated them.
aea2610
to
62000e7
Compare
install_kubernetes() { | ||
local node_ip=$1 | ||
echo "Installing Kubernetes on node $node_ip..." | ||
ssh -o StrictHostKeyChecking=no -i "$AWS_EC2_SSH_KEY_NAME" ubuntu@$node_ip << EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be realistic to use the Ansible playbook we already have (over SSH) instead of having that long list of commands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that integrating Ansible involves significant changes and requires considerable effort. Given the scope of these modifications, it would be more efficient to implement them in a separate pull request (PR).
return out.String(), nil | ||
} | ||
|
||
func (data *testData) assignIP() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good if we could find a way to "isolate" the AWS-specific stuff in a single function and have the test script provide an "aws" flag when running the tests.
The function can iterate over the Pods, retrieve the Node, lookup the eni label and perform the IP assignment, all in one place. We don't need to build the mapping from Pod to eni ahead of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
} | ||
|
||
// createNetworkAttachmentDefinition creates a NetworkAttachmentDefinition in the specified namespace. | ||
func createNetworkAttachmentDefinition(client networkclient.Interface, namespace, name, config string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we let the script create the NetworkAttachmentDefinition with kubectl, like we do in ci/kind/test-secondary-network-kind.sh
. This is also how we create the IPPool. I don't see anything "dynamic" here so I am not sure why we are doing it programmatically?
We should probably follow the existing model and create it in the default namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the Namespace where the test Pod is located is randomly generated, the NetworkAttachmentDefinition needs to be in the same Namespace. Otherwise, it won`t work.
277ec59
to
b5e18b1
Compare
RUN_ALL=false | ||
shift | ||
;; | ||
--cleanup-only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there another cleanup job in job yaml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no separate cleanup job. When the cleanup-only
parameter is specified, it allows for deleting specified EC2 resources in an AWS cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then how can we ensure that the cleanup task always runs, regardless of whether the test function exits as expected? I didn't see any trap-like design either. We should prevent cloud resource leaks especially for unexpected job failure, as they may cause continuous costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running tests, the EC2 resources are automatically deleted if they succeed. However, if the tests fail, it's necessary to retain the testing environment for troubleshooting purposes and manually delete the EC2 resources afterward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the normal way to run Antrea CI, we had related discussion in #5734 (comment) . I’m not sure if I missed any discussion about this feature, which allows users to log in to the testbed for debugging instead of collecting logs? And it's difficult to ensure that the developer can manually delete the resource each time unexpected failures occur. cc @tnqn @antoninbas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure that the clean up always happens by default, even when the test fails. Otherwise we are very likely to leak resources in the AWS account when running the Jenkins job.
It is ok to request for clean up to be skipped when running the script manually. That should already be possible , as I can do:
./ci/test-sriov-secondary-network-aws.sh --setup-only
# Inspect the testbed, maybe run the tests manually with go test again
./ci/test-sriov-secondary-network-aws.sh --cleanup-only
c0b1757
to
4b2a25a
Compare
"ippools": ["pool1"] | ||
} | ||
}` | ||
err = awsTestData.createNetworkAttachmentDefinition(e2eTestData.GetTestNamespace(), "sriov-net1", configJSON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revisit my earlier comment about creating the NetworkAttachmentDefinition from sriov-secondary-networks.yml
and removing createNetworkAttachmentDefinition
. I don't see any reason why this would work for the VLAN tests and not for the SRIOV tests. It should be possible to have the NetworkAttachmentDefinition in the default Namespace and the Pods in the test Namespace. This is what we have for the VLAN e2e tests, and the "k8s.v1.cni.cncf.io/networks"
annotation should work the same independently of the network type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
9299b31
to
0610d8c
Compare
ci/jenkins/jobs/projects-cloud.yaml
Outdated
@@ -1608,3 +1608,60 @@ | |||
default-excludes: true | |||
fingerprint: false | |||
only-if-success: false | |||
- 'cloud-{name}-{test_name}': | |||
disabled: false | |||
triggers: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this Jenkins job triggered by GitHub comments or a timer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is a relatively resource-intensive pipeline, triggering it via comments or scheduled tasks may not be ideal. Would it be more appropriate for the project maintainers to trigger it as needed? cc @antoninbas @tnqn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let @luolanzone comment as well but maybe we could the following:
- run the test weekly - of course it's very important for resource cleanup to always work consistently as we don't want to leak EC2 instances
- trigger the job manually from the Jenkins UI prior to merging a PR which impacts secondary network support, and potentially prior to releasing a new Antrea version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have a comment trigger to allow users to trigger the e2e tests on demand. Can we validate who trigger it and only pass through when the name is in the whitelist If we are concerning about the resource impact?
@XinShuYang any suggestion for the name validator?
55a8ec8
to
4508b2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments, in particular for the bash script, but the Go code is mostly looking good to me
ci/jenkins/jobs/projects-cloud.yaml
Outdated
@@ -1608,3 +1608,60 @@ | |||
default-excludes: true | |||
fingerprint: false | |||
only-if-success: false | |||
- 'cloud-{name}-{test_name}': | |||
disabled: false | |||
triggers: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let @luolanzone comment as well but maybe we could the following:
- run the test weekly - of course it's very important for resource cleanup to always work consistently as we don't want to leak EC2 instances
- trigger the job manually from the Jenkins UI prior to merging a PR which impacts secondary network support, and potentially prior to releasing a new Antrea version
RUN_ALL=false | ||
shift | ||
;; | ||
--cleanup-only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure that the clean up always happens by default, even when the test fails. Otherwise we are very likely to leak resources in the AWS account when running the Jenkins job.
It is ok to request for clean up to be skipped when running the script manually. That should already be possible , as I can do:
./ci/test-sriov-secondary-network-aws.sh --setup-only
# Inspect the testbed, maybe run the tests manually with go test again
./ci/test-sriov-secondary-network-aws.sh --cleanup-only
chmod -R g-w build/images/ovs | ||
chmod -R g-w build/images/base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not have to run these commands on the local machine. What are they for? Are they required for Jenkins?
cc @XinShuYang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoninbas It seems they were first introduced in #2009. Do you still remember the reason behind it? According to the comments, it was implemented to address potential Docker caching issues.
073f95c
to
323e957
Compare
ci/jenkins/jobs/projects-cloud.yaml
Outdated
@@ -1608,3 +1608,137 @@ | |||
default-excludes: true | |||
fingerprint: false | |||
only-if-success: false | |||
- 'cloud-{name}-{test_name}-aws': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use cloud-{name}-{test_name}-cleanup
for all related templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The template uses the CLUSTERNAME
to delete a cluster, which is not working for deleting the EC2 in AWS.
- job-template:
name: 'cloud-{name}-{test_name}-cleanup'
node: '{node}'
triggers: '{triggers}'
builders: '{builders}'
parameters:
- string:
default: ''
description: The cluster name of the last finished build.
name: CLUSTERNAME
trim: 'false'
ci/jenkins/jobs/projects-cloud.yaml
Outdated
@@ -1608,3 +1608,60 @@ | |||
default-excludes: true | |||
fingerprint: false | |||
only-if-success: false | |||
- 'cloud-{name}-{test_name}': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the comment to trigger this job? I didn't see it's being defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test-sriov-secondary-network-e2e
https://jenkins.antrea.io/job/antrea-sriov-secondary-network-e2e-for-pull-request/2
ci/jenkins/jobs/projects-cloud.yaml
Outdated
@@ -1608,3 +1608,60 @@ | |||
default-excludes: true | |||
fingerprint: false | |||
only-if-success: false | |||
- 'cloud-{name}-{test_name}': | |||
disabled: false | |||
triggers: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have a comment trigger to allow users to trigger the e2e tests on demand. Can we validate who trigger it and only pass through when the name is in the whitelist If we are concerning about the resource impact?
@XinShuYang any suggestion for the name validator?
/test-sriov-secondary-network-e2e |
/test-sriov |
/test-sriov-secondary-network-e2e |
4 similar comments
/test-sriov-secondary-network-e2e |
/test-sriov-secondary-network-e2e |
/test-sriov-secondary-network-e2e |
/test-sriov-secondary-network-e2e |
0ab381e
to
614ae06
Compare
/test-sriov-secondary-network-e2e |
ci/jenkins/jobs/projects-cloud.yaml
Outdated
org_list: '{antrea_org_list}' | ||
white_list: '{antrea_white_list}' | ||
only_trigger_phrase: true | ||
trigger_permit_all: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trigger_permit_all: true
means all users can trigger this job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated.
614ae06
to
2d21d24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, a few nits
# Function to install Kubernetes on a node | ||
function install_kubernetes() { | ||
local node_ip=$1 | ||
until ssh -o StrictHostKeyChecking=no -i "$AWS_EC2_SSH_KEY_NAME" "ubuntu@$node_ip" exit; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the script will be blocked here if there is a ssh issue, please consider to use timeout
and fail the installation if a reasonable time is passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@antoninbas @tnqn can you take a look again? Thanks |
/test-sriov-secondary-network-e2e |
1 similar comment
/test-sriov-secondary-network-e2e |
a93c5d2
to
a31d393
Compare
/test-sriov-secondary-network-e2e |
1 similar comment
/test-sriov-secondary-network-e2e |
48eefe5
to
6ad8de2
Compare
/test-sriov-secondary-network-e2e |
1 similar comment
/test-sriov-secondary-network-e2e |
0f1e295
to
7137b41
Compare
Signed-off-by: Wenqi Qiu <[email protected]>
7137b41
to
6ab5756
Compare
/test-sriov-secondary-network-e2e |
Add a script to create a Kubernetes testbed on AWS VM Nodes with SR-IOV interface configured.
Add e2e test codes for SR-IOV case to validate the traffic between two Pods with secondary network enabled.
Add a Jenkins job named
cloud-antrea-test-aws-sriov-secondary-network
in Jenkins service, and add github comment trigger to trigger the e2e tests for SecondaryNetwork with SR-IOV on demand.The complete test process record:
https://jenkins.antrea.io/job/cloud-antrea-test-aws-sriov-secondary-network/16/