Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General Retrospective for January 2025 Releases #64

Open
8 tasks
adamfarley opened this issue Nov 18, 2024 · 7 comments
Open
8 tasks

General Retrospective for January 2025 Releases #64

adamfarley opened this issue Nov 18, 2024 · 7 comments
Assignees

Comments

@adamfarley
Copy link
Contributor

Summary

A retrospective for all efforts surrounding the titular releases.

All community members are welcome to contribute to the agenda via comments below.

This will be a virtual meeting after the release, with at least a week of notice in the #release Slack channel.

On the day of the meeting we'll review the agenda and add a list of actions at the end.

Invited: Everyone.

Time, Date, and URL

Time:
Date:
URL:

Details

Retrospective Owner Tasks (in order):

  • Post retro URL in #Release around the start of the new release.
  • Wait until most builds are released, with no signs of a respin.
  • Announce the retrospective's date + time on #Release a week in advance.
  • Host the retrospective:
    • Go through the agenda.
    • Create a list of actions.
  • Process each action:
    • Create a "WIP" issue including the source comment.
    • Add the issue to the current iteration.
    • Add an issue link to the action list.
  • Create a new retrospective issue for the next release.
  • Set a calendar reminder so you remember to do step 1 before the next release.
  • Close this issue.

TLDR

Add proposed agenda items as comments below.

@adamfarley adamfarley self-assigned this Nov 18, 2024
@adamfarley adamfarley changed the title General Retrospective for January 2024 Releases General Retrospective for January 2025 Releases Nov 18, 2024
@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Jan 15, 2025

Release pipelines which have some errors in the downstream jobs are not able to generate the release summary report in TRSS.
As an example https://trss.adoptium.net/resultSummary?parentId=6780fa67f66194006d2f37b1 which corresponds to https://ci.adoptium.net/job/build-scripts/job/release-openjdk11-pipeline/51/

SL/Jan15: think this is related to the amount of failures to the point where it essentially exceeds the character limit in the release report. Pipelines with a some errors can be generated, pipelines with a massive amount of information to 'share' exceed the limit. aqa-test-tools/issues/xxxx to print out a msg if limit is hit (so the user knows no report will be generated).

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Jan 15, 2025

https://adoptium.slack.com/archives/CLCFNV2JG/p1736788317222669?thread_ts=1736429650.249329&cid=CLCFNV2JG

Regarding the release trigger getting confused with the tags, the picture below is an example of how it might get confused
image

The top 3 entries are fine; we have a tag jdk-23.0.2+22 which we want to use as a dryrun so we push a dryrun-ga tag, jdk-23.0.2-dryrun-ga, which has the same commit sha as jdk-23.0.2+22, and jdk-23.0.2+22 has a corresponding jdk-23.0.2+22_adopt tag present. All is good. Except jdk-23.0.1-dryrun-ga is present which shares the same commit sha as jdk-23.0.2+22. This will confuse the trigger and/or the downstream release pipeline it kicks off. The solution is to remove the jdk-23.0.1-dryrun-ga tag (delete it)

@smlambert
Copy link
Contributor

I have manually increased the TIME_LIMIT on https://ci.adoptium.net/job/Test_openjdk23_hs_extended.openjdk_riscv64_linux/ as it appears to be hitting its 25 hour limit and aborting before completion.

FYI @Haroon-Khel

We will need to investigate why its timing out (my guess is that certain testcase failures hanging and running long until each one hits its timeout). If there is time, we should figure that out during the dry run assessment, but if not a longer time limit will hopefully allow the jobs to finish without aborting even with the timeouts.

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Jan 21, 2025

https://adoptium.slack.com/archives/C09NW3L2J/p1737466419332739

https://ci.adoptium.net/view/git-mirrors/job/git-mirrors/job/adoptium/job/git-skara-jdk23u/2688/console

+ git push origin master --tags
To github.com:adoptium/jdk23u
 ! [rejected]                jdk-23.0.2-dryrun-ga -> jdk-23.0.2-dryrun-ga (already exists)
error: failed to push some refs to 'github.com:adoptium/jdk23u'
hint: Updates were rejected because the tag already exists in the remote.

Steps to resolve this from Andrew:

The problem was because we manually tagged the jdk-23.0.2+00_adopt tag, rather than letting the mirror job do it… so when it tried it got a missmatch…
To resolve I did:
Deleted local cache on jenkins-worker, rm -rf /home/jenkins/workspace/git-mirrors/adoptium/git-skara-jdk23u/workspace/jdk23u
Deleted jdk-23.0.2+00_adopt tag from mirror repo
Then re-ran mirror job….

@sophia-guo
Copy link
Contributor

sophia-guo commented Jan 23, 2025

Some test jobs timeout during this release and dry run even with timeout=25 hours. The reason is TRSS unavailable and based on assumptive test times ( very random and small number) testlist number is set as 1. So not parallel at all for release jdk21,17,11,8. Only jdk23 tests are parallel. This is why for this release even for some primary platforms tests run slowly.

The timeout caused build failed and no test results are archived, we have to rerun the job to get the test results.

For example:
https://ci.adoptium.net/job/Test_openjdk21_hs_extended.openjdk_x86-64_mac/81/
https://ci.adoptium.net/job/Test_openjdk17_hs_extended.openjdk_x86-64_linux/233/
https://ci.adoptium.net/job/Test_openjdk21_hs_extended.openjdk_aarch64_linux/127/

19:00:04  Starting to generate parallel test lists.
19:00:04  
19:00:05  Parsing /home/jenkins/workspace/Test_openjdk17_hs_extended.openjdk_x86-64_linux/aqa-tests/TKG/../openjdk/playlist.xml
19:00:06  Attempting to get test duration data from TRSS.
19:00:06  curl --silent --max-time 120 -L -k https://trss.adoptopenjdk.net/api/getTestAvgDuration?limit=10&jdkVersion=17&impl=hs&platform=x86-64_linux&group=openjdk&level=extended
19:00:06  Warning: cannot parse data from TRSS.
19:00:06  Unexpected character (e) at position 0.
19:00:06  	at org.json.simple.parser.Yylex.yylex(Yylex.java:610)
19:00:06  	at org.json.simple.parser.JSONParser.nextToken(JSONParser.java:269)
19:00:06  	at org.json.simple.parser.JSONParser.parse(JSONParser.java:118)
19:00:06  	at org.json.simple.parser.JSONParser.parse(JSONParser.java:92)
19:00:06  	at org.testKitGen.TestDivider.parseDuration(TestDivider.java:162)
19:00:06  	at org.testKitGen.TestDivider.getDataFromTRSS(TestDivider.java:252)
19:00:06  	at org.testKitGen.TestDivider.createDurationQueue(TestDivider.java:281)
19:00:06  	at org.testKitGen.TestDivider.divideTests(TestDivider.java:404)
19:00:06  	at org.testKitGen.TestDivider.generateLists(TestDivider.java:425)
19:00:06  	at org.testKitGen.MainRunner.genParallelList(MainRunner.java:74)
19:00:06  	at org.testKitGen.MainRunner.main(MainRunner.java:38)
19:00:06  Attempting to get test duration data from cached files.
19:00:06  
19:00:06  TEST DURATION
19:00:06  ====================================================================================
19:00:06  Total number of tests searched: 86
19:00:06  Number of test durations found: 0
19:00:06  No test duration data found.
19:00:06  (Default duration assigned, executed tests: 40s; not executed tests: 0s.)
19:00:06  ====================================================================================
19:00:06  
19:00:06  Test target is split into 1 lists.
19:00:06  Reducing estimated test running time from 28m40s to 28m40s.
19:00:06  
19:00:06  -------------------------------------testList_0-------------------------------------
19:00:06  Number of tests: 86
19:00:06  Estimated running time: 28m40s

It may be better to fall back to parallel by a pre-defined numberofnode if this happens ( even though rarely happens).

@sxa
Copy link
Member

sxa commented Jan 27, 2025

Should "Re-run in Grinder" links always set PARALLEL=None by default?
Reason: The "blocks" in jenkins show differently if you have parallel on or off, so if there is a mix it won't show both types, only the most recent "type".
If it was all set to non-parallel the blocks in jenkins for the pipelines stages would be more visible to the different people running tests which would make it easier to monitor progress.

For example at the time of originally posting this comment it's only showing two jobs in the main display:

Image

despite there being other jobs prior to that still running which were initiated with the re-run links which have DYNAMIC=Parallel. I usually try to switch mine to use PARALLEL=None but it would be preferable not to have this by default (since it's usually a small number of targets and you don't benefit so much from running in parallel - particularly when it makes collating the results more complex).

@sxa
Copy link
Member

sxa commented Jan 30, 2025

Should we cherry pick cacerts updates that occur in the master branch between the branching for the dry-run and the final GA builds?
OpenJ9 noticed an issue where there ones were different from ours because they are not currently basing things from our release branches.
References:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants