Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status of smartOS support and what future holds #1663

Closed
anonrig opened this issue Dec 11, 2024 · 39 comments
Closed

Status of smartOS support and what future holds #1663

anonrig opened this issue Dec 11, 2024 · 39 comments

Comments

@anonrig
Copy link
Member

anonrig commented Dec 11, 2024

Hi everybody,

I'm unfortunately opening this issue to TSC repository and not to the nodejs/build repository, in order to get more visibility and share the urgency/state with the rest of the @nodejs/tsc members. Unfortunately, even though I tagged @nodejs/platform-smartos team multiple times, I haven't seen any progress (meaning updating the GCC version) which concerned me enough to create this issue. My goal is not to offend or blame anyone but just discuss our path forward for operating systems that are blocking future improvements of node.js due to the lack of support or personnel working on them.

Current smartOS machines uses old GCC versions despite the "Supported toolchains" show higher versions. For example, our documentation says we support GCC 12 or higher, but our smartOS machine is using GCC 10.

There is an active pull-request of mine from 3 months ago where it is still failing to build. nodejs/node#54990. This is concerning me the mostly because 3 months is a long enough time to expect a compiler version update from a Tier 1 supported OS.

Today, I opened another pull-request to test if updating Ada to the main branch of Ada breaks anything and it seems smartOS (as well as macOS) have failed to build due to old compiler. (Ref: nodejs/node#56218)

As a result of this fact, I recommend lowering SmartOS to "Tier 3/Experimental" and unblock existing pull-requests. I'm open to all suggestions, even though I have a recommendation.

Happy holidays!

cc @nodejs/build @nodejs/platform-smartos @nodejs/tsc

@anonrig anonrig changed the title Status of smartOS machines and what future holds Status of smartOS support and what future holds Dec 11, 2024
@bahamat
Copy link

bahamat commented Dec 11, 2024

I'm on the smartos team. I did receive the notification for nodejs/node#54990. You received an answer from @richardlau that the migration was in progress. This migration was being handled by Ryan Aslett. As of August 22, to my knowledge, Ryan was able to make progress without further assistance from me. I can say that after that date no further assistance was requested. I don't have access to the Jenkins build infra, so aside from assisting in designating which image versions should be used, and provisioning in MNX (which we provide for free), there's not anything I can do. And since I don't have access to the build system, I am unable to even verify whether the task was completed.

As for the compiler version, I discussed this with you today on Slack.

To recap what we discussed on Slack (and to expand a bit):

As of August, I had arranged with Ryan Aslett (via Slack) to have SmartOS build agents for v21.4.1 and v23.4.0.

As far as a newer version of gcc goes:

  • 21.4.1 will be unsupported as of next month when 24.4.0 is released
  • 22.4.0 has gcc12, which is newer than the requested gcc11
  • 23.4.0 has gcc13, which is newer than the requested gcc11
  • The upcoming 24.4.0 will have gcc13, which is newer than the requested gcc11

Given this, the best thing today to right now is to replace the 21.4.1 image with 22.4.0, which will be supported for another 13 months. Then, once 24.4.0 is released, the 23.4.0 image can be replaces with 24.4.0, or kept as-is. To the extent that I am able to assist with this, I am ready, able, and willing to provide that assistance.

I think it's a mischaracterization that we have been unresponsive. We've addressed every issue that's been asked of us (to the extent that we have permission to contribute) and in a timely manor. We have provided fixes for bugs in node and v8, and also helped uncover bugs in other operating systems before those bugs could land in a release (i.e., bugs were discovered in a failed build on SmartOS, but further analysis showed that other operating systems were affected by the same bug, even though they did not exhibit a build failure).

As I stated earlier, the one issue you mentioned that we didn't directly respond, you had already received an answer from someone more directly involved and responsible for the task than we are so it didn't seem necessary to have to make any specific reply.

@mcollina
Copy link
Member

cc @ryanaslett what's the status of the migration @bahamat is referring to?

@ryanaslett
Copy link

Apologies for the delay, and lack of understanding of the urgency and impact.

I have provisioned a Smartos23 instance, and got as far as evaluating whether or not the instances were ready to put into the testing matrix.

When running the tests, they were failing intermittently, or, when they did succeed, the tests were taking over 7 hours to complete, even on subsequent runs.

The next step is to investigate whether they're misconfigured, or underprovisioned, or what is causing them to intermittently fail, as well as underperform.

I had intended to begin that investigation myself, so I had better questions to ask, but other priorities got my attention, and I hadn't yet returned to this initiative to finish it off.

Jenkins had automatically cleaned up the old builds I was testing with in the meantime, so wI'll have to kick off another build to start troubleshooting this. (running here: https://ci.nodejs.org/job/node-test-commit-smartos-test-ryan/nodes=smartos23-64/9/consoleFull)

I should have communicated the status of this much sooner and tried to involve more assistance, as I'm neither fully versed in the intricacies of SmartOS, nor do I have deep experience with building node, specifically. As such I was struggling with where to begin investigating to get these runners into a state that resembles the existing node18/20 performance (30-40 minutes).

@mcollina
Copy link
Member

@ryanaslett, could @bahamat help you understand what the problems could be, and help solve them?

My understanding is that we do not have enough volunteers to to maintain a Tier 2 status for SmartOS, so either more volunteers provide help in a timely manner, or I think we would need to move SmartOS to Tier 3 for Node v24.

I propose a deadline of 2025/02/14 so there is enough time for the work to be completed before we stabilize Node.js v24.

@mhdawson
Copy link
Member

We discussed in the TSC meeting, and there seemed to be agreement that setting the deadline of 2025/02/14 to get the machines updated makes sense, otherwise testing on SmartOS would be temporarily disabled (and documented as Tier 3) until they are. The assumption would be that it would go back to Tier2 once the machines were re-enabled and testing passing.

As an FYI, the deadline might have to be moved up if we need to do a Jenkins security upgrade which forces the need for a newer Java.

@bahamat
Copy link

bahamat commented Dec 11, 2024

@mcollina What is the minimum number of volunteers? Is that published anywhere?

Are there any cases where we did not respond in a timely manner (aside from the one already mentioned, where the question was answered, just not by us)?

@ryanaslett As for performance issues in the build, this is probably a resource constraint. I can help you diagnose and identify the issue(s), as well as recommend changes necessary to address it. I'm not familiar with the build process (especially the way it's done with the Jenkins agents) but @jperkin from our team is very good at that kind of thing. If we could get someone who is more familiar with the the node build pipeline, we can get this taken care of pretty quickly.

@bahamat
Copy link

bahamat commented Dec 11, 2024

@mhdawson Moving to v22.4.0 as the minimum SmartOS build version will also address any Java issues. We had to do the same thing for our Jenkins environment.

@mhdawson
Copy link
Member

@bahamat in terms of your question:

What is the minimum number of volunteers? Is that published anywhere?

It is enough that a platform is not blocking progress of the project. The best way to make sure of that is for there to be people active and involved in the build working group who will prioritize fixing problems with machines, do the required updgrades etc. Otherwise you are hoping other volunteers prioritize problems on smartOS over other work which may not align with their interests or priorities. We do have some cycles from Ryan from the Linux foundation but prioritization by the larger project also applies to what he will work on as well.

It's not just about responding to questions, but instead being proactive and leading resolution of problems that occur on smartOS machines.

This is my experience as a person from another company which has a vested interest in operatating systems and architectures which are not MacOs, Windows, Linux on x86 and ARM.

@anonrig
Copy link
Member Author

anonrig commented Dec 11, 2024

I think it's a mischaracterization that we have been unresponsive. We've addressed every issue that's been asked of us (to the extent that we have permission to contribute) and in a timely manor. We have provided fixes for bugs in node and v8, and also helped uncover bugs in other operating systems before those bugs could land in a release (i.e., bugs were discovered in a failed build on SmartOS, but further analysis showed that other operating systems were affected by the same bug, even though they did not exhibit a build failure).

@bahamat I've tried to be clear in the issue description:

Unfortunately, even though I tagged @nodejs/platform-smartos team multiple times, I haven't seen any progress (meaning updating the GCC version) which concerned me enough to create this issue. My goal is not to offend or blame anyone but just discuss our path forward for operating systems that are blocking future improvements of node.js due to the lack of support or personnel working on them.

I apologize for this misunderstanding, as it was not my intention. Rather than focusing on the issue, which is an extremely old GCC version that doesn't comply/contradicts with our BUILDING.md documentation, I see that the topic has converted into a different place, a place I'm not comfortable with. I've specifically mentioned "my goal is not to offend or blame anyone but just discuss our path forward".

I'm not familiar with issues around smartOS development, but IMHO:

Even before the communication have come up to this point, we should have made smartOS experimental, and once it complied with the supported toolchains, we should made it Tier 2 again.

I propose a deadline of 2025/02/14 so there is enough time for the work to be completed before we stabilize Node.js v24.

We discussed in the TSC meeting, and there seemed to be agreement that setting the deadline of 2025/02/14 to get the machines updated makes sense, otherwise testing on SmartOS would be temporarily disabled (and documented as Tier 3) until they are.

@mcollina @mhdawson We originally changed the supported compiler versions to GCC 12.2 on August 5. (nodejs/node@046343e). If we go with February 14, 2025, we'll make the deadline 6.5 months (197 days).

I think 6.5 months to update a Tier 2 platform is too much, which delays lots of improvements to the project. This is going to stall 3+ pull-requests, updating of Ada, and the addition of URLPattern to the project, which I'm not comfortable with.

@joyeecheung
Copy link
Member

joyeecheung commented Dec 11, 2024

IIUC the situation is:

  1. Our CI infrastructure is using EOL/soon-EOL SmartOS versions, where the GCC versions are also EOL/soon-EOL
  2. The newer SmartOS versions have higher versions of GCC that could make the problem go away and an upgrade in our CI infra has been overdue
  3. Those who have access to the SmartOS machines in the CI don't have the expertise/priority to finish the migration ASAP
  4. Those who have the expertise/priority to finish the migration ASAP don't have access to the machines in our CI infra

It seems the problem primarily comes from 4. IMO resolving 4 and setting a deadline makes sense. It seems SmartOS is not the only one blocking PRs using new C++ features and there's also macOS. Unless we are considering demoting macOS at the same time (which shares 1-3 it seems, except s/gcc/clang/), it doesn't seem urgent to demote SmartOS alone. We can revisit demoting smartOS when it becomes the sole blocker.

@anonrig
Copy link
Member Author

anonrig commented Dec 11, 2024

It seems SmartOS is not the only one blocking PRs using new C++ features and there's also macOS.

This is not true. Here's a pull-request that's blocked by smartOS and not by macOS: nodejs/node#54990

@anonrig
Copy link
Member Author

anonrig commented Dec 11, 2024

I think this is also a good opportunity to mention and highlight that the macOS infrastructure update is on a similar position. cc @mhdawson @targos nodejs/build#3686

@jasnell
Copy link
Member

jasnell commented Dec 12, 2024

@mhdawson:

I propose a deadline of 2025/02/14 so there is enough time for the work to be completed before we stabilize Node.js v24.

I'm good with this approach as long as (a) there is a special call out that the deadline may be moved up in case we need to get an urgent release out and (b) an exception is made that if a PR fails CI on SmartOs due to out of date compiler issues it will not block merging of that PR. And yes, I understand that means that legitimate bugs that only surface on SmartOs might sneak their way through but those would just need to be addressed later once that platform is finally updated.

@mcollina
Copy link
Member

And yes, I understand that means that legitimate bugs that only surface on SmartOs might sneak their way through but those would just need to be addressed later once that platform is finally updated.

@jasnell that's exactly what "Tier 3" is. So are you confirming you are ok in lowering SmartOS to Tier 3 until the CI situation can be resolved?

@jasnell
Copy link
Member

jasnell commented Dec 12, 2024

Yes

@ryanaslett
Copy link

Update: I've got a fresh smartos23 instance running with more CPU, hoping to bring build times down to something reasonable. I will continue to uopdate the smartos infra progress on the original issue: nodejs/build#3731 (comment)

@bnoordhuis
Copy link
Member

I note that this is the umpteenth time this pattern plays out with smartos:

  • problems build up
  • we get to the point where we're this close to demoting it
  • at the last moment some smartos people pop out of the woodwork
  • things quiet down for a bit
  • the cycle starts anew

I don't think there's been another OS or arch where that's been such a recurring theme. Neither do I think this time is going to be different because why would it?

@rvagg
Copy link
Member

rvagg commented Dec 13, 2024

It's like the seasons @bnoordhuis, don't you like it when summer shows up?

Download stats should tell an interesting story here, they never were very high for smartos and we held on for longer than we otherwise would have post-4.0 simply out of deference to Joyent and it's kind of surprising that it's still here! Are there real users?

@richardlau
Copy link
Member

richardlau commented Dec 13, 2024

Downloads will be 0 because we stopped releasing smartos binaries with Node.js 14 but kept testing on it: nodejs/build#2168 (comment)

I assume there must be users for MNX to be willing to sponsor the x64 machines (not just smartos) we were asked to move out of Equinix Metal, but they would be getting Node.js from the SmartOS repositories.

@bahamat
Copy link

bahamat commented Dec 14, 2024

Usage of node on SmartOS comes primarily through pkgsrc, the package manager we share with NetBSD. We do have a significant amount of usage, which is why we have a vested interest in ensuring builds still work.

We have no problem at all being the source of binaries.

I think what might help here, is if we had direct access to the build reports for SmartOS, or even better if we could configure a web hook for push notifications of build failures. Up until now, we haven't been granted that access so we've been at the mercy of being explicitly notified via GitHub issues by a human.

@ryanaslett
Copy link

Another update: We were able to make our way through the performance issues of the nodes, provision them all, warm the caches, and put them into the rotation see (nodejs/build#3731 (comment))

nodejs/node#56106 is the first pull request to be tested on the new smartos22/23 nodes.

@anonrig
Copy link
Member Author

anonrig commented Dec 15, 2024

The builds seem to pass. Thank you for your hard work. I still want to discuss the status of smartOS in the next TSC meeting. please keep it in the agenda. @bnoordhuis @rvagg would you be interested in joining?

@bahamat
Copy link

bahamat commented Dec 15, 2024

Will there be a representative for SmartOS invited?

@mcollina
Copy link
Member

Sure thing, happy to have you on the meeting!
Are you on the OpenJS slack @bahamat?

@bahamat
Copy link

bahamat commented Dec 15, 2024

Yes, I am.

@bnoordhuis
Copy link
Member

would you be interested in joining?

Interested, yes; capable, no. They're always at a horrible day/time for me.

@anonrig
Copy link
Member Author

anonrig commented Dec 15, 2024

Interested, yes; capable, no. They're always at a horrible day/time for me.

@bnoordhuis I see. No worries! I would appreciate if you could share your experiences (the previous issues with smartos), and some info on how it became a tier 2 platform? original issue, making smartos tier 2, has a comment by @Trott asking why it's not an experimental but a tier 2 platform, but I couldn't find any public info regarding it.

@bnoordhuis
Copy link
Member

bnoordhuis commented Dec 16, 2024

I would appreciate if you could share your experiences (the previous issues with smartos)

It goes back at least 10 years; it predates io.js.

Node-on-smartos was always kind of janky but after sunsetting no.de and TJ (Fontaine) stepping down, Joyent themselves pretty much lost interest in the port, let alone the rest of the world. It's been in a state of of disrepair ever since.

I could list individual issues but that's kind of pointless; it's just never been in good shape. I'm sure everyone who's been working on node for a long time will agree.

original issue, making smartos tier 2, has a comment by @Trott asking why it's not an experimental but a tier 2 platform, but I couldn't find any public info regarding it.

The way I remember it: a tier 2 platform needs 1-2 FTE with quick turnaround times. That was pledged at the time by interested parties but it never happened.

My personal opinion: the smartos CI is a drag on everyone, no one is interested in maintaining it, the user base is essentially zero. Any of those would already be enough to disqualify it from tier 2 status, never mind all three.

@jclulow
Copy link

jclulow commented Dec 20, 2024

G'day! I'm a member of the illumos core team. While there's a lot of discussion about the "SmartOS" port, here, really we're talking about the illumos port. This is consumed by users in a number of illumos distributions in addition to SmartOS; e.g., OmniOS, OpenIndiana, Tribblix, etc. While we are obviously not aiming for world domination this quarter, it's absolutely not true that "the user base is essentially zero"!

As a community we have had great success in assisting in maintenance for a variety of modern toolchains; e.g., we're at tier 2 status in both of the Go and Rust projects. In both cases we answer questions people raise, and attempt to help with debugging. Some of the companies in our ecosystem also in some cases fund and manage CI environments for popular distributions like OmniOS and SmartOS.

I think most of what we have here is a communication challenge. Similar challenges have arisen in the past in another related project, libuv, which we've also tried to work through there. In all of these cases, there are absolutely people willing to assist with issues as they arise, and no doubt also to chip in for infrastructure if it will help. We have to hear about the problems, though!

Could we look at renaming the "SmartOS" port to just "illumos", and get some folks from communities like OmniOS, etc, into a GitHub team that could be notified about issues directly relating to our platform?

I'm also hoping we can set aside the more subjective assessments; e.g.,

Node-on-smartos was always kind of janky
it's just never been in good shape

And focus on what we can do to help with maintenance in the future!

@anonrig
Copy link
Member Author

anonrig commented Dec 20, 2024

Could we look at renaming the "SmartOS" port to just "illumos", and get some folks from communities like OmniOS, etc, into a GitHub team that could be notified about issues directly relating to our platform?

I think that is a different discussion that can be better addressed in GitHub.com/nodejs/build repository.

@jclulow
Copy link

jclulow commented Dec 20, 2024

I think that is a different discussion that can be better addressed in GitHub.com/nodejs/build repository.

FWIW, I think it's actually at the core of the issue here: i.e., getting engagement on the maintenance of what is fundamentally the illumos (not just SmartOS) platform support in Node, so that it's not a burden for folks.

@anonrig
Copy link
Member Author

anonrig commented Dec 20, 2024

FWIW, I think it's actually at the core of the issue here: i.e., getting engagement on the maintenance of what is fundamentally the illumos (not just SmartOS) platform support in Node, so that it's not a burden for folks.

I understand but the reason I've opened this issue is that 4-5 pull-requests have been blocked due to the old "smartOS" labeled VM. The timeline to update GCC and unblocking collaborators is the issue here.

@jclulow
Copy link

jclulow commented Dec 20, 2024

I understand but the reason I've opened this issue is that 4-5 pull-requests have been blocked due to the old "smartOS" labeled VM. The timeline to update GCC and unblocking collaborators is the issue here.

The title of the issue is "Status of smartOS support and what future holds", but if indeed the focus here is that narrow: I have just listened to the TSC meeting recording from 18 DEC, and I agree with @richardlau and @bahamat that it seems like this has ultimately been a communication breakdown.

It also sounds like the SmartOS builder GCC upgrade issue is resolved, now? How can I, and the broader illumos community, help out to avoid the long delays next time this comes up?

@bahamat
Copy link

bahamat commented Dec 20, 2024

I understand but the reason I've opened this issue is that 4-5 pull-requests have been blocked due to the old "smartOS" labeled VM. The timeline to update GCC and unblocking collaborators is the issue here.

I would like to point out (again) that every issue over the past 2-3 years, aside from the gcc11 issue has been resolved by us within a day or two.

Yes, this one caused a problem that we shouldn't have had. I wholeheartedly agree. But I only understood it in the context of migrating machines out of equinix metal. It wasn't until Dec 12 that I even knew about a toolchain blocker, which I responded to within minutes, and we had builds working with gcc12 in less than 48 hours.

This just underscores the fact that I (and others) see this as a communication issue, not a technical one, nor of reliability/responsiveness.

@mcollina
Copy link
Member

@jclulow would you like to join the @nodejs/platform-smartos team?

@jclulow
Copy link

jclulow commented Dec 20, 2024

@jclulow would you like to join the @nodejs/platform-smartos team?

Yes, please! Sign me up!

@bahamat
Copy link

bahamat commented Dec 20, 2024

Can we also add @jperkin from MNX, please?

@mcollina
Copy link
Member

@bahamat @jclulow done.

@anonrig
Copy link
Member Author

anonrig commented Jan 22, 2025

Removing from tsc-agenda. I've already spent considerable amount of time. I hope the recent discussions regarding smartOS will make a difference in the project and the current state/decision of TSC is within the benefit of the project.

@anonrig anonrig closed this as not planned Won't fix, can't repro, duplicate, stale Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests