-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified semantic conventions for tasks, workflows, pipelines, jobs #1688
Comments
I think(?) this is related: open-telemetry/opentelemetry-java-instrumentation#12830 cc @cb645j |
Related issue for long running spans: #1648 |
Thanks for the feedback here and during the SIG Call. Especially @lmolkova statement on not over-generalizing made me think more about this topic. Before digging into that, let's ignore the "long running" issue for now, as shared in #1648 this is a spec issue and not a semantic convention issue, and for the sake of simplicity, let's assume that a workflow run with it's sub tasks can be represented by a trace with the run being the parent and the tasks the child spans. Let me begin with where I am coming from: in my previous job as solution engineer for APM I created lots of dashboards and visualizations, and a thing that annoyed me was that I had to do the same thing again and again because things carried slightly different names and by that required me reworking a lot of things. That's why I am a huge fan of the semantic conventions, they create consistence! So, the leading question I had in mind is: how can I visualize workflow telemetry in my preferred visualization tool consistently? To answer that question, let's take a look how workflows are visualized across different solutions today:
So what all workflows have in common is that they can be represented by a graph where the vertices represent tasks, the edges the dependencies between tasks. There are "start vertices" and "end vertices", and a run is a flow through that graph from a start to an end. From an observability standpoint, what I want to understand from a highlevel is:
These questions are independent of the workflow I am looking at. This is how I would troubleshoot a CICD workflow, a AI Agent workflow, a business workflow, a build workflow, etc. What is different is when I want to drill into a specific task and investigate further why there are errors, or why that task my be slow. So, similar to a service map, a way to represent them for monitoring is something like the following: flowchart TD
style A fill:#0f0
style B fill:#ff0
style C fill:#0f0
style D fill:#0f0
style E fill:#0f0
style F fill:#f00
style G fill:#ff0
subgraph "WF1, sr: 62/100, art: 4min"
A[Setup<br>0 errors/min<br>15s AVG runtime] -->|100 runs/min| B(Step 1<br>2 errors/min<br>3min AVG runtime)
B -->|98 runs/min| C(Step 2<br>0 errors/min<br>1.3s AVG runtime)
C -->|40 runs/min| D[Step 3a<br>0 errors/min<br>12s AVG runtime]
C -->|20 runs/min| E[Step 3b<br>0 errors/min<br>17s AVG runtime]
C -->|40 runs/min| F[Step 3c<br>0 errors/min<br>3.7min AVG runtime]
D --->|40 runs/min| G
E --->|20 runs/min| G
F --->|2 runs/min| G(Tear Down<br>2 errors/min<br>13s AVG runtime)
end
If different workflows now do not share a common set of attributes, engineers building visualizations have either to rebuild the same thing over and over again, or they have to add some logic to select the right attributes depending on the workflow type they are looking at. This means, from my point of view, there are a few common attributes:
If available, the For metrics, there are also some common examples:
Additionally, the following existing attributes may be used:
There might be other common attributes, but these are the ones that would enable such a unified visualization. Other attributes would not be common, e.g. cicd.pipeline.task.run.url.full since they are domain specific/not available in all workflows (e.g. a tool like |
Area(s)
area:new
Is your change request related to a problem? Please describe.
While reviewing the AI Agent Span Semantic Convention I made the comment, that the definition of
ai_agent.workflow.*
andai_agent.task.*
should not be unique, since there tasks & workflows that can be modeled similarly outside of theai_agent
scope. Same is true for the existing experimental CICD pipeline attributes. Other examples might be cronjobs, business processes, build tools, like make or goyek, which defines their owngoyek.flow.*
tasks in https://pkg.go.dev/github.com/goyek/x/otelgoyek#example-package (thanks to @pellared for pointing me to goyek)Describe the solution you'd like
The solution I would like to see is a unified set of attributes that describe such "workflow" and then only to have the unique attributes in the related specifications, similar as we have it today for the HTTP SemConv, where
server.*
orurl.*
attributes are used where applicable.So instead of the current proposals, the future may look like the following:
and
and
Both @open-telemetry/semconv-genai-approvers @open-telemetry/semconv-cicd-approvers are very active working groups, pushing semantic conventions forward, and it would be great to see some broader thinking about "workflows" (or however this should be named in a unified way)
Describe alternatives you've considered
No response
Additional context
Note, that this may also require to make progress on the long standing question of "long running 'spans'"1:
open-telemetry/opentelemetry-specification#373,
open-telemetry/opentelemetry-specification#2692 and other previous issues touched upon this topic, but so far there is no solution for long-running (++minutes, hours, days) or even "infinite" spans
Footnotes
the term "span" might not be correct in this context. ↩
The text was updated successfully, but these errors were encountered: