Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correlate client and orchestrator functions for DF .NET Isolated Distributed Tracing #2998

Draft
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

bachuv
Copy link
Collaborator

@bachuv bachuv commented Dec 17, 2024

These changes help correlate client and orchestrator functions in DF .NET Isolated.

Gantt chart (shows HTTP trigger and orchestrator function in one trace)
image

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • I have added all required tests (Unit tests, E2E tests)
  • My changes do not require any extra work to be leveraged by OutOfProc SDKs
    • Otherwise: That work is being tracked here: #issue_or_pr_in_each_sdk
  • My changes do not change the version of the WebJobs.Extensions.DurableTask package
    • Otherwise: major or minor version updates are reflected in /src/Worker.Extensions.DurableTask/AssemblyInfo.cs
  • My changes do not add EventIds to our EventSource logs
    • Otherwise: Ensure the EventIds are within the supported range in our existing Windows infrastructure. You may validate this with a deployed app's telemetry. You may also extend the range by completing a PR such as this one.
  • My changes should be added to v3.x branch.
    • Otherwise: This change only applies to Durable Functions v2.x and will not be merged to branch v3.x.

@@ -154,8 +155,22 @@ public override Task<Empty> Hello(Empty request, ServerCallContext context)
{
try
{
ActivitySource activityTraceSource = new ActivitySource("DurableTask.WebJobs");
Copy link
Collaborator Author

@bachuv bachuv Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, the source name is "DurableTask.WebJobs" because we load the DurableTelemetryModule and it looks for sources that start with "DurableTask" (related code link). I'm working on updating the listeners to add a new one ("WebJobs.Extensions.DurableTask") when we initialize the module so we can remove this "DurableTask.WebJobs" that's just used to test for now.

Comment on lines 80 to 84
Activity? currActivity = Activity.Current;
if (options == null && options?.ParentTraceId == null && currActivity != null)
{
options = new StartOrchestrationOptions(ParentTraceId: currActivity?.Id?.ToString());
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the ParentTraceId information to the StartOrchestrationOptions, but I'm wondering if there's a better place to store this information since StartOrchestrationOptions can be configured by the customer and we don't necessarily need customers to have access to ParentTraceId.

Copy link
Contributor

@jviau jviau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid having a new activity source and span added. We should try and having the trace ID captured from the worker directly added to the enqueued orchestration. I think to accomplish this we will want to hydrate a shim activity which has its ID set to the trace context from the gRPC payload.

@bachuv bachuv changed the title Add gRPC span and correlate client and orchestrator functions for DF .NET Isolated Distributed Tracing Correlate client and orchestrator functions for DF .NET Isolated Distributed Tracing Jan 15, 2025
@bachuv bachuv marked this pull request as draft January 15, 2025 19:55
Comment on lines +171 to +176
if (newActivity != null)
{
typeof(Activity).GetField("_traceId", BindingFlags.Instance | BindingFlags.NonPublic)?.SetValue(newActivity, traceId);
typeof(Activity).GetField("_spanId", BindingFlags.Instance | BindingFlags.NonPublic)?.SetValue(newActivity, spanId);
}
}
Copy link
Collaborator Author

@bachuv bachuv Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used reflection here to set the Activity's trace ID and span ID, but don't set the rehydrated Activitys parent Ids since that information isn't passed through the CreateInstanceRequest type which ends up not correlating the client and orchestration functions in the distributed trace. I think we need to pass additional information to successfully correlate the client and orchestrator functions.

As a reference, the code below is how we rehydrated an Activity for .NET In-Proc in DTFx:

  1. Get the parent context which is saved in the ExecutionStartedEvent and set it as the parentContext
if (!startEvent.TryGetParentTraceContext(out ActivityContext activityContext))
...
Activity? activity = ActivityTraceSource.StartActivity(
          activityName,
          kind: activityKind,
          parentContext: activityContext,
          startTime: startTime);

https://github.com/Azure/durabletask/blob/d7f33b07107eef66898538f195c184378bee8745/src/DurableTask.Core/Tracing/TraceHelper.cs#L78

  1. Set the rehydrated Activity with the original Activitys trace ID and span ID:
if (startEvent.ParentTraceContext.Id != null && startEvent.ParentTraceContext.SpanId != null)
{
    activity.SetId(startEvent.ParentTraceContext.Id!);
    activity.SetSpanId(startEvent.ParentTraceContext.SpanId!);
}

https://github.com/Azure/durabletask/blob/d7f33b07107eef66898538f195c184378bee8745/src/DurableTask.Core/Tracing/TraceHelper.cs#L107

In the example above, we had the following information:

  • ParentTraceContext to point to the parent
  • Id for the rehydrated Activity
  • spanId for the rehydrated Activity

With the .NET Isolated approach, we only have the last two items, not the information about its parent. Would it be worth updating CreateInstanceRequest to have the following fields?

  • TraceContext parentTraceContext**
  • TraceContext currentTraceContext

TraceContext also has a spanId field which would be another approach to help pass this information since we could create the Id using the traceId (from the parent) and spanId, but it's deprecated. Is it possible to add that field back or create a new one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we might be over complicating things a bit. Instead of reconstructing the trace activity we got from the client using reflection, can we start a new trace activity and have the incoming trace context be its parent? Something like this (this code won't compile, but is just illustrative):

Activity? activity = ActivityTraceSource.StartActivity(
          "gRPC create orchestration",
          kind: activityKind,
          parentContext: request.ExecutionStarted.ParentTraceContext,
          startTime: DateTime.UtcNow);

I feel like this would be cleaner and would help capture the fact that there is a gRPC network hop between the language worker and our host extension.

Copy link
Contributor

@jviau jviau Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect you to need the parent information here. The incoming trace-context is the parent information.

The scenario is, we are about to call TaskHubClient to schedule an orchestration. In that call, TaskHubClient is going to create a send_orchestration activity with kind=client. At time of this activity creation, we want to make sure it has access to whatever span was active when the customer's code asked to schedule this orchestration. This is done by ensuring Activity.Current is set where its id (traceId, spanId) are set to what we want the parent of the mentioned activity to be. This process should not look at Activity.Current's parent information.

Creating a new Activity.Current with traceId and spanId set should be sufficient. We shouldn't need parent information. If this isn't working, we may need debug into TaskHubClient and its activity creation to see what is going on.

@@ -154,8 +156,34 @@ public override Task<Empty> Hello(Empty request, ServerCallContext context)
{
try
{
Activity? newActivity = new Activity("gRPC start orchestration");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this span name match the convention we're using for other spans?

@@ -154,8 +156,34 @@ public override Task<Empty> Hello(Empty request, ServerCallContext context)
{
try
{
Activity? newActivity = new Activity("gRPC start orchestration");

string traceParentContext = request.ParentTraceContext.TraceParent.ToString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: Should this variable be named differently? It doesn't seem to be the trace context. Rather, it seems to be the traceparent value.

Suggested change
string traceParentContext = request.ParentTraceContext.TraceParent.ToString();
string traceParentString = request.ParentTraceContext.TraceParent.ToString();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants