monitorization_OpenTelemetry.txt

# OpenTelemetry

* Also known as OTel, is a vendor-neutral open source Observability 
  framework for instrumenting, generating, collecting, and exporting 
  telemetry data such as traces, metrics, and logs.
* Replaces "deprecated" OpenTracing and OpenCensus Cloud Native Foundation projects.
* Vendors who natively support OpenTelemetry via OTLP include:<br/>
    Apache SkyWalking, Fluent Bit, Jaeger, OpenLIT, ClickHouse, 
  Embrace, Grafana Labs, GreptimeDB, Highlight, HyperDX, observIQ, 
  OneUptime, Ope, qryn, Red Hat, Sig, Tracetest, Uptrace, 
  VictoriaMetrics, Alibaba Cloud, AppDynamics (Cisco), Aria by VMware 
  (Wavefront), Aspecto, AWS, Axiom, Azure, Better Stack, Bonree, 
  Causely, Chro, Control Plane, Coralogix, Cribl, DaoCloud, Datadog, 
  Dynatrace, Elastic, F5, Google Cloud Platform, Helios, Honeycomb, 
  Immersive Fusion, Instana, ITRS, KloudFuse, KloudMate, LogicMonitor, 
  LogScale by Crowdstrike (Humio), Logz.io, Lumigo, Middleware, New 
  Relic, Observe, Inc., ObserveAny, OpenText, Oracle, Sentry, Sentry 
  Software, Seq, Service, ServicePilot, SolarWinds, Splunk, Sumo Logic, 
  TelemetryHub, TingYun, Traceloop, VuNet Systems, Add your organization

## Concepts:

* Architecture:
  ```
  Microservices
  OTel Auto.Inst.
  OTel API
  OTel SDK

  Shared Infra
  Kubernetes
  L7 Proxy
  AWS//...
  ```

* <https://opentelemetry.io/docs/concepts/signals/logs/>
* A log is a timestamped text record, either structured (recommended) 
  orunstructured, with metadata. Of all telemetry signals logs, have 
  the biggestlegacy.

# OpenTelemetry + JAVA Micrometer
* <https://grafana.com/blog/2022/05/04/how-to-capture-spring-boot-metrics-with-the-opentelemetry-java-instrumentation-agent/>


## OpenTelemetry conferences at Kubecon 2024

* <https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/program/schedule/>


### Why is OpenTelemetry so complicated? is a question that we hear 

### Minimizing Data Loss Within the OpenTelemetry (OTel) Collector - Alex Kats, Capital One


OTel collector is meant to serve as a ... data pipeline. However, as
a single component in a wider observability architecture, it is only
as reliable as the downstream platforms/services it exports data to.

 The OTel collector has several built in mechanisms that aim to 
minimize the impact of unhealthy downstream exporters, including an 
out of the box sending queue with an additional configuration 
parameter for persistent queueing. 

... Failover Connector allows for dynamic routing or “failover” of
telemetry data based on downstream exporter health. 


### Unifying Observability: Correlating Metrics, Traces, and Logs with Exemplars and OpenTelemetry - 

 While metrics, traces, and logs each provide valuable insights, 
their true power is realized when they are correlated.  ... exemplars 
using the OpenTelemetry SDK and Collector, and showcase the results 
in Grafana. Attendees will learn how to leverage OpenTelemetry to 
create exemplars which will allow them to navigate from either logs 
or metrics to their traces.

Kruthika Prasanna Simha & Charlie Le, Apple


### OpenTelemetry Project Update -

OpenTelemetry started with distributed traces and metrics, ... This session will focus on what's coming next, including new signals and sources. ... new logging functionality, including its two logging paths, the benefits of each, and real-world production examples ... next wave of OpenTelemetry enhancements, including profiling and the insights that this unlocks in combination with distributed traces, and how we're extending your observability to client applications.
 Alolita Sharma, Apple; Juraci Paixão Kröhling, Grafana Labs; Ted Young, ServiceNow; Morgan Mclean, Splunk; Daniel Dyla, Dynatrace

### Using OpenTelemetry for Deep Observability Within Messaging Queues 

... recent changes in OpenTelemetry have made new semantic conventions 
and changes in agents to better monitor messaging queues such as 
Kafka, RabbitMQ, and Amazon SQS, etc. ... we'll discuss how those 
semantic conventions are standardizing the telemetry collected from 
producers, consumers, and the messaging queues, and how in-depth 
observability can be achieved by correlating producer-to-consumer 
spans with the metrics collected from Kafka. Additionally, We will 
demonstrate how the Kafka Java client side instrumentation enabled 
and JMX metrics collected from Kafka how OpenTelemetry 
instrumentation can help for metrics to trace and trace to metrics 
correlation and spot reasons for anomalies like increased consumer 
lag, partition failures, time taken by messaging queues. This will 
also help in giving the corresponding traces in time that can help 
end users to better delve into their infrastructures and optimize 
their asynchronous applications.
Shivanshu Raj Shrivastava & Ekansh Gupta, SigNoz


### How We Made OpenTelemetry Be Our Fitness Tracker for Your CI/CD Pipelines!

... In this session, engineers from Clario will demonstrate how they 
leverage OpenTelemetry to observe, validate, report and optimize 
their CI/CD pipelines, keeping their deployments healthy despite 
increased scale and unlocking the full potential of modern software 
delivery on Kubernetes with GitLab.
Andi Grabner CNCF Ambassador and DevRel, Dynatrace

- Nicolas Woerner Associate DevOps Engineer, Clario & Andreas Grabner, Dynatrace


[[{PM.low_code,PM.TODO]]
### Low-Overhead, Zero-Instrumentation, Continuous Profiling for OpenTelemetry 

... Elastic has recently donated its whole-system continuous profiling agent to OpenTelemetry.
... Leveraging eBPF, the profiling agent provides unprecedented visibility into the runtime behavior of all applications:
* it builds stacktraces that go from the kernel to userspace native code, all the way into code running into higher level runtimes, enabling users to:
  * identity performance regressions
  * reduce wasteful computations.
  * debug complex issues faster.

 This session will explore:
* Benefits of eBPF-based continuous profiling compared to conventional approaches that rely on application instrumentation 
* How the agent builds profiles that seamlessly span kernel, native code and most widely used application runtimes 
* Integration with the rest of OpenTelemetry: OTLP and Collector   [[101]]

Speakers: 
* Christos Kalkanis Principal Software Engineer, Elastic
[[PM.low_code}]]


Mastering OpenTelemetry Collector Configuration - Steve Flanders, Cisco
Configuring the OpenTelemetry Collector can be a daunting task for both novices and seasoned professionals alike. Yet, mastering this crucial aspect is essential for unlocking the full potential of your observability stack. In this session, you will embark on a journey to gain the knowledge and skills needed to conquer common OpenTelemetry Collector configuration challenges. This session will draw from real-world experiences and best practices and provide live demonstrations to navigate the intricacies of OpenTelemetry Collector configuration. Whether you are a novice looking to get started or a seasoned veteran seeking to level up your skills, this session promises to empower you with the knowledge and confidence needed to properly and efficiently configure the OpenTelemetry Collector.


nspektor Gadget: eBPF for Observability, Made Easy and Approachable | Project Lightning Talk
eBPF is a powerful tool for observability. But better tooling can make it even more powerful and, importantly, more approachable.
In this short talk, we’ll use the mechanisms Inspektor Gadget has for distributing and deploying eBPF programs to quickly build a data collection pipeline with eBPF that can be integrated with popular observability tools or one's own applications.
By the end of the talk, the audience should feel empowered to work with eBPF using the high-level tooling and integrate it into their systems and tooling.


gRPC: The gRPC "Standard Library" | Project Lightning Talk
gRPC has found widespread adoption in organizations around the world. You've probably written a protobuf yourself to define your own API. But did you know that the gRPC project actually defines several standard gRPC services that are generally applicable. In this talk, you will learn about gRPC's reflection, health, channelz, and status protos and how you can use them to get more out of your gRPC-based system.


wasmCloud: Declarative WebAssembly Orchestration for Cloud Native Applications | Project Lightning Talk
wasmCloud released its 1.0 version in April of this year. Since then, the project has done everything but slow down. Maintainer Brooks Townsend demonstrates how wasmCloud enables users to build and orchestrate WebAssembly (Wasm) applications across distributed infrastructure. Learn how wasmCloud integrates the latest developments in WebAssembly standards to help users create and deploy applications “building block” style—connecting portable, interoperable Wasm components so they can focus on business logic. In this lightning project update, Brooks discusses wasmCloud’s component support, distributed networking, declarative orchestration, OpenTelemetry observability, the project roadmap, and more.


SIG Auth & SIG Storage: Secret Guardians - (Secrets Store) CSI Driver and Sync Controller | Project Lightning Talk
Applications running on Kubernetes require access to sensitive information (passwords, SSH keys and authentication tokens). But how do you configure your applications when the source of truth for these secrets is an external secret store? What if you need to store, retrieve and perform zero touch rotation of these secrets securely? Meet the (Secrets Store) CSI Driver and Sync Controller, sig-auth subprojects providing a simple way to retrieve secrets from enterprise-grade external stores such as Azure Key Vault, Google Secret Manager and HashiCorp Vault.
In this lightning talk, Anish will introduce you to the (Secrets Store) CSI driver and Sync controller and discuss trade-offs of the CSI driver versus Sync controller.


Eraser: Cleaning Up Vulnerable Images from Kubernetes Nodes | Project Lightning Talk
Supply chain security is an increasingly important issue in cloud-native computing. It is common for pipelines to build and push images to the cluster, but uncommon for those images to be removed from a node’s local store once a CVE has been disclosed. Kubernetes has no built-in solution to this problem: its garbage collection only responds to disk pressure. As images become outdated, they present a risk as users may run a vulnerable container. Eraser, a CNCF sandbox project, is an open source solution that automates the scanning and removal of non-running images.

Envoy: Highlights of Envoy Gateway v1.1.0 - What’s New and Improved | Project Lightning Talk
Envoy Gateway (EG) released its latest version, 1.1.0, on July 22. This update marks the first feature release since the 1.0.0 GA (General Availability) version and includes multiple new features and improvements. In this lighting talk, I will highlight some of the most important new features, including Wasm extension, non-k8s support, IP allow/deny list, stateful service support, etc.


Harbor: Harbor and the World of SBOMs | Project Lightning Talk
Discover how integrating SBOM (Software Bill of Materials) with Harbor enhances your software supply chain security. In this lightning talk, we'll cover:
- What is SBOM?: Quick overview of its role in software transparency.
- Integration with Harbor: Highlights of the SBOM integration in Harbor v2.11.
- Security Best Practices: Using SBOM to identify and address vulnerabilities.
Perfect for software engineers, DevOps professionals, and security enthusiasts looking to strengthen their software supply chain.


SlimToolkit: Improving DX with Containers - Making it Easy to Understand, Optimize, and Debug Your Containers | Project Lightning Talk
This talk will introduce the key capabilities in SlimToolkit: inspecting, minifying, and debugging containers that will enhance your developer experience with containerized applications.
We'll walk through a number of short examples showing how common container related problems can be addressed using various commands provided by the tool.
* Are the popular recommendations to create production-ready containers not possible in your environment, or is it just too much work?

Open Policy Agent (OPA): That's One Small Bump for OPA, but One Giant Leap for Policy as Code | Project Lightning Talk
At last, OPA's made it to v1! Let's take a whistle-stop tour of what's involved in cutting a v1 release for a project over 3.5 billion downloads; its own language and large community. Get the latest updates, and glimpse into the future in this light speed overview!


Falco: Evolution of Real Time Cloud Security with Falco | Project Lightning Talk
Falco, the CNCF runtime security project, can continuously monitor your entire environment looking for suspicious activity. From bare metal servers to massive Kubernetes clusters made of hundreds of thousands of nodes to your cloud provider activity, Falco and its powerful detection rule system have you covered. In this Lightning Talk, Luca and Melissa will focus on how the Falco project is constantly evolving to meet defenders' needs by providing rich libraries of detection rules, making it easier to customize them, catch bypass attempts and bring light to every dark corner of modern cloud infrastructures.


Copa: Project Copacetic - Directly Patch Container Image Vulnerabilities | Project Lightning Talk
Maintaining secure container images and addressing new vulnerabilities quickly is a major challenge. To patch images, users face two options: wait for third-party authors to release updates, which can take weeks, or perform a full image rebuild, a time and resource-intensive process.
Project Copacetic (Copa) enhances the image patching process, reducing turnaround time and complexity. It integrates easily into existing build infrastructure, giving users greater control over their patching timeline and reducing costs.


OpenFGA: The Cloud Native Way to Implement Fine Grained Authorization | Project Lightning Talk
This talk will be a short introduction to OpenFGA, a report on the state of the project and an exploration of different adoption use cases from companies all sizes.


Meshery: Visualizing Kubernetes Resource Relationships with Meshery | Project Lightning Talk
Meshery and its extensions empower you to navigate cloud native infrastructure in complex environments. This lighting talk delves into the human-computer interaction (HCI) principles that underpin MeshMap's intuitive visualization of Kubernetes resources and the various forms of inter/relationships with other CNCF projects' resources.

Human-Computer Interaction Principles in Meshery:

- Cognitive Load: How Meshery reduces cognitive load by presenting complex information in a structured and visually digestible manner.
- Mental Models: How Meshery aligns with users' mental models of Kubernetes environments, facilitating comprehension and navigation.
- Visual Perception: How Meshery leverages visual cues, colors, and layout to guide users' attention and highlight critical information.


Flux: What's Flux and What's New? | Project Lightning Talk
Get a quick intro of GitOps and Progressive delivery using Flux, how to get started, and new capabilities with the last release of 2024.
We'll walk you through key features of Flux (a graduated project and GA) such as being multi-everything (multi-tenant, multi-cluster, etc.). And Flux works with your existing tools (like CI and Kubernetes tools).
We'll cover reliability and security reasons that Flux is the GitOps tool of choice for cloud vendors, global enterprises, and other companies.


Lightning Talk: `Kubectl Debug` Lacks an `IDE` Option. Let’s Fix That! - Mario Loriedo, Red Hat
Don't get me wrong. `kubectl debug` is one of my favorite `kubectl` commands. But probably because I like it so much, I am convinced it deserves more love! This talk will present a `kubectl debug` extension that starts an IDE in an ephemeral container for debugging purposes. This extension uses the DevWorkspace operator, which is capable of running lightweight cloud development environments, including the IDE, in containers. If you like debugging by adding breakpoints in an IDE rather than inspecting your application's logs, you should attend this talk.


Effortless, Sidecar-Less Mutual TLS and Rich Authorization Policies up and Running in 5 Minutes - Lin Sun, solo.io
Do you need zero trust or mutual TLS (mTLS) among your application pods? You may be able to manage certificates within your applications, but how would you handle automatic periodic certificate rotation? The evolution of sidecar-less service mesh technology enables mTLS among application pods with just a simple namespace label. No sidecars or application pod restarts are required. This approach provides immediate benefits, including cryptographic identity for application pods, and ensures session-based data confidentiality and integrity in pod communications. In just 5 minutes, Lin will demonstrate live how developers and operators can effortlessly enforce mTLS and rich Layer 7 (L7) authorization policies without any sidecars!


Is Everyone O-KEDA? “Exciting” Lessons Learned in Our Journey to Use KEDA Pod Autoscaling - Brian Davis, Red Canary
We thought that changing our Kubernetes pod autoscaler seemed like a really straightforward thing to do. With relative ease, we yanked out our old custom pod autoscaler and replaced it with KEDA. We were impressed with the flexibility and control we now had in our cluster, but then discovered a set of really hard lessons that no one had anticipated. In this lightning talk, I’ll hit the highlights of secondary issues we encountered due to such a seemingly simple change, such as Docker Hub rate limits, Kubernetes metrics server failures and their exciting impact on our cluster, AWS rate limits, and late night fights with Argo CD for control of pod maximums. Lastly, I’ll share my personal favorite topic: the “Night Club Theory” of autoscaling tuning. If you or someone you love is thinking of changing your autoscaler, I recommend spending 5 minutes with me to learn the things you should be aware of before you make the switch!


Keynote: Multicluster Batch Jobs Dispatching with Kueue at CERN - Ricardo Rocha, Lead Platforms Infrastructure, CERN & Marcin Wielgus, Staff Software Engineer, Google
With the skyrocketing demand for GPUs and problems with obtaining the hardware in requested quantities in desired locations, the need for multicluster batch jobs is stronger than ever.
During this talk we will show how you can automatically find the needed capacity across multiple clusters, regions or clouds, dispatch the jobs there and monitor their status. We will discuss the setup in both fixed-size on-prem environments, fully autoscaled clusters running on clouds and mixed, hybrid environments. In the end we will present what a recent effort for a multicluster setup looks like at CERN, do a quick (but impressive) demo, and share the lessons learned during the deployment.

Advanced Model Serving Techniques with Ray on Kubernetes - Andrew Sy Kim, Google & Kai-Hsun Chen, Anyscale
With the proliferation of Large Language Models, Ray, a distributed open-source framework for scaling AI/ML, has developed many advanced techniques for serving LLMs in a distributed environment. In this session, Andrew Sy Kim and Kai-Hsun Chen will provide an in-depth exploration of advanced model serving techniques using Ray, covering model composition, model multiplexing and fractional GPU scheduling. Additionally, they will discuss ongoing initiatives in Ray focused on GPU-native communication, which, when combined with Kubernetes DRA, offers a scalable approach to tensor parallelism, a technique used to fit large models across multiple GPUs. Finally, they will present a live demo, demonstrating how KubeRay enables the practical application of these techniques to real-world LLM deployments on Kubernetes. The demo will showcase Ray’s powerful capabilities to scale, compose and orchestrate popular open-source models across a diverse set of hardware accelerators and failure domains.


TUF: Secure Distribution Beyond Software - Marina Moore, Independent
As organizations improve their software supply chain, they may encounter an influx of metadata: attestations, SBOMs, VEX statements, and more. Have you ever wondered how to securely distribute all of this information to end users? Enter TUF! The Update Framework (TUF), has paved the way for secure software updates throughout the cloud native ecosystem and beyond, and is being expanded to securely distribute signing keys, attestations, and more. TUF allows organizations to ensure that all of this data is up-to-date and resilient to tampering. The TUF project is constantly improving and this talk will highlight some of these improvements, from recent integrations by groups such as Docker and Github to an effort to provide conformance testing across various TUF implementations. The TUF project has an active team of maintainers and contributors that make all of these improvements possible, and we will discuss how you can get involved to keep making the project better.


Unlocking Cost Savings & New Possibilities: Your Guide to Prometheus Remote Write 2.0 - Callum Styan, Grafana Labs & Bartłomiej Płotka, Google
Prometheus Remote Write is the protocol used to send Prometheus metrics from Prometheus or any other metric source to compatible remote storage endpoints such as Thanos and Cortex. Remote Write is generally used for metric long term storage, centralization, and cloud services. It also enables users to run Prometheus in an agent mode, reducing local storage requirements. Welcome to Remote Write 2.0! In this talk, Bartek and Callum, Prometheus maintainers and RW2.0 spec. co-authors, will introduce you to the next iteration of the popular protocol which adds more functionality while cutting your egress costs up to 60%, and keeps the previous versions easy-to-implement stateless design! The audience will learn what's changed in the second version of Remote Write, what it unlocks, and how easy it is to update or adopt. Finally, the speakers will share the latest benchmarks and differences with the common alternatives.


AuthZEN: The “OpenID Connect” for Authorization - Omri Gazitt, Aserto
Today, the authorization world is fractured - each vendor supports its own APIs & protocols. But this is about to change. AuthZEN, a new OpenID Foundation working group, was created in late 2023 to establish authorization standards. OIDF is the home of OpenID Connect, the ubiquitous standard for federated login, and that’s where we’re setting our sights. In this talk, I'll describe the current state of cloud-native authorization, including the policy-as-code and policy-as-data approaches, and the various open source projects in each camp. I'll also share the progress we’ve made creating a single authorization API that works across both policy-as-code (OPA, Topaz) and policy-as-data (Zanzibar-style projects), present the API specs we've created so far, and show off the various interoperable implementations. With this foundation in place, engineering teams can be more confident in externalizing their authorization and picking a provider without being locked in to a proprietary API.


Best Friends Keep No Secrets: Going Secretless with Cert-Manager - Ashley Davis & Tim Ramlot, Venafi
In today's complex Kubernetes environments, managing secrets securely is a challenge. Traditional methods often involve complex configurations with secret vaults, secret syncing and secret backups. Regardless of which fancy technology is used, secrets always come with a risk of being leaked. Most of the secrets used in traditional applications can be replaced by short-lived certificates. Applications can prove to be the owner of a certificate without sharing any secrets. In Kubernetes, cert-manager can be used to provision these certificates to all applications without sharing any secret information. Table of contents: - Do we actually need secrets? Comparing authentication methods: static secrets vs short-lived secrets and proof of ownership - How to issue certificates using cert-manager without using [S|s]ecrets - Compatibility and other challenges


The Hard Truth About GitOps and Database Rollbacks - Rotem Tamir, Ariga
For two decades now, the common practice for handling rollbacks of database schema migrations has been pre-planned "down migration scripts". A closer examination of this widely accepted truth reveals critical gaps that result in teams relying on risky, manual operations to roll back schema migrations in times of crisis. In this talk, we show why our existing tools and practices cannot deliver on the GitOps promise of "declarative" and "continuously reconciled" workflows and how we can use the Operator Pattern to build a new solution for robust and safe schema rollbacks.


Breaking Free from Vulnerability Scanning Noise: Automated VEX Aggregation for Accuracy - Teppei Fukuda, Aqua Security Software Ltd.
Vulnerability scanners detect known vulnerabilities in software dependencies, but often produce inaccurate results (false-positives) due to their inability to automatically determine if a vulnerability is actually exploitable. Vulnerability Exploitability eXchange (VEX) is an industry-wide initiative that aims to address this issue, but the lack of standardized distribution hinders its effective utilization. This talk introduces VEX Hub, a central repository that automatically aggregates VEX documents published by open-source projects. VEX Hub’s unique architecture makes it easy and practical for software maintainers to start adopting VEX, while at the same time making it seamless for scanners and users to incorporate VEX in their workflow. The presentation showcases a practical use case of VEX Hub with Trivy, an open-source security scanner that popularizes VEX thanks to VEX Hub and delivers more accurate and actionable scanning results to its users.


Cilium, eBPF, WireGuard: Can We Tame the Network Encryption Performance Gap? - Daniel Borkmann & Anton Protopopov, Isovalent
To increase data security for cloud and hybrid cloud deployments, many companies, governments, standards, and tenders require data in transit to be protected. However, network encryption comes at a cost - what is the performance impact and how can we reduce it? In this session, we explore how network encryption can be efficiently enforced with Cilium, eBPF, and WireGuard. We dive deep into Cilium’s integration of WireGuard and elaborate on both the management plane and Cilium’s eBPF datapath. We analyze and benchmark what performance cost one can expect and explore opportunities in the Linux kernel to reduce that price. This talk is for operators and security teams that need to encrypt network traffic, but also want to minimize its overhead. The audience will walk away understanding whether network encryption needs to come at a high toll and whether there are opportunities for optimizations.


Kubernetes Data Protection WG Deep Dive - Dave Smith-Uchida, Veeam
Data Protection WG is dedicated to promoting data protection support in Kubernetes. The Working Group is working on identifying missing functionalities and collaborating across multiple SIGs to design features to enable data protection in Kubernetes. In this session, we will discuss what is the current state of data protection in Kubernetes and where it is heading in the future. We will also talk about how interested parties (including storage and backup vendors, cloud providers, application developers, and end users, etc.) can join this WG and contribute to this effort. Details of the WG can be found here: https://github.com/kubernetes/community/tree/master/wg-data-protection.


Bridging Clouds: TikTok’s Blueprint for Unified OIDC Access on Multi-Cloud Kubernetes - Naveen Mogulla, TikTok
As businesses embrace increasingly complex multi-cloud environments, managing access across diverse Kubernetes setups becomes paramount. At TikTok, we faced the challenge of unifying OpenID Connect (OIDC) access for Kubernetes clusters across GKE, EKS, OKE and on-prem clusters each providing different levels of support and integration. This talk will detail our journey to develop a scalable, centralized OIDC framework using a reverse proxy approach, ensuring seamless authentication and authorization across different cloud providers. We will discuss our architectural strategy, highlighting how we leveraged Envoy for request handling and dynamic configuration with external authorization filters to accommodate diverse OIDC implementations. Discover how TikTok overcame identifying OIDC discrepancies among providers to implementing a unified solution that not only simplifies k8s access management but also reinforces security and compliance across our global, multi-cloud infrastructure.


Tutorial: Confidential Containers 101: A Hands-on Workshop - Archana Choudhary & Suraj Deshmukh, Microsoft
As traditional enterprises with stringent data protection requirements become cloud-native and migrate to Kubernetes on public clouds, they are wondering: “Is my data secure on this shared hardware? Can someone with a host access snoop on my data?” And especially, with the upcoming Digital Operational Resilience Act (DORA) in Europe mandating data protection in use, it’s crucial for users to familiarize themselves with solutions like Confidential Containers (CoCo), a CNCF sandbox project. In this, first of its kind, hands-on workshop we’ll dive deep into using CoCo with k8s. We’ll explore real-world challenges, such as ensuring data confidentiality from platform owners (cloud providers), and show you how to overcome them. Through practical exercises, you’ll learn to set up CoCo and secure your containerized workloads, turning theory into practice. Attendees will discover streamlined practices, find robust protection mechanisms, and gain strategic insights into adopting CoCo.


Extending the Gateway API: The Power and Challenges of Policies - Kate Osborn, NGINX
From the beginning, the Gateway API has been designed to be extensible. With over 25 implementations to date, it’s crucial that these implementations have a way to support implementation-specific features without resorting to annotations. Among the various ways to extend the Gateway API, the Policy Attachment mechanism stands out as the most potent and challenging. In this session, we will explain what Policy Attachment is and share the lessons we learned at NGINX when implementing our own Policies. You will learn about: - The difference between direct and inherited policies. - How policy inheritance and merging works. - Corner cases, such as conflicting policies and invalid target refs. - Techniques to verify if a policy has been successfully applied. - Strategies for troubleshooting policies. We will show you examples of Gateway API policies as well as policies from multiple Gateway API implementations.
Mastering ApplicationSet: Advanced Argo CD Automation - Alexander Matyushentsev, Akuity
Argo CD has become an essential deployment tool that engineers use to automate various infrastructure management use cases across hundreds of clusters. This presents a new challenge of managing Argo CD applications at scale. The Argo CD team has explored multiple approaches to solving this, resulting in the creation of ApplicationSet. Over time, ApplicationSet has gained many features, becoming sophisticated and quite complex to use. In this session, we will dive into advanced ApplicationSet features: orchestrating complex rollouts of ingress controllers across multiple clusters and accommodating snowflake clusters. We will enable the audience to answer these and many other questions about using ApplicationSet. Finally, we will demonstrate an effective way to debug ApplicationSet specifications without digging through logs and altering production Argo CD settings.


Ekansh Gupta
SDE, SigNoz
Shivanshu Raj Shrivastava
Founding Engineer, SigNoz


Global Payments: Setting New Standards for Reliability in Cloud Native Multi-Region Applications - Trey Caliva, Global Payments
As a multinational FinTech provider, processing over 32 billion card transactions for 816 million accounts, Global Payments requires globally available architectures with quick disaster recovery while maintaining subsecond latencies. In addition, these workloads require strict adherence to compliance standards. This session will explore the high-level architectural decisions implemented in a cloud-native redesign and cloud migration of a mission critical legacy .NET application. Key cloud native tools leveraged include Kubernetes on GCP, and the use of CockroachDB as a cloud native database solution. We will explore how leveraging these cloud native technologies achieved extreme fault tolerance in a multi-region deployment, setting new standards for performance and reliability.
Jim HatcherSolutions Engineer, Cockroach Labs
Trey Caliva Principal Cloud Architect, Global Payments


Scale Job Triggering with a Distributed Scheduler - Cassie Coyle & Artur Souza, Diagrid
Imagine scheduling thousands or millions of jobs that are persisted and triggered timely and resilient to downtime. Some jobs might be triggered every second while others need to reliably be triggered on the first day of the month. Achieving high throughput and reliability is critical for the performance and operational efficiency of modern distributed systems. How can traditional cron job scheduling be extended? How can distributed systems handle job scheduling with minimal downtime? What challenges arise when scaling job scheduling to thousands or millions of jobs? In this session, Artur and Cassie will delve into the design of Dapr’s distributed Scheduler and how users can start using it today. You will gain a comprehensive understanding of how Dapr’s Scheduler unblocks scalability of actors and workflows while also enabling new capabilities, like delayed pubsub and schedule job API.
Speakers
Artur SouzaHead of Engineering, Diagrid
Cassie Coyle Software Engineer, Diagrid


CEL-Ebrating Simplicity: Mastering Kubernetes Policy Enforcement - Kevin Conner, Getup Cloud & Anish Ramasekar, Microsoft
As Kubernetes deployments grow increasingly complex, robust policy enforcement is crucial. The Common Expression Language (CEL) provides a powerful solution, enabling the creation of sophisticated, human-readable expressions for Kubernetes policies. This session explores CEL's integration with Kubernetes, simplifying policy definition and enforcement. Key takeaways: - Fundamentals of CEL and its Kubernetes integration. - Practical use cases for CEL in admission control, resource management, and security. - Enhancing policy expressiveness and flexibility with CEL. - Introduction to CEL Playground for testing and validating CEL expressions. Through live demos, learn to leverage CEL and CEL Playground for streamlined policy management in Kubernetes. Ideal for administrators, developers, and DevOps professionals, this session equips you to enhance your Kubernetes policies using CEL. Join us to discover how CEL and CEL Playground can transform your Kubernetes policy management.
Anish Ramasekar Principal Software Engineer, Microsoft
Kevin Conner Chief Engineer, Getup Cloud


Linkerd Update: Ingress, Egress, IPv6, Enhanced Multicluster, Rust, and More - William Morgan, Buoyant
The pace of feature delivery in Linkerd has never been higher. In this whirlwind project update by Linkerd maintainers and directors, you'll learn about the latest developments and upcoming features. We'll discuss new support for egress traffic control and visibility, ingress traffic handling, UX improvements to multicluster, new support for IPv6, and more. Come prepared to learn about the world's fastest, lightest service mesh!


SIG Instrumentation Introduction and Deep Dive - Han Kang, David Ashpole & Richa Banker, Google; Damien Grisonnet, Red Hat
Kubernetes SIG Instrumentation is responsible for ensuring high quality and consistent instrumentation across the Kubernetes project. We will begin with an introductory overview of the efforts the SIG Instrumentation has worked on in the past and is currently working on. This deep dive session will go into detail about currently ongoing efforts happening within SIG Instrumentation to share with the audience concrete pieces of work to encourage future collaboration. Software engineering and operations are both disciplines practiced in SIG Instrumentation, and any experience will help the special interest group's mission. Join this session to learn how to get involved in SIG Instrumentation to make instrumentation even better!


Container Image Workflows at Scale with Buildpacks - Jesse Brown & Terence Lee, Heroku
Buildpacks transform source applications into images that run on any cloud. Each output image contains a full Software Bill of Materials which allows platform developers to know precisely what software is deployed. This makes them an excellent solution where a container runtime is provided to untrusted or semi-trusted development teams. There are wider use-cases where many application development teams share a common runtime, like Kuberenetes. In this talk we look at using Buildpacks to deploy web applications at scale, we consider batch processing in large workflows - particularly AI/machine learning workflows - and we look at an example Functions as a Service platform that uses Buildpacks.


Poster Session (PS06): What's Happening with SPIFFE and WIMSE? - Daniel Feldman, Qusaic
This session will be a very brief overview of what's going on with the SPIFFE and WIMSE identity standards projects. SPIFFE is a CNCF effort to standardize workload identity implementations. That is, a SPIFFE implementation can grant services unique identities and credentials. WIMSE is an IETF effort to build on the SPIFFE foundation. In particular, it adds a new, unique token format that allows securely recording multi-hop identity information. Implementors will be able to use this token format to build complete, end-to-end, cryptographically auditable identity records.


Keynote: Open Source Security Is Not A Spectator Sport - Justin Cappos, Professor, NYU & Santiago Torres Arias, Assistant Professor, Purdue University
The CNCF has been a trailblazer in resilient open source software security by enabling innovation, coordination and community building. We will highlight some of the efforts and resources provided by TAG Security including security assessments for CNCF projects, one of the first supply chain security recommendations, A Reference Architecture to Securing the Software Supply Chain, and the Cloud Native Security Whitepaper.
We’ve done this all by fostering an open and welcoming community of security professionals. Come and join our community and help us improve cloud-native security for all!
Speakers
Justin Cappos, Professor, NYU
Santiago Torres Arias, Assistant Professor of Electrical and Computer Engineering, Purdue University


GitOps at Production Scale with Flux - Leigh Capili, Flox & Priyanka Ravi, G-Research
In this session, Leigh and Pinky will cover best practices when running Flux at scale in production. We'll start with an overview of the scaling capabilities of Flux controllers: - Vertical Scaling - Horizontal Scaling - Sharding We'll dive deeply into each method and explain when and how to use them considering multi-tenancy, cluster fleet size, and workload complexity. We'll also introduce the Mean Time To Production benchmarking tool the Flux team has developed using CUE lang and Timoni. The benchmark measures the time it takes for Flux to deploy thousands of Helm charts and Kustomize overlays on Kubernetes clusters. We'll explain the benchmark results and share lessons from running it on different Kubernetes distributions and providers. The session will conclude with the Flux roadmap and our API promises now that Flux is GA.

Leigh Capili Senior DevRel Engineer, Flox
Priyanka Ravi Platform Tech Advocate, G-Research


What's New with Kubectl and Kustomize … and How You Can Help! - Eddie Zaneski, Defense Unicorns & Arda Guclu, Red Hat


SPIRE: Intro & In-Depth Exploration of the Upcoming Forced Rotation and Revocation Feature - Agustín Martínez Fayó & Marcos Yacob, Hewlett Packard Enterprise
Join us for an insightful session on the SPIRE project, where we’ll provide a comprehensive introduction covering the foundational aspects of SPIRE, detailing its architecture, capabilities, and the problems it solves. Additionally, we’ll delve into the exciting upcoming updates for the project, with a special focus on the highly anticipated forced rotation and revocation feature that will provide a rapid, reliable, and automated mechanism for recovering from key compromise. Whether you’re new to SPIRE or an experienced user, this talk will equip you with the knowledge of current developments and prepare you for the future enhancements that will further strengthen your infrastructure to provide secure identities for workloads.


How to Move from Ingress to Gateway API with Minimal Hassle - Keith Mattix, Microsoft
For many, the Ingress resource was one of the first Kubernetes APIs they used, adding HTTP routing rules and SSL certs for cluster-external traffic. These APIs are used for production in clusters across the world today, configuring ingress gateways serving hundreds of thousands of connections per second. As of October 2023, the Ingress API has been superseded by the Gateway API, a new set of Kubernetes resources with over 20 implementations that enforces security best practices by design. However, migrating networking APIs is an intimidating task, and doing so safely is every company’s primary concern. Join this session to learn how to make this migration safe by identifying the best migration path, implementing Gateway API best practices, and utilizing community-supported migration tools such as ingress2gateway.


What Agent to Trust with Your K8s: Falco, Tetragon or KubeArmor? - Henrik Rexed, Dynatrace
In the CNCF landscape we have plenty of ebpf based security solutions that help us protect our k8s cluster from runtime vulnerabilities. On paper though Falco, Tetragon and KubeArmor look very similar. Eventually you have to make a choice on which one best fits your needs. To give you additional insights to make your decision join this session. We have run extensive benchmarks against those three solutions and will answer the following questions that came out of our testing: - What are the different featuresets? - What about the performance impact of each agent? - Which privileges does each solution need? - What are the pros and cons across the three options?
Speakers
Henrik Rexed Cloud Native Advocate, Dynatrace


What Istio Got Wrong: Learnings from the Last Seven Years of Service Mesh - Christian Posta & Louis Ryan, Solo.io
Building complex systems often requires simplicity in components—a lesson the Istio project has learned throughout its seven(plus)-year journey. Although Istio offers a lot of powerful features for application networking, crucial for many organizations, the path to maturity and broader adoption was fraught with challenges. In this talk, we explore the key mistakes made during Istio's development, including its initially complex architecture, an overload of features, premature release of version 1.0, difficulties faced by contributors, and delays in joining the CNCF. We will discuss the impact of these mistakes, how these missteps were addressed, and how they have positioned Istio as a leader in the service mesh market. This presentation will detail how Istio's evolution reflects a shift towards simpler, more modular components that together offer effective solutions for managing APIs and service-to-service communication regardless of platform.
Speakers
Louis Ryan CTO, Solo.io
Christian Posta Global Field CTO, Solo.io


How the Tables Have Turned: Kubernetes Says Goodbye to Iptables - Casey Davenport, Tigera & Dan Winship, Red Hat
For decades, iptables has been the preferred packet filtering system in the Linux kernel. Used extensively across the Kubernetes networking ecosystem, iptables is now on the way out and is expected to be removed from the next generation of Linux distributions. With iptables past its prime, where does that leave Kubernetes? The successor to iptables -- nftables -- is ready to carry the torch instead, with a newly released beta kube-proxy implementation in v1.31 and network policy using Calico’s nftables backend. In this talk, Dan and Casey will share what they have learned building Kubernetes Service and NetworkPolicy implementations using nftables. They will cover the history and current status of iptables usage in Kubernetes, the capabilities and performance characteristics of Kubernetes networks running on nftables, and why eBPF may not be the right tool for the job.
Speakers
Casey Davenport Casey Davenport, Tigera
Dan Winship Senior Principal Software Engineer, Red Hat


Contribfest: Enhancing Kubernetes Debugging and Observability with Inspektor Gadget
Let’s dive into the world of Kubernetes observability and debugging by joining the Inspektor Gadget Contribfest. Inspektor Gadget is both a collection of eBPF tools (Gadgets) and a systems inspection framework for Kubernetes, containers, and Linux hosts. In this session, maintainers will give a quick introduction to the Inspektor Gadget project and will guide participants to setup their development environment. The gadgets concept will be introduced, and we’ll guide participants to create a simple hello world gadget. Then, participants will be able to contribute in different ways: - By building gadgets for new use cases - By extending the existing gadgets - By brainstorming ideas of new features
Speakers
Mauricio Vásquez Bernal Principal Software Engineer, Microsoft
Jose Blanquicet Senior Software Engineer, Microsoft


Contribfest: Helm 4: The Next Generation of the Kubernetes Package Manager
Love it or hate it, there is little argument that Helm remains a popular choice for packaging Kubernetes applications. As the project embarks on its first new major version since 2019, Helm 4, anyone who makes use of Helm, whether it be a producer or consumer, has the opportunity to help shape the future and direction. Join members of the Helm community to get a unique opportunity to take part in the development of Helm 4 so it can provide the next generation of Kubernetes applications and users the package manager for today and tomorrow.. In this session, attendees will learn: Learn about the key features being considered Support for Helm 3 before, during and after Helm 4 is released How to get involved in the Helm project, including the various roles and responsibilities The process for contributing to the Helm codebase This is a session any Kubernetes contributor does not want to miss
Speakers
Andrew Block Distinguished Architect, Red Hat
Matt Farina Distinguished Engineer, SUSE


What Containerd 2.0 Means for You - Samuel Karp, Google
containerd 2.0 is the first major new version of containerd since 1.0.0 was released in 2017. This new version of containerd introduces new features, new extension points, and new backends for image operations and CRI with the goal of increased flexibility and better efficiency for certain types of workloads. containerd 2.0 also removes some previously-deprecated features in favor of modern replacements. This talk will discuss how to prepare for containerd 2.0 in your production environments, including strategies for incorporating containerd 2.0's new functionality and detecting/remediating any impact of removed features prior to upgrading.


Object Storage Is All You Need - Justin Cormack, Docker
When Jeff Bezos commissioned Amazon S3 he called it "malloc for the web"; since then many people have considered cloud object storage to be a weird kind of non Posix filesystem, but also a great backing store for websites or storing lots of data.


Nothing but NATS - Going Beyond Cloud Native - Byron Ruth & Kevin Hoffman, Synadia
These days building so-called cloud-native apps involves assembling a custom stack of tools 10x bigger than the app we're building. Additionally, applications increasingly need to expand out to the edge and cloud-native stacks simply don't work in those environments. Fortunately with NATS, we don't need a stack. In this session you'll see how we can leverage compute, storage, and connectivity to build cloud-to-edge native apps more powerful than ever, with less code, effort, and frustration.
Speakers
Byron Byron Ruth, Synadia
Kevin Hoffman Engineering Director, Cloud Platform, Synadia


Pushing Authorization Further: CEL, Selectors and Maybe RBAC++ - Mo Khan & Rita Zhang, Microsoft; Jordan Liggitt, Google
Significant changes have been made to authorization in recent versions of Kubernetes. For example, common expression language (CEL) in validating admission policy (VAP) can access the authorizer to perform runtime checks during admission. Authorization has also been made aware of label and field selectors, which are available as extra info to be used by webhooks and CEL expressions in VAP. Looking forward, Kubernetes RBAC could be enhanced to take advantage of this new info. RBAC++ is a proof of concept design to combine CEL with RBAC to allow for conditional bindings at runtime. Thinking about even more experimental changes: what if authorization (and RBAC++) could directly assert conditions at admission time?
Speakers
Rita Zhang Principal software engineer, Kubernetes SIG Auth co-chair, Security Response Committee, Microsoft
Mo Khan Software Engineer, Microsoft
Jordan Liggitt Software Engineer, Google


Why Perfect Compliance Is the Enemy of Good Kubernetes Security - Michele Chubirka, Google
Technology organizations often struggle over who should manage the security of their Kubernetes environment. This task usually falls to platform or cloud engineering teams, but they often feel abandoned by their security counterparts, uncertain of which requirements will deliver real security value. While published benchmarks and security guides for Kubernetes are helpful, not all recommendations work for every use-case. They may require Kubernetes alpha or beta features which could cause issues with platform stability. Our desire to prioritize “perfect” security over having a functional platform that addresses relevant risks can leave us with nothing, frustrating everyone. Kubernetes is meant to increase application delivery velocity, but when overly strict compliance prevents a team from moving forward, they will subvert security requirements. Let’s stop obsessing over the red in our security and compliance dashboards and focus on what adds real value by reducing risk.
Speakers
Michele Chubirka Cloud Security Advocate, Google


Rook: Intro and Deep Dive with Ceph Storage - Travis Nielsen, Annette Clewett, Blaine Gardner & Subham Rai, IBM
The Rook project will be introduced to attendees of all levels and experience. Rook is an open source cloud-native storage operator for Kubernetes, providing the platform, framework, and support for Ceph to natively integrate with Kubernetes. The panel will discuss various scenarios to show how Rook configures Ceph to provide stable block, shared file system, and object storage for your production data. Rook was accepted as a graduated project by the Cloud Native Computing Foundation in October 2020.


SPIFFE the Easy Way: Universal X509 and JWT Identities Using Cert-Manager - Tim Ramlot & Ashley Davis, Venafi
SPIFFE is incredible. Each workload is assigned its own universal identity, simplifying the security and management of communications in distributed systems. While SPIRE (the reference SPIFFE implementation) is exceptionally powerful, it is also quite complex. Deploying SPIRE on Kubernetes requires StatefulSets, which can be challenging and frustrating. Many cloud vendors are starting to offer turnkey SPIFFE solutions, but that comes with risk of vendor lock-in. In this talk, we will demonstrate how to use the Cloud Native cert-manager solution to implement SPIFFE (x509 and JWT) with low operational overhead for all Kubernetes workloads. The session includes all you need to know to issue X.509 SVIDs, use them and validate them. Additionally, we will introduce an experimental solution to convert x509 SVIDs into JWT SVIDs. The demo will highlight how to authenticate to third-party APIs (such as AWS, GCP, Azure, and others) using these JWT SVIDs.
Ashley Davis Staff Software Engineer, Venafi
Tim Ramlot Senior Software Engineer - cert-manager maintainer, Venafi


## OpenTelemetry vs syslog-nf for Logging Pipeline

* <https://www.infoq.com/news/2024/10/cloudflare-opentelemetry/>

...Cloudflare documented how it significantly upgraded its logging pipeline by moving from syslog-ng to OpenTelemetry Collector.

... it collects and processes millions of log events per second from every server in its network. 


...several motivations for this change:
   * Language compatibility: OpenTelemetry Collector is written in Go, more familiar to Cloudflare's engineering team
     than C language used by syslog-ng.
   * Easier integration with internal libraries: Building syslog-ng with Cloudflare's internal Post-Quantum cryptography 
     libraries was a challenge. 
   * Enhanced metrics: OpenTelemetry Collector supports Prometheus metrics, allowing the team to gather more detailed
     telemetry data about the collectors' performance.
   * Unified telemetry infrastructure: Cloudflare already uses OpenTelemetry Collectors for some of its tracing 
     infrastructure. Consolidating different types of telemetry into a single system reduces the engineering 
     team's complexity.

... As part of the migration,the engineers developed custom components to maintain compatibility 
... including a custom exporter for Cloudflare's own log format, a modified file exporter for
additional output formats, processors to incorporate external source JSON data into log entries,
and rate-limiters for individual services to prevent them from overwhelming the logging pipeline.

.. Cloudflare is far from alone in moving to OpenTelemetry, with other big companies such as Shopify, Splunk, Google
and GitHub also adopting the technology. 

...In a Google Cloud webinar, some of these organisations detailed their OpenTelemetry usage.
* Google is using OpenTelemetry in several products, such as using the collector in Google Kubernetes Engine
  and Google Compute Engine, and replacing OpenCensus SDKs in Cloud Monitoring and Cloud Trace.
* Splunk is adopting OpenTelemetry internally and contributing extensively to the project by using the
  collector for infrastructure monitoring, moving to OpenTelemetry client libraries, and contributing
  to collector and auto-instrumentation development.
* Shopify is migrating its trace collection infrastructure to OpenTelemetry, and implementing PII redaction,
  sampling, and span renaming in the collector.

...GitHub is adopting OpenTelemetry to standardise its telemetry practices (vs non compatible tools 
   like  statsd for metrics, syslog for text logs, and OpenTracing for request traces)
...  using OTLP (OpenTelemetry Protocol) as a standard, vendor-neutral format for telemetry signals.
... auto-instrumentation for Ruby and Postgres distributed tracing ...
... The open standards also allow GitHub to create automatic correlation between different signals,
    using OpenTelemetry tracing as the root.
... automatically calculating metrics and converting tracing events to detailed logs.
...  They're also contributing back to the OpenTelemetry project.