DevOps
fromTheregister
22 hours agoDatadog digs down into GPU efficiency as AI costs soar
Datadog introduces GPU monitoring to enhance visibility and cost management for AI-driven organizations.
Tracy is compatible with Kotlin from version 2.0.0 and Java from version 17. Integrations can be made with SDKs from OpenAI, Anthropic, and Gemini. The library also works with common Kotlin/LLM stacks including OkHttp and Ktor clients, as well as OpenAI, Anthropic, and Gemini ones.
The data that feeds your observability tools is out of control. Too much of it, low quality, unmanaged, and growing faster than anyone budgeted for. When they started building Sawmills two years ago, this was already a serious pain point. Costs were climbing. Signal-to-noise was degrading. Teams were drowning in telemetry that told them less and less while costing more and more.
"A central issue here is the fact that, as systems scale, telemetry scales even faster," explained Azulay. "Every service creates metrics. Every request generates traces.... and logs multiply as the velocity of deployment increases. This is the structural reality of distributed systems." He points to research from Omdia that suggests organisations consistently "under-instrument" their environments, not because they lack the tools to do so, but because they can't afford to fully use them.
The Old Way (Siloed Tools): The application team opens their APM tool. They see slow transaction times but no obvious errors in their code. They create a ticket for the infrastructure team. The infrastructure team checks their dashboards. Server CPU and memory look fine. They blame the network. The network team checks their monitoring tools. Bandwidth is normal, and latency is low. They declare, "It's not the network!" Hours, or even days, are lost in a painful cycle of finger-pointing while the business loses revenue.
The real cost of poor observability isn't just downtime; it's lost trust, wasted engineering hours, and the strain of constant firefighting. But most teams are still working across fragmented monitoring tools, juggling endless alerts, dashboards, and escalation systems that barely talk to one another, which acts like chaos disguised as control. The result is alert storms without context, slow incident response times, and engineers burned out from reacting instead of improving.
Dynatrace has launched Dynatrace Intelligence, a system that combines deterministic AI and agentic AI. The platform is designed to help organizations transition from reactive to autonomous operations. Dynatrace Intelligence is the new agentic operations system that takes center stage at the observability company's Perform conference. It is built to observe and optimize dynamic AI workloads. The platform is designed to enable organizations to build more resilient applications and improve customer experiences.
Lead without authority. You may not have direct reports, yet you shape architecture, quality and the roadmap. Your leverage comes from artifacts, reviews and clear standards, not from title.I started by publishing a lightweight architecture template and a rollout checklist that the team could copy. That reduced ambiguity during design and cut review cycles by nearly 30 percent
I once transitioned from a SaaS CTO role to become a business unit CIO at a Fortune 100 enterprise that aimed to bring startup development processes, technology, and culture into the organization. The executives recognized the importance of developing customer-facing applications, game-changing analytics capabilities, and more automated workflows. Let's just say my team and I did a lot of teaching on agile development and nimble architectures.
AI is no longer a research experiment or a novelty in the IDE: it is part of the software delivery pipeline. Teams are learning that integrating AI into production is less about model performance and more about architecture, process, and accountability. In this article series, we examine what happens after the proof of concept and how AI changes the way we build, test, and operate systems.
The more attributes you add to your metrics, the more complex and valuable questions you can answer. Every additional attribute provides a new dimension for analysis and troubleshooting. For instance, adding an infrastructure attribute, such as region can help you determine if a performance issue is isolated to a specific geographic area or is widespread. Similarly, adding business context, like a store location attribute for an e-commerce platform, allows you to understand if an issue is specific to a particular set of stores
In 2025, nearly every security conversation circled back to AI. In 2026, the center of gravity will shift from raw innovation to governance. DevOps teams that rushed to ship AI capabilities are now on the hook for how those systems behave, what they can reach, and how quickly they can be contained when something goes wrong. At the same time, observability, compliance, and risk are converging.
Dynatrace started collecting trace data from applications in 2005. Organizations wanted to know why an application was slow and what was happening exactly. That first generation was mainly manual and technical. "It was about collecting data and understanding what was going on," explains Spitzbart. APM has remained the company's foundation for a long time and remains a core component today.
On-call engineers spend hours manually investigating incidents across multiple observability tools, logs, and monitoring systems. This process delays incident resolution and impacts business operations, especially when teams need to correlate data across different monitoring platforms. AWS DevOps Agent (in preview) is a frontier agent that resolves and proactively prevents incidents, continuously improving reliability and performance of applications in AWS, multicloud, and hybrid environments.