DevOps
fromInfoQ
17 hours agoYelp Achieves Zero-Downtime Upgrade of Over 1,000 Cassandra Nodes
Yelp upgraded its Apache Cassandra infrastructure across 1,000 nodes without downtime, showcasing effective management of stateful systems at scale.
Blackbox Hosting has consolidated storage from two full racks down to just 8U of rack space following migration to Everpure FlashArray hardware, achieving a 10:1 data reduction ratio and an 85% reduction in power utilization.
Uber's engineering team has transformed its data replication platform to move petabytes of data daily across hybrid cloud and on-premise data lakes, addressing scaling challenges caused by rapidly growing workloads. Built on Hadoop's open-source Distcp framework, the platform now handles over one petabyte of daily replication and hundreds of thousands of jobs with improved speed, reliability, and observability.
There is a growing emphasis on database compliance today due to the stricter enforcement of compliance rules and regulations to safeguard user privacy. For example, GDPR fines can reach £17.5 million or 4% of annual global turnover (the higher of the two applies). Besides the direct monetary implications, companies also need to prioritize compliance to protect their brand reputation and achieve growth.
"The job didn't fail. It just... never finished." That was the worst part. No errors.No stack traces.Just a Spark job running forever in production - blocking downstream pipelines, delaying reports, and waking up-on-call engineers at 2 AM. This is the story of how I diagnosed a real Spark performance issue in production and fixed it drastically, not by adding more machines - but by understanding Spark properly.
A future-proof IT infrastructure is often positioned as a universal solution that can withstand any change. However, such a solution does not exist. Nevertheless, future-proofing is an important concept for IT leaders navigating continuous technological developments and security risks, all while ensuring that daily business operations continue. The challenge is finding a balance between reactive problem solving and proactive planning, because overlooking a change can cost your organization. So, how do you successfully prepare for the future without that one-size-fits-all solution?
Developers have spent the past decade trying to forget databases exist. Not literally, of course. We still store petabytes. But for the average developer, the database became an implementation detail; an essential but staid utility layer we worked hard not to think about. We abstracted it behind object-relational mappers (ORM). We wrapped it in APIs. We stuffed semi-structured objects into columns and told ourselves it was flexible.
As businesses contend with ever-increasing data volumes and performance-intensive applications such as AI model training, AI inferencing and high-performance computing, they need infrastructure that delivers speed, scalability and efficiency without added complexity.
Databricks today announced the general availability of Lakebase on AWS, a new database architecture that separates compute and storage. The managed serverless Postgres service is designed to help organizations build faster without worrying about infrastructure management. When databases link compute and storage, every query must use the same CPU and memory resources. This can cause a single heavy query to affect all other operations. By separating compute and storage, resources automatically scale with the actual load.
Core components According to the documentation, the core components of DynamoDB are tables, items, and attributes. This is accurate in the sense of what you can act on through the API, but can be deceptively simple, and leaves out two other equally important aspects: what you can do with it (the logical model) and how it scales (the physical model).
By replacing repeated fine‑tuning with a dual‑memory system, MemAlign reduces the cost and instability of training LLM judges, offering faster adaptation to new domains and changing business policies. Databricks' Mosaic AI Research team has added a new framework, MemAlign, to MLflow, its managed machine learning and generative AI lifecycle development service. MemAlign is designed to help enterprises lower the cost and latency of training LLM-based judges, in turn making AI evaluation scalable and trustworthy enough for production deployments.
When I manage infrastructure for major events (whether it is the Olympics, a Premier League match or a season finale) I am dealing with a "thundering herd" problem that few systems ever face. Millions of users log in, browse and hit "play" within the same three-minute window. But this challenge isn't unique to media. It is the same nightmare that keeps e-commerce CTOs awake before Black Friday or financial systems architects up during a market crash. The fundamental problem is always the same: How do you survive when demand exceeds capacity by an order of magnitude?
The main advantage of going the Multi-Cloud way is that organizations can "put their eggs in different baskets" and be more versatile in their approach to how they do things. For example, they can mix it up and opt for a cloud-based Platform-as-a-Service (PaaS) solution when it comes to the database, while going the Software-as-a-Service (SaaS) route for their application endeavors.