#dataset-engineering

[ follow ]
Artificial intelligence
fromWIRED
12 hours ago

AI Could Democratize One of Tech's Most Valuable Resources

Nvidia faces potential competition as startups like Wafer optimize AI code for various chips, challenging its dominance in AI hardware.
Data science
fromInfoWorld
2 days ago

Google Cloud introduces QueryData to help AI agents create reliable database queries

QueryData enhances AI agents' accuracy in querying databases by translating natural language into precise database queries.
Careers
fromFast Company
2 days ago

4 myths about AI in hiring, debunked

AI in hiring can reduce bias compared to human recruiters, challenging common misconceptions about its fairness.
#ai-adoption
Marketing tech
fromAdExchanger
2 days ago

AI Is Nothing Without Data Fidelity. Here's A Four-Step Approach to Protect It | AdExchanger

Data integrity is crucial for effective AI in advertising, as flawed data leads to poor outcomes.
#aws
DevOps
fromAmazon Web Services
2 days ago

Troubleshooting environment with AI analysis in AWS Elastic Beanstalk | Amazon Web Services

AWS Elastic Beanstalk simplifies web application deployment and scaling, now enhanced with AI Analysis for troubleshooting environment health issues.
DevOps
fromTheregister
6 days ago

AWS: Agents shouldn't be secret, so we built a registry

AWS Agent Registry enhances visibility and control over AI agents in corporate environments.
DevOps
fromAmazon Web Services
2 days ago

Troubleshooting environment with AI analysis in AWS Elastic Beanstalk | Amazon Web Services

AWS Elastic Beanstalk simplifies web application deployment and scaling, now enhanced with AI Analysis for troubleshooting environment health issues.
DevOps
fromTheregister
6 days ago

AWS: Agents shouldn't be secret, so we built a registry

AWS Agent Registry enhances visibility and control over AI agents in corporate environments.
Agile
fromAP News
1 month ago

Quantum Agile By Codewave Becomes India's first framework for software teams working with AI agents.

Quantum Agile™ is India's first framework for software teams utilizing AI agents, positioning the country as a leader in next-gen software development methodologies.
Web frameworks
fromInfoQ
6 days ago

Tiger Teams, Evals and Agents: The New AI Engineering Playbook

Sam Bhagwat is a co-founder and CEO of Mastra, an open source JavaScript/Typescript framework for building AI agents.
#ai-agents
Software development
fromDevOps.com
5 days ago

Google's Scion Gives Developers a Smarter Way to Run AI Agents in Parallel - DevOps.com

Scion is an experimental orchestration testbed for managing concurrent AI agents, preventing conflicts and enhancing collaboration.
Data science
fromMedium
1 week ago

15 Datasets for Training and Evaluating AI Agents

Datasets for training and evaluating AI agents are essential for building reliable agentic systems and preventing execution failures.
Software development
fromDevOps.com
5 days ago

Google's Scion Gives Developers a Smarter Way to Run AI Agents in Parallel - DevOps.com

Scion is an experimental orchestration testbed for managing concurrent AI agents, preventing conflicts and enhancing collaboration.
Business intelligence
fromTechzine Global
2 days ago

Celonis and Oracle collaborate on identifying automation and AI opportunities

Celonis Process Intelligence Platform integrates with Oracle Cloud Infrastructure to enhance AI capabilities and optimize business processes.
Data science
fromMedium
1 week ago

15 Datasets for Training and Evaluating AI Agents

Datasets for training and evaluating AI agents are essential for building reliable agentic systems and preventing execution failures.
Python
fromMathspp
6 days ago

uv skills for coding agents

Utilizing uv workflows enhances Python code execution and script management for coding agents, ensuring proper handling of dependencies and sandboxing.
#generative-ai
Software development
fromInfoWorld
6 days ago

How agile practices ensure quality in GenAI-assisted development

Generative AI enhances coding speed but increases technical debt without Agile practices like pair programming and automated tests.
Software development
fromInfoWorld
6 days ago

How agile practices ensure quality in GenAI-assisted development

Generative AI enhances coding speed but increases technical debt without Agile practices like pair programming and automated tests.
Information security
fromTechzine Global
2 days ago

How AI could drive cyber investigation tools from niche to core stack

The rise of AI presents new cybersecurity risks, necessitating a shift from traditional defensive strategies to proactive measures against sophisticated threats.
fromTechzine Global
5 days ago

Cisco strengthens AI observability Splunk by acquiring Galileo

Galileo provides AI teams with tools to evaluate the quality of AI outputs, detect errors before they reach end users, and continuously improve the behavior of AI agents in production.
DevOps
Careers
fromeLearning Industry
6 days ago

13 AI Skills To Equip Your Workforce For An AI-Driven Future

AI is rapidly changing the skills landscape, making traditional competencies obsolete and creating a demand for new skills in the workforce.
#snowflake
Django
fromMedium
2 weeks ago

Snowflake Supports Directory Imports

Easier package imports into Snowflake functions and procedures from stage directories and SnowGit directories streamline development and deployment.
Artificial intelligence
fromTheregister
3 weeks ago

Snowflake's ongoing pitch: bring AI to data, not vice versa

Snowflake is enhancing its platform for AI integration through strategic partnerships and acquisitions, focusing on customer ROI and data management efficiency.
Django
fromMedium
2 weeks ago

Snowflake Supports Directory Imports

Easier package imports into Snowflake functions and procedures from stage directories and SnowGit directories streamline development and deployment.
Artificial intelligence
fromTheregister
3 weeks ago

Snowflake's ongoing pitch: bring AI to data, not vice versa

Snowflake is enhancing its platform for AI integration through strategic partnerships and acquisitions, focusing on customer ROI and data management efficiency.
Software development
fromInfoQ
1 week ago

Google Brings MCP Support to Colab, Enabling Cloud Execution for AI Agents

Google's Colab MCP Server allows AI agents to interact with Colab, enabling offloading of compute-intensive tasks to a cloud environment.
#agentic-ai
Information security
fromTechzine Global
3 weeks ago

Databricks launches Lakewatch: agentic SIEM on the Lakehouse

Lakewatch is an open SIEM platform that consolidates security, IT, and business data, enabling rapid threat detection and response using AI agents.
Information security
fromTechzine Global
3 weeks ago

Databricks launches Lakewatch: agentic SIEM on the Lakehouse

Lakewatch is an open SIEM platform that consolidates security, IT, and business data, enabling rapid threat detection and response using AI agents.
Artificial intelligence
fromMedium
3 days ago

Mastra AI - The Modern Framework for Building Production-Ready AI Agents

Creating reliable, scalable AI systems requires more than simple prompts; it involves building infrastructure and managing complex workflows.
Scala
fromMedium
2 weeks ago

Data Extraction and Classification Using Structural Pattern Matching in Scala

Scala pattern matching enhances code readability and extensibility in real-world data engineering use cases.
#postgresql
DevOps
fromInfoQ
6 days ago

Google Cloud Highlights Ongoing Work on PostgreSQL Core Capabilities

Google Cloud has made significant technical contributions to PostgreSQL, enhancing logical replication, upgrade processes, and system stability.
DevOps
fromInfoQ
6 days ago

Google Cloud Highlights Ongoing Work on PostgreSQL Core Capabilities

Google Cloud has made significant technical contributions to PostgreSQL, enhancing logical replication, upgrade processes, and system stability.
Data science
fromFast Company
1 week ago

Data, not infrastructure, must drive your AI strategy

Data centricity is essential for effective AI strategies, enabling collaboration and problem-solving across business units by making data accessible.
#ai
Data science
fromTheregister
3 weeks ago

Datadog bets DIY AI will mean it dodges the SaaSpocalypse

Datadog is releasing an AI model to enhance its observability tools and mitigate risks from customers building their own solutions.
Data science
fromTheregister
3 weeks ago

Datadog bets DIY AI will mean it dodges the SaaSpocalypse

Datadog is releasing an AI model to enhance its observability tools and mitigate risks from customers building their own solutions.
fromInfoWorld
2 weeks ago

How Apache Kafka flexed to support queues

Apache Kafka has cemented itself as the de facto platform for event streaming, often referred to as the 'universal data substrate' due to its extensive ecosystem that enables connectivity and processing capabilities.
Scala
Business intelligence
fromZDNET
6 days ago

I asked 5 data leaders about how they use AI to automate - and end integration nightmares

Strong processes and AI integration are essential for businesses to effectively utilize data.
DevOps
fromInfoQ
1 week ago

Uber's Hive Federation Decentralizes 16K Datasets and 10+ PB for Zero-Downtime Analytics at Scale

Uber redesigned its Hive data warehouse to decentralize datasets, enhancing scalability, security, and operational autonomy for teams.
Science
fromNature
3 weeks ago

Drowning in data sets? Here's how to cut them down to size

The Square Kilometre Array Observatory will generate massive data, but storage and retention pose significant challenges for researchers.
fromInfoWorld
6 days ago

Bringing databases and Kubernetes together

Automating Kubernetes workloads with Operators can provide the same level of functionality as DBaaS, while still avoiding lock-in to a specific provider.
DevOps
#observability
DevOps
fromTechzine Global
1 week ago

Observability warehouses, the next structural evolution for telemetry

Observability is essential for real-time insights in cloud systems, helping to reduce downtime and improve performance.
DevOps
fromTechzine Global
1 week ago

Observability warehouses, the next structural evolution for telemetry

Observability is essential for real-time insights in cloud systems, helping to reduce downtime and improve performance.
Business intelligence
fromTheregister
2 weeks ago

Microsoft Fabric Database Hub dubbed 'partial' solution

Microsoft's Fabric Database Hub offers a centralized management solution for its database services but lacks support for non-Microsoft databases.
fromInfoWorld
2 weeks ago

Databricks pitches Lakewatch as a cheaper SIEM - but is it really?

"This is more likely to complement existing SIEMs than replace them. Early adoption will come from large enterprises already committed to Databricks, especially those seeking flexibility or cost control."
Information security
Data science
fromMedium
4 weeks ago

Building Consistent Data Foundations at Scale

Building consistent data foundations through intentional architecture, engineering, and governance is essential to prevent fragmentation, support AI adoption, ensure regulatory compliance, and enable reliable organizational decisions at scale.
Data science
fromMedium
1 month ago

Migrating to the Lakehouse Without the Big Bang: An Incremental Approach

Query federation enables safe, incremental lakehouse migration by allowing simultaneous queries across legacy warehouses and new lakehouse systems without risky big bang cutover approaches.
Software development
fromMedium
1 month ago

Unified Databricks Repository for Scala and Python Data Pipelines

Databricks repositories require structured setup with Gradle for multi-language support, dependency management, and version control to scale beyond manual notebook maintenance.
Business intelligence
fromInfoWorld
4 weeks ago

Snowflake's new 'autonomous' AI layer aims to do the work, not just answer questions

Project SnowWork is Snowflake's autonomous AI layer that automates data analysis tasks like forecasting, churn analysis, and report generation without requiring data team intervention.
Artificial intelligence
fromInfoWorld
1 month ago

Databricks launches Genie Code to automate data science and engineering tasks

Databricks launched Genie Code, an AI agent that automates data science and engineering tasks within its lakehouse platform to accelerate ML workflows and enterprise data operations.
DevOps
fromInfoWorld
4 weeks ago

Update your databases now to avoid data debt

Multiple major open source databases reach end-of-life in 2026, requiring teams to plan upgrades and migrations to avoid security risks and higher costs.
Python
fromTreehouse Blog
1 month ago

Python for Data: A SQL + Pandas Mini-Project That Actually Prepares You for Real Work

Effective data analysis requires combining SQL and Python skills in integrated projects that mirror real-world workflows, not learning them in isolation.
Startup companies
fromInfoQ
2 months ago

Etleap Launches Iceberg Pipeline Platform to Simplify Enterprise Adoption of Apache Iceberg

Managed Iceberg pipeline platform unifies ingestion, transformation, orchestration, and table operations inside customers' VPCs, enabling enterprise Iceberg adoption without building custom stacks.
Data science
fromMedium
1 month ago

100 Scala Interview Questions and Answers for Data Engineers

Structured Scala and Apache Spark interview preparation requires understanding distributed systems, performance trade-offs, and pipeline design beyond theoretical knowledge.
fromInfoQ
2 months ago

350PB, Millions of Events, One System: Inside Uber's Cross-Region Data Lake and Disaster Recovery

Uber has built HiveSync, a sharded batch replication system that keeps Hive and HDFS data synchronized across multiple regions, handling millions of Hive events daily. HiveSync ensures cross-region data consistency, enables Uber's disaster recovery strategy, and eliminates inefficiency caused by the secondary region sitting idle, which previously incurred hardware costs equal to the primary, while still maintaining high availability. Built initially on the open-source Airbnb ReAir project, HiveSync has been extended with sharding, DAG-based orchestration, and a separation of control and data planes.
Tech industry
Miscellaneous
fromTechzine Global
2 months ago

Klarrio uses open source expertise to build foundational data platforms

Klarrio builds compliant, scalable open-source data platforms and platform-engineering foundations, integrating and securing underlying infrastructure so customers can focus on analytics and data science.
DevOps
fromEntrepreneur
1 month ago

How AI Is Revolutionizing Disaster Recovery

AI can transform static disaster recovery runbooks into continuously validated, automatically updated procedures that keep pace with evolving infrastructure and prevent costly recovery delays.
Business intelligence
fromTechzine Global
1 month ago

Dataiku introduces platform for scalable enterprise AI

Dataiku launches Platform for AI Success with three new products designed to move AI initiatives from pilots to measurable business outcomes through unified orchestration across cloud providers.
Artificial intelligence
fromInfoWorld
1 month ago

Why AI requires rethinking the storage-compute divide

AI workloads require continuous processing of unstructured multimodal data, causing redundant data movement and transformation that wastes infrastructure costs and data scientist time.
Data science
fromInfoWorld
1 month ago

The revenge of SQL: How a 50-year-old language reinvents itself

SQL has experienced a major comeback driven by SQLite in browsers, improved language tools, and PostgreSQL's jsonb type, making it both traditional and exciting for modern development.
fromInfoWorld
2 months ago

AI is changing the way we think about databases

Developers have spent the past decade trying to forget databases exist. Not literally, of course. We still store petabytes. But for the average developer, the database became an implementation detail; an essential but staid utility layer we worked hard not to think about. We abstracted it behind object-relational mappers (ORM). We wrapped it in APIs. We stuffed semi-structured objects into columns and told ourselves it was flexible.
Software development
fromTechzine Global
2 months ago

Databricks makes serverless Postgress service Lakebase available

Databricks today announced the general availability of Lakebase on AWS, a new database architecture that separates compute and storage. The managed serverless Postgres service is designed to help organizations build faster without worrying about infrastructure management. When databases link compute and storage, every query must use the same CPU and memory resources. This can cause a single heavy query to affect all other operations. By separating compute and storage, resources automatically scale with the actual load.
Software development
fromTechzine Global
2 months ago

Sumo Logic launches data pipeline apps for Snowflake and Databricks

Snowflake offers a fully managed data platform, but Sumo Logic users often lack insight into performance, login activity, and operational health. The Sumo Logic Snowflake Logs App analyzes login and access activity to identify anomalies or suspicious behavior. It also optimizes data pipelines with insights into long-running or failing queries. Teams can centralize log data to facilitate correlation across applications, cloud services, and data platforms.
Information security
Software development
fromInfoWorld
2 months ago

Why your next microservices should be streaming SQL-driven

Streaming SQL with UDFs, materialized results, and ML/AI integrations enables continuous, stateful processing of event streams for microservices.
Business intelligence
fromTechzine Global
2 months ago

ClickHouse, the open-source challenger to Snowflake and Databricks

ClickHouse is a high-performance columnar OLAP database rapidly adopted by AI and enterprise users, now valued at $15B and acquiring Langfuse.
Information security
fromSecuritymagazine
2 months ago

Product Spotlight on Analytics

Taelor Sutherland is Associate Editor at Security magazine covering enterprise security, coordinating digital content, and holding a BA in English Literature from Agnes Scott College.
fromMedium
3 months ago

How I Fixed a Critical Spark Production Performance Issue (and Cut Runtime by 70%)

"The job didn't fail. It just... never finished." That was the worst part. No errors.No stack traces.Just a Spark job running forever in production - blocking downstream pipelines, delaying reports, and waking up-on-call engineers at 2 AM. This is the story of how I diagnosed a real Spark performance issue in production and fixed it drastically, not by adding more machines - but by understanding Spark properly.
Data science
fromInfoQ
2 months ago

Beyond the Warehouse: Why BigQuery Alone Won't Solve Your Data Problems

Data warehouses like BigQuery perform well initially but become slow, costly, and disorganized at scale, undermining low-latency operational use and innovation.
fromInfoWorld
2 months ago

AI-augmented data quality engineering

SHAP for feature attribution SHAP quantifies each feature's contribution to a model prediction, enabling: LIME for local interpretability LIME builds simple local models around a prediction to show how small changes influence outcomes. It answers questions like: "Would correcting age change the anomaly score?" "Would adjusting the ZIP code affect classification?" Explainability makes AI-based data remediation acceptable in regulated industries.
Artificial intelligence
Data science
fromCIO
2 months ago

5 perspectives on modern data analytics

Data/business analytics is the top IT investment priority, yet analytics projects often fail due to poor data, vague objectives, and one-size-fits-all solutions.
fromMedium
2 months ago

Why "Data Scientist" is Becoming "AI Engineer" and What That Actually Means

The title "data scientist" is quietly disappearing from job postings, internal org charts, and LinkedIn headlines. In its place, roles like "AI engineer," "applied AI engineer," and "machine learning engineer" are becoming the norm. This Data Scientist vs AI Engineer shift raises an important question for practitioners and leaders alike: what actually changes when a data scientist becomes an AI engineer, and what stays the same? More importantly, what skills matter if you want to make this transition intentionally rather than by accident?
Artificial intelligence
Business intelligence
fromInfoWorld
2 months ago

Google tests BigQuery feature to generate SQL queries from English

Google allows natural language expressions inside SQL comments to speed translation of intent into executable queries, reducing query-writing time and easing analytics workflows.
Data science
fromInfoQ
1 month ago

Databricks Introduces Lakebase, a PostgreSQL Database for AI Workloads

Databricks Lakebase is a serverless PostgreSQL OLTP database that separates compute from storage and unifies transactional and analytical capabilities.
Artificial intelligence
fromTechzine Global
2 months ago

Snowflake launches Cortex Code agent for understanding data context

Cortex Code is an AI agent that converts complex data engineering, ML, and analytics tasks into natural-language workflows integrated into Snowflake and developer tools.
Data science
fromInfoWorld
2 months ago

Snowflake debuts Cortex Code, an AI agent that understands enterprise data context

Cortex Code enables developers to use natural language to build, optimize, and deploy governed, production-ready data pipelines, analytics, ML workloads, and AI agents.
fromTreehouse Blog
1 month ago

Portfolio Projects for Entry-Level Data Roles

Most beginner data portfolios look similar. They include: A few cleaned datasets Some charts or dashboards A notebook with code and commentary Again, nothing here is wrong. But hiring teams don't review portfolios to check whether you can follow instructions. They review them to see whether you can think like a data analyst. When projects feel generic, reviewers are left guessing:
Data science
Data science
fromDevOps.com
2 months ago

Why Data Contracts Need Apache Kafka and Apache Flink - DevOps.com

Data contracts formalize schemas, types, and quality constraints through early producer-consumer collaboration to prevent pipeline failures and reduce operational downtime.
Artificial intelligence
fromInfoQ
2 months ago

Autonomous Big Data Optimization: Multi-Agent Reinforcement Learning to Achieve Self-Tuning Apache Spark

A Q-learning agent autonomously learns and generalizes optimal Spark configurations by discretizing dataset features and combining with Adaptive Query Execution for superior performance.
Artificial intelligence
fromInfoWorld
2 months ago

Teradata unveils enterprise AgentStack to push AI agents into production

Teradata positions Enterprise AgentStack as a vendor-agnostic execution layer across hybrid environments, contrasting platform-tied AI approaches from Snowflake and Databricks.
fromMedium
2 months ago

From Graphs to Generative AI: Building Context That Pays-Part 1

Every year, poor communication and siloed data bleed companies of productivity and profit. Research shows U.S. businesses lose up to $1.2 trillion annually to ineffective communication, that's about $12,506 per employee per year. This stems from breakdowns that waste an average of 7.47 hours per employee each week on miscommunications. The damage isn't only interpersonal; it's structural. Disconnected and fragmented data systems mean that employees spend around 12 hours per week just searching for information trapped in those silos.
Data science
Artificial intelligence
fromInfoQ
2 months ago

Google BigQuery Adds SQL-Native Managed Inference for Hugging Face Models

BigQuery lets data teams deploy and run Hugging Face or Vertex AI open models with plain SQL, auto-provisioning compute and managing endpoints.
[ Load more ]