Data science
fromTNW | Opinion
2 hours agoAI amplifies whatever you feed it, including confusion
Organizations struggle with AI due to confusion over relevant data, leading to overwhelmed teams and a disconnect between ambition and execution.
Neo4j Aura Agent is an end-to-end platform for creating agents, connecting them to knowledge graphs, and deploying to production in minutes. In this post, we'll explore the features of Neo4j Aura Agent that make this all possible, along with links to coded examples to get hands-on with the platform.
Our customers, ranging from large enterprises to AI research labs, are no longer just asking for AI features. They need a way to collect high-fidelity, synchronized robot and vision data to train AI models on the same robots they intend to deploy. Our AI Trainer is the industry's first direct lab-to-factory solution for AI model training.
Since its release in 2021, this repository has become a bedrock in discovery and a first port of call for research projects that try to understand life at the molecular level. But previous iterations of the database lacked predictions of how proteins form complexes, which can be indispensable for their function.
Google tasked Gemini with sorting through 5 million news articles from around the world and isolating flood reports. It transformed this data into a geo-tagged series of chronological events. Next, researchers trained a model to ingest current weather forecasts and leverage the Groundsource data to determine the likelihood of a flash flood in a given area.
A comprehensive analysis of Google search patterns related to birds explores what species people seek information about most frequently. The investigation spans six interconnected analyses examining bird variety, taxonomic classifications, information sharing behaviors, birder sighting correlations with search trends, regional popularity differences across states, and temporal patterns in search interest.
Hedge funds and other money managers spent $2.8 billion on alternative data in 2025, according to a new report from consultancy Neudata, a 17% jump from the year before. It's more than double what asset managers spent on alternative data in 2021, which includes a wide range of non-traditional information sources. The report projects that the total spend on alternative datasets could jump to more than $23 billion in the consultancy's bull case in 2030 and just under $8 billion in the bear case.
AI was everywhere, but I wasn't focused on product launches. I was looking at how companies think about data itself: how it's shared, governed and ultimately turned into decisions. And across conversations with executives and sessions on security and compliance, a pattern emerged: the technical limitations that once justified locking data down have largely been solved. What remains difficult is human. Alignment, trust and confidence inside organizations are now the true barriers.
Every search, purchase, loyalty swipe, location ping and scroll feeds systems that now shape pricing, product decisions, hiring and marketing strategies. Most founders understand this in theory, but few grasp the practical consequence: whether they intend to or not, they and their customers are already casting votes with their data. And those votes? They're usually cast passively, on someone else's terms.
I wrote a book for O'Reilly on scaling machine learning with Spark specifically. My second book is coming out on how to improve high-performance Spark, the second edition. Started my career in the machine learning space 15 years ago, moved into data infrastructure, batch processing, and a year and a half ago I moved into the data streaming space, which I think it's what's going to help us pave the future in the data.
Most beginner data portfolios look similar. They include: A few cleaned datasets Some charts or dashboards A notebook with code and commentary Again, nothing here is wrong. But hiring teams don't review portfolios to check whether you can follow instructions. They review them to see whether you can think like a data analyst. When projects feel generic, reviewers are left guessing:
Every year, poor communication and siloed data bleed companies of productivity and profit. Research shows U.S. businesses lose up to $1.2 trillion annually to ineffective communication, that's about $12,506 per employee per year. This stems from breakdowns that waste an average of 7.47 hours per employee each week on miscommunications. The damage isn't only interpersonal; it's structural. Disconnected and fragmented data systems mean that employees spend around 12 hours per week just searching for information trapped in those silos.
For a brief moment in October, Alejandro Quintero thought he had made it big in China. The Bogotá-based data analyst owns and manages a website that publishes articles about paranormal activities, like ghosts and aliens. The content is written in "Spanglish," he says, and was never intended for an Asian audience. But last fall, Quintero's site suddenly began receiving a large volume of visits from China and Singapore.
Next Gen Stats began in 2015, when the National Football League deployed RFID chips in player shoulder pads and even in the football itself, enabling the league to capture location data multiple times per second through sensors installed throughout stadiums.
This is the conundrum of elite chess. The stronger the players, the greater the odds of the match ending in a draw. "What ended up happening," said Mark Glickman, senior lecturer in the Department of Statistics and longtime chess enthusiast, "is that these top players were not having their ratings change very much, just because the games would be drawn all the time."
A new PhD track is being added to the Walter S. and Lucienne Driskill Graduate Program in Life Sciences ( DGP) for the 2026 application cycle, to enhance student learning and build community around computational biology and bioinformatics at Feinberg. The computational biology and bioinformatics (CBB) track in the graduate program will prepare students through coursework and lectures to use modern computational approaches, including machine learning and artificial intelligence, to extract biological insight from large-scale datasets to address complex biological problems.
That local exodus is documented by Cornell-led research that mapped annual moves between U.S. neighborhoods from 2010 to 2019 in detail 4,600 times greater than standard public data. Called MIGRATE, the new, publicly available dataset revealed that most of those displaced remained within the affected county - moves not captured in county-level public migration data aggregated every five years.
Instead of treating each prompt as a one-off request, the new agent remembers what was asked earlier, including datasets, filters, time ranges, and assumptions, and uses that context when answering follow-up questions. This lets users refine an analysis progressively rather than starting from scratch each time," Satapathy added. Satapathy pointed out that this eases the pressure on developers to prebuild dashboards or predefined business logic for every possible question that a data analyst or business user could ask.
With the introduction of Live Query for BigQuery and Alteryx One: Google Edition, users no longer need to move data to run workflows. Companies that standardize cloud platforms for analytics and AI often see a gap between where data is stored and how it is prepared and used. Alteryx wants to change that by bringing analytics workflows directly to BigQuery. The promise: from data to insight to action, without compromising on security or scalability.