#vision-language-models

[ follow ]
fromTNW | Health-Tech
4 days ago

Cedars-Sinai's AI beats specialist models at reading heart scam

EchoPrime, a video-based vision-language model, analyses echocardiogram footage and generates a written report of cardiac form and function. Its findings were published in Nature (volume 650, pages 970-977) in February 2026, under the title 'Comprehensive echocardiogram evaluation with view primed vision language AI.'
Medicine
Software development
fromTechzine Global
1 week ago

Microsoft introduces open-source multimodal Phi-4 reasoning model

Microsoft's Phi-4-reasoning-vision-15B combines vision and reasoning capabilities using mid-fusion architecture, outperforming larger models on mathematical and scientific benchmarks while maintaining efficiency through selective multimodal layer processing.
fromNature
1 week ago

Merlin: a computed tomography vision-language foundation model and dataset - Nature

The large volume of abdominal computed tomography (CT) scans coupled with the shortage of radiologists have intensified the need for automated medical image analysis tools. Previous state-of-the-art approaches for automated analysis leverage vision-language models (VLMs) that jointly model images and radiology reports.
Medicine
Privacy technologies
fromPrivacy International
1 week ago

Nowhere to Hide? Privacy Risks and Policy Implications of AI Geolocation

Vision-Language Models can accurately determine photo locations without GPS data, creating serious privacy and human rights risks including surveillance, doxxing, and discriminatory policing.
fromTechCrunch
1 month ago

Ex-Googlers are building infrastructure to help companies understand their video data | TechCrunch

Businesses are generating more video than ever. From years of broadcast archives to thousands of store cameras and countless hours of production footage, most of it just sits unused on servers, unwatched and unanalyzed. This is dark data: a massive, untapped resource that companies collect automatically but almost never use in a meaningful way. To tackle the problem, Aza Kai (CEO) and Hiraku Yanagita (COO), two former Googlers who spent nearly a decade working together at Google Japan, decided to build their own solution.
Artificial intelligence
fromNature
1 month ago

Multimodal learning with next-token prediction for large multimodal models - Nature

Since AlexNet5, deep learning has replaced heuristic hand-crafted features by unifying feature learning with deep neural networks. Later, Transformers6 and GPT-3 (ref. 1) further advanced sequence learning at scale, unifying structured tasks such as natural language processing. However, multimodal learning, spanning modalities such as images, video and text, has remained fragmented, relying on separate diffusion-based generation or compositional vision-language pipelines with many hand-crafted designs.
Artificial intelligence
Artificial intelligence
fromZDNET
2 months ago

Nvidia's physical AI models clear the way for next-gen robots - here's what's new

Nvidia released open Cosmos and GR00T physical-AI models to accelerate robot development, enabling realistic world understanding, simulation, reasoning, and reduced pretraining effort.
Artificial intelligence
fromTechCrunch
3 months ago

Nvidia announces new open AI models and tools for autonomous driving research | TechCrunch

Nvidia released Alpamayo-R1, an open vision-language reasoning model plus Cosmos Cookbook resources to accelerate level-4 autonomous driving and physical AI development.
Wearables
fromZDNET
5 months ago

These Halo smart glasses just got a major memory boost, thanks to Liquid AI

Brilliant Labs will integrate Liquid AI's vision–language foundation models into Halo AI smart glasses to improve real-time scene understanding and agentic memory.
Artificial intelligence
fromComputerworld
6 months ago

Microsoft researchers develop new tech for video AI agents

Microsoft is developing MindJourney, a video-AI framework that explores 3D spaces using world models, VLMs, video generation, and reasoning to predict surroundings and movement.
Philosophy
fromTheregister
6 months ago

Vision AI models see optical illusions when none exist

Vision language models, like GPT-5, misinterpret simple images as complex illusions, reflecting a form of cognitive bias similar to humans.
Artificial intelligence
fromHackernoon
2 years ago

Researchers Push Vision-Language Models to Grapple with Metaphors, Idioms, and Sarcasm | HackerNoon

The V-FLUTE dataset enhances understanding of figurative language in AI, assessing the performance of vision-language models.
Artificial intelligence
fromHackernoon
2 years ago

Can AI Understand a Joke? New Dataset Tests Bots on Metaphors, Sarcasm, and Humor | HackerNoon

Large AI models struggle with figurative language, which presents challenges due to its implicit meanings.
#idefics2
Bootstrapping
fromHackernoon
56 years ago

The Artistry Behind Efficient AI Conversations | HackerNoon

The cross-attention architecture exceeds fully autoregressive models in vision-language performance, despite having a higher computational cost.
#machine-learning
Artificial intelligence
fromPyImageSearch
9 months ago

Content Moderation via Zero Shot Learning with Qwen 2.5 - PyImageSearch

Digital platforms face complex challenges in content moderation due to user-generated content growth.
Qwen 2.5 models can enhance content moderation through advanced multimodal understanding.
[ Load more ]