#vision-language-models

[ follow ]
fromTechCrunch
1 week ago

Ex-Googlers are building infrastructure to help companies understand their video data | TechCrunch

Businesses are generating more video than ever. From years of broadcast archives to thousands of store cameras and countless hours of production footage, most of it just sits unused on servers, unwatched and unanalyzed. This is dark data: a massive, untapped resource that companies collect automatically but almost never use in a meaningful way. To tackle the problem, Aza Kai (CEO) and Hiraku Yanagita (COO), two former Googlers who spent nearly a decade working together at Google Japan, decided to build their own solution.
Artificial intelligence
fromNature
3 weeks ago

Multimodal learning with next-token prediction for large multimodal models - Nature

Since AlexNet5, deep learning has replaced heuristic hand-crafted features by unifying feature learning with deep neural networks. Later, Transformers6 and GPT-3 (ref. 1) further advanced sequence learning at scale, unifying structured tasks such as natural language processing. However, multimodal learning, spanning modalities such as images, video and text, has remained fragmented, relying on separate diffusion-based generation or compositional vision-language pipelines with many hand-crafted designs.
Artificial intelligence
Artificial intelligence
fromZDNET
1 month ago

Nvidia's physical AI models clear the way for next-gen robots - here's what's new

Nvidia released open Cosmos and GR00T physical-AI models to accelerate robot development, enabling realistic world understanding, simulation, reasoning, and reduced pretraining effort.
Artificial intelligence
fromTechCrunch
2 months ago

Nvidia announces new open AI models and tools for autonomous driving research | TechCrunch

Nvidia released Alpamayo-R1, an open vision-language reasoning model plus Cosmos Cookbook resources to accelerate level-4 autonomous driving and physical AI development.
Wearables
fromZDNET
5 months ago

These Halo smart glasses just got a major memory boost, thanks to Liquid AI

Brilliant Labs will integrate Liquid AI's vision–language foundation models into Halo AI smart glasses to improve real-time scene understanding and agentic memory.
Artificial intelligence
fromComputerworld
5 months ago

Microsoft researchers develop new tech for video AI agents

Microsoft is developing MindJourney, a video-AI framework that explores 3D spaces using world models, VLMs, video generation, and reasoning to predict surroundings and movement.
Philosophy
fromTheregister
6 months ago

Vision AI models see optical illusions when none exist

Vision language models, like GPT-5, misinterpret simple images as complex illusions, reflecting a form of cognitive bias similar to humans.
Artificial intelligence
fromHackernoon
2 years ago

Researchers Push Vision-Language Models to Grapple with Metaphors, Idioms, and Sarcasm | HackerNoon

The V-FLUTE dataset enhances understanding of figurative language in AI, assessing the performance of vision-language models.
Artificial intelligence
fromHackernoon
2 years ago

Can AI Understand a Joke? New Dataset Tests Bots on Metaphors, Sarcasm, and Humor | HackerNoon

Large AI models struggle with figurative language, which presents challenges due to its implicit meanings.
#idefics2
Bootstrapping
fromHackernoon
56 years ago

The Artistry Behind Efficient AI Conversations | HackerNoon

The cross-attention architecture exceeds fully autoregressive models in vision-language performance, despite having a higher computational cost.
#machine-learning
Artificial intelligence
fromPyImageSearch
8 months ago

Content Moderation via Zero Shot Learning with Qwen 2.5 - PyImageSearch

Digital platforms face complex challenges in content moderation due to user-generated content growth.
Qwen 2.5 models can enhance content moderation through advanced multimodal understanding.
[ Load more ]