
"Want to hear just the guitar riff from a song? How about cutting out the train noise from a voice recording? Meta says its new SAM Audio model can separate and edit sounds using simple prompts, cutting down on the manual work typical of audio-editing tools. The release of the Segment Anything Model (SAM) Audio follows the previous release of Meta-made segmentation models for visual assets."
"By "multimodal," Meta is referring to SAM Audio's ability to interpret three types of prompts for audio segmentation: text prompts, time-segment markings, and visual selections in video used to isolate or remove specific sounds. Take a video of a band playing, for example, and select the guitarist to have SAM Audio automatically isolate that player. Highlight the waveform of a barking dog in an outdoor recording, tell SAM to remove that sound, and it can trace and eliminate those interruptions throughout the entire file."
""SAM Audio performs reliably across diverse, real-world scenarios - using text, visual, and temporal cues," Meta said in its SAM Audio announcement. "This approach gives people precise and intuitive control over how audio is separated." The company said it sees a number of use cases for SAM Audio, like cleaning up an audio file, removing background noise, and other tasks that previously required hands-on work in audio-editing software or dedicated sound-mixing tools."
SAM Audio separates and edits individual sounds within audio and video files using text prompts, time-segment markings, and visual selections. The model can isolate instruments or voices and remove unwanted noises across entire recordings. The system is available on the Segment Anything Playground and for download. The model is described as the first unified multimodal model for audio separation, combining capabilities that previously existed across fragmented, single-purpose tools. Typical use cases include cleaning audio files, removing background noise, and simplifying tasks that formerly required hands-on work in audio-editing or sound-mixing software. The approach uses temporal, textual, and visual cues to deliver precise control.
Read at Theregister
Unable to calculate read time
Collection
[
|
...
]