'Google for DNA' brings order to biology's big data
Briefly

'Google for DNA' brings order to biology's big data
""It's a huge achievement," says Rayan Chikhi, a biocomputing researcher at the Pasteur Institute in Paris. "They set a new standard" for analysing raw biological data - including DNA, RNA and protein sequences - from databases that can contain millions of billions of DNA letters, amounting to 'petabases' of information, more entries than all the webpages in Google's vast index."
"Although MetaGraph is tagged as 'Google for DNA', Chikhi likens the tool to a search engine for YouTube, because the tasks are more computationally demanding. In the same way that YouTube searches can retrieve every video that features, say, red balloons even when those key words don't appear in the title, tags or description, MetaGraph can uncover genetic patterns hidden deep within expansive sequencing data sets without needing those patterns to be explicitly annotated in advance."
"The motivation behind MetaGraph was to address an accessibility problem in sequencing data sets. The size of these repositories has risen at a blistering pace in the past few decades, but this growth has presented challenges for the scientists using the data they contain. Raw sequencing reads are fragmented, noisy and too numerous to search directly. "The volume of the data, paradoxically, is the main inhibitor of us actually using the data," says Babaian."
MetaGraph is a high-performance search engine that quickly sifts massive volumes of biological sequencing data stored in public repositories. It analyses raw DNA, RNA and protein sequence reads across databases containing petabases of information. The engine can uncover genetic patterns hidden deep within expansive sequencing data sets without requiring explicit annotations. MetaGraph operates with computational demands similar to large multimedia search systems, enabling retrievals that standard keyword searches cannot achieve. The project addresses accessibility problems created by fragmented, noisy and excessively numerous raw reads that prevent direct searching. Indexing these repositories enables researchers to query and extract biological insights from public sequence archives.
Read at Nature
Unable to calculate read time
[
|
]