Large genome model: Open source AI trained on trillions of bases

"After training on trillions of base pairs of DNA, Evo 2 developed internal representations of key features in even complex genomes like ours, including things like regulatory DNA and splice sites, which can be challenging for humans to spot."

"Bacterial genomes are organized along relatively straightforward principles. Any genes that encode proteins or RNAs are contiguous, with no interruptions in the coding sequence. Genes that perform related functions, like metabolizing a sugar or producing an amino acid, tend to be clustered together, allowing them to be controlled by a single, compact regulatory system."

"Eukaryotes are not like that. The coding sections of genes are interrupted by introns, which don't encode for anything. They're regulated by a sequence that can be scattered across hundreds of thousands of base pairs."

Evo 2 is an open-source AI system trained on genomes from bacteria, archaea, and eukaryotes, processing trillions of base pairs of DNA. Unlike its predecessor Evo, which worked only with bacterial genomes, Evo 2 successfully identifies key genomic features in complex organisms. The system has developed internal representations of regulatory DNA, splice sites, and other challenging-to-identify elements found in eukaryotic genomes. This advancement addresses the limitation of the original Evo system, which struggled with complex genome structures. Eukaryotic genomes differ significantly from bacterial genomes in organization, featuring interrupted coding sequences with introns and scattered regulatory elements that are difficult for humans to identify.

#ai-genome-analysis #eukaryotic-genome-structure #gene-identification #regulatory-sequences #machine-learning-biology

Read at Ars Technica

Unable to calculate read time

Collection

[

...

]

Large genome model: Open source AI trained on trillions of basesLarge genome model: Open source AI trained on trillions of bases Briefly

Large genome model: Open source AI trained on trillions of bases
Large genome model: Open source AI trained on trillions of bases
Briefly