In the previous lesson, you learned how to turn text into embeddings - compact, high-dimensional vectors that capture semantic meaning. By computing cosine similarity between these vectors, you could find which sentences or paragraphs were most alike. That worked beautifully for a small handcrafted corpus of 30-40 paragraphs. But what if your dataset grows to millions of documents or billions of image embeddings? Suddenly, your brute-force search breaks down - and that's where Approximate Nearest Neighbor (ANN) methods come to the rescue.
They set a new standard for analysing raw biological data including DNA, RNA and protein sequences from databases that can contain millions of billions of DNA letters, amounting to petabases' of information, more entries than all the webpages in Google's vast index. Although MetaGraph is tagged as Google for DNA', Chikhi likens the tool to a search engine for YouTube, because the tasks are more computationally demanding.