Introducing MS MARCO Web Search: A New Era for LLM and IR Data | HackerNoon
Briefly

The MS MARCO Web Search dataset emerges as a groundbreaking resource for web information retrieval research. Boasting millions of clicked query-document labels and utilizing the ClueWeb22 document set, it reflects authentic web document distributions and real query behaviors. With its large-scale high-quality components, the dataset promises to further advance research in large language models (LLMs) and information retrieval algorithms, fostering developments and overcoming existing biases and limitations in available datasets.
MS MARCO Web Search dataset consists of a high quality set of web pages that mirrors the highly-skewed web document distribution, a query set that reflects the real web query distribution.
We use ClueWeb22 as our document set since it is the largest and newest open web document dataset, meeting the requirements of large scale and realistic document distributions.
Read at Hackernoon
[
|
]