Generally, sources that filter higher on Google tend to be of higher quality and trustworthiness.
“Major LLM developers no longer disclose their training data as they once did.
They are now more commercial and less transparent,” Wukoson and Fortuna wrote.
OpenAI, Google, Meta and Anthropic didn’t immediately respond to requests for comment.
Investors currently value startups OpenAI and Anthropic at$157 billionand$40 billion, respectively.
News publishers, meanwhile, are struggling and have been forced intowaves of layoffsover the past few years.
Meanwhile, some AI companies have inked licensing deals with publishers to feed their LLMs with up-to-date news articles.
Ziff Davis hasn’t signed a similar deal.
Neither OpenAI, Google, Anthropic nor Meta have disclosed training data used to train their most recent models.
Correction, Nov. 11:An earlier version of this article misstated the dataset used to train GPT-3.
It was trained on WebText 2.