The copyright lawsuits against OpenAI are piling up as the tech company seeks data to train its AI

OpenAI uses any and all publicly available data to train ChatGPT, including books and articles from the internet. Now, those who own them want to be paid for their work.

Training data is an essential part of creating the AI models that are taking over the tech world. Leading tech companies like Google, Meta, OpenAI, Anthropic, and Microsoft are all scrambling to find new sources of data. Meta at one point even considered buying Simon & Schuster, one of the world’s biggest publishing houses.

Part of the problem is that publishers are increasingly accusing these companies of hoovering up copyrighted data. They’d like to be paid for their work. Meta and OpenAI have argued in comments to the US Copyright Office that putting copyrighted material on the internet makes it “publicly available” and thus under fair use.

But they’ll still have to make that argument in court as the company faces lawsuits from several groups over the copyrighted material.

The Center for Investigative Reporting, a news nonprofit known sometimes by its acronym CIR and which merged with Mother Jones and Reveal earlier this year, sued OpenAI and Microsoft last week in federal court. The lawsuit accuses OpenAI of being “built on the exploitation of copyrighted works belonging to creators around the world, including CIR.”

Lawyers for the CIR accused OpenAI and Microsoft of using copyrighted material from Mother Jones to train their GPT and Copilot AI models.

“OpenAI and Microsoft started vacuuming up our stories to make their product more powerful, but they never asked for permission or offered compensation, unlike other organizations that license our material,” Monika Bauerlein, CEO of the Center for Investigative Reporting, said in an announcement about the lawsuit. “This free rider behavior is not only unfair, it is a violation of copyright.”

The lawsuit says that “16,793 distinct URLs from Mother Jones’s web domain” appeared in a published list of the top web domains present in the company’s WebText training set.

In another class action lawsuit from the Author’s Guild, two authors claimed that the company used information from their books to train ChatGPT. The New York Times also filed a similar lawsuit against the company in December 2023.

In May, court documents in the Author’s Guild lawsuit revealed that OpenAI deleted two huge datasets used to train GPT-3. Lawyers for the guild said the two sets likely contained “more than 100,000 published books.”

The two employees responsible for putting together the data no longer work for OpenAI, court documents say.

OpenAI has begun signing licensing agreements with news organizations to fairly use their work. The company has signed such agreements with The Associated Press, publishers of The Wall Street Journal and New York Post, The Atlantic, Prisa Media, Le Monde newspaper, Financial Times, and Business Insider parent Axel Springer.

But the scale of content required for these bots to continuously learn will require far more than a handful of licensing agreements.

One solution is synthetic data, which is artificially generated rather than collected from the real world, and can easily be generated by machine learning algorithms.

OpenAI has considered synthetic data as an option to train its models, but CEO Sam Altman has raised concerns about producing quality data.

“As long as you can get over the synthetic data event horizon, where the model is smart enough to make good synthetic data, everything will be fine,” Altman said at a tech conference in May 2023. The company has also explored a process in which AI models work together — one AI system produces data, while another judges it.

OpenAI did not immediately return a request for comment from Business Insider.

Read the full article here