OpenAI lawsuit reignites privacy debate over data scraping
The lawsuit filed this week in California against OpenAI, the artificial intelligence company behind the wildly popular ChatGPT app, is rekindling a decade-old debate about the legal and ethical concerns about tech companies scraping as much information as possible about everything — and everyone — on the web.
The suit filed on behalf of 16 clients alleges an array of harms from copyright violations to wiretapping due to Open AI’s data collection practices, adding to a growing list of legal challenges against companies repurposing or reusing images, personal information, code and other data for their own purposes.
Last November, coders sued GitHub along with its parent company Microsoft and partner OpenAI over a tool known as CoPilot that uses AI to generate code. The coders argued the companies violated the licensing agreements for the code. In February, Getty Images sued Stability AI for allegedly infringing the copyright of more than 12 million images.
As the lawsuit notes, AI companies deploy data scraping technology at a massive scale. The race between every major tech company and a growing pack of startups to develop new AI technologies, experts say, has also accelerated not just the scale of web scraping but the potential harms that come with it. Experts note that while web scraping can have benefits to society, such as business transparency and academic research, it can also come with harms, such as cybersecurity risks and scammers harvesting sensitive information for fraud.
“The volume with which they’re going out across the web and scraping code and scraping data and using that to train their algorithms raises an array of legal issues,” said Lee Tiedrich, distinguished faculty fellow in ethical technology at Duke University. “Certainly, to the extent that privacy and other personally identifiable information are involved, it raises a whole host of privacy issues.”
Those privacy concerns are the centerpiece of the recent California lawsuit, which accuses OpenAI of scraping the web to steal “private information, including personally identifiable information, from hundreds of millions of internet users, including children of all ages, without their informed consent or knowledge.”
“They’re taking personal data that has been shared for one purpose and using it for a completely different purpose without the consent of those who shared the data,” said Timothy Edgar, professor of practice of computer science at Brown University. “It is by definition, a privacy violation, or at least an ethical violation, and it might be a legal violation.”
The ways that AI companies may use that data to train their models could lead to unforeseen consequences for those whose privacy has been violated, such as having that information surface in a generated response, said Edgar. And it will be very hard for those whose privacy has been violated to claw back that data.
“It’s going to become a whack-a-mole situation where people are trying to go after each company collecting our information to try to do something about it,” said Megan Iorio, senior counsel at Electronic Privacy Information Center. “It will be a very similar to a situation we have with data brokers where it’s just impossible to control your information.”
Data scraping cases have a long history in the U.S. and go all the way up to the Supreme Court. In November 2022, the court heard a six-year-long case from LinkedIn accusing data company HiQ Labs of violating the Computer Fraud and Abuse Act by scraping profiles from the networking website to build its product. The high court denied the claim that the scraping amounted to hacking and sent the case back to a lower court where it was eventually resolved. ClearView AI, a facial recognition company, has been sued for violating privacy laws in Europe and the state of Illinois for its practice of trawling the web to build its database of more than 20 billion images. It settled the ACLU’s lawsuit in Illinois in May 2022 by promising to stop selling the database to private companies.
Now, LinkedIn’s parent company Microsoft is on the other side of the courtroom, named as a plaintiff in three different related lawsuits against OpenAI. “The whole issue of data scraping and code scraping was like a crescendo that kept getting louder. It kept growing and growing,” said Tiedrich, who called the AI lawsuit “inevitable.”
The California suit against OpenAI combines the arguments of many of these lawsuits in a whopping 157-page document. Tiedrich says that while there have been recent court cases weighing in on fair use of materials, something relevant to the copyright aspects of the OpenAI lawsuit, the legality of data scraping is full of grey areas for courts and lawmakers to resolve.
“A lot of the big AI companies are doing data scraping, but data scraping has been around. There are cases, going back 20 years ago, to scraping airline information,” said Tiedrich. “So I think it’s fair to say that the decision could have broader implications than just if it gets to a judicial decision than just AI.”
The OpenAI lawsuit’s privacy arguments might be even more difficult to uphold. Iorio, who with EPIC filed a friend of the court brief in the LinkedIn case, said the plaintiffs suing OpenAI are in a better position to show those harms since they are individuals, not a company. However, the limitation of federal privacy laws makes it hard to bring a data scraping case on those grounds, she said. Of the three privacy statutes cited by the lawsuit, only the Illinois privacy law covers publicly available information of all users. (The lawsuit also cites the Children’s Online Privacy Protection Rule, which protects users under 13.)
That leaves scrapers, whether they are tech giants or cyber criminals, with a lot of leeway. “Without a comprehensive privacy law that does not have a blanket exemption for publicly available data, we have the danger here of this country becoming a safe haven for malicious web scrapers,” said Edgar.