Our Terms & Conditions | Our Privacy Policy
Who’s liable for using open-source data?
A group of authors in the US— Richard Kadrey, Sarah Silverman and Christopher Golden claim that Meta torrented and processed their books to train its artificial intelligence models. The authors had originally filed a case against Meta’s copyright infringement of their works back in 2023. In their original lawsuit, the authors mentioned that the company used their books as a part of the training dataset “without consent, without credit and without compensation”.
As per a recent discovery order in the case, Meta CEO Mark Zuckerberg gave the company’s AI team orders to use the LibGen dataset. The fact discovery in the case also reveals that Meta engineers filtered out lines of copyright management information (CMI) to train its AI model Llama. This is despite the fact that in response to piracy discussions during the case, Zuckerberg said that such activity “raises lots of red flags.”
What is Lib-Gen?
Library Genesis or Lib-Gen is a file-sharing project that provides access to copyrighted works like academic books and journal articles. Besides the current lawsuit, publishers have in the past also filed court cases against LibGen for facilitating online piracy.
For instance, in India, Elsevier, The American Chemical, and Wiley Publications filed a court case against LibGen and fellow free academic paper-accessing site SciHub in India in 2020 for copyright infringement. The publishers urged the country to permanently block the sites in India. Internet Service Providers (ISPs) across multiple countries including the United Kingdom, France, and Germany block access to LibGen links.
Details of the authors’ lawsuit:
In their 2023 lawsuit, the authors mentioned that in a table describing the contents of its AI models’ training dataset, Meta had listed that 85 gigabytes (GB) came from books. These books come from two sources:
- Project Gutenberg: an online archive of approximately 70,000 books that are out of copyright
- The Book3 section of The Pile, a publicly available training data for large language models.
The authors say that the data from the Pile’s Book3 data set includes a mix of fiction and non-fiction books from the “shadow library” Bibliotik similar to other free book sites like Zlib, LibGen, and SciHub. “These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host,” the lawsuit suggests.
Why it matters:
While it seems here that Zuckerberg knew about the presence of LibGen sources in the company’s database, this lawsuit brings up some key questions about liability in the AI ecosystem. Meta acquired this data through an open-source dataset, which was accumulated from Bibliotik Private Tracker, part of the many shadow libraries like LibGen. If an AI company obtained the data through open-source datasets, which in turn obtained it from shadow libraries, who bears liability at each step of this chain? While some have filed cases against piracy websites in the past, should open-source dataset creators also bear liability for infringement?
Advertisements
Copyright concerns with AI model training:
Meta isn’t the only company to face a lawsuit for allegedly using copyrighted information to train its models. In the recent past, others like Microsoft and OpenAI have also been hit with a series of lawsuits from authors and news publications suggesting that the two used their information to train AI models. In what appears to be a response to these cases, OpenAI signed an agreement with the News Corp which gave it access to the content of major news publications like The Wall Street Journal, New York Post, and The Daily Telegraph in May last year. Microsoft similarly signed an agreement with Harper Collins to train AI on its books in November last year. Just like OpenAI and Microsoft, Meta also signed a content licensing deal with Reuters in October last year.
Also read:
Support our journalism:
For You
Images are for reference only.Images and contents gathered automatic from google or 3rd party sources.All rights on the images and contents are with their legal original owners.
Comments are closed.