OpenAI declared before the UK Parliament that it was “impossible” to train leading artificial intelligence (AI) models without using copyrighted materials. This is a popular position in the world of AI, in which this company and other major players have leveraged content on the internet to prepare and develop the models that power chatbots and image generators, triggering a series of lawsuits for violating intellectual property. Two announcements today provide evidence that it is, in fact, possible to train large language models (LLMs) without using copyrighted materials without permission.
Training an AI fairly
A group of researchers backed by the French government published what is considered the largest AI training data set composed entirely of public domain texts. The nonprofit Fairly Trained announced that it has awarded its first certification to an LLM built without copyright infringement, proving that technology like the one behind ChatGPT can be built in a different way than the controversial one. industry custom.
“There is no fundamental reason why someone would not be able to train an LLM fairly,” says Ed Newton-Rex, CEO of Fairly Trained. He founded the organization in January 2024, after leaving his leadership position at the imaging startup Stability AI, because he did not agree with its policy of scraping content from the internet without authorization. Fairly Trained offers certification to companies willing to demonstrate that they trained their AI models with data they own, licensed, or in the public domain. When the nonprofit began operations, some critics noted that it had not yet identified an LLM that met those requirements. Today, Fairly Trained announced that it has certified its first LLM. It is called KL3M and was developed by the legal technology consulting startup 273 Ventures, based in Chicago (USA), from a training data set of legal, financial, and regulatory documents.
Jillian Bommarito, co-founder of the company, says the decision to prepare KL3M in this way came from the company’s “risk-averse” clients, such as law firms. “They are concerned about provenance and need to know that the results are not based on corrupt data,” she highlights. “We do not rely on fair use .” Customers were interested in using generative AI for tasks like summarizing legal documents and drafting contracts, but they didn’t want to be dragged into intellectual property litigation, as has happened to OpenAI, Stability AI, and others.
Bommarito shares that 273 Ventures had not worked with an LLM before, but decided to train one as an experiment. “Our test to see if it was possible,” he says. The company created its own training data set, called “Kelvin Legal DataPack,” which includes thousands of legal documents reviewed for compliance with copyright law.
Although the data set is minuscule (about 350 billion tokens, or units of data) compared to those collected by OpenAI and others who have scraped the internet en masse, Bommarito maintains that the KL3M model performed much better than expected, somewhat which he attributes to the care with which the sources of information had been previously verified. “Having clean, high-quality data means not having to make the model so large,” he highlights. Selecting a data set helps a finished AI model specialize in the task for which it was designed. 273 Ventures is now offering reservations on a waiting list to clients who want to purchase access to this information.
Great language models for AI created legally
Companies looking to emulate KL3M will have more help in the future in the form of freely available, breach-free data sets. On Wednesday, researchers published what they consider the largest AI data set for language models composed exclusively of public domain content. Common Corpus, as it is called, is a collection of written materials roughly the same size as those used to train OpenAI’s GPT-3 text generation model and was offered on the open-source AI platform Hugging Face.
The data set was constructed from sources such as public domain newspapers digitized by the US Library of Congress and the French National Library. Pierre-Carl Langlais, coordinator of the Common Corpus project, calls it a “set large enough to train a state-of-the-art LLM.” In big AI jargon, it contains 500 billion tokens; OpenAI’s most capable model is believed to have been trained on several billion.
Common Corpus is a collaboration coordinated by the French startup Pleias, in partnership with other artificial intelligence groups, such as Allen AI, Nomic AI, and EleutherAI. It is supported by the French Ministry of Culture and boasts the largest open data set in French to date. However, it aims to be multicultural and multi-use, a way to offer researchers and startups from a wide variety of fields access to a verified training suite, free of concerns about potential infringement.
The new data set also has its limitations. Much public domain content is obsolete; In the United States, for example, copyright protection typically lasts more than seventy years from the death of the creator, so you won’t be able to train an AI model on current affairs or, say, how Write a blog post using today’s jargon. Although, on the other hand, it would write an excellent pastiche of Proust.
“As far as I know, this is currently the largest public domain dataset to date for training LLM,” says Stella Biderman, CEO of EleutherAI, a collective open-source project that publishes AI models. “It is an invaluable resource.”
Projects like this are also extremely rare. No LLM other than 273 Ventures has been submitted to Fairly Trained for certification. But some who want artificial intelligence to be fairer for artists whose works have been swallowed up in systems like GPT-4 hope that Common Corpus and KL3M will show that there is a sector of the AI world skeptical of the arguments that justify the data scraping without permission.
“It’s a sales pitch,” says Mary Rasenberger, CEO of the Authors Guild of America, which represents book writers. “We are starting to see a lot more licenses and permit applications. It is a growing trend.” The Authors Guild, along with SAG-AFTRA and other professional groups, was recently named an official Fairly Trained contributor.
Although it has no more LLMs on its roster, Fairly Trained recently certified its first company offering AI voice models, Spanish voice-changing startup VoiceMod, as well as its first “AI band,” a heavy metal project. metal called Frostbite Orckings.