The AI News You Need, Now.

Cut through the daily AI news deluge with starlaneai's free newsletter. These are handpicked, actionable insights with custom analysis of the key events, advancements, new tools & investment decisions happening every day.

starlane.ai Island
21 Score
12
SCORE 21
12

This French start-up may have solved AI's copyright training issues

Original article seen at: www.euronews.com on April 2, 2024

161 views 7
This French Start-Up May Have Solved Ai's Copyright Training Issues image courtesy www.euronews.com

tldr

  • πŸ“š Pleias unveils the Common Corpus, the largest public dataset for training large language models.
  • βš–οΈ The initiative comes amid legal battles over the use of copyrighted material for AI training.
  • 🌐 The Corpus includes 180 billion words in English and has the largest open data set in multiple languages.
  • πŸ”’ The dataset is based on non-copyrighted material, limiting its training on newer material.
  • 🀝 Pleias cofounder believes the Corpus can encourage cooperation and competition in the AI industry.

summary

French start-up Pleias, in collaboration with other open science AI companies, has unveiled the Common Corpus, the largest public dataset for training large language models (LLMs). This comes amid legal battles over the use of copyrighted material for AI training, with OpenAI and Microsoft being sued by the New York Times. The Common Corpus, supported by Langu:IA, a project run by the French culture ministry's French language unit, includes 180 billion words in English and has the largest open data set in French, German, Spanish, Dutch, and Italian. The dataset is based on non-copyrighted material, which means it is not trained on newer material. Pleias cofounder Pierre-Carl Langlais believes the Corpus can level the playing field by lowering the value of copyrighted data and encourages cooperation and competition in the AI industry.

starlaneai's full analysis

The unveiling of the Common Corpus represents a significant development in the AI industry. It addresses the ongoing legal and ethical issues surrounding the use of copyrighted material for AI training, offering a large-scale, multilingual dataset based on non-copyrighted material. However, its reliance on older material could limit its applicability to newer content, potentially affecting its adoption. The Common Corpus could influence the AI investment landscape by attracting investors interested in legal and ethical AI training solutions. It also represents a successful collaboration among several entities, highlighting the potential for further collaboration in the AI industry. However, the AI industry may face challenges in ensuring the dataset remains up-to-date and relevant, given its reliance on non-copyrighted material.

* All content on this page may be partially written by a clever AI so always double check facts, ratings and conclusions. Any opinions expressed in this analysis do not reflect the opinions of the starlane.ai team unless specifically stated as such.

starlaneai's Ratings & Analysis

Technical Advancement

70 The Common Corpus represents a significant technical advancement in the AI industry, providing a large-scale, multilingual dataset for training LLMs. However, its reliance on non-copyrighted material limits its applicability to newer content.

Adoption Potential

80 Given the ongoing legal issues surrounding the use of copyrighted material for AI training, the Common Corpus has high adoption potential as it offers a legal alternative.

Public Impact

60 The public impact of the Common Corpus is moderate. While it can potentially lead to the development of more advanced AI tools, its impact on the general public is indirect.

Innovation/Novelty

85 The Common Corpus is a novel solution to the copyright issues plaguing the AI industry. Its multilingual and large-scale nature sets it apart from other datasets.

Article Accessibility

50 The article is moderately accessible to a general audience, with some technical jargon that may be difficult for non-experts to understand.

Global Impact

75 The global impact of the Common Corpus is high, given its multilingual nature and potential to influence AI development worldwide.

Ethical Consideration

65 The article discusses the ethical issue of using copyrighted material for AI training, which the Common Corpus addresses by using non-copyrighted material.

Collaboration Potential

90 The Common Corpus is a result of collaboration among several entities and has high potential for further collaboration in the AI industry.

Ripple Effect

80 The Common Corpus could have a significant ripple effect in the AI industry, potentially influencing how AI models are trained and how copyright issues are addressed.

Investment Landscape

70 The unveiling of the Common Corpus could influence the AI investment landscape by attracting investors interested in legal and ethical AI training solutions.

Job Roles Likely To Be Most Interested

Ai Researchers
Data Scientists
Ai Engineers
Ai Legal Experts

Article Word Cloud

Chatgpt
Openai
Artificial Intelligence
French Language
Hugging Face
Large Language Model
Languages Of France
Master Of Laws
Open Science
Microsoft
The New York Times
Europe
Copyright Law Of The European Union
Open Data
Le Monde
Multilingualism
Euronews
Italian Language
Spanish Language
German Language
Netherlands
France
Languages Of Europe
Free Content
Ethics
Massachusetts Institute Of Technology
Copyright Issues
Large Language Models
Occiglot
Common Corpus
Public Dataset
Pleias
Ai Training
Huggingface
Eleuther
Langu:Ia
Nomic Ai
Pierre-Carl Langlais