Unbundling AI: A List of Public Training Data Deals (October 2025)
Unbundling AI: A List of Public Training Data Deals (October 2025)
Unbundling AI: A List of Public Training Data Deals (October 2025)
Oct 15, 2025
Oct 15, 2025
Oct 15, 2025
A concise map of the 2025 AI content licensing market. Who licensed what to whom, how the money works, and what could happen next.


TL;DR: Media owners, social platforms, and content libraries are signing large multi year licenses with AI providers. Most public deals involve OpenAI for both training and display with attribution. Reddit disclosed about 203 million in aggregate data licensing value. Reports place News Corp near a quarter billion over five years. Perplexity is pushing revenue share grounding programs with live attribution. This guide catalogs the main deals, explains deal types, and gives leaders a checklist to evaluate new offers.
What counts as a training data deal
Working definition: A multi year license that grants an AI provider rights to use a content owner’s archives, feeds, or API for one or both of the following:
Model training to improve models with historical or ongoing data
Grounding and display in product answers with attribution and links
Three common scopes today:
Display with attribution
Training or data access
Hybrid that covers both, often with product collaboration
Why it matters: Licensed, high quality corpora improve reliability and reduce legal risk. They can also make a company’s data more visible in AI answers, either directly through guaranteed inclusion via licensed feeds or indirectly through stronger credibility signals. Terms like refresh cadence, correction flows, and attribution UX shape AI search visibility and referral traffic, which means companies should pay special attention to these deals for AI Search Optimization.
Confirmed publisher deals with OpenAI
OpenAI holds the broadest public portfolio across news and magazines.
Publisher | Announced | Scope | Public notes |
---|---|---|---|
May 2024 | Training and display | Multi year access to current and archived content with attribution. Reports suggest more than 250 million across five years. Figures not confirmed. | |
Jan 2023 | Training and collaboration | Licensed portions of the AP text archive. Financials undisclosed. | |
Dec 2023 | Training and display | ChatGPT can summarize and link with attribution. Training use referenced. | |
Apr 2024 | Display and product work | FT content appears in ChatGPT with attribution. Training scope less explicit. | |
May 2024 | Training and display | Product and ad collaboration. Industry reporting cites about 16 million per year fixed component. |
Pattern: Archives plus live feeds exchanged for compensation, attribution, and product collaboration.
More OpenAI publisher partners
Vox Media and The Atlantic for content surfacing and product work
Condé Nast for display and training plus early SearchGPT tests
Time for one century of archives with links and product collaboration
Le Monde, Prisa Media, Future plc for multi year licenses with attribution
Beyond OpenAI: other provider plays
Perplexity x Gannett and a broader Publisher Program with revenue share on grounded answers and live attribution.
Meta AI x Reuters for news summarization and links.
The New York Times x Amazon for licensing to Amazon AI products.
Social and developer platforms
Reddit x OpenAI and Google
Licensed API and structured content for training and live use
Google reported near 60 million per year in press coverage
Reddit disclosed about 203 million aggregate data licensing value in filings
Stack Overflow x OpenAI and Google
API and data to surface vetted answers with attribution inside assistants
Financials undisclosed
Images, video, and music
Shutterstock x OpenAI
Six year license for images, video, music, and metadata for training
Priority access to product integrations
Getty Images
Active litigation track with some generators and separate licensing options
No single public OpenAI training license announcement
Major labels
Negotiations toward AI licensing frameworks and possible micro payments
Many terms remain private or in flight
Deal structures and money
Rights scope: display, training, hybrid, or grounding with revenue share.
Payment patterns: fixed fees, fixed plus variable usage, or pure revenue share.
Operational terms to watch: refresh cadence, correction and retention rules, attribution UX, opt outs, and API integration details.
Reality check: most contracts are confidential; public dollar figures are directional.
Gaps, standards, and signals to monitor
Vendors with fewer public media licenses: Anthropic and xAI
Standardization attempts: efforts like Really Simple Licensing remain early
Curated data brokers: Bright Data, Scale AI and others are part of the supply chain but sit outside one to one publisher deals
Legal pressure: lawsuits and settlements continue to shape negotiation leverage and appetite for explicit licenses.
FAQ
Training vs display vs grounding
Training improves the model with archives. Display shows attributed excerpts and links. Grounding uses licensed content in real time to inform answers with citation.
Do licenses stop unlicensed use
They reduce risk for covered content. Enforcement and opt out vary by vendor. Many publishers pair deals with active legal strategies.
How big are these deals
From single digit millions per year to nine figure multi year packages. Only a few numbers are public. Treat press figures as directional.
Who is most active?
OpenAI holds the largest set of public publisher deals. Perplexity leads on grounding with revenue share. Meta, Amazon, and Google have selective agreements. Anthropic and xAI have fewer public media licenses.
TL;DR: Media owners, social platforms, and content libraries are signing large multi year licenses with AI providers. Most public deals involve OpenAI for both training and display with attribution. Reddit disclosed about 203 million in aggregate data licensing value. Reports place News Corp near a quarter billion over five years. Perplexity is pushing revenue share grounding programs with live attribution. This guide catalogs the main deals, explains deal types, and gives leaders a checklist to evaluate new offers.
What counts as a training data deal
Working definition: A multi year license that grants an AI provider rights to use a content owner’s archives, feeds, or API for one or both of the following:
Model training to improve models with historical or ongoing data
Grounding and display in product answers with attribution and links
Three common scopes today:
Display with attribution
Training or data access
Hybrid that covers both, often with product collaboration
Why it matters: Licensed, high quality corpora improve reliability and reduce legal risk. They can also make a company’s data more visible in AI answers, either directly through guaranteed inclusion via licensed feeds or indirectly through stronger credibility signals. Terms like refresh cadence, correction flows, and attribution UX shape AI search visibility and referral traffic, which means companies should pay special attention to these deals for AI Search Optimization.
Confirmed publisher deals with OpenAI
OpenAI holds the broadest public portfolio across news and magazines.
Publisher | Announced | Scope | Public notes |
---|---|---|---|
May 2024 | Training and display | Multi year access to current and archived content with attribution. Reports suggest more than 250 million across five years. Figures not confirmed. | |
Jan 2023 | Training and collaboration | Licensed portions of the AP text archive. Financials undisclosed. | |
Dec 2023 | Training and display | ChatGPT can summarize and link with attribution. Training use referenced. | |
Apr 2024 | Display and product work | FT content appears in ChatGPT with attribution. Training scope less explicit. | |
May 2024 | Training and display | Product and ad collaboration. Industry reporting cites about 16 million per year fixed component. |
Pattern: Archives plus live feeds exchanged for compensation, attribution, and product collaboration.
More OpenAI publisher partners
Vox Media and The Atlantic for content surfacing and product work
Condé Nast for display and training plus early SearchGPT tests
Time for one century of archives with links and product collaboration
Le Monde, Prisa Media, Future plc for multi year licenses with attribution
Beyond OpenAI: other provider plays
Perplexity x Gannett and a broader Publisher Program with revenue share on grounded answers and live attribution.
Meta AI x Reuters for news summarization and links.
The New York Times x Amazon for licensing to Amazon AI products.
Social and developer platforms
Reddit x OpenAI and Google
Licensed API and structured content for training and live use
Google reported near 60 million per year in press coverage
Reddit disclosed about 203 million aggregate data licensing value in filings
Stack Overflow x OpenAI and Google
API and data to surface vetted answers with attribution inside assistants
Financials undisclosed
Images, video, and music
Shutterstock x OpenAI
Six year license for images, video, music, and metadata for training
Priority access to product integrations
Getty Images
Active litigation track with some generators and separate licensing options
No single public OpenAI training license announcement
Major labels
Negotiations toward AI licensing frameworks and possible micro payments
Many terms remain private or in flight
Deal structures and money
Rights scope: display, training, hybrid, or grounding with revenue share.
Payment patterns: fixed fees, fixed plus variable usage, or pure revenue share.
Operational terms to watch: refresh cadence, correction and retention rules, attribution UX, opt outs, and API integration details.
Reality check: most contracts are confidential; public dollar figures are directional.
Gaps, standards, and signals to monitor
Vendors with fewer public media licenses: Anthropic and xAI
Standardization attempts: efforts like Really Simple Licensing remain early
Curated data brokers: Bright Data, Scale AI and others are part of the supply chain but sit outside one to one publisher deals
Legal pressure: lawsuits and settlements continue to shape negotiation leverage and appetite for explicit licenses.
FAQ
Training vs display vs grounding
Training improves the model with archives. Display shows attributed excerpts and links. Grounding uses licensed content in real time to inform answers with citation.
Do licenses stop unlicensed use
They reduce risk for covered content. Enforcement and opt out vary by vendor. Many publishers pair deals with active legal strategies.
How big are these deals
From single digit millions per year to nine figure multi year packages. Only a few numbers are public. Treat press figures as directional.
Who is most active?
OpenAI holds the largest set of public publisher deals. Perplexity leads on grounding with revenue share. Meta, Amazon, and Google have selective agreements. Anthropic and xAI have fewer public media licenses.