Where Do LLMs Learn From: Training Data Analysis

707 words
4 min read
Last updated January 15, 2025

Where Do LLMs Learn From: Training Data Analysis

GPT3 Training Data Sources

Breakdown of GPT-3's training data sources and their relative proportions

GPT-3: The Foundation of Modern LLMs

Where do these large language models get their knowledge? Let's analyze the groundbreaking datasets used by OpenAI's GPT-3, which marked the beginning of the LLM revolution.

Core Training Sources:

  1. 1

    Common Crawl (60%)

    • Largest source: 60% of training data
    • Monthly-updated snapshot of the Internet
    • Contains 400+ billion tokens (≈ 6 million books)
  2. 2

    WebText2 (22%)

    • Based on Reddit-curated content
    • 15 years of upvoted links
    • High-quality, human-filtered content
  3. 3

    Additional Sources

    • Books and Wikipedia articles
    • Structured, verified information

Evolution of Training Data

Open AI Content Licensing Partners

Major Publishers

  • News Corp
  • Axel Springer
  • TIME
  • The Atlantic
  • The Wall Street Journal
  • Financial Times

Online Platforms

  • Reddit - Community discussions
  • Stack Overflow - Technical knowledge
  • Shutterstock - Visual content

Impact on Search and SEO

Key Implications

  1. Quality standards are rising as models learn from verified sources
  2. Technical accuracy is increasingly important due to specialized dataset inclusion
  3. Community engagement may influence content value in training sets
  4. Visual content description is becoming more relevant

Future Implications

  • Growing importance of authoritative content
  • Increased value of technical accuracy
  • Rising significance of community engagement
  • Enhanced integration of multimedia content

Stay Connected



Part of "The Future of SEO in the Age of AI-Driven Search" series.



References

About The Author

Eugene Mi

Eugene Mi

Experienced SEO marketer with 15+ years of experience. Worked with Fortune 500 companies and startups.

Explore Reports