How to Opt Out of AI Training on Major Platforms

2237 words
11 min read
Last updated December 13, 2024

"You have the right to remain silent. Anything you say can and will be used .. for training AI!"

These days every popular online service collects data for training their own AI by default. If you feel uncomfortable in providing your data for AI, ther is a way to opt out! See the list of the popular services and instructions on how to opt out:

LinkedIn

You can disable using your data for AI training on this page. Note: If you don't see this option and you are located in the European Union, it means your data is not being used due to your location and because of European Artificial Intelligence Act (AI Act) adopted in 2024.

Screenshot of LinkedIn AI training opt-out page

Meta (Facebook and Instagram)

There is no off switch but you may send a request through this form.

Screenshot of Meta AI training opt-out form

X (aka Twitter)

You can disable use of your data for training Grok AI model on this page.

Screenshot of X (Twitter) AI training opt-out page

Open AI ChatGPT

IMPORTANT: If you use the service without logging in, then your inputs and outputs are collected and used for AI training by default.

  1. Login with your account at https://chatgpt.com, and click on your profile icon (right-top icon).
  2. Choose settings.
  3. Then select "Data Controls", then "Improve the model for everyone" and set to "Off"
Screenshot of ChatGPT AI training opt-out settings

Anthropic Claude

IMPORTANT: If you use the service without logging in, then your inputs and outputs are collected and used for AI training by default.

No separate switch to turn off using your data for AI training. Also, Anthropic currently (as of September 2024) says that "Anthropic may not train models on Customer Content from paid Services." if you use their paid product (see their commercial terms).

Also, Anthropic reserves the right to store your input and output from 2 to 7 years in case it was automatically flagged by their trust and safety classifiers as violating Anthropic's Usage Policy. More information here.

If you delete a conversation then Anthropic automatically deletes user prompts and outputs on its backend within 30 days, unless you submit a separate data deletion request. More information here.

Screenshot of Anthropic Claude AI training information

Perplexity AI

IMPORTANT: If you use the service without logging in, then your inputs and outputs are collected and used for AI training by default.

Go to Perplexity AI settings and turn off the "AI Data Retention" switch.

Screenshot of Perplexity AI data retention settings

Reddit

IMPORTANT: If you use the service without logging in, then your inputs and outputs are collected and used for AI training by default.

Reddit does not have a separate switch to exclude your data from AI training. According to Reddit's Terms of Use, by using Reddit you provide them with "...worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit."

Reddit is licensing user data to third parties that may use your data to train AI (source).

However, you may remove your posts and comments if required through this page.

Screenshot of Reddit data access information

Wordpress.com

If you are hosting your website on Wordpress.com then your content is used by Wordpress.com for AI training by default (and also for sharing with 3rd party services).

  • To turn it off you need to go to the Admin panel (dashboard) of your wordpress.com hosted website (for example, https://YOURWEBSITE.wordpress.com/wp-admin/). Then go to Navigate to Settings → General (or Hosting → Site Settings if use WP-Admin).
  • Then scroll down to the Personalization section and enable the Prevent third-party sharing for YOURWEBSITE.wordpress.com option.
  • If you run multiple websites on Wordpress.com, you need to enable this option for all of them one by one!

Screenshot of Wordpress.com hosted website dashboard with the option controlling access to your website for AI training

For more details about disabling your content from use for AI training, please visit this page: https://wordpress.com/support/privacy-settings/make-your-website-public/#prevent-third-party-sharing.

GitHub CoPilot

IMPORTANT: GitHub Copilot uses public repositories with permissive licenses for training its AI models. Here are three ways to prevent GitHub Copilot from using your code for training:

  1. Do not use GitHub: The most effective way to ensure your code is not used for AI training is to avoid using GitHub altogether.
  2. Keep repositories private: By keeping your repositories private, you can prevent GitHub Copilot from accessing and using your code for training purposes.
  3. Use restrictive licenses for public repositories: For public repositories, you can use a restrictive license or even no license to avoid giving explicit permission for your code to be used for AI training. This can help limit the use of your code by GitHub Copilot.

If your organization uses GitHub Copilot, you can exclude certain content from being used by it. For more details, visit this page

GitHub - Your Repositories In Datasets Used By Popular LLMs

Many of LLMs (Large Language Models) use Stack v2 dataset for training. This dataset is a collection of public source code in 600 programming languages and sized at about 67 terrabytes. It includes source code from GitHub repositories with permissive licenses.

You can prevent your code (from your public Github repositories) from being included into this Stack v2 dataset and therefore exclude it from being used for AI training.

If you want to learn more about Stack v2 dataset, please visit https://huggingface.co/datasets/bigcode/the-stack-v2.

Substack

IMPORTANT: By default, Substack does not block AI bots from using your content for training purposes.

To prevent AI bots from using your Substack content for training, you need to enable the "Block AI training" option. This option is available on the "Settings" page of your Substack account.

Follow these steps to enable the "Block AI training" option:

  1. Log in to your Substack account and navigate to the "Settings" page.
  2. Scroll down to the "Block AI training" section.
  3. Enable the "Block AI training" option by toggling the switch.

Please note that this option just updated public rules for AI bots. But some AI bots have been observed not respecting these public rules (like robots.txt and similar rules defining access). Therefore, if your Substack page is public, some AI bots may still have access to your content despite enabling this option. Because of this, this option does not gurarntee that your content will not be used for AI training.

Add rules for AI bots to your website

If you are running a commercial website to promote your business or sell things, or providing a service, you may want to allow your website to be indexed by AI search engines so regular users could easily discover your website via AI when needed.

However, if you are running a website where you don't want your content to be captured by robots.txt file (or you want this content to be exclusively available and known through your website only), then you first need to define robots.txt file that defines rules for web bots.If you don't have robots.txt on your website then web bots (including ones used by AI) will consider they are allowed to read and learn from your website without any restrictions.

Here are some examples of how to use the robots.txt file to control the access of web bots, including AI bots, to your website:

Example 1: Disallow All Bots

        
        User-agent: *
        Disallow: /
        
        

This rule tells all bots that they are not allowed to access any part of your website.

Example 2: Allow All Bots

        
        User-agent: *
        Disallow:
        
        

This rule tells all bots that they are allowed to access all parts of your website.

Example 3: Disallow Specific Bots

        
        User-agent: CCBot
        Disallow: /
        User-agent: anthropic-ai
        Disallow: /
        
        

This rule tells a specific bot (in this case, "CCBot" which is used for Common Crawl (used by ChatGPT and many others) and "anthropic-ai" which is used for Anthropic) that it is not allowed to access any part of your website.

Example 4: Disallow All Bots from Specific Directories

        
        User-agent: *
        Disallow: /private/
        Disallow: /tmp/
        
        

This rule tells all bots that they are not allowed to access the "/private/" and "/tmp/" directories of your website.

IMPORTANT: this rule is voluntary and some web and AI bots may ignore it and read these directories from to your website despite this rule.

Example 5: Allow Specific Bots

        
        User-agent: Googlebot
        Disallow:
        
        User-agent: *
        Disallow: /
        
        

This rule tells the Googlebot (which is used by Google Search indexing) that it is allowed to access all parts of your website, while all other bots (defined as *) are not allowed to access any part of your website.

By using these examples, you can customize the robots.txt file to control which bots can access your website and which parts of your website they can access.

That's why we've created this free tool that allows you to quickly create a special file (robots.txt) that disallows AI from reading and learning from your website.

IMPORTANT: while robots.txt file is a standard way to define rules for web bots, it is not a perfect solution. This file is more like a voluntary rules. Some AI bots were cought of not respecting these rules and still accessing and capturing content from websites despite the rules defined in robots.txt file. For more protection (if you need it), please consider adding login/password protection to your website.

About The Author

E Miao

E Miao

Experienced SEO marketer with 15+ years of experience. Worked with Fortune 500 companies and startups.

Explore Reports