There is concern about the lack of an easy way to opt out of having one’s content used to train large language models (LLMs) like ChatGPT. There is a way to do it, but it’s neither straightforward nor guaranteed to work.

How AIs Learn From Your Content

Large Language Models (LLMs) are trained on data that originates from multiple sources. Many of these datasets are open source and are freely used for training AIs.

Some of the sources used are:

  • Wikipedia
  • Government court records
  • Books
  • Emails
  • Crawled websites

There are actually portals and websites offering datasets that are giving away vast amounts of information.

One of the portals is hosted by Amazon, offering thousands of datasets at the Registry of Open Data on AWS.

How to Block ChatGPT From Using Your Website ContentScreenshot from Amazon, January 2023

The Amazon portal with thousands of datasets is just one portal out of many others that contain more datasets.

Wikipedia lists 28 portals for downloading datasets, including the Google Dataset and the Hugging Face portals for…

Source link