Skip to main content

How ChatGPT and our foundation models are developed

Learn more about how we develop our models and apply them in products like ChatGPT

Updated over 2 weeks ago

OpenAI’s foundation models, including the models that power ChatGPT, are developed using three primary sources of information: (1) information that is publicly available on the internet, (2) information that we partner with third parties to access, and (3) information that our users, human trainers, and researchers provide or generate.

This article provides an overview of the publicly available information we use to help develop these models and how we collect and use that information in compliance with privacy laws. To understand how we collect and use information from users of our services, including how to opt out of having ChatGPT conversations used to help teach our models, please see our Privacy Policy and this help center article.

What is ChatGPT and how does it work?

ChatGPT is an artificial intelligence-based service that you can access via the internet. You can use ChatGPT for a wide range of tasks, including organizing and summarizing information, assisting with translations, analyzing or generating images, inspiring creativity and ideas, and other everyday activities. ChatGPT is designed to understand and respond to user questions and instructions by learning patterns from large amounts of information, including text, images, audio, and video. During training, the model analyzes relationships within this data—such as how words typically appear together in context—and uses that understanding to predict the next most likely word when generating a response, one word at a time. Similarly, models that generate other forms of content, like images, learn patterns in how pixels relate to each other and to associated captions in the training data.

For example, during the model’s learning process (known as “training”), the model might be tasked with completing a sentence like: “Instead of turning left, she turned ___.” Early in training, its responses are largely random. However, as the model processes and learns from a large volume of text, it becomes better at recognizing patterns and predicting the most likely next word. This process is repeated across millions of sentences to refine its understanding and improve its accuracy.

Because there are multiple plausible ways to complete a sentence—such as “Instead of turning left, she turned right,” “around,” or “back”—there is an inherent element of randomness in how the model responds. As a result, the same question may yield different answers across different queries.

Machine learning models consist of large sets of numbers, known as “weights” or “parameters,” along with code that interprets and uses those numbers. These models do not store or retain copies of the data they are trained on. Instead, as a model learns, the values of its parameters are adjusted slightly to reflect patterns it has identified. In the earlier example, the model improved from predicting random words to making more accurate predictions—not by storing the training sentences, but by updating its internal parameters. The model does not retain copies of the sentences, images, or audio it processes during training. ChatGPT does not “copy and paste” from its training data—similar to how a teacher, after extensive study, can explain concepts by understanding the relationships between ideas without memorizing or reproducing the original materials verbatim. When generating a response to a user request, the model uses these learned weights to predict and create new content.

What type of public information is used to teach ChatGPT?

For publicly available internet content, we use only information that is freely and openly accessible on the internet. We do not intentionally gather data from sources known to be behind paywalls or from the dark web. Additionally, we apply filters to remove material we do not want our models to learn from, such as hate speech, adult content, sites that aggregate personal information, and spam. The remaining information is then used to train our models.

Is personal information used to teach ChatGPT?

A significant portion of online content involves information about people, so our training data may incidentally include personal information. However, we do not intentionally collect personal information for the purpose of training our models.

We use training data solely to develop the model’s capabilities—such as prediction, reasoning, and problem-solving—not to build user profiles, contact individuals, advertise or market to them, or sell personal information.

In some cases, models may learn from personal information to understand how elements like names and addresses function in language, or to recognize public figures and well-known entities. This helps the model generate more accurate and contextually appropriate responses.

We take active steps to limit the processing of personal information during training. For example, we exclude sources that aggregate large amounts of personal data, and we train our models to avoid responding to requests for private or sensitive information about individuals.

How does the development of ChatGPT comply with privacy laws?

We use training information lawfully. Our foundation models power a wide range of beneficial applications—from content creation and customer support to software development, personalized education, and scientific research. These capabilities depend on large-scale training data. The information used to train our models is publicly available and is not intended to cause harm to individuals. We base our collection and use of personal information that is included in training information on legitimate interests under privacy laws like the GDPR, as explained in more detail in our Privacy Policy. We have completed a data protection impact assessment to help ensure we are collecting and using this information legally and responsibly.

We respond to objection requests and similar rights. As a result of learning language, ChatGPT responses may sometimes include personal information about individuals whose personal information appears multiple times on the public internet (for example, public figures). Individuals in certain jurisdictions can object to the processing of their personal information by our models or make other data subject rights requests through our Privacy Portal. You can also exercise these rights by reaching out to dsar@openai.com.

Please be aware that, in accordance with privacy laws, some rights may not be absolute. We may decline a request if we have a lawful reason for doing so. However, we strive to prioritize the protection of personal information, and comply with all applicable privacy laws. If you feel we have not adequately addressed an issue, you have the right to lodge a complaint with your local supervisory authority.

For more information about OpenAI’s practices with respect to personal information we collect from or about you when you use our website, applications, and services, please see our Privacy Policy.

Did this answer your question?