OpenAI

Wie ChatGPT und unsere Foundation-Modelle entwickelt werden

Mehr erfahren darüber, wie wir unsere Modelle entwickeln und sie in Produkten wie ChatGPT einsetzen.

Aktualisiert: 1 hour ago

OpenAI’s foundation models, including the models that power ChatGPT, are developed using three primary sources of information: (1) information that is publicly available on the internet, (2) information that we partner with third parties to access, and (3) information that our users, human trainers, and researchers provide or generate.

Developing foundation models like those used in ChatGPT involves several stages, including preparing training data, pre-training, and post-training, as well as ongoing evaluation and improvement after deployment. Different types of information may be used at these stages for various purposes, including improving model performance, reliability, and safety. 

This article provides an overview of the information we use to help develop these models, how we collect and use that information in compliance with privacy laws, and the safeguards we apply throughout the training process. To understand how we collect and use information from users of our services, including how to opt out of having ChatGPT conversations used to help improve our models, please see our Privacy Policy and this help center article.

What is ChatGPT and how does it work?

ChatGPT is an artificial intelligence-based service that you can access via the internet or app. You can use ChatGPT for a wide range of tasks, including organizing and summarizing information, assisting with translations, supporting coding, research and analysis, completing multi-step tasks across tools, analyzing or generating images, inspiring creativity and ideas, and other everyday activities. ChatGPT is designed to understand and respond to user questions and instructions by learning patterns from large amounts of information, including text, images, audio, and video. 

During training, the model analyzes relationships within this data—such as how words typically appear together in context—and uses that understanding to predict the next most likely word when generating a response, one word at a time. Text may be converted into smaller units, sometimes called “tokens” which may represent whole words, parts of words, or punctuation. Tokens are the building blocks of text that the model processes. Similarly, models that generate other forms of content, like images, learn patterns in how pixels relate to each other and to associated captions in the training data.

For example, during the model’s learning process (known as “training”), the model might be tasked with completing a sentence like: “Instead of turning left, she turned ___.” Early in training, its responses are largely random. However, as the model processes and learns from a large volume of text, it becomes better at recognizing patterns and predicting the most likely next word. This process is repeated across millions of sentences to refine its understanding and improve its accuracy.

Because there are multiple plausible ways to complete a sentence—such as “Instead of turning left, she turned right,” “around,” or “back”—there is an inherent element of randomness in how the model responds. As a result, the same question may yield different answers across different queries.

Machine learning models consist of large sets of numbers, known as “weights” or “parameters,” along with code that interprets and uses those numbers. These models do not store or retain copies of the data they are trained on. Instead, as a model learns, the values of its parameters are adjusted slightly to reflect patterns it has identified. In the earlier example, the model improved from predicting random words to making more accurate predictions—not by storing the training sentences, but by updating its internal parameters. The model does not retain copies of the sentences, images, or audio it processes during training. ChatGPT does not “copy and paste” from its training data—similar to how a teacher, after extensive study, can explain concepts by understanding the relationships between ideas without memorizing or reproducing the original materials verbatim. When generating a response to a user request, the model uses these learned weights to predict and create new content.

What type of information is used to teach ChatGPT?

For publicly available internet content, we use only information that is freely and openly accessible on the internet. This may include publicly available webpages, public forums, public blogs, public posts, and other publicly available online content. For example, if you participate in a publicly available online discussion forum or post a public blog or other post, we may use that publicly accessible content for model training purposes. However, we take steps to reduce the processing of personal information in our training process.  When collecting publicly available internet content, we do not intentionally gather data from sources known to be behind paywalls or from the dark web. Additionally, we apply filters to remove material we do not want our models to learn from, such as hate speech, adult content, sites that aggregate personal information, and spam. The remaining information is then used to train our models. 

Website owners can manage whether publicly available content from their sites may be accessed for use in training by using standard web controls such as robots.txt to disallow GPTBot, which may crawl publicly available content to help train our models. We provide guidance to help website owners manage how their sites and content interact with our AI systems. 

We also use information from third-party partners to help train and improve our models. This may include information in datasets that we access through agreements with third parties, as well as information provided or generated by human trainers and researchers where permitted under our policies and agreements. This helps improve the quality, safety, and performance of our models. These sources may include text, images, audio, video, or other data types, depending on the dataset.

We also increasingly use synthetic data in some training processes. For example, we may use information and our models to generate synthetic prompts, multilingual examples, or other training materials. Synthetic data can help improve model performance, including by supplementing training data in areas where data is sparse or imbalanced, and may also support privacy-enhancing approaches to model development.

Is personal information used to teach ChatGPT?

A significant portion of online content involves information about people, so our training data may incidentally include personal information. However, we take steps to reduce the processing of personal information in our training process.

We use training data to develop the model’s capabilities—such as prediction, reasoning, and problem-solving—not to build profiles of individuals, contact them, or personalize ads to them.

In some cases, models may learn from personal information to understand how elements like names and addresses function in language, or to recognize public figures and well-known entities. This helps the model generate more accurate and contextually appropriate responses.

How is personal information protected during training?

We take active steps to limit the processing of personal information during training. For example, we exclude known sources that aggregate large amounts of personal data, apply filtering to reduce personal information in the training process, and take steps to identify and remove duplicate content to reduce the risk of repeating training data. In addition, we train our models to avoid responding to requests for private or sensitive information about individuals. 

How long we retain information

We retain information in training data only for as long as reasonably necessary for the purposes described in this article and our Privacy Policy, including to develop and improve our models and for related scientific research purposes. Retention is subject to periodic review to ensure continued necessity, and varies depending on the type of information and how it is used. In determining retention, we consider factors such as our purpose for processing the information, the amount, nature and sensitivity of the information, the potential risk of harm from unauthorised use or disclosure and any legal obligations we are subject to.

How does the development of ChatGPT comply with privacy laws?

We use training information lawfully. Our foundation models power a wide range of beneficial applications—including accessibility tools, customer support, software development, personalized education, and scientific research. These capabilities depend on large-scale training data, including publicly available information, and information from third-party partners. We apply safeguards throughout the training process, including steps designed to reduce the processing of personal information in the training process and to mitigate risks, as described in this article. We base our collection and use of personal information that is included in training information on legitimate interests under privacy laws like the GDPR, including to train and improve our models for users and broader society in line with our mission to ensure that artificial general intelligence benefits everyone, as explained in more detail in our Privacy Policy. We have completed a data protection impact assessment to help ensure we are collecting and using this information legally and responsibly.

When information may be shared or transferred

We don’t “sell” personal information, and only disclose personal information in training data in the limited circumstances described in our Privacy Policy. For example, we may share information with affiliates, vendors, and service providers who support the development, testing, and improvement of our models. We may also disclose information in the good faith belief that such action is necessary to comply with a legal obligation or to protect our rights, safety, and security and those of our users, employees, or the public, as described in our Privacy Policy.

As our infrastructure is global, personal information in training data may be processed in countries outside the EEA, Switzerland or the UK (including in the United States). Where this occurs, we apply appropriate safeguards, such as adequacy decisions or standard contractual clauses, as described in our Privacy Policy.

Your rights and how to exercise them

We respond to objection requests and similar rights requests. As a result of learning language, ChatGPT responses may sometimes include personal information about individuals whose personal information appears multiple times on the public internet (for example, public figures). Individuals in certain jurisdictions can object to the processing of their personal information by our models or make other data subject rights requests through our Privacy Portal. You can also exercise these rights by reaching out to privacy@openai.com

To help us assess and respond to your request, please provide enough information for us to understand what personal information your request relates to, such as your name, relevant URLs, specific examples of model outputs, or other details that help identify the issue. In some cases, we may ask you to verify your identity or confirm that the information relates to you before we can take action. More information about how to submit these requests, including best practices and how requests are reviewed, is available in our Help Center article on personal data removal from ChatGPT. We review requests in accordance with applicable privacy laws and respond within applicable legal timelines.

Please be aware that, in accordance with privacy laws, some rights may not be absolute. For example, we may not be able to fulfil a request where we cannot verify the relevant information, where the request does not relate to personal information processed by OpenAI, where an exemption applies, or where we have another lawful reason for doing so. Requests are assessed on a case-by-case basis and may involve balancing privacy rights against other important considerations, such as freedom of expression and the public interest. 

However, we strive to prioritize the protection of personal information, and comply with all applicable privacy laws. If you feel we have not adequately addressed an issue, you have the right to lodge a complaint with your local supervisory authority.

For more information about OpenAI’s practices with respect to personal information we collect from or about you when you use our website, applications, and services, please see our Privacy Policy.

War dieser Artikel hilfreich?