All Collections
Using Codex
Understanding Codex training data and outputs
Understanding Codex training data and outputs
Michael Schade avatar
Written by Michael Schade
Updated over a week ago

OpenAI cares deeply about developers and is committed to respecting their rights. Our hope is that Codex will lower barriers to entry and increase opportunities for beginner programmers, make expert programmers more productive, and create new code-generation tools.

The Codex model was trained on tens of millions of public repositories, which were used as training data for research purposes in the design of Codex. We believe that is an instance of transformative fair use.

The source material from those public repositories is intended to be used for these research and training purposes only; it is not intended to be included verbatim in Codex outputs. Analysis has shown that, even in this early stage of development, the vast majority of output (>99%) does not match training data. Of course, certain source material, like all computer programs, contains common, widely-used solutions that are either standard and/or functionally-mandated.

During this early, developmental stage of Codex, we continue to refine the product in numerous ways. We welcome feedback from developers, including any questions or concerns they may have about the generated output during our free beta period.

Did this answer your question?