The following topics and questions summarize the areas of research for which OpenAI has interest:
Alignment: How can we understand what objective, if any, a model is best understood as pursuing? How do we increase the extent to which that objective is aligned with human preferences, such as via prompt design or fine-tuning?
Fairness and Representation: How should performance criteria be established for fairness and representation in language models? How can language models be improved in order to effectively support the goals of fairness and representation in specific, deployed contexts?
Interdisciplinary Research: How can AI development draw on insights from other disciplines such as philosophy, cognitive science, and sociolinguistics?
Interpretability / Transparency: How do these models work, mechanistically? Can we identify what concepts they’re using, or extract latent knowledge from the model, make inferences about the training procedure, or predict surprising future behavior?
Misuse Potential: How can systems like the API be misused? What sorts of ‘red teaming’ approaches can we develop to help us and other AI developers think about responsibly deploying technologies like this?
Model Exploration: Models like those served by the API have a variety of capabilities which we have yet to explore. We’re excited by investigations in many areas including model limitations, linguistic properties, commonsense reasoning, and potential uses for many other problems.
Robustness: Generative models have uneven capability surfaces, with the potential for surprisingly strong and surprisingly weak areas of capability. How robust are large generative models to “natural” perturbations in the prompt, such as phrasing the same idea in different ways or with/without typos? Can we predict the kinds of domains and tasks for which large generative models are more likely to be robust (or not robust), and how does this relate to the training data? Are there techniques we can use to predict and mitigate worst-case behavior? How can robustness be measured in the context of few-shot learning (e.g. across variations in prompts)? Can we train models so that they satisfy safety properties with a very high level of reliability, even under adversarial inputs?