Please check out our updated guidance here.

Every organization is bound by rate limits which determine how many requests can be sent per second. This rate limit has been hit by the request.

Rate limits can be quantized, meaning they are enforced over shorter periods of time (e.g. 60,000 requests/minute may be enforced as 1,000 requests/second). Sending short bursts of requests or contexts (prompts+max_tokens) that are too long can lead to rate limit errors, even when you are technically below the rate limit per minute.

How can I fix it?

  • Include exponential back-off logic in your code. This will catch and retry failed requests.

  • For token limits

    • Reduce the max_tokens to match the size of your completions. Usage needs are estimated from this value, so reducing it will decrease the chance that you unexpectedly receive a rate limit error. For example, if your prompt creates completions around 400 tokens, the max_tokens value should be around the same size.

    • Optimize your prompts. You can do this by making your instructions shorter, removing extra words, and getting rid of extra examples. You might need to work on your prompt and test it after these changes to make sure it still works well. The added benefit of a shorter prompt is reduced cost to you. If you need help, let us know.

  • For request limits

    • Batch your prompts in an array. This will reduce the number of requests you need to make. The prompt parameter can hold up to 20 unique prompts.

Did this answer your question?