When you use your fine-tuned model for the first time in a while, it might take a little while for it to load. This sometimes causes the first few requests to fail with a 429 code and an error message that reads "the model is still being loaded".
The amount of time it takes to load a model will depend on the shared traffic and the size of the model. A larger model like gpt-4
, for example, might take up to a few minutes to load, while smaller models might load much faster.
Once the model is loaded, ChatCompletion requests should be much faster and you're less likely to experience timeouts.
We recommend handling these errors programmatically and implementing retry logic. The first few calls may fail while the model loads. Retry the first call with exponential backoff until it succeeds, then continue as normal (see the "Retrying with exponential backoff" section of this notebook for examples).