How can I tell how many tokens a string will have before I try to embed it?
For V2 embedding models, as of Dec 2022, there is not yet a way to split a string into tokens. The only way to get total token counts is to submit an API request.
If the request succeeds, you can extract the number of tokens from the response: `response[“usage”][“total_tokens”]`
If the request fails for having too many tokens, you can extract the number of tokens from the error message: `This model's maximum context length is 8191 tokens, however you requested 10000 tokens (10000 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.`
For V1 embedding models, which are based on GPT-2/GPT-3 tokenization, you can count tokens in a few ways:
For one-off checks, the OpenAI tokenizer page is convenient
In Python, transformers.GPT2TokenizerFast (the GPT-2 tokenizer is the same as GPT-3)
In JavaScript, gpt-3-encoder
How can I retrieve K nearest embedding vectors quickly?
For searching over many vectors quickly, we recommend using a vector database.
Vector database options include:
Which distance function should I use?
We recommend cosine similarity. The choice of distance function typically doesn’t matter much.
OpenAI embeddings are normalized to length 1, which means that:
Cosine similarity can be computed slightly faster using just a dot product
Cosine similarity and Euclidean distance will result in the identical rankings