Before sending a string for embedding, you can estimate how many tokens it will use by applying OpenAI’s tiktoken tokenizer library.
This is especially useful because embedding models (like text-embedding-3-small
) have maximum token limits you’ll need to stay within.
How to Count Tokens with Tiktoken
You can use the tiktoken
Python package to calculate the number of tokens a string will generate.
Here’s a sample code snippet:
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
# Example usage
num_tokens = num_tokens_from_string("tiktoken is great!", "cl100k_base")
print(num_tokens)
Important:
For third-generation embedding models (e.g.,
text-embedding-3-small
ortext-embedding-3-large
), you should use the"cl100k_base"
encoding.Different models may require different encodings — always refer to the model documentation if unsure.
Why Token Counting Matters
If your string exceeds the model’s maximum input size, your API request will fail.
Accurately counting tokens ahead of time ensures smoother embedding workflows and prevents errors during processing.