The Search endpoint (/search) returns a relevance score for each document ranked by semantic search. The higher the relevance score for a document, the more semantically similar the document is to the query. Note that if the documents are referenced through the file parameter and the keyword search returns no results, the search fails with InvalidRequestError and semantic search is not performed. Returned results are not sorted according to their relevance scores.

Each search query produces a different distribution of scores for a fixed group of documents. Let’s say we are searching through a list of titles of 10 must-read classic books.

import openai

# API Key
openai.api_key = MY_KEY

# Creating a list of 10 classic book titles
documents = ['Pride and Prejudice by Jane Austen (1813)',
'To Kill a Mockingbird by Harper Lee (1960)',
'The Great Gatsby by F. Scott Fitzgerald (1925)',
'Moby-Dick by Herman Melville (1851)',
'The Lion, the Witch and the Wardrobe by C.S. Lewis (1950)',
'To the Lighthouse by Virginia Woolf (1927)',
'Frankenstein by Mary Shelley (1823)',
'The Lord of the Rings by J. R. R. Tolkien (1954)',
'The Adventures of Huckleberry Finn by Mark Twain (1884)',
'Great Expectations by Charles Dickens (1860)']

# Search with query 'fantasy'
results = openai.Engine("ada").search(

# Print results to see the scores
response_df = pd.DataFrame(results["data"])
sorted = response_df.sort_values(by=['score'], ascending=False)

The resulting scores (when sorted) look like this

The query ‘fantasy’ returns a max relevance score of 253 for The Lord of the Rings by J. R. R. Tolkien (Mean: 77 Standard Deviation: 103). Alternatively, the query ‘drama’ returns a max relevance score of 43 for Pride and Prejudice by Jane Austen (Mean: -28, Standard Deviation: 50). The variation is a consequence of the search setup, where the query's probability is conditioned on the document's probability.


Suppose we want to determine how ‘The Great Gatsby by F. Scott Fitzgerald’ ranks against other titles in the list in their relevance to ‘tragedy’ vs ‘drama’. The query ‘tragedy’ returns a relevance score of 40.5, while ‘drama’ returns 41.4 for The Great Gatsby. Comparing the raw scores doesn’t tell us much about how the title ranks within each distribution. For such comparisons, it helps to normalize the score in order to use a common scale without distorting the differences in the range of values.

One way to normalize the score is to compute the z-score by calculating the difference between the similarity score and the sample mean, and dividing the difference by the sample standard deviation.

For instance, with our documents containing the top 10 classic book titles:

import pandas as pd

search_results = pd.DataFrame()

# Save the Book titles under 'Title' Column
search_results["title"] = documents

# We'll be comparing 2 queries
queries = ["tragedy","drama"]

# Calling the Search API
for query in queries:
results = openai.Engine("ada").search(

# Save results in a DataFrame
response_df = pd.DataFrame(results["data"])

# Calculate mean and standard deviation
mean = response_df["score"].mean()
std = response_df["score"].std()

# Save scores e.g. drama in 'drama similarity score' column
search_results[query + " similarity score"]=(response_df["score"])

# Calculate and save normalized score in a separate column
search_results[query + " normalized score"]=(response_df["score"] - mean)/std

print (search_results)

Normalizing the scores is useful because it helps us compare scores that are from distributions with different scales.

While the similarity scores for ‘tragedy’ and ‘romance’ are similar, with normalization we can infer that The Great Gatsby ranks less than 1 standard deviation above the mean for ‘tragedy’, whereas it’s more than 1 standard deviation above the mean for ’drama’.

Similarly normalized scores for the query ‘romance’ helps us compare how the scores for two different titles fall within the same distribution.

With normalized scores we can see that Pride and Prejudice ranks more than 2 standard deviations above the mean whereas The Great Gatsby is within 1 standard deviation of the mean.

Did this answer your question?