How to Build a ChatGPT powered search engine with OpenAI API

Milen Hristov

Apr 24, 20236 min read

Updated: May 19, 2023

Introduction

In recent years, Natural Language Processing (NLP) and Machine Learning (ML) have been making significant strides in the field of search engines, especially in e-commerce. Building a product search engine that can understand natural language and provide relevant search results can be a game-changer for businesses. ChatGPT, a powerful language model, has the potential to transform the way people search for products online. In this article, we will explore the steps required to build a ChatGPT-powered product search engine and how it can enhance the search experience for customers. We will also delve into the benefits of using ChatGPT and the technical requirements to implement this cutting-edge technology via the OpenAI API. By the end of this article, you will have a clear understanding of how to leverage ChatGPT to build an advanced product search engine that can revolutionize your e-commerce business with code snippets and examples.

About OpenAI API

OpenAI, a leading AI research institute which is behind the creation of ChatGPT, offers an API that allows developers to access state-of-the-art machine learning models and tools. By integrating the OpenAI API into your application, you can enhance its capabilities with cutting-edge language processing, computer vision, and reinforcement learning technologies. However, integrating an AI API can be a daunting task for developers who are unfamiliar with the technology. In this article, we will guide you through the process of integrating the OpenAI API into your application. We will cover the prerequisites, authentication, and API requests necessary to get started with the OpenAI API in this article.

Functions (of ChatGPT/OpenAI) we used

Now let's get down to business. To build our product recommendation tool we are going to use the following OpenAI endpoints:

- Text embeddings. This is a way to find relatedness between two pieces of text. An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness

- Chat completion. The GPT-3.5 model which can understand and generate natural language.

Data preparation

As an input for our platform product recommendation tool we are going to use a light version of a product catalog represented as a CSV with the following basic columns: product name, product description and vendor name. Of course in a real world example this would be a database of items of your e-commerce website or simply crowing many sources, putting the data in a database and then preparing the db in the mentioned format.

The Algo (step by step)

The algorithm we've implemented is the following:

Prepare the data on startup:
1. Parse our products CSV.
2. Transform each product with the embeddings endpoint into a vector of numbers. As input for the embeddings endpoint we are going to build a single human readable string from all the columns we have in the CSV as GPT models prove to work better this way.
3. Save the products with their embeddings into a new file or database for reusability. While in this example we only use ~30 products, in real life case calculating the embeddings for all your products can take hours and imply significant costs. For large datasets, we suggest using a vector database.
Search and suggest an answer on user question:
1. Given a user question, generate an embedding vector for the query.
2. Calculate the cosine distance between our user question's vector and all of our existing vectors. Sort them by distance and take the top results. As results will be in the [-1, 1] range we could set a minimum threshold for relatedness (e.g. 0.8).
3. Take the top vector(s) and pass it's corresponding product text to the chat completion input message as a context.
4. Prepare the chat completion prompt so it gives us well defined answers.
5. Return GPT's answer.

Here are some code snippets from the Python tool we built to showcase you this idea:

import ast  # for converting embeddings saved as strings back to arrays
import openai  # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data
import tiktoken  # for counting tokens
from scipy import spatial  # for calculating vector similarities for search
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff

The following function should be covering everything in point 1 regarding parsing the initial products list and generating embeddings. We use pandas DataFrame to store the products and their vectors:

EMBEDDINGS_FILE = 'processed/embeddings.csv'  # save for reusability
PRODUCTS_FILE = 'processed/products.csv'  # initial products list

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def embedding_with_backoff(**kwargs):
    """Call the openai embedding endpoint with an exponential backoff to prevent rate limiting"""
    return openai.Embedding.create(**kwargs)


def get_or_create_embeddings_df() -> pd.DataFrame:
    """Given a csv with products either create a new csv with embeddings or use an existing one."""
    if os.path.isfile(EMBEDDINGS_FILE):
        df = pd.read_csv(EMBEDDINGS_FILE, index_col=0)
        df['embeddings'] = df['embeddings'].apply(ast.literal_eval)
        return df

    df = pd.read_csv(PRODUCTS_FILE, index_col=0)
    df['formatted_text'] = df.apply(
        lambda
            r: f'Product name: {r["p_name"]}, Product Description: {r["p_description"]}, Store name: {r["store_name"]}.',
        axis=1)
    df['embeddings'] = df.formatted_text.apply(
        lambda s: embedding_with_backoff(input=s, model=EMBEDDING_MODEL)['data'][0]['embedding'])

    # Create a directory to store the csv files
    if not os.path.exists("processed"):
        os.mkdir("processed")

    df.to_csv(EMBEDDINGS_FILE, columns=['formatted_text', 'embeddings'])
    return df

The next function is going to generate the embeddings vector for our input user question, find the relatedness between itself and all other vectors and return the top N most similar products:

def strings_ranked_by_relatedness(
        query: str,
        df: pd.DataFrame,
        relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
        top_n: int = 3
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row['formatted_text'], relatedness_fn(query_embedding, row["embeddings"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

The next snippet is about preparing the input message for the chat completion endpoint. We need to make sure we instruct ChatGPT to only give recommendations if it found a match in our list of products or answer with 'I could not find an answer' otherwise. Note that we try to put as much context as we could given that we have a limit on the number of tokens we could use. With the gpt-3.5-turbo model it is 4096 tokens for both the input and the response.

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))
    
def query_message(
        query: str,
        df: pd.DataFrame,
        model: str,
        token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = 'Use the below contexts to answer the subsequent question. If the answer cannot be found in the contexts, write "I could not find an answer."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nContext:\n"""\n{string}\n"""'
        if num_tokens(message + next_article + question, model=model) > token_budget:
            break
        else:
            message += next_article
    return message + question

The final function is about making the chat completion call to OpenAI with all the info we gathered from the user and our products as context:

def ask(
        query: str,
        df: pd.DataFrame,
        model: str = GPT_MODEL,
        token_budget: int = 4096 - 500,  # gpt-3.5-turbo model
        print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system",
         "content": "You answer questions about my products. If you can please include product description and store."},
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message

And just like that your product recommendation tool should be ready for use.

The results

Here are some examples from running the tool:

Given the following products:

Wireless Noise-Canceling Headphones	Over-ear headphones with noise-cancellation technology and 20-hour battery life	The Audio Store
Portable Bluetooth Speaker	A compact Bluetooth speaker with a powerful sound and long battery life	The Audio Gear
Wireless Bluetooth Earbuds	Comfortable and lightweight earbuds with excellent sound quality and touch controls	The Audio Gear

In [1]: ask('Can you recommend me a headset?', df=df)
Out[1]: 'Yes, I would recommend the Wireless Noise-Canceling Headphones from The Audio Store. They have noise-cancellation technology and a 20-hour battery life.'

In [2]: ask('Do you have a cordless sound player?', df=df)
Out[2]: 'Yes, we have a Portable Bluetooth Speaker at The Audio Gear store.'

In [3]: ask('Do you have any TVs?', df=df)
Out[3]: 'I could not find an answer.'

About the author

Milen is a Tech Lead at Looming Tech being the key engineer and software architect behind the Hintsa Performance wellbeing platform. In the past he used to work as a senior engineer for companies like The Financial Times and Expedia London.

Learn more about the great Hintsa app Milen's team built, here: https://www.looming.tech/post/hintsa-transforming-the-wellbeing-app-space-with-looming-tech

Please get in touch with us if you need help with integrating ChatGPT into your products or you want to get help from Milen in building scalable software solutions.

This post is heavily influenced by this article. Some of the functions are copied from there. We recommend reading it if you found this post interesting.