Every Token Counts: The Art of (Dynamic) OpenAI API Cost Optimization

Published in

AI Mind

6 min readSep 5, 2023

Have you started developing with OpenAI and found yourself wondering about the costs? If so, you’re in good company. In this guide, we’ll explore:

Estimating Token Usage: How to determine token usage before making an API call.
Predicting Costs: How to forecast the costs based on token count.
Dynamically Selecting Models: Choosing the most cost-effective model without compromising performance.

Understanding token usage and its costs is essential, especially for frequent or large-scale API users. It helps you extract the maximum value from the OpenAI API.

Token Estimation with tiktoken

Tokens are at the heart of cost management when working with OpenAI. But how do we count them accurately? That’s where `tiktoken` comes in — a Python library from OpenAI.

What is `tiktoken`?

`tiktoken` lets you determine the number of tokens in a text string without an API call. Think of it as a token counter in your toolkit, helping you gauge and predict costs more effectively.

Setting Up `tiktoken`

Getting started is simple:

pip install tiktoken

How Does It Work?

Unlike basic word counters, `tiktoken` evaluates the text and counts tokens, ranging from a single character to an entire word. For instance, “ChatGPT is great!” translates into five tokens: [“Chat”, “G”, “PT”, “ is”, “ great!”].

Here’s a basic usage example:

import tiktoken

def count_tokens_with_tiktoken(text, model_name="gpt-3.5-turbo"):
    """Return the number of tokens in a text string using tiktoken."""
    encoding = tiktoken.encoding_for_model(model_name)
    return len(encoding.encode(text))

text_sample = "ChatGPT is great!"
tokens = count_tokens_with_tiktoken(text_sample)
print(f"'{text_sample}' consists of {tokens} tokens.")

Note: Always use the encoding specific to your target model. Different models might tokenize text differently!

Estimating Chat Completion Tokens

To choose the most cost-effective model, you must accurately estimate token usage, significantly influencing the overall cost of your API calls.

Methodology:
1. Break Down the Messages: Messages consist of content and potentially a role and name. Each element has associated tokens.

2. Factor in Response Tokens: Account for the model’s response tokens.

3. Calculate Total Token Count: Add the input message tokens and expected response tokens.

By following this approach, you can accurately predict costs before initiating an API call.

def num_tokens_from_messages(messages, model="gpt-3.5-turbo"):
    """Return the number of tokens used by a list of messages."""

    # Get encoding for the specified model
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print(f"Warning: model ({model}) not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")

    # Define token costs for messages and names
    tokens_per_message = 3
    tokens_per_name = 1

    # Calculate total tokens
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name

    # Add tokens for the assistant's reply primer
    num_tokens += 3

    return num_tokens

Demystifying OpenAI Chat Completion Pricing

Grasping OpenAI’s chat completion pricing might feel like understanding a complex board game. However, with some insight, it’s pretty straightforward.

The fundamental concept is that OpenAI bills for both the input (your messages) and the output (the model’s response). These charges are based on the token count. Importantly, you’re not double-charged for input in the model’s response, ensuring fair pricing.

To clarify, let’s work through an example:

Imagine sending a 200-token message to the gpt-3.5-turbo model and expecting an 850-token response. The cost breakdown would look something like this:

input_cost = 200 * 0.0015
output_cost = 850 * 0.002
total_cost = input_cost + output_cost

Here’s a function to help you implement this:

def estimate_chat_cost(messages, response_target_tokens, input_cost_per_k=0.002, output_cost_per_k=0.002):
    """
    Estimate the cost of an OpenAI API chat completion.Parameters:
    - messages (list): List of message dictionaries as used in OpenAI chat completion.
    - response_target_tokens (int): Expected number of tokens in the model's response.
    - input_cost_per_k (float): Cost per 1000 input tokens for the specific model.
    - output_cost_per_k (float): Cost per 1000 output tokens for the specific model.
    Returns:
    - float: Estimated cost in dollars.
    """
    input_tokens = num_tokens_from_messages(messages)
    # Calculate cost based on input tokens and target response tokens
    cost = estimate_cost(input_tokens, response_target_tokens, input_cost_per_k, output_cost_per_k)
    return cost

Dynamic Cost Estimation

By combining token estimation with pricing, you can gauge your costs:

def estimate_chat_cost(messages, response_target_tokens, input_cost_per_k=0.002, output_cost_per_k=0.002):
    """
    Estimate the cost of an OpenAI API chat completion.

    Parameters:
    - messages (list): List of message dictionaries as used in OpenAI chat completion.
    - response_target_tokens (int): Expected number of tokens in the model's response.
    - input_cost_per_k (float): Cost per 1000 input tokens for the specific model.
    - output_cost_per_k (float): Cost per 1000 output tokens for the specific model.

    Returns:
    - float: Estimated cost in dollars.
    """

    input_tokens = num_tokens_from_messages(messages)

    # Calculate cost based on input tokens and target response tokens
    cost = estimate_cost(input_tokens, response_target_tokens, input_cost_per_k, output_cost_per_k)

    return cost

Cost-Efficient Model Selection

While always opting for the model with the largest context may seem enticing, it can be costly. For instance, the `gpt-3.5-turbo` models with 4K and 16K tokens differ in price by 2X per token!

Thus, a strategic approach to model selection is necessary. This section breaks down the strategy and offers a code implementation.

Strategy:
1. Estimate: Begin by predicting the token usage of your chat messages.

2. Consider the Response: Decide the response’s length and account for those tokens.

3. Make an Informed Choice: Assess the total token count. If they fit within the smaller model, considering a threshold buffer, choose it. Otherwise, opt for the larger model.

The code provided here helps to automate this decision-making process:

def select_model(small_model, large_model, messages, response_size, buffer_percent=10, threshold=0.9):
    """
    Determine the most appropriate model based on the predicted token count.

    Args:
    - small_model (tuple): Information about the smaller model in the format (model_name, max_tokens).
    - large_model (tuple): Information about the larger model in the format (model_name, max_tokens).
    - messages (list): List of message dictionaries with 'role' and 'content'.
    - response_size (int): Expected token count for the response.
    - buffer_percent (int, optional): Buffer percentage for the response size. Defaults to 10.
    - threshold (float, optional): Threshold as a fraction to trigger switching to the larger model. Defaults to 0.9.

    Returns:
    - tuple: The selected model name and the maximum token count for the conversation.

    Raises:
    - ValueError: If the combined tokens exceed the token limit of both models.
    """

    # Calculate the response size with buffer
    response_tokens_with_buffer = int(response_size * (1 + buffer_percent / 100))

    # Calculate total tokens for the small model
    total_tokens_small = num_tokens_from_messages(messages, small_model[0]) + response_tokens_with_buffer
    total_tokens_small_threshold = int(small_model[1] * threshold)

    # If the total tokens for the small model are within the threshold, return the small model details
    if total_tokens_small <= total_tokens_small_threshold:
        return small_model[0], total_tokens_small

    # If not, calculate total tokens for the large model
    total_tokens_large = num_tokens_from_messages(messages, large_model[0]) + response_tokens_with_buffer

    # If the total tokens for the large model are within its limit, return the large model details
    if total_tokens_large <= large_model[1]:
        return large_model[0], total_tokens_large

    # If neither model can accommodate the total tokens, raise an error
    raise ValueError("The combined tokens exceed the token limit of both models.")

Example Usage:

small = ('gpt-3.5-turbo', 4000)
large = ('gpt-3.5-turbo-16k', 16000)
messages = [{"role": "system", "content": "You are a helpful assistant."}]
response_target = 100

model, max_tokens = select_optimal_model(small, large, messages, response_target)
print(f"Chosen Model: {model} using {tokens_used} tokens.")

response = openai.ChatCompletion.create(
  model=model,
  messages=messages,
  temperature=1,
  max_tokens=max_tokens,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)

Closing Thoughts

Cost management in AI is an art and science. It involves judiciously using resources, understanding available tools, and making data-driven decisions.

With the right tools, beyond picking the right model, you can take a hands-on approach to optimizing your OpenAI interactions:

Tune your response size: Setting an appropriate response (i.e., by dynamically calculating max_tokens) helps you manage your cost and improves the consistency of your responses.
Trim Your Prompts: Assigning token costs to parts of your prompts can help you focus on where to optimize. Concise prompts save tokens and expenses.
Monitor and Alert: Use OpenAI’s Dashboard to monitor your API usage and set alerts for nearing limits.

Dive in, experiment, and share your insights!