Caching, Model Selection & Cost Strategies — How We Routinely Save Our Clients ~50% on OpenAI API💰

Read Time:
minutes

With the ever-increasing dependence on generative AI models, it’s easy to overlook just how much is being spent. If you look at the stats, the current projections show the generative AI market growing from $67.18 billion USD in 2024 to $967.65 billion USD 💹 by 2032, at a CAGR of 39.6%.Although open-source contributions are at an all-time high, they have yet to reach a level that can compete with closed models as an industry standard.

OpenAI, as a leader in this space, operates as a for profit organisation, with charges associated with each API call depending on the model used.

In this guide, I’ll walk you through the cost-saving strategies that work for us daily and can work for you too. Let’s dive into these techniques so you can see how they can work for you✨.

How OpenAI API Pricing Works

OpenAI charges based on the number of tokens processed (input and output combined). A "token" can be as short as a single character or as long as a word, depending on the language. We will talk more about this later. Here’s what impacts your total costs:

  1. Model Type: Latest and advanced models tend to be more expensive to use than earlier versions.
  2. Token Usage: More tokens mean higher costs, so the complexity and length of the text you process directly affect pricing.
  3. API calls: The number of API calls you make directly affects the total cost. High-volume usage increases costs.

Additional services like fine-tuning, embedding generation, or combining multiple features can lead to extra charges

For more details on API pricing for each model, be sure to check out the OpenAI Pricing page. They also have a great cost estimator that can give you a clear idea of what to expect. If you’re looking for more options, there are other calculators out there as well. I came across this one and thought it was pretty helpful, give it a try!

The model has varying price structure for input and output tokens.

Just to put things into perspective, let’s talk money and tokens.

To help visualise the token costs, consider that o1-preview costs around $15  💵 per million input tokens and $60  💵 for output tokens. For models like Whisper (audio), you’ll pay $0.006  💵 per minute. Keeping these in mind is essential to managing your budget effectively.

What You Need to Know About Tokens

You can think of tokens as pieces of words, punctuation marks, or even entire words in some cases. A token is roughly around 4 characters. A short word, like "hello," is a single token, but longer or more complex words could be split into multiple tokens. It all comes down to the class or type of Tokenizer we use.

What is Tokenisation and Byte Pair Encoding?

When language models process text, they don’t really "read" it like we do. Instead, they break it down into tokens using processes like Byte Pair Encoding (BPE). BPE splits text into smaller units, making it easier for the model to understand patterns, like common word parts (e.g., "ing" in "encoding"). The tokens correspond to the numerical representation of the text.

How tiktoken Makes Token Counting Easy

tiktoken, a BPE tokenizer developed by OpenAI, is used to work with GPT models.

image.png

OpenAI uses different encoding type for different models. This embedding type governs the count of token, and thus incur you costs. To see it for yourself, try out this interactive tokenizer tool on their webpage.

Step-by-Step Savings: How We Cut API Costs by 50% with OpenAI Features

We’ve discovered strategies that consistently save our clients around ~50%, and the best part is, you can implement them too. From Prompt caching to predicting outputs, these tactics are simple but powerful. Here’s how you can start saving right away.

What is Prompt Caching, and How Can It Save You?

Repetitive API calls are a pain. You know how quickly costs can pile up, especially when your prompts start getting repetitive. That’s where Prompt Caching steps in. This clever feature reduces costs by ~50% for long prompts and reduces latency by up to 80%.

So, how does it work? Well, when you send a request, OpenAI checks if the beginning of your prompt (the “prefix”) matches any previously sent prompts to the server. If it finds a match, instead of reprocessing the entire request, OpenAI uses the cached version. The result? Faster response times and drastically reduced API costs. The caching gets triggered automatically if the Prompt Token count is 1024 or more.

But Here’s the Catch

Just because Prompt Caching kicks in automatically doesn’t mean you’re guaranteed to save a bundle💰. It's about ensuring your prompts are optimised for caching. Here are the principles you should keep in mind,

  • Structuring your Prompts : Keep the static content at the beginning of your Prompt while the dynamic content goes at the end. This helps OpenAI identify matches and cache them more effectively. A “match” here is called a Cache Hit
Prompt Caching by OpenAI

OpenAI explains this concept well with the diagram above, showing how caching can work more effectively when the structure is optimised. These Cache Hits leads to a reduce in costs and latency.

  • Maximise Token Count Where Possible : Caching is triggered for prompts over 1024 tokens, and if you’re close to that threshold, adding a little more content can push you over the line. Just ensure that you're not padding your prompts unnecessarily, only add meaningful tokens that contribute to the context or task.

Here's why this matters: caching input tokens comes with a 50% discount from OpenAI. This means that if you’re able to maximize the token count and trigger caching, you’ll pay half the cost for the cached tokens. This can significantly reduce your overall API costs in the long run

  • Keep a track on the Cached Tokens : Even if your request has fewer than 1024 tokens, OpenAI will include a cached_tokens field in the usage.prompt_tokens_details section of the API response. This field indicates how many tokens from the prompt were retrieved from the cache. If the request is under 1024 tokens, you'll see cached_tokens set to zero.

Here’s an example of a request where the prompt token count exceeds 1024, triggering a cache hit,


from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {"role": "system", "content": f"""
    This detailed exploration will cover a broad spectrum of perspectives on a central theme, including its historical, cultural, psychological, societal, political, economic, ethical, global, technological, and future implications. The goal is to understand how this theme not only influences but is also shaped by different aspects of life, resulting in an in-depth analysis that uncovers its many layers. Each section of this examination is designed to provide a comprehensive understanding, laying the groundwork for insights into the far-reaching effects of this theme. This approach will reflect its various impacts on humanity, culture, policy, and innovation. By analyzing the implications across multiple domains, this exploration will help in identifying the core dynamics that drive the evolution and interpretation of the theme under review.

Historical Context: Examine the origins of this theme, tracking its emergence and the factors that catalyzed its development. Identify key figures, movements, and transformative events that shaped its history. Reflect on how the course of history molded its significance and evolution, from its initial inception to its current form. Evaluate the historical events that created pivotal turning points, noting how this theme influenced or was influenced by the political, social, and economic landscapes of the time. Discuss the broader historical trends that have impacted its development and what lessons can be drawn from its historical trajectory.

Cultural Implications: Explore the intersection of this theme with various cultures, norms, and values across the world. Investigate the diverse ways in which different cultures interpret, celebrate, or challenge this theme. Consider cultural symbols, rituals, and artifacts that have been born from or shaped by this theme. What role does this theme play in the arts, in literature, in music, and in other forms of cultural expression? Delve into how this theme interacts with identity formation, both personal and collective, and the ongoing cultural narratives it sustains or disrupts. Examine how cultural production, consumption, and distribution are influenced by this theme and whether it represents a catalyst for cultural change.

Psychological and Emotional Dimensions: Analyze how this theme resonates on an emotional and psychological level with individuals and groups. What specific emotional responses does it evoke, and how do those feelings manifest in different contexts? Reflect on the psychological effects of engaging with this theme, from personal transformation to collective trauma or joy. How does it impact decision-making, thought patterns, and social behaviors? Investigate the range of emotional responses—positive and negative—that arise from experiencing or contemplating this theme. Consider how emotions such as fear, joy, anger, or nostalgia play a role in shaping how people connect with this theme.

Societal Impact: Investigate how this theme affects society at large. How does it influence social structures, relationships, and the broader community dynamics? Consider its impact on social identity, and how it reshapes the way groups interact with each other. Explore whether this theme fosters unity or division, and how it affects group solidarity, collective actions, and movements. Assess the implications for social justice, equity, and community well-being as shaped by this theme. Reflect on the ways this theme influences the creation of societal norms, and whether it poses challenges to or reinforces traditional social hierarchies and power structures.

Political Implications: Reflect on how this theme has influenced or been influenced by politics, governance, and policy. What political movements, ideologies, or parties have been shaped by this theme, and how has it driven legislative or global changes? Investigate whether it has had any role in sparking political discourse or reshaping international relations. Look into the ways in which world leaders, political systems, and nations react to the influence of this theme. What new alliances, tensions, or political divides have emerged as a result? Consider the policy implications at local, national, and international levels, and the strategies that have been employed to address or incorporate this theme in political decision-making.

Economic Consequences: Consider the economic implications of this theme across industries, markets, and financial systems. How has it reshaped economic structures or created new sectors? Examine how businesses and workers have adapted to the challenges or opportunities posed by this theme. Investigate the long-term financial consequences for both global markets and local economies. What investments, innovations, or economic shifts have been driven by this theme? Evaluate its impact on economic development, inequality, and sustainability. Consider its role in global trade, resource distribution, and the rise of new economic paradigms.

Ethical Considerations: Engage with the ethical questions that arise when considering this theme. What moral dilemmas or questions are provoked by its existence or evolution? How do different ethical frameworks, such as utilitarianism, deontology, or virtue ethics, approach these dilemmas? Reflect on whether the pursuit of this theme conflicts with any ethical principles or requires trade-offs that challenge moral values. Investigate the global ethical implications, considering the potential benefits and harms to various groups or environments. Consider whether the theme has sparked ethical debates and how such discussions have influenced policies or public opinion.

Global Perspective: Examine the global significance of this theme and its impact across international borders. How does this theme relate to global challenges like sustainability, human rights, and climate change? Investigate whether it fosters international cooperation or sparks global conflict. Explore the actions taken by international organizations, such as the United Nations, or non-governmental organizations to address the implications of this theme. How does this theme shape policies, agreements, and international cooperation on a global scale? Consider the diplomatic challenges it presents and how different countries approach the issues it raises.

Technological Impact: Investigate how this theme intersects with emerging technologies. Does it drive technological innovation, or does it face challenges from advancements such as artificial intelligence, automation, biotechnology, and others? How might new technologies enhance or impede the development of this theme? What role do research institutions, tech companies, and governments play in advancing or regulating this theme in light of technological changes? Explore how this theme interacts with technological risks, opportunities, and disruptions, and what future technological innovations may arise in response to it.

Future Trajectory: Speculate on the future evolution of this theme. Given current trends, what scenarios might unfold in the next 10, 20, or 50 years? Will this theme lead to radical societal changes, or will it evolve gradually? Reflect on how future generations might engage with this theme, and whether it will become a defining moment in history or fade into obscurity. Consider potential technological, political, or cultural changes that could significantly alter its trajectory and the way humanity addresses its future implications.

Dynamic Content (User Input Focused):
Now, we will explore how {user_input} specifically manifests in the above aspects. What unique characteristics, impacts, and challenges does {user_input} present within each of these categories? Let’s look deeper into its relevance, implications, and the ways it interacts with these different dimensions.

        """},
    {"role": "user", "content": "us elections"}
  ]
)
print(completion.usage)

The output will show how many prompt tokens came from the cache:


CompletionUsage(completion_tokens=1487, prompt_tokens=1386, total_tokens=2873, 
completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, 
audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), 
prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=1152))

In this case, the cached_tokens value shows a cache hit of 1152 tokens, meaning that those 1152 tokens were pulled from cache, speeding up the process and saving costs.

Models Supporting Prompt Caching

Prompt Caching is enabled for the following models:

  • GPT-4o (excluding GPT-4o-2024-05-13 and ChatGPT-4o-latest)
  • GPT-4o-mini
  • O1-preview
  • O1-mini

These models benefit from the ability to reuse prompt tokens, saving both time and costs.

Pricing Table for Different Models

The cached input token pricing operates at a significant 50% discount than the non cached ones. For comparison, here’s the pricing:

Cache Accessibility and Expiry

It's important to note that Prompt Caches are not shared between organisations. Only users within the same organization can access the cache for identical prompts. This ensures that cache data is kept private and secure.

Additionally, manual cache clearing is not available. OpenAI automatically clears caches for prompts that have not been recently accessed. Typically, caches will expire after 5-10 minutes of inactivity, but during off-peak periods, this can sometimes extend up to one hour.

How You Could Reduce Response Time By Using Predicting Outputs

OpenAI’s recent release, Predicted Outputs, is designed to reduce model latency and costs in certain cases. This feature is based on the Speculative Decoding principle.

Large Autoregressive models like GPT are slow because they generate tokens serially. To generate K tokens, you need K runs of the model. These models utilise sequential decoding.

Generating just one sentence might take multiple passes through the model, each time doing a step-by-step computation. It’s a bottleneck.

Speculative Decoding is a clever way to speed things up without altering the model or sacrificing output quality. It constitutes of a draft model, which is nothing but a smaller model with faster inference to guess the next few tokens. These guesses are then verified by the larger, slower model. Instead of checking each token sequentially, we evaluate several potential tokens in parallel.

If the guesses from the smaller model are good enough, they’re accepted. If not, the larger model can quickly correct them in the next pass. This allows us to generate multiple tokens at once, significantly reducing the model runs.The end result? We get the same output with much less time spent.Although, this has some drawbacks. The quality of the draft or predicted tokens generated by the draft model affects the response time. If predicted tokens are incorrect, OpenAI charges for any tokens not included in the final completion at the completion token rate.💰💲 Plus, it only supports GPT-4o and GPT-4o-mini series of models.The primary purpose of this feature is to speed up the generation process by minimising the model runs and in some cases minimise unnecessary token usage and costs.

To show you how predicted outputs can really cut down response time, I decided to test it using one of my own blog posts. Here’s how I did it.

Step 1 :  Scraping the Blog

First things first, I needed to grab the content. Using BeautifulSoup, I scraped one of my blog posts Building an Agentic Framework with O1 and GPT-4O to use as the test data.


import requests
from bs4 import BeautifulSoup

def scrape_blog(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        paragraphs = [p.get_text(strip=True) for p in soup.find_all('p')]
        return '\n'.join(paragraphs)
    except requests.exceptions.RequestException:
        return None

url = 'https://www.ionio.ai/blog/building-an-agentic-framework-with-o1-and-gpt-4o'
content = scrape_blog(url)

if content:
    print(content)

Once I had the content, I passed it into GPT-4o mini with a prompt asking it to revise the first two sub-headings of the blog. Here's the prompt I used:


prompt = f"""As an expert writer and editor, for the given blog content: {content}, make revisions to the
sub-headings and make them more interesting by including as many real-life examples as you can. Only make changes to the first two subheadings; leave the rest as it is. Properly format and return the output."""

Step 2: Running the Model Without Predicted Outputs

I started by running the model the traditional way without any predictions just to establish a baseline for how long the process would take. The code for running the model was simple:


from openai import OpenAI
import time
client = OpenAI()

start_time = time.time()
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
)
total_time = time.time() - start_time

Once it finished, I checked the results and noted a few key metrics:


### print(completion.usage)  ## Completion usage ## if you want the complete execution attributes
print(completion.usage.completion_tokens)  ## Token usage
print(total_time)  ## Time it took to complete
tps = completion.usage.completion_tokens / total_time
print(tps) ## The token per second

Here’s what I got:

  • Completion Tokens: 602
  • Total Time: 29.76 seconds
  • Tokens Per Second : 20.22

Step 3: Introducing Predicted Outputs

Now, for the exciting part! I re-ran the test, this time using the predicted outputs feature. By doing this, the model would predict some of the tokens before the main processing, effectively reducing response time.

Here’s how I added the predicted output feature:


completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
prediction={"type": "content", "content": content})

Step 4: Checking the Results with Predicted Outputs

After the model ran with predicted outputs, I checked the metrics again to see the difference. And here’s what I found:


print(completion.usage)  # Completion usage
print(completion.usage.completion_tokens)  # Token usage
print(total_time)  # Time it took to complete

The results were impressive:

  • Completion Tokens: 751
  • Total Time: 10.27 seconds
  • Tokens Per Second : 73.12

The Outcome

The Predicted Outputs made the response time faster by almost 50%, from 29.76 seconds to just 10.27 seconds—impressive, right?

However, there was an increase in completion tokens, from 602 to 751, which is expected.


CompletionTokensDetails(accepted_prediction_tokens=16, rejected_prediction_tokens=116)

This increase shows the model generating additional predictions in parallel to accelerate the process. Specifically, 16 predictions were accepted and included in the final output, while 116 were rejected. The rejected tokens represent predictions that were close but didn’t quite match the final output, but that's part of the trade off.

Extra Cost-Cutting Strategies That Worked for Us and Will Work for You Too

Let’s talk real savings. I’m sure you’re tired of seeing those AI costs rack up with every request. Here's what actually worked for me, and will work for you too.

  1. Set Usage Limits : I learned this the hard way when I ran the O1 mini model excessively, not realising the cost, I ended up with a $325 💸 charge and a negative account balance. Ouch. Lesson learned. To avoid surprises, set limits based on your actual usage. Keep an eye on those limits so you know when you’re approaching them. Simple, but effective.
  2. Optimise those Prompts : Prompts are the input tokens that you get charged for. Keep it precise yet effective avoiding redundancy. Effective Prompt writing strategies will help you save bucks in the long run.
  3. Limit the Max Tokens Output : Adding a max_token limit limits the response length and helps you prevent unnecessary token usage and manage your costs better.
  4. Choose your Model wisely : Not all tasks may require advanced capabilities. Usage of simpler models in such situations can be a game changer. For simple text related tasks, you may consider using GPT-3.5 instead of leveraging the GPT-4 model series.
  5. Adjust the Temperature Setting accordingly : The temperature determines the randomness of responses. By setting it to 0, you get more predictable, structured answers that use fewer tokens, which can help you save on API costs.

Conclusion and Final Thoughts

Optimising your OpenAI API usage doesn’t have to be complicated. These strategies cut down our expenses by around 50% across various client projects. We’ve shared everything we’ve learned, from setting usage limits to prompt caching and using Predicted Outputs.

By applying these methods, you can save on costs too, no gimmicks, just practical, straightforward techniques that work. The goal is simple - get the most out of the OpenAI API without overspending.

If you’re looking to build AI solutions or need expert consultation, feel free to reach out, we’d love to help.Happy reading!

Book an AI consultation

Looking to build AI solutions? Let's chat.

Schedule your consultation today - this not a sales call, feel free to come prepared with your technical queries.

You'll be meeting Rohan Sawant, the Founder.
 Company
Book a Call

Let us help you.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Behind the Blog 👀
Shivam Mitter
Writer

The guy on coffee who can do AI/ML.

open source for the win!
Pranav Patel
Editor

Good boi. He is a good boi & does ML/AI. AI Lead.