With the ever-increasing dependence on generative AI models, it’s easy to overlook just how much is being spent. If you look at the stats, the current projections show the generative AI market growing from $67.18 billion USD in 2024 to $967.65 billion USD 💹 by 2032, at a CAGR of 39.6%.Although open-source contributions are at an all-time high, they have yet to reach a level that can compete with closed models as an industry standard.
OpenAI, as a leader in this space, operates as a for profit organisation, with charges associated with each API call depending on the model used.
In this guide, I’ll walk you through the cost-saving strategies that work for us daily and can work for you too. Let’s dive into these techniques so you can see how they can work for you✨.
How OpenAI API Pricing Works
OpenAI charges based on the number of tokens processed (input and output combined). A "token" can be as short as a single character or as long as a word, depending on the language. We will talk more about this later. Here’s what impacts your total costs:
- Model Type: Latest and advanced models tend to be more expensive to use than earlier versions.
- Token Usage: More tokens mean higher costs, so the complexity and length of the text you process directly affect pricing.
- API calls: The number of API calls you make directly affects the total cost. High-volume usage increases costs.
Additional services like fine-tuning, embedding generation, or combining multiple features can lead to extra charges
For more details on API pricing for each model, be sure to check out the OpenAI Pricing page. They also have a great cost estimator that can give you a clear idea of what to expect. If you’re looking for more options, there are other calculators out there as well. I came across this one and thought it was pretty helpful, give it a try!
The model has varying price structure for input and output tokens.
Just to put things into perspective, let’s talk money and tokens.
To help visualise the token costs, consider that o1-preview costs around $15 💵 per million input tokens and $60 💵 for output tokens. For models like Whisper (audio), you’ll pay $0.006 💵 per minute. Keeping these in mind is essential to managing your budget effectively.
What You Need to Know About Tokens
You can think of tokens as pieces of words, punctuation marks, or even entire words in some cases. A token is roughly around 4 characters. A short word, like "hello," is a single token, but longer or more complex words could be split into multiple tokens. It all comes down to the class or type of Tokenizer we use.
What is Tokenisation and Byte Pair Encoding?
When language models process text, they don’t really "read" it like we do. Instead, they break it down into tokens using processes like Byte Pair Encoding (BPE). BPE splits text into smaller units, making it easier for the model to understand patterns, like common word parts (e.g., "ing" in "encoding"). The tokens correspond to the numerical representation of the text.
How tiktoken Makes Token Counting Easy
tiktoken, a BPE tokenizer developed by OpenAI, is used to work with GPT models.
OpenAI uses different encoding type for different models. This embedding type governs the count of token, and thus incur you costs. To see it for yourself, try out this interactive tokenizer tool on their webpage.
Step-by-Step Savings: How We Cut API Costs by 50% with OpenAI Features
We’ve discovered strategies that consistently save our clients around ~50%, and the best part is, you can implement them too. From Prompt caching to predicting outputs, these tactics are simple but powerful. Here’s how you can start saving right away.
What is Prompt Caching, and How Can It Save You?
Repetitive API calls are a pain. You know how quickly costs can pile up, especially when your prompts start getting repetitive. That’s where Prompt Caching steps in. This clever feature reduces costs by ~50% for long prompts and reduces latency by up to 80%.
So, how does it work? Well, when you send a request, OpenAI checks if the beginning of your prompt (the “prefix”) matches any previously sent prompts to the server. If it finds a match, instead of reprocessing the entire request, OpenAI uses the cached version. The result? Faster response times and drastically reduced API costs. The caching gets triggered automatically if the Prompt Token count is 1024 or more.
But Here’s the Catch
Just because Prompt Caching kicks in automatically doesn’t mean you’re guaranteed to save a bundle💰. It's about ensuring your prompts are optimised for caching. Here are the principles you should keep in mind,
- Structuring your Prompts : Keep the static content at the beginning of your Prompt while the dynamic content goes at the end. This helps OpenAI identify matches and cache them more effectively. A “match” here is called a Cache Hit
OpenAI explains this concept well with the diagram above, showing how caching can work more effectively when the structure is optimised. These Cache Hits leads to a reduce in costs and latency.
- Maximise Token Count Where Possible : Caching is triggered for prompts over 1024 tokens, and if you’re close to that threshold, adding a little more content can push you over the line. Just ensure that you're not padding your prompts unnecessarily, only add meaningful tokens that contribute to the context or task.
Here's why this matters: caching input tokens comes with a 50% discount from OpenAI. This means that if you’re able to maximize the token count and trigger caching, you’ll pay half the cost for the cached tokens. This can significantly reduce your overall API costs in the long run
- Keep a track on the Cached Tokens : Even if your request has fewer than 1024 tokens, OpenAI will include a cached_tokens field in the usage.prompt_tokens_details section of the API response. This field indicates how many tokens from the prompt were retrieved from the cache. If the request is under 1024 tokens, you'll see cached_tokens set to zero.
Here’s an example of a request where the prompt token count exceeds 1024, triggering a cache hit,
The output will show how many prompt tokens came from the cache:
In this case, the cached_tokens
value shows a cache hit of 1152 tokens, meaning that those 1152 tokens were pulled from cache, speeding up the process and saving costs.
Models Supporting Prompt Caching
Prompt Caching is enabled for the following models:
- GPT-4o (excluding GPT-4o-2024-05-13 and ChatGPT-4o-latest)
- GPT-4o-mini
- O1-preview
- O1-mini
These models benefit from the ability to reuse prompt tokens, saving both time and costs.
Pricing Table for Different Models
The cached input token pricing operates at a significant 50% discount than the non cached ones. For comparison, here’s the pricing:
Cache Accessibility and Expiry
It's important to note that Prompt Caches are not shared between organisations. Only users within the same organization can access the cache for identical prompts. This ensures that cache data is kept private and secure.
Additionally, manual cache clearing is not available. OpenAI automatically clears caches for prompts that have not been recently accessed. Typically, caches will expire after 5-10 minutes of inactivity, but during off-peak periods, this can sometimes extend up to one hour.
How You Could Reduce Response Time By Using Predicting Outputs
OpenAI’s recent release, Predicted Outputs, is designed to reduce model latency and costs in certain cases. This feature is based on the Speculative Decoding principle.
Large Autoregressive models like GPT are slow because they generate tokens serially. To generate K tokens, you need K runs of the model. These models utilise sequential decoding.
Generating just one sentence might take multiple passes through the model, each time doing a step-by-step computation. It’s a bottleneck.
Speculative Decoding is a clever way to speed things up without altering the model or sacrificing output quality. It constitutes of a draft model, which is nothing but a smaller model with faster inference to guess the next few tokens. These guesses are then verified by the larger, slower model. Instead of checking each token sequentially, we evaluate several potential tokens in parallel.
If the guesses from the smaller model are good enough, they’re accepted. If not, the larger model can quickly correct them in the next pass. This allows us to generate multiple tokens at once, significantly reducing the model runs.The end result? We get the same output with much less time spent.Although, this has some drawbacks. The quality of the draft or predicted tokens generated by the draft model affects the response time. If predicted tokens are incorrect, OpenAI charges for any tokens not included in the final completion at the completion token rate.💰💲 Plus, it only supports GPT-4o and GPT-4o-mini series of models.The primary purpose of this feature is to speed up the generation process by minimising the model runs and in some cases minimise unnecessary token usage and costs.
To show you how predicted outputs can really cut down response time, I decided to test it using one of my own blog posts. Here’s how I did it.
Step 1 : Scraping the Blog
First things first, I needed to grab the content. Using BeautifulSoup, I scraped one of my blog posts Building an Agentic Framework with O1 and GPT-4O to use as the test data.
Once I had the content, I passed it into GPT-4o mini with a prompt asking it to revise the first two sub-headings of the blog. Here's the prompt I used:
Step 2: Running the Model Without Predicted Outputs
I started by running the model the traditional way without any predictions just to establish a baseline for how long the process would take. The code for running the model was simple:
Once it finished, I checked the results and noted a few key metrics:
Here’s what I got:
- Completion Tokens: 602
- Total Time: 29.76 seconds
- Tokens Per Second : 20.22
Step 3: Introducing Predicted Outputs
Now, for the exciting part! I re-ran the test, this time using the predicted outputs feature. By doing this, the model would predict some of the tokens before the main processing, effectively reducing response time.
Here’s how I added the predicted output feature:
Step 4: Checking the Results with Predicted Outputs
After the model ran with predicted outputs, I checked the metrics again to see the difference. And here’s what I found:
The results were impressive:
- Completion Tokens: 751
- Total Time: 10.27 seconds
- Tokens Per Second : 73.12
The Outcome
The Predicted Outputs made the response time faster by almost 50%, from 29.76 seconds to just 10.27 seconds—impressive, right?
However, there was an increase in completion tokens, from 602 to 751, which is expected.
This increase shows the model generating additional predictions in parallel to accelerate the process. Specifically, 16 predictions were accepted and included in the final output, while 116 were rejected. The rejected tokens represent predictions that were close but didn’t quite match the final output, but that's part of the trade off.
Extra Cost-Cutting Strategies That Worked for Us and Will Work for You Too
Let’s talk real savings. I’m sure you’re tired of seeing those AI costs rack up with every request. Here's what actually worked for me, and will work for you too.
- Set Usage Limits : I learned this the hard way when I ran the O1 mini model excessively, not realising the cost, I ended up with a $325 💸 charge and a negative account balance. Ouch. Lesson learned. To avoid surprises, set limits based on your actual usage. Keep an eye on those limits so you know when you’re approaching them. Simple, but effective.
- Optimise those Prompts : Prompts are the input tokens that you get charged for. Keep it precise yet effective avoiding redundancy. Effective Prompt writing strategies will help you save bucks in the long run.
- Limit the Max Tokens Output : Adding a max_token limit limits the response length and helps you prevent unnecessary token usage and manage your costs better.
- Choose your Model wisely : Not all tasks may require advanced capabilities. Usage of simpler models in such situations can be a game changer. For simple text related tasks, you may consider using GPT-3.5 instead of leveraging the GPT-4 model series.
- Adjust the Temperature Setting accordingly : The temperature determines the randomness of responses. By setting it to 0, you get more predictable, structured answers that use fewer tokens, which can help you save on API costs.
Conclusion and Final Thoughts
Optimising your OpenAI API usage doesn’t have to be complicated. These strategies cut down our expenses by around 50% across various client projects. We’ve shared everything we’ve learned, from setting usage limits to prompt caching and using Predicted Outputs.
By applying these methods, you can save on costs too, no gimmicks, just practical, straightforward techniques that work. The goal is simple - get the most out of the OpenAI API without overspending.
If you’re looking to build AI solutions or need expert consultation, feel free to reach out, we’d love to help.Happy reading!