
Interconnects
GPT-4.5: "Not a frontier model"?
Podcast summary created with Snipd AI
Quick takeaways
- Despite being the largest publicly available model, GPT-4.5 is not classified as a frontier model, sparking curiosity about its release rationale.
- While it features improvements such as reduced hallucinations, distinguishing substantial enhancements from earlier models proves to be quite challenging.
Deep dives
Understanding GPT 4.5's Release
GPT 4.5 is not classified as a frontier model, yet it is the largest model made publicly available to date, raising questions about its development and release strategy. While it features improvements, distinguishing significant enhancements from earlier versions remains challenging. The progression from GPT 3.5 to GPT 4 showcased a noticeable upgrade, transitioning user experience from acceptable to commendable, whereas GPT 4 to 4.5 has incrementally advanced this to 'really great.' This situation reflects the ongoing complexities of scaling AI models and the political game theory that may influence OpenAI's decisions regarding this release.
Performance and Capabilities of GPT 4.5
GPT 4.5 boasts notable features such as reduced hallucinations and enhanced emotional intelligence, although measuring these improvements can be somewhat subjective. Benchmarks like SimpleQA and GPQA indicate that GPT 4.5 performs exceptionally well, outstripping previous models in certain evaluations. Many users have reported a smoother experience with GPT 4.5, enhancing its appeal in practical applications like writing and technical tasks. However, feedback suggests that earlier models still hold advantages in certain writing comparisons, indicating a nuanced relationship between model size, functionality, and user preferences.
The Future of Scaling and Pricing Concerns
The pricing structure for GPT 4.5 signifies a strategic approach given its high initial costs, which mirror those of its predecessor, GPT 4. OpenAI appears to anticipate demand fluctuations, hinting that GPT 4.5 may not remain in production if user engagement is low. Predictions indicate that future models could benefit from significant improvements in speed and efficiency, potentially stemming from emerging technologies like NVIDIA's Blackwell GPUs. Overall, GPT 4.5 represents a critical junction in AI development, signaling the need for a refined understanding of scaling dynamics while integrating advancements into broader applications.
More: https://www.interconnects.ai/p/gpt-45-not-a-frontier-model
As GPT-4.5 was being released, the first material the public got access to was OpenAI’s system card for the model that details some capability evaluations and mostly safety estimates. Before the live stream and official blog post, we knew things were going to be weird because of this line:
GPT-4.5 is not a frontier model.
The updated system card in the launch blog post does not have this. Here’s the original system card if you need a reference:
Regardless, someone at OpenAI felt the need to put that in. The peculiarity here summarizes a lot of the release. Some questions are still really not answered, like “Why did OpenAI release this?” That game theory is not in my purview.
The main contradiction to the claims that it isn’t a frontier model is that this is the biggest model the general public has ever gotten to test. Scaling to this size of model did NOT make a clear jump in capabilities we are measuring. To summarize the arc of history, the jump from GPT-3.5 to GPT-4 made the experience with the models go from okay to good. The jump from GPT-4o (where we are now) to GPT-4.5 made the models go from great to really great.
Feeling out the differences in the latest models is so hard that many who are deeply invested and excited by AI’s progress are just as likely to lie to themselves about the model being better as they are to perceive real, substantive improvements. In this vein, I almost feel like I need to issue a mea culpa. I expected this round of scaling’s impacts to still be obvious before the brutal economic trade-offs of scaling kicked in.
While we got this model, Anthropic has also unintentionally confirmed that their next models will be trained on an approximation of “10X the compute,” via a correction on Ethan Mollick’s post on Claude 3.7.
Note: After publishing this piece, I was contacted by Anthropic who told me that Sonnet 3.7 would not be considered a 10^26 FLOP model and cost a few tens of millions of dollars to train, though future models will be much bigger.
GPT-4.5 is a point on the graph that scaling is still coming, but trying to make sense of it in a day-by-day transition is hard. In many ways, zooming out, GPT-4.5 will be referred to in the same breath as o1, o3, and R1, where it was clear that scaling pretraining alone was not going to give us the same level of breakthroughs. Now we really know what Ilya saw.
All of this marks GPT-4.5 as an important moment in time for AI to round out other stories we’ve been seeing. GPT-4.5 likely finished training a long time ago — highlighted by how it has a date cutoff in 2023 still — and OpenAI has been using it internally to help train other models, but didn’t see much of a need to release it publicly.
What GPT-4.5 is good for
In the following, I am going to make some estimates on the parameter counts of GPT-4.5 and GPT-4o. These are not based on any leaked information and should be taken with big error bars, but they are very useful for context.
GPT-4.5 is a very big model. I’d bet it is well bigger than Grok 3. We have seen this story before. For example, GPT-4 was roughly known to be a very big mixture of experts model with over 1T parameters total and ~200B active parameters. Since then, rumors have placed the active parameters of models like GPT-4o or Gemini Pro at as low as 60B parameters. This type of reduction, along with infrastructure improvements, accounts for massive improvements in speed and price.
Estimates place GPT-4.5 as about an order of magnitude more compute than GPT-4. These are not based on any released numbers, but given a combination of a bigger dataset and parameters (5X parameters + 2X dataset size = 10X compute), the model could be in in the ballpark of 5-7T parameters total, which if it had a similar sparsity factor to GPT-4 would be ~600B active parameters.
With all of these new parameters, actually seeing performance improvements is hard. This is where things got very odd. The two “capabilities” OpenAI highlighted in the release are:
* Reduced hallucinations.
* Improved emotional intelligence.
Both of these have value but are hard to vibe test.
For example, SimpleQA is a benchmark we at Ai2 are excited to add to our post-training evaluation suite to improve world knowledge of our models. OpenAI made and released this evaluation publicly. GPT-4.5 makes huge improvements here.
In another one of OpenAI’s evaluations, PersonQA, which is questions regarding individuals, the model is also state of the art.
And finally, also GPQA, the Google-proof knowledge evaluation that reasoning models actually excel at.
At the time of release, many prominent AI figures online were touting how GPT-4.5 is much nicer to use and better at writing. These takes should be taken in the context of your own testing. It’s not that simple. GPT-4.5 is also being measured as middle of the pack in most code and technical evaluations relative to Claude 3.7, R1, and the likes.
For an example on the writing and style side, Karpathy ran some polls comparing GPT-4.5’s writing to GPT-4o-latest, and most people preferred the smaller, older model. Given what we know about post-training and the prevalence of distilling from the most powerful model you have access to, it is likely that GPT-4o-latest is distilled from this new model, previously called Orion, and its drastically smaller size gives it a night and day difference on iteration speed, allowing for better post-training.
More on the character in that GPT-4o-latest model was covered in our previous post on character training.
All of this is a big price to pay to help OpenAI reclaim their top spot on ChatBotArena — I expect GPT 4.5 to do this, but the results are not out yet.
I’ve been using GPT-4.5 in preparation for this. It took a second to get used to the slower speed, but it’s fine. I will keep using it for reliability, but it’s not worth paying more for. o1 Pro and the other paid offerings from OpenAI offer far more value than GPT-4.5.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Making sense of GPT-4.5’s ridiculous price
When the original GPT-4 first launched, it was extremely expensive. In fact, GPT-4 was comparable in price to GPT-4.5 at launch. Here’s a help post on the OpenAI forums, conveniently found by OpenAI DeepResearch with GPT-4.5, that captures the context. GPT-4 launched in March 2023.
We are excited to announce GPT-4 has a new pricing model, in which we have reduced the price of the prompt tokens.
For our models with 128k context lengths (e.g. gpt-4-turbo), the price is:
* $10.00 / 1 million prompt tokens (or $0.01 / 1K prompt tokens)
* $30.00 / 1 million sampled tokens (or $0.03 / 1K sampled tokens)
For our models with 8k context lengths (e.g. gpt-4 and gpt-4-0314), the price is:
* $30.00 / 1 million prompt token (or $0.03 / 1K prompt tokens)
* $60.00 / 1 million sampled tokens (or $0.06 / 1K sampled tokens)
For our models with 32k context lengths (e.g. gpt-4-32k and gpt-4-32k-0314), the price is:
* $60.00 / 1 million prompt tokens (or $0.06 / 1K prompt tokens)
* $120.00 / 1 million sampled tokens (or $0.12 / 1K sampled tokens)
GPT-4.5’s pricing launched at:
Input:$75.00 / 1M tokens
Cached input:$37.50 / 1M tokens
Output:$150.00 / 1M tokens
OpenAI included language in the release that they may not keep this model in the API, likely forecasting low demand, as they wanted to hear from users if it enabled entirely new use-cases.
Many analysts think that Nvidia’s next generation of GPU, Blackwell, which comes with GPUs with far more memory per FLOP (enabling storing bigger models), are not priced into this. We can expect to see the same arc of pricing with 4.5 as we did with 4 to 4 Turbo to 4o.
* GPT-4 Turbo launched in November 2023 at $10 / 1M input and $30 / 1M output.
* GPT-4o launched in May 2024 at $2.5 / 1M input and $10 / 1M output.
These are huge reductions, about 10X.
These are products that OpenAI makes a healthy margin on, and there are no signs that that isn’t the case. The AI community collectively has grown so accustomed to incredible progress in making the technology more efficient that even a blip in the process, where bigger models are available, feels potentially bubble-popping.
The future of scaling
Scaling language models is not dead. Still, reflecting on why this release felt so weird is crucial to staying sane in the arc of AI’s progress. We’ve entered the era where trade-offs among different types of scaling are real.
If forced to summarize all of this curtly, it would be: GPT-4.5 is, oddly, ahead of its time.
This means that the progression of AI needs to take a different tack, but we already knew this with the rapid progress of reasoning models. The true impact of GPT-4.5 is when it is integrated with multiple lines of rapid progress.
One of the flagship results in the DeepSeek R1 paper and related RL follow-up work in the AI community is that scaling RL training works better on bigger models. There is a lot of work to do to know all the domains that’ll be absorbed into this umbrella. Future models like o4 could be distilled from a reasoning model trained on GPT-4.5. In fact, this may already be the case. OpenAI’s current models likely would not be so good without GPT-4.5 existing.
In as soon as a year, most of the models we are working with will be GPT-4.5 scale and they will be fast. The “well-rounded” improvements they offer are going to help make many more applications more robust, but OpenAI and others in the AI labs have pushed scaling a bit further than the current serving infrastructure can support.
Frontier labs are not taking enough risk if they’re not going to try to push the limits of every direction of scaling they have. Though releasing the model isn’t needed, we have to guess why OpenAI actually wanted to do this. It’s likely that GPT-4.5 is being used in other internal systems for now and other external products soon, so releasing it is a natural step on the way to the next thing, rather than a detour.
GPT-4.5 is a frontier model, but its release is not an exciting one. AI progress isn’t free, and it takes a lot of hard work. Most people should only care when GPT-4.5 is integrated into more than just chat.
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe