The 2024 AI Panorama Report, released by stateof.ai, is a comprehensive analysis of the latest advances in artificial intelligence. This report covers many aspects of the field of artificial intelligence, such as: basic models, AI hardware, AI and security ethics, AI applications, etc. This market research report will bring an interpretation of the 2024 AI Panorama Report.
A breakthrough in large language models (LLMs).
Over the past period, GPT-4 has been far ahead of the performance of AI language models, with benchmarks and community leaderboards showing a significant gap between GPT-4 and other models. However, with the release of new models such as the Claude 3.5 Sonnet, Gemini 1.5, and Grok 2, the performance of these competitors has improved significantly, and they are now close to GPT-4, closing the gap. According to the State of AI, today's large language models generally have a high level of programming ability, and perform well in factual memory and mathematics. However, it is slightly lacking in open-ended Q&A and multimodal problem solving. For example, GPT-4o outperformed the Claude 3.5 Sonnet on the MMLU benchmark, but slightly less on the MMLU-Pro benchmark, which is more challenging to design.
Given the relatively subtle technical differences between architectures and the large amount of overlap that can exist in pre-trained data, model builders now increasingly need to compete on new capabilities and product features.
Figure: Claude 3.5 New Models Released Sonnet, Gemini 1.5 and Grok 2 Close the Gap with Open AI
On September 13, 2024, Open AI released the o1 model, the name "o1" is to mean "reset the counter to 1", which means that OpenAI hopes to redefine the reasoning ability of artificial intelligence through this model and open a new era. Also known as Strawberry, the o1 model pushes the limits of LLM inference with its own COT (Chain of Thought) process that makes up for the math, science, and code deficiencies of previous models.
According to the State of AI, by moving computations from the pre-training and post-training phases to the inference phase, the o1 model infers complex prompts step-by-step in a chain thinking (COT) fashion, and applies reinforcement learning to optimize the COT and the strategies it uses. This unlocks the possibility of solving multi-layered math, science, and programming problems that have historically been difficult for large language models (LLMs) to overcome due to the inherent limitations of prediction based on the next word.
OpenAI reports that O1 has made significant progress on inference-focused benchmarks compared to 4O, with the math competition AIME 2024 being the most prominent, with a whopping score of 83.83 compared to 4O's 13.4. However, this power boost comes at a high cost: the O1 preview costs $15 for 1 million input tokens, while $60 for 1 million output tokens. This makes it 3 to 4 times more expensive than GPT-4o.
OpenAI makes it clear in its API documentation that O1 is not a drop-in replacement for 4O, and that it is not the best model for tasks that require consistently fast responses, image input, or function calls.
Figure: Open AI o1 large language model test
According to the State of AI, developers quickly tested o1 and found that it performed significantly better than other large language models on certain logic problems and puzzles. Its real strengths, however, lie in complex math and science tasks, as evidenced by a video that has been circulating on the Internet, in which a Ph.D. is surprised to find that O1 replicates his Ph.D. code for a year in about an hour. However, the model is still weak in some spatial inference. Like the previous 4o, it is not yet able to play chess.
Figure: Comparison of the parameters of the Llama 3.1 large language model with other big oracle models (Source: State of AI)
In addition to Open AI, other tech giants have also made some progress in their big prediction models. For example, in April this year, Meta released the Llama 3 series models, then released Llama 3.1 in July, and Llama 3.2 in September. Among them, Llama 3.1 is their largest model to date, with 405B parameters (or 405 billion), and it can compete with GPT-4o and Claude 3.5 Sonnet in tasks such as reasoning, math, multilingualism, and long text contexts. This marks the first time that open-source models have narrowed the gap with proprietary cutting-edge models, and is an important milestone in the field of open source models.