According to the State of AI, as more and more large language models begin to exhibit excellent benchmark performance, researchers are increasingly concerned about data collection contamination. That is, when test or validation data is leaked into the training set. Data pollution is an important problem in the field of machine learning, especially with the development of large language models and deep learning technologies, the size and complexity of datasets continue to increase, and the risk of data contamination also increases. Data contamination occurs when data from the test set used to evaluate the performance of the model is leaked into the dataset that the model is pre-trained.
Currently, researchers are trying to correct problems with widely used benchmarks. But the challenges of benchmarking are two-sided, with higher error rates in some of the most popular benchmarks, which can lead to underestimation of the capabilities of these models and security implications. At the same time, the temptation to overfit is great.
Domestic AI models are booming
According to the State of AI report, domestic AI large language models are still shining despite US sanctions. Models developed by companies such as DeepSeek, Zero-One, Zhipu AI, and Alibaba have figured prominently on the LMSYS rankings, particularly in mathematics and programming. On the other hand, China's strongest model is competitive with the strongest model in the second echelon of the cutting-edge model produced in the United States, while also being able to compare with the current optimal level (SOTA) in some subtasks. To compensate for the limitations in GPU access, the R&D team in China has prioritized computational efficiency to make more efficient use of resources. And the R&D team in China has different advantages. For example, DeepSeek reduces memory requirements during inference, creates new technologies such as Multi-head Latent Attention, and enhances the Mixture of Experts (MoE) architecture.
Figure: Domestic AI models are booming
In addition, China's open source projects have gained high popularity and recognition around the world, and they have made significant contributions to the global technology ecosystem. DeepSeek, for example, is a standout in the coding community due to its combination of speed, and accuracy. DeepSeek's deepseek-coder-v2 model stands out in this regard, making it a strong contender in the field of coding tasks.
Alibaba's recently released Qwen-2 series has also impressed the community, especially in terms of visual capabilities. From challenging OCR tasks to analyzing complex works of art, Qwen-2 has demonstrated its versatility and prowess in the field of computer vision. Compared with its predecessor, Qwen 1.5, the Qwen-2 series has achieved a generational leap in overall performance, greatly improving the ability of code, mathematics, reasoning, instruction following, multi-language understanding, etc. The series includes five sizes of pre-trained and instruction-fine-tuned models to meet different computational and application needs.
On the smaller project, Tsinghua University's Natural Language Processing Laboratory funded the OpenBMB project, which gave birth to the MiniCPM project. With less than 2.5 billion parameters, these small models can be run on devices, making them highly accessible and practical. Their 2.8 billion parameter vision model only slightly lags behind GPT-4V in some metrics, while the 8.5 billion parameter Llama 3 model surpasses GPT-4V in some metrics, demonstrating the impressive capabilities of these Chinese open source projects.
In addition, Tsinghua University's knowledge engineering group has created CogVideoX, one of the most powerful text-to-video models available. This innovation further strengthens China's leading position in the field of artificial intelligence and open source contributions.
Related: