When Artificial Intelligence Hits Its Limits (Vietnamese below)

When Artificial Intelligence Hits Its Limits (Vietnamese below)

January 27, 2025

When Artificial Intelligence Hits Its Limits

Why Humanity’s Struggles to Test AI Highlight a Growing Concern (Vietnamese below)

The race to measure artificial intelligence has reached a critical impasse. Researchers are finding that some of the world’s brightest minds are struggling to devise tests that cutting-edge AI systems can’t conquer—a development that underscores both the promise and peril of AI’s rapid evolution.

For years, AI progress was charted through a series of standardized benchmarks, from solving SAT-level problems to tackling advanced challenges in math, science, and logic. These tests served as yardsticks, offering a glimpse into how far AI had come. Yet, as AI systems from OpenAI, Google, and Anthropic advanced, these benchmarks became increasingly obsolete.

The latest generation of AI has mastered many graduate-level problems, raising an unsettling question: Are these systems now advancing beyond our ability to evaluate them effectively?

The Hardest Test Yet

Enter “Humanity’s Last Exam,” a new evaluation designed to probe the limits of AI. Developed by Dan Hendrycks, director of the Center for AI Safety, in collaboration with Scale AI, the test aims to address a pressing concern: how to measure AI’s intellectual capabilities when conventional tests fall short.

The exam, consisting of 3,000 rigorously crafted questions, spans subjects as diverse as analytic philosophy and rocket engineering. Experts from academia and industry—including professors and award-winning mathematicians—contributed questions designed to stump even the most sophisticated AI models. Those whose questions made the cut received between $500 and $5,000, reflecting the value of their expertise.

These questions underwent a meticulous two-step filtering process. First, they were tested on advanced AI systems. If the models failed—sometimes performing worse than random guessing—the questions were passed to human reviewers for refinement.

The test, while designed to push the boundaries of AI, also highlights its uneven capabilities. For instance, it includes questions like this one from the realm of physics:

A block is placed on a horizontal rail, along which it can slide frictionlessly. It is attached to the end of a rigid, massless rod of length R. A mass is attached at the other end…

(Answering it might require not only mathematical precision but also a deep understanding of mechanics—a level of nuance that even advanced AI models may struggle to replicate consistently.)

A Snapshot of AI’s Progress

When leading AI systems—including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet—tackled Humanity’s Last Exam, they performed dismally. The highest score, achieved by OpenAI’s o1 system, was a mere 8.3%. Yet researchers, including Hendrycks, believe these scores will climb, potentially reaching 50% or more within a year.
The prospect of AI systems achieving such expertise raises important questions about how to measure their impact beyond traditional exams. “We need methods that go beyond standardized tests,” Hendrycks said. “The ultimate goal is to determine whether AI can contribute to intellectual breakthroughs—whether it can solve problems humans haven’t yet cracked.”

Summer Yue, director of research at Scale AI and an organizer of the exam, echoes this sentiment. She envisions a future where AI doesn’t just answer questions but helps humanity uncover new knowledge.

The Uneven Landscape of AI Capabilities

The trajectory of AI’s development remains anything but smooth. While some models can diagnose diseases more accurately than doctors, win medals in math competitions, and out-code top programmers, they still falter in tasks like basic arithmetic or composing structured poetry.

This inconsistency complicates efforts to evaluate AI’s true capabilities. Kevin Zhou, a theoretical particle physicist at UC Berkeley who contributed questions to the exam, underscores this point. While impressed by AI’s ability to tackle challenging problems, he notes that research involves far more than producing correct answers.
“There’s a big gulf between taking an exam and being a practicing physicist,” Zhou said. “Research demands creativity, intuition, and the ability to navigate unstructured problems—areas where AI still struggles.”

The Broader Implications

The limitations of AI testing reflect a larger issue: the difficulty of understanding AI’s broader societal and economic impacts. Some researchers argue that as AI systems evolve, their contributions may be better assessed through real-world outcomes, such as economic productivity or breakthroughs in science and engineering.
In the meantime, initiatives like Humanity’s Last Exam serve as valuable tools for benchmarking progress, even as they underscore the growing gap between AI’s capabilities and our ability to measure them.
As Hendrycks puts it, “The question isn’t just whether AI can pass our tests, but whether it can do the things we’ve yet to imagine—and what that means for humanity.”

a white toy with a black nose — As AI systems advance at an unprecedented pace, even the brightest minds are grappling with designing tests sophisticated enough to challenge these systems. This struggle highlights both the promise and peril of AI’s relentless growth.

Đo AI

Việc kiểm tra tiến bộ của AI đang làm nổi bật một mối lo ngại đang ngày càng gia tăng
Một số bộ óc sáng giá nhất thế giới đang gặp khó khăn trong việc thiết kế thêm các bài kiểm tra để sao cho các hệ thống AI tiên tiến không thể vượt qua được. Đây là diễn biến cho thấy triển vọng lẫn nguy cơ (promise and peril) có thể xảy ra trong tương lai do AI phát triển quá nhanh chóng.

Mau tiến quá

Trong nhiều năm qua, AI đã tiến bộ thấy rõ và được ghi nhận thông qua một loạt tiêu chuẩn, từ giải các bài toán trong SAT – Scholastic Assessment Test, mà một số đại học đang dùng để kiểm tra đầu vào của sinh viên, đến việc đối mặt với các bài toán nâng cao và cả những bài kiểm tra khoa học và logic cao cấp. Đây là loại bài – thước đo để biết AI đã tiến xa đến đâu. Trên thực tế, loại bài này ngày càng trở nên lỗi thời vì OpenAI, Google và Anthropic đã tiến rất xa.

Cần biết rằng thế hệ AI mới nhất đã quá thành thục ở cấp độ sau đại học. Điều này đặt ra một câu hỏi đáng lo ngại: Liệu các hệ thống này có đang tiến xa, vượt quá khả năng đánh giá hiệu quả của con người ?

Ông Dan Hendrycks, giám đốc Trung tâm an toàn AI, phối hợp với Scale AI đã đưa ra “Kỳ thi cuối của loài người” (Humanity’s Last Exam) được xem như bài đánh giá mới được thiết kế để kiểm tra giới hạn của AI. Nó nhắm giải quyết một mối quan tâm cấp bách: làm thế nào để đo lường khả năng trí tuệ của AI.

Bài thi gồm đến 3.000 câu hỏi với nhiều chủ đề, từ triết học phân tích cho đến kỹ thuật tên lửa nhằm làm khó AI. Mặc dù được thiết kế để đẩy tới tận cùng ranh giới của AI hiện nay, nó đã làm nổi bật những khả năng không đồng đều của các hệ thống AI.

Khó kiểm tra

Khi các hệ thống AI hàng đầu, trong đó có Google’s Gemini 1.5 Pro, Anthropic’s Claude 3.5 Sonnet, ChatGPT đối mặt với Kỳ thi cuối của loài người, chúng đã thể hiện rất kém. Điểm số cao nhất chỉ là 8.3% cho hệ thống thông dụng 01 ChatGPT của OpenAI (OpenAI’s o1 system). Tuy nhiên, các nhà nghiên cứu, bao gồm cả ông Hendrycks, tin rằng điểm số sẽ tăng lên, có khả năng đạt 50% hoặc hơn trong vòng một năm tới.

Triển vọng các hệ thống AI đạt được trình độ chuyên môn cao như vậy đang đặt ra những câu hỏi quan trọng về cách đo lường tác động của chúng. Theo ông Hendrycks, “chúng ta cần những phương pháp vượt ra ngoài các bài kiểm tra tiêu chuẩn. Mục tiêu cuối cùng là xác định xem AI có thể đóng góp gì vào đột phá trí tuệ hay không. Liệu nó có thể giải quyết những vấn đề mà con người vẫn chưa thể giải quyết được.”

Cô Summer Yue, giám đốc nghiên cứu tại Scale AI, người cùng tổ chức kỳ thi, đồng tình với quan điểm trên. Theo cô, AI không chỉ trả lời những câu hỏi mà còn giúp nhân loại khám phá ra được kiến thức mới trong tương lai.
Tuy vậy, việc phát triển AI không hề suôn sẻ. Trong khi một số mô hình có thể chẩn đoán bệnh chính xác hơn cả bác sĩ, giành huy chương trong các cuộc thi toán học, và lập trình vượt trội hơn cả những lập trình viên hàng đầu, chúng vẫn gặp khó khăn trong các nhiệm vụ cơ bản (basic tasks) như toán học đơn giản hoặc… sáng tác thơ kiểu vè giản đơn!

Ngọc Trân (theo The New York Times)

When Artificial Intelligence Hits Its Limits (Vietnamese below)

Other Articles