Measuring AI: Progress and Challenges (Vietnamese below)

Measuring AI: Progress and Challenges (Vietnamese below)

January 26, 2025

Measuring AI: Progress and Challenges (Vietnamese below)

The rapid evolution of artificial intelligence (AI) has sparked both excitement and growing concerns worldwide. As AI systems advance at an unprecedented pace, even the brightest minds are grappling with designing tests sophisticated enough to challenge these systems. This struggle highlights both the promise and peril of AI’s relentless growth.

Breakneck Progress

Over the years, AI has made significant strides, as demonstrated by its ability to tackle a variety of challenges. These range from solving complex SAT (Scholastic Assessment Test) questions—used by some universities for admissions—to mastering advanced problems in logic and science. Such benchmarks have long served as tools to measure AI’s progress.

However, these benchmarks are quickly becoming obsolete. Giants like OpenAI, Google, and Anthropic are developing AI systems that surpass current standards with ease. The latest AI models have achieved near-expert proficiency at postgraduate levels, raising an unsettling question: Are these systems advancing beyond the limits of human evaluation?

A Test for the Future

To address these concerns, Dan Hendrycks, director of the Center for AI Safety, collaborated with Scale AI to develop a groundbreaking assessment called Humanity’s Last Exam. This test aims to push the boundaries of AI’s capabilities and explore the limits of its intellectual potential.

The exam consists of 3,000 challenging questions across diverse subjects, including analytical philosophy and rocket engineering. Designed to be exceptionally difficult, it seeks to probe the limits of AI while uncovering its uneven abilities.

Mixed Results

When leading AI systems such as Google’s Gemini 1.5 Pro, Anthropic’s Claude 3.5 Sonnet, and ChatGPT were subjected to the test, their performance was underwhelming. OpenAI’s standard GPT-4 model scored the highest—at a mere 8.3%.
Still, researchers, including Hendrycks, anticipate rapid improvement. They predict these systems could achieve scores of 50% or higher within a year, signaling their increasing capability to tackle even the most complex problems.

Beyond Standard Metrics

The potential for AI systems to reach such expertise presents profound implications for how their impact is assessed. Hendrycks emphasizes the need for innovative evaluation methods beyond traditional testing. He notes, “Our ultimate goal is to determine whether AI can contribute to breakthroughs in intellectual challenges—solving problems that have long eluded human comprehension.”

Summer Yue, director of research at Scale AI and co-organizer of Humanity’s Last Exam, shares a similar vision. She believes AI’s potential lies not just in answering questions but in unlocking new realms of knowledge and innovation.

Challenges and Limitations

Despite its remarkable achievements, AI development is far from smooth sailing. Some models can diagnose diseases more accurately than doctors, win medals in math competitions, and outperform elite programmers in coding tasks. Yet, they still struggle with basic tasks, such as simple arithmetic or composing basic rhymes.
This inconsistency underscores the complexity of AI’s evolution. As these systems continue to improve, so too must our tools to evaluate their capabilities.

The Road Ahead

The rapid advancements in AI bring both optimism and caution. On one hand, AI holds the promise of solving problems beyond human capacity, contributing to groundbreaking discoveries. On the other, its rapid development raises questions about its implications for humanity and how we can responsibly guide its growth.

idea, visualization, line art, visualize, teaching, presentation, learning, student, training, education, idea, idea, idea, idea, teaching, teaching, presentation, presentation, presentation, learning, learning, learning, learning, learning, student, training, training, education, education, education — The future of translation is not just about technology; it is about fostering understanding and communication in an increasingly interconnected world.

Đo AI

Việc kiểm tra tiến bộ của AI đang làm nổi bật một mối lo ngại đang ngày càng gia tăng
Một số bộ óc sáng giá nhất thế giới đang gặp khó khăn trong việc thiết kế thêm các bài kiểm tra để sao cho các hệ thống AI tiên tiến không thể vượt qua được. Đây là diễn biến cho thấy triển vọng lẫn nguy cơ (promise and peril) có thể xảy ra trong tương lai do AI phát triển quá nhanh chóng.

Mau tiến quá

Trong nhiều năm qua, AI đã tiến bộ thấy rõ và được ghi nhận thông qua một loạt tiêu chuẩn, từ giải các bài toán trong SAT – Scholastic Assessment Test, mà một số đại học đang dùng để kiểm tra đầu vào của sinh viên, đến việc đối mặt với các bài toán nâng cao và cả những bài kiểm tra khoa học và logic cao cấp. Đây là loại bài – thước đo để biết AI đã tiến xa đến đâu. Trên thực tế, loại bài này ngày càng trở nên lỗi thời vì OpenAI, Google và Anthropic đã tiến rất xa.

Cần biết rằng thế hệ AI mới nhất đã quá thành thục ở cấp độ sau đại học. Điều này đặt ra một câu hỏi đáng lo ngại: Liệu các hệ thống này có đang tiến xa, vượt quá khả năng đánh giá hiệu quả của con người ?

Ông Dan Hendrycks, giám đốc Trung tâm an toàn AI, phối hợp với Scale AI đã đưa ra “Kỳ thi cuối của loài người” (Humanity’s Last Exam) được xem như bài đánh giá mới được thiết kế để kiểm tra giới hạn của AI. Nó nhắm giải quyết một mối quan tâm cấp bách: làm thế nào để đo lường khả năng trí tuệ của AI.

Bài thi gồm đến 3.000 câu hỏi với nhiều chủ đề, từ triết học phân tích cho đến kỹ thuật tên lửa nhằm làm khó AI. Mặc dù được thiết kế để đẩy tới tận cùng ranh giới của AI hiện nay, nó đã làm nổi bật những khả năng không đồng đều của các hệ thống AI.

Khó kiểm tra

Khi các hệ thống AI hàng đầu, trong đó có Google’s Gemini 1.5 Pro, Anthropic’s Claude 3.5 Sonnet, ChatGPT đối mặt với Kỳ thi cuối của loài người, chúng đã thể hiện rất kém. Điểm số cao nhất chỉ là 8.3% cho hệ thống thông dụng 01 ChatGPT của OpenAI (OpenAI’s o1 system). Tuy nhiên, các nhà nghiên cứu, bao gồm cả ông Hendrycks, tin rằng điểm số sẽ tăng lên, có khả năng đạt 50% hoặc hơn trong vòng một năm tới.

Triển vọng các hệ thống AI đạt được trình độ chuyên môn cao như vậy đang đặt ra những câu hỏi quan trọng về cách đo lường tác động của chúng. Theo ông Hendrycks, “chúng ta cần những phương pháp vượt ra ngoài các bài kiểm tra tiêu chuẩn. Mục tiêu cuối cùng là xác định xem AI có thể đóng góp gì vào đột phá trí tuệ hay không. Liệu nó có thể giải quyết những vấn đề mà con người vẫn chưa thể giải quyết được.”

Cô Summer Yue, giám đốc nghiên cứu tại Scale AI, người cùng tổ chức kỳ thi, đồng tình với quan điểm trên. Theo cô, AI không chỉ trả lời những câu hỏi mà còn giúp nhân loại khám phá ra được kiến thức mới trong tương lai.
Tuy vậy, việc phát triển AI không hề suôn sẻ. Trong khi một số mô hình có thể chẩn đoán bệnh chính xác hơn cả bác sĩ, giành huy chương trong các cuộc thi toán học, và lập trình vượt trội hơn cả những lập trình viên hàng đầu, chúng vẫn gặp khó khăn trong các nhiệm vụ cơ bản (basic tasks) như toán học đơn giản hoặc… sáng tác thơ kiểu vè giản đơn!

Ngọc Trân (theo The New York Times)

Measuring AI: Progress and Challenges (Vietnamese below)

Other Articles