[Paper Review] GAIA : A Benchmark for General AI Assistants

💡

GAIA의 질문들은 현실 세계의 문제들로 구성되어 있으며, AI가 해결하기 위해서는 추론, 멀티모달(텍스트, 이미지 등 복합 정보 처리), 웹 브라우징, 도구 사용 등 다방면에 걸친 핵심적인 능력이 필요하다.

GAIA 질문들은 개념적으로는 인간에게 매우 간단하지만, 현재 가장 발전된 AI에게는 아주 어려운 수준입니다.실제 테스트 결과, 인간은 92%의 정답률을 보인 반면, 여러 플러그인(도구)을 장착한 GPT-4는 고작 15%의 정답률을 기록했다.

최근 거대 언어 모델들은 글쓰기와 지식 검색은 물론, 코드를 짜고 데이터를 분석하는 능력을 갖추게 되었다. 이러한 상황은 우리에게 새로운 딜레마를 안겨주었다. "대체 AI의 진짜 실력을 어떻게 측정해야하는가?" 기존의 낡은 데이터셋으로는 더 이상 LLM의 잠재력을 가늠할 수 없게 됐다.

기존 벤치마크의 명확한 한계

인간에게만 어려운 문제일 뿐 → MMLU나 GSM8k 같은 고난도 시험에서조차 최신 LLM들은 이미 인간 전문가 수준에 근접한 점수를 내고 있다. 심지어 특정 분야 비전문가인 평범한 사람보다 AI가 더 높은 점수를 받는 역전 현상까지 나타나죠. 이는 그 문제가 '보편적인 지능'이 아닌 '특정 지식 암기' 능력에 더 가깝다고 평가된다.

공정하지 않은 시험, '데이터 오염' → 더 큰 문제는 데이터 오염이라 일컬어지는 Data Contamination이다. AI가 시험지를 미리 훔쳐본 것과 같다. 평가에 사용될 문제와 답을 훈련 데이터에서 미리 학습했다면, 그 점수는 진짜 실력이 아닌 단순 암기력을 평가하는 게 아닌가라고 의심해봐야 한다.

대체 누가 채점할 수 있을까? → AI가 쓴 소설 한 편, 세계적인 석학도 풀기 어려운 수학 문제. 이걸 대체 누가, 어떻게 객관적으로 채점할 수 있을까? 인간이 하기엔 너무 복잡하고, 다른 AI에게 채점을 맡기는 건 결국 '더 뛰어난 AI'에 의존해야 하는 모순에 빠지게 된다.

새로운 관점으로 접근하기, '견고함'을 측정하자

결국, 기존의 방식으로는 AI의 진짜 지능을 측정하기 어렵다는 결론에 이른다. 논문의 저자들은 바로 이 지점에서 근본적인 질문을 던진다. AI 발전의 척도가 정말 '인간을 뛰어넘는 초월적인 능력'에만 있어야 하는가?

오히려 '인간처럼 평범하고 일상적인 문제를 얼마나 견고하게 해결하는가'에 주목해야 한다는 새로운 관점이 필요해졌다. 바로 이 문제의식에서, AI의 일반 지능과 견고함을 측정하려는 새로운 시도에서 GAIA 벤치마크가 탄생했다.

레벨 (Level)	질문 요약 (이렇게 물어봅니다)	AI가 해야 할 일 (이런 능력이 필요해요)	왜 AI에게 어려운가?
Level 1 (기초)	"2018년 1-5월, NIH 웹사이트에 올라온 여드름 환자 대상 H.pylori 임상시험의 실제 등록자 수는 몇 명인가요?"	[1] 정확한 웹 검색: 정부 기관 사이트(NIH) 탐색 [2] 핵심 정보 추출: 특정 기간, 특정 시험의 데이터 식별	수많은 정보 속에서 '실제 등록자 수'라는 단 하나의 정확한 값을 콕 집어내야 합니다. 검색 키워드를 잘못 입력하거나 표를 잘못 읽으면 바로 오답으로 이어진다.
Level 2 (응용)	(아이스크림 사진을 보여주며) "이 아이스크림의 유지방 함량은 2020년 위키피디아 기준 미국 연방 표준보다 몇 % 높거나 낮을까요?"	[1] 이미지 속 정보 읽기: 제품 영양성분표 인식 [2] 기준 정보 찾기: 위키피디아에서 '미국 연방 표준' 검색 [3] 계산과 추론: 두 정보를 합쳐 백분율 계산	눈으로 사진을 보고, 웹에서 기준을 찾고, 계산기까지 두드려야 하는 복합적인 미션으로, 어느 한 과정이라도 막히면 정답을 낼 수 없다.
Level 3 (심화)	"2006년 1월 21일 NASA '오늘의 사진' 속 두 우주인 중 더 작게 보이는 사람은 누구인가요? 그리고 그가 속한 그룹에서 우주에 가장 짧게 머문 동료의 이름과 체류 시간은 얼마인가요?"	[1] 사진 분석과 인물 식별: 특정 날짜의 사진 속 인물 파악 [2] 꼬리에 꼬리를 무는 검색: 인물 → 소속 그룹 → 동료들의 우주 체류 시간 데이터 추적 [3] 데이터 비교 및 처리: 수많은 동료의 데이터를 비교해 최솟값 찾기	여러 단서를 엮어 하나의 결론에 도달해야 하는 과제이다. 한 번의 검색으로는 절대 답을 찾을 수 없고, 과정이 길고 복잡해서 한 단계만 삐끗해도 실패이다.

GAIA의 설계 철학 “풀기는 어렵게, 채점은 쉽게”

GAIA는 암호화폐의 'Proof of Work' 알고리즘과 비슷한 원칙을 따른다.

해결 과정은 아주 어렵게하여 AI는 정답을 찾기 위해 웹을 헤매고, 이미지를 분석하고, 여러 단계의 복잡한 추론을 거쳐야만 한다. 최종 정답은 "90", "+4.6", "White; 5,876"처럼 아주 명확하고 간결한 사실이어야 한다.

현실 세계를 반영한 진짜 문제, 명확한 실패 원인 분석, 객관식이 아니므로 정답을 찍을 수도 없고, 인터넷에 답이 그대로 있지도 않아 '데이터 오염' 걱정 없고, 복잡한 설정 없이 질문을 던지기만 하면 되니 평가 과정이 매우 간단하고, 결과의 신뢰도도 높다.

AI 길들이기: 명확한 프롬프트와 평가 방식

💡

당신은 AI 어시스턴트입니다. 지금부터 제가 묻는 질문에 당신의 생각 과정을 먼저 서술하고, 최종 답변은 반드시 'FINAL ANSWER: [최종 답변]' 형식으로 제출해주세요.

이런 식으로 명확한 규칙을 정해주면, AI는 정해진 틀 안에서 자신의 능력을 최대한 발휘하게 된다. 예를 들어, 아래와 같은 질문을 받았을 때 GPT-4는 어떻게 반응했는지 아래와 같은 형태를 띄는 것을 볼 수 있다.

GAIA 질문: "첨부된 엑셀 파일은 한 패스트푸드 체인의 매출 기록입니다. 음료를 뺀 '음식' 매출 총액은 얼마인가요? (USD, 소수점 둘째 자리까지)"

GPT-4의 해결 과정:

코드 실행 도구를 활성화한다.

pandas 라이브러리로 엑셀 파일을 불러온다.

'음료' 항목을 제외한 나머지 음식 메뉴들의 매출액을 모두 더한다.

결과를 지정된 형식에 맞춰 출력한다.

FINAL ANSWER: 89706.00 (정답! ✓)

GAIA의 구성: 능력, 난이도, 그리고 숨은 의도

필요한 능력들은 질문을 풀기 위해선 웹 브라우징 능력(최우선), 코딩, 멀티모달, 다양한 파일 처리 능력이 필수적이다.

난이도는 질문들은 필요한 '작업 단계'와 '도구 종류'에 따라 체계적으로 세 가지 레벨로 나눈다.

Level 1 - 간단한 도구 하나로 5단계 안에 해결 가능한 워밍업 문제
Level 2 - 여러 도구를 조합하며 10단계 안팎의 과정을 거쳐야 하는 본격적인 문제
Level 3 - 거의 완벽한 AI만이 풀 수 있는, 매우 길고 복잡한 최종 보스급 문제

가로축은 문제를 푸는 데 필요한 'Step의 수', 세로축은 '사용한 Tool의 종류'를 나타낸다. 점의 색깔은 난이도를, 점의 크기는 해당 위치에 얼마나 많은 질문이 있는지를 의미한다.

위 그래프는 GAIA의 난이도가 체계적으로 설계되었음을 보여준다. 쉬운 문제부터 시작해, 점점 더 많은 단계를 거치고 다양한 도구를 사용해야 하는 복잡한 문제로 나아가도록 구성되어 있어, AI의 한계를 측정할 수 있다고 논문은 말하고 있다.

가장 먼저 눈에 띄는 것은 Human에 해당하는 빨간색 막대이다. 난이도와 상관없이 인간은 90%를 웃도는 압도적인 정확도를 보여주었다. 이는 GAIA의 질문들이 인간에게는 얼마나 직관적이고 간단한지를 보여주는 지표이다.

가장 어려운 레벨에서는 모든 AI가 0점을 받으며 실패했다. 인간만이 80% 후반대의 정답률을 유지하며 차이가 분명했다. 가장 똑똑하다는 생성형 모델들조차도 인간의 '보편적 문제 해결 능력' 앞에서는 아직 걸음마 단계에 불과하다는 사실이 드러나는 그래프라고 볼 수 있다.

플러그인이 없는 기본 GPT-4는 파일 읽기(Diverse filetype reading)나 멀티모달(Multi-modality) 문제에서 거의 0점에 가까운 점수를 기록했다. 당시 생성형 모델은 혼자서는 엑셀 파일을 열어보거나 이미지 속 글자를 읽을 수 없기 때문인 것으로 파악된다. 하지만 적절한 플러그인을 장착하자, 이 영역에서 유의미한 성능 향상을 이뤄낸 것을 볼 수 있다.

마찬가지로 플러그인으로 웹 브라우저나 코드 실행기 같은 도구를 활용할 수 있게 되자, 기본 모델로는 풀 수 없었던 문제들을 해결하는 것을 확인할 수 있었다.

GAIA 데이터셋 톺아보기: JSON 메타데이터 뜯어보자


{"task_id": "c61d22de-5f6c-4958-a7f6-5e9707bd3466", "Question": "A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?", "Level": 2, "Final answer": "egalitarian", "file_name": "", "Annotator Metadata": {"Steps": "1. Go to arxiv.org and navigate to the Advanced Search page.\n2. Enter \"AI regulation\" in the search box and select \"All fields\" from the dropdown.\n3. Enter 2022-06-01 and 2022-07-01 into the date inputs, select \"Submission date (original)\", and submit the search.\n4. Go through the search results to find the article that has a figure with three axes and labels on each end of the axes, titled \"Fairness in Agreement With European Values: An Interdisciplinary Perspective on AI Regulation\".\n5. Note the six words used as labels: deontological, egalitarian, localized, standardized, utilitarian, and consequential.\n6. Go back to arxiv.org\n7. Find \"Physics and Society\" and go to the page for the \"Physics and Society\" category.\n8. Note that the tag for this category is \"physics.soc-ph\".\n9. Go to the Advanced Search page.\n10. Enter \"physics.soc-ph\" in the search box and select \"All fields\" from the dropdown.\n11. Enter 2016-08-11 and 2016-08-12 into the date inputs, select \"Submission date (original)\", and submit the search.\n12. Search for instances of the six words in the results to find the paper titled \"Phase transition from egalitarian to hierarchical societies driven by competition between cognitive and social constraints\", indicating that \"egalitarian\" is the correct answer.", "Number of steps": "12", "How long did this take?": "8 minutes", "Tools": "1. Web browser\n2. Image recognition tools (to identify and parse a figure with three axes)", "Number of tools": "2"}}
{"task_id": "17b5a6a3-bc87-42e8-b0fb-6ab0781ef2cc", "Question": "I\u2019m researching species that became invasive after people who kept them as pets released them. There\u2019s a certain species of fish that was popularized as a pet by being the main character of the movie Finding Nemo. According to the USGS, where was this fish found as a nonnative species, before the year 2020? I need the answer formatted as the five-digit zip codes of the places the species was found, separated by commas if there is more than one place.", "Level": 2, "Final answer": "34689", "file_name": "", "Annotator Metadata": {"Steps": "1. Search the web for \u201cfinding nemo main character\u201d.\n2. Note the results, which state that the main character is a clownfish.\n3. Search the web for \u201cusgs nonnative species database\u201d.\n4. Click result for the Nonindigenous Aquatic Species site.\n5. Click \u201cMarine Fishes\u201d.\n6. Click \u201cSpecies List of Nonindigenous Marine Fish\u201d.\n7. Scroll through the list until I find the clown anenomefish, and click \u201cCollection info\u201d.\n8. Note the place that a clown anenomefish was found, in Fred Howard Park at the Gulf of Mexico.\n9. Search the web for \u201cfred howard park florida zip code\u201d.\n10. Note the zip code, 34689. Since only one clownfish was found before the year 2020, this is the answer.", "Number of steps": "10", "How long did this take?": "5 minutes", "Tools": "1. Search engine\n2. Web browser", "Number of tools": "2"}}
{"task_id": "04a04a9b-226c-43fd-b319-d5e89743676f", "Question": "If we assume all articles published by Nature in 2020 (articles, only, not book reviews/columns, etc) relied on statistical significance to justify their findings and they on average came to a p-value of 0.04, how many papers would be incorrect as to their claims of statistical significance? Round the value up to the next integer.", "Level": 2, "Final answer": "41", "file_name": "", "Annotator Metadata": {"Steps": "1. Find how many articles were published in Nature in 2020 by Googling \"articles submitted to nature 2020\"\n2. Click through to Nature's archive for 2020 and filter the results to only provide articles, not other types of publications: 1002\n3. Find 4% of 1002 and round up: 40.08 > 41", "Number of steps": "3", "How long did this take?": "5 minutes", "Tools": "1. search engine\n2. calculator", "Number of tools": "2"}}
{"task_id": "14569e28-c88c-43e4-8c32-097d35b9a67d", "Question": "In Unlambda, what exact charcter or text needs to be added to correct the following code to output \"For penguins\"? If what is needed is a character, answer with the name of the character. If there are different names for the character, use the shortest. The text location is not needed. Code:\n\n`r```````````.F.o.r. .p.e.n.g.u.i.n.si", "Level": 2, "Final answer": "backtick", "file_name": "", "Annotator Metadata": {"Steps": "1. Searched \"Unlambda syntax\" online (optional).\n2. Opened https://en.wikipedia.org/wiki/Unlambda.\n3. Note that the hello world program is very similar in syntax to the code in this question.\n4. Go to the source referenced by the hello world program.\n5. From the referenced source, read what the components of the program do to understand that each period needs a backtick after the initial `r.\n6. Observe that in the given code, there are 12 periods but only 11 backticks after the initial `r, so the missing character is a backtick.", "Number of steps": "6", "How long did this take?": "15 minutes", "Tools": "1. Web browser\n2. Search engine\n3. Unlambda compiler (optional)", "Number of tools": "3"}}
{"task_id": "e1fc63a2-da7a-432f-be78-7c4a95598703", "Question": "If Eliud Kipchoge could maintain his record-making marathon pace indefinitely, how many thousand hours would it take him to run the distance between the Earth and the Moon its closest approach? Please use the minimum perigee value on the Wikipedia page for the Moon when carrying out your calculation. Round your result to the nearest 1000 hours and do not use any comma separators if necessary.", "Level": 1, "Final answer": "17", "file_name": "", "Annotator Metadata": {"Steps": "1. Googled Eliud Kipchoge marathon pace to find 4min 37sec/mile\n2. Converted into fractions of hours.\n3. Found moon periapsis in miles (225,623 miles).\n4. Multiplied the two to find the number of hours and rounded to the nearest 100 hours.", "Number of steps": "4", "How long did this take?": "20 Minutes", "Tools": "1. A web browser.\n2. A search engine.\n3. A calculator.", "Number of tools": "3"}}
{"task_id": "32102e3e-d12a-4209-9163-7b3a104efe5d", "Question": "The attached spreadsheet shows the inventory for a movie and video game rental store in Seattle, Washington. What is the title of the oldest Blu-Ray recorded in this spreadsheet? Return it as appearing in the spreadsheet.", "Level": 2, "Final answer": "Time-Parking 2: Parallel Universe", "file_name": "32102e3e-d12a-4209-9163-7b3a104efe5d.xlsx", "Annotator Metadata": {"Steps": "1. Open the attached file.\n2. Compare the years given in the Blu-Ray section to find the oldest year, 2009.\n3. Find the title of the Blu-Ray disc that corresponds to the year 2009: Time-Parking 2: Parallel Universe.", "Number of steps": "3", "How long did this take?": "1 minute", "Tools": "1. Microsoft Excel", "Number of tools": "1"}}
{"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "Question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.", "Level": 1, "Final answer": "3", "file_name": "", "Annotator Metadata": {"Steps": "1. I did a search for Mercedes Sosa\n2. I went to the Wikipedia page for her\n3. I scrolled down to \"Studio albums\"\n4. I counted the ones between 2000 and 2009", "Number of steps": "4", "How long did this take?": "5 minutes", "Tools": "1. web browser\n2. google search", "Number of tools": "2"}}
{"task_id": "3627a8be-a77f-41bb-b807-7e1bd4c0ebdf", "Question": "The object in the British Museum's collection with a museum number of 2012,5015.17 is the shell of a particular mollusk species. According to the abstract of a research article published in Science Advances in 2021, beads made from the shells of this species were found that are at least how many thousands of years old?", "Level": 2, "Final answer": "142", "file_name": "", "Annotator Metadata": {"Steps": "1. Use search engine to search for \"British Museum search collection\" and navigate to the British Museum's collection search webpage.\n2. Select \"Museum number\" as search field and \"2012,5015.17\" in text box, then run search.\n3. Open the page for the single result and note that the description says that this is the shell of an individual of the Nassa gibbosula species.\n4. Use search engine to search for \"Nassa gibbosula\".\n5. Note that according to the search result from the World Register of Marine Species website, Nassa gibbosula is not an accepted species name.\n6. Open the page for Nassa gibbosula on the World Register of Marine Species website.\n7. Scan the page and note that the accepted species name is Tritia gibbosula.\n8. Use search engine to search for \"Science Advances 2021 Tritia gibbosula\".\n9. Find that the top result is an article from 2021 in Science Advances titled \"Early Middle Stone Age personal ornaments from Bizmoune Cave, Essaouira, Morocco\".\n10. Scan abstract and note that the article discusses beads made from Tritia gibbosula shells that date to at least 142 thousand years ago, giving a final answer of 142.", "Number of steps": "10", "How long did this take?": "12 minutes", "Tools": "1. Web browser\n2. Search engine", "Number of tools": "2"}}
{"task_id": "7619a514-5fa8-43ef-9143-83b66a43d7a4", "Question": "According to github, when was Regression added to the oldest closed numpy.polynomial issue that has the Regression label in MM/DD/YY?", "Level": 2, "Final answer": "04/15/18", "file_name": "", "Annotator Metadata": {"Steps": "1. Searched \"numpy github\" on Google search.\n2. Opened the NumPy GitHub page.\n3. Clicked \"Issues\" in the repo tabs.\n4. Clicked \"Closed\" on the filter bar.\n5. Set the filter to the \"numpy.polynomial\" label.\n6. Set the filter to the \"06 - Regression\" label.\n7. Opened the oldest Regression post.\n8. Scrolled down to find when the Regression label was added (Apr 15, 2018).\n9. Converted to MM/DD/YY (04/15/18).", "Number of steps": "9", "How long did this take?": "10 minutes", "Tools": "1. Web browser\n2. Search engine", "Number of tools": "2"}}
{"task_id": "ec09fa32-d03f-4bf8-84b0-1f16922c3ae4", "Question": "Here's a fun riddle that I think you'll enjoy.\n\nYou have been selected to play the final round of the hit new game show \"Pick That Ping-Pong\". In this round, you will be competing for a large cash prize. Your job will be to pick one of several different numbered ping-pong balls, and then the game will commence. The host describes how the game works.\n\nA device consisting of a winding clear ramp and a series of pistons controls the outcome of the game. The ramp feeds balls onto a platform. The platform has room for three ping-pong balls at a time. The three balls on the platform are each aligned with one of three pistons. At each stage of the game, one of the three pistons will randomly fire, ejecting the ball it strikes. If the piston ejects the ball in the first position on the platform the balls in the second and third position on the platform each advance one space, and the next ball on the ramp advances to the third position. If the piston ejects the ball in the second position, the ball in the first position is released and rolls away, the ball in the third position advances two spaces to occupy the first position, and the next two balls on the ramp advance to occupy the second and third positions on the platform. If the piston ejects the ball in the third position, the ball in the first position is released and rolls away, the ball in the second position advances one space to occupy the first position, and the next two balls on the ramp advance to occupy the second and third positions on the platform.\n\nThe ramp begins with 100 numbered ping-pong balls, arranged in ascending order from 1 to 100. The host activates the machine and the first three balls, numbered 1, 2, and 3, advance to the platform. Before the random firing of the pistons begins, you are asked which of the 100 balls you would like to pick. If your pick is ejected by one of the pistons, you win the grand prize, $10,000.\n\nWhich ball should you choose to maximize your odds of winning the big prize? Please provide your answer as the number of the ball selected.", "Level": 1, "Final answer": "3", "file_name": "", "Annotator Metadata": {"Steps": "Step 1: Evaluate the problem statement provided in my user's prompt\nStep 2: Consider the probability of any ball on the platform earning the prize.\nStep 3: Evaluate the ball in position one. The probability of it earning the prize, P1, is 1/3\nStep 4: Using a calculator, evaluate the ball in position two. The probability of it earning the prize, P2, is the difference between 1 and the product of the complementary probabilities for each trial\nP2 = 1 - (2/3)(2/3)\nP2 = 5/9\nStep 5: Using a calculator, evaluate the ball in position three. The probability of it earning the prize, P3, is the difference between 1 and the product of the complementary probabilities for each trial\nP3 = 1 - (2/3)(2/3)(2/3)\nP3 = 19/27\nStep 6: Consider the possible outcomes of numbers higher than 3.\nStep 7: For each trial, either 1 or 2 balls from the ramp will advance to the platform. For any given selection, there is a 50% chance that the ball advances to position 2 or position 3.\nStep 8: As position three holds the highest chance of earning the prize, select the only ball known to occupy position three with certainty, ball 3.\nStep 9: Report the correct answer to my user, \"3\"", "Number of steps": "9", "How long did this take?": "1 minute", "Tools": "None", "Number of tools": "0"}}

위는 이제 원본 메타데이터가 포함된 가이아 데이터셋이다. 이중 가장 위에 있는 걸 뜯어보면,


{
  "task_id": "c61d22de-5f6c-4958-a7f6-5e9707bd3466",
  "Question": "A paper about AI regulation that was originally submitted to arXiv.org in June 2022 shows a figure with three axes, where each axis has a label word at both ends. Which of these words is used to describe a type of society in a Physics and Society article submitted to arXiv.org on August 11, 2016?",
  "Level": 2,
  "Final answer": "egalitarian",
  "file_name": "",
  "Annotator Metadata": {
    "Steps": "1. Go to arxiv.org and navigate to the Advanced Search page...",
    "Number of steps": "12",
    "How long did this take?": "8 minutes",
    "Tools": "1. Web browser\n2. Image recognition tools (to identify and parse a figure with three axes)",
    "Number of tools": "2"
  }
}

JSON의 Key-Value 값이 갖는 의미는 무엇인가?

이 데이터 구조는 크게 '문제 자체'와 '인간의 해결 과정'이라는 두 부분으로 나누어 볼 수 있다. 각 항목이 어떤 의미를 갖는지 자세히 살펴봐보자.

Key (항목)	역할 (이게 뭔가요?)	숨은 의미 (왜 중요할까요?)
task_id	고유 식별 번호	466개의 모든 질문을 구분하기 위한 고유 식별자이다. 이는 데이터를 관리하고 추적하는 데 사용딘다.
Question	AI에게 던지는 실제 질문	"2022년 6월 arXiv에 제출된 AI 규제 논문의 특정 그림 속 단어 중, 2016년 8월 11일 다른 arXiv 논문에서 사회 유형을 묘사하는 데 사용된 단어는 무엇인가?" 이 질문의 맥락을 파악해보면 단순 검색이 아닌, ① 특정 논문 찾기 → ② 그림 속 정보 추출 → ③ 완전히 다른 조건으로 두 번째 논문 찾기 → ④ 추출한 정보와 두 번째 논문 내용 교차 검증이라는 복잡한 과정을 요구한다.
Level	문제의 난이도	단계가 높을 수록 여러 도구와 여러 단계를 거쳐야 풀 수 있다는 뜻이다.
Final answer	유일무이한 정답	정답은 "egalitarian"이라는 단어 하나입니다. 이처럼 정답이 명확하고 간결하기 때문에, AI의 복잡한 답변 과정과 상관없이 채점이 아주 쉽고 객관적일 수 있습니다.
file_name	첨부 파일 이름	이 문제에는 별도의 첨부 파일이 없음을 의미합니다. 모든 단서는 개방된 웹(arXiv.org)에서 찾아야 합니다.
Annotator Metadata	인간의 문제 해결 기록	인간 전문가가 이 문제를 어떻게 풀었는지에 대한 모든 정보가 담겨 있습니다. 아래에서 더 자세히 분석해 보겠습니다.

'Annotator Metadata' 심층 분석

AI의 성능을 측정하려면 인간이라는 명확한 '기준점'이 필요합니다. Annotator Metadata는 그러한 기준점을 제공하는 인간의 문제 해결 과정을 담은 로그 파일에 대응한다.

1. 인간의 문제 해결 로드맵 - Golden Action에 대응하는 개념

arXiv 고급 검색 페이지로 이동한다.

'AI regulation'을 검색하고, 제출일을 '2022년 6월'로 지정한다.

검색 결과에서 세 개의 축이 있는 그림을 가진 논문을 찾는다.

그림에서 6개의 라벨 단어(deontological, egalitarian 등)를 메모한다.

다시 arXiv 고급 검색으로 돌아간다.

이번엔 'Physics and Society' 카테고리에서 제출일을 '2016년 8월 11일'로 지정하여 검색한다.

검색 결과에서 아까 메모해 둔 6개의 단어 중 하나가 포함된 논문을 찾는다.

"egalitarian"이라는 단어가 포함된 논문을 발견하고, 이것이 정답임을 확인한다.

2. Tools, Number of steps, How long did this take?

Tools: "웹 브라우저", "이미지 인식 도구"

AI가 인터넷을 항해할 능력뿐만 아니라, 논문 속 그림을 보고 그 안의 텍스트를 읽어내는 시각적 능력인 멀티모달 능력까지 갖춰야 함을 명시한다.

Number of steps: "12"

이 문제의 복잡성을 '12단계'라는 객관적인 숫자로 보여준다.

How long did this take?: "8 minutes"

숙련된 인간에게도 8분이나 걸리는, 결코 만만치 않은 작업임을 알려준다.

GAIA 최신 리더보드 (2025년 9월 26일 기준)

Agent name	Model family	Average score (%)	Level 1 score (%)	Level 2 score (%)	Level 3 score (%)	Submission date	Organisation
🥇 Co-Sight v2.0.1	ZTE Nebula LLM, Claude Sonnet 4, Gemini 2.5 Pro	84.39	95.7	83.02	67.35	2025-09-24	ZTE-AICloud
🥈 Co-Sight v2.0.0	Claude Sonnet 4, Gemini 2.5 Pro	84.65	95.7	83.02	65.31	2025-08-30	ZTE-AICloud
🥉 Skywork Deep Research Agent	skywork-agent-model, GPT-4.1, Claude-3.7-sonnet, Gemini-2.5-pro	83.39	93.55	83.82	65.31	2025-08-13	Skywork AI
Agent v0.1.4	gpt-4.1	83.86	93.55	83.82	63.27	2025-08-11	ㅤ
ShawnAgent v1.1	GPT5, o3, Claude-Sonnet-4, Gemini-2.5-Pro	82.39	92.47	83.65	59.18	2025-09-05	ㅤ
ShawnAgent v1.2	GPT5, o3, Claude-Sonnet-4, Gemini-2.5-Pro	82.39	91.4	83.82	63.27	2025-09-09	ㅤ
Agent v0.1.3	gpt-4.1	82.86	92.47	81.76	63.27	2025-08-08	ㅤ
AWorld (Run Instantly)	GPT-4o, DeepSeek V3, Claude-Sonnet-4, Gemini-2.5-Pro	81.73	95.7	81.13	57.14	2025-08-06	inclusionAI
ShawnAgent v1.0	03, GPT5, Claude 3.7 Sonnet, Gemini 2.5 Pro	81.4	95.7	82.39	51.02	2025-09-04	ㅤ
Agent v0.1.2	gpt-4.1	81.86	92.47	80.5	61.22	2025-08-05	ㅤ
agent-0904	(not specified)	81.86	92.47	78.62	67.35	2025-09-04	ㅤ
Su Zero Ultra	(not specified)	80.4	93.55	77.36	65.31	2025-06-26	Suzhou AI Lab

'단일 모델'의 시대는 끝났다 → 앙상블 & 에이전트의 부상 가장 눈에 띄는 점은 상위권 에이전트 대부분이 단 하나의 LLM을 사용하지 않는다는 것이다. Model family 항목을 보면, ZTE의 Co-Sight나 inclusionAI의 AWorld처럼 GPT, Claude, Gemini 등 여러 최신 모델들을 앙상블하여 사용하는 것을 볼 수 있다.

'Level 3' 평가데이터셋으로 판가름나는 중이다 Level 1 점수는 대부분 90점대를 넘기며 상향 평준화되었다. Level 2 역시 80점대로 비슷한 점수를 공유하고 있다. 하지만 진정한 실력 차이는 가장 어려운 Level 3 점수에서 드러난다. 현재 1위인 Co-Sight v2.0.1조차 Level 3 점수는 67.35%에 그친다. 이는 아직 AI가 여러 도구를 복합적으로 사용하고, 매우 긴 단계의 추론을 수행하는 능력에 명확한 한계가 있음을 보여준다.

숨 가쁜 개발 속도와 경쟁 대부분의 기록이 불과 며칠 또는 몇 주 간격으로 제출되었음을 알 수 있습니다. ZTE-AICloud가 8월 30일에 v2.0.0을 제출하고 한 달도 안 된 9월 24일에 개선된 v2.0.1을 제출한 것이 눈에 띈다.

Conclusion

최근에는 변호사 시험에 합격하고 전문가 수준의 코드를 작성하는 시대가 되었지만, 정작 인간에게는 지극히 당연한 일상의 문제 해결 능력 앞에서는 아직 미숙하다는 사실을 깨닫곤 한다. 웹에서 정보를 찾고, 사진 속 단서를 파악하며, 여러 도구를 조합해 결론을 내리는 이 평범한 과제 앞에서 아직 인공지능이 제대로 된 길을 찾는 것을 어려워하고 유의미한 격차를 가진 것을 볼 수 있었다.

향후 연구에서는 단순히 더 강력한 두뇌를 개발하는 것도 중요하지만, 최근 전공 수업에서 교수님께서 했던 말씀이 떠올랐다. 인공지능의 퍼포먼스는 이제 어느 정도 수렴해가고 이제는 ‘안정성’과 ‘속도’에 포커스가 맞춰지고 있다고 하셨다. 그 연장선상에서 인공지능이 현실 세계와 직접 부딪히며 문제를 해결하는 도구를 얼마나 능숙하게 사용하는지에 집중해야 하는 것으로 보인다. 또한, AutoGPT의 사례에서 보았듯, 스스로 계획을 세우고 여러 단계를 끈기 있게 수행하는 '자율성'과 '견고함'을 어떻게 확보할 것인지에 대한 깊은 고민이 필요하다.

결국 AI가 우리의 삶에 진정으로 스며드는 범용 비서가 되기 위해서는 지적으로 초월적인 능력이 필요한 것이 아니라 인간과 같은 유연하고 견고한 문제 해결 능력이 선행되어야 한다. GAIA는 바로 그 가능성을 측정하는 첫 번째 의미 있는 잣대라고 할 수 있다