15.6 C
London
Wednesday, October 16, 2024
HomeTechnologyApple Engineers Reveal the Fragility of AI 'Reasoning'

Apple Engineers Reveal the Fragility of AI ‘Reasoning’

Date:

Related stories

This Week’s Notable Quotes | The Nation

In the upcoming election, the direction of democracy and...

Leaked: Microsoft’s Surface Laptop Prototype Features Intel Lunar Lake Chips

Earlier this year, Microsoft introduced the Surface Laptop 7,...

O3 Mining Announces Normal Course Issuer Bid

O3 Mining has announced its intention to undertake a...

40% of Cloud VC Funds Go to Generative AI Startups: Accel Report

According to venture investors at Accel, generative artificial intelligence...

Top Fitness Trackers for 2024

Fitness trackers offer a practical solution for those looking...
spot_img

In recent times, companies such as OpenAI and Google have been promoting the advanced “reasoning” capabilities as a significant advancement in their latest artificial intelligence models. However, a new study by six Apple engineers reveals that the mathematical “reasoning” of these advanced large language models (LLMs) can be extremely fragile and unreliable when subjected to minor changes in common benchmark problems.

The fragility observed in the study supports previous research, which suggests that LLMs’ reliance on probabilistic pattern matching lacks the formal understanding of underlying concepts needed for truly reliable mathematical reasoning. The researchers propose that “current LLMs are not capable of genuine logical reasoning,” instead attempting to replicate reasoning steps observed in their training data.

In their paper titled “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models,” available as a preprint, the Apple researchers started with GSM8K’s standardized set of over 8,000 grade-school level mathematical word problems, often used as a benchmark for modern LLMs’ complex reasoning capabilities. The researchers employed a novel approach by modifying a portion of that testing set, dynamically replacing certain names and numbers with new values. For example, a question about Sophie getting 31 building blocks might be altered to ask about Bill getting 19 building blocks.

This method helps prevent “data contamination,” which can occur when static GSM8K questions are incorporated into an AI model’s training data. Despite these incidental changes not altering the difficulty of the mathematical reasoning, models should theoretically perform equally well on GSM-Symbolic as on GSM8K.

However, when over 20 state-of-the-art LLMs were tested on GSM-Symbolic, researchers found that average accuracy decreased across the board compared to GSM8K, with performance drops ranging from 0.3 percent to 9.2 percent, depending on the model. The results also exhibited high variance across 50 separate runs of GSM-Symbolic with varying names and values. Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model. Additionally, altering the numbers resulted in worse accuracy compared to changing the names.

This variance is noteworthy since, according to the researchers, “the overall reasoning steps needed to solve a question remain the same.” The fact that minor changes led to such variable results suggests to the researchers that these models are not performing “formal” reasoning but instead are trying to match patterns, aligning questions and solution steps with those seen in the training data.

Despite this, the overall variance for the GSM-Symbolic tests was relatively small in the broader perspective. For example, OpenAI’s ChatGPT-4o saw a marginal drop from 95.2 percent accuracy on GSM8K to 94.9 percent on GSM-Symbolic. This high success rate is impressive with either benchmark, even if the model is not using “formal” reasoning (though accuracy for many models decreased significantly when researchers added just one or two additional logical steps to the problems).

The tested LLMs performed poorly, however, when the Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately inconsequential statements” to the questions. In this “GSM-NoOp” benchmark set, a question might include irrelevant details, such as kiwis being a bit smaller than average.

These red herrings led to what researchers termed “catastrophic performance drops” in accuracy compared to GSM8K, with declines ranging from 17.5 percent to 65.7 percent, depending on the model. These substantial accuracy drops highlight the limitations of using simple “pattern matching” to “convert statements to operations without truly understanding their meaning,” the researchers conclude.

Source link