Surprisingly enough, it seems some AI agents aren’t quite up to scratch on some basic business tests

LovabledanielsJune 17, 202539 Views

Salesforce research finds single-turn tasks see only 58% success, while multi-turn effectiveness drops to 35%
Reasoning models like gemini-2.5-pro tend to outperform lighter models
CRMArena-Pro has proven to be a challenging benchmark

Researchers from Salesforce AI Research have introduced a new benchmark – CRMArena-Pro – which uses synthetic enterprise data to access LLM agent performance in difference CRM scenarios.

It found LLM agents achieved around 58% success on tasks which can be completed in a single step, with tasks that require multiple interactions dropping in effectiveness to just 35% – barely more than one in three.

Although models like gemini-2.5-pro achieved over 83% success in workflow execution, the Salesforce researchers still highlighted some concerns with AI agents, suggesting they might not quite be up to scratch after all.

Are AI agents actually that good?

The paper, entitled ‘Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions’, explained that LLM agents displayed near-zero inherent confidentiality awareness, noting that their performance in handling sensitive information is only improved with explicit prompting (which often came at the expense of task success).

They also criticized previous and existing benchmarks for failing to capture multi-turn interactions, addressing B2B scenarios or confidentiality, and reflecting realistic data environments. CRMArena-Pro is build on synthetic data validated by CRM experts, covering B2B and B2C settings.

In terms of analysis results, reasoning models like gemini-2.5-pro and o1 outperformed lighter models most of the time – Salesforce’s researchers concluded that models that seek more clarifications generally perform better, especially in multi-turn tasks.

For example, while the average performance across the nine models tested (three each from OpenAI, Google and Meta) resulted in a score of 35.1%, gemini-2.5-pro scored 54.5%.

“These findings suggest a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios, positioning CRMArena-Pro as a challenging testbed for guiding future advancements in developing more sophisticated, reliable, and confidentiality-aware LLM agents for professional use,” the researchers concluded.

Looking ahead, Salesforce CEO Marc Benioff views AI agents as a high-margin opportunities, with major corporate clients including governments betting on AI agents for boosted efficiency and further cost savings.

Weekly update

OnePlus ditched metal on the new Nord 5 because its buyers ‘prefer styles that are brief, simple, and elegant’

Israeli intelligence buildings targeted as Iran claims direct hit

One killed in clashes as Kenyans protest death of blogger in police custody | Protests News

Weekly Newsletter

Surprisingly enough, it seems some AI agents aren’t quite up to scratch on some basic business tests

Leave a comment

Leave a Reply Cancel reply

Explore more

Israeli intelligence buildings targeted as Iran claims direct hit

One killed in clashes as Kenyans protest death of blogger in police custody | Protests News

Indonesia’s Mount Lewotobi Laki-laki volcano erupts, alert at highest level | Volcanoes News

R. Kelly Suffers Overdose In Prison, His Lawyer Claims

Why stablecoins are gaining popularity

UK startup looks to cut shipping’s carbon emissions

Spain says ‘overvoltage’ caused huge April blackout

You can now create ChatGPT AI images using WhatsApp and it’s ridiculously easy to do – here’s how

Get to Know Us

Let's keep in touch