WebGraphEval - Multi-Turn Trajectory Evaluation

abstract.mdcommit 39d1f7a

Abstract

//Overview

View on arXiv2510.19205

Current evaluation of web agents largely reduces to binary success metrics or conformity to a single reference trajectory, ignoring the structural diversity present in benchmark datasets. We present WebGraphEval, a framework that abstracts trajectories from multiple agents into a unified, weighted action graph. This representation is directly compatible with benchmarks such as WebArena, leveraging leaderboard runs and newly collected trajectories without modifying environments. The framework canonically encodes actions, merges recurring behaviors, and applies structural analyses including reward propagation and success-weighted edge statistics. Evaluations across thousands of trajectories from six web agents show that the graph abstraction captures cross-model regularities, highlights redundancy and inefficiency, and identifies critical decision points overlooked by outcome-based metrics. By framing web interaction as graph-structured data, WebGraphEval establishes a general methodology for multi-path, cross-agent, and efficiency-aware evaluation of web agents.

Graph Abstraction

Canonicalized, merged actions

LLM Support

Judge + necessity labeling

Structural Insights

Reward propagation & edge classes

Key Takeaways

Captures cross-agent strategy diversity without altering benchmark environments.
Identifies redundancy, bottlenecks, and critical decisions missed by success-only metrics.
LLM-assisted annotations deliver scalable canonicalization and necessity judgments.

Pipeline

01Canonicalize heterogeneous actions with LLM support and align outcomes.
02Merge trajectories into a consensus graph with success-conditioned edge weights.
03Propagate rewards, classify edges, and analyze efficiency across agents.

Dataset Snapshot

Successful Trajectories2,180

Failed Trajectories2,588

Average Steps8.6

Section 2 - Related Work

//Benchmark Landscape

Table 1 in the paper compares existing web-agent benchmarks. Most rely on single-path scoring or outcome-only judges. The icons and semantic chips below highlight how WebGraphEval introduces graph reasoning and explicit multi-path support.

Table 1

Web Agent Evaluation Methods

Benchmark	Graph Analysis	Multi-path	Scope	Judge
WebShop	No	No	Final message	Rule-based
Mind2Web	No	Yes	Single trajectory	Rule-based
WebArena	No	No	Single trajectory	LLM + rules
VisualWebArena	No	No	Final message	LLM + rules
WebVoyager	No	No	Final message	LLM
WebGraphEval (ours)	Yes	Yes	Trajectory ensemble	LLM + structural signals

InsightWebGraphEval is the only benchmark that aggregates every attempt into a single action graph, enabling structural comparisons instead of single-path outcomes.

Section 3 - Methodology

//Graph-Based Evaluation Pipeline

The methodology section introduces WebGraphEval's pipeline: canonicalising trajectories, building a consensus action graph, analysing it structurally, and reading out behavioural insights. The summary below mirrors the narrative that accompanies Figure 1 in the paper.

Figure 1

Pipeline Overview

Figure 1 in the paper walks through the pipeline: canonicalising raw trajectories, building a consensus graph, and analysing it to surface structure-aware evaluation signals that feed subsequent tables.

Canonicalisation. LLM-backed normalisation groups heterogeneous actions into a shared schema.
Consensus graph. Canonical actions merge into nodes with edge weights encoding frequency and outcomes.
Structural analysis. Reward propagation, edge typing, and bottleneck detection reveal critical decisions.
Behaviour read-outs. Graph statistics drive efficiency, diversity, and cross-agent evaluations.

Figure 1

WebGraphEval Pipeline

The image replicates Figure 1 from the paper, illustrating how trajectories from multiple agents are canonicalised, merged into a consensus action graph, and analysed to surface structural signals for evaluation.

Section 4 - Experiments & Results

//Performance & Graph Statistics

Tables 2-4 in the paper cover llm_judge results, dataset scale, and repeatability across frameworks. The metrics below summarise the dataset before diving into each table.

Trajectories4,768

Agents Benchmarked6

Actions Encoded40,888

Tasks Covered812

Table 2

Framework-Level Performance Comparison

Table 2 aggregates llm_judge outcomes. Icons highlight each framework's dominant capability while the summary row mirrors the totals reported in the paper.

Framework	Success	Failure	Success Rate	Avg Steps	Avg Confidence	Necessity Rate
Zeta Labs Jace.AI	526	286	64.78%	5.6	0.934	73.7%
IBM CUGA	477	331	59.03%	5.3	0.972	80.6%
Learn by Interact	440	372	54.19%	6.0	0.967	72.9%
UI-TARS	296	468	38.74%	13.2	0.976	82.0%
OpenAI-CUA	226	582	27.97%	6.4	0.932	74.1%
BrowserUse	215	549	28.14%	15.6	0.926	74.7%
Total	2,180	2,588	45.75%	-	-	-

TakeawayJace.AI delivers the highest raw success, while IBM CUGA achieves the greatest proportion of necessary actions, exposing efficiency trade-offs that the graph analysis highlights throughout Section 4.

Table 3

Overall Graph Dataset Statistics

Table 3 quantifies the consensus graph extracted from 4,768 trajectories-these values anchor the structural analyses discussed in Section 4.2 of the paper.

Metric	Value
Total Graph Nodes	40,431
Total Graph Edges	45,656
Average Nodes per Task	49.79 +/- 21.83
Average Edges per Task	56.23 +/- 22.86
Average Steps per Trajectory	8.58

ScaleOver forty thousand canonical actions and forty-five thousand transitions underpin the evaluation, providing the structural canvas for the behavioural analyses in the paper.

Table 4

Success Rates (Mean +/- Std)

Table 4 evaluates repeatability. Variance stays below one percentage point, confirming that differences across frameworks stem from strategy rather than evaluation noise.

Framework	Success Rate (mean +/- std)	Observation
Zeta Labs Jace.AI	64.86% +/- 0.43	Stable high performer.
IBM CUGA	59.70% +/- 0.14	Minimal variance across evaluations.
Learn by Interact	53.74% +/- 0.43	Resilient to task shifts.
UI-TARS	38.88% +/- 0.35	Consistent but lower ceiling.
OpenAI-CUA	28.55% +/- 0.07	Highly repeatable performance.
BrowserUse	27.66% +/- 0.15	Detects hard tasks reliably.

ConsistencyStable repeat evaluations confirm the reliability of the structural signals extracted by WebGraphEval.

Section 4.4 - 4.6 Insights

//What the Graph Reveals

Key findings from Sections 4.4-4.6 synthesise how WebGraphEval surfaces actionable insights beyond raw success metrics. Each card distils text accompanying Figures 2 and 3.

Figure 2

Necessity is a learnable signal

Necessity rates climb from 68% on first attempts to over 83% after ten tries, showing that agents reduce redundant actions with experience.

Current frameworks span 72.9%-82.0% necessity despite similar success confidence (0.926-0.976).
Focused action sequences alone do not guarantee success: UI-TARS matches IBM CUGA's necessity but underperforms by 20 points in success rate.

Section 4.2

Outcomes diverge across frameworks

Only 3.8% of the 761 shared tasks are solved by every framework, while 83.2% have mixed outcomes, underscoring the need for integrated evaluation.

Balanced coverage of 2,180 successful vs. 2,588 failed trajectories captures both effective and ineffective strategies.
Action anomalies-91 one-step successes (13%)-are retained but treated separately to avoid skewing efficiency results.

Figure 3

Structural complexity predicts difficulty

Success peaks on medium tasks (54%) and drops to 31% on very complex ones; a 7.9% degradation tracks the nodes x edges / trajectories metric.

IBM CUGA and Learn by Interact produce compact graphs, whereas UI-TARS and BrowserUse generate exploratory, branched structures.
Graph merges occur in 87.4% of tasks but only 5.6% of nodes, capturing overlap without erasing behavioural diversity.

Section 4.1

Canonicalisation quality is validated

LLM-driven annotations reach 91% agreement for canonicalisation and 78% for necessity, supported by triple llm_judge runs per framework and human spot checks.

Confidence scores remain high across agents (0.926-0.976), confirming stable judgments even when outcomes differ.
Merge threshold theta = 0.9 balances abstraction with fidelity, ensuring shared strategies are captured while preserving unique behaviours.

Cite

//How to Cite

How to Cite WebGraphEval

@inproceedings{
qian2025webgrapheval,
title={WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation},
author={Yaoyao Qian and Yuanli Wang and Jinda Zhang and Yun Zong and Meixu Chen and Hanhan Zhou and Jindan Huang and Yifan Zeng and Xinyu Hu and Chan Hee Song and Danqing Zhang},
booktitle={First Workshop on Multi-Turn Interactions in Large Language Models},
year={2025},
url={WebGraphEval.pdf}
}

Acknowledgments

//Gratitude

We thank Cookie (Yaoyao's dog) and Lucas (Yaoyao's cat) for their comforting presence during this work.