~/research/webgraphevalmain
NeurIPS 2025 Workshop on Multi-Turn Interactions in LLMs (MTI-LLM)

Graph Evaluation Framework

WebGraphEval

Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation

Aggregating 4,768 WebArena trajectories into a unified action graph to reveal strategy diversity, efficiency, and cross-agent behavior beyond binary outcomes.

Trajectories4,768
Agents Benchmarked6
Tasks812
abstract.mdcommit 39d1f7a
Abstract

//Overview

View on arXiv2510.19205

Current evaluation of web agents largely reduces to binary success metrics or conformity to a single reference trajectory, ignoring the structural diversity present in benchmark datasets. We present WebGraphEval, a framework that abstracts trajectories from multiple agents into a unified, weighted action graph. This representation is directly compatible with benchmarks such as WebArena, leveraging leaderboard runs and newly collected trajectories without modifying environments. The framework canonically encodes actions, merges recurring behaviors, and applies structural analyses including reward propagation and success-weighted edge statistics. Evaluations across thousands of trajectories from six web agents show that the graph abstraction captures cross-model regularities, highlights redundancy and inefficiency, and identifies critical decision points overlooked by outcome-based metrics. By framing web interaction as graph-structured data, WebGraphEval establishes a general methodology for multi-path, cross-agent, and efficiency-aware evaluation of web agents.

Graph Abstraction

Canonicalized, merged actions

LLM Support

Judge + necessity labeling

Structural Insights

Reward propagation & edge classes

Key Takeaways

  • Captures cross-agent strategy diversity without altering benchmark environments.
  • Identifies redundancy, bottlenecks, and critical decisions missed by success-only metrics.
  • LLM-assisted annotations deliver scalable canonicalization and necessity judgments.

Pipeline

  1. 01Canonicalize heterogeneous actions with LLM support and align outcomes.
  2. 02Merge trajectories into a consensus graph with success-conditioned edge weights.
  3. 03Propagate rewards, classify edges, and analyze efficiency across agents.

Dataset Snapshot

Successful Trajectories2,180
Failed Trajectories2,588
Average Steps8.6

//Graph-Based Evaluation Pipeline

The methodology section introduces WebGraphEval's pipeline: canonicalising trajectories, building a consensus action graph, analysing it structurally, and reading out behavioural insights. The summary below mirrors the narrative that accompanies Figure 1 in the paper.

Figure 1

Pipeline Overview

Figure 1 in the paper walks through the pipeline: canonicalising raw trajectories, building a consensus graph, and analysing it to surface structure-aware evaluation signals that feed subsequent tables.

  • Canonicalisation. LLM-backed normalisation groups heterogeneous actions into a shared schema.
  • Consensus graph. Canonical actions merge into nodes with edge weights encoding frequency and outcomes.
  • Structural analysis. Reward propagation, edge typing, and bottleneck detection reveal critical decisions.
  • Behaviour read-outs. Graph statistics drive efficiency, diversity, and cross-agent evaluations.
Figure 1

WebGraphEval Pipeline

WebGraphEval pipeline diagram showing the canonicalization, graph construction, and analysis stages.

The image replicates Figure 1 from the paper, illustrating how trajectories from multiple agents are canonicalised, merged into a consensus action graph, and analysed to surface structural signals for evaluation.

//Performance & Graph Statistics

Tables 2-4 in the paper cover llm_judge results, dataset scale, and repeatability across frameworks. The metrics below summarise the dataset before diving into each table.

Trajectories4,768
Agents Benchmarked6
Actions Encoded40,888
Tasks Covered812
Table 2

Framework-Level Performance Comparison

Table 2 aggregates llm_judge outcomes. Icons highlight each framework's dominant capability while the summary row mirrors the totals reported in the paper.

FrameworkSuccessFailureSuccess RateAvg StepsAvg ConfidenceNecessity Rate
Zeta Labs Jace.AI52628664.78%5.60.93473.7%
IBM CUGA47733159.03%5.30.97280.6%
Learn by Interact44037254.19%6.00.96772.9%
UI-TARS29646838.74%13.20.97682.0%
OpenAI-CUA22658227.97%6.40.93274.1%
BrowserUse21554928.14%15.60.92674.7%
Total2,1802,58845.75%---
TakeawayJace.AI delivers the highest raw success, while IBM CUGA achieves the greatest proportion of necessary actions, exposing efficiency trade-offs that the graph analysis highlights throughout Section 4.
Table 3

Overall Graph Dataset Statistics

Table 3 quantifies the consensus graph extracted from 4,768 trajectories-these values anchor the structural analyses discussed in Section 4.2 of the paper.

MetricValue
Total Graph Nodes40,431
Total Graph Edges45,656
Average Nodes per Task49.79 +/- 21.83
Average Edges per Task56.23 +/- 22.86
Average Steps per Trajectory8.58
ScaleOver forty thousand canonical actions and forty-five thousand transitions underpin the evaluation, providing the structural canvas for the behavioural analyses in the paper.
Table 4

Success Rates (Mean +/- Std)

Table 4 evaluates repeatability. Variance stays below one percentage point, confirming that differences across frameworks stem from strategy rather than evaluation noise.

FrameworkSuccess Rate (mean +/- std)Observation
Zeta Labs Jace.AI64.86% +/- 0.43Stable high performer.
IBM CUGA59.70% +/- 0.14Minimal variance across evaluations.
Learn by Interact53.74% +/- 0.43Resilient to task shifts.
UI-TARS38.88% +/- 0.35Consistent but lower ceiling.
OpenAI-CUA28.55% +/- 0.07Highly repeatable performance.
BrowserUse27.66% +/- 0.15Detects hard tasks reliably.
ConsistencyStable repeat evaluations confirm the reliability of the structural signals extracted by WebGraphEval.

//What the Graph Reveals

Key findings from Sections 4.4-4.6 synthesise how WebGraphEval surfaces actionable insights beyond raw success metrics. Each card distils text accompanying Figures 2 and 3.

Figure 2

Necessity is a learnable signal

Necessity rates climb from 68% on first attempts to over 83% after ten tries, showing that agents reduce redundant actions with experience.

  • Current frameworks span 72.9%-82.0% necessity despite similar success confidence (0.926-0.976).
  • Focused action sequences alone do not guarantee success: UI-TARS matches IBM CUGA's necessity but underperforms by 20 points in success rate.
Section 4.2

Outcomes diverge across frameworks

Only 3.8% of the 761 shared tasks are solved by every framework, while 83.2% have mixed outcomes, underscoring the need for integrated evaluation.

  • Balanced coverage of 2,180 successful vs. 2,588 failed trajectories captures both effective and ineffective strategies.
  • Action anomalies-91 one-step successes (13%)-are retained but treated separately to avoid skewing efficiency results.
Figure 3

Structural complexity predicts difficulty

Success peaks on medium tasks (54%) and drops to 31% on very complex ones; a 7.9% degradation tracks the nodes x edges / trajectories metric.

  • IBM CUGA and Learn by Interact produce compact graphs, whereas UI-TARS and BrowserUse generate exploratory, branched structures.
  • Graph merges occur in 87.4% of tasks but only 5.6% of nodes, capturing overlap without erasing behavioural diversity.
Section 4.1

Canonicalisation quality is validated

LLM-driven annotations reach 91% agreement for canonicalisation and 78% for necessity, supported by triple llm_judge runs per framework and human spot checks.

  • Confidence scores remain high across agents (0.926-0.976), confirming stable judgments even when outcomes differ.
  • Merge threshold theta = 0.9 balances abstraction with fidelity, ensuring shared strategies are captured while preserving unique behaviours.

//How to Cite

How to Cite WebGraphEval

@inproceedings{
qian2025webgrapheval,
title={WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation},
author={Yaoyao Qian and Yuanli Wang and Jinda Zhang and Yun Zong and Meixu Chen and Hanhan Zhou and Jindan Huang and Yifan Zeng and Xinyu Hu and Chan Hee Song and Danqing Zhang},
booktitle={First Workshop on Multi-Turn Interactions in Large Language Models},
year={2025},
url={WebGraphEval.pdf}
}

//Gratitude

We thank Cookie (Yaoyao's dog) and Lucas (Yaoyao's cat) for their comforting presence during this work.