Total Runs

0

Session

Avg Quality Score

0.000

live

Auto-Approved

0.0%

CI gate threshold: default

Avg Risk Score

0.00

live

Task Difficulty Distribution

1,284 runs

Easy (missing + dupes)

45%

Medium (type + category)

34%

Hard (conflicts + budget)

21%

Quality score ≥ 0.80

68%

Steps within budget

81%

Deduplication accuracy

93%

Recent Runs

run_0001

No runs yet

WAITING

0.000

Select Task to Run

● Easy

Missing Values & Duplicates

Clean NaN entries across 3 columns, identify and remove duplicate rows with matching customer IDs.

500rows

12defects

40step budget

◆ Medium

Type Errors & Category Drift

Cast mistyped numeric fields, resolve inconsistent category labels (e.g. "NY" vs "New York").

1.2krows

28defects

60step budget

■ Hard

Conflicts & Budget Constraints

Resolve field-level conflicts across merged sources, handle outliers within strict step budget.

3krows

54defects

80step budget

Playground

Step · Reset · Get state

Click any action below, then press Step. Watch the step budget, quality report, and governance warning update live.

Raw JSON response

{ "message": "Raw JSON response will appear here" }

Status

● ready

taskeasy_missing_and_dupes

last actionnone

step0

submittedfalse

Environment Observation

Level: Easy Step 0 / 0 ● Ready

"status": "running",
"step": 7,
"rows_remaining": 2847,
"issues_found": {
  "missing_values": 34,
  "duplicates": 12,
  "type_errors": 8,
  "outliers": 5
},
"reward": 0.412,
"budget_used": 7 / 80,
"risk_flag": null

Step Log

Live

00:00.12 reset() → env initialized, task: hard_conflicts

00:01.04 inspect_column("revenue") → 34 nulls detected

00:01.88 clean_missing("revenue", strategy="median") → 34 rows imputed, reward +0.08

00:02.55 deduplicate(key="customer_id") → 12 dupes removed, reward +0.06

00:03.20 cast_type("order_date", target="datetime") ⚠ 3 unparseable values — flagged

00:04.01 resolve_conflict("region", source_priority=[A,B]) → 18 conflicts resolved, reward +0.11

00:05.74 flag_outliers("amount", method="iqr") ⚠ 5 outliers flagged for review

00:06.90 inspect_column("category") → 4 inconsistent labels found

Available Actions

Cleaning

◎

clean_missing()

Impute or drop null values

⧉

deduplicate()

Remove duplicate rows by key

⊞

cast_type()

Convert column to target dtype

Validation

✓

validate_constraints()

Check all columns against spec

△

cap_outliers()

IQR / z-score detection

⇄

normalize_categories()

Merge conflicting source fields

Submission

→

submit()

Finalize and grade result

Live Reward Tracker

Quality 0.000

Efficiency 0/0 steps

Risk Score 0.00

step reward trace will appear here

Dataset Preview

3,000 rows · 8 columns 54 defects

hard_conflicts_and_budget.json

#	customer_id str	order_date datetime	region str	category str	amount float	revenue float	status flags
1	CUST_0041	2024-03-12	New York	Electronics	1,240.00	890.50	✓ clean
2	CUST_0041	2024-03-12	NY	Electronics	1,240.00	890.50	⧉ duplicate
3	CUST_0089	n/a	California	Apparel	320.00	NaN	○ missing
4	CUST_0112	2024-01-28	Texas	electronics	98,400.00	72,100.00	△ outlier
5	CUST_0204	not-a-date	Florida	Home Goods	540.00	310.00	⊞ type error
6	CUST_0310	2024-02-14	Cali	Home Goods	780.00	490.00	⊞ category
7	CUST_0391	2024-02-19	New York	Electronics	2,100.00	1,450.00	✓ clean

Missing Values

34

across 3 columns

Duplicates

12

by customer_id key

Type Errors

8

order_date, category

Outliers

5

IQR method, amount

Per-Step Risk Scores

run_1284 · gpt-4o

00

No steps yet

Run actions to populate governance trace

0 low

CI Gate Status

Waiting for evaluation

Run and evaluate an episode to see CI gates.

PENDING

Risk Flags & Recommendations

i

No governance flags yet

Execute actions to populate recommendations.

Evaluation Leaderboard

Hard task · 1,284 runs

#
Agent
Policy
Score bar
Score
Verdict

1

session-agent

No evaluated runs yet

0.000

WAITING

Evaluation Payload

run_1284 · gpt-4o

Quality Scores

Missing handled1.00

Duplicates removed1.00

Type accuracy0.82

Category consistency0.94

Conflict resolution0.91

Efficiency

Steps used7 / 80

Budget utilization8.7%

Redundant actions0

Avg risk / step0.31

High-risk steps1

Final Score

0.000

CI gate threshold: 0.75 ● Waiting

Full Run Audit

Complete step-by-step trace with timestamps, actions, observations, and risk scores. Suitable for compliance review and policy debugging.

"run_id": "run_1284",
"agent": "gpt-4o-0613",
"task": "hard_conflicts_and_budget",
"total_steps": 7,
"final_score": 0.891,
"verdict": "AUTO_APPROVED",
"risk_flags": ["step_7_high_risk_drop"],
"gate_results": {
  "quality": "PASS",
  "budget": "PASS",
  "max_risk": "FLAG"
}

Connection & Runtime

● unknown

Base URL/

Selected Taskeasy_missing_and_dupes

Step Budget0

Steps Used0

Validation Passedfalse

API Endpoints

Loading endpoint map...

Quick Actions

Configuration Notes

This environment uses deterministic tasks and fixed evaluator thresholds by default. Use /evaluate with custom threshold payloads for stricter CI policies.

Current CI decision

WAITING

Task Library

3 tasks

01

Loading tasks...

Preparing task catalog

ready

Active Task

easy_missing_and_dupes

Used by next reset/run

Recommended Workflow

inspect → clean → validate → submit

Keeps risk and invalid actions low