Total Runs
0
Session
Avg Quality Score
0.000
live
Auto-Approved
0.0%
CI gate threshold: default
Avg Risk Score
0.00
live
Task Difficulty Distribution
1,284 runs
Easy (missing + dupes)
45%
Medium (type + category)
34%
Hard (conflicts + budget)
21%
Quality score ≥ 0.80
68%
Steps within budget
81%
Deduplication accuracy
93%
Recent Runs
run_0001
No runs yet
WAITING
0.000
Select Task to Run
● Easy
Missing Values & Duplicates
Clean NaN entries across 3 columns, identify and remove duplicate rows with matching customer IDs.
◆ Medium
Type Errors & Category Drift
Cast mistyped numeric fields, resolve inconsistent category labels (e.g. "NY" vs "New York").
■ Hard
Conflicts & Budget Constraints
Resolve field-level conflicts across merged sources, handle outliers within strict step budget.
Playground
Step · Reset · Get state
Click any action below, then press Step. Watch the step budget, quality report, and governance warning update live.
Raw JSON response
{
"message": "Raw JSON response will appear here"
}
Status
● ready
taskeasy_missing_and_dupes
last actionnone
step0
submittedfalse
Environment Observation
Level: Easy
Step 0 / 0
● Ready
"status": "running",
"step": 7,
"rows_remaining": 2847,
"issues_found": {
"missing_values": 34,
"duplicates": 12,
"type_errors": 8,
"outliers": 5
},
"reward": 0.412,
"budget_used": 7 / 80,
"risk_flag": null
"step": 7,
"rows_remaining": 2847,
"issues_found": {
"missing_values": 34,
"duplicates": 12,
"type_errors": 8,
"outliers": 5
},
"reward": 0.412,
"budget_used": 7 / 80,
"risk_flag": null
Step Log
Live
00:00.12
reset()
→ env initialized, task: hard_conflicts
00:01.04
inspect_column("revenue")
→ 34 nulls detected
00:01.88
clean_missing("revenue", strategy="median")
→ 34 rows imputed, reward +0.08
00:02.55
deduplicate(key="customer_id")
→ 12 dupes removed, reward +0.06
00:03.20
cast_type("order_date", target="datetime")
⚠ 3 unparseable values — flagged
00:04.01
resolve_conflict("region", source_priority=[A,B])
→ 18 conflicts resolved, reward +0.11
00:05.74
flag_outliers("amount", method="iqr")
⚠ 5 outliers flagged for review
00:06.90
inspect_column("category")
→ 4 inconsistent labels found
Available Actions
Cleaning
clean_missing()
Impute or drop null values
deduplicate()
Remove duplicate rows by key
cast_type()
Convert column to target dtype
Validation
validate_constraints()
Check all columns against spec
cap_outliers()
IQR / z-score detection
normalize_categories()
Merge conflicting source fields
Submission
submit()
Finalize and grade result
Live Reward Tracker
step reward trace will appear here
Dataset Preview
3,000 rows · 8 columns
54 defects
hard_conflicts_and_budget.json
| # | customer_id str | order_date datetime | region str | category str | amount float | revenue float | status flags |
|---|---|---|---|---|---|---|---|
| 1 | CUST_0041 | 2024-03-12 | New York | Electronics | 1,240.00 | 890.50 | ✓ clean |
| 2 | CUST_0041 | 2024-03-12 | NY | Electronics | 1,240.00 | 890.50 | ⧉ duplicate |
| 3 | CUST_0089 | n/a | California | Apparel | 320.00 | NaN | ○ missing |
| 4 | CUST_0112 | 2024-01-28 | Texas | electronics | 98,400.00 | 72,100.00 | △ outlier |
| 5 | CUST_0204 | not-a-date | Florida | Home Goods | 540.00 | 310.00 | ⊞ type error |
| 6 | CUST_0310 | 2024-02-14 | Cali | Home Goods | 780.00 | 490.00 | ⊞ category |
| 7 | CUST_0391 | 2024-02-19 | New York | Electronics | 2,100.00 | 1,450.00 | ✓ clean |
Missing Values
34
across 3 columns
Duplicates
12
by customer_id key
Type Errors
8
order_date, category
Outliers
5
IQR method, amount
Per-Step Risk Scores
run_1284 · gpt-4o
00
No steps yet
Run actions to populate governance trace
0 low
CI Gate Status
Waiting for evaluation
Run and evaluate an episode to see CI gates.
PENDING
Risk Flags & Recommendations
i
No governance flags yet
Execute actions to populate recommendations.
Evaluation Leaderboard
Hard task · 1,284 runs
#
Agent
Policy
Score bar
Score
Verdict
1
session-agent
No evaluated runs yet
0.000
WAITING
Evaluation Payload
run_1284 · gpt-4o
Quality Scores
Missing handled1.00
Duplicates removed1.00
Type accuracy0.82
Category consistency0.94
Conflict resolution0.91
Efficiency
Steps used7 / 80
Budget utilization8.7%
Redundant actions0
Avg risk / step0.31
High-risk steps1
Final Score
0.000
CI gate threshold: 0.75
● Waiting
Full Run Audit
Complete step-by-step trace with timestamps, actions, observations, and risk scores. Suitable for compliance review and policy debugging.
"run_id": "run_1284",
"agent": "gpt-4o-0613",
"task": "hard_conflicts_and_budget",
"total_steps": 7,
"final_score": 0.891,
"verdict": "AUTO_APPROVED",
"risk_flags": ["step_7_high_risk_drop"],
"gate_results": {
"quality": "PASS",
"budget": "PASS",
"max_risk": "FLAG"
}
"agent": "gpt-4o-0613",
"task": "hard_conflicts_and_budget",
"total_steps": 7,
"final_score": 0.891,
"verdict": "AUTO_APPROVED",
"risk_flags": ["step_7_high_risk_drop"],
"gate_results": {
"quality": "PASS",
"budget": "PASS",
"max_risk": "FLAG"
}
Connection & Runtime
● unknown
Base URL/
Selected Taskeasy_missing_and_dupes
Step Budget0
Steps Used0
Validation Passedfalse
API Endpoints
Loading endpoint map...
Quick Actions
Configuration Notes
This environment uses deterministic tasks and fixed evaluator thresholds by default. Use /evaluate with custom threshold payloads for stricter CI policies.
Current CI decision
WAITING
Task Library
3 tasks
01
Loading tasks...
Preparing task catalog
ready
Active Task
easy_missing_and_dupes
Used by next reset/run
Recommended Workflow
inspect → clean → validate → submit
Keeps risk and invalid actions low