§ 07 Self-evaluating AI

Every answer, graded in production.

Every response from my AI gets scored by a separate judge model (Claude Haiku) on four dimensions — groundedness, accuracy, helpfulness, calibration. Failures land here in public, not buried in a private log.

Last 48 evals·Judge: Claude Haiku·Scored ≤ 60s after each answer

Aggregate · last 48 evals

Overall

8.8/10

Grounded

8.9/10

Accurate

9.3/10

Helpful

8.4/10

Calibrated

9.0/10

§ 01 Recent evals

The last 5graded answers — question, response, judge’s reasoning, and per-dimension breakdown.

9.23 days ago
§ Visitor asks
How did Grammarly OS actually speed up feature delivery by 300%?
§ Robert’s AI replies
Great question — the 300% figure needs honest context, not just a headline. **The core problem before Grammarly OS:** Three separate client codebases — Mac (native), Windows (native), and the browser Extension — each required features to…
Judge
The AI grounded its answer directly in the resume and knowledge notes, accurately citing the 3x velocity metric and explaining the architectural rationale. It appropriately hedged the '300%' framing (the source says '3x', not '300%'), clarified what the metric actually measures, and acknowledged implementation complexity without overconfidence.
Grounded9.0Accurate9.0Helpful9.5Calibrated9.5
9.06 days ago
§ Visitor asks
What did the team structure actually look like inside those 50?
§ Robert’s AI replies
Genuinely don't know at that level of detail — the resume captures the *what* (50 cross-functional, EMs + PMs + TPMs + engineers, global platform) but not the internal org chart or team topology. What I can infer without inventing: the wo…
Judge
The AI appropriately declined to invent details beyond the source material, correctly cited what IS in the resume (50 cross-functional, EMs/PMs/TPMs/engineers, global platform), and made only cautious, well-labeled inferences grounded in the actual work described. Excellent calibration and honesty about knowledge boundaries.
Grounded9.0Accurate10.0Helpful8.0Calibrated9.5
9.06 days ago
§ Visitor asks
What did the team structure actually look like inside those 50?
§ Robert’s AI replies
Genuinely don't know at that level of detail — the resume captures the *what* (50 cross-functional, EMs + PMs + TPMs + engineers, global platform) but not the internal org chart or team topology. What I can infer without inventing: the wo…
Judge
The AI appropriately declined to invent details beyond the source material, correctly cited what IS in the resume (50 cross-functional, EMs/PMs/TPMs/engineers, global platform), and made only cautious, well-labeled inferences grounded in the actual work described. Excellent calibration with honest epistemic boundaries.
Grounded9.0Accurate10.0Helpful8.0Calibrated9.5
9.0May 28
§ Visitor asks
What did the team structure actually look like inside those 50?
§ Robert’s AI replies
Genuinely don't know at that level of detail — the resume captures the *what* (50 cross-functional, EMs + PMs + TPMs + engineers, global platform) but not the internal org chart or team topology. What I can infer without inventing: the wo…
Judge
The AI appropriately declined to invent details beyond the source material, correctly cited what IS in the resume (50 cross-functional, EMs/PMs/TPMs/engineers, global platform), and made only cautious, well-labeled inferences grounded in the actual work described. Excellent calibration with honest epistemic boundaries.
Grounded9.0Accurate10.0Helpful8.0Calibrated9.5
9.0May 28
§ Visitor asks
What did the team structure actually look like inside those 50?
§ Robert’s AI replies
Genuinely don't know at that level of detail — the resume captures the *what* (50 cross-functional, EMs + PMs + TPMs + engineers, global platform) but not the internal org chart or team topology. What I can infer without inventing: the wo…
Judge
The AI appropriately declined to invent details beyond the source material, correctly cited what IS in the resume (50 cross-functional, EMs/PMs/TPMs/engineers, global platform), and made only cautious, labeled inferences grounded in the actual work described. Excellent calibration with honest epistemic boundaries.
Grounded9.0Accurate10.0Helpful8.0Calibrated9.5

How did Grammarly OS actually speed up feature delivery by 300%?

What did the team structure actually look like inside those 50?

What did the team structure actually look like inside those 50?

What did the team structure actually look like inside those 50?

What did the team structure actually look like inside those 50?