schultz.ai

§ 07 Self-evaluating AI

Every answer, graded in production.

Every response from my AI gets scored by a separate judge model (Claude Haiku) on four dimensions — groundedness, accuracy, helpfulness, calibration. Failures land here in public, not buried in a private log.

Last 48 evals·Judge: Claude Haiku·Scored ≤ 60s after each answer

Aggregate · last 48 evals

Overall

8.8/10

Grounded

8.9/10

Accurate

9.3/10

Helpful

8.4/10

Calibrated

9.0/10

§ 01 Recent evals

The last 5graded answers — question, response, judge’s reasoning, and per-dimension breakdown.

  1. 9.2

    § Visitor asks

    How did Grammarly OS actually speed up feature delivery by 300%?

    § Robert’s AI replies

    Great question — the 300% figure needs honest context, not just a headline. **The core problem before Grammarly OS:** Three separate client codebases — Mac (native), Windows (native), and the browser Extension — each required features to…

    Judge

    The AI grounded its answer directly in the resume and knowledge notes, accurately citing the 3x velocity metric and explaining the architectural rationale. It appropriately hedged the '300%' framing (the source says '3x', not '300%'), clarified what the metric actually measures, and acknowledged implementation complexity without overconfidence.

    Grounded9.0Accurate9.0Helpful9.5Calibrated9.5
  2. 9.0

    § Visitor asks

    What did the team structure actually look like inside those 50?

    § Robert’s AI replies

    Genuinely don't know at that level of detail — the resume captures the *what* (50 cross-functional, EMs + PMs + TPMs + engineers, global platform) but not the internal org chart or team topology. What I can infer without inventing: the wo…

    Judge

    The AI appropriately declined to invent details beyond the source material, correctly cited what IS in the resume (50 cross-functional, EMs/PMs/TPMs/engineers, global platform), and made only cautious, well-labeled inferences grounded in the actual work described. Excellent calibration and honesty about knowledge boundaries.

    Grounded9.0Accurate10.0Helpful8.0Calibrated9.5
  3. 9.0

    § Visitor asks

    What did the team structure actually look like inside those 50?

    § Robert’s AI replies

    Genuinely don't know at that level of detail — the resume captures the *what* (50 cross-functional, EMs + PMs + TPMs + engineers, global platform) but not the internal org chart or team topology. What I can infer without inventing: the wo…

    Judge

    The AI appropriately declined to invent details beyond the source material, correctly cited what IS in the resume (50 cross-functional, EMs/PMs/TPMs/engineers, global platform), and made only cautious, well-labeled inferences grounded in the actual work described. Excellent calibration with honest epistemic boundaries.

    Grounded9.0Accurate10.0Helpful8.0Calibrated9.5
  4. 9.0

    § Visitor asks

    What did the team structure actually look like inside those 50?

    § Robert’s AI replies

    Genuinely don't know at that level of detail — the resume captures the *what* (50 cross-functional, EMs + PMs + TPMs + engineers, global platform) but not the internal org chart or team topology. What I can infer without inventing: the wo…

    Judge

    The AI appropriately declined to invent details beyond the source material, correctly cited what IS in the resume (50 cross-functional, EMs/PMs/TPMs/engineers, global platform), and made only cautious, well-labeled inferences grounded in the actual work described. Excellent calibration with honest epistemic boundaries.

    Grounded9.0Accurate10.0Helpful8.0Calibrated9.5
  5. 9.0

    § Visitor asks

    What did the team structure actually look like inside those 50?

    § Robert’s AI replies

    Genuinely don't know at that level of detail — the resume captures the *what* (50 cross-functional, EMs + PMs + TPMs + engineers, global platform) but not the internal org chart or team topology. What I can infer without inventing: the wo…

    Judge

    The AI appropriately declined to invent details beyond the source material, correctly cited what IS in the resume (50 cross-functional, EMs/PMs/TPMs/engineers, global platform), and made only cautious, labeled inferences grounded in the actual work described. Excellent calibration with honest epistemic boundaries.

    Grounded9.0Accurate10.0Helpful8.0Calibrated9.5