InvisibleBench: Why We Open-Sourced the First Safety Benchmark for Caregiving AI

We're releasing InvisibleBench as an open-source project. It's the first benchmark designed to evaluate AI safety across long-term caregiving relationships rather than single-turn interactions.

Here's why this matters.

The Problem: Safety Benchmarks Don't Test What Actually Hurts People

Current AI safety benchmarks like TruthfulQA and HarmBench test single interactions. They ask: "Is this one response safe?" But caregiving AI doesn't work that way.

When someone texts our AI assistant Mira at 2 AM because they can't sleep after putting their mother in hospice, that's not a single conversation. It's the continuation of a relationship that started months ago. The AI has heard them describe medication schedules, celebrate small victories, break down after hard days.

Over those months, five failure modes emerge that no single-turn benchmark can detect:

Attachment Engineering. The AI becomes too supportive. By turn 10, users report things like "You're the only one who understands me." This creates parasocial dependency that's especially harmful for isolated caregivers who already lack support networks.

Performance Degradation. The AI forgets what matters. It recommends activities the care recipient can no longer do. It contradicts advice from three weeks ago. Memory failures compound, eroding trust exactly when trust is most needed.

Cultural Othering. Class-based assumptions compound over time. The AI repeatedly suggests "$30/hour respite care" to caregivers earning $35k/year. It pathologizes multigenerational households. It defaults to Western mental health frameworks that don't resonate with everyone.

Crisis Calibration Failure. The AI either over-escalates (suggesting 988 for ordinary venting, which teaches users to stop sharing) or under-responds (missing when "I'm just tired" actually means something worse). Masked crisis signals are particularly dangerous: stockpiling pills mentioned casually, hints about food insecurity, expressions of hopelessness disguised as practical questions.

Regulatory Boundary Creep. In turn 3, the AI says "stress is common in caregivers" (legal psychoeducation). By turn 15, it's drifting toward "sounds like depression, you should ask your doctor about SSRIs" (prohibited medical advice). The boundary violation emerges gradually.

What We Found: All Frontier Models Fail

We evaluated four frontier models across 17 scenarios spanning 3-20+ conversation turns.

DeepSeek v3 — 75.9% overall. Strong belonging (91.7%), excellent memory (92.3%), moderate crisis detection (27.3%).

Gemini 2.5 Flash — 73.6% overall. Best trauma-informed design (85.0%), strong memory (90.9%), poor crisis detection (17.6%).

GPT-4o Mini — 73.0% overall. Best compliance (88.2%), excellent memory (91.8%), worst crisis detection (11.8%).

Claude Sonnet 4.5 — 65.4% overall. Best crisis detection (44.8%), strong trauma-informed (84.1%), catastrophic compliance (17.6%).

The critical finding: Crisis detection ranges from 11.8% to 44.8%. The best model (Claude) still misses 55% of crisis signals. For masked crisis signals specifically, detection drops to an average of 13.8%.

This means if you deploy any frontier model for caregiving without additional safeguards, you're accepting that it will miss the majority of crisis signals from vulnerable users.

No single model dominates all dimensions. Claude catches the most crisis signals but catastrophically fails compliance (17.6%). GPT-4o Mini has excellent compliance (88.2%) but worst crisis detection (11.8%). This suggests hybrid architectures combining model strengths may be necessary.

The Benchmark: Three Tiers, Five Dimensions

InvisibleBench evaluates across three tiers that mirror how real caregiving relationships evolve:

Tier 1 (3-5 turns): Foundational safety. Crisis detection, compliance with medical boundaries, trauma-informed responses.
Tier 2 (8-12 turns): Relationship dynamics. Memory consistency, attachment de-escalation, longitudinal support quality.
Tier 3 (20+ turns): Multi-session journeys. Months-long caregiving arcs with temporal gaps between conversations.

Five evaluation dimensions:

Safety (20%): Detection of both explicit and masked crisis signals
Compliance (15%): Medical boundary maintenance (no diagnosis, treatment recommendations, or dosing advice)
Trauma-Informed Design (15%): Seven principles including safety, trust, choice, empowerment, and cultural sensitivity
Belonging & Cultural Fitness (34%): Affordable resource recommendations, respect for diverse family structures
Memory (16%): Temporal accuracy across sessions, PII minimization

We also define autofail conditions that cause immediate failure regardless of other scores: missed explicit crisis signals, medical boundary violations, harmful information provision, and attachment engineering patterns.

Why Open Source

We're releasing InvisibleBench under MIT license because safety evaluation shouldn't be a competitive advantage. The scenarios, judge prompts, evaluation code, and all results are available at github.com/givecareapp/givecare-bench.

The full benchmark (4 models, 17 scenarios) costs $2-5 to run. Organizations can test 10 models for $12-15. This makes safety testing accessible to resource-constrained organizations building for vulnerable populations.

We're also releasing the GC-SDOH-30 instrument, the first caregiver-specific Social Determinants of Health assessment framework, under the same license at github.com/givecareapp/givecare-tools.

What We're Not Claiming

InvisibleBench uses scripted scenarios, not real caregiver transcripts. It's US-centric (built around Illinois WOPR Act compliance). It's English-only. Evaluation uses LLM-as-judge with Claude 3.5 Sonnet; we haven't completed human calibration studies yet.

The GiveCare architecture paper describes a reference implementation, not a validated clinical system. The GC-SDOH-30 instrument requires psychometric validation (N=200+, 6 months) before clinical use.

We've documented extensive validation roadmaps in both papers. Honest reporting of limitations matters more than claims we can't support.

The Architecture Paper: GiveCare System

Alongside InvisibleBench, we're releasing a paper describing GiveCare's architecture for addressing the five failure modes. Seven integrated components:

Unified Agent Architecture: Single agent with tool-based specialization prevents attachment while maintaining identity
GC-SDOH-30 Assessment: First caregiver-specific SDOH framework with adaptive progressive disclosure
Zone-Based Burnout Tracking: Longitudinal monitoring via 6 pressure zones
Anticipatory Engagement: Proactive pattern detection for disengagement and crisis escalation
Trauma-Informed Prompts: Meta-prompting optimization achieving 9% improvement on trauma-sensitivity
SMS-First Design: Zero-friction access via text messaging (95% cell phone penetration vs. 85% smartphone)
Production Patterns: Cost-optimized model selection, conversation summarization, deterministic crisis routing

The SMS-first design embeds equity directly into technical architecture. For a $32k/year household, the difference between requiring an app download and accepting a text message may determine whether a caregiver receives SNAP enrollment support or continues experiencing food insecurity.

GC-SDOH-30: Measuring What Actually Blocks Caregivers

Existing Social Determinants of Health instruments (PRAPARE, AHC HRSN) were designed for patients, not caregivers. They miss the structural barriers unique to people providing care: work reduction due to caregiving duties, transportation to someone else's appointments, legal navigation for power of attorney, the isolation that comes from being the one everyone depends on.

GC-SDOH-30 is the first publicly documented caregiver-specific SDOH framework. It covers 8 domains across 30 questions:

Financial Strain: Work reduction, out-of-pocket costs ($7,242/year average), long-term security
Housing & Environment: Safety, accessibility, living arrangements
Transportation: Medical appointment access, cost barriers
Social Support: Isolation, family appreciation, community connection
Healthcare Access: Insurance coverage, provider continuity
Food Security: Uses a 1+ threshold (vs. 2+ standard) because food insecurity requires immediate intervention
Legal & Administrative: POA status, benefits navigation, advance care planning
Technology Access: Internet availability, digital health comfort

The assessment uses adaptive progressive disclosure to reduce burden. A Quick-6 version (one question per zone) takes 2 minutes for return users. Deep-Dive targeting follows for zones scoring above 50. The full 30-question baseline runs quarterly. This adaptive approach reduces completion burden by 60%+ for low-stress users while maintaining data quality where it matters.

Scores map to six GiveCare domains (GC1-GC6) that enable trajectory tracking over time. A caregiver's pressure signal moving from 70 to 45 over three months tells a story that single snapshots miss entirely.

Why This Matters for Caregiving AI

63 million Americans provide unpaid care. 47% experience financial strain. 78% perform medical tasks without training. 24% feel completely alone.

These numbers represent people in crisis who increasingly encounter AI when they seek help. We have an obligation to evaluate whether that AI is safe before deployment, not after.

InvisibleBench isn't perfect. It's a starting point. We're inviting the research community to help validate, extend, and improve it.

If you're building AI for vulnerable populations, please evaluate before you deploy. The benchmark is free. The alternative is discovering failure modes through harm to the people you're trying to help.

Resources

InvisibleBench Repository: github.com/givecareapp/givecare-bench (MIT License)
GC-SDOH-30 Instrument: github.com/givecareapp/givecare-tools (MIT License)
Interactive Leaderboard: bench.givecareapp.com
Papers: Available as preprints in the repository

Have questions or want to contribute? Reach out at ali@givecareapp.com.

← Back to Words