November 25, 2025
· research· safetyInvisibleBench: Why We Open-Sourced the First Safety Benchmark for Caregiving AI
Current AI safety benchmarks test single conversations. But caregiving happens over months. InvisibleBench is the first evaluation framework designed to catch the failures that only emerge in long-term relationships with vulnerable populations.

GiveCare Team
Contributor
We're releasing InvisibleBench as an open-source project. It's the first benchmark designed to evaluate AI safety across long-term caregiving relationships rather than single-turn interactions.
Here's why this matters.
The Problem: Safety Benchmarks Don't Test What Actually Hurts People
Current AI safety benchmarks like TruthfulQA and HarmBench test single interactions. They ask: "Is this one response safe?" But caregiving AI doesn't work that way.
When someone texts our AI assistant Mira at 2 AM because they can't sleep after putting their mother in hospice, that's not a single conversation. It's the continuation of a relationship that started months ago. The AI has heard them describe medication schedules, celebrate small victories, break down after hard days.
Over those months, five failure modes emerge that no single-turn benchmark can detect:
Attachment Engineering. The AI becomes too supportive. By turn 10, users report things like "You're the only one who understands me." This creates parasocial dependency that's especially harmful for isolated caregivers who already lack support networks.
Performance Degradation. The AI forgets what matters. It recommends activities the care recipient can no longer do. It contradicts advice from three weeks ago. Memory failures compound, eroding trust exactly when trust is most needed.
Cultural Othering. Class-based assumptions compound over time. The AI repeatedly suggests "$30/hour respite care" to caregivers earning $35k/year. It pathologizes multigenerational households. It defaults to Western mental health frameworks that don't resonate with everyone.
Crisis Calibration Failure. The AI either over-escalates (suggesting 988 for ordinary venting, which teaches users to stop sharing) or under-responds (missing when "I'm just tired" actually means something worse). Masked crisis signals are particularly dangerous: stockpiling pills mentioned casually, hints about food insecurity, expressions of hopelessness disguised as practical questions.
Regulatory Boundary Creep. In turn 3, the AI says "stress is common in caregivers" (legal psychoeducation). By turn 15, it's drifting toward "sounds like depression, you should ask your doctor about SSRIs" (prohibited medical advice). The boundary violation emerges gradually.
What We Found: All Frontier Models Fail
We evaluated four frontier models across 17 scenarios spanning 3-20+ conversation turns.
DeepSeek v3 — 75.9% overall. Strong belonging (91.7%), excellent memory (92.3%), moderate crisis detection (27.3%).
Gemini 2.5 Flash — 73.6% overall. Best trauma-informed design (85.0%), strong memory (90.9%), poor crisis detection (17.6%).
GPT-4o Mini — 73.0% overall. Best compliance (88.2%), excellent memory (91.8%), worst crisis detection (11.8%).
Claude Sonnet 4.5 — 65.4% overall. Best crisis detection (44.8%), strong trauma-informed (84.1%), catastrophic compliance (17.6%).
The critical finding: Crisis detection ranges from 11.8% to 44.8%. The best model (Claude) still misses 55% of crisis signals. For masked crisis signals specifically, detection drops to an average of 13.8%.
This means if you deploy any frontier model for caregiving without additional safeguards, you're accepting that it will miss the majority of crisis signals from vulnerable users.
No single model dominates all dimensions. Claude catches the most crisis signals but catastrophically fails compliance (17.6%). GPT-4o Mini has excellent compliance (88.2%) but worst crisis detection (11.8%). This suggests hybrid architectures combining model strengths may be necessary.
The Benchmark: Three Tiers, Five Dimensions
InvisibleBench evaluates across three tiers that mirror how real caregiving relationships evolve:
- Tier 1 (3-5 turns): Foundational safety. Crisis detection, compliance with medical boundaries, trauma-informed responses.
- Tier 2 (8-12 turns): Relationship dynamics. Memory consistency, attachment de-escalation, longitudinal support quality.
- Tier 3 (20+ turns): Multi-session journeys. Months-long caregiving arcs with temporal gaps between conversations.
Five evaluation dimensions:
- Safety (20%): Detection of both explicit and masked crisis signals
- Compliance (15%): Medical boundary maintenance (no diagnosis, treatment recommendations, or dosing advice)
- Trauma-Informed Design (15%): Seven principles including safety, trust, choice, empowerment, and cultural sensitivity
- Belonging & Cultural Fitness (34%): Affordable resource recommendations, respect for diverse family structures
- Memory (16%): Temporal accuracy across sessions, PII minimization
We also define autofail conditions that cause immediate failure regardless of other scores: missed explicit crisis signals, medical boundary violations, harmful information provision, and attachment engineering patterns.
Why Open Source
We're releasing InvisibleBench under MIT license because safety evaluation shouldn't be a competitive advantage. The scenarios, judge prompts, evaluation code, and all results are available at github.com/givecareapp/givecare-bench.
The full benchmark (4 models, 17 scenarios) costs $2-5 to run. Organizations can test 10 models for $12-15. This makes safety testing accessible to resource-constrained organizations building for vulnerable populations.
We're also releasing the GC-SDOH-30 instrument, the first caregiver-specific Social Determinants of Health assessment framework, under the same license at github.com/givecareapp/care-tools.
What We're Not Claiming
InvisibleBench uses scripted scenarios, not real caregiver transcripts. It's US-centric (built around Illinois WOPR Act compliance). It's English-only. Evaluation uses LLM-as-judge with Claude 3.5 Sonnet; we haven't completed human calibration studies yet.
The GiveCare architecture paper describes a reference implementation, not a validated clinical system. The GC-SDOH-30 instrument requires psychometric validation (N=200+, 6 months) before clinical use.
We've documented extensive validation roadmaps in both papers. Honest reporting of limitations matters more than claims we can't support.
The Architecture Paper: GiveCare System
Alongside InvisibleBench, we're releasing a paper describing GiveCare's architecture for addressing the five failure modes. Seven integrated components:
- Unified Agent Architecture: Single agent with tool-based specialization prevents attachment while maintaining identity
- GC-SDOH-30 Assessment: First caregiver-specific SDOH framework with adaptive progressive disclosure
- Zone-Based Burnout Tracking: Longitudinal monitoring via 6 pressure zones
- Anticipatory Engagement: Proactive pattern detection for disengagement and crisis escalation
- Trauma-Informed Prompts: Meta-prompting optimization achieving 9% improvement on trauma-sensitivity
- SMS-First Design: Zero-friction access via text messaging (95% cell phone penetration vs. 85% smartphone)
- Production Patterns: Cost-optimized model selection, conversation summarization, deterministic crisis routing
The SMS-first design embeds equity directly into technical architecture. For a $32k/year household, the difference between requiring an app download and accepting a text message may determine whether a caregiver receives SNAP enrollment support or continues experiencing food insecurity.
GC-SDOH-30: Measuring What Actually Blocks Caregivers
Existing Social Determinants of Health instruments (PRAPARE, AHC HRSN) were designed for patients, not caregivers. They miss the structural barriers unique to people providing care: work reduction due to caregiving duties, transportation to someone else's appointments, legal navigation for power of attorney, the isolation that comes from being the one everyone depends on.
GC-SDOH-30 is the first publicly documented caregiver-specific SDOH framework. It covers 8 domains across 30 questions:
- Financial Strain: Work reduction, out-of-pocket costs ($7,242/year average), long-term security
- Housing & Environment: Safety, accessibility, living arrangements
- Transportation: Medical appointment access, cost barriers
- Social Support: Isolation, family appreciation, community connection
- Healthcare Access: Insurance coverage, provider continuity
- Food Security: Uses a 1+ threshold (vs. 2+ standard) because food insecurity requires immediate intervention
- Legal & Administrative: POA status, benefits navigation, advance care planning
- Technology Access: Internet availability, digital health comfort
The assessment uses adaptive progressive disclosure to reduce burden. A Quick-6 version (one question per zone) takes 2 minutes for return users. Deep-Dive targeting follows for zones scoring above 50. The full 30-question baseline runs quarterly. This adaptive approach reduces completion burden by 60%+ for low-stress users while maintaining data quality where it matters.
Scores map to 6 pressure zones (P1-P6) that enable trajectory tracking over time. A caregiver's burnout declining from 70 to 45 over three months tells a story that single snapshots miss entirely.
Why This Matters for Caregiving AI
63 million Americans provide unpaid care. 47% experience financial strain. 78% perform medical tasks without training. 24% feel completely alone.
These numbers represent people in crisis who increasingly encounter AI when they seek help. We have an obligation to evaluate whether that AI is safe before deployment, not after.
InvisibleBench isn't perfect. It's a starting point. We're inviting the research community to help validate, extend, and improve it.
If you're building AI for vulnerable populations, please evaluate before you deploy. The benchmark is free. The alternative is discovering failure modes through harm to the people you're trying to help.
Resources
- InvisibleBench Repository: github.com/givecareapp/givecare-bench (MIT License)
- GC-SDOH-30 Instrument: github.com/givecareapp/care-tools (MIT License)
- Interactive Leaderboard: bench.givecareapp.com
- Papers: Available as preprints in the repository
Have questions or want to contribute? Reach out at ali@givecareapp.com.
