What the model does, how it was measured, what it cannot do.

This is Ishavi's public model card. It follows the standard model-card spec adapted for an interview-scoring system: intended use, architecture, evaluation metrics, known limitations, bias risks, and the caveats every recruiter integrating the platform should read first.

Card version: 2026-Q1
Card issued: 2026-03-31
Next revision: 2026-Q3 (post-audit)
Model providers: OpenAI, Google (Gemini), DeepSeek
Owner: Ishavi -- privacy@ishavi.app

01
Section 01
Intended use
Ishavi runs structured voice interviews focused on knowledge verification. The model is intended to score candidate responses against a job-specific rubric supplied by the recruiter, producing evidence-anchored recommendations that a human reviewer takes as input -- not as a decision.
- Primary users: recruiters, hiring managers, and RPO firms running first-round technical or domain screens.
- Primary subjects: job applicants who have consented to an AI-conducted interview.
- Out of scope: personality scoring, cultural-fit prediction, retention forecasting, salary negotiation analysis.
- Out of scope: high-stakes decisions taken without human review (the platform forbids this in product, not just in policy).
02
Section 02
Model architecture
The pipeline is not a single model -- it is a chain of specialised models with strict input/output schemas between stages. Each stage is independently swappable and independently audited.
- Speech-to-text: OpenAI Whisper (whisper-1) transcribes candidate audio.
- Scoring + summarisation: OpenAI gpt-4.1-mini composes the rubric-anchored scorecard from the transcript; DeepSeek V3.2 is configured as a drop-in escape-hatch alias.
- Safety classifier: Google Gemini 2.5 Flash runs as a second-vendor prompt-injection / safety check on untrusted content before it reaches the scorer. It does not compose the scorecard.
- Text-to-speech: OpenAI TTS (gpt-4o-mini-tts) for the interviewer voice.
- All model outputs are stored alongside the prompt + system message that produced them so any decision can be reproduced.
03
Section 03
Training data
Ishavi does not train its own foundation models. All foundation models used are general-purpose models hosted by their respective providers; we configure them with structured prompts, rubric grounding, and retrieval over the customer's job description.
- No customer transcripts, audio, or scoring data is used to fine-tune any foundation model.
- No candidate personal data leaves the customer's region except through the model provider's published inference endpoint, governed by their DPA.
- Future fine-tuned domain models will be opt-in per tenant and disclosed here before training begins.
- Provider sub-processors and their data-handling commitments are listed at /legal/subprocessors.
04
Section 04
Evaluation metrics
We are pre-launch and have not yet accumulated a production corpus, so we do not publish performance figures we cannot stand behind. Rather than cite numbers we don't have, here is exactly how we will measure quality once real, recruiter-reviewed interviews exist.
- Rubric alignment: model recommendation vs. recruiter ground truth, reported as weighted Cohen's kappa.
- Transcript fidelity: word error rate (WER) on the post-correction pass, reported separately for US, Indian, and accented English.
- Evidence quote accuracy: whether each cited quote appears verbatim in the transcript, under a strict-match check.
- Human-reviewer override rate on appeal: share of appeals upheld, modified, or left unchanged.
No production metrics are published yet because we have no production cohort. We will publish measured figures -- with sample sizes and an external audit -- once there is real data to report.
05
Section 05
Known limitations
Stated plainly because hiding them is worse than admitting them. Recruiters integrating Ishavi should understand these limits and design their workflow around them.
- Heavily accented English raises WER and consequently scoring noise. The platform shows transcript-confidence bands; recruiters should treat low-confidence quotes with care.
- Long single-turn answers (>180 seconds) compress into the rubric less reliably than shorter answers. Follow-up generation is tuned to keep turns under 90 seconds.
- Domain knowledge outside the rubric is not scored -- a candidate can be excellent at something the rubric did not ask about.
- Real-time network jitter on the candidate's side can drop audio frames; the platform surfaces this as a session-quality flag rather than silently filling gaps.
- Voice biometrics are NOT used; the platform does not attempt to identify the speaker beyond the candidate's pre-authenticated session.
06
Section 06
Bias risks
We assume bias exists until measurement proves otherwise. The model card commits to surfacing the risks we know about and the mitigations we have in place, not to claiming the system is bias-free.
- Accent bias: documented above as a WER gap; flagged in product, mitigated with confidence bands on transcript-anchored quotes.
- Lexical bias: the rubric grounding step is intended to keep scoring tied to job-relevant vocabulary; we test for unintended weight on prestige terms quarterly.
- Length bias: longer answers are not scored higher by default; the rubric extractor normalises to evidence count, not word count.
- Adjacent-disability risk: the platform offers extended-time accommodations and pause-and-resume on every interview by default.
- Protected-class inference: explicitly forbidden in the system prompt; flagged by an output-classifier guardrail before delivery to the recruiter.
07
Section 07
Demographic-stratified performance
Audit pending Q3 2026. This section will be populated with disaggregated performance by self-reported gender, broad ethnicity, primary language, age band, and disability disclosure -- following the methodology used in the NYC LL144 bias-audit framework. We will publish the audit report alongside this page when issued.
Until then, recruiters running in NYC must rely on their independent annual bias audit per Local Law 144. Ishavi furnishes the underlying interaction data on request under a data-processor agreement.
08
Section 08
Caveats + recommendations
Ishavi is built to be one signal in a hiring decision, not the decision itself. We strongly recommend integrating it as a structured first-round screen with mandatory human review on every advance/reject -- the appeals workflow is designed around this assumption.
- Pair Ishavi with a separate live interview before a hire decision.
- Configure the Bill of Rights appeals SLA explicitly per tenant; the default 72 hours is a floor, not a ceiling.
- Review the recommendation, the evidence quotes, and the candidate's appeal (if any) before closing a decision.
- Re-evaluate the rubric every six months against actual job performance of hires made through the platform.