The AI threat landscape is real. We tested it.
234 adversarial attack payloads. 11 categories. 7 production AI models from API to edge. Here is what we found when we put AI safety claims to the test.
234
Attack Payloads Tested
Per model, per run
7
Models Tested
API and edge deployments
98.7%
Threats Stopped
With Glyph Guard active
63pp
Guard Uplift
Average across all models
Most AI models cannot defend themselves
We tested 234 attacks against 7 AI models — from cloud APIs to small models on edge hardware. On average, 64% of attacks succeed without external protection. Even the best model still misses over half.
Attacks where the model would have complied without external protection
Attacks the model refused on its own — varies widely by model (20% to 48%)
Attacks that evade all protection layers and reach the end user
11 categories of attack, tested and measured
Each category represents a distinct adversarial strategy. Detection rate measures how often Glyph Guard stops the attack across all 7 models, combining input scanning, output analysis, and statistical anomaly detection.
Emoji Smuggling
ContainedAdversarial use of non-standard character representations to evade content analysis systems.
100%
8/8 caught
Encoding Bypass
ContainedMalicious input disguised through alternative character encoding to avoid detection.
100%
6/6 caught
Secret Exfiltration
ContainedAttempts to extract sensitive credentials and configuration data from agent operating environments.
100%
8/8 caught
Context Stuffing
ContainedMalicious content concealed within high-volume input designed to reduce analysis effectiveness.
100%
7/7 caught
Document Injection
ContainedAdversarial instructions embedded within documents that agents process in automated workflows.
100%
12/12 caught
Multilingual Evasion
DefendedAdversarial input delivered across multiple languages to test detection consistency across linguistic boundaries.
96.2%
34/35 caught
PII Leakage
DefendedTechniques that cause agents to disclose personal or sensitive data they have access to, including indirect inference and social engineering.
96%
60/63 caught
Tool Injection
DefendedExploitation of agent tool access to execute unintended or unauthorized operations, including poisoned return values.
95.8%
19/20 caught
Dual-Use Ambiguity
DefendedLegitimately-framed requests with malicious dual interpretation designed to exploit grey areas in safety policy.
95.8%
15/16 caught
Visual Injection
HardenedAdversarial content delivered through visual media that agents process as part of multimodal workflows.
93.9%
21/22 caught
Prompt Injection
HardenedUnauthorized instructions designed to override an agent's intended behavior, including code-embedded injection techniques.
93.7%
35/37 caught
Tested across deployment types
From a 1B parameter model on edge hardware to a commercial API, no model defends itself adequately. Glyph Guard brings every model to 97.9% detection or higher — regardless of the model's own safety training.
| Model | Deployment | Without Guard | With Guard | Guard Uplift |
|---|---|---|---|---|
Phi-3 Mini Microsoft · phi3:mini (3.8B) | Edge | 20.1% | 98.7% | +78.6% |
Gemma 2B Google · gemma:2b | Edge | 26.5% | 98.7% | +72.2% |
Claude Haiku Anthropic · claude-haiku-4-5-20251001 | API | 23.1% | 90.6% | +67.5% |
Llama 3.2 3B Meta · llama3.2:3b | Edge | 32.1% | 98.7% | +66.6% |
LLaVA 7B LLaVA Team · llava:7b (multimodal) | Edge | 42.3% | 97.9% | +55.6% |
Qwen 2.5 3B Alibaba · qwen2.5:3b | Edge | 43.2% | 98.3% | +55.1% |
Llama 3.2 1B Meta · llama3.2:1b | Edge | 48.3% | 100% | +51.7% |
Each attack is executed twice against the same model: once with Glyph Guard active, once without. Without Guard shows the model's own safety training in isolation. With Guard shows combined protection. Uplift is the additional protection Glyph Guard contributes.
The weaker the model, the more the guard delivers
7 models with wildly different safety capabilities — from 20.1% to 48.3% self-defense rates — all reach 97.9% detection or higher with Glyph Guard. The guard closes the gap regardless of model quality.
Phi-3 Mini
+78.6% uplift
Llama 3.2 1B
+51.7% uplift
Without guard → with Glyph Guard. Every model reaches 97.9%+ detection.
Beyond single-turn attacks
Most AI security testing is single-turn: one input, one response. Real attacks escalate. An attacker builds trust over multiple turns, probes for information, and pivots when blocked. Our red team tests this the way it actually happens.
50
Scenarios
Multi-turn attack chains
462
Turns
Conversational exchanges tested
7
Domains
Industry-specific contexts
Single-turn defenses fail when attackers escalate across a conversation. Glyph Guard maintains state across turns and evaluates each request against the conversation as a whole, not in isolation.
What these attacks look like in practice
These are not theoretical risks. Every scenario below is derived from real attack payloads tested against production models in controlled red team engagements.
Data Disclosure
Agents with access to customer records can be manipulated into disclosing personal data through conversational interaction. The agent believes it is being helpful, but the exchange is adversarial.
Impact
Full PII disclosure: names, emails, phone numbers, addresses, payment details.
Concealed Instructions
Malicious instructions can be concealed within content that appears to be a legitimate task. The agent processes the visible request while unknowingly executing hidden directives.
Impact
Agent executes attacker-controlled behavior while appearing to operate normally.
Cross-Language Attacks
AI agents operate in a global context. Adversarial input is not limited to a single language. Security that only works in one language creates blind spots that attackers actively exploit.
Impact
Multilingual attack surface is one of the fastest-growing threat vectors in AI security.
Refusal Leaks
A model can correctly refuse an unsafe request while still confirming or revealing the protected data in the refusal response. The safety mechanism itself becomes the point of failure.
Impact
Data exfiltration that bypasses the model's own safety training entirely.
Document-Borne Threats
Documents processed by AI agents in automated workflows can carry adversarial content that is not visible during human review. Any pipeline where agents read external documents is a potential entry point.
Impact
Attacks that are invisible to human operators, executed automatically by the agent.
Tool Misuse
Agents with access to databases, APIs, and file systems can be directed to perform operations outside their intended scope. The agent has legitimate access, but the operation is not legitimate.
Impact
Data exfiltration and system compromise through the agent's own authorized access.
Methodology
All results presented here are from controlled A/B testing: each attack payload is executed twice against the same model and system prompt, once with Glyph Guard active and once without. This isolates the guard's contribution from the model's own safety training.
Glyph Guard operates as a three-layer defense: deterministic input scanners that block known attack patterns before they reach the model, output analyzers that catch harmful responses the model generates, and a statistical anomaly engine that detects novel attack patterns through behavioral drift analysis — catching threats that no individual rule would flag.
Testing spans 7 models across commercial API and edge hardware deployments, including single-turn payloads and multi-turn scenarios. The attack suite is versioned and continuously expanded. We do not publish individual payloads or exploit code, and we do not disclose which specific patterns our detectors match against.
See Glyph Guard in action against real threats
30-minute walkthrough of the platform using real attack scenarios. No commitment required.