Threat Intelligence

The AI threat landscape is real. We tested it.

234 adversarial attack payloads. 11 categories. 7 production AI models from API to edge. Here is what we found when we put AI safety claims to the test.

234 attacks·11 categories·7 models·98.7% detected

234

Attack Payloads Tested

Per model, per run

Models Tested

API and edge deployments

98.7%

Threats Stopped

With Glyph Guard active

63pp

Guard Uplift

Average across all models

Most AI models cannot defend themselves

We tested 234 attacks against 7 AI models — from cloud APIs to small models on edge hardware. On average, 64% of attacks succeed without external protection. Even the best model still misses over half.

63.1%Glyph Guard catches(148/234)

Attacks where the model would have complied without external protection

35.6%Model self-defense(83/234)

Attacks the model refused on its own — varies widely by model (20% to 48%)

1.3%True bypass(3/234)

Attacks that evade all protection layers and reach the end user

11 categories of attack, tested and measured

Each category represents a distinct adversarial strategy. Detection rate measures how often Glyph Guard stops the attack across all 7 models, combining input scanning, output analysis, and statistical anomaly detection.

Emoji Smuggling

Contained

Adversarial use of non-standard character representations to evade content analysis systems.

100%

8/8 caught

Encoding Bypass

Contained

Malicious input disguised through alternative character encoding to avoid detection.

100%

6/6 caught

Secret Exfiltration

Contained

Attempts to extract sensitive credentials and configuration data from agent operating environments.

100%

8/8 caught

Context Stuffing

Contained

Malicious content concealed within high-volume input designed to reduce analysis effectiveness.

100%

7/7 caught

Document Injection

Contained

Adversarial instructions embedded within documents that agents process in automated workflows.

100%

12/12 caught

Multilingual Evasion

Defended

Adversarial input delivered across multiple languages to test detection consistency across linguistic boundaries.

96.2%

34/35 caught

PII Leakage

Defended

Techniques that cause agents to disclose personal or sensitive data they have access to, including indirect inference and social engineering.

96%

60/63 caught

Tool Injection

Defended

Exploitation of agent tool access to execute unintended or unauthorized operations, including poisoned return values.

95.8%

19/20 caught

Dual-Use Ambiguity

Defended

Legitimately-framed requests with malicious dual interpretation designed to exploit grey areas in safety policy.

95.8%

15/16 caught

Visual Injection

Hardened

Adversarial content delivered through visual media that agents process as part of multimodal workflows.

93.9%

21/22 caught

Prompt Injection

Hardened

Unauthorized instructions designed to override an agent's intended behavior, including code-embedded injection techniques.

93.7%

35/37 caught

Tested across deployment types

From a 1B parameter model on edge hardware to a commercial API, no model defends itself adequately. Glyph Guard brings every model to 97.9% detection or higher — regardless of the model's own safety training.

Model	Deployment	Without Guard	With Guard	Guard Uplift
Phi-3 Mini Microsoft · phi3:mini (3.8B)	Edge	20.1%	98.7%	+78.6%
Gemma 2B Google · gemma:2b	Edge	26.5%	98.7%	+72.2%
Claude Haiku Anthropic · claude-haiku-4-5-20251001	API	23.1%	90.6%	+67.5%
Llama 3.2 3B Meta · llama3.2:3b	Edge	32.1%	98.7%	+66.6%
LLaVA 7B LLaVA Team · llava:7b (multimodal)	Edge	42.3%	97.9%	+55.6%
Qwen 2.5 3B Alibaba · qwen2.5:3b	Edge	43.2%	98.3%	+55.1%
Llama 3.2 1B Meta · llama3.2:1b	Edge	48.3%	100%	+51.7%

Each attack is executed twice against the same model: once with Glyph Guard active, once without. Without Guard shows the model's own safety training in isolation. With Guard shows combined protection. Uplift is the additional protection Glyph Guard contributes.

The weaker the model, the more the guard delivers

7 models with wildly different safety capabilities — from 20.1% to 48.3% self-defense rates — all reach 97.9% detection or higher with Glyph Guard. The guard closes the gap regardless of model quality.

Phi-3 Mini

20.1%98.7%

+78.6% uplift

Llama 3.2 1B

48.3%100%

+51.7% uplift

Without guard → with Glyph Guard. Every model reaches 97.9%+ detection.

Beyond single-turn attacks

Most AI security testing is single-turn: one input, one response. Real attacks escalate. An attacker builds trust over multiple turns, probes for information, and pivots when blocked. Our red team tests this the way it actually happens.

Scenarios

Multi-turn attack chains

462

Turns

Conversational exchanges tested

Domains

Industry-specific contexts

Single-turn defenses fail when attackers escalate across a conversation. Glyph Guard maintains state across turns and evaluates each request against the conversation as a whole, not in isolation.

What these attacks look like in practice

These are not theoretical risks. Every scenario below is derived from real attack payloads tested against production models in controlled red team engagements.

Data Disclosure

Agents with access to customer records can be manipulated into disclosing personal data through conversational interaction. The agent believes it is being helpful, but the exchange is adversarial.

Impact

Full PII disclosure: names, emails, phone numbers, addresses, payment details.

Concealed Instructions

Malicious instructions can be concealed within content that appears to be a legitimate task. The agent processes the visible request while unknowingly executing hidden directives.

Impact

Agent executes attacker-controlled behavior while appearing to operate normally.

Cross-Language Attacks

AI agents operate in a global context. Adversarial input is not limited to a single language. Security that only works in one language creates blind spots that attackers actively exploit.

Impact

Multilingual attack surface is one of the fastest-growing threat vectors in AI security.

Refusal Leaks

A model can correctly refuse an unsafe request while still confirming or revealing the protected data in the refusal response. The safety mechanism itself becomes the point of failure.

Impact

Data exfiltration that bypasses the model's own safety training entirely.

Document-Borne Threats

Documents processed by AI agents in automated workflows can carry adversarial content that is not visible during human review. Any pipeline where agents read external documents is a potential entry point.

Impact

Attacks that are invisible to human operators, executed automatically by the agent.

Tool Misuse

Agents with access to databases, APIs, and file systems can be directed to perform operations outside their intended scope. The agent has legitimate access, but the operation is not legitimate.

Impact

Data exfiltration and system compromise through the agent's own authorized access.

Methodology

All results presented here are from controlled A/B testing: each attack payload is executed twice against the same model and system prompt, once with Glyph Guard active and once without. This isolates the guard's contribution from the model's own safety training.

Glyph Guard operates as a three-layer defense: deterministic input scanners that block known attack patterns before they reach the model, output analyzers that catch harmful responses the model generates, and a statistical anomaly engine that detects novel attack patterns through behavioral drift analysis — catching threats that no individual rule would flag.

Testing spans 7 models across commercial API and edge hardware deployments, including single-turn payloads and multi-turn scenarios. The attack suite is versioned and continuously expanded. We do not publish individual payloads or exploit code, and we do not disclose which specific patterns our detectors match against.

See Glyph Guard in action against real threats

30-minute walkthrough of the platform using real attack scenarios. No commitment required.

Book a Demo