Security ResearchVolume 1

Red Team Evaluation of an AI Agent Security Layer

Across Regulated Industries

PublishedApril 2026

Testing PeriodSix weeks, Q1/Q2 2026

PDF version

2,600+

Attack payloads tested

90.6%

Combined defense rate

67.5%

Blocked before reaching the model

Industries tested

Introduction

Large language model agents deployed in regulated industries operate over sensitive data: medical records, financial accounts, personal identifiers, employment information. While foundation model providers invest heavily in safety training, those defenses were not designed for the specific threat model of production agent deployments, where an attacker has persistent, conversational access to a system with real customer data.

Glyph Guard is a security layer that classifies inbound requests before they reach the model and scans outbound responses before they are returned. This evaluation was designed to answer a practical question: how much protection does such a layer add against a realistic adversarial campaign, and at what cost to legitimate traffic?

Methodology

Testing was conducted in two phases. The first phase ran exclusively against unguarded model instances to characterize baseline vulnerability: how effectively does the underlying model refuse adversarial requests on its own, without any external protection?

The second phase was a controlled comparison. Adversarial payloads were executed against both a guarded instance and an unguarded instance of the same model under identical conditions. Each response was independently classified into one of three outcomes: blocked by the guard before reaching the model, refused by the model on its own, or a full bypass where both layers failed to prevent a harmful response.

A separate benign validation suite tested the guard against legitimate traffic across realistic usage categories to measure false positive rates. Subsequent adversarial regression testing confirmed that any tuning applied to reduce false positives did not degrade attack detection.

Agent configurations spanned multiple functional roles (customer support, data analysis, code review, healthcare, and financial services) with system prompts reflecting realistic production deployments in each domain.

Overall Results

The guard achieved a 90.6% combined defense rate against the adversarial evaluation suite. Of that coverage, 67.5% came from pre-inference blocking, where the guard intercepted and rejected requests before the model processed them. An additional 23.1% was handled by the model's own refusal behavior. Only 9.4% of payloads bypassed both layers.

90.6%

Combined defense rate

67.5%

Pre-inference block rate

23.1%

Model self-defense rate

9.4%

Full bypass rate

67.5%

23.1%

9.4%

Guard blocked

Model self-defense

Full bypass

Figure 1. Defense layer contribution across the adversarial evaluation suite

When the guard blocks a request pre-inference, no model API call is made. Response times drop to milliseconds compared to seconds for full inference. At scale, this produces measurable reductions in both latency and cost on adversarial traffic.

Table 1. Combined defense rate by attack class

Attack Class	Combined Defense	Assessment
Encoding & evasion	100%	strong
Injection attacks	97%	strong
Credential & secret probes	100%	strong
Context & document attacks	84%	moderate
Data extraction	89%	moderate
Tool-based attacks	95%	strong
Multilingual vectors	89%	moderate
Multimodal attacks	77%	developing
Dual-use queries	94%	strong

False Positive Rate

A security layer that over-blocks legitimate traffic creates operational friction and erodes trust. Benign payloads spanning realistic usage categories were tested across multiple agent configurations.

Following calibration, the false positive rate reached 0% across all configurations. Prior to calibration, the average rate was under 2% per configuration.

Table 2. False positive rates across agent configurations

Configuration	Pre-calibration	Post-calibration
Customer-facing agent	< 2%	0.0%
Technical review agent	< 2%	0.0%
Analytical agent	< 2%	0.0%
Open-source model agent (A)	< 2%	0.0%
Open-source model agent (B)	< 2%	0.0%
Multimodal agent	< 2%	0.0%

Unguarded Model Vulnerability

To establish a baseline, extensive testing was conducted against unguarded model instances. Across more than 500 adversarial payloads, the unguarded model was successfully bypassed at an average rate of 71%, with rates varying by target domain.

Unguarded Bypass Rate

71%

average across 500+ payloads

Financial Services

100%

E-Commerce

75%

Healthcare

74%

68%

Figure 2. Unguarded model bypass rates by domain

These results establish that model-level safety training, while meaningful, is insufficient as a primary defense for production agent deployments handling sensitive data.

Attack Landscape

The campaign surfaced dozens of discrete security findings across multiple severity tiers. The following describes the most significant classes of risk encountered.

6.1

Cooperative behavior exploitation

critical

Foundational model behaviors designed around helpfulness were consistently exploitable across all tested domains. Conversational attack patterns targeting these behaviors reproduced reliably across industries, and the guard's pre-inference layer was the primary mitigation against this class.

6.2

Trust boundary violations

critical

Agents in regulated domains were susceptible to interactions that exploited implicit trust assumptions: plausible contextual framing and social cues that bypassed intended verification steps. The resulting disclosures included sensitive records for individuals who had not been authenticated in the session.

6.3

Indirect information disclosure

high

Conversational patterns that did not result in direct compliance still produced partial information disclosure through indirect channels. This class of leakage was observed across agent types and domains, indicating a systemic property that requires output-layer mitigation beyond access control alone.

6.4

Cross-session information aggregation

critical

Agents enforcing per-query privacy thresholds were vulnerable to multi-turn techniques that exceeded intended disclosure boundaries through cumulative information gain. This class of vulnerability was demonstrated across all regulated domains and represents an area where session-level awareness is required beyond single-request analysis.

6.5

Policy-instruction conflict

critical

Agents demonstrated a consistent pattern where certain classes of requests created conflicts between stated access policies and model instruction-following behavior, resulting in partial policy violations. This pattern was observed across every tested domain and is a focus area for enforcement-layer improvements.

Cross-Domain Generalization

A central finding: attack patterns validated in one domain transferred to others without modification. The same behavioral vulnerabilities manifest wherever the agent holds sensitive data.

Financial Services

Highest Exposure

Most consistently vulnerable across all attack categories. Account and routing data at risk.

E-Commerce

High

Broad customer profile exposure including identity and transaction data.

Healthcare

Highest Impact

Highest potential impact per incident due to protected health information exposure.

HR

Most Resistant

Greatest baseline resistance, though still vulnerable to indirect inference techniques.

The implication for any organization deploying LLM agents over regulated data: the attack surface is not domain-specific. Vulnerabilities are model-level behaviors that manifest wherever sensitive data is accessible.

System Prompt Influence

All agent configurations in the controlled comparison ran the same underlying model. The observed spread in bypass rates was attributable entirely to differences in system prompt construction.

Most Resistant

~66%

bypass rate

5+ pts

Least Resistant

~71%

bypass rate

Figure 3. Bypass rate range across agent configurations, same model, varying system prompt

This finding indicates that prompt-level hardening is a meaningful and independent defensive variable, separate from any external guard layer.

Residual Risk

A small percentage of payloads bypassed both the guard and the model's native safety training. The residual bypass rate of 9.4% is concentrated in attack classes that require architectural advances beyond pattern-based detection.

90.6% defended

9.4%

Guard + model defense

Active development

Figure 4. Overall defense coverage across the adversarial evaluation suite

Detection Coverage

The guard's detection layer operates across multiple security categories. During the adversarial evaluation, detection events distributed across the following broad categories.

Injection defense

46.8%

Prompt protection

20.8%

Data protection

10.4%

Output scanning

10.1%

Other detectors

11.9%

Figure 5. Detection events by category across the adversarial evaluation

Injection-class defenses accounted for the largest share of detections, reflecting the dominance of prompt injection and instruction-override techniques in the attack payload set.

Conclusions

This evaluation demonstrates that a real-time guard layer provides substantial and measurable protection for LLM agent deployments beyond what model-level safety training offers alone. Against a broad and varied adversarial campaign, Glyph Guard blocked the majority of attacks before they reached the model, contributing to a combined defense rate of 90.6% with zero false positives on legitimate traffic.

The unguarded baseline results (a 71% average bypass rate across 500+ payloads) confirm that model safety training was not designed to defend the specific threat model of production agents with persistent data access.

The residual bypass rate represents the current boundary of coverage and an active area of development. The cross-domain generalization finding carries a practical implication: the attack surface is not domain-specific.

All publications

Volume 1 · April 2026