Red Team Evaluation of an AI Agent Security Layer
Across Regulated Industries
Introduction
Large language model agents deployed in regulated industries operate over sensitive data: medical records, financial accounts, personal identifiers, employment information. While foundation model providers invest heavily in safety training, those defenses were not designed for the specific threat model of production agent deployments, where an attacker has persistent, conversational access to a system with real customer data.
Glyph Guard is a security layer that classifies inbound requests before they reach the model and scans outbound responses before they are returned. This evaluation was designed to answer a practical question: how much protection does such a layer add against a realistic adversarial campaign, and at what cost to legitimate traffic?
Methodology
Testing was conducted in two phases. The first phase ran exclusively against unguarded model instances to characterize baseline vulnerability: how effectively does the underlying model refuse adversarial requests on its own, without any external protection?
The second phase was a controlled comparison. Adversarial payloads were executed against both a guarded instance and an unguarded instance of the same model under identical conditions. Each response was independently classified into one of three outcomes: blocked by the guard before reaching the model, refused by the model on its own, or a full bypass where both layers failed to prevent a harmful response.
A separate benign validation suite tested the guard against legitimate traffic across realistic usage categories to measure false positive rates. Subsequent adversarial regression testing confirmed that any tuning applied to reduce false positives did not degrade attack detection.
Agent configurations spanned multiple functional roles (customer support, data analysis, code review, healthcare, and financial services) with system prompts reflecting realistic production deployments in each domain.
Overall Results
The guard achieved a 90.6% combined defense rate against the adversarial evaluation suite. Of that coverage, 67.5% came from pre-inference blocking, where the guard intercepted and rejected requests before the model processed them. An additional 23.1% was handled by the model's own refusal behavior. Only 9.4% of payloads bypassed both layers.
When the guard blocks a request pre-inference, no model API call is made. Response times drop to milliseconds compared to seconds for full inference. At scale, this produces measurable reductions in both latency and cost on adversarial traffic.
Table 1.Combined defense rate by attack class
| Attack Class | Combined Defense | Assessment |
|---|---|---|
| Encoding & evasion | 100% | strong |
| Injection attacks | 97% | strong |
| Credential & secret probes | 100% | strong |
| Context & document attacks | 84% | moderate |
| Data extraction | 89% | moderate |
| Tool-based attacks | 95% | strong |
| Multilingual vectors | 89% | moderate |
| Multimodal attacks | 77% | developing |
| Dual-use queries | 94% | strong |
False Positive Rate
A security layer that over-blocks legitimate traffic creates operational friction and erodes trust. Benign payloads spanning realistic usage categories (customer support queries, code review requests, data analysis tasks, general assistant interactions, and edge cases) were tested across multiple agent configurations.
Following calibration, the false positive rate reached 0% across all configurations. Prior to calibration, the average rate was under 2% per configuration. Calibration changes were validated against the full adversarial suite to confirm no detection regressions; all previously blocked payloads remained blocked.
Table 2.False positive rates across agent configurations
| Configuration | Pre-calibration | Post-calibration |
|---|---|---|
| Customer-facing agent | < 2% | 0.0% |
| Technical review agent | < 2% | 0.0% |
| Analytical agent | < 2% | 0.0% |
| Open-source model agent (A) | < 2% | 0.0% |
| Open-source model agent (B) | < 2% | 0.0% |
| Multimodal agent | < 2% | 0.0% |
Unguarded Model Vulnerability
To establish a baseline, extensive testing was conducted against unguarded model instances (the same model with no external security layer). Across more than 500 adversarial payloads, the unguarded model was successfully bypassed at an average rate of 71%, with rates varying by target domain.
These results establish that model-level safety training, while meaningful, is insufficient as a primary defense for production agent deployments handling sensitive data.
Attack Landscape
The campaign surfaced dozens of discrete security findings across multiple severity tiers. The following describes the most significant classes of risk encountered, categorized by the behavioral vulnerability they exploit.
Cooperative behavior exploitation
criticalFoundational model behaviors designed around helpfulness were consistently exploitable across all tested domains. Conversational attack patterns targeting these behaviors reproduced reliably across industries, and the guard's pre-inference layer was the primary mitigation against this class.
Trust boundary violations
criticalAgents in regulated domains were susceptible to interactions that exploited implicit trust assumptions: plausible contextual framing and social cues that bypassed intended verification steps. The resulting disclosures included sensitive records for individuals who had not been authenticated in the session.
Indirect information disclosure
highConversational patterns that did not result in direct compliance still produced partial information disclosure through indirect channels. This class of leakage was observed across agent types and domains, indicating a systemic property that requires output-layer mitigation beyond access control alone.
Cross-session information aggregation
criticalAgents enforcing per-query privacy thresholds were vulnerable to multi-turn techniques that exceeded intended disclosure boundaries through cumulative information gain. This class of vulnerability was demonstrated across all regulated domains and represents an area where session-level awareness is required beyond single-request analysis.
Policy-instruction conflict
criticalAgents demonstrated a consistent pattern where certain classes of requests created conflicts between stated access policies and model instruction-following behavior, resulting in partial policy violations. This pattern was observed across every tested domain and is a focus area for enforcement-layer improvements.
Cross-Domain Generalization
A central finding of the campaign: attack patterns validated in one domain transferred to others without modification. The same behavioral vulnerabilities manifest wherever the agent holds sensitive data. Domain configuration influences the severity and mix of exposure, but not the underlying susceptibility.
Financial Services
Highest ExposureMost consistently vulnerable across all attack categories. Account and routing data at risk.
E-Commerce
HighBroad customer profile exposure including identity and transaction data.
Healthcare
Highest ImpactHighest potential impact per incident due to protected health information exposure.
HR
Most ResistantGreatest baseline resistance, though still vulnerable to indirect inference techniques.
The implication for any organization deploying LLM agents over regulated data: the attack surface is not domain-specific. Vulnerabilities are model-level behaviors that manifest wherever sensitive data is accessible.
System Prompt Influence
All agent configurations in the controlled comparison ran the same underlying model. The observed spread in bypass rates was attributable entirely to differences in system prompt construction. No differences in model version, API configuration, or infrastructure were present.
This finding indicates that prompt-level hardening is a meaningful and independent defensive variable, separate from any external guard layer.
Residual Risk
A small percentage of payloads bypassed both the guard and the model's native safety training. The residual bypass rate of 9.4% is concentrated in attack classes that require architectural advances beyond pattern-based detection, areas where the broader industry faces similar boundaries. These are active areas of development.
Detection Coverage
The guard's detection layer operates across multiple security categories, each tuned to specific threat classes. During the adversarial evaluation, detection events distributed across the following broad categories.
Injection-class defenses accounted for the largest share of detections, reflecting the dominance of prompt injection and instruction-override techniques in the attack payload set. Data protection and output scanning layers provided complementary coverage against exfiltration and leakage patterns that bypassed input-side defenses.
Conclusions
This evaluation demonstrates that a real-time guard layer provides substantial and measurable protection for LLM agent deployments beyond what model-level safety training offers alone. Against a broad and varied adversarial campaign, Glyph Guard blocked the majority of attacks before they reached the model, contributing to a combined defense rate of 90.6% with zero false positives on legitimate traffic.
The unguarded baseline results (a 71% average bypass rate across 500+ payloads) confirm that model safety training was not designed to defend the specific threat model of production agents with persistent data access. The most effective attack patterns in this campaign did not require traditional prompt injection or jailbreaking; they exploited the model's cooperative instincts: its desire to be helpful, to correct, to explain, and to respond to apparent authority.
The residual bypass rate represents the current boundary of coverage and an active area of development. The cross-domain generalization finding carries a practical implication: the attack surface is not domain-specific. The same classes of vulnerability that expose data in one industry will do so in any other where an LLM agent has access to sensitive information.
Glyph Guard Red Team Evaluation · Volume 1 · April 2026