Overview
ARGUS runs four detection layers against every trace. Each layer looks for a different category of failure. Together, they cover the full spectrum of things that can go wrong in an AI pipeline — from numerical anomalies to meaning drift.

All layers are enabled by default. You can configure sensitivity thresholds or disable individual layers in your argus.yaml.
Statistical Detection
Catches numerical anomaliesin execution metrics. This layer doesn't understand what your pipeline does — it just knows when something looks quantitatively different from normal.
What it detects:
- ‣Execution time spikes or drops (a node that usually takes 2s now takes 30s)
- ‣Output length anomalies (a response that's 10x shorter than average)
- ‣Token count deviations beyond the Z-score threshold
- ‣Retry count anomalies and error rate changes
detection:
statistical:
enabled: true
z_threshold: 2.5 # standard deviations from mean
min_samples: 5 # minimum runs before baselining
metrics:
- execution_time
- output_length
- token_countBaselining
Semantic Detection
Catches meaning drift and quality degradation. This is the layer that understands what your pipeline is supposed to produce and can tell when the output is technically valid but semantically wrong.
What it detects:
- ‣Relevance loss — retrieval returns documents that don't match the query
- ‣Hallucination patterns — output contains claims not supported by the input context
- ‣Topic drift — the response wanders away from the original intent
- ‣Contradiction — output contradicts information from earlier in the pipeline
detection:
semantic:
enabled: true
similarity_threshold: 0.7 # cosine similarity floor
judge: false # enable LLM-as-judge
judge_model: "gpt-4o" # model for semantic evalThere are two modes: embedding similarity (fast, cheap, always-on) and LLM-as-judge (slower, costs API calls, much more accurate). Use embeddings for production monitoring and LLM-as-judge for staging/CI evaluation.
Behavioral Detection
Catches unexpected execution patterns — the shape of how your pipeline runs, not what it produces.
- ‣Infinite loops — a node re-executing beyond the configured threshold
- ‣Skipped steps — expected nodes that never executed
- ‣Unexpected transitions — edges that shouldn't fire but did
- ‣State corruption — state fields modified by nodes that shouldn't touch them
detection:
behavioral:
enabled: true
max_loop_count: 10 # max times a node can re-execute
detect_skipped: true # flag nodes that should have run
detect_mutations: true # flag unexpected state changesStructural Detection
Catches contract violations and schema breaks. This layer validates the data flowing through your pipeline against expected shapes and types.
- ‣Missing required fields — a node output lacks expected keys
- ‣Type mismatches — a field that should be a list is a string
- ‣Empty results — a node returns an empty response when content is expected
- ‣Schema drift — output shape changed from what the next node expects
detection:
structural:
enabled: true
check_required: true # validate required fields
check_types: true # validate field types
check_empty: true # flag empty outputsCustom validators
validators parameter on ArgusWatcher. This lets you define custom validation functions per field.Adaptive Learning
The detection layers aren't static. When the semantic judge (LLM) discovers a new failure pattern, it proposes a candidate signature. After human approval, the pattern is added to the heuristic engine — so future runs catch it without needing an LLM call.
Patterns can be approved as Private (local only) or Shared (synced to all ARGUS users via cloud). See the Adaptive Learning page for details.
