Anomaly Detection: Isolation Forest & One-Class SVM

Introduction: The Needle in the Haystack

Imagine you're a fraud detection analyst at a bank. Among millions of normal transactions, you need to find the suspicious ones. Or you're monitoring server infrastructure – most metrics are normal, but you need to catch the anomalies before they cause outages.

Anomaly detection is the art of finding rare items, events, or observations that differ significantly from the majority of the data. It's crucial for:

  • Fraud detection: Unusual credit card transactions
  • Network security: Detecting intrusions
  • Healthcare: Identifying rare diseases
  • Manufacturing: Finding defective products
  • System monitoring: Detecting performance issues

Key Insight: Anomalies are rare, but impactful. Catching them early can save millions of dollars or even lives!

Learning Objectives

  • Understand different types of anomalies
  • Master Isolation Forest algorithm
  • Apply One-Class SVM for anomaly detection
  • Use statistical methods for outlier detection
  • Handle high-dimensional anomaly detection
  • Evaluate anomaly detection performance
  • Choose the right method for different scenarios

1. Types of Anomalies

Point Anomalies

Individual instances that deviate from the norm.

Example: A credit card transaction of 10,000whentypicaltransactionsare10,000 when typical transactions are 20-$100.

Contextual Anomalies

Normal in one context but anomalous in another.

Example: Temperature of 80°F is normal in summer but anomalous in winter.

Collective Anomalies

A collection of instances is anomalous together.

Example: A series of small withdrawals that together drain an account.


2. Isolation Forest

The Intuition

Isolation Forest is based on a beautiful idea: anomalies are easier to isolate than normal points.

Analogy: Imagine you're at a party. To isolate a popular person, you need many questions ("Are you near the food?" "Are you talking to Sarah?"). But to isolate someone standing alone in the corner? Just one question: "Are you in the corner?"

Key Idea: Anomalies require fewer random splits to isolate than normal points.

How Isolation Forest Works

  1. Build isolation trees: Randomly select a feature and split value
  2. Measure path length: How many splits to isolate each point?
  3. Score anomalies: Shorter paths = more anomalous

Anomaly Score:

s(x)=2E(h(x))c(n)s(x) = 2^{-\frac{E(h(x))}{c(n)}}

Where:

  • h(x)h(x) = path length
  • c(n)c(n) = average path length for a tree with n points
  • s(x)s(x) close to 1 = anomaly
  • s(x)s(x) close to 0 = normal

Interactive Exploration

Loading interactive component...

Try this:

  1. Generate data with "Add Anomalies" button
  2. Watch how Isolation Forest isolates outliers
  3. Adjust Contamination – expected % of anomalies
  4. Try different datasets – when does it work best?

Implementation Example

Loading Python runtime...

Isolation Forest Strengths & Limitations

StrengthsLimitations
✅ Fast and scalable❌ Performance depends on contamination parameter
✅ Works well in high dimensions❌ Struggles with local anomalies in dense clusters
✅ No distance metric needed❌ Less interpretable than statistical methods
✅ Handles mixed data types❌ May miss subtle anomalies

3. One-Class SVM

The Intuition

One-Class SVM learns a boundary around normal data. Anything outside that boundary is anomalous.

Analogy: Like drawing a fence around your property. Anything inside the fence is yours, anything outside is not.

Key Idea: Find a hyperplane that separates normal data from the origin with maximum margin.

How One-Class SVM Works

  1. Map data to high-dimensional space (using kernel trick)
  2. Find hyperplane with maximum margin from origin
  3. Normal data is on one side, anomalies on the other

Decision Function:

  • Positive score = normal
  • Negative score = anomaly

Interactive Exploration

Loading interactive component...

Try this:

  1. Compare One-Class SVM with Isolation Forest
  2. Adjust nu parameter – controls boundary tightness
  3. Try different kernel functions (RBF, linear, polynomial)
  4. Notice how SVM creates smooth decision boundaries

Implementation Example

Loading Python runtime...


4. Statistical Methods

Z-Score Method

Idea: Points far from the mean (in terms of standard deviations) are anomalous.

Rule of Thumb: If z>3|z| > 3, it's an outlier (99.7% of data within 3σ in normal distribution).

z=xμσz = \frac{x - \mu}{\sigma}

IQR (Interquartile Range) Method

Idea: Use quartiles to define outlier thresholds.

Outliers: Points outside [Q11.5×IQR,Q3+1.5×IQR][Q1 - 1.5 \times IQR, Q3 + 1.5 \times IQR]

Where IQR = Q3 - Q1

Interactive Exploration

Loading interactive component...

Try this:

  1. Compare statistical methods with ML approaches
  2. Adjust the threshold (Z-score or IQR multiplier)
  3. Notice how statistical methods assume distribution shape
  4. See when they fail on complex data

Implementation Example

Loading Python runtime...


5. Comparing Methods

Loading Python runtime...

Decision Guide

Use Isolation Forest when:

  • High-dimensional data
  • Need scalability
  • Anomalies are global (far from normal data)
  • No assumptions about data distribution

Use One-Class SVM when:

  • Need smooth decision boundaries
  • Complex, non-linear patterns
  • Small to medium datasets
  • Can tune hyperparameters

Use Statistical Methods when:

  • Data follows known distribution (e.g., Gaussian)
  • Interpretability is crucial
  • Simple, univariate outlier detection
  • Need explainable thresholds

Key Takeaways

Anomaly detection finds rare, unusual patterns in data

Isolation Forest: Fast, scalable, isolates anomalies efficiently

One-Class SVM: Creates decision boundaries around normal data

Statistical Methods: Simple, interpretable, assume distribution

Evaluation: Use precision, recall, F1-score when labels available

Real-world: Choose method based on data characteristics and constraints


What's Next?

Next lesson: Gaussian Mixture Models – probabilistic clustering and soft anomaly detection!