API Voyager Framework: Case Study

The History

When I joined the infrastructure team, the backup monitoring process was manual and reactive. A senior engineer would scan through 100+ backup alert emails every morning, mentally filtering which ones were "real" and which were transient noise. This person had become the single point of failure: they'd built an intuition over years for which alerts to worry about, but that knowledge lived entirely in their head.

The team had tried two prior solutions: a PowerShell script that scraped the vendor's REST API and dumped results to a CSV (abandoned because nobody looked at the CSVs), and a ServiceNow integration that auto-created tickets for every failure (abandoned because it generated so many tickets that the NOC team demanded it be turned off within a week).

Both approaches failed for the same reason: they treated all failures equally. A VM that was briefly unreachable during a storage migration got the same treatment as a VM with a corrupted VSS writer that would fail every backup until someone intervened. The missing piece was intelligence: the ability to distinguish noise from signal automatically.

The Problem

The client's backup infrastructure monitors 3,000+ weekly jobs across 5 global sites (Israel, Poland, Costa Rica, Mexico, and domestic US). Before API Voyager, this generated 100+ daily alert emails, the vast majority caused by transient issues like momentary VM unreachability, locked files, or temporary network hiccups.

The result was classic alert fatigue. Engineers stopped reading the alerts entirely, which meant genuine persistent failures (the ones indicating real infrastructure problems) were buried in noise and went unaddressed until users reported data loss or compliance audits flagged gaps.

The core insight: The problem wasn't monitoring; it was intelligence. We had all the data. We just couldn't tell signal from noise.

The Architecture

API Voyager is a reusable three-phase serverless pipeline built on Azure Functions. Each phase operates independently, communicating through Azure Table Storage rather than direct function calls. This loose coupling means one failure never cascades, and each phase can be tested, deployed, and scaled independently.

flowchart LR
    RSC["Backup Platform\nAPI"]
    COL["Phase 1\nData Collector\n(Hourly)"]
    TBL[("Azure\nTable Storage")]
    ANA["Phase 2\nFailure Analyzer\n(Daily 8 AM)"]
    OAI["Azure\nOpenAI\n(GPT-4)"]
    DASH["Phase 3\nDashboard\n(Daily 8:30 AM)"]
    BLOB[("Azure Blob\nStatic Site")]
    SNOW["ServiceNow\nTicket"]

    RSC -->|"GraphQL API\n+ Pagination"| COL
    COL -->|"Upsert Events\n(Deduplication)"| TBL
    TBL -->|"7-Day\nFailure Window"| ANA
    ANA <-->|"Structured Prompt\n+ Analysis"| OAI
    ANA -->|"Store Results"| TBL
    ANA -->|"2+ Consecutive\nFailures Only"| SNOW
    TBL --> DASH
    DASH -->|"HTML Report"| BLOB

    style RSC fill:#f1eeea,stroke:#e0dbd5,color:#162032
    style COL fill:#0d7377,stroke:#0a5e61,color:#fff
    style TBL fill:#f8f6f3,stroke:#e0dbd5,color:#162032
    style ANA fill:#0d7377,stroke:#0a5e61,color:#fff
    style OAI fill:#162032,stroke:#2a3f5f,color:#e8edf3
    style DASH fill:#0d7377,stroke:#0a5e61,color:#fff
    style BLOB fill:#f8f6f3,stroke:#e0dbd5,color:#162032
    style SNOW fill:#c27a3c,stroke:#a06530,color:#fff

Fig. 1: API Voyager three-phase pipeline with ServiceNow integration

Phase 1: Data Collection (Hourly)

An Azure Function fires every hour, authenticating to the backup platform via OAuth2 (credentials stored in Key Vault, accessed through Managed Identity, zero hardcoded secrets). It executes a GraphQL query against the activitySeriesConnection endpoint with automatic cursor-based pagination, fetching all backup events regardless of volume.

Events are upserted to Azure Table Storage using the platform's event ID as the row key, providing automatic deduplication. Hourly runs that encounter the same event simply update rather than duplicate.

Phase 2: Intelligent Analysis (Daily at 8 AM)

This is where the magic happens. The analyzer queries 7 days of failure data and applies the two-consecutive-failures algorithm before sending anything to Azure OpenAI. GPT-4 then generates natural language analysis: root cause categorization, priority ranking, and specific remediation recommendations.

Phase 3: Dashboard & ServiceNow (Daily at 8:30 AM)

The dashboard generator pulls analysis results from Table Storage and produces a password-protected HTML report uploaded to Azure Blob Storage's static website hosting. Only resources with consecutive failures are surfaced, ensuring every item on the dashboard requires action.

Confirmed persistent failures are simultaneously routed to ServiceNow for automated ticket creation, ensuring SOC2 and Dekra audit compliance with a documented remediation trail.

The Key Innovation: Two-Consecutive-Failures

This algorithm is what makes the entire system operationally viable. Without it, every component downstream (AI analysis, dashboard, ServiceNow) would be drowning in noise.

flowchart TD
    A["Backup Event\nReceived"] --> B{"Did it\nfail?"}
    B -->|"Success"| C["Reset failure\ncounter for\nthis resource"]
    B -->|"Failed"| D["Increment failure\ncounter for\nthis resource"]
    D --> E{"Counter\n>= 2?"}
    E -->|"No (first failure)"| F["Log but\ndo NOT escalate\n(likely transient)"]
    E -->|"Yes (consecutive)"| G["ESCALATE"]
    G --> H["AI Analysis\n(Azure OpenAI)"]
    G --> I["ServiceNow\nTicket Creation"]
    G --> J["Dashboard\nAlert"]

    style A fill:#f1eeea,stroke:#e0dbd5,color:#162032
    style B fill:#f8f6f3,stroke:#e0dbd5,color:#162032
    style C fill:#e6f7ef,stroke:#0d7a50,color:#0d7a50
    style D fill:#fef3e2,stroke:#b06d1e,color:#b06d1e
    style E fill:#f8f6f3,stroke:#e0dbd5,color:#162032
    style F fill:#e6f7ef,stroke:#0d7a50,color:#0d7a50
    style G fill:#0d7377,stroke:#0a5e61,color:#fff
    style H fill:#162032,stroke:#2a3f5f,color:#e8edf3
    style I fill:#c27a3c,stroke:#a06530,color:#fff
    style J fill:#0d7377,stroke:#0a5e61,color:#fff

Fig. 2: Two-consecutive-failures algorithm filters transient noise from persistent problems

Why this matters for ServiceNow: Without this filter, the system would create hundreds of tickets daily, becoming the team everyone ignores. With it, each ServiceNow ticket represents a genuine, persistent problem that requires action. Clean signal in, clean tickets out.

Security Architecture

API Voyager uses a zero-trust, Managed Identity security model. No credentials are stored in code, environment variables, or app settings. Every authentication flows through Azure RBAC with least-privilege role assignments.

flowchart TD
    FA["Azure Function App\n(System-Assigned\nManaged Identity)"]
    KV["Key Vault\n(API Credentials)"]
    TS[("Table Storage\n(Backup Events)")]
    OAI["Azure OpenAI\n(GPT-4)"]
    BS[("Blob Storage\n(Dashboard)")]
    RUB["Backup API"]

    FA -->|"Key Vault\nSecrets User"| KV
    FA -->|"Table Data\nContributor"| TS
    FA -->|"Cognitive Services\nOpenAI User"| OAI
    FA -->|"Blob Data\nOwner"| BS
    KV -->|"OAuth2 Client\nCredentials"| RUB

    style FA fill:#0d7377,stroke:#0a5e61,color:#fff
    style KV fill:#162032,stroke:#2a3f5f,color:#e8edf3
    style TS fill:#f8f6f3,stroke:#e0dbd5,color:#162032
    style OAI fill:#162032,stroke:#2a3f5f,color:#e8edf3
    style BS fill:#f8f6f3,stroke:#e0dbd5,color:#162032
    style RUB fill:#f1eeea,stroke:#e0dbd5,color:#162032

Fig. 3: Zero-trust security model using Managed Identity + RBAC (no stored credentials)

Results

100+

Daily Alerts Eliminated

3,000+

Weekly Jobs Monitored

Global Sites

$15-20

Monthly Cost

Why It's a Framework, Not a Script

API Voyager is designed to be reusable. The three-phase pattern (collect, analyze, act) applies to any periodic data processing task:

Use Case	Phase 1 (Collect)	Phase 2 (Analyze)	Phase 3 (Act)
Backup Monitoring	Vendor GraphQL	Consecutive-failures + AI	Dashboard + ServiceNow
Cloud Cost Optimization	Azure Cost API	Anomaly detection	Teams alert + budget ticket
Security Alert Triage	Defender API	Severity correlation	Incident response workflow
Compliance Monitoring	Azure Policy API	Drift detection	Remediation task creation

The critical architectural pattern: functions communicate via shared storage, not direct calls. This loose coupling means each phase can fail, retry, or scale independently without cascading failures.

Tech Stack

Azure Functions (Python 3.11) Azure OpenAI (GPT-4) GraphQL API Azure Table Storage Azure Key Vault Azure Blob Storage Managed Identity + RBAC ServiceNow API Bicep (IaC) Application Insights

API Voyager: Turning 100+ Daily Alerts into Actionable Intelligence

The History

The Problem

The Architecture

Phase 1: Data Collection (Hourly)

Phase 2: Intelligent Analysis (Daily at 8 AM)

Phase 3: Dashboard & ServiceNow (Daily at 8:30 AM)

The Key Innovation: Two-Consecutive-Failures

Security Architecture

Results

Why It's a Framework, Not a Script

Tech Stack

Interested in working together?