The History
When I joined the infrastructure team, the backup monitoring process was manual and reactive. A senior engineer would scan through 100+ Rubrik alert emails every morning, mentally filtering which ones were "real" and which were transient noise. This person had become the single point of failure: they'd built an intuition over years for which alerts to worry about, but that knowledge lived entirely in their head.
The team had tried two prior solutions: a PowerShell script that scraped Rubrik's REST API and dumped results to a CSV (abandoned because nobody looked at the CSVs), and a ServiceNow integration that auto-created tickets for every failure (abandoned because it generated so many tickets that the NOC team demanded it be turned off within a week).
Both approaches failed for the same reason: they treated all failures equally. A VM that was briefly unreachable during a storage migration got the same treatment as a VM with a corrupted VSS writer that would fail every backup until someone intervened. The missing piece was intelligence: the ability to distinguish noise from signal automatically.
The Problem
Align Technology's backup infrastructure monitors 3,000+ weekly jobs across 5 global sites (Israel, Poland, Costa Rica, Mexico, and domestic US). Before API Voyager, this generated 100+ daily alert emails, the vast majority caused by transient issues like momentary VM unreachability, locked files, or temporary network hiccups.
The result was classic alert fatigue. Engineers stopped reading the alerts entirely, which meant genuine persistent failures (the ones indicating real infrastructure problems) were buried in noise and went unaddressed until users reported data loss or compliance audits flagged gaps.
The core insight: The problem wasn't monitoring; it was intelligence. We had all the data. We just couldn't tell signal from noise.
The Architecture
API Voyager is a reusable three-phase serverless pipeline built on Azure Functions. Each phase operates independently, communicating through Azure Table Storage rather than direct function calls. This loose coupling means one failure never cascades, and each phase can be tested, deployed, and scaled independently.
flowchart LR
RSC["Rubrik Security\nCloud"]
COL["Phase 1\nData Collector\n(Hourly)"]
TBL[("Azure\nTable Storage")]
ANA["Phase 2\nFailure Analyzer\n(Daily 8 AM)"]
OAI["Azure\nOpenAI\n(GPT-4)"]
DASH["Phase 3\nDashboard\n(Daily 8:30 AM)"]
BLOB[("Azure Blob\nStatic Site")]
SNOW["ServiceNow\nTicket"]
RSC -->|"GraphQL API\n+ Pagination"| COL
COL -->|"Upsert Events\n(Deduplication)"| TBL
TBL -->|"7-Day\nFailure Window"| ANA
ANA <-->|"Structured Prompt\n+ Analysis"| OAI
ANA -->|"Store Results"| TBL
ANA -->|"2+ Consecutive\nFailures Only"| SNOW
TBL --> DASH
DASH -->|"HTML Report"| BLOB
style RSC fill:#f1eeea,stroke:#e0dbd5,color:#162032
style COL fill:#0d7377,stroke:#0a5e61,color:#fff
style TBL fill:#f8f6f3,stroke:#e0dbd5,color:#162032
style ANA fill:#0d7377,stroke:#0a5e61,color:#fff
style OAI fill:#162032,stroke:#2a3f5f,color:#e8edf3
style DASH fill:#0d7377,stroke:#0a5e61,color:#fff
style BLOB fill:#f8f6f3,stroke:#e0dbd5,color:#162032
style SNOW fill:#c27a3c,stroke:#a06530,color:#fff
Phase 1: Data Collection (Hourly)
An Azure Function fires every hour, authenticating to Rubrik Security Cloud via OAuth2 (credentials stored in Key Vault, accessed through Managed Identity, zero hardcoded secrets). It executes a GraphQL query against the activitySeriesConnection endpoint with automatic cursor-based pagination, fetching all backup events regardless of volume.
Events are upserted to Azure Table Storage using Rubrik's event ID as the row key, providing automatic deduplication. Hourly runs that encounter the same event simply update rather than duplicate.
Phase 2: Intelligent Analysis (Daily at 8 AM)
This is where the magic happens. The analyzer queries 7 days of failure data and applies the two-consecutive-failures algorithm before sending anything to Azure OpenAI. GPT-4 then generates natural language analysis: root cause categorization, priority ranking, and specific remediation recommendations.
Phase 3: Dashboard & ServiceNow (Daily at 8:30 AM)
The dashboard generator pulls analysis results from Table Storage and produces a password-protected HTML report uploaded to Azure Blob Storage's static website hosting. Only resources with consecutive failures are surfaced, ensuring every item on the dashboard requires action.
Confirmed persistent failures are simultaneously routed to ServiceNow for automated ticket creation, ensuring SOC2 and Dekra audit compliance with a documented remediation trail.
The Key Innovation: Two-Consecutive-Failures
This algorithm is what makes the entire system operationally viable. Without it, every component downstream (AI analysis, dashboard, ServiceNow) would be drowning in noise.
flowchart TD
A["Backup Event\nReceived"] --> B{"Did it\nfail?"}
B -->|"Success"| C["Reset failure\ncounter for\nthis resource"]
B -->|"Failed"| D["Increment failure\ncounter for\nthis resource"]
D --> E{"Counter\n>= 2?"}
E -->|"No (first failure)"| F["Log but\ndo NOT escalate\n(likely transient)"]
E -->|"Yes (consecutive)"| G["ESCALATE"]
G --> H["AI Analysis\n(Azure OpenAI)"]
G --> I["ServiceNow\nTicket Creation"]
G --> J["Dashboard\nAlert"]
style A fill:#f1eeea,stroke:#e0dbd5,color:#162032
style B fill:#f8f6f3,stroke:#e0dbd5,color:#162032
style C fill:#e6f7ef,stroke:#0d7a50,color:#0d7a50
style D fill:#fef3e2,stroke:#b06d1e,color:#b06d1e
style E fill:#f8f6f3,stroke:#e0dbd5,color:#162032
style F fill:#e6f7ef,stroke:#0d7a50,color:#0d7a50
style G fill:#0d7377,stroke:#0a5e61,color:#fff
style H fill:#162032,stroke:#2a3f5f,color:#e8edf3
style I fill:#c27a3c,stroke:#a06530,color:#fff
style J fill:#0d7377,stroke:#0a5e61,color:#fff
Why this matters for ServiceNow: Without this filter, the system would create hundreds of tickets daily, becoming the team everyone ignores. With it, each ServiceNow ticket represents a genuine, persistent problem that requires action. Clean signal in, clean tickets out.
Security Architecture
API Voyager uses a zero-trust, Managed Identity security model. No credentials are stored in code, environment variables, or app settings. Every authentication flows through Azure RBAC with least-privilege role assignments.
flowchart TD
FA["Azure Function App\n(System-Assigned\nManaged Identity)"]
KV["Key Vault\n(Rubrik Credentials)"]
TS[("Table Storage\n(Backup Events)")]
OAI["Azure OpenAI\n(GPT-4)"]
BS[("Blob Storage\n(Dashboard)")]
RUB["Rubrik API"]
FA -->|"Key Vault\nSecrets User"| KV
FA -->|"Table Data\nContributor"| TS
FA -->|"Cognitive Services\nOpenAI User"| OAI
FA -->|"Blob Data\nOwner"| BS
KV -->|"OAuth2 Client\nCredentials"| RUB
style FA fill:#0d7377,stroke:#0a5e61,color:#fff
style KV fill:#162032,stroke:#2a3f5f,color:#e8edf3
style TS fill:#f8f6f3,stroke:#e0dbd5,color:#162032
style OAI fill:#162032,stroke:#2a3f5f,color:#e8edf3
style BS fill:#f8f6f3,stroke:#e0dbd5,color:#162032
style RUB fill:#f1eeea,stroke:#e0dbd5,color:#162032
Results
Why It's a Framework, Not a Script
API Voyager is designed to be reusable. The three-phase pattern (collect, analyze, act) applies to any periodic data processing task:
| Use Case | Phase 1 (Collect) | Phase 2 (Analyze) | Phase 3 (Act) |
|---|---|---|---|
| Backup Monitoring | Rubrik GraphQL | Consecutive-failures + AI | Dashboard + ServiceNow |
| Cloud Cost Optimization | Azure Cost API | Anomaly detection | Teams alert + budget ticket |
| Security Alert Triage | Defender API | Severity correlation | Incident response workflow |
| Compliance Monitoring | Azure Policy API | Drift detection | Remediation task creation |
The critical architectural pattern: functions communicate via shared storage, not direct calls. This loose coupling means each phase can fail, retry, or scale independently without cascading failures.