Dark Model Endpoints for Regulated AI Workloads

Banks, defense contractors, and government agencies are all adopting LLMs. Most run on managed services such as Azure OpenAI with BAA, Bedrock with PrivateLink, Vertex AI in a customer VPC. Some also run their own inference for workloads that can't leave the building, typically vLLM, SGLang, or TGI on internal GPUs, with Ollama covering lighter-weight cases. The models work fine, but there's a problem with the plumbing that connects clients to them.

The routing problem

Consider a financial services firm with the following AI deployment:

On-prem vLLM cluster - PII-sensitive summarization (loan documents, customer records, internal memos)
Anthropic Claude - coding assistance for the engineering team
OpenAI GPT-4o-mini - fast lookups and classification tasks
AWS Bedrock fine-tuned model - regulatory document analysis

Each of these four model backends has its own connection method, auth mechanism, and policy enforcement point.

Without a unified policy point, each team's client makes its own routing decisions. When there's a misrouted request, unexpected cost spike, or a payload that shouldn't have crossed an org boundary, the platform team is reconstructing what happened across four separate logging systems with no shared identity to correlate against.

Why the workarounds don't work

At NetFoundry, the three common patterns we run into:

VPN for on-prem model access. Grants broad network access when all you need is a path to the inference endpoint. Every VPN user can potentially reach every service on the network. Scaling is painful. Split tunneling introduces its own risks, and you need a separate solution for the cloud providers.
Direct connections to self-hosted models. No governance, failover, or cost tracking. Each team configures its own client to point to its own model endpoint. When the on-prem cluster goes down, there's no automatic fallback. When a new team needs access, someone needs to configure a firewall rule.
Separate infrastructure per deployment topology. One proxy for on-prem models, a different gateway for cloud providers, and a third thing for the air-gapped environment. Three sets of configs, auth systems, and places to check when something breaks.

For cloud-only deployments, there is a fourth pattern that does work: provider private endpoints (Azure OpenAI with Private Endpoint, Bedrock with PrivateLink, Vertex AI in a customer VPC) fronted by an existing API gateway like Kong or Apigee, with Entra or Okta for identity. Traffic stays on the cloud provider's network, identity is centralized, and there's no overlay to deploy.

The story breaks down when on-prem inference comes into play, or when the same workload needs to span cloud, on-prem, and air-gapped environments. None of the patterns above gives you what you actually need across that cross-environment surface: one place to enforce policy, regardless of where the model lives.

A single API across every backend

OpenZiti's LLM Gateway is an open-source (Apache 2.0) gateway that presents a single OpenAI-compatible API to your clients. Behind that API, it routes to any combination of providers - OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Google Vertex AI, Ollama, vLLM, or any OpenAI-compatible endpoint. Your clients don't need to know or care where the model lives.

What makes this important for regulated environments is how it connects to those backends. LLM Gateway runs on OpenZiti, which means your on-prem model endpoints don't need open ports or VPNs, and aren't visible to anything that isn't explicitly authorized to reach them.

We call these "dark model endpoints." The vLLM cluster in your data center has no listening ports on any network. OpenZiti creates an overlay path from the gateway to the model only for explicitly authorized identities.

Without an authorized identity, there's nothing on the network to connect to.

Routing by sensitivity

LLM Gateway supports semantic routing - a three-layer cascade that classifies requests and routes them to the appropriate model:

Heuristic layer (0ms) - Pattern matching on keywords, system prompts, message length, tool presence. Fast and deterministic. If the prompt contains "summarize this loan document," it matches a pattern and routes to the on-prem model immediately.
Embedding layer (~20-50ms) - Handles the ambiguous cases that keywords miss. Runs against a local model, so the classification itself never leaves your environment.
LLM classifier (~200-500ms) - Optional, for edge cases. A fast local model classifies the request intent. Again, runs locally.

You can define routes like "anything involving PII goes to the on-prem cluster, coding tasks go to Anthropic, fast lookups go to GPT-4o-mini," and the gateway handles it automatically per request, without the client needing to know which model to ask for.

For a financial services team, this means a developer can point their IDE at one endpoint and write code all day using Claude, while a compliance analyst using the same endpoint gets routed to the on-prem model for document summarization. Same API; the model and security posture for each request are determined by policy, not by which config the client loaded.

Security properties for compliance

The security model goes deeper than routing:

End-to-end encryption - In private share mode, traffic between the client and the model is encrypted end-to-end using libsodium. The gateway operator cannot inspect the payload - mTLS handles authentication, libsodium handles the data path. This is relevant and important when the platform team running the gateway and the consuming team are on opposite sides of an information barrier - trading desks, M&A, legal, sell-side research, or any tenant in a multi-team gateway where the operator is in a different trust zone than the user.
FIPS-compliant cryptographic modes - NetFoundry supports standard, FIPS-compliant, and pluggable cryptographic modes across the overlay. For organizations that require FIPS 140 validated crypto, this is a checkbox you can actually check - not a "we use AES-256 so we're probably fine" hand-wave.
Dark-by-default model endpoints - Zero listening ports. The vLLM cluster in your data center, the fine-tuned model running on dedicated GPU infrastructure, the small Ollama instance on a GPU box in the lab - none of them have open ports. They initiate outbound connections to the OpenZiti network. Nothing can reach them without an authorized identity and an explicit service policy.
Identity-based access with virtual API keys - Each user or team gets a virtual API key tied to their OpenZiti identity (X.509 certificate, not a shared secret). That identity controls which models they can access, what their budget limit is, and what shows up in the audit log. Revoke the identity and access stops immediately - no key rotation race.
Per-identity cost tracking and budgets - Know exactly which team is spending what on which provider. Set budget limits per identity. When the intern's experiment starts burning through GPT-4 tokens at 3am, the budget cap stops it.

A concrete deployment: financial services

Going back to the four-backend scenario from earlier - what does that same firm look like with LLM Gateway in front?

The platform team deploys LLM Gateway on their internal infrastructure and connects it to the OpenZiti overlay. The on-prem vLLM cluster runs Llama 3.3 70B (dark, connected via OpenZiti). The Anthropic, OpenAI, and Bedrock backends each have their commercial API access registered to the gateway.

Semantic routing rules direct PII-sensitive requests to the vLLM cluster. Coding requests go to Claude. Quick lookups go to GPT-4o-mini. Regulatory analysis goes to the Bedrock model.

Three teams get access:

Engineering - can use Claude and GPT-4o-mini. Budget: e.g., $2k/month.
Compliance - can use the vLLM cluster and the Bedrock model. No external API budget cap (internal compute, tracked separately).
Research - can use all four backends. Budget: e.g., $5k/month.

Each team authenticates with their OpenZiti identity. The virtual API key they use maps to that identity. The gateway enforces model access, budget limits, and audit logging per identity. The compliance team can't send a request to Anthropic even if they try - the gateway won't route it.

Identity at the gateway enforces the policy, instead of relying on every client to use the right config file.

Air-gapped deployment

Some environments can't touch the internet at all (e.g., defense contractors, classified government systems, certain healthcare environments).

NetFoundry Self-Hosted supports fully air-gapped deployment - the entire control plane, the overlay network, and LLM Gateway all run inside your perimeter with no external connectivity required, using FIPS-compliant crypto and production installers bundled for the environment.

The deployment optionally includes a break-glass vendor access path - a temporary, auditable connection that lets you grant NetFoundry engineers access for troubleshooting or upgrades from inside a virtually air-gapped network, then revoke it when the work is done. You keep the virtual air-gap, but you can pull NetFoundry in when you actually need them.

In these environments, the LLM Gateway routes exclusively to on-prem models. The API and identity model are identical to a non-air-gapped deployment - only the routing table changes.

The full regulated-industry stack

LLM Gateway handles model access, but models aren't the only thing applications need. Agents and applications also need tools (e.g., databases, APIs, file systems, and knowledge bases).

MCP Gateway does for tool access what LLM Gateway does for model access: aggregates multiple MCP servers behind a single dark endpoint, with structural permission filtering (filtered tools don't exist in the registry - they're not checked at runtime, they're removed), per-client isolation, and the same zero-trust identity model.

Agora extends this to agent-to-agent communication - cryptographic identity per agent, workgroup-scoped visibility (agents outside a workgroup can't even discover agents inside it), engagement contracts that bind every session, and a full audit trail.

All three share a single identity model. The same OpenZiti identity that controls which models a team can access also controls which MCP tools they can use and which agents they can interact with.

For a regulated enterprise, this means one infrastructure layer for model access, tool access, and agent communication - with cryptographic identity, end-to-end encryption, and audit trails - instead of separate products and separate auth systems for each surface.

Where to start

The open-source repos are all Apache 2.0:

LLM Gateway - multi-provider model routing with semantic routing, identity-based access, and dark model endpoints
MCP Gateway - governed MCP tool access with aggregation and structural permission filtering
Agora - zero-trust agent-to-agent communication network
OpenZiti - the zero-trust overlay network underneath all of it

If you're building AI infrastructure for a regulated environment and the problems described here sound familiar, we at NetFoundry are running an AI Accelerator design partner program with a small number of early adopters. We'd like to hear about your use case.

Dark Model Endpoints: Private LLM Meshes for Regulated Industries

Comments

Zero Trust for AI Infrastructure

Bake It In: Building Agent Runtimes on Zero Trust from Day One

More from this blog

Secure Your Kubernetes Workloads with Ephemeral Zero-Trust Identities

Bake It In: Building Agent Runtimes on Zero Trust from Day One

You Can't Govern What You Can't See

The Gap Between "Agents Can Talk" and "Agents Should Talk"

The routing problem

Why the workarounds don't work

A single API across every backend

Routing by sensitivity

Security properties for compliance

A concrete deployment: financial services

Air-gapped deployment

The full regulated-industry stack

Where to start

Command Palette

Comments

Zero Trust for AI Infrastructure

Bake It In: Building Agent Runtimes on Zero Trust from Day One

More from this blog

The routing problem

Why the workarounds don't work

A single API across every backend

Routing by sensitivity

Security properties for compliance

A concrete deployment: financial services

Air-gapped deployment

The full regulated-industry stack

Where to start