If you can call an API, you can run a state-of-the-art AI model. That's the whole promise of inference as a service. This guide explains what's actually happening, the words you'll keep hearing, and how to pick the right setup for what you're building.
What is AI inference?
An AI model has two phases in its life. Training is when the model learns, it's shown enormous amounts of data and slowly adjusts its internal weights until it's good at a task. Training is expensive, slow, and done rarely. Inference is when the finished model is put to work: you give it an input, a question, a document, an image, and it produces an output. Every time ChatGPT answers, a coding assistant completes a line, or a support bot replies, that's inference.
Put simply: training builds the model, inference uses it. The vast majority of what an AI feature costs over its lifetime is inference, because training happens once but inference happens on every single request, forever.
Training teaches the model. Inference is the model doing its job in production, and it's the part you pay for again and again.
Inference as a service, explained
Running inference yourself means buying or renting GPUs, installing a serving engine, loading models into GPU memory, batching requests efficiently, and keeping it all online. That's a real engineering project. Inference as a service hands all of that to a provider. You send a request to an API, the provider's infrastructure runs the model, and you get a result back. You pay only for what you use.
Under the hood there are three layers, but you only ever touch the first:
- The request layer, the API endpoint you call. It authenticates your key, meters usage for billing, and routes your request to the right model.
- The engine layer, the serving software that loads the model, manages memory, batches requests together, and runs the actual computation. This is where most of the speed comes from.
- The hardware layer, the accelerators and networking that do the math. The provider manages all of it; you never configure a thing.
The win is leverage: a small team can ship a feature on a frontier model in an afternoon, instead of spending weeks standing up infrastructure they'll then have to babysit.
Open models and the OpenAI-compatible API
Most inference providers serve open-weight models, models like DeepSeek, Llama, Qwen, and others whose weights are published so anyone can host them. Because many teams already wrote their code against OpenAI's API format, the industry settled on an OpenAI-compatible API as the common standard. In practice that means switching providers is usually a one-line change: point your existing OpenAI client at a new base URL and drop in a new key.
# the same OpenAI SDK, pointed at a different endpoint client = OpenAI( base_url="https://api.cloud.baysn.ai/v1", api_key="sk-baysn-...", )
That compatibility matters because it kills lock-in. Your prompts, your tooling, and your evaluation harness all keep working, you're free to choose a provider on price, speed, and privacy rather than on how much rewriting it would cost to leave.
Open vs closed-weight models
Not all models are equal in how you can use them. A closed-weight model (sometimes called proprietary) keeps its weights secret. You can only reach it through the vendor's own API, and your requests run on their servers. GPT, Claude, and Gemini are closed-weight. An open-weight model publishes its weights, so anyone can download and host it. DeepSeek, Llama, Qwen, MiniMax, and gpt-oss are open-weight.
A quick note on wording: people say "open source," but most open models release the trained weights under a license rather than the full training data and code. "Open-weight" is the precise term, and it's the part that matters in production, because it's the weights you run.
Open-weight models win on the things that compound over a product's life: control, privacy, price, and the freedom to move. You can run them privately, fine-tune them, and switch hosts without rewriting anything. Closed models still lead on a few frontier tasks and ask nothing of your infrastructure, so the honest answer is that many teams use both. The shift worth noticing is that open models have closed most of the quality gap while keeping every other advantage.
Baysn serves open-weight models, so you keep the control and the privacy. Your prompts run on capacity that can be yours alone, and your data is never used to train anyone's model.
How per-token pricing works
Serverless inference is usually billed per token, a token being roughly three-quarters of a word. You're charged separately for input tokens (the prompt and context you send) and output tokens (what the model generates), and output is typically the pricier of the two. Prices are quoted per million tokens, so a model at "$0.20 in / $0.60 out" costs twenty cents per million tokens you send and sixty cents per million it returns.
Two numbers tell you most of what you need about speed: time-to-first-token (how long until the answer starts streaming, which is what users feel) and throughput (tokens per second once it's going, which determines how much you can serve). A good provider optimizes both with techniques like speculative decoding, quantization, and continuous batching.
Serverless vs dedicated vs private
There are three common ways to consume inference. Most teams start at the top and move down as they scale.
Serverless is the default starting point: no provisioning, scales automatically, and you pay only for the tokens you use. Dedicated capacity gives you reserved hardware billed by the hour, once you're past roughly ten thousand sustained requests a day, this usually beats per-token pricing and gives you tighter latency control. Private deployments run in your own region, on-prem, or fully air-gapped, so sensitive data never leaves your boundary, the right call for regulated industries.
Start serverless to validate. Move to dedicated when volume is steady and predictable. Go private when compliance or data residency requires it.
How to choose a provider
Beyond the marketing, a few things actually differentiate inference providers:
- Performance, ask for time-to-first-token and throughput on your model and prompt shape, not a generic benchmark.
- Model coverage, do they serve the models you need, across chat, reasoning, code, vision, and embeddings?
- Privacy and data residency, where does your data go, is it ever used for training, and can they deploy in your region?
- A clean path to scale, can you move from serverless to dedicated to private without rewriting anything?
Frequently asked questions
Is inference the same as running a model locally?
Conceptually yes, both produce outputs from a trained model. The difference is who manages the hardware and serving stack. Inference as a service means a provider runs it behind an API, so you don't manage GPUs at all.
Do I need to know machine learning to use it?
No. If you can make an HTTP request or use an SDK, you can call an inference API. The model, hardware, and optimization are all handled for you.
What's a token, exactly?
A token is a chunk of text, on average about ¾ of a word. Models read and write in tokens, and serverless inference is billed by how many you send (input) and receive (output).
When should I stop using serverless?
When your traffic is steady and high, roughly past 10,000 sustained requests per day, reserved dedicated capacity billed per GPU-hour usually costs less and gives you tighter latency control.
Ready to make your first call?
Generate a free API key and run an open model in about five minutes