vMira 5.1, our most capable Russian-language model, is available today across all our products. It is faster, more accurate, and longer-context than vMira 5. This post covers what is new in this release, how it performs, and how to deploy it.
vMira 5.1 reads and writes fluent Russian and English, accepts images alongside text, and ships an optional reasoning mode that shows its work on math, code, and structured analysis. Native context is 262,144 tokens; with our context-extension method, requests can run to one million tokens for specialised workloads. Inference runs on our hosted API, in a private deployment inside the Russian Federation, or in a compact on-device build that runs on a single Linux server.
The reasoning model — vMira Thinking — ships at the same time and is described in detail in a companion post. The rest of this page covers the base model: what changed in this release, how it performs, where it is weak, and how to integrate it.
What is new in 5.1
vMira 5.1 is faster on the prompts that already worked, and meaningfully better on the prompts that did not. The largest gains are in three areas. **Russian-language reasoning** is materially stronger: with the optional reasoning mode, grade-school math accuracy is up 38%, competitive-programming accuracy is up 24%, and confidently-wrong answers on our factuality evaluation are down 61%. **Long-context behaviour** holds quality through 256,000 tokens and degrades gracefully beyond — we recommend a retrieval pipeline above that point for mission-critical work. **Image input** is now a first-class modality across the full context window: scans, photos of documents, diagrams, and handwritten notes all work alongside text. Deployment is unchanged in shape — hosted, private region, on-device — but every layer of the stack is faster.
Training
Training was a multi-stage process. The first stage built broad command of Russian and English on a large corpus weighted toward Russian-language press, fiction, technical documentation, statutes, code, and conversational data. The second stage was a supervised pass on a curated mix split four ways: reward-filtered instruction data; reasoning examples drawn from olympiad-grade math, long chain-of-thought traces, and structured analysis; targeted examples for the categories of public Russian-language evaluation where models historically under-perform; and Russian cultural and conversational data without translation artefacts. The third stage is alignment — a step that teaches the model **when** to stop reasoning rather than just **which** answer to prefer.
Long context
Native context is 262,144 tokens — long enough for a complete book, a multi-file repository, or a research paper with its references. With our context-extension method, requests can run to one million tokens for specialised workloads such as long-form contract analysis or full-codebase audits. Two practical notes for production. First, above roughly 256K tokens, retrieval quality starts to depend on where in the context the relevant information sits — for mission-critical work we recommend a retrieval pipeline rather than raw long context. Second, pricing scales linearly across context lengths; there is no premium tier above 256K.
Image and vision input
vMira 5.1 accepts images as part of the conversation: screenshots, scans, photos of documents, diagrams, charts, hand-written notes. The model describes what it sees, extracts Russian and English text, answers questions about the contents, and reasons about page layout. It does not generate images — that is what vMira Studio is for — and it does not accept video as a single input. For frame-based reasoning, extract frames first and pass them as an image sequence; the model treats them as ordered context.
Deployment
Three deployment paths cover ninety-five percent of what teams actually need. The **hosted API** runs in our data centre with rolling capacity, billed per token in rubles or dollars. **Private deployment in the Russian Federation region** is available for enterprise customers under a separate service-level agreement, with primary collection of personal data inside Russia in line with Federal Law 152-FZ; inference can run in the same region or, where the data has been documented as transferable, in a foreign region for cost. The **on-premise build** is a compact package that runs on a single modern Linux server. All three paths support continuous batching and structured-output constraints. Operationally, a consumer GPU serves dozens of concurrent active users in steady state; server-class hardware in our hosted region serves the chat product to hundreds of concurrent users per card. Your numbers will depend on your prompt distribution and the fraction of requests that engage reasoning.
Where it falls short
We try to be honest about what the model can and cannot do. The headline limits we know about today, and what we recommend in their place:
- Knowledge cutoff: March 2026. The model does not know about events after that date unless web search or a tool call is enabled. The hosted API has search on by default; the on-premise build does not.
- Hallucinations: like every transformer, the model will sometimes confidently produce an incorrect answer. We have reduced this materially during post-training and again during preference optimisation, but we have not eliminated it. For factual-critical work, enable search or use the API's citation mode, which forces every claim to a retrieved source.
- Math: arithmetic over many digits is brittle without the reasoning mode enabled. Reasoning mode is significantly stronger; if you are benchmarking math performance, use it. We chose not to silently route long math to the reasoning model — that surprises the cost line — so the choice is explicit.
- Russian dialect coverage: standard Russian and the major minority languages of the Federation (Tatar, Bashkir, Chuvash) are supported at usable quality for everyday tasks. Less common regional dialects and pre-Soviet orthography are weaker; if your domain depends on them, fine-tune or evaluate before relying on the base model.
- Image input is text-and-document focused. Photo-grounded reasoning (medical imaging, satellite imagery, engineering drawings at scale) is not a supported use case. The model will respond, but the reliability is below what those domains need.
“If a model post only lists wins, it is marketing. We list the limits because those are the conversations our customers actually need to have.”
For developers
API access is at our developer portal, with SDKs in Python, TypeScript, Go, and Rust. Four model identifiers are exposed: `mira`, `mira-thinking`, `mira-pro`, and `mira-max`. Reasoning is opt-in via the `reasoning.effort` parameter (low | medium | high). All requests are billed by token, never by time, so reasoning depth does not surprise you on the cost line. Streaming, tools, structured JSON output, file inputs, and web search are first-class on every model. Enterprise deployments include private endpoints, dedicated capacity, separate region pinning, and direct human support for incident response.
How 5.1 feels in everyday use
Long documents drop in cleanly: a hundred-page contract, a multi-chapter book, a research paper with appendices — all fit native context and are answered in seconds. Image input is the largest qualitative change. Where the previous model could read printed Russian text from screenshots, 5.1 reads handwriting, fills in document layout context, and reasons about diagrams that mix Russian and English. Voice in chat is more natural on the harder consonant clusters and on words with mobile stress, where older models systematically tripped. The reasoning model is opt-in via a single API parameter and is described in detail in a companion post.
Precision builds
We ship multiple precision builds. The hosted API runs in a high-precision configuration tuned for quality on the public categories of our evaluation framework. Private deployments and on-device builds use more compact builds, with the quality delta for each documented in the corresponding release report so customers can pick the trade-off their workload tolerates. For most teams, the on-device build is the right pick when data cannot leave the network and latency is not the binding constraint.
Concurrency on commodity hardware
For self-hosters, the practical question is how many concurrent active users a single GPU can serve. A modern consumer GPU serves dozens of concurrent active users in steady state, with a hard ceiling that depends on context length and reasoning effort. Server-class hardware scales this several times over; one card in our hosted region serves the chat product to hundreds of concurrent users. Your numbers will depend on your prompt distribution and the fraction of requests that engage reasoning, but they are reproducible on the same class of hardware we run.
What the API actually returns
Every chat call returns a structured response: `message.content` (the answer), optionally `reasoning.summary` and `reasoning.steps` (when reasoning was enabled), `usage` (input, output, and reasoning tokens, billed separately), and `model_meta` (the exact build hash, region, and runtime engine that served the request). The build hash is not cosmetic — if you log it with each request, you have a reproducible audit trail of which exact model produced which answer, useful for compliance and for narrowing down regressions when something changes between releases. We retain the build for six months past its successor's release, so a customer can pin to a known-good model while they validate the next.
On-device deployment: when and why
The on-device build is the path for customers whose data cannot leave the network. It runs on a modern Linux server with a single consumer GPU for production-grade throughput, or on a CPU-only host for lower-volume internal tools. CPU throughput is slow enough that you would not run a chat front-end off it, but fast enough for back-office automation, document analysis, and reflective work that a human reads. The on-device build has the same instruction-following profile as the hosted model, the same refusal calibration, and the same image-input head. It does not include web search or hosted tools; those have to be wired in by the customer to systems they control.