Today we are shipping vMira 5.1, including a new model — vMira Thinking — that no longer just answers: it drafts a chain of reasoning, verifies its own steps, and stops when it doubts. Reasoning out loud is not a feature bolted on top; it is a different way of using vMira, and a different cost model.
Over the last eighteen months we trained our models to separate the thinking from the answer. Internally we represent the thought process as a structured trace, then collapse it into a clean response. The result is a brief pause before each output where the model drafts its reasoning, checks for arithmetic mistakes and contradictions, and polishes the final reply. That pause translates into more useful answers, fewer numerical errors, and explanations you can audit.
The interesting research question was not how to make a model think more. Models that think more are easy. The question was how to make a model that knows when enough thinking has happened — that does not burn through a thousand reasoning tokens for a question that needs ten, and does not give up on a hard question after fifty. Three techniques combine to give that behaviour today, and a fourth lands in the next training run.
How the model knows when to stop
Three things make vMira Thinking different from a long-chain-of-thought model that fills its budget by default. First, training data was difficulty-aware: short reasoning traces were paired with easy problems and long traces with hard problems, so the model learned that brevity is a positive signal on simple tasks. Second, supervised fine-tuning embedded an explicit budget signal — the model sees its own thinking budget as a token in context and learns to compress when the budget is tight. Third, preference optimisation in the alignment stage replaced standard preference learning with a budget-aware reinforcement-learning objective: rewards stopping early when the answer is already correct, penalises stopping early when it is not. The combined effect on our internal evaluations is roughly a sixty-seven percent reduction in reasoning tokens at equal accuracy, and a two-thirds reduction in dollar cost for a typical reasoning workload at the same quality.
Available today
vMira Thinking, exposed in the API as mira-thinking, is available on every Plus, Pro, and Teams plan at no extra subscription cost. Token pricing is $1.50 per million input tokens and $5.00 per million output tokens, with a 50% surcharge applied to tokens generated inside the thinking trace itself — there is no charge for thinking that turned out to be unnecessary, because tokens, not seconds, are the meter.
import { vmira } from "vmira";
const result = await vmira.chat.create({
model: "mira-thinking",
reasoning: { effort: "medium" }, // low | medium | high
messages: [{ role: "user", content: "..." }],
});
console.log(result.reasoning.summary);
console.log(result.message.content);“The interesting part is not that the model thinks more. It is that we know when it is not thinking enough.”
What is next
Two follow-ups land in the next training run. The first is a deep-reasoning mode that spends minutes — not seconds — on hard problems: contract analysis, multi-file debugging, large-scale code review, mathematical proof drafting. It is gated behind a separate API parameter so it never fires by accident on a chat request. The second is a set of serving-side optimisations targeted at long reasoning workloads. Together they should give us another multi-fold throughput improvement on long reasoning workloads. If you would like early access to deep-reasoning mode for an enterprise pilot, write to us.
Why thinking is a different model, not a flag
The first wave of "reasoning" implementations was a system-prompt template bolted onto an existing chat model — think step by step before answering. That gets you partway: the model becomes more deliberate on math and structured tasks and a little less prone to confidently-wrong arithmetic. But the costs scale linearly with the template, verbosity is the same on a one-step question and a twenty-step proof, and there is no clean way to distinguish what the model thought from what it ultimately decided. We took a different path. vMira Thinking is a separately post-trained model with a structured thought-trace as a first-class output, a budget signal woven into the input format, and an alignment stage whose reward is shaped specifically around the cost-quality trade-off of reasoning. The user gets one knob — `reasoning.effort` — and the rest is the model's job to decide.
Inside the reasoning training data
Reasoning data was assembled from three sources. Long Russian-language chain-of-thought traces — step-by-step explanations of physics and chemistry word problems, structured legal analyses, and competitive-programming solutions where the trace itself shows alternative approaches considered and discarded. Olympiad-grade math problems with curated solutions, so the model learns the shape of a careful proof, not just the arithmetic. Long-CoT traces distilled from a much larger frontier model, deduplicated and quality-filtered before they entered our set. Each example is paired with metadata about its difficulty, so the model learns that short responses are correct on easy problems and long responses are correct on hard ones — not the other way around. The combination of difficulty-aware curation and a budget signal during training is what gives the model the brevity that other reasoning systems lack.
Embedded budget signal: how the model sees how much it can think
Every supervised-fine-tuning example is prefixed with a token that tells the model the budget for its thinking trace. The model learns this token is a real instruction, not decoration: when the budget is tight it compresses, when the budget is loose it expands. By the end of training the model has internalised a pricing-aware writing style — short reasoning when the question merits it, long when the question demands it, and a graceful failure mode ("I do not have enough budget to verify this" rather than a confidently-wrong answer) when neither is possible. At inference, the budget can be set explicitly per request or derived implicitly from the `reasoning.effort` parameter, with `low` collapsing the trace to a few tokens, `medium` the default, and `high` opening the budget wide for problems that need it.
Budget-aware preference optimisation, not vanilla preference learning
In the alignment stage we replaced standard preference learning with a reinforcement-learning objective whose reward function explicitly accounts for thinking-token cost. Two answer trajectories that arrive at the same correct answer, one in fifty tokens and one in eight hundred, are treated as preference-equivalent on quality but differentiated on cost — and the policy learns to prefer the cheap-and-correct trajectory. Crucially, the same reward penalises under-thinking on hard problems: a model that gives up too early on a question it could have solved is corrected during the RL stage. The combined effect, on our internal evaluations, is roughly a sixty-seven-percent reduction in average reasoning tokens at constant accuracy, plus a meaningful improvement on the hardest tier of problems where short chains-of-thought are not enough.
What the developer actually sees
On the API, a `reasoning` block is streamed alongside the message content. You can render the reasoning as a collapsible "show your work" panel, summarise it with a second model call, hide it entirely, or post-process it to extract structured intermediate values. The summary, which we produce server-side, is roughly one-tenth the length of the raw trace and is what most product surfaces choose to show end-users. Structured-output mode, which constrains the final answer to a JSON schema you provide, composes cleanly with thinking — the model deliberates freely, then conforms its answer to your schema. Tool-calling composes too, but with a subtlety: the model uses thinking to decide whether a tool is needed, then makes the call, then continues thinking with the tool result in context. Each of these compositions is covered by streamed events, so a UI can show progress without polling.
Pricing the thinking phase fairly
We charge for tokens, not seconds. A thinking phase that finishes early bills less than one that needed three rewrites, and a question the model decides does not need thinking at all bills like a regular generation. The fifty-percent surcharge applies only to tokens generated inside the thinking trace — typically the most expensive minority of the response. A typical reasoning request on the API breaks down to roughly ten percent input, sixty percent thinking, and thirty percent answer; tightening the thinking budget moves dollars from the middle of that breakdown to the right. For high-volume workloads, the effort-aware billing means a small drop in average effort delivers a large drop in cost without a measurable drop in quality on the easy tier of your traffic.
Failure modes and how we catch them
Thinking adds new failure surfaces. Traces that loop on a wrong assumption. Traces that contradict the final answer. Traces that invent intermediate facts the answer then relies on. We catch these in three places. In training, the preference data was labelled with explicit trace-consistency criteria, so the model learned to prefer traces whose intermediate claims survive into the final answer. At inference, the trace and the answer are scored against each other server-side; if they materially disagree, the model is asked to rewrite the answer to match the trace (rare in practice). On the public eval framework, a dedicated category — trace fidelity — measures, on every release, whether the published trace actually supports the answer. Below the threshold, we do not ship the model. The same category is the entry point for community-contributed test cases.
Composability with the rest of the platform
Reasoning is not a separate API surface; it is a parameter on the same chat endpoint that powers every other vMira product. The same model identifier, the same auth, the same rate-limit policy, the same audit log. That means an existing tool-calling agent can opt into reasoning by adding one parameter. A retrieval-augmented system can run with reasoning on for the synthesis step and off for the retrieval step. A long-context analysis can spend its thinking budget once at the start of a session and reuse the conclusions across follow-up turns. We expose every intermediate so developers can build flows we have not anticipated; the only thing the platform asks for in return is that you do not paste a thinking trace back into the model verbatim — the model already knows what it thought, and showing it the trace twice produces lower-quality follow-up answers.