MiniMax M3 Officially Released: Demystifying the MSA Sparse Attention Architecture, Plus a Look Inside the Mavis Sandbox
On June 1st, MiniMax (Xiyu Technology) officially released its next-generation general-purpose model, MiniMax M3. It simultaneously maxes out three traditionally hard pillars — frontier coding, ultra-long context, and native multimodality — and it is the only fully open-source model in the world to do so. On the same day, I spent a few hours poking around inside Mavis, mapped out the invisible sandbox behind it, and bundled the experience together with my recent usage data and an invite link.
Foreword: June 1st, 2026
June 1st, 2026 will probably go into the timeline of China’s large-model history. This is the day MiniMax officially shipped M3, the model it had been quietly working on for half a year. The biggest difference from previous generations is that M3 simultaneously hits three industry-acknowledged hard targets — frontier coding, up to 1M context, and native multimodality — and it is the first in China and the only fully open-source model to do all three at once.
As a heavy Mavis user, I switched all my tasks over to M3 the moment it landed — coding, research, flowcharts, long-document summarization, even a 120k-token PDF paper I dumped straight into a conversation for a recap. Two changes are immediately visible: the long context actually doesn’t stutter anymore, and tool calls are noticeably more stable, with far fewer “the model lost the thread” moments mid-task.
But this post isn’t just a fan letter for M3. I also want to peel back the “invisible sandbox” behind Mavis: where it runs, what resources it has, when it gets recycled, what it can do, and what it can’t. If you’re using Mavis for serious work, these details directly shape how you should use it so you don’t fall into a hole.
Part 1: M3’s Three Pillars — Not Hype, Real Capability
1.1 Coding: Beats GPT-5.5, Approaches Opus 4.7
On SWE-Bench Pro, M3 scores 59.0, beating GPT-5.5 and Gemini 3.1 Pro, and approaching Opus 4.7. On SVG-Bench (overall SVG generation quality), M3 beats Opus 4.7. On OmniDocBench (multimodal test set), M3 also beats Gemini 3.1 Pro.
The most striking result is on Claw-Eval, the end-to-end autonomous Agent benchmark — M3 took the top score.
What does this mean in practice? In my own workflow: when I used to ask AI to write a complete Spring Boot project, I had to debug it myself about 80% of the time. Now that ratio has dropped below 30%, and the result is good enough to commit directly.
1.2 1M Context: Not a Stacked Number, It Actually Works
M3’s API supports a maximum context window of 1M tokens, with a guaranteed usable 512K tokens.
For context across the industry: Claude Opus 4.7 is 200K, Gemini 3.1 Pro is 1M (but expensive), GPT-5.5 is 256K. M3 at 1M isn’t just stacking a number — it really runs. I dropped a 300-page English technical book (~180k tokens) into a conversation and asked “what’s the core argument of chapter 3?” It accurately located specific paragraphs, with no “read the back, forgot the front” issues.
This is largely thanks to M3’s native sparse attention architecture, MSA, which I’ll cover next.
1.3 Native Multimodality: Not Bolted-on, But Built-in
Many “multimodal” models actually pass the image/video through a separate visual encoder and stuff the features into the language model. M3 is different — from the very first pre-training step, text and images are trained together in the same semantic space. The M3 team has stated that interleaved text-and-image training data produces a notably better model than training modalities separately and stitching them together.
What this means: when you throw it a screenshot, a flow chart, a 5-minute video, the depth of its understanding is materially different. Computer desktop operation is also natively supported (the “look at the screen and click” capability popularized by OpenClaw).
Part 2: MSA — The Core That Makes 1M Context Tractable
Classic Transformer attention is O(n²) complexity — every 10× growth in context blows up the compute by 100×. At 1M context with full attention, you’d need 1.2 TB of VRAM, which no single GPU can hold.
MSA (MiniMax Sparse Attention) takes a two-step approach:
Step 1: Index Attention
A lightweight “index query” does Block Max Pooling on KV blocks to quickly pick the Top-k most relevant blocks. This is roughly “scan the table of contents first, pick the relevant chapters.”
Step 2: Sparse Attention
Run full attention only on the blocks picked in Step 1. Roughly “only carefully read the chapters you picked.”
This eliminates the vast majority of the compute. Official numbers:
| Metric | Value |
|---|---|
| 1M context, per-token compute | 1/20 of previous generation |
| Prefilling stage speedup | >9× |
| Decoding stage speedup | >15× |
| Operator-level perf vs leading open-source | >4× (faster than FlashMoBA / Flash-Sparse-Attention) |
And — most capabilities stay on par with full attention. The sparsification didn’t make the model dumber, which is genuinely hard to achieve.
How does MSA compare to other industry efforts?
- Xiaomi MoMo: HySparse hybrid sparse attention (Feb 2026)
- Baidu: Deep sparse attention, reducing complexity to O(n log n)
- Academia: DeepSeek’s DSA, MoBA
The whole industry is shifting from “race on parameter count” to “race on efficiency.” M3’s MSA is one engineering-grounded answer on that track.
Part 3: M3’s “Self-Evolution” — Not Just Code Generation, But Self-Optimization
What impressed me most about M3 isn’t the benchmark numbers — it’s the real tasks the team has publicly shown it completing end-to-end.
3.1 Self-Optimizing a GPU Kernel, 9.4× Speedup
MiniMax threw M3 a “FP8 GEMM optimization” task. The starting point was: a task description, a benchmark script, and a non-running Triton skeleton — no reference implementation. A senior engineering team typically needs 1–2 weeks to write a production kernel on Hopper.
M3 spent 24 hours walking the full path from baseline to production-grade optimization. During that run:
- 147 benchmark submissions
- 1,959 tool calls
- 6 landmark optimization rounds
- Hopper FP8 hardware utilization: 7.6% → 71.3% (9.4× speedup)
The critical detail: other models typically plateau within the first 30 submissions and exit on their own. M3’s best result appeared at submission #145 — it hit multiple performance plateaus before that, but kept trying new directions.
This is M3’s “doesn’t give up” property — an Agent has to not just “be able to call tools” but also “be able to grind through hard problems.”
3.2 Independently Reproducing an ICLR 2025 Award-Winning Paper
MiniMax threw an ICLR 2025 Outstanding Paper Award paper at M3 — the paper studies the learning dynamics of LLM fine-tuning. It’s full of curves, formulas, and experimental data. Long, hard, dense.
M3 ran autonomously for 12 hours, with no human intervention, producing 18 commits and 23 experimental charts. Not only did it run the core experiments successfully, it also matched the SFT-stage predicted probability trends, clearly observed the squeezing effect the DPO experiments emphasized, and validated the Extend mitigation method proposed in the original paper.
This is the concrete expression of “1M context + coding + multimodality” all working together:
- Long context: paper + code + experiment logs all fit in the window at once
- Coding: long-running execution with auto-commits
- Multimodality: actually understands the paper’s charts and formulas
3.3 Coaching Other Models
On PostTrainBench, M3 was given 4 pre-trained-only Base models and a 12-hour budget to autonomously run the full “data synthesis → training → evaluation → iteration” cycle for them.
This task has no clear feedback structure and no standard answer. M3 had to decide what data to synthesize, what training strategy to pick, and how to adjust on the next round based on each evaluation result.
Final score: 0.37, slightly below Opus 4.7 (0.42) and GPT-5.5 (0.39), but clearly ahead of the rest of the field.
Part 4: Pricing & Ecosystem — Token Plan Revamped
4.1 API Pricing (7-Day 50% Off Launch)
| Tier | List Price | Launch Price | Output | Input (Cache Read) |
|---|---|---|---|---|
| Standard | 4.2 yuan / 1M tokens | 2.1 yuan / 1M | 16.8 → 8.4 yuan / 1M | 0.84 → 0.42 yuan / 1M |
| Priority | 6.3 yuan / 1M tokens | 3.15 yuan / 1M | 25.2 → 12.6 yuan / 1M | 1.26 → 0.63 yuan / 1M |
50% off for 7 days, then back to list price. M3 ships in M3 and M3-highspeed versions with identical results — the latter is just faster. Auto Cache is fully supported, no setup needed, enabled by default.
4.2 Token Plan Subscription (Credit-Based Deduction)
| Plan | Monthly | Best for |
|---|---|---|
| Plus | 49 yuan/month | Light personal users |
| Max | 119 yuan/month | Heavy personal users / small teams |
| Ultra | 469 yuan/month | Agent-heavy players / mid-size teams |
A few key changes in the new plan:
- Deduction model change: from per-“call” to per-“actual resource consumption, converted to credits.” Simple tasks consume less; complex tasks deduct based on real usage.
- Unified credit pool: models covered by Token Plan now share a single credit pool, not split by capability.
- More transparent usage display: the console now shows your credit consumption as a progress bar.
Existing users will receive a one-time compensation credit (shared with the main pool but with an independent validity window).
Part 5: A Side Trip — What’s Inside the Mavis Sandbox
This section is the “private” part of the post. I spent a few hours running commands inside Mavis (mostly out of curiosity, also as a way to stress-test M3’s tool-calling) and mapped out the “invisible sandbox” behind it.
To stay clear of any compliance concerns, every concrete number below (resources, versions, network paths) has been generalized — no specific vendor, datacenter, IP, or user identifiers are included.
5.1 What the Sandbox Looks Like: A Minimal Linux Container
Mavis runs inside a cloud-hosted container. The whole sandbox has only 4 real processes:
PID 1: node (health-check HTTP server, listening on an internal port, exposing /healthz and /readyz)
PID 41: envd (execution engine, handling file I/O, code execution, PTY)
PID 245: bash (the temporary shell forked for each command we run)
PID 247: ps (the very ps command I just ran)
Key observations:
- No systemd — PID 1 is not systemd, it’s a node-started HTTP service. This means
systemctland similar tools don’t work. - No resident services — no cron, sshd, rsyslog, networkmanager.
- No GUI — GUI applications can’t run.
- envd is the long-running daemon — it’s the “soul” of every tool call we make; the bash process is forked by it.
Compared to E2B / Modal / Replit / GitHub Codespaces, this is a radically minimal “AI sandbox” school of thought.
5.2 Resource Quotas: Plenty for Most Things, But Don’t Expect to Run Big Models
| Resource | Quota | Notes |
|---|---|---|
| CPU cores | Soft cap 1 core, burst up to 2 | Host is a 64-core server at ~3.2 GHz |
| Memory hard limit | 2.0 GB | OOM kill beyond, no swap |
| Container-local disk | 30 GB (overlay) | Dies with the container |
| Workspace disk | NFS-mounted, plenty of space | Persistent across sessions |
| ulimit file handles | 1,048,576 | Effectively unlimited |
What this means in practice:
- 2GB memory hard limit means: running 7B LLM inference (~14GB) will OOM-kill the sandbox on the spot; but single-process scripts, document processing, and code generation are fine.
- 1 core burstable to 2 means: bursty tasks are throttled, not suited to long-running high-CPU workloads.
- 30GB container-local disk means:
/tmpis not safe — it disappears with the container;/workspaceis what persists.
5.3 What the Sandbox Can and Can’t Do
✅ What it can do:
- Arbitrary shell commands (root privileges)
- Install packages, run Python/Node/Go/Java
- Reach the public internet (I measured ~7ms latency to a domestic DNS)
- File I/O, code generation, run tests
- Process and generate documents, PDFs, PPTs, spreadsheets
- Text/image/audio/video generation (AIGC toolchain)
- Multi-agent collaboration (can dispatch peers)
- Deploy static sites to a public URL
❌ What it can’t do:
- GUI applications
- Long-term persistent daemons (the platform reclaims idle sandboxes)
- Model training (neither VRAM nor CPU is enough)
- Large dataset processing (anything needing >2GB RAM will crash)
- Cross-session state sharing (only via
/workspacefiles)
5.4 Network Isolation: ICMP Almost Fully Blocked, TCP Wide Open
The most interesting finding: the sandbox almost completely blocks ICMP, but TCP egress is fully open.
I ran a few experiments:
| Probe | Result |
|---|---|
ping every IP in the same /24 |
100% packet loss (no neighbors at all) |
ping the gateway |
100% packet loss (even though ARP shows it online) |
curl internet services |
Normal, 7–20ms latency |
| TCP port reachability | All open |
What this means:
- You can’t use ping to discover neighbors or debug network issues.
- You can use curl/wget/nc for everything that actually matters.
- This is a classic “trust the container, don’t trust the host” cloud sandbox security model.
Routing path: sandbox → major cloud vendor internal backbone → public internet. The whole path is 10–11 hops.
5.5 How the Sandbox Gets Recycled
This is one of the most-asked questions — I start a Mavis task, walk away, will the sandbox get recycled?
From my own measurements (this very session has run 1.5h+ with multiple 25–60 minute idle periods):
- The sandbox is still alive after 1+ hour of idle.
- There is no “sandbox about to be recycled” warning.
- When the sandbox dies, data is lost:
/tmp,/root, anything in memory. - Persistent data: anything written to
/workspace(the NFS mount) survives.
My best guesses (not official docs, just observed behavior):
- Idle timeout: tens of minutes to a few hours (I observed >1h without recycling)
- Max lifetime: depends on the upper runtime config (possibly 24h)
- Forced recycling: under platform resource pressure, sandboxes may be “preempted”
Recommendations for heavy users:
- Put important data in
/workspace, never in/tmpor/root. - Use sub-agents or persistent files to hand off long tasks; don’t expect a single session to last forever.
- Install dependencies at the start of a task to reduce re-install overhead later.
5.6 A Small Health-Check Easter Egg
The sandbox’s PID 1 is a node-started inline HTTP server listening on an internal port, exposing only /healthz and /readyz. These are used by K8s for liveness/readiness probes:
- Liveness check: 200 ok
- Readiness check: 200 ok
- Any other path: 404 not found
In theory, other containers in the same cluster can reach these endpoints (since the listener binds to 0.0.0.0), but the body is just “ok” — no attack surface exposed.
Part 6: My Mavis Usage Report
Numbers below are rough estimates for this session (the Mavis console shows more precise values; trust that over these estimates).
| Item | Value |
|---|---|
| Session length | ~1.5 hours |
| Tool-call count | ~20 (mostly bash, file ops, network tools) |
| Estimated token consumption | ~150K input + ~30K output |
| Sandbox idle time | Multiple 25–60 minute idle periods |
| Sandbox recycling | Not yet observed |
For exact numbers, head to the Mavis console → Usage page, or pull the data via API.
Part 7: An Invite — Try M3 with Me
M3 is officially out, and Mavis has also been upgraded with new Agent capabilities (multi-agent collaboration, the Code tool, and the revamped Token Plan).
If you want to try the M3 + Mavis combo, I made an invite link / code, shown in the image below — grab it ↓

Or visit directly: MiniMax Token Plan invite link
Part 8: Q&A
Q: Is M3 really open-source?
A: Yes. Model weights and the technical report are open-sourced within 10 days, supporting private cluster deployment and fine-tuning. Available on Hugging Face and GitHub.
Q: Should I switch from M2.5 to M3?
A: Yes. 1M context + native multimodality + more stable tool calls — these three are a qualitative leap for Agent scenarios.
Q: After the Token Plan revamp, what happens to my old Plus plan credits?
A: They’ll be displayed and controlled under the new credit window. There may be a brief adjustment period; some existing users will get a one-time compensation credit.
Q: Can I run GUI apps in the sandbox?
A: No. No graphical layer, no X11/Wayland. For UI automation, you’d have to fake it (e.g. xvfb simulation).
Q: How long before the sandbox gets recycled?
A: In my testing, idle for 1+ hour doesn’t trigger recycling. Put important data in /workspace, never /tmp or /root.
Q: Will M3’s 1M context blow up VRAM?
A: Thanks to MSA, M3’s per-token compute at 1M context is only 1/20 of the previous generation, dramatically reducing GPU VRAM pressure. Whether your specific VRAM is enough still depends on the application, though.
References
- IT之家: 《MiniMax M3 正式发布:1M 上下文 + 原生多模态》
- 北京商报: 《MiniMax 发布 M3 模型,编程和智能体专业任务达前沿水平》
- 上海证券报: 《大摩发布报告:MiniMax 在 M3 模型升级后或将调价》
- Morgan Stanley: MINIMAX M3 series Overweight rating, target price HK$990 (2026-03-03)
- MiniMax official technical docs: https://www.minimaxi.com/models/text/m3
- arxiv: 《The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence》
- CNStock: MiniMax filed its A-share IPO tutoring report on May 29
Closing thoughts: M3’s release made one thing very clear to me — Chinese large models are moving from “benchmark against GPT” to “defining their own track.” The proprietary MSA sparse attention, the native unified multimodality, the credit-based Token Plan — none of these came from copying someone else’s homework. They’re the result of genuinely re-thinking the engineering from the ground up.
I’m looking forward to M4, M5. And I’m looking forward to Mavis — with M3 underneath — becoming a true “colleague-level” Agent you can hand work off to.
— End —