Qwen3.5 35B Local Deployment: Complete Ollama Cluster Guide

Complete guide to deploying Qwen3.5 35B locally with Ollama on Mac Studio cluster. Quantization choices, troubleshooting, performance benchmarks.

Tags:ollamaqwen3.5Local Deploymentmacosai-infrastructure
Illustration
Qwen3.5 35B Local Deployment: Complete Ollama Cluster Guide

Why We Deployed 35B Models Locally

In April 2026, SFD Lab made a decision: migrate core inference tasks from cloud to local. Not because we distrust APIs, but simple math—9 daily articles + 15 Agent collaborations = 500K+ monthly API calls. Two Mac Studio vs cloud costs: local deployment pays back in 18 months.

More importantly: data stays in-house. User conversations, skill configs, memory fragments—no need to send these to third parties.

Hardware Choice: Why Mac Studio

We chose two Mac Studio M3 Ultra:

  • MS01: 96GB unified memory, primary inference node
  • MS02: 96GB unified memory, backup + coding专用

Why not H100? Simple: VRAM is expensive. 80GB H100 single card costs 250K RMB, 96GB Mac Studio complete unit is 30K RMB. Ollama optimization on Apple Silicon is mature—Qwen3.5 35B Q8 quantized needs only 38GB, MS01 handles it easily.

Ollama Deployment Flow

Step 1: Install Ollama

brew install ollama
# macOS requires manual service start
ollama serve

Step 2: Pull Models

# MS01: Qwen3.5 35B Q8
ollama pull qwen3.5:35b-q8_0

MS02: Qwen3-Coder-Next (coding专用)

ollama pull qwen3-coder-next:latest

Quantization Guide

QuantSizeRAMQuality Loss
Q8_038GB~42GBNegligible
Q6_K30GB~34GBMinimal
Q4_K_M23GB~27GBAcceptable

Lessons Learned

Issue 1: Download Interruption—38GB takes 20-30 min, network fluctuations cause breaks. Ollama supports resume, but ensure correct permissions on ~/.ollama/models.

Issue 2: OOM—First run on MS01 crashed macOS. Fix: limit Ollama max memory via launchctl setenv OLLAMA_MAX_VRAM 40000000000.

Issue 3: Request Queuing—Single instance handles one request at a time. Solution: run two instances on different ports (11434, 11435).

Performance Benchmarks

1000 tokens prompt, 500 tokens output:

  • Q8_0 (MS01): First token 1.2s, 28 tokens/s
  • Q4_K_M (MS02): First token 0.8s, 35 tokens/s

SFD Editor Note

This dual-machine cluster has run 2 weeks with 50+ daily inference requests, zero failures. Electricity cost: ~300 SGD/month vs 2000+ SGD cloud API savings. Next: explore Ollama distributed inference for 72B models.

Comments

Share your thoughts!

Leave a Comment

0/500

Loading comments…