After Model Deployment, These Five Issues Are the Real Nightmare
Last year, we helped a startup specializing in intelligent customer service deploy a RAG (Retrieval-Augmented Generation) system. The "presentable" tasks—model

After Model Deployment, These Five Issues Are the Real Nightmare
Last year, we helped a startup specializing in intelligent customer service deploy a RAG (Retrieval-Augmented Generation) system. The "presentable" tasks—model selection, data cleaning, and prompt tuning—took three weeks. On day one of launch, the client said, "The performance is acceptable."
Then the nightmare began.
Issue 1: The Token Bill Was Three Times Higher Than Expected
The client estimated a QPS (Queries Per Second) of 50, but the actual peak surged to 200. To make matters worse, they stuffed the entire user conversation history into the context window—averaging 4,000 input tokens per request, with the model returning 800 output tokens.
We took three actions:
- Used a sliding window to truncate the context to the most recent 20 turns of conversation.
- Summarized and compressed historical data older than three days.
- Added QPS rate limiting and queuing mechanisms at the gateway layer.
Result: The token count per request dropped to around 1,200, reducing the monthly bill from an estimated 8,000 RMB to 3,500 RMB.
Lesson: Always budget costs at 1.5 times the peak load. Never trust the client's "normal" estimates.
Issue 2: GPU Memory Leaks Causing Weekly Crashes
We deployed an open-source model using vLLM. Everything ran smoothly until day seven—GPU memory usage slowly climbed from 14GB to 24GB, leading to an Out-of-Memory (OOM) crash.
Investigation revealed that vLLM's KV cache was not being properly released under specific batch sizes. Our temporary fix was a monitoring script that automatically restarted the service when memory usage exceeded 20GB. The permanent solution involved upgrading to the latest version of vLLM and adjusting the max_num_batched_tokens parameter.
Lesson: Deploying open-source models isn't just about "getting it to run." You must implement memory monitoring and automatic recovery mechanisms.
Issue 3: The Client’s Data Format Changed Daily
On day one, they provided JSON. On day two, it switched to CSV. On day three, they dumped an Excel file containing merged cells.
We were forced to build a "universal parser" at the data ingestion layer, supporting JSON, CSV, Excel, and even PDF table extraction. We also added data schema validation; any data failing to meet the format requirements was intercepted at the ingestion layer, returning clear error messages.
Lesson: Clients won’t follow your specifications. The ingestion layer must be tolerant, while the validation layer must be strict.
Issue 4: No One Noticed the Performance Degradation
In the third week after launch, the customer satisfaction score for the support service dropped from 85% to 72%. The client didn’t inform us; we only discovered it during the end-of-month reconciliation.
The cause was that the client had switched data sources, and the labeling quality of the new data was significantly poorer. We implemented a performance monitoring dashboard that automatically evaluates accuracy on 200 sample cases daily, triggering alerts if the results fall below a set threshold.
Lesson: Model performance doesn’t end at deployment. Continuous monitoring is essential, and monitoring metrics must align with the business metrics that matter to the client.
Issue 5: No Rollback Plan Was Prepared
Once, we updated the referenced model version, but the new version changed the output format, causing all downstream parsing to fail. Lacking a rollback script, it took us two hours to restore service.
Subsequently, we established a standard procedure: before every model update, we conduct a canary release on a small traffic subset (5%). If no issues arise after 24 hours, we roll out to full traffic. We also retain images and configurations of the previous version to enable one-click rollbacks.
Lesson: Deploying without a rollback plan is gambling.
Final Thoughts
In AI project delivery, the model itself may account for only 30% of the workload. The remaining 70% is engineering and operations—cost control, stability assurance, data governance, performance monitoring, and change management.
These tasks aren’t glamorous, but they determine whether a project survives its first month.
Comments
Share your thoughts!
Loading comments…