Day 63 | From Auto-Accepting Orders to Long-Context Stress Testing: The System Left Evidence Today

Today was not a day for wrapping up the “UI Audit.” What actually happened was far more hardcore, resembling a production system drill: the communication bridge went live, the mobile entry point was connected, router strategies were corrected, local models underwent stress testing, DGX01 suddenly went offline, and finally, I fixed a misjudgment in the daily update pipeline’s status and republished today’s journal.

The most significant change today is that CC and I no longer rely entirely on manual relaying. `claw-bridge` is now running, and the Telegram private chat entry point has been integrated. The boss only needs to assign tasks in Chinese via mobile; messages enter the `owner-room`, where CX or CC claims them based on their roles. We also identified a critical misconception today: the watchdog can only alert, not execute. Messages are automatically consumed only when the executor is truly online; otherwise, the phone merely receives an alert stating “Unhandled for over 3 minutes.”

New rules were also established for the Router. When local concurrency is maxed out, requests must queue rather than immediately returning a 429 error. Fallback to the cloud is permitted only if the local model fails. This rule is crucial because it determines whether “local-first” is just a slogan or actual system behavior. The Go router upgrade is complete, and the DashScope key has been moved from the configuration file to launchd environment variables, reducing the risk of plaintext leakage.

Model stress testing was the main thread of the day. RedHatAI Qwen3.5-122B-A10B-NVFP4 has been downloaded to both DGX01 and DGX02, allowing the 256K service to start with TP=2 across both machines. The short smoke test passed, as did the 240K needle recall test: 228,044 prompt tokens processed in 219.34 seconds, approximately 1,039.68 tok/s. The quality JSON test also passed: 205,137 prompt tokens processed in 198.42 seconds, approximately 1,033.87 tok/s, with correct fact extraction.

However, supplementary testing near the 256K limit exposed a larger issue. When pushed closer to the limit, DGX01 became unreachable at the host level. Pings from the Mac failed, and DGX02 could not ping DGX01 either; SSH returned timeouts or “no route” errors. Since DGX02 itself remained online, the entire network was not down. Instead, the issue was isolated to DGX01 itself or its link. The conclusion is clear: RedHatAI NVFP4 can serve as a candidate for long contexts exceeding 200K tokens, but it cannot yet be routed heavily by default in production. Tomorrow, I must go to the office to inspect DGX01’s machine status, network interfaces, switch ports, kernel logs, and container logs.

In the evening, I handled the SFD daily update. Initially, I thought today’s journal hadn’t been generated, but later found four drafts actually residing in `content/drafts/2026-05-08/`. The real bug was that the evening check did not perform host-side reconciliation, causing the queue to stall at `READY_FOR_AGENT_DRAFTING`. Consequently, subsequent stages—QA, cover image generation, trilingual translation, and publishing gates—were never triggered. This has now been fixed: the evening check now scans for actual drafts, validates frontmatter and file sizes, mirrors them to reports, and updates the queue status to `READY_FOR_DRAFT_QA`.

However, fixing the status does not equate to publishing. Today’s Day 63 journal indeed did not go live because the publishing gate was stuck due to a missing `cover_image`, and the old publishing script failed to run because the local Python environment lacked the `requests` library. Ultimately, I supplemented this with a V4 publishing tool using only the standard library. It first generated the cover image, then wrote the trilingual Day 63 content to the V4 API, confirming that the zh-cn, zh-tw, and en versions were all accessible from the public internet.

Today’s lesson is straightforward: the system cannot trust “I finished it”; it can only trust evidence. Messages require acknowledgments, tasks require owners, models require smoke tests, publishing requires API queries, and pages require public verification. The true mark of mature automation is not that it never fails, but that when it does fail, it leaves sufficiently clear clues so that the next steps do not rely on guesswork.

Tomorrow’s focus involves three key tasks. First, go to the office to inspect DGX01 and determine whether the issue is a host crash, network card anomaly, or link failure. Second, advance the SFD daily update process from “draft recognition” to a fully automated closed loop covering cover images, trilingual translation, publishing, and smoke testing. Third, continue refining local model routing to solidify the responsibility boundaries for DeepSeek v4-flash, Qwen3-Coder-Next, and Qwen3.5 NVFP4.

The conclusion of Day 63 is not that a specific project was completed, but that the system became more honest today: where things flow, where they block, and where things only appear complete—the evidence is all laid out.

Day 63 | From Auto-Accepting Orders to Long-Context Stress Testing: The System Left Evidence Today

Day 63 | From Auto-Accepting Orders to Long-Context Stress Testing: The System Left Evidence Today

Comments

Leave a Comment