The TCP protocol remains to be the main workhorse in today’s large-scale data centers. However, the increasingly demanding performance expectations—led by the advancements in hardware (e.g., 25 to 100Gbps link speed) and software (e.g., Intel DPDK support)—make the kernel-based TCP stack no longer a favorable option. Over the past decade, multiple parties have proposed various user-stack TCP stacks where they offer the things-as-usual TCP support but with significant performance improvement. Unfortunately, we find these proposals may not work well in the field, especially in large-scale deployments. In this paper, we present Luna, a user-space TCP stack that has been successfully serving nearly 1 million nodes in the Aliyun cloud for the last five years. We discuss our lessons on the design tradeoffs with an emphasis on three unique features in thread, memory, and traffic models. The extensive microbenchmark evaluations and performance statistics collected from the field demonstrate that Luna can outperform kernel and other user-space TCP stacks with up to 3.5X in throughput and reduce up to 53% latency.