Foundational Model System Researcher

Apply Now

System Development & Optimization: Lead the development and optimization of large-scale model training and inference systems. Leverage cutting-edge technologies such as hybrid parallelism, automatic parallelization, high-performance operator development, and communication optimization to significantly improve training speed and eﬃciency, accelerating model iteration.
Tackling Technical Challenges: Focus on solving complex challenges in machine learning systems, including high concurrency, high reliability, and high scalability. Ensure stable and eﬃcient system operation under diverse scenarios, providing strong technical support for continuous business growth.
Comprehensive Coverage Across Domains: Take responsibility for multiple critical sub-domains of machine learning systems, including resource scheduling, model training, model inference, and reinforcement learning training. Drive overall system performance improvement and functional enhancement.
Performance Analysis & Innovation: Conduct in-depth analysis of performance metrics during large-model training, accurately identify and resolve bottlenecks to maximize training eﬃciency. Stay at the forefront of emerging machine learning system technologies, actively research and adopt new methods, fully unlock hard- ware potential, and drive continuous innovation and upgrades.

Programming & Framework Skills: Proficiency in at least one programming language (C, C++, Python) or experience in CUDA development. Familiarity with at least one distributed training framework such as PyTorch FSDP, DeepSpeed, or Megatron-LM. Candidates with awards in international programming competitions (e.g., ACM, ICPC, Codeforces) will be given priority.
Technical Solution Excellence: Ability to design solutions with strict standards across multiple dimensions such as machine performance and system stability, ensuring scientific, rational, and eﬃcient outcomes.
Domain Expertise & Passion: Substantial practical experience and strong interest in one or more of the following areas:
Parallel Systems: Deep research in distributed training of foundation models, eﬃcient fine-tuning, reinforcement learning training, and inference engine optimization, including but not limited to parallel strategy design, quantization & compression techniques, and operator optimization.
High-Performance Operators: Familiarity with parallel computing (e.g., Triton, CUDA), communication technologies (e.g., NCCL, NVSHMEM), and AI compilers (e.g., MLIR, TVM, Triton, LLVM), with relevant development and optimization experience.

Apply Now