Loading...
Loading...
Loading...

Foundational Model System Researcher

Singapore

Key Responsibilities

  • System Development & Optimization: Lead the development and optimization of large-scale model training and inference systems. Leverage cutting-edge technologies such as hybrid parallelism, automatic parallelization, high-performance operator development, and communication optimization to significantly improve training speed and efficiency, accelerating model iteration.

  • Tackling Technical Challenges: Focus on solving complex challenges in machine learning systems, including high concurrency, high reliability, and high scalability. Ensure stable and efficient system operation under diverse scenarios, providing strong technical support for continuous business growth.

  • Comprehensive Coverage Across Domains: Take responsibility for multiple critical sub-domains of machine learning systems, including resource scheduling, model training, model inference, and reinforcement learning training. Drive overall system performance improvement and functional enhancement.

  • Performance Analysis & Innovation: Conduct in-depth analysis of performance metrics during large-model training, accurately identify and resolve bottlenecks to maximize training efficiency. Stay at the forefront of emerging machine learning system technologies, actively research and adopt new methods, fully unlock hard- ware potential, and drive continuous innovation and upgrades.

Preferred Qualifications

  • Programming & Framework Skills: Proficiency in at least one programming language (C, C++, Python) or experience in CUDA development. Familiarity with at least one distributed training framework such as PyTorch FSDP, DeepSpeed, or Megatron-LM. Candidates with awards in international programming competitions (e.g., ACM, ICPC, Codeforces) will be given priority.

  • Technical Solution Excellence: Ability to design solutions with strict standards across multiple dimensions such as machine performance and system stability, ensuring scientific, rational, and efficient outcomes.

  • Domain Expertise & Passion: Substantial practical experience and strong interest in one or more of the following areas:

  • Parallel Systems: Deep research in distributed training of foundation models, efficient fine-tuning, reinforcement learning training, and inference engine optimization, including but not limited to parallel strategy design, quantization & compression techniques, and operator optimization.

  • High-Performance Operators: Familiarity with parallel computing (e.g., Triton, CUDA), communication technologies (e.g., NCCL, NVSHMEM), and AI compilers (e.g., MLIR, TVM, Triton, LLVM), with relevant development and optimization experience.