Responsibilities:
Participate in the architecture design and core component development of AI training clusters, building high-performance and highly available computing platforms.
Develop observability systems for training and inference tasks and resources, enhancing cluster monitoring, alerting, and log analysis capabilities.
Optimize key components such as compute scheduling, RDMA, and container runtimes to ensure efficient and stable execution of training and inference workloads.
Support large-scale cluster automation for deployment, operations, and troubleshooting, improving system maintainability and availability.
Requirements:
Bachelor’s degree or above from a 211 (or higher) university, in Computer Science, Software Engineering, Electronic Information, or related fields.
Solid foundation in operating systems; familiarity with Linux kernel, networking, storage, and performance tuning.
Proficiency in Golang with strong coding ability; Kubernetes development experience is a plus.
Familiar with cloud-native technologies such as Kubernetes, Docker, Prometheus, and Grafana.
Experience in operations or development of large-scale distributed systems, with the ability to quickly identify and resolve complex issues.