Loading...
Loading...
Loading...

DevOps Engineer

Singapore

Responsibilities:

  • Participate in the architecture design and core component development of AI training clusters, building high-performance and highly available computing platforms.

  • Develop observability systems for training and inference tasks and resources, enhancing cluster monitoring, alerting, and log analysis capabilities.

  • Optimize key components such as compute scheduling, RDMA, and container runtimes to ensure efficient and stable execution of training and inference workloads.

  • Support large-scale cluster automation for deployment, operations, and troubleshooting, improving system maintainability and availability.

Requirements:

  • Bachelor’s degree or above from a 211 (or higher) university, in Computer Science, Software Engineering, Electronic Information, or related fields.

  • Solid foundation in operating systems; familiarity with Linux kernel, networking, storage, and performance tuning.

  • Proficiency in Golang with strong coding ability; Kubernetes development experience is a plus.

  • Familiar with cloud-native technologies such as Kubernetes, Docker, Prometheus, and Grafana.

  • Experience in operations or development of large-scale distributed systems, with the ability to quickly identify and resolve complex issues.