Key Responsibilities
Data System Development: Build large-scale data processing systems to support the training and evaluation of trillion-parameter foundation models, ensuring efficiency, stability, and scalability across the entire data pipeline.
High-Quality Data Construction: Lead the collection, cleaning, deduplication, an- notation, and augmentation of data for training foundation models (language, multi- modal, agent, etc.), continuously improving data quality and diversity.
Intelligent Data Tools: Develop intelligent tools for data generation, synthesis, filtering, and automated evaluation to accelerate data iteration and closed-loop optimiza- tion, supporting model capability expansion and alignment training.
Basic Requirements
Strong programming skills, proficient in Python/C++, with solid system design abilities and the capability to independently develop large-scale data processing modules.
Familiarity with data processing and storage frameworks such as Spark, Flink, Ray, or Hadoop, with hands-on experience in building and optimizing data pipelines.
Understanding of foundation model training workflows and data quality requirements, with awareness of data-driven model iteration and evaluation practices.
Excellent problem-solving, engineering execution, and teamwork abilities.
Preferred Qualifications
Experience in constructing training datasets for large models or leading the cleaning and management of million-scale high-quality data.
Familiarity with data augmentation and synthesis techniques (e.g., Self-Instruct, RLAIF, synthetic QA generation, image-text alignment augmentation), or experience with agent-based data generation.
Knowledge of web-scale data collection, crawler development, deduplication, information extraction, and web structure parsing.
Familiarity with interactive log data construction and feedback data mining in reinforcement learning environments.
Contributions to well-known open-source datasets (e.g., OpenWebMath, RefinedWeb, RedPajama, LAION, COYO) in the form of data processing tools or cleaning strategies.
Strong performance in competitions such as ACM/ICPC, NOI/IOI, data mining, or da- ta-centric AI.