ML Systems Engineer
ML Systems Engineer (Distributed Training) — From-Scratch Model (Auxerta)
Location: United States + travel required Japan (Tokyo–Yokohama)
Job Type: Full-time
Start: ASAP / flexible
Sponsorship: Not available
About Auxerta
Auxerta is a startup building our own model from scratch (custom architecture + custom training loop). This role focuses on training stability, performance, and reproducibility.
Responsibilities
Own distributed training for a custom model: stability, correctness checks, performance
Scale training across GPUs (DDP/FSDP/DeepSpeed or similar)
Build robust checkpointing/resume and experiment reproducibility
Profile and improve throughput (step time, dataloader/I/O, comms bottlenecks)
Write clear documentation/postmortems and collaborate closely with founders
Requirements
5+ years professional experience in ML systems / distributed training / deep learning engineering
Strong PyTorch experience training large models end-to-end
Hands-on with multi-GPU training (DDP/FSDP/DeepSpeed or equivalent)
Comfortable debugging NaNs, divergence, OOMs, NCCL/deadlocks
Strong Linux and software engineering fundamentals; production-quality code
Comfortable working with non-standard / from-scratch model architectures
Travel & work authorization (required)
Willing and able to travel to Japan (Tokyo–Yokohama) as needed
Must have U.S. work authorization via U.S. passport (citizen) or U.S. green card (permanent resident)
No visa sponsorship now or in the future
Preferred
CUDA/Triton or performance engineering
Inference/serving optimization (vLLM, TensorRT-LLM, quantization)
Data pipeline scaling (sharding/streaming/dedupe/PII filtering)
How to apply
Submit your resume and a short cover note (6–12 sentences is fine). We read both.
In your cover note, please include:
Who you are and what you’ve built (specific examples help)
Why you’re interested in this role and startup environment