You are viewing a preview of this job. Log in or register to view more details about this job.

ML Systems Engineer

ML Systems Engineer (Distributed Training) — From-Scratch Model (Auxerta)
 

Location:  United States  + travel required Japan (Tokyo–Yokohama)

Job Type: Full-time

Start: ASAP / flexible

Sponsorship: Not available

 

About Auxerta
Auxerta is a startup building our own model from scratch (custom architecture + custom training loop). This role focuses on training stability, performance, and reproducibility.

 

Responsibilities

Own distributed training for a custom model: stability, correctness checks, performance

Scale training across GPUs (DDP/FSDP/DeepSpeed or similar)

Build robust checkpointing/resume and experiment reproducibility

Profile and improve throughput (step time, dataloader/I/O, comms bottlenecks)

Write clear documentation/postmortems and collaborate closely with founders

 

Requirements

5+ years professional experience in ML systems / distributed training / deep learning engineering

Strong PyTorch experience training large models end-to-end

Hands-on with multi-GPU training (DDP/FSDP/DeepSpeed or equivalent)

Comfortable debugging NaNs, divergence, OOMs, NCCL/deadlocks

Strong Linux and software engineering fundamentals; production-quality code

Comfortable working with non-standard / from-scratch model architectures

 

Travel & work authorization (required)

Willing and able to travel to Japan (Tokyo–Yokohama) as needed

Must have U.S. work authorization via U.S. passport (citizen) or U.S. green card (permanent resident)

No visa sponsorship now or in the future

 

Preferred

CUDA/Triton or performance engineering

Inference/serving optimization (vLLM, TensorRT-LLM, quantization)

Data pipeline scaling (sharding/streaming/dedupe/PII filtering)

 

How to apply

Submit your resume and a short cover note (6–12 sentences is fine). We read both.

 

In your cover note, please include:

 

Who you are and what you’ve built (specific examples help)

Why you’re interested in this role and startup environment