Data Engineer — AI & Data Infrastructure
We’re looking for a Data Engineer with a strong foundation in data pipelines and a meaningful edge in AI-native data infrastructure — specifically RAG pipelines, vector search, embedding workflows, and semantic retrieval systems.
You’ll work on two interconnected problem sets:
The first is foundational: consolidating eight legacy systems into a unified, reliable data platform — ETL pipelines, a data warehouse, and cross-system client identity resolution.
The second is where the work gets genuinely interesting: transforming three decades of institutional research into an intelligent, searchable, interactable knowledge layer that clients can query in ways that weren’t possible two years ago.
This is a small, senior team. You’ll work directly with the CTO, have real architectural ownership, and build systems that are in production — not in a sandbox.
What You’ll Work On
Data Foundation & Migration
- Lead the data engineering work for our research portal migration — extracting, transforming, and loading data from legacy systems into modern cloud infrastructure
- Build and maintain ETL/ELT pipelines across multiple integration points: CRM, research distribution platforms, trading systems, and third-party APIs
- Design and implement our “Golden Record” initiative — cross-system client identity resolution across eight legacy databases with no unified identifiers
- Implement event-driven data flows using AWS EventBridge as the central routing layer, treating each source system as a swappable adapter
AI-Native Data Infrastructure (RAG & Search)
- Design and build production-grade RAG (Retrieval-Augmented Generation) pipelines over AGCO’s research archive — ingestion, chunking strategy, embedding generation, vector storage, and retrieval
- Implement hybrid search approaches that combine semantic (vector) search with keyword and metadata filtering, appropriate for structured financial research queries
- Build and maintain embedding pipelines that keep the vector store current as new research is published, with full observability and freshness guarantees
- Evaluate and implement emerging retrieval strategies as the space evolves:
- Re-ranking with cross-encoders
- Hypothetical Document Embeddings (HyDE)
- Query expansion and decomposition
- Graph-based retrieval (e.g., GraphRAG) for analyst relationship mapping
- Structured metadata retrieval for faceted financial queries
- Wire retrieval layers into LLM interfaces for research summarization, analyst Q&A, and recommendation-change tracking across the archive
- Enable client queries such as: “Show me all emerging market buy recommendations from analysts with 10+ years of coverage who changed their view in the last 6 months”
DevOps & Data Infrastructure
- Apply DataOps practices across all pipelines: version control, CI/CD, environment parity across dev/staging/production, and infrastructure as code
- Monitor pipeline health, embedding freshness, retrieval quality, and LLM call latency — build alerting that catches problems before users do
- Work within AGCO’s AWS environment (App Runner, EventBridge, CDK) and contribute to IaC best practices
Collaboration & Documentation
- Partner with the CTO, product team, and application developers to translate business requirements into sound data and retrieval architecture decisions
- Document data flows, schema designs, chunking strategies, and retrieval logic so systems are maintainable and not a black box
- Contribute to evaluation frameworks for retrieval quality — precision, recall, answer faithfulness — so we know when the system is actually working