Member of Technical Staff
OpenAI
View LinkedIn profileI work on the systems layer behind frontier-scale AI, helping keep six-figure GPU fleets and the very large pre-training runs they support healthy, stable, and making forward progress.
My work covers the path from cluster readiness before launch to job health during long-running training: GPU fleet health, high-speed fabrics and cross-connects, cluster validation, hardware-health automation, and the control planes that keep large runs reliable when individual nodes, links, or failure domains misbehave.
I operate at the boundary between infrastructure and ML engineering, where topology, workload behavior, and fast operational judgment all have to align for frontier-scale systems to deliver real progress.