Sujith Katakam – DevNetwork

Back to Advisory Boards

I work on the systems layer behind frontier-scale AI, helping keep six-figure GPU fleets and the very large pre-training runs they support healthy, stable, and making forward progress.

My work covers the path from cluster readiness before launch to job health during long-running training: GPU fleet health, high-speed fabrics and cross-connects, cluster validation, hardware-health automation, and the control planes that keep large runs reliable when individual nodes, links, or failure domains misbehave.

I operate at the boundary between infrastructure and ML engineering, where topology, workload behavior, and fast operational judgment all have to align for frontier-scale systems to deliver real progress.