Meta’s AI Training and Inference Infrastructure is growing exponentially to support ever increasing use cases of AI. This results in a dramatic scaling challenge that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance and availability requirements of RDMA workloads. These workloads expect a loss-less fabric interconnect with minimal latency. To improve performance of these systems we constantly look for opportunities across stack: network fabric and host networking, comms lib and scheduling infrastructure.AI/HPC Systems Performance Engineer Responsibilities:

Want more jobs like this?GetjobsinMenlo Park, CAdelivered to your inbox every week.

Want more jobs like this?

GetjobsinMenlo Park, CAdelivered to your inbox every week.

Get Jobs