Meta’s AI Training and Inference Infrastructure is growing exponentially to support ever increasing use cases of AI. This results in a dramatic scaling challenge that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure that connects myriads of GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance and availability requirements of RDMA workloads that expects a loss-less fabric interconnect. To enhance the performance of these systems, we continuously seek opportunities for improvement across our infrastructure stack, including network fabric, host networking, communication libraries, and scheduling infrastructure.AI/HPC Network Engineer Responsibilities:
Want more jobs like this?GetjobsinMenlo Park, CAdelivered to your inbox every week.
Want more jobs like this?
GetjobsinMenlo Park, CAdelivered to your inbox every week.
Get Jobs