AI/ML workloads in data centers generate distinct traffic called “Elephant flows.” These large amounts of remote direct memory access (RDMA) traffic are typically produced by graphics processing units (GPUs) in AI servers. It is essential to ensure that the fabric bandwidth utilization is efficient and works well even in situations of low entropy workloads. Juniper’s Arun Gandhi, Mahesh Subramaniam, and Himanshu Tambakuwala discuss the efficient load balancing techniques and their pros and cons within the AI data center fabric.
Managing the Elephant in the Room for AI Data Centers:
RDMA Over Converged Ethernet Version 2 for AI Data Centers:
https://www.juniper.net/us/en/the-feed/topics/ai-and-machine-learning/rdma-over-converged-ethernet-version-2-for-ai-data-centers.html
AI Data Center Networking:
https://www.juniper.net/us/en/solutions/data-center/ai-infrastructure.html