Millions of people are using artificial intelligence every month across Meta’s platforms, including Facebook, and the company is upgrading its data-center equipment to handle the growing computing load required by AI.
AI is a critical piece of Meta’s goal to serve more relevant content to users across its platforms, said Alexis Black Bjorlin, vice president for infrastructure at Meta, at a keynote at the AI Hardware Summit being held in Santa Clara, California.
“That gives us deeper insights. It gives us a better ability to predict user behavior, and thus a better ability to serve content that’s meaningful and relevant for our nearly 3 billion active daily users,” Black Bjorlin said during a keynote on Wednesday.
The hardware upgrades will also push AI to more applications and services. It will also help Meta meet its long-term pivot to a business strategy around the metaverse, which is well underway. Close to 700 million people use augmented reality across Meta platform on a monthly basis, Black Bjorlin said.
“In particular, AI can detect and take down more than 95% of objectionable content before you even see it. In Q2 alone, our AI systems removed nearly 250 million pieces of content that violated our platform safety policies across Facebook and Instagram,” Black Bjorlin said.
By 2025, Meta plans to build mega clusters containing over 4,000 accelerators, Black Bjorlin said. The network of cores will be organized as a mesh, with bandwidth of 1 terabyte per second among accelerators. Black Bjorlin didn’t elaborate on the type of accelerators the company plans to use, but the company uses Nvidia GPUs extensively, and is planning an AI supercomputer based on Nvidia’s GPUs.
“Sometimes you’ll see us talking about the size of scale in terms of thousands of accelerators. What we really also have to design towards is megawatts,” Black Bjorlin said.
Meta has data centers across 20 regions worldwide, with each region having about five data center buildings. The company has over 50 million square feet of data center footprint across the globe, Black Bjorlin said.
A typical small scale AI training cluster would be at eight megawatts, but Meta sees the need to scale this to 64 megawatts of total power envelope.
“A big portion of this power budget is going to be dedicated to the network,” Black Bjorlin said. AI typically needs superfast network bandwidth to move data between computing cores, memory and storage for machine learning.
That entails understanding the system as a whole and what’s adding value, and stripping out unnecessary components. Black Bjorlin said the idea is to shrink hardware at the system and chip levels. She gave the example of optical interconnects, which is being researched by Meta for use in data centers.
“It gives us a significant way to reduce the power consumption that’s taken from the optics. And when I talk about this, it’s not just about switch-to-switch on the higher-level network. It’s actually optical interconnects down to the accelerators themselves,” Black Bjorlin said.
She applauded the work being done by CXL Consortium, which last month released the version 3.0 of the Compute Express Link specification, which establishes a communication link between chips, memory and storage in systems.
Meta’s current data center infrastructure handles 3.65 billion active users monthly for its services and 2.91 billion users on Facebook. In addition to 95% accuracy in blocking objectionable content, the AI systems can translate 200 languages. The company uses the OPT-175B natural language processing model, which has 175 billion parameters and which was open sourced to developers.
The company is building its AI infrastructure around the PyTorch machine-learning toolkit, which is emerging as a language of choice for AI alongside TensorFlow. There are more than 150,000 PyTorch projects on GitHub from more than 2,400 authors.
Meta this week spun off its PyTorch project to the newly formed PyTorch Foundation, which will be managed by Linux Foundation. The foundation members also include top cloud providers Amazon Web Services, Google Cloud, and Microsoft Azure.
The new operating model for AI at Meta relies on velocity of moving models into production, which in some cases is more important than traditional system metrics like performance per watt.
“We’re trying to find a way to capture the best of both worlds – to maintain developer efficiency and using quick time to production and achieving high performance. Ideally, we’d have hardware that supports native Ethernet,” Black Blorjin said.