Tech

New framework reduces memory usage and boosts energy efficiency for large-scale AI graph analysis

LovabledanielsJune 23, 202548 Views

Real-time, large-scale graph neural network inference through BingoCGN — BingoCGN employs cross-partition message quantization to summarize inter-partition message flow, which eliminates the need for irregular off-chip memory access and utilizes a fine-grained structured strong lottery theory-based training algorithm to improve computational efficiency. Credit: Institute of Science Tokyo, Japan

BingoCGN, a scalable and efficient graph neural network accelerator that enables inference of real-time, large-scale graphs through graph partitioning, has been developed by researchers at the Institute of Science Tokyo, Japan. This breakthrough framework utilizes an innovative cross-partition message quantization technique and a novel training algorithm to significantly reduce memory demands and increase computational and energy efficiency.

Graph neural networks (GNNs) are powerful artificial intelligence (AI) models designed for analyzing complex, unstructured graph data. In such data, entities are represented as nodes and relationships between them are the edges. GNNs have been successfully employed in many real-world applications, including social networks, drug discovery, autonomous driving, and recommendation systems. Despite their potential, achieving real-time, large-scale GNN inference, critical for tasks like autonomous driving, remains challenging.

Large graphs require extensive memory, often overflowing on-chip buffers, which are memory regions integrated into a chip. This forces the system to rely on slower off-chip memory. Since graph data is stored irregularly, this leads to irregular memory access patterns, degrading computational efficiency and increasing energy consumption.

One promising solution is graph partitioning, where large graphs are divided into smaller graphs, each assigned its own on-chip buffer. This results in more localized memory access patterns and smaller buffer size requirements as the number of partitions increases.

However, this is only partially effective. As the number of partitions grows, the links between the partitions and inter-partition edges grow substantially. This requires increased off-chip memory access, limiting scalability.

To address this issue, a research team led by Associate Professor Daichi Fujiki from Institute of Science Tokyo, Japan, developed a novel, scalable and efficient GNN accelerator called BingoCGN. “BingoCGN employs a new technique called cross-partition message quantization (CMQ) that summarizes inter-partition message flow, eliminating irregular off-chip memory access, and a new training algorithm that significantly boosts computational efficiency,” explains Fujiki. Their findings will be presented at the Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25) from June 21–25, 2025.

CMQ uses a technique called vector quantization, which clusters inter-partition nodes and represents them using points called centroids. Nodes are clustered based on their distance, with each node assigned to its nearest centroid. For a given partition, these centroids replace the inter-partition nodes, effectively compressing node data. The centroids are stored in tables called codebooks, which reside directly in the on-chip buffer.

CMQ, therefore, allows inter-partition communication without the need for irregular and costly off-chip memory access. Additionally, since this method requires frequent reading and writing of nodes and centroids to memory, this method employs a hierarchical tree-like structure for codebooks, with parent and child centroids, reducing computation demands while maintaining accuracy.

While CMQ solves the memory bottleneck, it shifts the burden to computation. To counter this, the researchers developed a novel training algorithm based on strong lottery ticket theory. In this method, the GNN is initialized with random weights, generated on-chip using random number generators.

Then, unnecessary weights are pruned using a mask, forming a smaller, less dense or sparse sub-network that has comparable accuracy to the full GNN but is significantly more efficient to compute. Further, this method incorporates fine-grained (FG) structured pruning, which uses multiple masks with different levels of sparsity, to construct an even smaller and more efficient sub-network.

“Through these techniques, BingoCGN achieves high-performance GNN inference even on finely partitioned graph data, which was previously considered difficult,” remarks Fujiki. “Our hardware implementation, tested on seven real-world datasets, achieves up to 65-fold speedup and up-to 107-fold increase in energy-efficiency compared to state-of-the-art accelerator FlowGNN.”

This breakthrough opens the door to real-time processing of large-scale graph data, paving the way for diverse real-world applications of GNNs.

More information:
Jiale Yan et al, BingoGCN: Towards Scalable and Efficient GNN Acceleration with Fine-Grained Partitioning and SLT, Proceedings of the 52nd Annual International Symposium on Computer Architecture (2025). DOI: 10.1145/3695053.3731115

Provided by
Institute of Science Tokyo

Citation:
New framework reduces memory usage and boosts energy efficiency for large-scale AI graph analysis (2025, June 23)
retrieved 23 June 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.