Celebrate PyTorch 2.0 with New AI Developer Performance Features

Celebrate PyTorch 2.0 with New AI Developer Performance Features
Published on

PyTorch 2.0 celebrates with new AI developer performance features and more exciting news inside

Torchinductor CPU FP32 Inference Optimized

In this article, we have discussed the insights on PyTorch 2.0 with new AI developer performance features. Read to know more about PyTorch 2.0 with the new AI developer features.

As a component of the PyTorch 2.0 compilation stack, TorchInductor CPU backend optimization significantly boosts performance over celebrating PyTorch 2.0 eager mode through graph compilation.

Utilizing the PyTorch ATen CPU kernels for memory-bound operations with explicit vectorization on top of OpenMP*-based thread parallelization and the Intel Extension for PyTorch for Conv/GEMM ops with post-op fusion and weight pre-packing, the TorchInductor CPU backend is made faster.

These improvements, combined with the potent loop fusions in TorchInductor codeGen, allowed us to outperform three sample deep learning benchmarks-TorchBench, HuggingFace, and timm1-by up to 1.7 times in terms of FP32 inference performance. The development of low-precision support and training.

See the Improvements

This TouchInductor CPU Performance Dashboard tracks the performance enhancements on multiple backends.

Make Graph Neural Network (GNN) in PYG Perform Better for Inference and Training on CPU

GNN is an effective method for analyzing data with a graph structure. On Intel® CPUs, including the brand-new 4th Gen Intel® Xeon® Scalable processors, this capability is intended to enhance GNN inference and training performance.

The popular library PyTorch Geometric (PyG) was developed using PyTorch to carry out GNN operations. Currently, PyG's GNN models perform poorly on the CPU because of the absence of SpMM_reduce, a crucial kernel-level optimization, and other GNN-related sparse matrix multiplication operations (scatter/gather, etc.).

Message passing optimizations between nearby neural network nodes are offered to overcome this:

When the edge index is recorded in coordinate format (COO), message forwarding in scatter_reduce suffers from a performance bottleneck.

Gather A variant of scatter_reduce that is tailored specifically for the GNN computation when the index is an enlarged tensor.

When the edge index is stored in a compressed sparse row (CSR), sparse. mm with the reduced flag experiences a performance bottleneck in message-passing. Reduce flags for sum, mean, AMAX, and amin are supported.

Accelerating Pyg on Intel CPUs discusses the end-to-end performance benchmark results for both inference and training on the 3rd Gen Intel® Xeon® Scalable processors 8380 platforms and the 4th Gen 8480+ platform.

Unified Quantization Backend to Improve INT8 Inference for X86 CPU Platforms

The new X86 quantization backend, which takes the place of FBGEMM as the standard quantization backend for X86 systems, is a combination of FBGEMM (Facebook General Matrix-Matrix Multiplication) and one API Deep Neural Network Library (oneDNN) backends. Better end-to-end INT8 inference performance as compared to FBGEMM as a result.

For X86 platforms, the default entry point for users is the X86 quantization backend and kernel selection is handled automatically behind the scenes. The criteria for selection are based on performance testing results from earlier feature development by Intel.

Accordingly, the X86 backend takes the role of FBGEMM and, depending on the use case, may provide higher performance.

The criteria for selection are:

FBGEMM is always employed on platforms lacking VNNI (such as those with Intel® CoreTM i7 CPUs).

For linear, FBGEMM is usually utilized on platforms supporting VNNI (such as those running 2nd-4th generation Intel Xeon Scalable CPUs and the next platforms).

For depth-wise convolution with layers more than 100, FBGEMM is used; otherwise, oneDNN is utilized.

Use the OneDNN Graph API to Speed Up CPU Inference

OneDNN Graph API adds a customizable graph API to OneDNN to increase the possibilities for optimizing code generation on Intel® AI hardware. It recognizes the graph divisions that should be accelerated by fusion automatically. For both inference and training use cases, the fusion patterns concentrate on fusing compute-intensive processes like convolution, matmul, and their neighbor operations.

Only inference workloads can be optimized at this time, and only BFloat16 and Float32 datatypes are supported. Only devices that support BF16 via Intel® AVX-512 (Intel® AVX-512) are optimized for BF16.

PyTorch requires little to no changes to enable more recent OneDNN Graph fusions and optimized kernels. User options for OneDNN Graph include:

Before JIT tracing a model, either use the API torch.jit.enable_onednn_fusion(True), OR…

Use torch.jit.fuser("fuser3") as its context manager.

Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp

                                                                                                       _____________                                             

Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.

Related Stories

No stories found.
logo
Analytics Insight
www.analyticsinsight.net