Wednesday, November 17, 2021

Vector math made easy: John Rose and Paul Sandoz on Java’s Vector API

Oracle Java Tutorial and Materials, Oracle Java Exam, Oracle Java Exam Prep, Java Preparation, Oracle Java Certification, Oracle Java Career

The Vector API provides a mechanism for writing cross-platform data-parallel algorithms in Java, such as complex mathematical and array-based operations.

Download a PDF of this article

The Vector API provides a portable API for expressing vector mathematics computations. The first iteration of the API was proposed by JEP 338 and integrated into Java 16. The second incubator, JEP 414, is part of Java 17. A third incubator is in progress and is currently targeted for Java 18 as JEP 417.

This work is part of Java’s Project Panama, which strengthens many of the connections between the JVM and non-Java APIs (also known as foreign APIs), including making vector math intrinsics part of the HotSpot JVM (and thus, part of core Java).

As JEP 414’s documentation explains

A vector computation consists of a sequence of operations on vectors. A vector comprises a (usually) fixed sequence of scalar values, where the scalar values correspond to the number of hardware-defined vector lanes. A binary operation applied to two vectors with the same number of lanes would, for each lane, apply the equivalent scalar operation on the corresponding two scalar values from each vector. This is commonly referred to as Single Instruction Multiple Data (SIMD).

Vector operations express a degree of parallelism that enables more work to be performed in a single CPU cycle and thus can result in significant performance gains. For example, given two vectors, each containing a sequence of eight integers (i.e., eight lanes), the two vectors can be added together using a single hardware instruction. The vector addition instruction operates on sixteen integers, performing eight integer additions, in the time it would ordinarily take to operate on two integers, performing one integer addition.

HotSpot already supports auto-vectorization, which transforms scalar operations into superword operations which are then mapped to vector instructions. The set of transformable scalar operations is limited, and also fragile with respect to changes in code shape. Furthermore, only a subset of the available vector instructions might be utilized, limiting the performance of generated code.

Today, a developer who wishes to write scalar operations that are reliably transformed into superword operations needs to understand HotSpot’s auto-vectorization algorithm and its limitations in order to achieve reliable and sustainable performance. In some cases, it may not be possible to write scalar operations that are transformable. For example, HotSpot does not transform the simple scalar operations for calculating the hash code of an array (thus the Arrays::hashCode methods), nor can it auto-vectorize code to lexicographically compare two arrays (thus we added an intrinsic for lexicographic comparison).

The Vector API aims to improve the situation by providing a way to write complex vector algorithms in Java, using the existing HotSpot auto-vectorizer but with a user model which makes vectorization far more predictable and robust. Hand-coded vector loops can express high-performance algorithms, such as vectorized hashCode or specialized array comparisons, which an auto-vectorizer may never optimize. Numerous domains can benefit from this explicit vector API including machine learning, linear algebra, cryptography, finance, and code within the JDK itself.

The documentation continues

A vector is represented by the abstract class Vector<E>. The type variable E is instantiated as the boxed type of the scalar primitive integral or floating point element types covered by the vector. A vector also has a shape which defines the size, in bits, of the vector. The shape of a vector governs how an instance of Vector<E> is mapped to a hardware vector register when vector computations are compiled by the HotSpot C2 compiler. The length of a vector, i.e., the number of lanes or elements, is the vector size divided by the element size.

and

Operations on vectors are classified as either lane-wise or cross-lane.

A lane-wise operation applies a scalar operator, such as addition, to each lane of one or more vectors in parallel. A lane-wise operation usually, but not always, produces a vector of the same length and shape. Lane-wise operations are further classified as unary, binary, ternary, test, or conversion operations.

A cross-lane operation applies an operation across an entire vector. A cross-lane operation produces either a scalar or a vector of possibly a different shape. Cross-lane operations are further classified as permutation or reduction operations.

Sample code is listed near the end of this article.

To flesh out the Java team’s intentions behind the Vector API, Java Magazine spoke with two of the top Java architects working on those JEPs, John Rose and Paul Sandoz.

Java Magazine: Can you provide a quick overview of the Vector API?

Rose: Project Panama gives Java programmers better access to all the modern capabilities of a CPU. One of the exciting things CPUs can do is SIMD (single instruction, multiple data) processing, which provides a multilane data flow through your program. There might be four lanes or eight lanes or any number of lanes through which individual data elements flow—and the CPU is organizing operations in parallel on all lanes at once. This greatly increases throughput, as you would expect.

With the Vector API, the Java team is working to give Java programmers direct access to this using Java code; in the past, they had to program vector math at the assembly-code level.

Until now this wasn’t a big deal, because the SIMD microprocessor features were not significant when Java was first being designed 25 years ago, but they have become more standard and widespread today. Being able to work with SIMD instructions and multiple lanes operating in parallel is a requirement now if you’re going to get the full benefit of a modern CPU. With the Vector API, Java is entering into that space in a new way, using native Java code.

Oracle Java Tutorial and Materials, Oracle Java Exam, Oracle Java Exam Prep, Java Preparation, Oracle Java Certification, Oracle Java Career
Java Magazine: How does that work in practice?

Sandoz: The HotSpot compiler translates all the method calls that you do on the vector instances using this API into hardware and vector structures that match closely with the capabilities of the particular hardware platform. This lets you program in a data-parallel way using SIMD processing in a platform-independent manner; you don’t have to know or understand the underlying hardware.

I like to think of it as a “what you see is what you get” API in the sense that when your methods match that of the platform, you would expect the HotSpot compiler to translate to the optimal hardware instructions that are available on the platform.

Today, you might write your code using scalar loops, which are performed over an element at a time. That’s slow. Now, you can transform your scalar algorithms into much faster data-parallel algorithms using the Vector API, so you get a very clear, well-written application that performs well across multiple platforms.

Basically, the Vector API delivers both performance and portability.

Java Magazine: Who should look to the Vector API right away?

Rose: If you have used the Intel Intrinsics API to hand-code your own vector loops, you should look at Java’s Vector API. In fact, it is my hope that you will enjoy the Vector API more than using the vector intrinsics based on C and C++.

The Vector API tends to be a cleaner experience because instead of using ad hoc header file functions that magically open code into vector instructions, the API uses a style that is modeled cleanly with Java objects and Java interfaces and has Java’s deservedly admired rigor of definition.

Your operations can be specified to operate predictably, and that’s important, because even if the vector hardware isn’t in your runtime’s CPU, your algorithms will behave as if it has those capabilities—although, of course, runtime performance probably will be slower if the CPU doesn’t support SIMD processing.

It’s important to emphasize that whatever you end up getting from the vector unit will not be something that’s close to what you intended. That’s not good enough. Rather, the results will always be exactly what the Java code is designed to do, regardless of the underlying hardware implementation. The object modeling goes all the way down to the bit level, so you know which bits are going to come out based on the Java code, and yet the processing goes through the vector unit.

Java Magazine: Sounds great! So, what are the sweet spots for Vector API applications?

Rose: Parsing and sorting are top of mind. There are also many interesting algorithms that are waiting to be further developed, which right now require heroic work in assembly code but with the Vector API will not be so heroic.

Developers can be—will be—more inventive in their rethinking of algorithms that used to be sequential and scalar but can now be vectorized, which makes the code faster and easier to understand; parsing is definitely one of them.

Sandoz: Simple array operations, array mismatch, array comparison, array equality, even hash code of arrays—those are all easy wins in the JDK itself.

Rose: Also, machine learning. I’m looking forward to using more bitwise operations and bit scrambling operations on vectors.

There are interesting algorithms you can build out of things such as the parallel extract and deposit operations, which aren’t vectorized yet on all platforms, or out of the Advanced Encryption Standard (AES) primitives or other cryptographic primitives.

You can do some beautiful algorithm work if vector math is part of your toolkit—and if you’re not penalized for reaching for them because you had to hand-code in C or assembly language at the expense of portability.

Java Magazine: Which vector operations are supported by the API?

Rose: Calculator functions to add, subtract, multiply, and divide across vectors and matrices, as well as floating-point and fixed-point comparisons.

Sandoz: Logical functions, transcendental functions, sine, cosine, log, and all those types of things.

I have to give a shout-out to Intel here because in the first round of the Vector API there were no optimizations for sine, cosine, log, and transcendental functions. In the second incubator, in JEP 414 with Java 17, Intel contributed the vectorized optimizations for those functions based on the Intel Short Vector Math Library (SVML). That’s a great contribution from Intel, providing high-performance vectorized mathematical functions for Java and JVM developers. There are even more optimizations coming with Java 18.

Java Magazine: Final thoughts?

Rose: SIMD programming is growing in importance, and every developer should have better ways to program in the SIMD style. The Vector API is our latest effort to contribute to that style of programming. I’m excited that we’re getting these capabilities into users’ hands, even if the API is still in the incubator phase. It’s something you may want to experiment with. It’s good stuff.

Sample comparison of scalar and vector code

Here is a simple scalar computation over elements of arrays, assuming that the array arguments are of the same length.

void scalarComputation(float[] a, float[] b, float[] c) {

   for (int i = 0; i < a.length; i++) {

        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;

   }

}

Here is an equivalent vector computation, using the Vector API, which will run in parallel using the CPU’s SIMD capabilities (if they are present in the hardware).

static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

void vectorComputation(float[] a, float[] b, float[] c) {

    int i = 0;

    int upperBound = SPECIES.loopBound(a.length);

    for (; i < upperBound; i += SPECIES.length()) {

        // FloatVector va, vb, vc;

        var va = FloatVector.fromArray(SPECIES, a, i);

        var vb = FloatVector.fromArray(SPECIES, b, i);

        var vc = va.mul(va)

                   .add(vb.mul(vb))

                   .neg();

        vc.intoArray(c, i);

    }

    for (; i < a.length; i++) {

        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;

    }

}

Source: oracle.com

Related Posts

0 comments:

Post a Comment