Monday, June 14, 2021

Java Spark RDD reduce() Examples – sum, min and max operations

Java Spark RDD Reduce, Core Java, Oracle Java Preparation, Oracle Java Certification, Oracle Java Career

A quick guide to explore the Spark RDD reduce() method in java programming to find sum, min and max values from the data set.

1. Overview

In this tutorial, we will learn how to use the Spark RDD reduce() method using java programming language. Most of the developers use the same method reduce() in pyspark but in this article, we will understand

how to get the sum, min and max operations with Java RDD.

2. Java Spark RDD – reduce() method

First let understand the syntax of java reduce() spark method.

public T reduce(scala.Function2<T,T,T> f)

This method takes the Function2 functional interface which is the concept of Java 8. But the Function2 is implemented in Scala language.

Function2 takes two arguments as input and returns one value. Here, always input and output type should be same.

3. Java Spark RDD reduce() Example to find sum


In the below examples, we first created the SparkConf and JavaSparkContext with local mode for the testing purpose.

We’ve provided the step by step meaning in the program.

We must have to pass the lambda expression to the reduce() method.

You might be surprised with the logic behind the reduce() method. Below is the explanation of its internals. As a developer, you should know the basic knowledge on hood what is going on.

Java Spark RDD Reduce, Core Java, Oracle Java Preparation, Oracle Java Certification, Oracle Java Career
On the RDD, reduce() method is called with the logic of value1 + value2. That means this formula will be applied to all the values in each partition untill partition will have only one value.

If there are more than one partitions then all the outputs of partitions are moved to another data node. Then next, again the same logic value1 + value2 is applied to get the final result.

if only one partition is for the input file or dataset then it will return the final output of the single partion.

package com.oraclejavacertified.rdd.reduce;
 
import java.util.Arrays;
import java.util.List;
 
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
 
public class RDDReduceExample {
 
    public static void main(String[] args) {
         
        // to remove the unwanted loggers from output.
        Logger.getLogger("org.apache").setLevel(Level.WARN);
 
        // Getting the numbers list.
        List<Integer> numbersList = getSampleData();
         
        // Creating the SparkConf object
        SparkConf sparkConf = new SparkConf().setAppName("Java RDD_Reduce Example").setMaster("local");
 
        // Creating JavaSprakContext object
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
         
        // Converting List into JavaRDD.
        JavaRDD<Integer> integersRDD =  sc.parallelize(numbersList);
         
        // Getting the sum of all numbers using reduce() method
        Integer sumResult = integersRDD.reduce( (value1, value2) -> value1 + value2);
 
        // printing the sum
         
        System.out.println("Sum of RDD numbers using reduce() : "+sumResult);
         
        // closing Spark Context
        sc.close();
         
    }
 
    /**
     * returns a list of integer numbers
     * 
     * @return
     */
    private static List<Integer> getSampleData() {
 
        return Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9);
 
    }
 
}

Output:

Sum of RDD numbers using reduce() : 45

4. Java Spark RDD reduce() min and max Examples


Next, let us find the min and max values from the RDD.

package com.oraclejavacertified.rdd.reduce;
 
import java.util.Arrays;
import java.util.List;
 
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
 
public class RDDReduceExample {
 
    public static void main(String[] args) {
         
        // to remove the unwanted loggers from output.
        Logger.getLogger("org.apache").setLevel(Level.WARN);
 
        // Getting the numbers list.
        List<Integer> numbersList = getSampleData();
         
        // Creating the SparkConf object
        SparkConf sparkConf = new SparkConf().setAppName("Java RDD_Reduce Example").setMaster("local");
 
        // Creating JavaSprakContext object
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
         
        // Converting List into JavaRDD.
        JavaRDD<Integer> integersRDD =  sc.parallelize(numbersList);
         
        // Finding Min and Max values using reduce() method
         
        Integer minResult = integersRDD.reduce( (value1, value2) -> Math.min(value1, value2));
         
        System.out.println("Min of RDD numbers using reduce() : "+minResult);
         
        Integer maxResult = integersRDD.reduce( (value1, value2) -> Math.max(value1, value2));
         
        System.out.println("Max of RDD numbers using reduce() : "+maxResult);
         
        // closing Spark Context
        sc.close();
         
    }
 
    /**
     * returns a list of integer numbers
     * 
     * @return
     */
    private static List<Integer> getSampleData() {
 
        return Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9);
 
    }
 
}

Output:

Min of RDD numbers using reduce() : 1
Max of RDD numbers using reduce() : 9

Source: javacodegeeks.com

Related Posts

0 comments:

Post a Comment