A quick guide to explore the Spark RDD reduce() method in java programming to find sum, min and max values from the data set.
1. Overview
In this tutorial, we will learn how to use the Spark RDD reduce() method using java programming language. Most of the developers use the same method reduce() in pyspark but in this article, we will understand
how to get the sum, min and max operations with Java RDD.
2. Java Spark RDD – reduce() method
First let understand the syntax of java reduce() spark method.
public T reduce(scala.Function2<T,T,T> f)
This method takes the Function2 functional interface which is the concept of Java 8. But the Function2 is implemented in Scala language.
Function2 takes two arguments as input and returns one value. Here, always input and output type should be same.
3. Java Spark RDD reduce() Example to find sum
In the below examples, we first created the SparkConf and JavaSparkContext with local mode for the testing purpose.
We’ve provided the step by step meaning in the program.
We must have to pass the lambda expression to the reduce() method.
You might be surprised with the logic behind the reduce() method. Below is the explanation of its internals. As a developer, you should know the basic knowledge on hood what is going on.
On the RDD, reduce() method is called with the logic of value1 + value2. That means this formula will be applied to all the values in each partition untill partition will have only one value.
If there are more than one partitions then all the outputs of partitions are moved to another data node. Then next, again the same logic value1 + value2 is applied to get the final result.
if only one partition is for the input file or dataset then it will return the final output of the single partion.
package com.oraclejavacertified.rdd.reduce;
import java.util.Arrays;
import java.util.List;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class RDDReduceExample {
public static void main(String[] args) {
// to remove the unwanted loggers from output.
Logger.getLogger("org.apache").setLevel(Level.WARN);
// Getting the numbers list.
List<Integer> numbersList = getSampleData();
// Creating the SparkConf object
SparkConf sparkConf = new SparkConf().setAppName("Java RDD_Reduce Example").setMaster("local");
// Creating JavaSprakContext object
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// Converting List into JavaRDD.
JavaRDD<Integer> integersRDD = sc.parallelize(numbersList);
// Getting the sum of all numbers using reduce() method
Integer sumResult = integersRDD.reduce( (value1, value2) -> value1 + value2);
// printing the sum
System.out.println("Sum of RDD numbers using reduce() : "+sumResult);
// closing Spark Context
sc.close();
}
/**
* returns a list of integer numbers
*
* @return
*/
private static List<Integer> getSampleData() {
return Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9);
}
}
Output:
Sum of RDD numbers using reduce() : 45
4. Java Spark RDD reduce() min and max Examples
Next, let us find the min and max values from the RDD.
package com.oraclejavacertified.rdd.reduce;
import java.util.Arrays;
import java.util.List;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class RDDReduceExample {
public static void main(String[] args) {
// to remove the unwanted loggers from output.
Logger.getLogger("org.apache").setLevel(Level.WARN);
// Getting the numbers list.
List<Integer> numbersList = getSampleData();
// Creating the SparkConf object
SparkConf sparkConf = new SparkConf().setAppName("Java RDD_Reduce Example").setMaster("local");
// Creating JavaSprakContext object
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// Converting List into JavaRDD.
JavaRDD<Integer> integersRDD = sc.parallelize(numbersList);
// Finding Min and Max values using reduce() method
Integer minResult = integersRDD.reduce( (value1, value2) -> Math.min(value1, value2));
System.out.println("Min of RDD numbers using reduce() : "+minResult);
Integer maxResult = integersRDD.reduce( (value1, value2) -> Math.max(value1, value2));
System.out.println("Max of RDD numbers using reduce() : "+maxResult);
// closing Spark Context
sc.close();
}
/**
* returns a list of integer numbers
*
* @return
*/
private static List<Integer> getSampleData() {
return Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9);
}
}
Output:
Min of RDD numbers using reduce() : 1
Max of RDD numbers using reduce() : 9
Source: javacodegeeks.com
0 comments:
Post a Comment