docs: updates for MapReduce

This commit is contained in:
Ilkka Seppälä
2025-04-12 16:40:17 +03:00
parent 8ca487e96c
commit 4b06dc2a2e
2 changed files with 77 additions and 40 deletions
+77 -40
View File
@@ -2,20 +2,26 @@
title: "MapReduce Pattern in Java"
shortTitle: MapReduce
description: "Learn the MapReduce pattern in Java with real-world examples, class diagrams, and tutorials. Understand its intent, applicability, benefits, and known uses to enhance your design pattern knowledge."
category: Structural
category: Functional
language: en
tag:
- Delegation
- Concurrency
- Data processing
- Data transformation
- Functional decomposition
- Immutable
- Multithreading
- Scalability
---
## Also known as
* Split-Apply-Combine Strategy
* Scatter-Gather Pattern
* Map-Reduce
* Divide and Conquer for Data Processing
## Intent of Map Reduce Design Pattern
MapReduce aims to process and generate large datasets with a parallel, distributed algorithm on a cluster. It divides the workload into two main phases: Map and Reduce, allowing for efficient parallel processing of data.
To efficiently process large-scale datasets by dividing computation into two phases: map and reduce, which can be executed in parallel and distributed across multiple nodes.
## Detailed Explanation of Map Reduce Pattern with Real-World Examples
@@ -29,19 +35,22 @@ In plain words
Wikipedia says
> "MapReduce is a programming model and associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster".
MapReduce consists of two main steps:
The "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
The "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve.
This approach allows for efficient processing of vast amounts of data across multiple machines, making it a fundamental technique in big data analytics and distributed computing.
> MapReduce is a programming model and associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. MapReduce consists of two main steps:
The Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. The Reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve. This approach allows for efficient processing of vast amounts of data across multiple machines, making it a fundamental technique in big data analytics and distributed computing.
Flowchart
![MapReduce flowchart](./etc/mapreduce-flowchart.png)
## Programmatic Example of Map Reduce in Java
### 1. Map Phase (Splitting & Processing Data)
* The Mapper takes an input string, splits it into words, and counts occurrences.
* Output: A map {word → count} for each input line.
* Each input string is split into words, normalized, and counted.
* Output: A map `{word → count}` for each input string.
#### `Mapper.java`
```java
public class Mapper {
public static Map<String, Integer> map(String input) {
@@ -57,13 +66,17 @@ public class Mapper {
}
}
```
Example Input: ```"Hello world hello"```
Output: ```{hello=2, world=1}```
### 2. Shuffle Phase (Grouping Data by Key)
### 2. Shuffle Phase Grouping Words Across Inputs
* Takes results from all mappers and groups values by word.
* Output: A map `{word → list of counts}`.
* The Shuffler collects key-value pairs from multiple mappers and groups values by key.
#### `Shuffler.java`
```java
public class Shuffler {
public static Map<String, List<Integer>> shuffleAndSort(List<Map<String, Integer>> mapped) {
@@ -78,14 +91,18 @@ public class Shuffler {
}
}
```
Example Input:
```
[
{"hello": 2, "world": 1},
{"hello": 1, "java": 1}
]
```
Output:
```
{
"hello": [2, 1],
@@ -94,10 +111,13 @@ Output:
}
```
### 3. Reduce Phase (Aggregating Results)
### 3. Reduce Phase Aggregating Counts
* Sums the list of counts for each word.
* Output: A sorted list of word counts in descending order.
* The Reducer sums up occurrences of each word.
#### `Reducer.java`
```java
public class Reducer {
public static List<Map.Entry<String, Integer>> reduce(Map<String, List<Integer>> grouped) {
@@ -112,7 +132,9 @@ public class Reducer {
}
}
```
Example Input:
```
{
"hello": [2, 1],
@@ -120,7 +142,9 @@ Example Input:
"java": [1]
}
```
Output:
```
[
{"hello": 3},
@@ -129,10 +153,12 @@ Output:
]
```
### 4. Running the Full MapReduce Process
### 4. MapReduce Coordinator Running the Whole Pipeline
* Coordinates map, shuffle, and reduce phases.
* The MapReduce class coordinates the three steps.
#### `MapReduce.java`
```java
public class MapReduce {
public static List<Map.Entry<String, Integer>> mapReduce(List<String> inputs) {
@@ -148,10 +174,12 @@ public class MapReduce {
}
```
### 4. Main Execution (Calling MapReduce)
### 5. Main Execution Example Usage
* Runs the MapReduce process and prints results.
* The Main class executes the MapReduce pipeline and prints the final word count.
#### `Main.java`
```java
public static void main(String[] args) {
List<String> inputs = Arrays.asList(
@@ -168,6 +196,7 @@ public class MapReduce {
```
Output:
```
hello: 4
world: 2
@@ -183,10 +212,11 @@ fun: 1
## When to Use the Map Reduce Pattern in Java
Use MapReduce when:
* Processing large datasets that don't fit into a single machine's memory
* Performing computations that can be parallelized
* Dealing with fault-tolerant and distributed computing scenarios
* Analyzing log files, web crawl data, or scientific data
* When processing large datasets that can be broken into independent chunks.
* When data operations can be naturally divided into map (transformation) and reduce (aggregation) phases.
* When horizontal scalability and parallelization are essential, especially in distributed or big data environments.
* When leveraging Java-based distributed computing platforms like Hadoop or Spark.
## Map Reduce Pattern Java Tutorials
@@ -197,32 +227,39 @@ Use MapReduce when:
Benefits:
* Scalability: Can process vast amounts of data across multiple machines
* Fault-tolerance: Handles machine failures gracefully
* Simplicity: Abstracts complex distributed computing details
* Enables massive scalability by distributing processing across nodes.
* Encourages a functional style, promoting immutability and stateless operations.
* Simplifies complex data workflows by separating transformation (map) from aggregation (reduce).
* Fault-tolerant due to isolated, recoverable processing tasks.
Trade-offs:
* Overhead: Not efficient for small datasets due to setup and coordination costs
* Limited flexibility: Not suitable for all types of computations or algorithms
* Latency: Batch-oriented nature may not be suitable for real-time processing needs
* Requires a suitable problem structure — not all tasks fit the map/reduce paradigm.
* Data shuffling between map and reduce phases can be performance-intensive.
* Higher complexity in debugging and optimizing distributed jobs.
* Intermediate I/O can become a bottleneck in large-scale operations.
## Real-World Applications of Map Reduce Pattern in Java
* Google's original implementation for indexing web pages
* Hadoop MapReduce for big data processing
* Log analysis in large-scale systems
* Genomic sequence analysis in bioinformatics
* Hadoop MapReduce: Java-based framework for distributed data processing using MapReduce.
* Apache Spark: Utilizes similar map and reduce transformations in its RDD and Dataset APIs.
* Elasticsearch: Uses MapReduce-style aggregation pipelines for querying distributed data.
* Google Bigtable: Underlying storage engine influenced by MapReduce principles.
* MongoDB Aggregation Framework: Conceptually applies MapReduce in its data pipelines.
## Related Java Design Patterns
* Chaining Pattern
* Master-Worker Pattern
* Pipeline Pattern
* [Master-Worker](https://java-design-patterns.com/patterns/master-worker/): Similar distribution of tasks among workers, with a master coordinating job execution.
* [Pipeline](https://java-design-patterns.com/patterns/pipeline/): Can be used to chain multiple MapReduce operations into staged transformations.
* [Iterator](https://java-design-patterns.com/patterns/iterator/): Often used under the hood to process input streams lazily in map and reduce steps.
## References and Credits
* [What is MapReduce](https://www.ibm.com/think/topics/mapreduce)
* [Wy MapReduce is not dead](https://www.codemotion.com/magazine/ai-ml/big-data/mapreduce-not-dead-heres-why-its-still-ruling-in-the-cloud/)
* [Scalabe Distributed Data Processing Solutions](https://tcpp.cs.gsu.edu/curriculum/?q=system%2Ffiles%2Fch07.pdf)
* [Big Data: Principles and Paradigms](https://amzn.to/3RJIGPZ)
* [Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems](https://amzn.to/3E6VhtD)
* [Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale](https://amzn.to/4ij2y7F)
* [Java 8 in Action: Lambdas, Streams, and functional-style programming](https://amzn.to/3QCmGXs)
* [Java Design Patterns: A Hands-On Experience with Real-World Examples](https://amzn.to/3HWNf4U)
* [Programming Pig: Dataflow Scripting with Hadoop](https://amzn.to/4cAU36K)
* [What is MapReduce (IBM)](https://www.ibm.com/think/topics/mapreduce)
* [Why MapReduce is not dead (Codemotion)](https://www.codemotion.com/magazine/ai-ml/big-data/mapreduce-not-dead-heres-why-its-still-ruling-in-the-cloud/)
Binary file not shown.

After

Width:  |  Height:  |  Size: 88 KiB