mirror of
https://github.com/tiennm99/java-design-patterns.git
synced 2026-05-14 08:58:26 +00:00
docs: updates for MapReduce
This commit is contained in:
+77
-40
@@ -2,20 +2,26 @@
|
||||
title: "MapReduce Pattern in Java"
|
||||
shortTitle: MapReduce
|
||||
description: "Learn the MapReduce pattern in Java with real-world examples, class diagrams, and tutorials. Understand its intent, applicability, benefits, and known uses to enhance your design pattern knowledge."
|
||||
category: Structural
|
||||
category: Functional
|
||||
language: en
|
||||
tag:
|
||||
- Delegation
|
||||
- Concurrency
|
||||
- Data processing
|
||||
- Data transformation
|
||||
- Functional decomposition
|
||||
- Immutable
|
||||
- Multithreading
|
||||
- Scalability
|
||||
---
|
||||
|
||||
## Also known as
|
||||
|
||||
* Split-Apply-Combine Strategy
|
||||
* Scatter-Gather Pattern
|
||||
* Map-Reduce
|
||||
* Divide and Conquer for Data Processing
|
||||
|
||||
## Intent of Map Reduce Design Pattern
|
||||
|
||||
MapReduce aims to process and generate large datasets with a parallel, distributed algorithm on a cluster. It divides the workload into two main phases: Map and Reduce, allowing for efficient parallel processing of data.
|
||||
To efficiently process large-scale datasets by dividing computation into two phases: map and reduce, which can be executed in parallel and distributed across multiple nodes.
|
||||
|
||||
## Detailed Explanation of Map Reduce Pattern with Real-World Examples
|
||||
|
||||
@@ -29,19 +35,22 @@ In plain words
|
||||
|
||||
Wikipedia says
|
||||
|
||||
> "MapReduce is a programming model and associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster".
|
||||
MapReduce consists of two main steps:
|
||||
The "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
|
||||
The "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
|
||||
This approach allows for efficient processing of vast amounts of data across multiple machines, making it a fundamental technique in big data analytics and distributed computing.
|
||||
> MapReduce is a programming model and associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. MapReduce consists of two main steps:
|
||||
The Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. The Reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. This approach allows for efficient processing of vast amounts of data across multiple machines, making it a fundamental technique in big data analytics and distributed computing.
|
||||
|
||||
Flowchart
|
||||
|
||||

|
||||
|
||||
## Programmatic Example of Map Reduce in Java
|
||||
|
||||
### 1. Map Phase (Splitting & Processing Data)
|
||||
|
||||
* The Mapper takes an input string, splits it into words, and counts occurrences.
|
||||
* Output: A map {word → count} for each input line.
|
||||
* Each input string is split into words, normalized, and counted.
|
||||
* Output: A map `{word → count}` for each input string.
|
||||
|
||||
#### `Mapper.java`
|
||||
|
||||
```java
|
||||
public class Mapper {
|
||||
public static Map<String, Integer> map(String input) {
|
||||
@@ -57,13 +66,17 @@ public class Mapper {
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Example Input: ```"Hello world hello"```
|
||||
Output: ```{hello=2, world=1}```
|
||||
|
||||
### 2. Shuffle Phase (Grouping Data by Key)
|
||||
### 2. Shuffle Phase – Grouping Words Across Inputs
|
||||
|
||||
* Takes results from all mappers and groups values by word.
|
||||
* Output: A map `{word → list of counts}`.
|
||||
|
||||
* The Shuffler collects key-value pairs from multiple mappers and groups values by key.
|
||||
#### `Shuffler.java`
|
||||
|
||||
```java
|
||||
public class Shuffler {
|
||||
public static Map<String, List<Integer>> shuffleAndSort(List<Map<String, Integer>> mapped) {
|
||||
@@ -78,14 +91,18 @@ public class Shuffler {
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Example Input:
|
||||
|
||||
```
|
||||
[
|
||||
{"hello": 2, "world": 1},
|
||||
{"hello": 1, "java": 1}
|
||||
]
|
||||
```
|
||||
|
||||
Output:
|
||||
|
||||
```
|
||||
{
|
||||
"hello": [2, 1],
|
||||
@@ -94,10 +111,13 @@ Output:
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Reduce Phase (Aggregating Results)
|
||||
### 3. Reduce Phase – Aggregating Counts
|
||||
|
||||
* Sums the list of counts for each word.
|
||||
* Output: A sorted list of word counts in descending order.
|
||||
|
||||
* The Reducer sums up occurrences of each word.
|
||||
#### `Reducer.java`
|
||||
|
||||
```java
|
||||
public class Reducer {
|
||||
public static List<Map.Entry<String, Integer>> reduce(Map<String, List<Integer>> grouped) {
|
||||
@@ -112,7 +132,9 @@ public class Reducer {
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Example Input:
|
||||
|
||||
```
|
||||
{
|
||||
"hello": [2, 1],
|
||||
@@ -120,7 +142,9 @@ Example Input:
|
||||
"java": [1]
|
||||
}
|
||||
```
|
||||
|
||||
Output:
|
||||
|
||||
```
|
||||
[
|
||||
{"hello": 3},
|
||||
@@ -129,10 +153,12 @@ Output:
|
||||
]
|
||||
```
|
||||
|
||||
### 4. Running the Full MapReduce Process
|
||||
### 4. MapReduce Coordinator – Running the Whole Pipeline
|
||||
|
||||
* Coordinates map, shuffle, and reduce phases.
|
||||
|
||||
* The MapReduce class coordinates the three steps.
|
||||
#### `MapReduce.java`
|
||||
|
||||
```java
|
||||
public class MapReduce {
|
||||
public static List<Map.Entry<String, Integer>> mapReduce(List<String> inputs) {
|
||||
@@ -148,10 +174,12 @@ public class MapReduce {
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Main Execution (Calling MapReduce)
|
||||
### 5. Main Execution – Example Usage
|
||||
|
||||
* Runs the MapReduce process and prints results.
|
||||
|
||||
* The Main class executes the MapReduce pipeline and prints the final word count.
|
||||
#### `Main.java`
|
||||
|
||||
```java
|
||||
public static void main(String[] args) {
|
||||
List<String> inputs = Arrays.asList(
|
||||
@@ -168,6 +196,7 @@ public class MapReduce {
|
||||
```
|
||||
|
||||
Output:
|
||||
|
||||
```
|
||||
hello: 4
|
||||
world: 2
|
||||
@@ -183,10 +212,11 @@ fun: 1
|
||||
## When to Use the Map Reduce Pattern in Java
|
||||
|
||||
Use MapReduce when:
|
||||
* Processing large datasets that don't fit into a single machine's memory
|
||||
* Performing computations that can be parallelized
|
||||
* Dealing with fault-tolerant and distributed computing scenarios
|
||||
* Analyzing log files, web crawl data, or scientific data
|
||||
|
||||
* When processing large datasets that can be broken into independent chunks.
|
||||
* When data operations can be naturally divided into map (transformation) and reduce (aggregation) phases.
|
||||
* When horizontal scalability and parallelization are essential, especially in distributed or big data environments.
|
||||
* When leveraging Java-based distributed computing platforms like Hadoop or Spark.
|
||||
|
||||
## Map Reduce Pattern Java Tutorials
|
||||
|
||||
@@ -197,32 +227,39 @@ Use MapReduce when:
|
||||
|
||||
Benefits:
|
||||
|
||||
* Scalability: Can process vast amounts of data across multiple machines
|
||||
* Fault-tolerance: Handles machine failures gracefully
|
||||
* Simplicity: Abstracts complex distributed computing details
|
||||
* Enables massive scalability by distributing processing across nodes.
|
||||
* Encourages a functional style, promoting immutability and stateless operations.
|
||||
* Simplifies complex data workflows by separating transformation (map) from aggregation (reduce).
|
||||
* Fault-tolerant due to isolated, recoverable processing tasks.
|
||||
|
||||
Trade-offs:
|
||||
|
||||
* Overhead: Not efficient for small datasets due to setup and coordination costs
|
||||
* Limited flexibility: Not suitable for all types of computations or algorithms
|
||||
* Latency: Batch-oriented nature may not be suitable for real-time processing needs
|
||||
* Requires a suitable problem structure — not all tasks fit the map/reduce paradigm.
|
||||
* Data shuffling between map and reduce phases can be performance-intensive.
|
||||
* Higher complexity in debugging and optimizing distributed jobs.
|
||||
* Intermediate I/O can become a bottleneck in large-scale operations.
|
||||
|
||||
## Real-World Applications of Map Reduce Pattern in Java
|
||||
|
||||
* Google's original implementation for indexing web pages
|
||||
* Hadoop MapReduce for big data processing
|
||||
* Log analysis in large-scale systems
|
||||
* Genomic sequence analysis in bioinformatics
|
||||
* Hadoop MapReduce: Java-based framework for distributed data processing using MapReduce.
|
||||
* Apache Spark: Utilizes similar map and reduce transformations in its RDD and Dataset APIs.
|
||||
* Elasticsearch: Uses MapReduce-style aggregation pipelines for querying distributed data.
|
||||
* Google Bigtable: Underlying storage engine influenced by MapReduce principles.
|
||||
* MongoDB Aggregation Framework: Conceptually applies MapReduce in its data pipelines.
|
||||
|
||||
## Related Java Design Patterns
|
||||
|
||||
* Chaining Pattern
|
||||
* Master-Worker Pattern
|
||||
* Pipeline Pattern
|
||||
* [Master-Worker](https://java-design-patterns.com/patterns/master-worker/): Similar distribution of tasks among workers, with a master coordinating job execution.
|
||||
* [Pipeline](https://java-design-patterns.com/patterns/pipeline/): Can be used to chain multiple MapReduce operations into staged transformations.
|
||||
* [Iterator](https://java-design-patterns.com/patterns/iterator/): Often used under the hood to process input streams lazily in map and reduce steps.
|
||||
|
||||
## References and Credits
|
||||
|
||||
* [What is MapReduce](https://www.ibm.com/think/topics/mapreduce)
|
||||
* [Wy MapReduce is not dead](https://www.codemotion.com/magazine/ai-ml/big-data/mapreduce-not-dead-heres-why-its-still-ruling-in-the-cloud/)
|
||||
* [Scalabe Distributed Data Processing Solutions](https://tcpp.cs.gsu.edu/curriculum/?q=system%2Ffiles%2Fch07.pdf)
|
||||
* [Big Data: Principles and Paradigms](https://amzn.to/3RJIGPZ)
|
||||
* [Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems](https://amzn.to/3E6VhtD)
|
||||
* [Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale](https://amzn.to/4ij2y7F)
|
||||
* [Java 8 in Action: Lambdas, Streams, and functional-style programming](https://amzn.to/3QCmGXs)
|
||||
* [Java Design Patterns: A Hands-On Experience with Real-World Examples](https://amzn.to/3HWNf4U)
|
||||
* [Programming Pig: Dataflow Scripting with Hadoop](https://amzn.to/4cAU36K)
|
||||
* [What is MapReduce (IBM)](https://www.ibm.com/think/topics/mapreduce)
|
||||
* [Why MapReduce is not dead (Codemotion)](https://www.codemotion.com/magazine/ai-ml/big-data/mapreduce-not-dead-heres-why-its-still-ruling-in-the-cloud/)
|
||||
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 88 KiB |
Reference in New Issue
Block a user