diff --git a/map-reduce/README.md b/map-reduce/README.md index d73f513fc..d832616ba 100644 --- a/map-reduce/README.md +++ b/map-reduce/README.md @@ -2,20 +2,26 @@ title: "MapReduce Pattern in Java" shortTitle: MapReduce description: "Learn the MapReduce pattern in Java with real-world examples, class diagrams, and tutorials. Understand its intent, applicability, benefits, and known uses to enhance your design pattern knowledge." -category: Structural +category: Functional language: en tag: - - Delegation + - Concurrency + - Data processing + - Data transformation + - Functional decomposition + - Immutable + - Multithreading + - Scalability --- ## Also known as -* Split-Apply-Combine Strategy -* Scatter-Gather Pattern +* Map-Reduce +* Divide and Conquer for Data Processing ## Intent of Map Reduce Design Pattern -MapReduce aims to process and generate large datasets with a parallel, distributed algorithm on a cluster. It divides the workload into two main phases: Map and Reduce, allowing for efficient parallel processing of data. +To efficiently process large-scale datasets by dividing computation into two phases: map and reduce, which can be executed in parallel and distributed across multiple nodes. ## Detailed Explanation of Map Reduce Pattern with Real-World Examples @@ -29,19 +35,22 @@ In plain words Wikipedia says -> "MapReduce is a programming model and associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster". -MapReduce consists of two main steps: -The "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. -The "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. -This approach allows for efficient processing of vast amounts of data across multiple machines, making it a fundamental technique in big data analytics and distributed computing. +> MapReduce is a programming model and associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. MapReduce consists of two main steps: +The Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. The Reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. This approach allows for efficient processing of vast amounts of data across multiple machines, making it a fundamental technique in big data analytics and distributed computing. + +Flowchart + +![MapReduce flowchart](./etc/mapreduce-flowchart.png) ## Programmatic Example of Map Reduce in Java ### 1. Map Phase (Splitting & Processing Data) -* The Mapper takes an input string, splits it into words, and counts occurrences. -* Output: A map {word → count} for each input line. +* Each input string is split into words, normalized, and counted. +* Output: A map `{word → count}` for each input string. + #### `Mapper.java` + ```java public class Mapper { public static Map map(String input) { @@ -57,13 +66,17 @@ public class Mapper { } } ``` + Example Input: ```"Hello world hello"``` Output: ```{hello=2, world=1}``` -### 2. Shuffle Phase (Grouping Data by Key) +### 2. Shuffle Phase – Grouping Words Across Inputs + +* Takes results from all mappers and groups values by word. +* Output: A map `{word → list of counts}`. -* The Shuffler collects key-value pairs from multiple mappers and groups values by key. #### `Shuffler.java` + ```java public class Shuffler { public static Map> shuffleAndSort(List> mapped) { @@ -78,14 +91,18 @@ public class Shuffler { } } ``` + Example Input: + ``` [ {"hello": 2, "world": 1}, {"hello": 1, "java": 1} ] ``` + Output: + ``` { "hello": [2, 1], @@ -94,10 +111,13 @@ Output: } ``` -### 3. Reduce Phase (Aggregating Results) +### 3. Reduce Phase – Aggregating Counts + +* Sums the list of counts for each word. +* Output: A sorted list of word counts in descending order. -* The Reducer sums up occurrences of each word. #### `Reducer.java` + ```java public class Reducer { public static List> reduce(Map> grouped) { @@ -112,7 +132,9 @@ public class Reducer { } } ``` + Example Input: + ``` { "hello": [2, 1], @@ -120,7 +142,9 @@ Example Input: "java": [1] } ``` + Output: + ``` [ {"hello": 3}, @@ -129,10 +153,12 @@ Output: ] ``` -### 4. Running the Full MapReduce Process +### 4. MapReduce Coordinator – Running the Whole Pipeline + +* Coordinates map, shuffle, and reduce phases. -* The MapReduce class coordinates the three steps. #### `MapReduce.java` + ```java public class MapReduce { public static List> mapReduce(List inputs) { @@ -148,10 +174,12 @@ public class MapReduce { } ``` -### 4. Main Execution (Calling MapReduce) +### 5. Main Execution – Example Usage + +* Runs the MapReduce process and prints results. -* The Main class executes the MapReduce pipeline and prints the final word count. #### `Main.java` + ```java public static void main(String[] args) { List inputs = Arrays.asList( @@ -168,6 +196,7 @@ public class MapReduce { ``` Output: + ``` hello: 4 world: 2 @@ -183,10 +212,11 @@ fun: 1 ## When to Use the Map Reduce Pattern in Java Use MapReduce when: -* Processing large datasets that don't fit into a single machine's memory -* Performing computations that can be parallelized -* Dealing with fault-tolerant and distributed computing scenarios -* Analyzing log files, web crawl data, or scientific data + +* When processing large datasets that can be broken into independent chunks. +* When data operations can be naturally divided into map (transformation) and reduce (aggregation) phases. +* When horizontal scalability and parallelization are essential, especially in distributed or big data environments. +* When leveraging Java-based distributed computing platforms like Hadoop or Spark. ## Map Reduce Pattern Java Tutorials @@ -197,32 +227,39 @@ Use MapReduce when: Benefits: -* Scalability: Can process vast amounts of data across multiple machines -* Fault-tolerance: Handles machine failures gracefully -* Simplicity: Abstracts complex distributed computing details +* Enables massive scalability by distributing processing across nodes. +* Encourages a functional style, promoting immutability and stateless operations. +* Simplifies complex data workflows by separating transformation (map) from aggregation (reduce). +* Fault-tolerant due to isolated, recoverable processing tasks. Trade-offs: -* Overhead: Not efficient for small datasets due to setup and coordination costs -* Limited flexibility: Not suitable for all types of computations or algorithms -* Latency: Batch-oriented nature may not be suitable for real-time processing needs +* Requires a suitable problem structure — not all tasks fit the map/reduce paradigm. +* Data shuffling between map and reduce phases can be performance-intensive. +* Higher complexity in debugging and optimizing distributed jobs. +* Intermediate I/O can become a bottleneck in large-scale operations. ## Real-World Applications of Map Reduce Pattern in Java -* Google's original implementation for indexing web pages -* Hadoop MapReduce for big data processing -* Log analysis in large-scale systems -* Genomic sequence analysis in bioinformatics +* Hadoop MapReduce: Java-based framework for distributed data processing using MapReduce. +* Apache Spark: Utilizes similar map and reduce transformations in its RDD and Dataset APIs. +* Elasticsearch: Uses MapReduce-style aggregation pipelines for querying distributed data. +* Google Bigtable: Underlying storage engine influenced by MapReduce principles. +* MongoDB Aggregation Framework: Conceptually applies MapReduce in its data pipelines. ## Related Java Design Patterns -* Chaining Pattern -* Master-Worker Pattern -* Pipeline Pattern +* [Master-Worker](https://java-design-patterns.com/patterns/master-worker/): Similar distribution of tasks among workers, with a master coordinating job execution. +* [Pipeline](https://java-design-patterns.com/patterns/pipeline/): Can be used to chain multiple MapReduce operations into staged transformations. +* [Iterator](https://java-design-patterns.com/patterns/iterator/): Often used under the hood to process input streams lazily in map and reduce steps. ## References and Credits -* [What is MapReduce](https://www.ibm.com/think/topics/mapreduce) -* [Wy MapReduce is not dead](https://www.codemotion.com/magazine/ai-ml/big-data/mapreduce-not-dead-heres-why-its-still-ruling-in-the-cloud/) -* [Scalabe Distributed Data Processing Solutions](https://tcpp.cs.gsu.edu/curriculum/?q=system%2Ffiles%2Fch07.pdf) +* [Big Data: Principles and Paradigms](https://amzn.to/3RJIGPZ) +* [Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems](https://amzn.to/3E6VhtD) +* [Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale](https://amzn.to/4ij2y7F) +* [Java 8 in Action: Lambdas, Streams, and functional-style programming](https://amzn.to/3QCmGXs) * [Java Design Patterns: A Hands-On Experience with Real-World Examples](https://amzn.to/3HWNf4U) +* [Programming Pig: Dataflow Scripting with Hadoop](https://amzn.to/4cAU36K) +* [What is MapReduce (IBM)](https://www.ibm.com/think/topics/mapreduce) +* [Why MapReduce is not dead (Codemotion)](https://www.codemotion.com/magazine/ai-ml/big-data/mapreduce-not-dead-heres-why-its-still-ruling-in-the-cloud/) diff --git a/map-reduce/etc/mapreduce-flowchart.png b/map-reduce/etc/mapreduce-flowchart.png new file mode 100644 index 000000000..dade8f7cc Binary files /dev/null and b/map-reduce/etc/mapreduce-flowchart.png differ