Paper Reading Note: MapReduce

MapReduce

Summary

  • Introduction: MapReduce is a programming model, mainly about the two functions of Map and Reduce
  • Significance: Programs based on this architecture can be parallelized on a large number of computers with common configurations.
    • The MR architecture enables programmers who have no experience in parallel computing and distributed system development to effectively use the resources of distributed systems

1 Introduction

  • Problem introduction: Google is faced with massive data calculations. To complete tasks within an acceptable time, they can only distribute these calculations on hundreds of hosts.
    • Therefore, how to deal with parallel computing, how to distribute data, how to deal with errors and other issues are integrated, requiring a lot of code processing. In order to solve these complex problems, the author designed such an abstract model.
  • Source of inspiration:
    • Many Map and Reduce primitives for functional speech
    • Most operations include such operations
      • Apply Map to the logical record of the input data to get a set of intermediate key/value pairs
      • Apply Reduce to values ​​that share the same key value, so as to appropriately merge the derived data
  • table of Contents:
    • Section 2: Basic programming model, simple example
    • Section 3: MapReduce interface customized based on cluster computing environment
    • Section 4: Some useful improvements
    • Section 5: Performance test under different tasks
    • Section 6: Application of MapReduce in Google
    • Section 7: Related work and future development

2 Programming model

The calculation model takes a set of key-value pairs as input and produces a set of key-value pairs as output. Users of the MapReduce library express calculations in two functions, Map and Reduce.

  • Map: written by the user, obtainedOneInput pair and generateA groupThe intermediate key-value pair.
    • The MapReduce library will automatically combine all intermediate values ​​with the same intermediate key and pass them to Reduce
  • Reduce: Write by the user, accept an intermediate Key and a corresponding set of Values, and merge them into a set of smaller values. Usually the Reduce function will only produce zero or one value value.
    • Reduce generally obtains the intermediate value through an iterator (so that even if the number of intermediate values ​​is much larger than the memory capacity, we can still handle it)

2.1 Example

map(String key, String value):
	// key:  document name
	// value: document contents
	for each word w in value:
		EmitIntermediate(w, "1");

reduce(String key, Iterator values):
	// key: a word
	// values: a list of counts
	int result = 0;
	for each v in values:
		result += ParseInt(v);
	Emit(AsString(result));

The map function generates text as a single word plus an associated occurrence count (1 in the example). The reduce function sums all the generated counts of the same word.

In addition, users need to fill in an object called mapreduce specification with the name of the input and output files, and an optional tuning parameter (this step determines the MapReduce specification). After that, the user calls the MapReduce function to pass in the above-defined objects. The user's code will be connected to the MapReduce library (implemented by C++).

2.2 Types of key-value pairs

Although the above examples use strings, the map and reduce functions have conceptually linked the types of use.

m a p : ( k 1 , v 1 ) → l i s t ( k s , v s ) map:(k_1,v_1)→ list(k_s,v_s) mapk1V1listksVs

r e d u c e : ( k 2 , l i s t ( v 2 ) ) → l i s t ( v 2 ) reduce:(k_2,list(v_2))→ list(v2) reducek2listv2listv2

It should be noted that the input key value and the output key value have different domains, while the middle key value and the output key value have the same domain

3 Realization

MapReduce (hereinafter referred to as MR) can have many implementation interfaces, which should be decided according to the specific environment.

3.1 Operation overview

By automatically dividing the input data into M segments (split) to ensure that the Map function can be called distributed among multiple machines. The input section can be processed in parallel by multiple machines. By dividing the intermediate key space into R pieces by using a partition function (such as the hash modulo R ), the Reduce function can be called distributedly. The number of slices R and the division function are specified by the user.

The following figure shows an overall running process. When the user program calls the MR function, the following operations will occur

Insert picture description here

Host: Assign tasks to working machines

  1. Input file segmentation (users can set parameters to control)
  2. The working machine assigned the map task reads the content of the corresponding input segment, analyzes the key-value pairs of the input data, passes them to the map function, and generates intermediate key-value pairs to buffer in memory
  3. At regular intervals, the buffer key-value pairs will be written to the local disk (in the m machine) and divided into R blocks. The locations of these areas are transmitted back to the host, and the host sends these locations to the reduce worker.
  4. When the reducer receives the location data from the host, it will use RPC (Remote Procedure Call) to read the buffered data from the local disk of the map machine. When a reduce worker finishes reading, it will sort the records according to the intermediate key, so that the records with the same key will be grouped together. If the total amount of intermediate data is large, it will use outer sorting
  5. The reduce worker works on the sorted intermediate machine, traverses each intermediate key, and passes it and the intermediate value set to the reduce function. The output of the reduce function is added to the final output file (there is a global file system in)
  6. After all map and reduce are completed, the MapReduce call ends and returns to the user code

After successful completion, the execution output of mapreduce will be put into R output files (one for each reduce task, and the file name is specified by the user). Generally speaking, users don't need to combine R output files into one-they usually use these files as input to another MapReduce call, or use them in another distributed application that can handle multiple file inputs.

3.2 Host data structure

The host maintains some data structures. For map tasks and reduce tasks, it stores the status (idle, executing, completed) and the identity number of the non-idle task worker.

The host is the channel through which the map task transfers the intermediate file storage location to the reduce task. Therefore, for each completed map task, the host stores the location and size of R intermediate file regions. When the map task is completed, the host will receive an update of the file location and size information. The information will be sent to the Reduce worker one by one (each map task will generate R-distributed to each Reduce worker-files to be processed).

3.3 Fault tolerance

Worker failure

The host periodically pings each worker. If there is no response after the timeout, the worker is marked as failed, and its completed map task will be rolled back to the idle state (idle) and reallocated; its ongoing map /Reduce task is reset to idle state and reassigned to other workers

  • The reason why the completed Map tasks need to be re-executed is that their results are saved on the local disk and cannot be retrieved after failure (the output of the completed Reduce task is stored in the global file system, so there is no need to redo)
  • When a map task is first executed by A and later executed by B, all the execute Reduce tasks will receive this notification. The reduce task that will but has not read data from worker A will be redirected to worker B

Master failure

For the master, we can simply periodically store checkpoints on the master data structure described above. If a master task dies, we can restart a master task from the state of the last checkpoint. However, because we only have one master, it is unlikely to fail. Therefore, in our implementation, if the master fails, it will simply interrupt the MapReduce operation. Users can detect this status and can manually (set the status and) restart the MapReduce operation if they need it.

Mechanism for handling failure

Receive the completion information of an unfinished map task-record the name and location of R files into the master data structure

Received a completion message of a completed map task-ignore

The same reduce task is executed by multiple machines-output file names will conflict, and the underlying file system can ensure that only one output file from one reduce task remains

3.4 Localization

In the author's computing environment, network bandwidth is a relatively scarce resource, so they try to store things in every machine.

Generally, several copies (usually three copies) of the input data are distributed in the local disk of the machine cluster, and each machine has a certain number of input files. When the Master assigns the map, it tries to assign the task to the machine containing the copy of the relevant input data, and if it fails, it tries to assign it to the machine adjacent to it.

Therefore, when a large-scale MR operation is run on a sufficiently large cluster, most of the data can be obtained locally, and the network bandwidth consumed is relatively small.

3.5 Task granularity

We subdivide the map phase and reduce phase into the M part and R part as described above . Ideally, M and R should far exceed the number of working machines, which can improve dynamic load balancing and speed up the recovery work after working machines fail.

In fact, there is a limit to the size of M and R. The host must perform O(M+R) scheduling and maintain O(M∗R) states in memory. (In fact, the memory footprint is very small: O(M∗R) states only need 1 byte of data, that is, 1 bit per map/reduce task pair).

In addition, R is generally constrained by users, because R determines how many files are finally output. So in practice we tend to choose M, yes, each independent task has about 16MB-64MB of input data (this is the best localization effect)

Generally, we also take R to a smaller multiple of the number of working machines we want to use

3.6 Standby tasks

Some machines may cause extremely slow calculations due to hardware quality, bugs, etc., which increases the time consumption of the entire MR

  • solution:
    • When the MR is close to completion, the host will schedule the tasks that are still executing for backup execution, so that the task will be completed whether the basic task is completed or the copy task is completed. Usually it will only increase a small amount of computational consumption, but it can significantly reduce the time to complete a large MR operation

4 extension

4.1 Partition function

Generally use hash, but according to the actual situation, use special division to help complete the task

4.2 Ensure order

We ensure that in a given partition, the processing order of the intermediate key/value pair data is processed in the order of the increment of the key value. This order ensures that an ordered output file is generated for each partition, which is very meaningful for applications that need to randomly access the output files according to the key value, and is also very helpful for data sets that need to be sorted.

4.3 Combiner

For the result of the map, the combiner function merges once locally, and then sends the result through the network

The only difference between combiner and reduce is how the MR library controls the output of the function. The result of reduce is saved in the final file, and the result of combiner is written to the intermediate file and sent to the reduce task.

Merging intermediate results like this part can significantly increase the speed of some MR operations

4.4 Type expansion of input and output

Although most MapReduce users only use a few predefined input types to meet the requirements, users can still support a new input type by providing a simple Reader interface implementation.

Reader does not have to read data from files. For example, we can easily implement a Reader that reads records from a database or a Reader that reads data from a data structure in memory.

Similarly, we provide some predefined output data types, through which data in different formats can be generated. The user adds a new output type in a manner similar to adding a new input data type.

4.5 Side effects (a bit confusing)

In some cases, users of MapReduce find it easier to add auxiliary output files during Map and/or Reduce operations. We rely on the program writer to make this "side effect" atomic and idempotent (note: idempotent refers to a mathematical operation that always produces the same result). Usually, the application program first writes the output result to a temporary file, and after all the data is output, the temporary file is renamed using the system-level atomic operation rename.

If a task produces multiple output files, we do not provide atomic operations like two-phase commit to support this situation. Therefore, for tasks that generate multiple output files and have consistency requirements across files, they must be deterministic tasks. But in the actual application process, this restriction has not caused us trouble.

4.6 Skip damaged records

In many cases, it is acceptable to ignore some problematic records, such as when performing statistical analysis on a huge data set. We provide an execution mode. In this mode, in order to ensure that the entire processing can continue, MapReduce will detect which records cause a deterministic crash, and skip these records without processing.

Each worker process has set up signal processing functions to capture segmentation violations and bus errors. Before executing the Map or Reduce operation, the MapReduce library saves the record serial number through global variables. If the user program triggers a system signal, the message processing function will use the "last breath" to send the sequence number of the last record processed to the master via a UDP packet. When the master sees that it has failed to process a particular record more than once, the master marks that the record needs to be skipped, and skips this record the next time the related Map or Reduce task is executed again.

4.7 Local execution

The author developed a local implementation version of the MapReduce library for developers to debug locally (it is difficult to debug on thousands of computers)

4.8 Status information

The author has developed a set of status information pages that show

  • Calculate the progress of execution, such as how many tasks have been completed, how many tasks are being processed, the number of input bytes, the number of intermediate data bytes, the number of output bytes, the percentage of processing, etc.
  • Links to stderr and stdout files for each task. Based on these data, users predict how long the calculation will take and whether additional computing resources are needed. These pages can also be used to analyze when the calculation performed slower than expected.
  • In addition, the top-level status page shows which workers have failed, and the Map and Reduce tasks that were running when they failed. This information is useful forDebug bugs in user codeVery helpful.

4.9 Counter

Count the number of key-value pairs that have been processed and output. Users can use this to see how many words have been processed. How many German documents have been indexed, etc.

5 Performance evaluation

Guess you like

Origin blog.csdn.net/Kaiser_syndrom/article/details/106222849