How to Read New Map Reduce Output
If accept gone through our previous postthe and then we have seen the components that make up a basic MapReduce job, we can see how everything works together at a higher level:
MapReduce inputs typically come up from input files loaded onto our processing cluster in HDFS. These files are evenly distributed across all our nodes. Running a MapReduce program involves running mapping tasks on many or all of the nodes in our cluster. Each of these mapping tasks is equivalent: no mappers have item "identities" associated with them. Therefore, whatsoever mapper can process any input file. Each mapper loads the prepare of files local to that machine and processes them.
When the mapping stage has completed, the intermediate (fundamental, value) pairs must exist exchanged betwixt machines to send all values with the same key to a unmarried reducer. The reduce tasks are spread across the same nodes in the cluster as the mappers. This is the only advice pace in MapReduce. Individual map tasks do not exchange data with 1 another, nor are they aware of i another'southward beingness. Similarly, dissimilar reduce tasks do not communicate with one another. The user never explicitly marshals information from one machine to another; all data transfer is handled by the Hadoop MapReduce platform itself, guided implicitly by the different keys associated with values. This is a fundamental chemical element of Hadoop MapReduce's reliability. If nodes in the cluster fail, tasks must be able to be restarted. If they have been performing side-effects, east.g., communicating with the outside world, then the shared country must exist restored in a restarted chore. By eliminating advice and side-effects, restarts can be handled more gracefully.
A Closer Await
The previous figure described the high-level view of Hadoop MapReduce. From this diagram, yous tin meet where the mapper and reducer components of the Word Count application fit in, and how it achieves its objective. We will at present examine this organisation in a bit closer detail.
shows the pipeline with more of its mechanics exposed. While simply two nodes are depicted, the same pipeline can exist replicated across a very big number of nodes. The next several paragraphs describe each of the stages of a MapReduce plan more than precisely.
Input files: This is where the data for a MapReduce task is initially stored. While this does not demand to exist the case, the input files typically reside in HDFS. The format of these files is capricious; while line-based log files can be used, nosotros could as well use a binary format, multi-line input records, or something else entirely. It is typical for these input files to be very big -- tens of gigabytes or more.
InputFormat: How these input files are divide up and read is defined by the InputFormat. An InputFormat is a class that provides the following functionality:
* Selects the files or other objects that should exist used for input
* Defines the InputSplits that break a file into tasks
* Provides a manufacturing plant for RecordReader objects that read the file
Several InputFormats are provided with Hadoop. An abstract blazon is called FileInputFormat; all InputFormats that operate on files inherit functionality and properties from this grade. When starting a Hadoop job, FileInputFormat is provided with a path containing files to read. The FileInputFormat will read all files in this directory. It then divides these files into one or more InputSplits each. You can choose which InputFormat to apply to your input files for a job by calling the setInputFormat() method of the JobConf object that defines the chore. A table of standard InputFormats is given beneath.
The default InputFormat is the TextInputFormat. This treats each line of each input file equally a separate record, and performs no parsing. This is useful for unformatted information or line-based records like log files. A more than interesting input format is the KeyValueInputFormat. This format also treats each line of input every bit a split tape. While the TextInputFormat treats the unabridged line as the value, the KeyValueInputFormat breaks the line itself into the key and value by searching for a tab character. This is particularly useful for reading the output of one MapReduce task as the input to some other, as the default OutputFormat (described in more than particular beneath) formats its results in this manner. Finally, the SequenceFileInputFormat reads special binary files that are specific to Hadoop. These files include many features designed to permit data to be rapidly read into Hadoop mappers. Sequence files are cake-compressed and provide direct serialization and deserialization of several arbitrary data types (not but text). Sequence files can be generated equally the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce chore to anther.
InputSplits: An InputSplit describes a unit of work that comprises a single map task in a MapReduce program. A MapReduce program practical to a data set, collectively referred to as a Job, is made upwards of several (possibly several hundred) tasks. Map tasks may involve reading a whole file; they often involve reading only function of a file. By default, the FileInputFormat and its descendants intermission a file up into 64 MB chunks (the same size every bit blocks in HDFS). Yous can control this value by setting the mapred.min.carve up.size parameter in hadoop-site.xml, or by overriding the parameter in the JobConf object used to submit a particular MapReduce chore. By processing a file in chunks, we allow several map tasks to operate on a single file in parallel. If the file is very large, this tin can improve performance significantly through parallelism. Even more importantly, since the various blocks that make upward the file may exist spread beyond several unlike nodes in the cluster, it allows tasks to be scheduled on each of these unlike nodes; the individual blocks are thus all processed locally, instead of needing to be transferred from 1 node to another. Of course, while log files tin can exist candy in this piece-wise fashion, some file formats are not amenable to chunked processing. Past writing a custom InputFormat, you tin control how the file is broken up (or is not broken up) into splits. Custom input formats are described in Module 5.
The InputFormat defines the list of tasks that make upward the mapping phase; each job corresponds to a unmarried input split up. The tasks are so assigned to the nodes in the organisation based on where the input file chunks are physically resident. An individual node may take several dozen tasks assigned to it. The node will begin working on the tasks, attempting to perform equally many in parallel as it tin can. The on-node parallelism is controlled by the mapred.tasktracker.map.tasks.maximum parameter.
RecordReader: The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader example is defined by the InputFormat. The default InputFormat, TextInputFormat, provides a LineRecordReader, which treats each line of the input file every bit a new value. The key associated with each line is its byte starting time in the file. The RecordReader is invoke repeatedly on the input until the entire InputSplit has been consumed. Each invocation of the RecordReader leads to some other phone call to the map() method of the Mapper.
Mapper: The Mapper performs the interesting user-defined work of the first phase of the MapReduce program. Given a central and a value, the map() method emits (key, value) pair(south) which are forwarded to the Reducers. A new case of Mapper is instantiated in a separate Java process for each map task (InputSplit) that makes upwards function of the total task input. The individual mappers are intentionally not provided with a mechanism to communicate with one another in any way. This allows the reliability of each map task to be governed solely by the reliability of the local motorcar. The map() method receives two parameters in addition to the central and the value:
* The OutputCollector object has a method named collect() which will frontward a (key, value) pair to the reduce phase of the task.
* The Reporter object provides information about the current task; its getInputSplit() method volition return an object describing the current InputSplit. It also allows the map task to provide additional information near its progress to the residual of the system. The setStatus() method allows you to emit a condition message back to the user. The incrCounter() method allows yous to increment shared performance counters. You lot may define as many arbitrary counters as you wish. Each mapper tin increment the counters, and the JobTracker will collect the increments made by the different processes and aggregate them for later retrieval when the job ends.
Partitioning & Shuffle: After the kickoff map tasks accept completed, the nodes may still exist performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling. A different subset of the intermediate key space is assigned to each reduce node; these subsets (known as "partitions") are the inputs to the reduce tasks. Each map chore may emit (key, value) pairs to whatsoever partition; all values for the same key are always reduced together regardless of which mapper is its origin. Therefore, the map nodes must all agree on where to send the different pieces of the intermediate information. The Partitioner class determines which partition a given (cardinal, value) pair will go to. The default partitioner computes a hash value for the key and assigns the sectionalisation based on this event. Custom partitioners are described in more detail in Module 5.
Sort: Each reduce chore is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a unmarried node is automatically sorted past Hadoop before they are presented to the Reducer.
Reduce: A Reducer instance is created for each reduce task. This is an instance of user-provided code that performs the second important stage of job-specific work. For each primal in the partition assigned to a Reducer, the Reducer's reduce() method is called once. This receives a key besides as an iterator over all the values associated with the key. The values associated with a key are returned by the iterator in an undefined order. The Reducer besides receives as parameters OutputCollector and Reporter objects; they are used in the aforementioned manner as in the map() method.
OutputFormat: The (cardinal, value) pairs provided to this OutputCollector are then written to output files. The way they are written is governed past the OutputFormat. The OutputFormat functions much like the InputFormat grade described earlier. The instances of OutputFormat provided by Hadoop write to files on the local disk or in HDFS; they all inherit from a common FileOutputFormat. Each Reducer writes a separate file in a common output directory. These files will typically be named part-nnnnn, where nnnnn is the sectionalisation id associated with the reduce task. The output directory is ready by the FileOutputFormat.setOutputPath() method. You tin command which particular OutputFormat is used by calling the setOutputFormat() method of the JobConf object that defines your MapReduce job.
Hadoop provides some OutputFormat instances to write to files. The basic (default) example is TextOutputFormat, which writes (key, value) pairs on private lines of a text file. This can exist easily re-read by a afterward MapReduce task using the KeyValueInputFormat class, and is also homo-readable. A better intermediate format for apply between MapReduce jobs is the SequenceFileOutputFormat which chop-chop serializes arbitrary information types to the file; the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the information to the next Mapper in the same manner as it was emitted by the previous Reducer. The NullOutputFormat generates no output files and disregards any (central, value) pairs passed to information technology past the OutputCollector. This is useful if you are explicitly writing your own output files in the reduce() method, and do not want boosted empty output files generated past the Hadoop framework.
RecordWriter: Much like how the InputFormat actually reads individual records through the RecordReader implementation, the OutputFormat class is a factory for RecordWriter objects; these are used to write the private records to the files as directed by the OutputFormat.
The output files written by the Reducers are then left in HDFS for your use, either by another MapReduce job, a split up program, for for human inspection.
vasquezwhoust1944.blogspot.com
Source: http://mapreduce-tutorial.blogspot.com/2011/04/mapreduce-data-flow.html
0 Response to "How to Read New Map Reduce Output"
Post a Comment