This tab covers MapReduce settings. Here you can set properties for the JobTracker and TaskTrackers, as well as some general and advanced properties. Click the name of the group to expand and collapse the display
Table 3.6. MapReduce Settings: JobTracker
Name | Notes |
---|---|
JobTracker host | This value is prepopulated based on your choices on previous screens |
JobTracker new generation size | Default size of Java new generation size for JobTracker (Java option -XX:NewSize) |
Maximum size of Java new generation for JobTracker (Java option -XX:MaxNewSize) | |
Maximum Java heap size for JobTracker in MB (Java option -Xmx) |
Table 3.7. MapReduce Settings: TaskTracker
Name | Notes |
---|---|
TaskTracker hosts | This value is prepopulated based on your choices on previous screens |
MapReduce local directories | Directories for MapReduce to store intermediate data files |
Number of Map slots per node | Number of slots that Map tasks that run simultaneously can occupy on a TaskTracker |
Number of Reduce slots per node | Number of slots that Reduce tasks that run simultaneously can occupy on a TaskTracker. |
Java options for MapReduce tasks | Java options for the TaskTracker child processes |
Table 3.8. MapReduce Settings: General
Name | Notes |
---|---|
MapReduce Capacity Scheduler | The scheduler to use for scheduling MapReduce jobs |
Cluster's Map slot size (virtual memory) | The virtual memory size of a single Map slot in the MapReduce framework. Use -1 for no limit |
Cluster's Reduce slot size (virtual memory) | The virtual memory size of a single Reduce slot in the MapReduce framework. Use -1 for no limit |
Upper limit on virtual memory for single Map task | Upper limit on virtual memory for single Map task. Use -1 for no limit. |
Upper limit on virtual memory for single Reduce task | Upper limit on virtual memory for single Reduce task. Use -1 for no limit. |
Default virtual memory for a job’s map-task | Virtual memory for single Map task. Use -1 for no limit. |
Default virtual memory for a job's reduce-task | Virtual memory for single Reduce task. Use -1 for no limit. |
Map-side sort buffer memory | The total amount of Map-side buffer memory to use while sorting files (Expert-only configuration) |
Limit on buffer | Percentage of sort buffer used for record collection (Expert-only configuration) |
Job log retention (hours) | The maximum time, in hours, for which the user-logs are to be retained after the job completion. |
Maximum number tasks for a Job | Maximum number of tasks for a single Job. Use -1 for no limit. |
LZO compression | Check to enable LZO compression in addition to Snappy |
Snappy compression | Check to enable Snappy compression |
Enable Job Diagnostics | Check to enable tools for tracing the path and troubleshooting the performance of MapReduce jobs |
Table 3.9. MapReduce Settings: Advanced
Name | Notes |
---|---|
MapReduce system directories | MapReduce system directories |
io.sort.factor | |
mapred.tasktracker.tasks.sleeptime-before-sigkill | Normally this is the amount of time before killing processes, and the recommended default is 5.000 seconds, a value of 5000 here. In this case it is used solely to blast tasks before killing them, and killing them very quickly (.25 second) to guarantee that we do not leave VMs around for later jobs |
mapred.job.tracker.handler.count | The number of server threads for the JobTracker. Roughly 4% of the number of TaskTracker nodes. |
mapreduce.cluster.administrators | ACL for MapReduce administrators by group |
mapred.reduce.parallel.copies | |
tasktracker.http.threads | |
mapred.map.tasks.speculative.execution | If true , then multiple instances of some map tasks may be executed in parallel |
mapred.reduce.tasks.speculative.execution | If true , then multiple instances of some reduce tasks may be executed in
parallel |
mapred.reduce.slowstart.completed.maps | |
mapred.inmem.merge.threshold | The threshold, in terms of the number of files, for triggering the in-memory merge process. When the threshold is hit, we initiate the merge and spill to disk. A value of less than or equal to 0 means no threshold is set and ramfs's memory consumption triggers the merge. |
mapred.job.shuffle.merge.percent | The threshold, expressed as a percentage of the total memory allocated to storing in-memory
map outputs (defined in mapred.job.shuffle.input.buffer.percent ), for triggering the in-memory merge
process. |
mapred.job.shuffle.input.buffer.percent | The percentage of memory to be allocated from the maximum heap size for storing map outputs during the shuffle. |
mapred.output.compression.type | If the job outputs are to be compressed as SequenceFiles, how should they be compressed? Acceptable values are: NONE, RECORD, or BLOCK. |
mapred.jobtracker.completeuserjobs.maximum | |
mapred.jobtracker.restart.recover | A value of true enables job recovery on restart; false starts afresh |
mapred.job.reduce.input.buffer.percent | The percentage of memory relative to the maximum heap size. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. |
mapreduce.reduce.input.limit | The limit on the input size of the reduce. If the estimated input size of the reduce is greater than this value, job is failed. A value of -1 means that no limit is set. |
mapred.task.timeout | The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, or updates its status string. |
jetty.connector | |
mapred.child.root.logger | |
mapred.max.tracker.blacklists | If a node is reported blacklisted by this number of successful jobs within the timeout window, it will be graylisted. |
mapred.healthChecker.interval | |
mapred.healthChecker.script.timeout | |
mapred.job.tracker.persist.jobstatus.active | Indicates if persistency of job status is active or not |
mapred.job.tracker.persist.jobstatus.hours | The number of hours job status information is persisted in DFS. Job status information is available after it drops off the memory queue and between JobTracker restarts. A value of zero means that job status information is not persisted at all. |
mapred.jobtracker.retirejob.check | |
mapred.jobtracker.retirejob.interval | |
mapred.job.tracker.history.completed.location | |
mapreduce.fileoutputcommitter.marksuccessfuljobs | |
mapred.job.reuse.jvm.num.tasks | The number of tasks to run per JVM. A value if -1 indicates no limit. |
hadoop.job.history.user.location | |
mapreduce.jobtracker.staging.root.dir | The path prefix for the staging directories. The next level is always the user's name. It is a path in the default file system. |
mapreduce.tasktracker.group | The group that the TaskTracker controller uses for accessing the controller. The mapred user must be a member and users should not be members. |
mapreduce.jobtracker.split.metainfo.maxsize | If the size of the split metainfo file is larger than this value, the JobTacker will fail the job during initialization. |
mapred.jobtracker.blacklist.fault-timeout-window | Sliding window in minutes |
mapred.jobtracker.blacklist.fault-bucket-width. | Bucket size in minutes. |
mapred.queue.names | Comma separated list of queues configured for this jobtracker |
Custom MapReduce Configs | Use this text box to enter values for mapred-site.xml properties not exposed by the UI. Enter in "key=value" format, with a newline as a delimiter between pairs. |