`

Mapper输入InputSplit分片数透析

 
阅读更多

什么是InputSplit

InputSplit是指分片,在MapReduce当中作业中,作为map task最小输入单位。分片是基于文件基础上出来的而来的概念,通俗的理解一个文件可以切分为多少个片段,每个片段包括了<文件名,开始位置,长度,位于哪些主机>等信息。map task的数量由输入文件总大小和分片大小确定的;hadoop2.2版本hdfs的数据块默认是128M。若一个文件大于128M,通过将大文件分解得到若干个数据块;若一个文件小于128M,则按它的实际大小组块存储;
因此可以解释为什么hdfs喜欢海量的大文件,而不喜欢海量的小文件:
(1)海量的小文件需要namenode存储较多的元数据信息记录块的位置(假如一个64M的数据块(大文件分解的块)需要1k的元数据信息,那么1M的数据块(小文件的块)仍需要1k的元数据信息)
(2)HDFS就是为流式访问大文件而设计的,如果访问大量小文件,需要不断的从一个datanode跳到另一个datanode,严重影响性能。
hdfs的一个数据块一定属于一个文件;
这里又有个问题:为什么map task分片大小最好和hdfs设置的块大小一致呢?
我们综合考虑两点:
(1)map task的个数=输入文件总大小(totalSize)/分片尺寸(splitSize)。也就是说分片尺寸越大,map task的个数就越少=>系统执行的开销越小,系统管理分片的开销越小。
(2)网络传输开销,如果分片太大以至于一个分片要跨越多个HDFS块,则一个map任务必须要由多个块通过网络传输,所以分片大小的上限是HDFS块的大小。
所以,map任务时的分片大小设置为HDFS块的大小是最佳选择。

流程

FileInputFormat.getSplits返回文件的分片数目,这部分将介绍其运行流程,后面将粘贴其源码并给出注释

  • 通过listStatus()获取输入文件列表files,其中会遍历输入目录的子目录,并过滤掉部分文件,如文件_SUCCESS
  • 获取所有的文件大小totalSIze
  • goalSIze=totalSize/numMaps。numMaps是用户指定的map数目
  • files中取出一个文件file
  • 计算splitSize。splitSize=max(minSplitSize,min(file.blockSize,goalSize)),其中minSplitSize是允许的最小分片大小,默认为1B
  • 后面根据splitSize大小将file分片。在分片的时候,如果剩余的大小不大于splitSize*1.1,且大于0B的时候,会将该区域整个作为一个分片。这样做是为了防止一个mapper处理的数据太小
  • 将file的分片加入到splits中
  • 返回4,直到将files遍历完
  • 结束,返回splits

 

 


  •  源码(hadoop2.2.0)

     

    其实流程算起来也不算复杂,所以就直接用代码注释来做吧

     

    这里边涉及这么几个方法:

    1、public List<InputSplit> getSplits(JobContext job), 这个由客户端调用来获得当前Job的所有分片(split),然后发送给JobTracker(新API中应该是ResourceManager),而JobTracker根据这些分片的存储位置来给TaskTracker分配map任务去处理这些分片。这个方法用到了后边的listStatus,然后根据得到的这些文件信息,从FileSystem那里去拉取这些组成这些文件的块的信息(BlockLocation),使用的是getFileBlockLocation(file,start,len),这个方法是与使用的文件系统实现相关的(FileSystem,LocalFileSystem,DistributedFileSystem)

    	/** 
    	   * Generate the list of files and make them into FileSplits.
    	   * @param job the job context
    	   * @throws IOException
    	   */
    	  public List<InputSplit> getSplits(JobContext job) throws IOException {
    		//根据默认配置与job所配置minInputSplitSize取得最小切片大小
    	    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    	    //根据job所配置maxInputSplitSize取得最大切片大小
    	    long maxSize = getMaxSplitSize(job);
    	 
    	    //保存生成的split信息
    	    List<InputSplit> splits = new ArrayList<InputSplit>();
    	    List<FileStatus> files = listStatus(job);//获取job所配置的文件列表
    	    //对每个文件进行切片
    	    for (FileStatus file: files) {
    	      Path path = file.getPath();
    	      long length = file.getLen();
    	      if (length != 0) {
    	        BlockLocation[] blkLocations;
    	        if (file instanceof LocatedFileStatus) {
    	          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
    	        } else {
    	          FileSystem fs = path.getFileSystem(job.getConfiguration());
    	          blkLocations = fs.getFileBlockLocations(file, 0, length);
    	        }
    	        if (isSplitable(job, path)) {//验证非二进制文件可切分(二进制文件不允许切分)
    	          long blockSize = file.getBlockSize();
    	          long splitSize = computeSplitSize(blockSize, minSize, maxSize);//计算分块尺寸
    	 
    	          long bytesRemaining = length;
    	          //循环按分片大小取出一个个分片 
    	          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
    	            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);//计算分块索引
    	            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
    	                                     blkLocations[blkIndex].getHosts()));
    	            bytesRemaining -= splitSize;
    	          }
    	          //对尾部不足一个分片大小的也生成一个分片
    	          if (bytesRemaining != 0) {
    	            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
    	            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
    	                       blkLocations[blkIndex].getHosts()));
    	          }
    	        } else { // not splitable
    	          //不可分块文件整体作为一块
    	          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts()));
    	        }
    	      } else { //空文件占用一个分块
    	        //Create empty hosts array for zero length files
    	        splits.add(makeSplit(path, 0, length, new String[0]));
    	      }
    	    }
    	    // Save the number of input files for metrics/loadgen
    	    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    	    LOG.debug("Total # of splits: " + splits.size());
    	    return splits;
    	  }

     2、protected List<FileStatus> listStatus(JobContext job), 先根据“mapred.input.dir”的配置值去得到用户指定的所有Path。然后根据这个JobContext的Configuration得到FileSystem(当然,更可能是 DistributedFileSystem )。最后应用用户可能设置了的PathFilter,通过FileSystem获取所有这些Path所代表的File(FileStatus)。注:这个方法的东西相当多,很多内容还十分陌生。

    /** List input directories.
       * Subclasses may override to, e.g., select only files matching a regular
       * expression. 
       * 
       * @param job the job to list input paths for
       * @return array of FileStatus objects
       * @throws IOException if zero items.
       */
      protected List<FileStatus> listStatus(JobContext job
                                            ) throws IOException {
        List<FileStatus> result = new ArrayList<FileStatus>();
        Path[] dirs = getInputPaths(job);
        if (dirs.length == 0) {
          throw new IOException("No input paths specified in job");
        }
         
        // get tokens for all the required FileSystems..
        TokenCache.obtainTokensForNamenodes(job.getCredentials(), dirs, 
                                            job.getConfiguration());
     
        // Whether we need to recursive look into the directory structure
        boolean recursive = getInputDirRecursive(job);
         
        List<IOException> errors = new ArrayList<IOException>();
         
        // creates a MultiPathFilter with the hiddenFileFilter and the
        // user provided one (if any).
        List<PathFilter> filters = new ArrayList<PathFilter>();
        filters.add(hiddenFileFilter);
        PathFilter jobFilter = getInputPathFilter(job);
        if (jobFilter != null) {
          filters.add(jobFilter);
        }
        PathFilter inputFilter = new MultiPathFilter(filters);
         
        for (int i=0; i < dirs.length; ++i) {
          Path p = dirs[i];
          FileSystem fs = p.getFileSystem(job.getConfiguration()); 
          FileStatus[] matches = fs.globStatus(p, inputFilter);
          if (matches == null) {
            errors.add(new IOException("Input path does not exist: " + p));
          } else if (matches.length == 0) {
            errors.add(new IOException("Input Pattern " + p + " matches 0 files"));
          } else {
            for (FileStatus globStat: matches) {
              if (globStat.isDirectory()) {
                RemoteIterator<LocatedFileStatus> iter =
                    fs.listLocatedStatus(globStat.getPath());
                while (iter.hasNext()) {
                  LocatedFileStatus stat = iter.next();
                  if (inputFilter.accept(stat.getPath())) {
                    if (recursive && stat.isDirectory()) {
                      addInputPathRecursively(result, fs, stat.getPath(),
                          inputFilter);
                    } else {
                      result.add(stat);
                    }
                  }
                }
              } else {
                result.add(globStat);
              }
            }
          }
        }
     
        if (!errors.isEmpty()) {
          throw new InvalidInputException(errors);
        }
        LOG.info("Total input paths to process : " + result.size()); 
        return result;
      }
     3、protected long computeSplitSize(long blockSize, long minSize, long maxSize),计算出当前Job所配置的分片最大尺寸。
    protected long computeSplitSize(long blockSize, long minSize,
                                      long maxSize) {
        return Math.max(minSize, Math.min(maxSize, blockSize));
      }
     4、protected int getBlockIndex(BlockLocation[] blkLocations, long offset), 由于组成文件的块的信息已经获得了,只需要根据offset来计算所在的那个块就行了。
      protected int getBlockIndex(BlockLocation[] blkLocations, 
                                  long offset) {
        for (int i = 0 ; i < blkLocations.length; i++) {
          // is the offset inside this block?
          if ((blkLocations[i].getOffset() <= offset) &&
              (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){
            return i;
          }
        }
        BlockLocation last = blkLocations[blkLocations.length -1];
        long fileLength = last.getOffset() + last.getLength() -1;
        throw new IllegalArgumentException("Offset " + offset + 
                                           " is outside of file (0.." +
                                           fileLength + ")");
      }
     5、通过源码的分析我们知道可以通过Configuration conf设置 “mapred.max.split.size”和“mapred.min.split.size”或者Job上设置maxInputSplitSize和minInputSplitSize的值进行调整InputSplit切片的大小最终达到调整map task数量的目的。

     

实例 

下面通过一个简单实例测试

三个文件mapperstest1-01.txt,mapperstest01-02.txt,mapperstest1-03.txt


 
1、没配置相关minSplitSize和maxSplitSize,即使用默认配置,splitSize=blockSize=128M。
     代码
package com.ljf.mr.test;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class MappersTest {

	public static class Mappers extends Mapper<LongWritable,Text,IntWritable,Text> {
		
		private static IntWritable intKey = new IntWritable(1);
		
		/**
		 * Text文本缺省使用LineRecordReader类读取
		 * 一行一个key/value对,key取文本偏移量
		 * value为行的文本内容
		 * @throws InterruptedException 
		 * @throws IOException 
		 */
		public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		    context.write(intKey, value);
		}
	}
	
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		if(args.length < 2) {
			System.exit(2);
		}
		Configuration conf = new Configuration();
		
		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

		@SuppressWarnings("deprecation")
		Job job = new Job(conf, "MappersTest");
		
		job.setJarByClass(MappersTest.class);
		job.setMapperClass(Mappers.class);
		
		job.setMapOutputKeyClass(IntWritable.class);
		job.setMapOutputValueClass(Text.class);
		
		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
		
		System.exit(job.waitForCompletion(true)? 0:1);
	}

}
     运行过程
14/09/25 19:21:22 INFO input.FileInputFormat: Total input paths to process : 3
14/09/25 19:21:22 INFO mapreduce.JobSubmitter: number of splits:3
14/09/25 19:21:22 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/09/25 19:21:22 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/09/25 19:21:22 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/09/25 19:21:22 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
14/09/25 19:21:22 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/09/25 19:21:22 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/09/25 19:21:22 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/09/25 19:21:22 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/09/25 19:21:22 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/09/25 19:21:22 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/09/25 19:21:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1411613494956_0004
14/09/25 19:21:23 INFO impl.YarnClientImpl: Submitted application application_1411613494956_0004 to ResourceManager at /0.0.0.0:8032
14/09/25 19:21:23 INFO mapreduce.Job: The url to track the job: http://ubuntu:8088/proxy/application_1411613494956_0004/
14/09/25 19:21:23 INFO mapreduce.Job: Running job: job_1411613494956_0004
14/09/25 19:21:50 INFO mapreduce.Job: Job job_1411613494956_0004 running in uber mode : false
14/09/25 19:21:50 INFO mapreduce.Job:  map 0% reduce 0%
14/09/25 19:22:29 INFO mapreduce.Job:  map 100% reduce 0%
14/09/25 19:22:29 INFO mapreduce.Job: Task Id : attempt_1411613494956_0004_m_000001_0, Status : FAILED
14/09/25 19:23:32 INFO mapreduce.Job:  map 100% reduce 0%
14/09/25 19:23:33 INFO mapreduce.Job:  map 100% reduce 100%
14/09/25 19:23:34 INFO mapreduce.Job: Job job_1411613494956_0004 completed successfully
14/09/25 19:23:34 INFO mapreduce.Job: Counters: 46
	File System Counters
		FILE: Number of bytes read=19412
		FILE: Number of bytes written=354065
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=12524
		HDFS: Number of bytes written=14550
		HDFS: Number of read operations=12
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Failed map tasks=4
		Killed map tasks=1
		Launched map tasks=8
		Launched reduce tasks=1
		Other local map tasks=5
		Data-local map tasks=3
		Total time spent by all maps in occupied slots (ms)=246514
		Total time spent by all reduces in occupied slots (ms)=23106
	Map-Reduce Framework
		Map input records=1214
		Map output records=1214
		Map output bytes=16978
		Map output materialized bytes=19424
		Input split bytes=402
		Combine input records=0
		Combine output records=0
		Reduce input groups=1
		Reduce shuffle bytes=19424
		Reduce input records=1214
		Reduce output records=1214
		Spilled Records=2428
		Shuffled Maps =3
		Failed Shuffles=0
		Merged Map outputs=3
		GC time elapsed (ms)=925
		CPU time spent (ms)=2850
		Physical memory (bytes) snapshot=621813760
		Virtual memory (bytes) snapshot=2639740928
		Total committed heap usage (bytes)=377892864
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=12122
	File Output Format Counters 
		Bytes Written=14550
 
2、配置相关minSplitSize和maxSplitSize的值,maxSplitSize=10240(10KB)。
     代码
package com.ljf.mr.test;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class MappersTest {

	public static class Mappers extends Mapper<LongWritable,Text,IntWritable,Text> {
		
		private static IntWritable intKey = new IntWritable(1);
		
		/**
		 * Text文本缺省使用LineRecordReader类读取
		 * 一行一个key/value对,key取文本偏移量
		 * value为行的文本内容
		 * @throws InterruptedException 
		 * @throws IOException 
		 */
		public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		    context.write(intKey, value);
		}
	}
	
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		if(args.length < 2) {
			System.exit(2);
		}
		Configuration conf = new Configuration();
//		conf.setStrings("mapred.max.split.size", "10240");
//		conf.setStrings("mapred.min.split.size", "10240");
		
		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

		@SuppressWarnings("deprecation")
		Job job = new Job(conf, "MappersTest");
		
		job.setJarByClass(MappersTest.class);
		job.setMapperClass(Mappers.class);
		
		job.setMapOutputKeyClass(IntWritable.class);
		job.setMapOutputValueClass(Text.class);
		
		FileInputFormat.setMaxInputSplitSize(job, 10240);//10KB
		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
		
		System.exit(job.waitForCompletion(true)? 0:1);
	}

}

     运行过程

14/09/25 19:29:53 INFO input.FileInputFormat: Total input paths to process : 3
14/09/25 19:29:53 INFO mapreduce.JobSubmitter: number of splits:4
14/09/25 19:29:53 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/09/25 19:29:53 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/09/25 19:29:53 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/09/25 19:29:53 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
14/09/25 19:29:53 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/09/25 19:29:53 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/09/25 19:29:53 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/09/25 19:29:53 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/09/25 19:29:53 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/09/25 19:29:53 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/09/25 19:29:53 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/09/25 19:29:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1411613494956_0005
14/09/25 19:29:54 INFO impl.YarnClientImpl: Submitted application application_1411613494956_0005 to ResourceManager at /0.0.0.0:8032
14/09/25 19:29:54 INFO mapreduce.Job: The url to track the job: http://ubuntu:8088/proxy/application_1411613494956_0005/
14/09/25 19:29:54 INFO mapreduce.Job: Running job: job_1411613494956_0005
14/09/25 19:30:16 INFO mapreduce.Job: Job job_1411613494956_0005 running in uber mode : false
14/09/25 19:30:16 INFO mapreduce.Job:  map 0% reduce 0%
14/09/25 19:30:56 INFO mapreduce.Job: Task Id : attempt_1411613494956_0005_m_000000_0, Status : FAILED
Container killed on request. Exit code is 137
Killed by external signal

14/09/25 19:30:57 INFO mapreduce.Job:  map 75% reduce 0%
14/09/25 19:31:21 INFO mapreduce.Job:  map 100% reduce 0%
14/09/25 19:31:23 INFO mapreduce.Job:  map 100% reduce 100%
14/09/25 19:31:23 INFO mapreduce.Job: Job job_1411613494956_0005 completed successfully
14/09/25 19:31:23 INFO mapreduce.Job: Counters: 46
	File System Counters
		FILE: Number of bytes read=19412
		FILE: Number of bytes written=433651
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=14420
		HDFS: Number of bytes written=14550
		HDFS: Number of read operations=15
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Failed map tasks=1
		Killed map tasks=1
		Launched map tasks=6
		Launched reduce tasks=1
		Other local map tasks=2
		Data-local map tasks=4
		Total time spent by all maps in occupied slots (ms)=188177
		Total time spent by all reduces in occupied slots (ms)=23271
	Map-Reduce Framework
		Map input records=1214
		Map output records=1214
		Map output bytes=16978
		Map output materialized bytes=19430
		Input split bytes=536
		Combine input records=0
		Combine output records=0
		Reduce input groups=1
		Reduce shuffle bytes=19430
		Reduce input records=1214
		Reduce output records=1214
		Spilled Records=2428
		Shuffled Maps =4
		Failed Shuffles=0
		Merged Map outputs=4
		GC time elapsed (ms)=1385
		CPU time spent (ms)=4280
		Physical memory (bytes) snapshot=582656000
		Virtual memory (bytes) snapshot=3298025472
		Total committed heap usage (bytes)=498614272
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=13884
	File Output Format Counters 
		Bytes Written=14550

 

http://my.oschina.net/KingPan/blog/288549 写道
参考MapReduce Application中mapper的数目和分片的数目

 

  • 大小: 79.1 KB
  • 大小: 53.3 KB
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics