This is the documentation for CDH 5.1.0.
Documentation for other versions is available at Cloudera Documentation.

HdfsFindTool

HdfsFindTool is essentially the HDFS version of the Linux file system find command. The command walks one or more HDFS directory trees and finds all HDFS files that match the specified expression and applies selected actions to them. By default, it simply prints the list of matching HDFS file paths to stdout, one path per line. The output file list can be piped into the MapReduceIndexerTool using the MapReduceIndexerTool --inputlist option.

More details are available through the command line help:
$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
 org.apache.solr.hadoop.HdfsFindTool -help

Usage: hadoop fs [generic options]
	[-find <path> ... <expression> ...]
	[-help [cmd ...]]
	[-usage [cmd ...]]

-find <path> ... <expression> ...:	Finds all files that match the specified expression and applies selected actions to them.

		The following primary expressions are recognised:
		  -atime n
		  -amin n
		    Evaluates as true if the file access time subtracted from
		    the start time is n days (or minutes if -amin is used).

		  -blocks n
		    Evaluates to true if the number of file blocks is n.

		  -class classname [args ...]
		    Executes the named expression class.

		  -depth
		    Always evaluates to true. Causes directory contents to be
		    evaluated before the directory itself.

		  -empty
		    Evaluates as true if the file is empty or directory has no
		    contents.

		  -exec command [argument ...]
		  -ok command [argument ...]
		    Executes the specified Hadoop shell command with the given
		    arguments. If the string {} is given as an argument then
		    is replaced by the current path name.  If a {} argument is
		    followed by a + character then multiple paths will be
		    batched up and passed to a single execution of the command.
		    A maximum of 500 paths will be passed to a single
		    command. The expression evaluates to true if the command
		    returns success and false if it fails.
		    If -ok is specified then confirmation of each command shall be
		    prompted for on STDERR prior to execution.  If the response is
		    'y' or 'yes' then the command shall be executed else the command
		    shall not be invoked and the expression shall return false.

		  -group groupname
		    Evaluates as true if the file belongs to the specified
		    group.

		  -mtime n
		  -mmin n
		    Evaluates as true if the file modification time subtracted
		    from the start time is n days (or minutes if -mmin is used)

		  -name pattern
		  -iname pattern
		    Evaluates as true if the basename of the file matches the
		    pattern using standard file system globbing.
		    If -iname is used then the match is case insensitive.

		  -newer file
		    Evaluates as true if the modification time of the current
		    file is more recent than the modification time of the
		    specified file.

		  -nogroup
		    Evaluates as true if the file does not have a valid group.

		  -nouser
		    Evaluates as true if the file does not have a valid owner.

		  -perm [-]mode
		  -perm [-]onum
		    Evaluates as true if the file permissions match that
		    specified. If the hyphen is specified then the expression
		    shall evaluate as true if at least the bits specified
		    match, otherwise an exact match is required.
		    The mode may be specified using either symbolic notation,
		    eg 'u=rwx,g+x+w' or as an octal number.

		  -print
		  -print0
		    Always evaluates to true. Causes the current pathname to be
		    written to standard output. If the -print0 expression is
		    used then an ASCII NULL character is appended.

		  -prune
		    Always evaluates to true. Causes the find command to not
		    descend any further down this directory tree. Does not
		    have any affect if the -depth expression is specified.

		  -replicas n
		    Evaluates to true if the number of file replicas is n.

		  -size n[c]
		    Evaluates to true if the file size in 512 byte blocks is n.
		    If n is followed by the character 'c' then the size is in bytes.

		  -type filetype
		    Evaluates to true if the file type matches that specified.
		    The following file type values are supported:
		    'd' (directory), 'l' (symbolic link), 'f' (regular file).

		  -user username
		    Evaluates as true if the owner of the file matches the
		    specified user.

		The following operators are recognised:
		  expression -a expression
		  expression -and expression
		  expression expression
		    Logical AND operator for joining two expressions. Returns
		    true if both child expressions return true. Implied by the
		    juxtaposition of two expressions and so does not need to be
		    explicitly specified. The second expression will not be
		    applied if the first fails.

		  ! expression
		  -not expression
		    Evaluates as true if the expression evaluates as false and
		    vice-versa.

		  expression -o expression
		  expression -or expression
		    Logical OR operator for joining two expressions. Returns
		    true if one of the child expressions returns true. The
		    second expression will not be applied if the first returns
		    true.

-help [cmd ...]:	Displays help for given command or all commands if none
		is specified.

-usage [cmd ...]:	Displays the usage for given command or all commands if none
		is specified.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

For example, to find all files that:

  • Are contained in the directory tree hdfs:///user/$USER/solrloadtest/twitter/tweets
  • Have a name matching the glob pattern sample-statuses*.gz
  • Were modified less than 60 minutes ago
  • Are between 1 MB and 1 GB

You could use the following:

$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.HdfsFindTool -find \
hdfs:///user/$USER/solrloadtest/twitter/tweets -type f -name \
'sample-statuses*.gz' -mmin -60 -size -1000000000c -size +1000000c