hadoop中的DistributedCache

子车高歌

2023-12-01

转载自：http://www.cnblogs.com/xuxm2007/archive/2011/06/29/2092145.html

http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/filecache/DistributedCache.html

Distribute application-specific large, read-only files efficiently.

DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.

Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via urls are already present on the FileSystem at the path specified by the url and are accessible by every machine in the cluster.

The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. Jars may be optionally added to the classpath of the tasks, a rudimentary software distribution mechanism. Files have execution permissions. Optionally users can also direct it to symlink the distributed cache file(s) into the working directory of the task.

DistributedCache tracks modification timestamps of the cache files. Clearly the cache files should not be modified by the application or externally while the job is executing.

=================================================================================================
http://www.chetanislazy.com/blog/2010/12/28/using-hadoops-distributedcache/

Adding files

When setting up your Job configuration:

//  Create symlinks in the job's working directory using the link name
//  provided below
DistributedCache.createSymlink(conf);

// Add a file to the cache.It must already exist on HDFS.The text after 
// the hash is the link name.
DistributedCache.addCacheFile(
       new URI("hdfs://localhost:9000/foo/bar/baz.txt#baz.txt"), conf);

Accessing files

Now that we’ve cached our file, let’s access it:

// Direct access by name
File baz = new File("baz.txt")
// prints "true" since the file was found in the working directory
System.out.println(baz.exists());

// We can also get a list of all cache files
Path[] cached = DistributedCache.getLocalCacheFiles(conf);
for(int i = 0; i < cache.length; i++) {
     Path path = cached[i];
     String filename = path.toString();
}

=================================================================================================

http://www.chetanislazy.com/blog/2010/12/29/distributing-jars-for-mapreduce-jobs-via-hdfs/

Hadoop has a built-in feature for easily distributing JARs to your worker nodes via HDFS but, unfortunately, it’s broken. There’s a couple of tickets open with a patch again 0.18 and 0.21 (trunk) but for some reason they still haven’t been committed yet. We’re currently running 0.20 so the patch does me no good anyway. So here’s my simple solution:

hadoop中的DistributedCache

Adding files

Accessing files

相关阅读

相关文章

相关问答

相关文档