Hadoop Resources

From D4Science Wiki
Revision as of 10:44, 4 October 2012 by Andrea.manzi (Talk | contribs)

Jump to: navigation, search

Hadoop provides a distributed filesystem (HDFS) that can store data across thousands of servers, and a means of running work (Map/Reduce jobs) across those machines, running the work near the data.

MapReduce Architecture

The Hadoop Map/Reduce framework has a master/slave architecture. It has a single master server or jobtracker and several slave servers or tasktrackers, one per node in the cluster. The jobtracker is the point of interaction between users and the framework. Users submit map/reduce jobs to the jobtracker, which puts them in a queue of pending jobs and executes them on a first-come/first-served basis. The jobtracker manages the assignment of map and reduce tasks to the tasktrackers. The tasktrackers execute tasks upon instruction from the jobtracker and also handle data motion between the map and reduce phases.

Hadoop Filesystem

Hadoop's Distributed File System is designed to reliably store very large files across machines in a large cluster. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time.

Hadoop & gCube

Hadoop nodes are exploited by gCube services which then provide higher level functionality through the iMarine VREs. gCube Services can execute Hadoop Map Reduce Jobs using the gCube Execution engine which implements a particular adaptor to interface to Hadoop jobtracker.

The following Hadoop clusters are available on the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. Ecosystem thanks to iMarine:


Partner Jobtracker Slaves HDFS Size total RAM total virtual CPU cores
CNR
Yes.png
Yes.png
Yes.png
Yes.png
Yes.png
Yes.png
Yes.png
Yes.png
Yes.png
Yes.png
Yes.png
Yes.png
1.4 TB
34 GB
46