Difference between revisions of "Hadoop Resources"
Andrea.manzi (Talk | contribs) |
|||
(One intermediate revision by one other user not shown) | |||
Line 3: | Line 3: | ||
'''MapReduce Architecture''' | '''MapReduce Architecture''' | ||
− | The Hadoop Map/Reduce framework has a master/slave architecture. It has a | + | The Hadoop Map/Reduce framework has a master/slave architecture. It has a HA job orchestrator (YARN) and several worker servers, one per node in the cluster. YARN is the point of interaction between users and the framework. Users submit map/reduce jobs to the jobtracker, which puts them in a queue of pending jobs and executes them on a first-come/first-served basis. YARN manages the assignment of the map/reduce ans spark tasks to the workers.<br/> |
+ | A Hue frontend, Ooozie and Hive are also available. | ||
'''Hadoop Filesystem''' | '''Hadoop Filesystem''' | ||
Line 11: | Line 12: | ||
'''Hadoop & gCube''' | '''Hadoop & gCube''' | ||
− | Hadoop nodes are exploited by gCube services which then provide higher level functionality through the iMarine VREs. gCube Services can execute Hadoop Map Reduce Jobs using the gCube Execution engine which implements a particular adaptor to interface to Hadoop jobtracker. As well a new Framework called WPS-Hadoop has been developed to | + | Hadoop nodes are exploited by gCube services which then provide higher level functionality through the iMarine VREs. gCube Services can execute Hadoop Map Reduce Jobs using the gCube Execution engine which implements a particular adaptor to interface to Hadoop jobtracker. As well a new Framework called WPS-Hadoop has been developed to allow executing different type of Environmental and Geospatial Algorithms in Hadoop. |
+ | <br/> | ||
+ | |||
+ | A Spark 2 environment is also available on the same Hadoop cluster. | ||
The following Hadoop clusters are available on the D4Science infrastructure thanks to iMarine: | The following Hadoop clusters are available on the D4Science infrastructure thanks to iMarine: | ||
Line 18: | Line 22: | ||
{| border="1" cellpadding="4" cellspacing="0" | {| border="1" cellpadding="4" cellspacing="0" | ||
|- | |- | ||
− | ! width="120"|Partner !! width="120"|Distribution !! width="80"| | + | ! width="120"|Partner !! width="120"|Distribution !! width="80"|YARN !! width="80"|Worker Nodes !! width="80"|HDFS Size !! width="80"|total RAM !! width="80"|total virtual CPU cores |
|- | |- | ||
− | | <center>CNR</center> || <center>Cloudera | + | | <center>CNR</center> || <center>Cloudera 5</center> || <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> || <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> <center>[[Image:Yes.png|20px]]</center> || <center>4.6 TB</center> || <center>640 GB</center> || <center>320</center> |
|} | |} |
Latest revision as of 18:31, 5 February 2020
Hadoop provides a distributed filesystem (HDFS) that can store data across thousands of servers, and a means of running work (Map/Reduce jobs) across those machines, running the work near the data.
MapReduce Architecture
The Hadoop Map/Reduce framework has a master/slave architecture. It has a HA job orchestrator (YARN) and several worker servers, one per node in the cluster. YARN is the point of interaction between users and the framework. Users submit map/reduce jobs to the jobtracker, which puts them in a queue of pending jobs and executes them on a first-come/first-served basis. YARN manages the assignment of the map/reduce ans spark tasks to the workers.
A Hue frontend, Ooozie and Hive are also available.
Hadoop Filesystem
Hadoop's Distributed File System is designed to reliably store very large files across machines in a large cluster. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time.
Hadoop & gCube
Hadoop nodes are exploited by gCube services which then provide higher level functionality through the iMarine VREs. gCube Services can execute Hadoop Map Reduce Jobs using the gCube Execution engine which implements a particular adaptor to interface to Hadoop jobtracker. As well a new Framework called WPS-Hadoop has been developed to allow executing different type of Environmental and Geospatial Algorithms in Hadoop.
A Spark 2 environment is also available on the same Hadoop cluster.
The following Hadoop clusters are available on the D4ScienceAn e-Infrastructure operated by the D4Science.org initiative. infrastructure thanks to iMarine:
Partner | Distribution | YARN | Worker Nodes | HDFS Size | total RAM | total virtual CPU cores |
---|---|---|---|---|---|---|
|
|
|
|
|
|
|