Here

Transcription

Here
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
Sneha D.Borkar1 , Prof.Chaitali S.Surtakar2
Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com
Assistant Professor, Information Technology, J.D.I.E.T, chaitali.suratkar61@gmail.com
ABSTRACT
Hadoop is a software framework that supports data intensive distributed application. Hadoop creates clusters of
machine and coordinates the work among them. It include two major component, HDFS (Hadoop Distributed
File System) and Map Reduce. HDFS is designed to store large amount of data reliably and provide high
availability of data to user application running at client. It creates multiple data blocks and store each of the
block redundantly across the pool of servers to enable reliable, extreme rapid computation. Map Reduce is
software framework for the analyzing and transforming a very large data set in to desire d output. This paper
focus on how the replicas are managed in HDFS for providing high availability of data under extreme
computational requirement .this paper focus on possible failure that will affect the Hadoop cluster and which are
failover mechanism can be deployed for protecting the cluster.
Keywords : Hadoop, HDFS, Map Reduce, Hadoop Technology, High Availability.
----------------------------------------------------------------------------------------------------------------------------- ---------------
1. INTRODUCTION
The Hadoop is in the parallel access to data that can reside on a single node or on thousands of nodes .The
Hadoop Distributed File System (HDFS) is one of many different components and project contained within the
community Hadoop ecosystem. The Apache Hadoop defines HDFS as: “the primary storage system used by Hadoop
applications”. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a
cluster to enable reliable, extremely rapid computations .Hadoop utilizes a scale-out architecture that makes use of
commodity servers configured as a cluster, where each server possesses inexpensive internal disk drives. Data in
Hadoop is broken down into blocks and spread throughout a cluster. Once that happens, Map Reduce tasks can be
carried out on the smaller subsets of data that may make up a very large dataset overall, thus accomplishing the type
of scalability needed for big data processing [1].It Has The ability to carry out some degree of automatic redundancy
and failover make it popular for modern businesses looking for data warehouse batch analytics solutions.
1.1What is Hadoop?
Hadoop is an Open Source implementation of a large-scale batch processing system. It uses the
Map Reduce framework introduced by Google by leveraging the concept o f map and reduce functions well
known used in Functional Programming. Although the Hadoop framework is written in Java, it allows
developers to deploy custom- written programs coded in Java or any other language to process data in a
parallel fashion across hundreds or thousands of commodity s ervers. It is optimized for contiguous read
requests (streaming reads), where processing consists of scanning all the data. Depending on the
complexity of the process and the volume of data, response time can vary from minutes to hours. While
Hadoop can processes data fast, its key advantage is its massive scalability.
Hadoop leverages a cluster of nodes to run Map Reduce programs massively in parallel [2]. This program
consists of two steps: the Map step processes input data and the Reduce step assembles intermediate results into a
final result. Each cluster node has a local file system and local CPU on which to run the Map Reduce programs. The
local files contain the file system called Hadoop Distributed File System (HDFS). The number of nodes in each
cluster varies from hundreds to thousands of machines. Hadoop can also allow for a certain set of fail-over
scenarios.
IJRISE| www.ijrise.org|editor@ijrise.org [211-216]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Fig.1:- A high level overview of Hadoop cluster
Hadoop is currently being used for index web searches, email spam detection, recommendation engines,
prediction in financial services, genome manipulation in life sciences, and for analysis of unstructured dat a such as
log, text, and click stream. While many of these applications could in fact be implemented in a relational database,
the main role of the Hadoop framework is functionally different from RDBMs
1.2 Map Reduce
Map Reduce has emerged as a popular way to harness the power of large clusters of computers. It allows
programmers to think in a data-centric fashion: they focus on applying transformations to sets of data records, and
allow the details of distributed execution, network communication and fault tolerance to be handled by the Map
Reduce framework. And also Map Reduce is typically applied to large batch-oriented computations that are
concerned primarily with time to job completion. The Google Map Reduce framework and open-source Hadoop
system reinforce this usage model through a batch-processing implementation strategy: the entire output of each
map and reduce task is materialized to a local file before it can be consumed by the next stage. Materialization
allows for a simple and elegant checkpoint/restart fault tolerance mechanism that is critical in la rge deployments,
which have a high probability of slowdowns or failures at worker nodes. We propose a modified MapReduce
architecture in which intermediate data is pipelined between operators, while preserving the programming interfaces
and fault tolerance models of previous Map Reduce frameworks. To validate this design, R. Abbott and H. GarciaMolina develop the Hadoop Online Prototype (HOP), a pipelining version of Hadoop [4]. Pipelining provides
several important advantages for Map Reduce framework; it also raises new design challenges.
2. Design of Hadoop Technology
2.1 Software Architectural Bottlenecks
Sometimes HDFS is not utilized to its full potential due to scheduling delays in the Hadoop architecture
that result in cluster nodes waiting for new tasks. Instead of using the disk in a streaming manner, the access pattern
is periodic. Some Data perfecting is not employed to improve performance, even though the typical Map Reduce
streaming access pattern is highly predictable.
2.2 Portability Limitations:
The performance enhancing features in the native file system are not available in Java in a platformindependent manner. It includes options such as bypassing the file system page cache and transferring data directly
from disk into user buffers .The HDFS implementation runs less efficiently and has higher processor usage than
would otherwise be necessary [3].
IJRISE| www.ijrise.org|editor@ijrise.org [211-216]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
2.3 Portability Assumptions:
The classic notion of software portability is simple: does the application run on multiple platforms? But,
a broader notion of portability is: does the application perform well on multiple platforms? While HDFS is strictly
portable, its performance is highly dependent on the behavior of underlying software layers, spe cifically the OS I/O
scheduler and native file system allocation algorithm. Here, author quantify the impact and significance of these
HDFS bottlenecks.
Fig-2:Hadoop Architecture
The study of Hadoop Architecture, understanding of its design and implementation and reasoning why
hadoop possess has excellent scalability, good fault tolerance capability but moderate performance.
Fig-3. Hadoop Architecture Design:
The particular aspect of database design application level I/O scheduling exploits application access
patterns to maximize storage bandwidth in a way that is not similarly exploitable by HDFS. Application -level I/O
scheduling is frequently used to improve database performance by reducing seeks in systems with large numbers of
concurrent queries. Because database workloads often have data re-use, storage usage can be reduced by sharing
data between active queries. Here, part or the entire disk is continuously scanned in a sequential manner. Clients join
IJRISE| www.ijrise.org|editor@ijrise.org [211-216]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
the scan stream in-flight, leave after they have received all necessary data and never interrupt the stream by
triggering immediate seeks. The highest overall throughput can be maintained for all queries. This particular type of
scheduling is only beneficial when multiple clients each access some portion of shared data, which is not common in
many HDFS workloads [5].
3. TECHNOLOGY
Hadoop, is also called as Apache Hadoop [4], It is an Apache Software Foundation project and open source
software platform for scalable, distributed computing. Also provide fast and reliable analysis of both structured data
and unstructured data. Given its capabilities to handle large data sets, it's often associated with the phrase big data.
Performing computation on large volumes of data has been done before, usually in a distributed setting but writing
distributed systems is notoriously hard. Map Reduce systems such as Hadoop are used in large- scale deployments
.Eliminating HDFS bottlenecks will not only boost application performance, but also improve overall cluster
efficiency, thereby reducing power and cooling costs and allowing more computation to be accomplished with the
same number of cluster nodes. These solutions include improved I/O scheduling, adding pipelining and prefetching
to both task scheduling and HDFS clients, pre-allocating file space on disk, and modifying or eliminating the local
file system, among other methods. Apache Hadoop controls costs by storing data more affordably per terabyte than
other platforms. Instead of thousands to tens of thousands per terabyte Hadoop delivers compute and storage for
hundreds of dollars per terabyte. Fault tolerance is one of the most important advantages of using Hadoop. Even if
individual nodes experience high rates of failure when running jobs on a large cluster, data is replicated across a
cluster so that it can be recovered easily in the face of disk, node or track failures. The flexible way that data is
stored in Apache Hadoop is one of its biggest assets –enabling businesses to generate value from data that was
previously considered too expensive to be stored and processed databases.
3.1 HDFS analysis
After the analysis of the Hadoop with Architect, here’s the dependency graph of project HDFS use mostly
hadoop-common and profound libraries. When external libraries are used, it’s better to check if we can easily
change a third party lib by another one without impacting the whole application, there are many reasons that can
encourage to change a third party libraries. As an example of jetty lib and search which methods from hdfs use it
directly.
Fig-4. HDFS Analysis
5. CONCLUSION
IJRISE| www.ijrise.org|editor@ijrise.org [211-216]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Hadoop is very flexible and permit us to change the behavior without changing the source code. Hadoop Map
Reduce is a large scale, open source software framework dedicated to scalable, distributed, data -intensive
computing. Using frameworks as user is very interesting, but going inside this framework could give us more info
suitable to understand it better, and adapt it to our needs easily. Hadoop is a powerful framework used by many
companies, and most of them need to customize it.
REFERENCES
[1] R. Abbott and H. Garcia-Molina. “Scheduling I/O requests with dead-lines:” A performance evaluation. In
Proceedings of the 11th Real-Time Systems Symposium, pages 113–124, Dec 1990.
[2] G. Candea, N. Polyzotis, and R. Vingralek. “A scalable, predictable join Operator for highly concurrent data
warehouses”. In 35th International Conference on Very Large Data Bases (VLDB), 2009.
[3] “Understanding Hadoop Clusters and the Network.” Available at http://bradhedlund.com. Accessed on June 1,
2013.
[4] In-Database Map-Reduce
[5]“Yahoo!
Hadoop
Tutorial.”
Yahoo!
Developer
Network.
Available
at
http://developer.yahoo.com/hadoop/tutorial/. Accessed on June 4, 2013.
[6]HDFS Java API: http://hadoop.apache.org/core/docs/current/api/
[7]HDFS source code: http://hadoop.apache.org/hdfs/version_control.html
[8] “Hadoop HDFS over HTTP - Documentation Sets 2.0.4-alpha.” Apache Software Foundation. Available at
http://hadoop.apache.org/docs/r2.0.4-alpha/hadoop-hdfs-httpfs/index.html. Accessed on June 5, 2013.
IJRISE| www.ijrise.org|editor@ijrise.org [211-216]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 1
IJRISE| www.ijrise.org|editor@ijrise.org [211-216]
e-ISSN: 2394-8299
p-ISSN: 2394-8280