A Novel Data Replication Policy in Data Grid
Transcription
A Novel Data Replication Policy in Data Grid
World Applied Sciences Journal 28 (11): 1847-1852, 2013 ISSN 1818-4952 © IDOSI Publications, 2013 DOI: 10.5829/idosi.wasj.2013.28.11.1934 A Novel Data Replication Policy in Data Grid Y. Nematii Department of Computer Engineering, Islamic Azad University, branch of biayza, Iran Abstract: Data grid aims to provide services for sharing and managing large data files around the world. The challenging problem in data grid is how to reduce network traffic. One common method to tackle network traffic is to replicate files at different sites and select the best replica to reduce access latency. This can generate many copies of file and stores them on suitable place to shorten the time of getting file. To employ the above two concepts, in this paper we propose a dynamic data replication strategy. Simulation Results with Optorsim shows better performance of our algorithm than former ones. Key words: Data Replication Data Grid Optorsim INTRODUCTION Data grid is a major type of grid, used in data-intensive applications; where size of data files reach tera or sometimes peta bytes. High Energy Physics (HEP), Genetic, Earth Observation, are examples of such applications. In such applications, managing and storing this huge data is very difficult. These massive amounts of data are shared among researchers around the world. Hence, management this large scale distributed data resources is difficult. The data grid is a solution for this problem [1]. Data Grid is an integrating architecture that permit connect a group of geographically distributed computers and storage resources that enable users to share data and resources [2]. When a user wants a file, huge amount of bandwidth could be spend to transfer the file from sever to client. Thus, data replication is an important optimization step to manage large data by replicating data in different globally distributed sites. Replication can be static or dynamic. In static replication, a replica keeps until it was deleted by client or its duration was finished. But, dynamic replication creates and deletes replicas according to variation access of the patterns or environment behavior. In data replication, there are three issues: (1) replica management, (2) replica selection and (3) replica location. Replica management is the process of creating, deleting, moving and modifying replicas in a data grid. Replica selection is the process of selecting most appropriate replica from those copies geographically spreading across the grid. Lastly, replica location is the problem of selecting the physical locations of several replicas of the desired data usefully in a large-scale wide area data grid system. There are several methods proposed for dynamic replication in grid [3-5]. The motivation of replication is that it can enhance data availability, network performance, load balancing and performance of file access. In a data grid latency for a job depends on the computing resource selected for job execution and the place of the data file(s) the job wants to access. Therefore, scheduling jobs at proper executing nodes and putting data at proper locations are important from the user’s view and it is difficult because of the next points: Each job needs a large scale of data, therefore data locality requires to be taken into check in any scheduling decisions. Data management policies like data replication would decrease the latency of data access and improve the performance of grid environment. So, there should be a coupling between data locality reached through data replication strategies and scheduling. In this paper, a dynamic data replication policy is proposed. We developed a replication strategy for 3-level hierarchical structure, called HRS (Hierarchical Replication Strategy). In this replication algorithm, replicated file stored in the site with the largest number of access for the file and main factor for replica selection is bandwidth. Our replica strategy are simulated in Optorsim and compared with different scheduling algorithms and replica strategies. The results of simulation show that the proposed method has minimum job execution time. Corresponding Author: Y. Nematii, Department of Computer Engineering, Islamic Azad University, branch of biayza, Iran. 1847 World Appl. Sci. J., 28 (11): 1847-1852, 2013 The rest of the paper is organized as follows: section 2 gives an overview of pervious work on data replication and job scheduling. Section 3 presents our scheduling algorithm and replication strategy. Section 4 describes our experiments and simulation results. Finally, section 5 explains conclusions and proposed future work. Related Work: In [8, 9], Ranganathan and Foster proposed six distinct replica strategies for multi-tier data grid as follows: (1) No replication or caching: just the root node keeps replicas, (2) Best Client: a replica is created for the client who has the largest number of requests for the file, (3) Cascading replication: original data files are stored at the root. when popularity is higher than the threshold of root, a replica is created at next level which is on the path to the best client, (4) Plain caching: the client that requests the file keeps a local copy, (5) Caching plus Cascading Replication: this merges Plain caching and Cascading and (6) Fast Spread: replicas of the file are stored at each node on the path to the best client. They compared the strategies for three different data patterns by means of the access latency and bandwidth consumption of each algorithm with the simulation tool. By the results of simulations, different access pattern needs different replica strategies. Cascading and Fast performed the best in the simulations. Also, the authors developed different scheduling strategies and combined them with different replication strategies. The results showed that data locality is importance in scheduling job. In [10], Chakrabarti et al. proposed an Integrated Replication and Scheduling Strategy (IRS). Their goal was an iterative enhancement of the performance based on the coupling between the scheduling and replication methods. After jobs are scheduled, popularity of required files is computed and then applied to replicate data for the next set of jobs. In work [11] the authors thought that data placement jobs should be treated differently from computational jobs, because they may have different semantics and different properties. So, they have developed a scheduler that provides solutions for many of the data placement activities in the grid. Their idea was to map data near to computational resource to complete computational cycles efficiently. The method can automatically determine which protocol to select to transfer data from one host to another. Sashi and Thanamani in [12] proposed a dynamic replication strategy where the replicas are created based on weights. The main points of this algorithm were increasing of the importance of newer records with giving higher weighs to them and replica placement was done in the site level. In this method, time distance of accesses from current time and the access number were main factors in decision of file popularity. A novel replication method, called Branch replication schema (BRS) have been presented in [13]. In their model, each replica is comprised of a disparate set of subreplicas grouped using a hierarchical tree topology and subreplicas of a replica do not overlap. In order to improve the scalability and performances of system for both read and write operation, they used parallel approach [14-16]. The results of simulation showed that BRS improves data access performance for files with different sizes and provides an efficient way for updating data replicas. Sang Min Park et al. [17] represented a bandwidth Hierarchy Replication (BHR) which decreases data access time by avoiding network congestions in a data grid network. By using BHR algorithm, one can take benefits from ‘network-level locality’ which shows that wanted file is located in the site which has broad bandwidth to the site of job execution. In data grid, some sites are linked closely may be placed within a region. For example, a country can be referred to as this network region. Network bandwidth between sites within a region will be higher than bandwidth between sites between regions. Hence, hierarchy of network bandwidth may appear in Internet. If the required file is placed in the same region, needed time for fetching the file will be less. BHR algorithm decreases data access time by maximizing this network-level locality. In this work, replicated files are placed in all the requested sites. In another paper [18] the authors presented Modified BHR which tries to replicate files in a site where the file has been accessed frequently based on the assumption that it may require in the future. The results showed that the proposed algorithm minimized the data access time and avoided unnecessary replication, so improved the performance. But they didn’t use an efficient scheduling algorithm. In our work, we addressed the problem of both replication and scheduling. A replication algorithm for a 3-level hierarchy structure and a scheduling algorithm were proposed in [19]. They considered network hierarchical structure with three levels. Nodes in the first level connected through 1848 World Appl. Sci. J., 28 (11): 1847-1852, 2013 internet which has low bandwidth. Next level they have moderately higher bandwidth comparing first level. In the last level nodes connected to each other with highest speed networks. For proficient scheduling of any jobs, the algorithm selects best region, LAN and site respectively. In their method when there is not enough space for replica, major factor for deleting is bandwidth. It leads to a better performance with LRU method. According previous works although 3-layer replication makes some improvement in some metrics of performance, it shows a weakness which is replica was placed in all requested site not in the best site. Therefore we extended this strategy to lessen this weakness. In [6], the authors presented a data grid architecture supporting efficient data replication and job scheduling. They considered computing sites into individual domains according to the network connection and each domain has a replica server. They proposed two centralized dynamic replication algorithms with different replica placement methods and a distributed dynamic replication algorithm, such as the grid scheduling heuristics of shortest turnaround time, least relative load and data present. They used the LRU algorithm for replica replacement. The results showed that the dynamic replication algorithms can reduce the job turnaround time significantly. In [20], the performances of eight dynamic replication strategies were evaluated under different data grid setting. The simulation results showed that the file replication policy chosen and the file access pattern had great influence on the real-time grid performance. Fast Spread-Enhanced was the best of the eight algorithms considered. Also, the peer-to-peer communication was indicated to be very profitable in boosting the real-time performance. In [21] authors presented a dynamic replication algorithm based on fast spread in multi-tier data grid and called their method predictive hierarchical fast spread (PHFS). PHFS predicts user’s subsequence component to adapt the replication configuration with the available condition, so increase locality in access. PHFS is proposed based on the idea that users who work on the same context will want some files with the high probability. One of the main results is that the PHFS algorithm is appropriate for applications in which the clients work on a context for duration of time and the client’s requests are not random. They investigated PHFS and common fast spread with point of view access latency and the results showed that PHFS has better performance and lower latency than common fast spread. Rashedur M.Rahman et al. in [22] proposed a technique to select the best replica, by using previous file transfer logs called the k-Nearest Neighbor (KNN) rule exploited. When a new request comes for the best replica, all prior data is use to find a subset of prior file requests that are like to it and then employs these to predict the best site for the replica. They also proposed a predictive framework that considers data from different sources and predicts transfer times of the sites that host replicas. The results showed that the K-nearest method has a better performance improvement than the traditional replica catalog based model. Replication Algorithm: When a job is allocated to local scheduler, the replica manager transfers all the needed files that are not available in local site. The goal of replication is to transferring required data to local site before job execution. So, data replication enhances job scheduling performance by decreasing job execution time. Our replication strategy decides which copy will be transferred and how handles this new replica. The replicated files were not put in all the requested sites. In [9] three types of localities are showed. For all required file of job, Replica Manager checks the existence of the file in local site. If file doesn’t available in the local site, first the method searches local LAN, if the file duplicated in the same LAN, then it chooses a file with maximum bandwidth available for transferring it. If the file doesn’t exist in the local LAN, then it searches local region and chooses a file that has highest bandwidth to the requester node. If the file doesn’t exist in the same region, then generates a list of replicas in other regions and chooses the replica with maximum bandwidth available for transferring it. In another word, it has destination between intra-LAN communication, intra-region communication and inter-region communication. Simulation Simulation Tool: We evaluated the performance of our algorithms by implementing them in Optor Sim, a data grid simulator developed by the EU DataGrid project as shown in Fig. 1 [23]. It is proposed to evaluate dynamic replication strategies [24, 25]. The OptorSim contains of various parts: the Computing Element (CE) that illustrates computational resource to process grid jobs, Storage Element (SE) shows where grid data are stored, Resource Broker (RB) that gets job submission from users and allocates each job to proper site based on the scheduling policy, Replica Manager (RM) at each site controls 1849 World Appl. Sci. J., 28 (11): 1847-1852, 2013 Fig. 1: Optor Sim architecture RESULTS AND DISCUSSION Table 1: Bandwidth configuration Parameter Value (Mpbs) Inter LAN bandwidth 1000 Intra LAN bandwidth 100 Intra Region bandwidth 10 Table 2: General job configuration Parameter Value Number of jobs types 6 Number of file access per jobs 16 Size of single file (GB) 1 Total size of files (GB) 100 Job delay (ms) 2500 Maximum Queue Size 200 the data transferring between sites and gives interface to directly access the Replica Catalogue and Replica Optimizer (RO) within RM includes the replication algorithm. Broker distributes jobs to grid sites. Jobs taking data from the SE and are executed by using CE. The replica manager maintains the replica catalogue and optimizer selects the best site to fetch the replica. Experimental Evaluation: The OptorSim structure is flat, so alternation was done in OptorSim code to implement the hierarchical structure. We assumed there are 3 regions with three sites on the average in each region. The storage capacity of the master site is 250 Giga Byte and the storage capacity of all other sites is 10 Giga Byte. We assumed the bandwidth as shown in Table 1. The number of storage elements and the number of computing elements are set 11 and 10, respectively. Table 2 specifies the simulation parameters and their values used in our study. There are 6 job types, each job type on average requires 16 files (each is 1 Giga Byte) to execute. We studied the performance of various replica strategies and different algorithm combinations. The Random scheduler, Shortest Queue scheduler, Access Cost scheduler and Queue Access Cost scheduler are four strategies have been evaluated. The Random scheduler schedules a job randomly. The Shortest Queue scheduler selects computing element that has the least number of jobs waiting in the queue. The Access Cost scheduler assigns the job to computing element where the file has the lowest access cost (cost to get all files needed for executing job). The Queue Access Cost scheduler selects computing element with the smallest sum of the access cost for the job and the access costs for the all of the jobs in the queue. More details about these schedulers have been given in [27]. Four replication algorithms have been investigated are Least Frequently Used (LFU), Least Recently Used (LRU), BHR and ModifiedBHR. In LRU strategy always replication takes place in the site where the job is executing. If there is not enough space for new replica, the oldest file in the storage element is deleted. In LFU always replication takes place in the site where the job is executing. If there is not enough space for new replica, least accessed file in the storage element is deleted. The BHR algorithm which is based on the network-level locality, tries to locate a variety of replicas as many as possible within a region, where broad bandwidth is available between sites. The Modified BHR Region Based algorithm replicates different files within the region with the condition that the replica is placed in the site where the file is accessed for the most of time. 1850 World Appl. Sci. J., 28 (11): 1847-1852, 2013 3. 4. 5. 6. Fig. 2: Mean job execution time Figure 2 shows the effect of number of job on the runtimes for different replica strategies for 1500 jobs. When the number of job increases the performance of our method also increases comparing with other method. In grid environment a lot of job should be run. 7. 8. CONCLUSION In this paper, we proposed a novel replication strategy. our replication strategy allows place data in a manner will suggest a faster access to files require by grid jobs, therefore increase the job execution’s performance. If the available storage for replication is not sufficient, our proposed algorithm will only replicate those files that do not exist in the local LAN. We analyzed the performance of our replication method with Optorsim. Experimental results showed that our replica algorithm outperforms the LRU method as the number of jobs increases. In future work, we plan to use other intelligent strategy to predict future needs and to acquire the best replication configuration. REFERENCES 1. 2. 9. 10. 11 12. Chervenak, A., I. Foster, C. Kesselman, C. Salisbury and S. Tuecke, 2000. The Data Grid: towards an architecture for the distributed management and analysis of large scientific datasets, J. Network Comput. Appl., 23(3): 187-200. Ann Chervenak, Ian Foster, Carl Kesselman, Charles Salisbury, Steven Tuecke. “The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets”, Journal of Network and Computer Applications, Volume 23,2001, pp: 187-200. 13. 14. 1851 Lamehamedi, H. and B.K. Szymanski, 2007. “Decentralized Data Management Framework for Data Grids”, Future Generation Computer Systems, Elsevier, 23: 109-115. Abawajy, J.H., 2004. “Placement of File Replicas in Data Grid Environments”, ICCS’04, LNCS 3038, pp: 66-73. Tang, M., B.S. Lee, C.K. Yeo and X. Tang, 2005. “Dynamic Replication Algorithms for the Multi-tier Data Grid”, Future Generation Computer Systems, Elsevier, 21: 775-790. Schintke, F. and A. Reinefeld, 2003. Modeling replica availability in large Data Grids, Journal of Grid Computing 1: 2. Zhao, W., X. Xu, Z. wang, Y. Zhang and S. He, 2010. “A Dynamic Optimal Replication Strategy in Data Grid Environment”, IEEE. Foster, I. and K. Ranganathan, 2001. Design and evaluation of dynamic replication strategies for high performance data grids, in: Proceedings of International Conference on Computing in High Energy and Nuclear Physics, Beijing, China. Foster, I. and K. Ranganathan, 2002. Identifying dynamic replication strategies for high performance data grids, in: Proceedings of 3rd IEEE/ACM International Workshop on Grid Computing, in: Lecture Notes on Computer Science, vol. 2242, Denver, USA, pp: 75-86. Chakrabarti, A., R.A. Dheepak and S. Sengupta, 2004. Integration of scheduling and replication in Data Grids, in: Lecture Notes in Computer Science, vol. 3296, pp: 375-385. Kosar, T. and M. Livny, 2004. Stork: Making data placement a first class citizen in the grid, in: Proceedings of the 24th International Conference on Distributed Computing Systems, ICDCS2004, Tokyo, Japan, pp: 342-349. Sashi, K. and Dr. Antony, 2010. Selvadoss Thanamani, “Dynamic Replica Management for Data Grid”, IACSIT International Journal of Engineering and Technology, 2: 4. José, M. Pérez and Félix García-Carballeira, 2009. Jesús Carretero, Alejandro Calderón, Javier Fernández: “Branch replication scheme: A new model for data replication in large scale data grids”, Future Generation Computer Systems, Elsevier. Jin, H., T. Cortes and R. Buyya, (Eds.), 2002. High Performance Mass Storage and Parallel I/O: Technologies and Applications, IEEE Press and Wiley. World Appl. Sci. J., 28 (11): 1847-1852, 2013 15. Perez, J.M., F. Garcia, J. Carretero, A. Calderon and J. Fernandez, 2004. A Parallel I/O Middleware to Integrate Heterogeneous Storage Resources on Grids (Grid Computing: First European Across Grids Conference, Santiago de Compostela, Spain, February 13_14, 2004), in: Lecture Notes in Computer Science Series, vol. 2970, pp: 124-131. 16. Garcia-Carballeira, F., J. Carretero, A. Calderon, J.D. Garcia and L.M. Sanchez, 2007. A global and parallel file systems for grids, Future Generation Computer Systems, 23(1): 116-122. 17. Sang-Min Park, Jai-Hoon Kim, Young-Bae Ko and Won-Sik Yoon, 2004. Dynamic Data Replication Strategy Based on Internet Hierarchy BHR, in: Lecture notes in Computer Science Publisher, vol. 3033, Springer-Verlag, Heidelberg, pp: 838-846. 18. Sashi, K. and Dr. Antony, 2011. Selvadoss Thanamani, “Dynamic replication in a data grid using a Modified BHR Region Based Algorithm”, Future Generation Computer Systems, Elsevier, 27: 202-210. 19. R. Sepahvand, A. Horri and Gh. Dastghaiby fard, 2008. “A New 3-layer replication Scheduling Strategy In Dta grid”, IEEE. 20. Atakan Dogan,” 2009. A study on performance of dynamic file replication algorithms for real-time file access in Data Grids”, Future Generation Computer Systems, Elsevier, 25: 829-839. 21. Leyli Mohammad Khanli, Ayaz Isazadeh and Tahmuras N. Shishavan,” 2010. PHFS: A dynamic replication method, to decrease access latency in the multi-tier data grid”, Future Generation Computer Systems, Elsevier. 22. Rashedur M. Rahman, Reda Alhajj and Ken Barker,” 2008. Replica selection strategies in data grid”, J. Parallel Distrib. Comput., Elsevier. 23. Cameron, D.G., R. Carvajal-Schiaffino, A.P. Millar, C. Nicholson, K. Stockinger and F. Zini, 2004. Optorsim: A simulation tool for scheduling and replica optimization in data grids. In: International Conference for Computing in High Energy and Nuclear Physics (CHEP 2004), Interlaken. 24. Optor Sim-A Replica Optimizer Simulation: http://edgwp2.web.cern.ch/edgwp2/ optimization/optorsim.html 25. EU Data Grid project. http://www.eu-egee.org/. 26. Hoschek, W., F.J. Jaen-Martinez, A. Samar, H. Stockinger and K. Stockinger, 2000. Data management in an international data grid project, in: Proceedings of the First IEEE/ACM International Workshop on Grid Computing, GRID ’00, in: Lecture Notes in Computer Science, vol. 1971, Bangalore, India, pp: 77-90. 27. David G. Cameron, Ruben Carvajal-Schiaffino, A. Paul Millar, Caitriana Nicholson, Kurt Stockinger and Floriano Zini, 2003. Evaluating scheduling and replica optimisation strategies in OptorSim, in: 4th International Workshop on Grid Computing (Grid2003), Phoenix, Arizona, November 17, IEEE Computer Society Press. 1852