Ziua doctoranzilor
Transcription
Ziua doctoranzilor
Advanced Techniques for Ensuring High Availability of Resources in Parallel and Distributed Systems Eliana-Dina Tirsa, Valentin Cristea eliana.tirsa@cs.pub.ro, valentin.cristea@cs.pub.ro Computer Science Department, Politehnica University of Bucharest Main topic: high availability in parallel and distributed systems High availability: - ability of a system to provide services to the users at a sufficient performance level, particularly in the presence of failures - two orthogonal issues: ensuring the system resilience to failures and maintaining the performance of the services above an expected threshold Research Approach: - address high availability problems at computation, communication and data levels, in several types of systems - consider both system resilience and performance aspects Research Contributions: - fault tolerance mechanisms - optimization algorithms (focused on performance metrics related to high availability – e.g. response time) - fault tolerant architectures design (also used to validate some of the proposed mechanisms and algorithms) Repository Replication and Synchronization in MonALISA Distributed Monitoring Framework Problem: due to repository failures, lack of monitoring data for some periods Solution: Repository replication system - small number of geographically distributed replicas - active replication (all replicas subscribe to same parameters) - low synchronization cost – only in case of recovery from failure - fault tolerant load balancer - load balancing mechanism uses the monitored states of the replicas Challenges: - replicas on different time zones - keep track of gap intervals/last update - synchronize from many sources - false positives (monitored service down or existent data with value 0) Fault tolerant P2P architecture for efficient multidimensional range search Communication Reliability 1. Scalable P2P communication architecture - structured peer-to-peer topology based on local interactions between peers - enhanced reliability: routing messages over backup/alternate paths - improved throughput: routing messages simultaneously over multiple paths - node and object identifiers mapped in a multidimensional geometric space - support for dynamic node arrivals and departures - peers perform (extended) geometrical routing - replicated objects - load aware topology based on local decisions only (periodically, nodes change identifiers in order to balance load and objects change node owners) 2. Algorithms for computing backup shortest paths when the network topology is known in advance - start by computing the shortest paths - identify the generic structure of a backup shortest path and use it during a traversal of the tree of shortest paths in order to compute backup shortest paths 3. Real-time data transfer scheduling techniques - algorithms and data structures for improving the response time Lock Contention Advisor Tool - applies to C concurrent programs -based on a kernel probing tool - intercepts futex system calls and builds sorted list of contended locks - uses both runtime information and static analysis to determine variables accessed inside a lock - suggestions according to detected access patterns: lock splitting, removing code from inside the lock, reader-writers lock etc Future Research Directions 1. Background (data and resource) replication strategies inside Clouds 2. Application assisted checkpoint-restart API and associated service Towards XtreemOS in the Clouds - ongoing work on extending a grid operating system(XtreemOS) with capabilities of managing virtual Cloud (Nimbus) resources provided on demand - technical difficulties: - different authorization, authentication, and security mechanisms - resource volatility in Clouds (Grid schedulers are designed for less dynamic environments)