TY - GEN
T1 - A resource manager for optimal resource selection and fault tolerance service in grids
AU - Lee, Hwa Min
AU - Chin, Sung Ho
AU - Lee, Jong Hyuk
AU - Lee, Dae Won
AU - Chung, Kwang Sik
AU - Jung, Soon Young
AU - Yu, Heon Chang
PY - 2004
Y1 - 2004
N2 - In this paper, we address the issues of resource management and fault tolerance in Grids. In Grids, the state of the selected resources for job execution is a primary factor that determines the computing performance. Specifically, we propose a resource manager for optimal resource selection. The resource manager automatically selects the optimal resources among candidate resources using a genetic algorithm. Typically, the probability of failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids and grid services are often expected to meet some minimum levels of Quality of Service (QoS) for desirable operation. To address this issue, we also propose fault tolerance service to satisfy QoS requirements. We extend the definition of failures, such as process failure, processor failure, and network failure, and design the fault detector and fault manager. The simulation results indicate that our approaches are promising in that (1) our resource manager finds the optimal set of resources that guarantees the optimal performance, (2) fault detector detects the occurrence of resource failures and (3) fault manager guarantees that the submitted jobs complete and improves the performance of job execution due to job migration even if some failures happen.
AB - In this paper, we address the issues of resource management and fault tolerance in Grids. In Grids, the state of the selected resources for job execution is a primary factor that determines the computing performance. Specifically, we propose a resource manager for optimal resource selection. The resource manager automatically selects the optimal resources among candidate resources using a genetic algorithm. Typically, the probability of failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids and grid services are often expected to meet some minimum levels of Quality of Service (QoS) for desirable operation. To address this issue, we also propose fault tolerance service to satisfy QoS requirements. We extend the definition of failures, such as process failure, processor failure, and network failure, and design the fault detector and fault manager. The simulation results indicate that our approaches are promising in that (1) our resource manager finds the optimal set of resources that guarantees the optimal performance, (2) fault detector detects the occurrence of resource failures and (3) fault manager guarantees that the submitted jobs complete and improves the performance of job execution due to job migration even if some failures happen.
UR - http://www.scopus.com/inward/record.url?scp=4544223732&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:4544223732
SN - 078038430X
T3 - 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004
SP - 572
EP - 579
BT - 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004
T2 - 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004
Y2 - 19 April 2004 through 22 April 2004
ER -