Distributed Computing Infrastructure Working Group

During the workshop this group was led by Al Geist. Participants included Michel Jaunin, Martin Frey, Guy Cormier, and Juan Meza. Notes are courtesy of Al Geist.

1. Describe where we are
    resource allocation and user validation
    scheduling
    partitioning of the system
    checkpoint/restart at system level
    disk quotas, archiving, migration
    External media
    statistics accounting
    performance monitoring & tuning
    security

    Which of these are (critical, necessary, useful)
    Which are: (included in OS, add-on to OS, 3rd party, develop on our on. )
    What is Dist Computing? heterogeneous environment, span administrative domains
    Which are unique to Distributed Computing?

2. Identify problems
    All solutions need to span Unix and NT domains – heterogeneity in general
    System Admin Tools that span the whole domain
    add user – change quotas, modify resource pool, PS kill
    Meta-scheduling – coupling exiting scheduling, local developed scheduler run at all sites
    Fault Tolerance – automatic detection, recovery/repair, notification
    Common Program Development Envir. – common set of tools and libraries. (Apps Group)
    Hetero between sites eg. C compilers, debugger, …
    Conferencing – commercial products exist
    Notebooks – useful tool, just set it up, not a big issue

3.Strategies to eliminate problems

System Administration Tools
Web-based resource management/monitoring tools

add / change quotas for user – for system-wide access
modify resource pool – local resources only
user accessable features: status of resources, my quotas, status of job, list processes, kill job

Meta-scheduling

existing tools – LSF, Condor, DQS, LoadLeveler (limited results from use with large MPP, NT)
coupling existing schedulers – (like Condor “flock”) home-grown and vendor schedulers.
1. Initially define an interface “file” of available-willing-to-share resources (GUSTO approach).
2. Longer term develop distributed broker/negotiation tool – no one site can have control of the meta-scheduler. (Harness distributed symmetric control could be useful)

Fault Tolerance

detection (lillith tool could be used to help build this) could be integrated with scheduling software
- What to detect? Network/CPU/Disk/Process/Host
- Heartbeat (GUSTO approach, but its course-grained)
- Longer term develop fault monitor daemon hierarchy – machine-site-enterprise that can survive failures and notify the hierarchy of all detections.
recovery (run in degraded mode) local policies
repair (recover to full operation) hot-swap hardware, restart application, replace monitor