Roadmap

On completion of this milestone, we should have the following:

  • a record of the types of faults that are most critical to resilience on our representative system (e.g., Red Storm),
  • proposed responses for each of type of fault -- not all of these responses may require an application restart, and
  • a particular fault that we want to address as a demonstration by the end of FY'09.

Milestone: Related Work

9 years late (01/01/09 00:00:00)

0%

Number of tickets:
closed:
0
active:
1
Total:
1

Enumerate related projects, perform background reading, and provide detailed descriptions of projects and any relevant work that we should be aware of or leverage for our own use. This documentation should be presented on 9lives wiki page.

Milestone: Diskless State Demo

9 years late (10/01/09 00:00:00)

The goal of this milestone is to demonstrate the ability to manage application state in an overlay network with minimal access to persistent storage. This milestone requires the following sub-tasks:

  • Identify state that needs to be managed and extracted,
  • Research and design data structures and algorithms for diskless management,
  • Compression techniques?
  • Design application/system interface for reading/writing state,
  • Implement an LWFS service to manage state,
  • Run experiments.

Milestone: Fake RAS Demo

9 years late (10/01/09 00:00:00)

The goal of this milestone is to develop a fake RAS system for testing functionality of our response system. This system needs to be able to support the list of faults identified in the milestone, Enumerate types of faults, likelihoods, and possible response (Stearley)

The goal is to evaluate memory usage of a number of selected MPP applications. This milestone will require the following sub-tasks:

  • design/development of code to trace application memory usage on Kitten
    • mark pages that change between checkpoints
    • rsync-like (checksum) method for identifying changes
    • what about pages touched by the network (do we need network dirty bit)?
    • what about when multiple cores share pages?
  • selection of appropriate applications
  • instrumentation of code (if necessary)
  • performance experiments and analysis

The goal of this milestone is to demonstrate the ability to suspend application activity on an MPP application.

  • research the viable options for quiescence on an MPP using a LWK,
  • identify all services that need to be suspended and identify problem areas,
  • design modifications or new features into the OS to support quiescence,
  • develop the prototype,
  • design a simple experiment to demonstrate functionality,
  • perform experiments

Look at what Cray did for BLCR on CNL.

Milestone: Virtualization Demo

9 years late (10/01/09 00:00:00)

The goal of this milestone is to demonstrate the ability for one node assume the identity of another node. For Portals, this requires an abstraction of the NID.

Note: See TracRoadmap for help on using the roadmap.