Advanced Computer Systems

WORK recently began at Lawrence Livermore National Laboratory and sister Department of Energy laboratories on the Accelerated Strategic Computing Initiative (ASCI), one of the largest supercomputing projects of all time. A major component of DOE's science-based Stockpile Stewardship and Management Program, ASCI's computational modeling and simulation capabilities will be used to assess the safety, security, and reliability of our nuclear stockpile.
ASCI's systems will soon be performing a trillion (tera or 1012) floating-point operations per second (flops) requiring memories of tens of trillions of bytes, which is well beyond the range of existing supercomputers. By 2004, ASCI systems will be performing in the 100-teraflops range. These machines will require systems that may be called on to store a quintillion or 1018 bytes (an exabyte), which is over ten thousand times beyond the capability of today's supercomputing storage systems. In addition, the transfer rates between these massive processing and storage systems will have to be on the order of tens to hundreds of billions of bytes per second. Achieving a balance between supercomputer processing and memory capacities and storage capabilities is critical not only to the success of ASCI but also to other high-end applications in science modeling, data collection, and multimedia.
Recognizing this coming need and the long-term effort required to achieve this balance, the National Storage Laboratory (NSL) was established in 1992 to develop, demonstrate, and commercialize technology for storage systems that serve even the most demanding supercomputers and high-speed networks. The NSL consisted of an advanced storage hardware testbed at Livermore and distributed software development partners. It involved more than 20 participants from industry, the Department of Energy, other federal laboratories, universities, and National Science Foundation supercomputer centers. The NSL collaboration was based on the premise that no single organization has the ability to confront all of the system-level issues that must be resolved in a timely manner for significant advancement in high-performance storage system technology. Lawrence Livermore and its sister DOE laboratories play leadership roles in developing high-performance storage systems because of their long history of development and innovation in high-end computing--of which storage is a critical component--in order to meet their national defense and scientific missions.
High-Performance Storage Systems
The High-Performance Storage System (HPSS) software development project grew out of NSL work. A major requirement for HPSS was that it be "scalable" in several dimensions--to allow huge capacities and transfer rates and to support many distributed systems and users. The system also had to be reliable, secure, and portable to many computing platforms and manageable by a small staff.
Work completed by the NSL had shown that HPSS could only be successful if it were based on a network-centered design. Typically, large-scale data storage has been handled by general-purpose computers acting as storage servers that connect to storage units such as disks and tapes (see figure below). The servers act as intermediaries in passing data to client systems like workstations or supercomputers on their network. As requirements for storage device data rates and capacities increase, the storage server must handle even more data faster. As data rates increase for storage devices and communications links, the size of the server must also increase to provide the required capacity and total data throughput bandwidth. These high data rates and capacity demands tend to drive the storage server into the mainframe class, which can be expensive to purchase and maintain and can have scalability limits.

If the storage software system and storage devices are instead distributed over a network, control of the storage system can be separated from the flow of data (see figure below). The bottleneck is removed, allowing more rapid data transmission and scalability of performance and capacity. Workstation-class systems used as storage servers provide the high-performance required and reduce the cost for storage server hardware in the bargain.

Focus on the Network
Operating on a high-performance network, the High-Performance Storage System uses a variety of cooperating distributed servers to control the management and movement of data stored on devices attached directly to the network. HPSS is designed to allow data to be transferred directly from one or more disk or tape controllers to a client once an HPSS server has established a transfer session. Its interfaces support parallel or sequential access to storage devices by clients executing parallel or sequential applications. HPSS can even manage data transfers in a situation where the number of data sources and destinations are different. Parallel data transfer is vital in situations that demand fast access to very large files and to reach the high data transfer rates of present and future supercomputers.
All aspects of HPSS are scalable so that the storage system can grow incrementally as user needs increase. The parallel nature of HPSS is one key to its scalability. For example, if a system has a storage device that can deliver 100 megabytes (100 million bytes) per second but a gigabyte (a billion bytes) per second is needed, then 10 devices in parallel, controlled by HPSS software, can be used to "scale up" to the new requirement. With this design, HPSS will be able to handle almost unlimited storage capacity, data transfer rates of billions of bytes per second and beyond, virtually unlimited file sizes, millions of naming directories, and hundreds to thousands of simultaneous clients.
HPSS uses several mechanisms to ensure data reliability and integrity. An important one is the use of transactions, which are groups of operations that either take place together or not at all. The problem with distributed servers working together on a common job is that one server may fail or not be able to do its part. Transactions assure that all servers successfully complete their job or the function is aborted. Although transactional integrity is common in relational data management systems, it is new in storage systems.
HPSS was designed to support a range of supercomputing client platforms, operate on many vendors' platforms, and use industry-standard storage hardware. The basic infrastructure of HPSS is the Open Software Foundation's Distributed Computing Environment because of its wide adoption among vendors and its almost universal acceptance by the computer industry. The HPSS code is also available to vendors and users for transferring HPSS to new platforms.
The principal HPSS development partners are IBM Worldwide Government Industry and four national laboratories--Lawrence Livermore, Los Alamos, Oak Ridge, and Sandia. There have been two releases of HPSS thus far, and IBM is marketing the system commercially. HPSS has already been adopted by the California Institute of Technology/Jet Propulsion Laboratory, Cornell Theory Center, Fermi National Accelerator Laboratory, Maui High-Performance Computer Center, NASA Langley Research Center, San Diego Supercomputer Center, and the University of Washington, as well as by the participating Department of Energy laboratories.
In combination with computers that can produce and manipulate huge amounts of data at ever-increasing rates, HPSS's scalable, parallel, network-based design gives users the capability to solve problems that could not be tackled before. As computing capacity and memory grow, so will HPSS evolve to meet the demand.

--Katie Walter

Key Words: computer network, hierarchical storage management, large-scale computer storage, parallel computing, supercomputing.

For further information, contact Dick Watson (510) 422-9216 (dwatson@llnl.gov) or visit the HPSS Internet home page at http://www.sdsc.edu/hpss/.

Back to January 1997