Paper
19 July 2010 Designing a high-availability cluster for the Subaru Telescope second generation observation control system
Eric Jeschke, Takeshi Inagaki
Author Affiliations +
Abstract
Subaru Telescope is commissioning a second-generation Observation Control System (OCS), building upon a 10 hear history of using the first generation OCS. One of the primary lessons learned about maintaining a distributed OCS system is that the idea of individual computer nodes specialized for specific functions greatly complicates troubleshooting and failover, even with a dedicated "hot spare" for each specialized node. In contrast, the Generation 2 (Gen2) system was designed from the ground up around the principle of a High-Availability (HA) cluster, commonly used for high-traffic, mission-critical web sites. In such a cluster, nodes are not specialized, and any node can perform any function of the OCS. We describe the problems encountered in trying to troubleshoot and manage failure on the legacy OCS system and describe the architectural design of the HA cluster for the new system, including special characteristics designed for the high-altitude, remote environment of the summit of Mauna Kea, where there is a greatly increased probability of such failures. Although the focus is primarily on the hardware, we touch upon the software architecture written to take advantage of the features of the HA cluster design. Finally, we outline the advantages of the new system and show how the design greatly facilitates troubleshooting, robustness and ease of failure management. The results may be of interest to anyone designing a distributed system using COTS hardware and open-source software to withstand failure and improve manageability in a remote environment.
© (2010) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Eric Jeschke and Takeshi Inagaki "Designing a high-availability cluster for the Subaru Telescope second generation observation control system", Proc. SPIE 7740, Software and Cyberinfrastructure for Astronomy, 77400S (19 July 2010); https://doi.org/10.1117/12.856512
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Control systems design

Telescopes

Optical instrument design

Data backup

Distributed computing

Failure analysis

Astronomy

RELATED CONTENT

Operational modes, health, and status monitoring
Proceedings of SPIE (August 18 2016)
Gathering headers in a distributed environment
Proceedings of SPIE (July 21 2008)
LSST operation simulator implementation
Proceedings of SPIE (June 30 2006)
The design of LBT's telemetry source registration
Proceedings of SPIE (July 19 2010)

Back to Top