Title A Workflow Pipeline for Scientific Data Analysis
Abstract As detector resolution and speed increase, the amount of data that must be transferred and analysed also increases. This is especially the case for production and high-throughput beamlines, and for experimental stations that require prompt feedback. Such beamlines utilize automated acquisition software, sample changers and other experimental apparatus, all of which facilitate the creation of larger amounts of data. The ability to easily and robustly handle this avalanche of data is key to scientific discovery and insight. We present a workflow pipeline for scientific data analysis that helps address this concern. It uses an industry standard messaging system for reliable task sequencing and triggering. Generic actors handle common tasks such as file transfers. Technique specific analysis code is implemented or called from custom actors that may be written in Java, C++ or Python. Experimental metadata and provenance information is stored along with raw and analysed data in a single HDF5 file that is manipulated by different stages of the pipeline. The system is deployed at the APS 2-BM-B and 8-ID-I beamlines. The tomography beamline located at 2-BM-B uses the pipeline to transfer data from detector computers to a cluster for GPU reconstruction. This beamline can produce over 10TB of raw detector data a day and over 40TB of reconstructed data a day. The x-ray photon correlation spectroscopy (XPCS) beamline at 8-ID-I uses the pipeline to move data from detector computers to a Hadoop distributed file system (HDFS) on a distributed-memory cluster for multi-tau analysis. The XPCS beamline can produce up to 2TB of raw data a day.
