Main content



Loading wiki pages...

Wiki Version:
Research in astronomy is undergoing a major paradigm shift, transformed by the advent of large, automated, sky-surveys into a data-rich field where multi-TB to PB-sized spatio-temporal data sets are commonplace. For example the Legacy Survey of Space and Time; LSST) is about to begin delivering observations of >10^10 objects, including a database with >4 x 10^13 rows of time series data. This volume presents a challenge: how should a domain-scientist with little experience in data management or distributed computing access data and perform analyses at PB-scale? We present a possible solution to this problem built on (adapted) industry standard tools and made accessible through web gateways. We have i) developed Astronomy eXtensions for Spark, AXS, a series of astronomy-specific modifications to Apache Spark allowing astronomers to tap into its computational scalability ii) deployed datasets in AXS-queriable format in Amazon S3, leveraging its I/O scalability, iii) developed a deployment of Spark on Kubernetes with auto-scaling configurations requiring no end-user interaction, and iv) provided a Jupyter notebook, web-accessible, front-end via JupyterHub including a rich library of pre-installed common astronomical software (accessible at We use this system to enable the analysis of data from the Zwicky Transient Facility, presently the closest precursor survey to the LSST, and discuss initial results. To our knowledge, this is a first application of cloud-based scalable analytics to astronomical datasets approaching LSST-scale. The code is available at
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.