Research in astronomy is undergoing a major paradigm shift, transformed by
the advent of large, automated, sky-surveys into a data-rich field where
multi-TB to PB-sized spatio-temporal data sets are commonplace. For example
the Legacy Survey of Space and Time; LSST) is about to begin delivering
observations of >10^10 objects, including a database with >4 x 10^13 rows
of time series data. This volume presents a challenge: how should a
domain-scientist with little experience in data management or distributed
computing access data and perform analyses at PB-scale?
We present a possible solution to this problem built on (adapted) industry
standard tools and made accessible through web gateways. We have i)
developed Astronomy eXtensions for Spark, AXS, a series of
astronomy-specific modifications to Apache Spark allowing astronomers to
tap into its computational scalability ii) deployed datasets in
AXS-queriable format in Amazon S3, leveraging its I/O scalability, iii)
developed a deployment of Spark on Kubernetes with auto-scaling
configurations requiring no end-user interaction, and iv) provided a
Jupyter notebook, web-accessible, front-end via JupyterHub including a rich
library of pre-installed common astronomical software (accessible at
http://hub.dirac.institute).
We use this system to enable the analysis of data from the Zwicky Transient
Facility, presently the closest precursor survey to the LSST, and discuss
initial results. To our knowledge, this is a first application of
cloud-based scalable analytics to astronomical datasets approaching
LSST-scale. The code is available at https://github.com/astronomy-commons.