TUTORIAL T002: LSDB and HiPsCat: Joint Distributed Analysis of LSST-Scale Datasets

Tutorial 1-N

When

1 to 2:50 p.m., Nov. 5, 2023

The present decade will be marked by growth of large survey catalogs, both in their number and scale. Joint analysis of such catalogs has historically shown itself to be tremendously useful (e.g. enabling multi-wavelength or time-domain studies), with its importance likely to rise even further. Yet, with the increase in scale towards PBs of data, joint analysis – even at a catalog level – becomes a complex data management problem that few astronomers are equipped to tackle with present-day technology. Here we present HiPSCat, a format for efficient and queryable storage of large datasets, and LSDB (Large Survey DataBase), a Python framework that enables distributed cross-matching and analysis of astronomical datasets at LSST scale (O(10B) sources). The HiPSCat format - framework-independent and built as an extension of the well-known IVOA HiPS standard - provides intelligent (balanced) spatial partitioning and enables scalable serving of PB-scale datasets (via HTTP) using Parquet for efficient storage. The LSDB framework enables distributed computing and cross-matching on HiPSCat-formatted datasets. Leveraging broadly adopted community libraries such as astropy, Pandas, and Dask, LSDB presents a user-friendly API approachable to astronomers. The goal of LSDB is to enable the user to focus on the science aspects of their tasks, leaving the difficult data management aspects (distribution, resiliency) to the framework.

You can find the tutorial notebooks here: https://github.com/swyatt7/ADASS_LSDB_tutorial in the nb/ folder. There is also a short presentation in the pres/ folder which will be a rough overview of HiPSCats for the first section of the tutorial.

Primary learning objectives:

  1. How LSDB can achieve your large scale science: Out of the box, LSDB comes equipped with the potential to perform various spatial analysis (like cone-searching, and cross-matching), along with time-series analysis (e.g. large scale lightcurve analysis). Not only will LSDB be equipped with these methods, but since it is built upon the dask.dataframe distributed framework, users will be able to define their own functions and map them across the catalogs. If a user wants to use their own cross-match algorithm, they can easily tie it into our HiPSCat framework through the dask.dataframe library, which we will provide adequate documentation on.
  2. The strengths of HiPSCat partitioning structure: It enables storage of astronomical datasets in a way that equalizes the number of rows per partition, yet keeps spatially adjacent objects together. Once two catalogs are partitioned in the HiPSCat manner, distributed, joint-spatial analysis is trivial between them.

Thank you to everyone interested in HiPSCat/LSDB! If you are here, you are excited about exploring concepts of LSST-sized catalog analysis with our python libraries. Prior to the tutorial, we will have most of the infrastructure already set in stone, so that it will be essentially seamless to get everything started.

 

INSTRUCTIONS FOR TUTORIAL T002 PARTICIPANTS

All you will need is:

  1. A computer with an internet browser + internet access   
  2. An account on https://lsst.dirac.dev/ jupyter hub. To get an account please send me (PG4gdWVycz0iem52eWdiOmZxamxuZ2dAaGoucnFoIj5mcWpsbmdnQGhqLnJxaDwvbj4=) an email with your github account username or email associated with the account, and I will grant you access.

Thats it! On the jupyter hub, the users will spin up there own computing instances. There we will have the python environment already set up, along with the credentials to access to the sample datasets in the cloud. The tutorial notebooks will be git-cloned from https://github.com/swyatt7/ADASS_LSDB_tutorial, and the users will follow along from there.

Contacts

Samuel Wyatt, University of Washington