TUTORIAL T004: Scientific Data Processing with the Pegasus Workflow Management System

Tutorial 2-S

When

3:10 to 5 p.m., Nov. 5, 2023

recording

Workflows are a key technology for enabling complex scientific computations. They capture the interdependencies between processing steps in data analysis and simulation pipelines as well as the mechanisms to execute those steps reliably and efficiently. Workflows can capture complex processes, promote sharing and reuse, and also provide provenance information necessary for the verification of scientific results and scientific reproducibility.

Pegasus (https://pegasus.isi.edu) is being used in a number of scientific domains doing production grade science. In 2016 the LIGO gravitational wave experiment used Pegasus to analyze instrumental data and confirm the first detection of a gravitational wave. The Southern California Earthquake Center (SCEC) based at USC, uses a Pegasus managed workflow infrastructure called Cybershake to generate hazard maps for the Southern California region. In 2019, SCEC completed the largest CyberShake study to date, producing the first physics-based PSHA maps for the Northern California region. Using Pegasus, they ran CyberShake workflows on three systems: HPC at the University of Southern California (USC), Blue Waters at the National Center for Supercomputing Applications (NCSA), and Titan at the Oak Ridge Leadership Computing Facility (OLCF), consuming about 120 million core hours of compute time. Pegasus orchestrated the execution of over 18,000 remote jobs using Globus GRAM, rvGAHP, and Condor Glideins, and transferred over 150 TB between the three systems. Pegasus is also being used in astronomy, bioinformatics, civil engineering, climate modeling, earthquake science, molecular dynamics and other complex analyses.

In 2020, we released Pegasus 5.0 that is a major improvement over previous releases. Pegasus 5.0 provides a brand new Python3 workflow API developed from the ground up so that, in addition to generating the abstract workflow and all the catalogs, it now allows you to plan, submit, monitor, analyze and generate statistics of your workflow. Since 2022, Pegasus has been a key part of the ACCESS support strategy (https://support.access-ci.org/pegasus)

Primary learning objectives

The goal of the tutorial is to introduce application scientists to the benefits of modeling their pipelines in a portable way with use of scientific workflows with application containers. We will examine the workflow lifecycle at a high level and issues and challenges associated with various steps in the workflow lifecycle such as creation, execution and monitoring and debugging. Through hands-on exercises in a hosted Jupyter notebook environment, we will describe an application pipeline as a Pegasus workflow using Pegasus Workflow API and execute the pipeline on distributed computing infrastructures. The attendees will leave the tutorial with knowledge on how to model their pipelines in a portable fashion using Pegasus workflow and run them on varied computing environments. The tutorial will also cover how to bundle application codes into a container and use them in workflows.

Click for instructions for tutorial T004 participants

Contacts

Karan Vahi, USC Information Sciences Institute
Mats Rynge, USC Information Sciences Institute