Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
TaskVine: Managing In-Cluster Storage for High-Throughput Data Intensive Workflows
DescriptionMany scientific applications are expressed as high-throughput workflows that consist of large graphs of data assets and tasks to be executed on large parallel and distributed systems. A challenge in executing these workflows is managing data: both datasets and software must be efficiently distributed to cluster nodes; intermediate data must be conveyed between tasks; output data must be delivered to its destination. Scaling problems result when these actions are performed in an uncoordinated manner on a shared filesystem. To address this problem, we introduce TaskVine: a system for exploiting the aggregate local storage and network capacity of a large cluster. TaskVine tracks the lifetime of data in a workflow --from archival sources to final outputs-- making use of local storage to distribute and re-use data. We describe the architecture and novel capabilities of TaskVine, and demonstrate its use with applications in genomics, high energy physics, molecular dynamics, and machine learning.
Event Type
Workshop
TimeMonday, 13 November 202310:25am - 10:43am MST
Location704-706
Tags
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
Registration Categories
W