BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250316T231111Z
LOCATION:B206
DTSTART;TZID=America/New_York:20241122T093000
DTEND;TZID=America/New_York:20241122T094500
UID:submissions.supercomputing.org_SC24_sess769_ws_adugai133@linklings.com
SUMMARY:Preparing Data at Scale: The Data Pipeline for AuroraGPT
DESCRIPTION:Workshop\n\nNicholas Chia and Clark Cucinell (Argonne National
  Laboratory (ANL)); Ian Foster (Argonne National Laboratory (ANL), Univers
 ity of Chicago); Tanwi Mallick and Robert R. Underwood (Argonne National L
 aboratory (ANL)); and Yadu Nand Babuji, Alexander Brace, Ozan Gokdemir, Ky
 le Hippe, Arham Khan, and Carlo Siebenschuh (University of Chicago)\n\nAur
 oraGPT seeks to test the hypothesis that a model trained on\nadditional sc
 ience data and text will improve performance on scien-\ntific tasks. If we
  consider that existing models such as PALM—the\npredecessor to Google’s G
 emini model family were trained on\n770B tokens of which only ∼1.9% was sc
 ientific text. To meet our\ngoal, we seek to incorporate substantially mor
 e scientific text.\n\nIn this presentation, we will share the recent progr
 ess of the\nAuroraGPT Data Team, how we contribute the project of building
 \na science-focused LLM with AuroraGPT, how we collaborate with\nthe other
  teams, and what topics we see as open questions. As\nthe data team, our t
 eam is responsible for identifying, preparing,\nand deduplicating scientif
 ic data and text. We will talk about the\nsystems and data quality challen
 ges that our team tackles to prepare\nterabytes of scientific data and tex
 t to produce high-quality text\nand data for training.\n\nTag: Artificial 
 Intelligence/Machine Learning\n\nRegistration Category: Workshop Reg Pass\
 n\nSession Chairs: Charlie Catlett (Argonne National Laboratory (ANL), Uni
 versity of Chicago); Fabrizio Gagliardi (Barcelona Supercomputing Center (
 BSC), Association for Computing Machinery (ACM)); Neeraj Kumar (Pacific No
 rthwest National Laboratory (PNNL)); Satoshi Matsuoka (RIKEN, Tokyo Instit
 ute of Technology); Irina Rish (University of Montreal, Canada; Mila – Que
 bec AI Institute); Rick Stevens (Argonne National Laboratory (ANL), Univer
 sity of Chicago); and Valerie Taylor (Argonne National Laboratory (ANL), U
 niversity of Chicago)
END:VEVENT
END:VCALENDAR
