SciPy Conference 2021 Takeaways#

A couple weeks ago, I had an opportunity to (virtually) attend the Scientific Computing with Python (SciPy) Conference! The conference consisted of three primary sections:

  • Tutorials (Monday/Tuesday)

  • Talks (Wednesday - Friday)

  • Sprints (Friday - Sunday)

Tutorial topics ranged from machine learning, interactive dashboards, to distributed computing using Dask! These tutorials were four hours long, including a wide variety of speakers for each tutorial topic.

The talks included keynote addresses at the beginning of the day, followed by various tracks, typically including some “scientific-focused” session (ex. “Earth, Ocean, Geo, and Atmospheric Science”), a machine learning track, and maintainer track. The variety of tracks offer an opportunity to group into more focused disciplines/groups. Personally, my favorite session was the Earth, Ocean, Geo and Atmospheric Science session, which was held Friday morning through afternoon (Central time)

In the next few sections, I include notes from a few of the sessions I attending, primarily related to keynote presentations and the Earth, Ocean, Geo, and Atmospheric Science portion of the conference.

All the talks were recorded, with the links to invidual talks included within the headers. If you are interested in talks not covered in this post, check out the following links:

Keynote - Fernando Perez - SciPy and Open Source Toolsin Sicence: Challenges and Successes#

Recorded presentation

Background#

  • iPython started with an afernoon hack in October 2001

    • Collaborative effort from the beginning

    • Announced iPython in 2001

  • Other things in the Python world that started in 2001

    • Enthought

  • Key part of Python - the community

  • So far - mostly white men - need to move toward more diversity in the field

    • Grant to push for more diversity - led by group at CU Boulder

  • In order to build, it takes a village

    • Large variety of packages + tools

    • 2015 - black holes - first figure was plotted in Python

    • 2019 - first image of black hole - cited NumPy and SciPy

      • 23,000 people contributed to the code behind this

  • 10 most influential code bases

    • IPython/Jupyter - really a lot of the packages that go into this

  • Students at UC Berkely

    • Most students using Jupyter, organize a national data science workshop now

Comments on the Open Development Model#

  • Too much ad-hoc volunteer work - only those who volunteer can participate

  • Can there be structural funding for:

    • Maintenance

    • Documentation

    • Community dev

    • Strategic planning

    • Cross-project coordination

  • Academic careers?

    • Postdocs, students, RSEs, faculty

  • Industry (“big tech”?)

    • What is their role in all of this?

  • Software is more than papers and code

    • Services and content - impact

    • Software

    • Standards and Protocols - ecosystem

    • Community - innovation and resiliency

  • Jupyter - language agnostic!

    • Document ideas - later support other languages(R, Julia, etc.)

      • Over 100 kernels!

  • Governance - managed and govern sustainability

  • How do we do this for another 20 years?

    • Limited direct federal $$$

    • Indirectly

      • Lots of $ have supported Scientific OSS

    • Who stepped in?

      • Private philanthropy

      • Alfred P. Sloan, Gordon and Betty Moore, Chan Zuckerberg

  • Catastrophic success: an economic problem

    • Python has overtaken IDL

    • Today, Python by a wide margin, is the most popular interpreted language in astronomy

    • Mathworks - 4000 employees

    • Wolfram - 800 employees

    • IDL/Harris - 17,000 employees

    • We need to think about this as a risk of science

  • Scientific OSS is a foundational public good

    • Roads and Bridges - the Unseen labor behind our digital infrastructure - needs maintenance and support

    • Can’t just give some PI money to build it - need long term support

  • Resources

    • Federal research and dev budget - $200 million per year

      • What fraction of R&D depends on computing?

    • $2 Billion is 1% - would make a HUGE difference

  • Some features of successful, resilient projects

    • Broad community engagement

    • Actively managed pipeline for new contributions

    • Capacity for short and long term funding

  • Funded full-time backbone teams

    • Maintenance, documentation, tech deep dives

    • Strategic planning, fundraising, operations, community dev

  • Professionalization is inclusive

    • Reliance on volunteers excludes people!

  • Efforts with larger orgs

    • NumFocus

  • JOSS - Journal of Open Source Software

    • Established academic publishers - see this as a threat

    • Cheaper and better - rethinking these models

  • Career paths

    • Society of research engineering - Europe

    • US Research Software Engineer Association

      • Create sustainable paths for this

  • Mozilla

    • Leading professional dev opportunities, in leadership for open source

  • 2i2c

    • Non-profit that takes some of work with infrastructure, scale computing for science

    • Doing these in academia - lots of infrastructure to do

    • 2i2c - manage the infrastructure and support open source

      • Sustaining the ecosystem

  • Jupyter Community Workshops

    • Good opportunity to explore some of these questions

    • Conversations about sustainability - federal partnerships

  • Now have enough for success, but have a much larger need now

    • Other agencies thinking about this - get resources on foundational side

Atmospheric data Community Toolkit#

Recorded presentation Github repo

  • Framing the problem

    • Didn’t really have CS incoporated into curriculum

      • Learned java + c, not for data analysis…

    • Advisors just dump their code

      • Their way becomes your way!

      • Can be tough to build up from there…

    • Get into job market - different groups do things differently

    • Build up coding silos - makes it difficult to share across groups

      • Even in same organizations!

      • Silos are built on 20 years of doing things…

    • If you have flexibility, can build from there

      • Adjust and break out of the coding silos

    • When you can contribute across organizations, get more people on board

      • Contribute to the greater community

  • Atmospheric Science side

    • See a gap for toolkit for atmospheric observations

    • Toolkit for data exploration and analysis of atmospheric time-series datasets

    • Pulling from xarray, pandas, MetPy

  • Atmospheric data Community Toolkit

    • Data exploration, analysis of time-series data

    • IO

    • Deploy instrumentation around the world

    • Want to share the code base across all of these

    • Read in, visualize

    • Goals

      • Bridge research communities

      • Reduce duplication of effort

      • Transparency

      • Flexibility

Discovery#

  • ARM Live Data Web Service

    • PB of data back to 1993

  • ASOS

    • API from iowa mesonet

  • USDS Cropscape

    • Croptype is very important!

  • IO

    • xarray for netcdf

    • Pandas for text/csv files

    • Specific readers for NOAA global monitoring network

      • Surface met, radiation

    • Binary micropulse lidar using MPL2NC

      • Create temporary netcdf file

    • Cole data file structures from ARM program to create “empty” datasets - fill with missing values

  • Corrections

    • Correct lidar data like micropulse lidar for deadtime, afterpulse

    • Correcting ceilometer data for easier visual analysis

    • Wind-corrections for ship motion

  • Able to add contributors to zenodo - get people credit!

  • Retrievals

    • Wind profiles from Scanning Doppler Lidar

      • PyART can be used for lidar too!

    • Sky infrared temperature fro AERI irradiances

    • Solar radiation calculations

      • Net radiation, longwave radiance

    • Sea surface temperature from infrared radiometer measurments

  • Quality control

    • Apply additional tests

      • Simple limit tests

    • Using bit-packed QC to allow for multiple tests

    • Plotting tools to visualize - example - simple limit test

  • More

    • Prioritized visualization at first

      • Timeseries plots

      • Vertical profiles

      • MetPy - soundings

      • Generate combined plots as needed

      • Use pyART for other aspects

    • General utils

      • Precip accumulation

      • Weighted averaging

  • Future dev

    • Boundary layer height calc

    • Interactive plotting

    • 3D viz

  • How do you coordinate with the model diagnostics community?

    • Earth Model Column Collaboratory

    • https://github.com/columncolab/EMC2

    • Example notebook

  • Moving toward open development

    • Needed to convince program managers

    • Convince that the wider community is using, addint to it

    • Worked with ARM group to get this going

OCEtrac - Ocetrac: morphological image processing for monitoring marine heatwaves#

Recorded Presentation Github repo

  • See warming trend in the oceans…

    • Lots of variability on top of this

    • Compute trend at every point you have data for

    • Can cause marine heatwaves to form

  • Species can cope with fluctuations in temperature

    • If we continue to warm the climate, warm extremes more frequent, more extreme

  • Marine events exceed some threshold over some 30 year time period

    • Can be broader across ocean, or more locally focused

  • Motivation

    • Marine heatwaves don’t stay in one place

    • Complex spatial connectivity and temporal behavior

    • Local analyses may be completely characterize evolution

    • Can we build a tool to identify and track these?

  • Goals of OceTRAC

    • ID marine heatwaves as 2D object from SST anomalies

    • Track marine heatwave objects in space and time

    • Create new labeled dataset of heatwaves

  • NOAA Optimum Interpolation SST (OISST)

    • Mean SSTs - bias corrected

    • Compute SST anomalies

    • Extract only 90th percentile of SST anoms

    • NOAA OISST data - on s3 object storage

      • use pangeo forge to convert from netcdf to zarr

  • Running notebook

    • get your dask cluster

    • Do some preprocessing

      • Need to do this before feeding in

    • Convert into a binary map - if exceed, label 1 - if not, 0

      • Can do this for any type of phenomena

      • Can use this for any feature id

    • After you have this, use SciPy.ndimage

      • Open/close - smooth contours of image

        • Close - fill small holes in features

        • open - eliminate small features

    • Structuring element

      • Radius - define size of element (what is used to construct object)

    • Once you have object, collect the area (objects below threshold removed)

    • Last part - track in space and time

      • Connectivity element - connected in x, y, time

        • If they are connected, share the same id

      • These can merge, come together, split

        • Heat wave remembers the history

Running this#

  • Load in data array

  • Set radius to number of grid points

  • Minimum size quartile - threshold for object areas

  • Set variables such as xdim, ydim

  • Set up the Tracker object

    • Feed in data array, mask, radius, minsize

  • Each color on there is a different heatwave

    • Can see size and shape of these

    • Can zoom into single events

  • Can look at more stats using regional properties

  • Now have quantified dataset to understand the event

New Insights#

  • Seen two different patterns evolve - regional, global

    • Tropics play big role in global scale

  • Global heat waves last longer, more intense

  • Marine heatwaves split and merge, connected in space and time

  • Can track any ocean variables you want

  • Dealing with the 180 degree seam

    • Something they had to add on… but it’s there now!