Tuesday, March 10, 2015

Workflow Tools and the Coding Culture of Scientists

Sarah Poon, Nan-Chen Chen, Cecilia Aragon, Lavanya Ramakrishnan

In this project, our team consists of both computer scientists and HCI researchers who sit at the cross section of computer science and the domain sciences. We interact with the domain scientists to understand how they use computation to achieve their science goals and work with the computer scientists who are developing the tools necessary to efficiently run codes on supercomputers or analyze the data.

In many cases, the domain scientists are the ones developing the science codes, the algorithms used to produce and analyze scientific data, while computer scientists develop tools and technologies to help the scientists run their codes on HPC systems.

Workflow Tools
Workflow libraries and systems offer many benefits to scientists to aid in the instrumentation of their codes to run on HPC. Staff members at the NERSC supercomputing facility in Oakland recently formed a workflows working group, with the aim of evaluating a subset of workflow tools that can be run and supported on HPC systems to serve the needs of scientific users.

The discussion of workflows and the qualities to look at when evaluating workflow tools deserves a more detailed analysis than what can be addressed here. Instead, we will briefly discuss one dimension, whether a scientist is able to design and author a workflow that will run efficiently at scale without the aid of a computer scientist or workflow tool expert. In the cases where workflow tools are used to support complex workflows (e.g., handling of very large amounts of data) careful workflow design is needed to effectively use HPC resources. Therefore, it is usually not the scientists themselves but workflow experts that write these workflows. In many of the use cases discussed at the workshop, these workflows were considered production workflows meant to be run over and over without much modification. Thus, the upfront investment of careful workflow design made sense to these groups.

Scientific work that that is more iterative often has workflows that are much more experimental in nature and require constant tweaking and revisions. In these cases, scientists have expressed the desire to be able to author the workflow themselves, without the aid of a computer scientist or workflow expert, usually in ways that fit naturally into their current coding environment. We have seen examples of high throughput workflows, where each run submits hundreds or thousands of application variants that are only run once. We have also seen examples of workflows that need to go into production but are provisional and require refinement. For these types of workflows, workflows tools that were expressly designed to be easy for people to self-author were a better fit. Similar to workflow tools, workflow types deserves a blog post of its own, which we will explore in a future blog post.

Workflow tools often expect domain scientists to encapsulate their code into black boxes with well defined inputs and output. In other words, the science codes need to be written in a clean, modular way. However, this expectation might be at odds with some of the realities that scientists face when they write their code and becomes problematic when considering the situations where scientists are writing the workflows without the aid of a computer scientist.

How domain scientists code

To illustrate the culture surrounding scientific code among scientists, we will explore different roles* of scientists in collaborative projects that heavily use HPC resources.

[* These descriptions of roles and coding culture are generalizations based on interviews as well as years of working with scientific collaborations. There are fuzzy boundaries and overlap in these roles when it comes to the characterizations described, and roles vary across projects.]

Principal Investigators/Senior Scientists: As the lead of a science project working with a number of scientists, students, and postdocs, the PI on a project not only has quite a lot of knowledge on the various ways computation is used in the project but also often has a hand in the running and coding of some pieces of this software. They often enjoy a certain amount of the day to day running of software, coding analysis tasks, and desire a high level of transparency on how the libraries work. However, they generally do not mind offloading certain tasks to a computer scientist if and when available, such as tuning their code to run well at scale. Despite a lot of experience in computation, several PI’s and senior scientists have expressed concerns they are not doing things “the right way”, especially whenever a HPC facility does a systems upgrade or procures a new machine and their codes no longer compile or perform as expected. This sometimes stems from the fact that, although they are aware that APIs for parallelism or multi-threading are available, these APIs are under-utilized due to the burden of learning them and refactoring their code to use them.

Mid-Career Scientists: Mid-career scientists in science collaborations have often gained a lot of experience using supercomputers as postdocs and graduate students. They have sometimes adopted many software engineering best practices by working with groups of computer scientists over the years. They do a large chunk of the day to day operations and analysis, and often have strong opinions about the types of tools and libraries they want to use, favoring tools that are simple to use, that don’t take a large amount of effort to learn, and that are flexible to sometimes daily iterations to their code. Some mid-career scientists have built computational tools, even workflow tools, despite the fact that often, existing tools could have been utilized by their collaborations. In some cases, they simply didn’t realize such tools existed or had a difficult time seeing a match between a tool and their needs. Other times, due to the complexity of the software and high learning curve, these scientists felt it would be easier to write their own. Most scientists start with writing the software primarily for their personal use to solve a specific problem. This can result in code that can at times be not well documented and riddled with hard coded variables. The task of then making this code production quality can seem like a huge burden compared to the act of writing the original code. In the end, all software written by these scientists are primarily a means to an end, since their career advancement is based on science results and publications, not git commits.

Early Career Scientists: Early career scientists often haven’t had much programming experience beyond what is taught in introductory classes or coding bootcamps. Early career scientists can range from graduate students to postdocs to junior scientists who often will spend a somewhat limited amount of time on a project (sometimes several years, sometimes only a few months). They usually need to learn these skills quickly and on the go, in order to get the science results they need to produce papers during the limited time on the project. An example task they might perform is to add a piece of analysis code into an inherited pipeline that may not be entirely functional. If their software runs inefficiently, they may not have enough background knowledge to recognize, diagnose, and fix the issues. Some of these early career scientists have expressed a feeling of being overwhelmed by the amount of knowledge needed to run seemingly simple code and of intimidation by being surrounded by senior scientists with so much computational knowledge. Like mid-career scientists, their primary goal is not to become an expert coder but to publish in their field.

Here are some early thoughts about the scientific coding culture that we can extract from these roles:

All of these scientist types are probably not going to be spending as much time planning and thinking out their code as perhaps a computer scientist would.

They may not even know all the available techniques and tools out there, may find it too cumbersome to learn, or may not be able to define their problems in such a way that appropriate tools and library can be found.

Adding in any new tool or library will potentially be an afterthought - something that will be added in after a piece of code is already written and running (possibly poorly) in production.

They are looking for ways to solve their immediate pain points, not necessarily for ways for their code to be more efficient. Science results are the goal, not efficient code.

All of these factors in the scientific coding culture mean that scientists often think and write code in ways that are different from computer scientists. They have different constraints and different motivations. This leads to a potential mismatch between what workflow tools are expecting and what domain scientists can provide. Since there is such a strong push to increase computational competency in the sciences, one may wonder if these differences dissolve over time. But when you look at the three roles, you see that there will typically be a range of skills, and often technology moves exponentially while people tend to learn incrementally**. Any individual scientist may increase his computational competency over time but at a slower rate than technological advances. Although we cannot anticipate how this coding culture will change over time, it is likely to continue to be true that tools aimed at computer scientists will not always align well with the way domain scientists code. Therefore, tools that are specifically targeted towards scientific coding should aim to empower these scientists while respecting their coding culture.

** Law of Disruption by Larry Downes