Sandia LabNews

Data-sharing web portal is designed to change how chemical science is done


Although fire has been successfully exploited for millennia, a mastery of combustion is more critical today than it has ever been. A deep understanding of the science of combustion is vital to maximizing the efficiency of current and future energy production.

New tools, including lasers and computers, have greatly expanded scientists’ ability to unveil new knowledge about the complex phenomena occurring over space and time within a burning flame. Now, additional ways to share that information and those tools with colleagues are ushering in a new era for how combustion science is conducted.

A $10 million effort that involved researchers from nine organizations over five years has created an online prototype portal for data sharing. Its creators’ vision is to speed access to richly detailed observations and simulations, rather than requiring scientists to wait for conferences or journal publications to learn about new results. They can then more quickly build upon what their peers have uncovered.

The team called the project and the online portal the Collaboratory for Multi-scale Chemical Science (CMCS, http://cmcs.org). The term “collaboratory” was coined about 15 years ago to describe a center without walls, where geographically separated collaborators might share information, analysis tools, or applications through high-performance networks.

The idea continues to grow

Nils Hansen’s (8353) research group is part of an international team that is working to identify chemical species in low-pressure flames probed with light from the Advanced Light Source at Lawrence Berkeley National Laboratory. The data are automatically captured and can be analyzed and graphically displayed in a number of ways using CMCS.

“We have so much data that it’s sometimes hard to keep up,” he says. “We thought it would be nice to have this data as part of the user facility and make it widely available to the user community.”

The CMCS work builds upon an earlier effort at the Combustion Research Facility. Software tools were developed for the Diesel Combustion Collaboratory from 1997 to 2000. Larry Rahn (8350) was involved from the start.

“Our plan was to take that to the next level,” Larry says, “so that it would be broader than a particular community and span all chemical science related to combustion. It’s a great vision driven by the broad spectrum of physical scales in combustion, and it could have an impact on the nation’s energy concerns.

“It’s a challenge for scientists to make this information flow better across these scales and their related disciplines. If they can, it channels research into challenges a neighboring discipline faces, and promotes research that is more useful toward the ultimate mission. It provides additional tools and approaches to facilitate a strong coupling among interdisciplinary scientists, allowing them to pursue research in a ‘systems science’ approach.”

The CMCS team included scientists to set requirements and explore prototype tools, as well as researchers to develop and build the infrastructure. Christine Yang (8116, formerly 8964) has been a coprincipal investigator and contributor to this and related projects, creating software frameworks the team members call knowledge grids.

She says scientists sometimes have a surfeit of data. For instance, Jackie Chen (8351) and her postdoctoral researchers perform terascale simulations of turbulent combustion to investigate fundamental turbulence-chemistry interactions in flames. Jackie has employed CMCS prototype software, FDTools developed by Wendy Doyle (8963), to track in space and time localized regions where fluid parcels auto-ignite. Recent simulations in three dimensions have produced tens of terabytes of data. Feature detection and tracking of salient regions of interest would greatly facilitate interpretation of large sets of simulated data.

A treasure trove of data

Rob Barlow (8351) also produces what collaboratory researcher David Leahy (a former Sandian now at Stanford University) calls a “treasure trove of data” through six-laser experiments that measure many complicated flame properties simultaneously.

Christine says that normally researchers work on a cycle of two to five years in which they produce and evaluate information to be presented in peer-reviewed papers and at conferences. She believes they see value in speeding up that process through collaboratories. Doing so takes advance coordination, both in agreeing upon standards, and preparing data by adding information that facilitates its use by others.

“In the past,” adds Leahy, “scientists didn’t think about sharing the data they got in the lab. First, they didn’t have the Internet. Second, they weren’t used to the process of sharing data — they just obtained it and decided what they thought it meant and then shared that, the very tippy-top of the iceberg, the top one percent. The primary goal had been to successfully get peer-reviewed publications. That was the measuring stick for success.”

Larry suggests that another yardstick may emerge, perhaps by counting data downloads in addition to publications and citations. “There’s kind of a data revolution happening in science,” he says, where data are becoming recognized as an end product in their own right.

A new measure of success

Leahy agrees, saying, “If a biologist figures out a structure of a protein and it is downloaded 1,000 times, that could be a new measure of success, a feather in the cap of the biologist.”

He was coprincipal investigator with Carmen Pancerella (8964) on an ongoing project funded by the National Institutes of Health and National Science Foundation, the Data Portal Enabling New Protein Structure Collaboration (Collaboratory for MS3D, http://ms3d.org/), now in its third year.

This project has benefited by using open- source software developed by the CMCS project team. The software framework is called the Knowledge Environment for Collaborative Science (KnECS) and is an enabling glue that knits together software collaboration tools, scientific applications, and data management tools. The framework allows team creation, workflow management, and subscriptions. The primary interface is a web portal enhanced with such interactive features as chat and announcements. The environment integrates data management software, most notably Scientific Annotation Middleware (SAM), for encoding, storing, searching, and controlling access to data. The data are tagged with information through creating metadata in XML, which can be viewed on any computer platform.

“The idea is to enable small research communities to tackle innovative ideas by using a shared, integrating infrastructure,” says Larry. An advantage is that the data will be accessible for years, and not become outmoded, like, for example, the magnetic tapes or even chart recordings of the past. He says the application to combustion was important since it is involved in 85 percent of global energy use.

There are also other related collaboratory efforts underway at Sandia. Outside Sandia, researchers who share access to rare instruments, such as telescopes, or produce vast amounts of genomic data, are already far along the path of data-sharing.

“It’s just a matter of time before the ideas are adopted broadly through all corners of the scientific community,” Leahy predicts.

Toward that end, KnECS is being released on SourceForge.net as open-source software for further development. The hope is that enhancements will be added for use in a variety of research communities, Christine says, and that it one day may be adapted for commercial release.