‘Data Revolution’ solves current issues in chemistry

Part of three projects chosen nationally this year by the National Science Foundation, Matthew Sigman and his team will help create a new generation of data chemists.

This release is adapted from a story by Lauren Albin at Colorado State University. Find the original here.

Matthew Sigman, distinguished professor and chair of the University of Utah department of chemistry, is part of a vanguard of researchers solving problems in chemistry using data science and machine learning.

Sigman is part of a National Science Foundation-supported team spanning five universities charged with creating a new generation of data chemists through their Center for Computer Assisted Synthesis (C-CAS). In addition to Sigman, the C-CAS team includes Center Director Olaf Wiest as well as Robert Paton, Nitesh Chawla, Abigail Doyle and Richmond Sarpong.

C-CAS combines data science and machine learning with chemistry to transform how the synthesis of complex organic molecules is planned and executed. As a result, a new generation of data chemists and machine learning scholars can be trained and educated to address complex challenges of modern synthetic chemistry.

“We are excited to work with this team to build workflows using machine learning to predict how chemical reactions will perform,” Sigman said. “Our goal is to convert synthetic chemistry from mainly an empirical science to using data science tools to facilitate and streamline chemistry development.”

Both graduate and undergraduate students will participate in the Center’s research, which will also establish networking events, online workshops, and collaborations with students at other schools.

C-CAS is supported by the Centers for Chemical Innovation Program of the Division of Chemistry and will include $1.8 million in funding. Two to three centers are created each year with nine currently in existence. As a “Phase One Center,” C-CAS will run for three years and, pending the outcome, potentially be extended and considerably expanded into a “Phase Two Center.”

Data revolution

In 2017, NSF announced its “10 Big Ideas,” encompassing a long-term research agenda to benefit generations to come. Of the 10, Sigman and the C-CAS team fall under “Harnessing the Data Revolution“.

The “Data Revolution” is a term used to describe the growing demand for data from all parts of society. It has impacted many fields, and chemistry is now one of those.

Currently, chemistry is recorded in laboratory books, databases inside companies or in the pages of Ph.D. theses. It also can be published in papers, put on online PDFs or captured in patents. There is a multitude of information in various places. Sigman and his team are working to build new computational tools to bring all that data together in one accessible place. To do this, they will work in three phases:

  1. Unify data from a variety of sources.
  2. Exploit unified data to represent chemistry in a way that addresses the problems with optimizing chemical reactions.
  3. Apply the data to synthesis planning and the synthesis of complex molecules.

“Access to high-quality information, containing both positive and negative results, will be key to developing new data-driven tools for chemists,” said Paton, of Colorado State University. “This wealth of knowledge available by computer will open new pathways into chemistry for those who have previously found it inaccessible due to challenges like fume hoods and laboratory spaces.”

The C-CAS team

The Center will be directed by Wiest at the University of Notre Dame and joined by Chawla also from Notre Dame, Doyle from Princeton University, Paton, Sarpong from the University of California, Berkeley and Sigman.

“Problems of this sort require a strong and diverse contingent of problem solvers,” Sigman said. “ The NSF Center format allows us to tackle much bigger problems.”

Each lead investigator has complementary expertise in reaching the outlined goal. Sigman develops physical-organic approaches to understand and predict selectivity in organic reactions.

Doyle uses ultra-high-throughput experimentation (HTE) technology and computational machine learning to predict the outcomes of reactions. Paton’s group uses computational algorithms to understand catalytic reaction mechanisms and to enhance performance. Chawla specializes in making fundamental advances in machine learning. Wiest uses both computational chemistry and experimental methods to elucidate reaction mechanisms and to perform high-throughput calculations on transition structures. Sarpong focuses on total synthesis, converting simpler chemical building blocks into complex, medicinally interesting natural products.

This project will be a forum for the exchange of ideas. Experts in each field will establish best practices, done in a very visible way, so they will be able to collectively figure out what great tools can be utilized to solve chemistry.

In addition to the researchers at various institutions, the group will also work with several industrial partners such as large pharmaceutical companies.

“Our industrial affiliates and collaborators add the ability to generate important data using modern technology and exceptional vested partners in developing workflows and tools to think about data and data science approaches to synthetic chemistry,” Sigman said.

To learn more about or stay up-to-date on the research Sigman and his group are doing, visit ccas.nd.edu.