Boost in Computational Chemistry through Data Science
Scientists have suggested rather claimed that if the technology used in aircraft is as advanced as computational technology then, Boeing 747 would be large enough to carry 12,000 people to the moon in about three hours only, that too for a round-trip at a very low cost of about twelve dollars. Over the past 20 years, this technology has taken up the pace and has been improving expeditiously. It has made solving complex calculations a very easy job and has made memory storage devices more efficient.
Wait, you might be thinking about what these authors are trying to say, or to put it precisely you might be thinking about what exactly is computational technology or scientific computing. So before jumping onto our main topic let’s first try to understand what exactly computation technology is in brief.
Computational science or scientific computing is the fourth and completely new era of scientific research that has emerged in the past 20–30 years. It has revolutionized the working of scientists and their way of thinking about doing science. Computational Science is the application of computational and numerical techniques to solve large and complex problems. As the world has been digitizing since the early 2000s, the algorithms and mathematical techniques are giving better results along with time as more data is being generated. Along with this, the computer hardware is improving day by day and becoming more efficient. This provides an advantage for computational science to grow. But our topic for today is not about computational science in detail but specifically about what is computational chemistry and what type of complex applications it builds when it is integrated with data science.
In recent years, we have seen a massive increase in the number of people practicing theoretical chemistry. This increase has been facilitated by the development of computer software which is making complex calculations easy. Correspondingly, people don’t have to understand even the most basic description of how calculations are done as they can be solved easily through computer software. Computational Chemistry is a magical term, which is used to mean many different things. Few people tend to interpret this to be the use of computers to analyze data obtained in complicated experiments. However, more frequently this term means the use of computers to make chemical predictions. Quite a few times, computational chemistry is used to predict new molecules or new reactions which are later investigated experimentally. Whereas, sometimes computational chemistry is used to supplement experimental studies. It does this by providing data that is hard to inquest experimentally. In simple words, one can consider Computational Chemistry as simply the application of chemical, mathematical, and computing skills to the solution of interesting chemical problems. It is a useful way to investigate materials that are too difficult to find or too expensive to purchase. Thus it helps in making the right choices thereby preventing unnecessary expenses. Principals like Schrödinger equation for calculations, make predictions before running the actual experiments, and help researchers make informed decisions.
The Tools of Computational Chemistry
There are many tools available for computational chemists to choose from. These tools are mainly divided into five broad classes. The terms are quite difficult to digest but as mentioned before our aim is to deliver it in layman’s terms. So let’s try to understand the five classes:
Molecular mechanics:
Molecular mechanics is based on a model of a molecule as a collection of atoms held together by bonds. We can calculate the energy of a given collection of atoms and bonds easily if we know the bond length and angles between them. In simpler terms, in the case of a molecule, changing geometry until the lowest energy is found helps us in geometry optimization. This is very helpful when we want to calculate the geometry of the molecule. But over here calculations are very complex. But thanks to molecular mechanics, it is fast, a fairly large molecule like a steroid (e.g. cholesterol, C27H46O) can be optimized in seconds on a good personal computer.
Ab Initio Calculations:
Small fact: ab initio, Latin: “from the start”, i.e. from first principles
These calculations are based on the Schrodinger equation. This helps to calculate energy and wave function for molecules smoothly. The wave function in simple words is a mathematical term used to calculate electron distribution in the given molecule. Electron distribution is very useful for determining how polar the molecule is, and which parts of it are likely to be attacked by nucleophiles or by electrophiles. But there is one problem with the Schrodinger equation i.e. it cannot be solved exactly for any molecule with more than one electron. Thus approximations are used; the less serious these are, the “higher” the level of the ab initio calculation is said to be. These are relatively slow. The geometry and IR spectra of propane can be calculated at a reasonably high level in minutes on a personal computer, but a fairly large molecule, like a steroid, could take perhaps days.
Semiempirical calculations:
These are also based on the Schrodinger equation similar to Ab Initio calculations. However, more approximations are made in solving it. Very complicated integrals are not actually evaluated in semiempirical calculations, i.e. it is a kind of library of integrals that are compiled by finding the best fit of some calculated entity like geometry or energy to the experimental values, which is called parameterization. Empirical in layman terms means experimental. It is the mixing of theory and experiment that makes the method “semiempirical”. It is based on the Schrodinger equation but parameterized with experimental values. Semiempirical calculations are slower than molecular mechanics but much faster than ab initio calculations. Semiempirical calculations take roughly 100 times as long as molecular mechanics calculations, whereas ab initio calculations take roughly 100 – 1,000 times as long as semiempirical. A semiempirical geometry optimization on steroids might take seconds on a PC.
Density functional calculations:
These are also based on the Schrodinger equation similar to Ab Initio calculations and semiempirical calculations. But here, DFT does not calculate a conventional wave function but rather derives the electron distribution(electron density function) directly. Density functional calculations are usually faster than ab initio, but slower than semiempirical.
Molecular dynamics calculations:
Molecular dynamics calculations apply the laws of motion to molecules. Thus one can simulate the motion of an enzyme as it changes shape on binding to a substrate or the motion of a swarm of water molecules around a molecule of the solute. Quantum mechanical molecular dynamics also allows actual chemical reactions to be simulated.
So far we have discussed in detail that what exactly is Computational Chemistry. To further instigate the process of understanding the use of data science in this field, let us brush up upon the concept of Data Science. This will help in a clearer understanding when we bridge them together in the later part.
WHAT IS DATA SCIENCE?
As the name suggests, it revolves around science involving data. It refers to the principles of learning from data based on statistics, and the scientific treatment of data to obtain new and reproducible knowledge. This study focuses on studying data sets and drawing some valuable insights from them so that we can extrapolate those insights to predict some new feature when we get something unique as an input.
USE OF DATA SCIENCE IN THE FIELD OF COMPUTATIONAL CHEMISTRY
Now, we have a basic understanding of tools used in computational chemistry. So, let us understand how data science is integrated with computational chemistry.
Data Science can be divided mainly into 3 broad categories:-
- Data management
- Statistical and Machine Learning
- Data Visualization
Within each category of this division, we can find many tools that can aid the process of computations that are used by chemical engineers. The tools most commonly used in the field by chemical engineers are ChemAxon, ForgeV10, Chemical Computing Group, LigandScout, StarDrop, Vortex, FAME.
The tools used for computational processes can also be divided in categories based on their functions.
- GUI – These are the tools that provide almost all of the functionality of the applications in a single unified interface.
Example:- MOE, Sybyl, Discovery Studio, ICM Pro - Molecule Viewers – These tools are used to view the 3-dimensional structure of a molecule, many are also capable of displaying biological macromolecules such as proteins.
Example:- PyMol, VIDA, Chem3D, Chimera, Jmol, JSmol, AstexViewer, CN3D, CylView, DINO, ICM Browser. Molgro Viewer, Qutemol, VMD, Yasara, ICM Browser - Chemical Sketchers – These are small molecule drawing packages that can be used to create the input for other programs, rendering structures on web pages, and some also provide publication-quality output. Many of these new lightweight tools have the advantage that they can be used on mobile devices like an iPad or smartphone.
Example:- ChemDraw, Marvin, JME, ChemDoodle, Elemental, JSDraw, PubChem Sketcher, Ketcher - SMARTSviewer generates a visualization of a molecular pattern that is given in the form of a SMARTS very useful for checking that your search query is what you really want.
- Chemical Property Calculations – Example:- Marvin, Elemental, Stardrop, Vortex, PaDEL (see also Toolkits below)
- 3D structure generation, conformers – Many drawing packages can generate 1D (SMILES) or 2D (SDF, mol) representations of molecules. However, some of the virtual screening tools require 3D structures and often a selection of reasonable conformations.
Example:- OMEGA, CORINA, ROTATE, CONFLEX
DATA MANAGEMENT
Data management refers to tools and methods to organize, sort, and process large and complex datasets and to enable the real-time processing of streams of data from sensors, instruments, and simulations. Data management Is the foundation of data science. The way data are organized, stored, and processed significantly impacts the performance of various computations and the accuracy and precision we can achieve.
In synthetic biology, each field (e.g., transcriptomics, proteomics, and metabolomics) will generate a unique and large dataset. Even for molecular- and nano-scale phenomena, data is commonly comprised of a large quantity of information. These data could be generated with computer simulations or collected from data-intensive experiments, such as those from high-resolution/high-speed microscopy. Now to handle such a havoc amount of data we need to decide upon some management tools.
In the late 20th century, spreadsheets were commonly used for data organization and analysis. But as science progresses the number of data increases exponentially and compared to that period, we are already generating a million times more data. Spreadsheets cant be used for storing such a high amount of data. Also, features like generating a subset of whole data to perform research on that part are practically not feasible in spreadsheets. Relational Database models were also being used after the failure of spreadsheets but they too failed to model data which may take a long amount of time and even cost a lot of money. But with the progress made in the field of Machine Learning and Data Science, various power tools such as SAS, Tableau, Apache Spark, etc. came into being. These tools are extensively used by Chemical Engineers to perform intensive computational research on the extensive dataset received from various experiments.
DATA VISUALIZATION
Among all the use of data science, this is one of the most important developments that data science has brought in the field of Computational Chemistry. The ability to visualize a large amount of data is itself a powerful tool for chemical engineers. They can reflect upon the data visualized in the form of line graphs or pie charts and draw the insights that earlier would have taken some long time as they would have to dig deep in each of the data received and manually derive results. This has indeed aided the whole process of computational research boosting the process.
As technology is becoming more advanced, data science has various uses in the field of computational chemistry. Scientists have long benefitted from and contributed to the development of quantitative methods to reveal patterns in structure-property relationships across all branches of chemistry ranging from materials to synthetic organic to biological. With increases in data availability from experiments or computation, have led to dramatic progress in the complexity of statistical techniques applied to chemistry. Through these data we can investigate Molecular geometry, Energies of molecules and transition states, Chemical reactivity, IR, UV, and NMR spectra, the interaction of a substrate with an enzyme, and the physical properties of substances. This beautiful integration helps analytical chemists a lot as their whole work is based on a “guess” through their knowledge. Software packages help to do all the math work required for quantitative analysis. Regression plotting is the greatest boon as it helps to understand the arbitrary results from the instrument. With machine learning, an analyst can gain insight that is otherwise limited by their own personal knowledge. A robust AI-based intelligent software that has knowledge of both skilled chemists and the instrument will give more confidence and the best results. Building such intelligent software can be the food of thought for all aspiring data scientists with some chemical background.