Scientific software developer in the Washington, D.C. area.

Portfolio

Cheminformatics blog posts

Why Some Organic Molecules Have a Color

Absorption and emission maxima of n=1-6 oligomers with an anthracene repeat unit

It’s usually because of a long chain of conjugated bonds. I search 20K data points to find a series of molecules where extending the conjugated chain increases the absorption wavelength.

Tautomer Generation Algorithms and InChI Representations

Histogram of frequency against difference in number of tautomers from RDKit baseline algorithm minus other sources

Which cheminformatics algorithms produce the most tautomers? And how successful is InChI at representing with a single representation all tautomers of a given structure?

Molecular Isotopic Distributions: Permutations and Combinations

Abundance against mass for SCl2 molecular isotopes

These posts use two different methods to calculate molecular isotopic mass distributions.

RDKit Contribution MolsMatrixToGridImage()

Three reactions, each in a row. First column: Target molecule and whether it's accessible based on commercial availability of reactants. Subsequent columns: Each reactant and whether it's commercial available.

I contributed MolsMatrixToGridImage to the RDKit 2023.09.1 release to draw row-and-column grids of molecules.

Display Molecular Formulas

Uses Python, RDKit, seaborn, and matplotlib

Two series of molecules with carbon chains 3, 2, and 1 atoms long. Top: Dialdehydes, with the one-carbon molecule, CO2, not shown. Bottom: Diols.

How to display molecular formulas such as C3H4O2 in molecular grids, tables, and graphs. Also works for other HTML-, Markdown-, or LaTeX-formatted text.

Molecular Formula Generation

Uses Python and RDKit

Photosynthesis chemical equation: 6CO2 + 6H2O → C6H12O6 + 6O2

In cheminformatics, the typical way of representing a molecule is with a SMILES string such as CCO for ethanol. However, there are still cases where the molecular formula such as C2H6O is useful.

Refitting Data From Wiener’s Classic Cheminformatics Paper

Uses Python, SciPy, Polars, NumPy, seaborn, matplotlib, and mol_frame

Graph of calculated against observed boiling point for alkanes

How well did cheminformatics pioneers Egloff and Wiener fit their models to boiling points of alkanes in the 1940s? This blog post revisits their fits using digital tools.

Revisiting a Classic Cheminformatics Paper: The Wiener Index

Uses Python, RDKit, Polars, matplotlib, seaborn, py2opsin, and mol_frame

Graph of calculated against observed boiling point for alkanes

This post revisits Harry Wiener’s article “Structural Determination of Paraffin Boiling Points”, extracts data for molecules from it, recalculates cheminformatics parameters and boiling points, and plots the data.

RDKit Utility to Check Whether Starting Materials for Synthesizing Your Target Molecules Are Commercially Available

Uses Python, RDKit, PubChem’s API, asyncio, and Semaphore

Three reactions, each in a row. First column: Target molecule and whether it's accessible based on commercial availability of reactants. Subsequent columns: Each reactant and whether it's commercial available.

Given target molecules and reactions to synthesize them, determine whether the starting materials are commercially available using PubChem’s API, and thus whether the target is synthetically accessible.

RDKit Utility to Create a Mass Spectrometry Fragmentation Tree

Uses Python and RDKit

Annotated mass spectrometry fragmentation tree using the function mass_spec_frag_tree in this blog post

Given a mass spec fragmentation hierarchy, with species as SMILES strings, display the fragmentation tree in a grid, labeling each species with its name and either mass or mass to charge ratio m/z.

RDKit Utility to Find the Maximum Common Substructure, and Groups Off It, Between a Set of Molecules

Uses Python and RDKit

Annotated grid of maximum common substructure and core; molecules and groups off maximum common substructure

Given a collection of molecules as SMILES strings, find the maximum common substructure (MCS) match between them, and the groups off that common core for each molecule, displaying the results using a grid.

Chemistry machine learning for drug discovery with DeepChem

Uses Python, DeepChem, seaborn, Matplotlib, and pandas

Predicted against measured lipophilicity for test and train data

Use the DeepChem deep learning package to predict compounds’ lipophilicity–how well they are absorbed into the lipids of biological membranes, which is important for oral delivery of drugs.

RDKit Utility to Visualize Retrosynthetic Analysis Hierarchically

Uses Python and RDKit

Annotated Recap retrosynthetic hierarchy tree

Given a target molecule, use the Recap algorithm to decompose it into a set of fragments that could be combined to make the parent molecule using common reactions. Display the fragmentation hierarchically.

RDKit Utility to Find and Highlight the Maximum Common Substructure Amongst Molecules

Uses Python and RDKit

Maximum substructure match, and the two molecules which are labeled by their functional groups

Given a collection of molecules as SMILES strings, find the maximum common substructure (MCS) match between them as a SMARTS string, display the match pattern as a molecule, and highlight the match pattern in each molecule using a grid.

Web apps

Materials and Cheminformatics Sampler

Uses Python, NumPy, SymPy, ChemPy, Flask, JavaScript, and Bootstrap

Find a given number of points which satisfy constraints given in a constraints file for an n-dimensional space defined on the unit hypercube, then write them to an output file.

Optionally, identify the components (dimensions) in the constraints file using chemical formulas, and Sampler will use ChemPy to calculate their molar masses, then output the component weight fraction.

Periodic Table Navigator

Uses Ruby, Sinatra, PostgreSQL, and JavaScript

Understand how the elements are related to each other. Emphasizes electronic configuration of the elements.

Open-source contributions

The RDKit cheminformatics package

  • Conceived, proposed, and coded MolsMatrixToGridImage feature to use a two-dimensional (nested) data structure as input to create molecular grid images. Feature was merged into the main codebase by the project maintainer and released in the 2023.09.1 release. It was the subject of an article on the site Macs In Chemistry, which included:

    If you need to display molecules and associated data in a grid then Jeremy Monat’s MolsMatrixToGridImage is exactly what you need. To underline just how useful this is and to highlight how it simplifies code he has written a very nice blog post.

  • Improved documentation by illustrating drawing capability in tutorial and adding SMILES (chemical notation) for R groups

SymPy computer algebra system in pure Python

ChemPy package for chemistry in Python

Sphinx documentation generator

Posts

subscribe via RSS