Displaying Molecular Formulas in Molecular Grids, Tables, and Graphs for Elemental Analysis

Here’s how to display formatted molecular formulas in tables and graphs. In addition to formatted molecular formulas, these techniques should work for any Markdown or LaTeX.

In the last blog post, we generated molecular formulas from SMILES strings or RDKit molecules. Once we have those molecular formulas, formatted as Markdown or LaTeX, we might want to display them in tables or graphs.

One of the uses of a molecular formula is for elemental analysis because it determines the mass percentage of each element in a compound. So all we need to know for elemental analysis is the count of each element in the molecule, not the arrangement of atoms.

Open this notebook in Google Colab so you can run it without installing anything on your computer

%%capture
%!pip install rdkit
%!pip install black[jupyter]
%!pip install polars
%!pip install pyarrow

from collections import defaultdict
from IPython.display import display, Markdown

import rdkit
from rdkit import Chem
from rdkit.Chem.Draw import MolsMatrixToGridImage
from rdkit.Chem.Draw import MolsToGridImage
import polars as pl
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Molecular formula generation code from previous blog post

Refer to the previous post for more information on these functions.

def composition(
    molecule: Chem.Mol,
    isotopes: bool = False,
) -> defaultdict:
    """Get the composition of an RDKit molecule:
    Atomic counts, including hydrogen atoms, and isotopes if requested.
    For example, ethanol (SMILES [13C](H)(H)(H)CO, formula C2H6O) returns:
      if isotopes = False (default): {'C': 2, 'O': 1, 'H': 6}.
      if isotopes = True: {'C': {13: 1, 0: 1}, 'O': {0: 1}, 'H': {0: 6}}.

    :param molecule: The RDKit molecule to analyze
    :param isotopes: Whether to include the isotope of each atom
    :returns: if isotopes = False (default): a dictionary of element:count entries;
              if isotopes = True: a nested dictionary of element:isotope:count entries.
    """
    # Check that there is a valid molecule
    if not molecule:
        return

    # Add hydrogen atoms--RDKit excludes them by default
    molecule = Chem.AddHs(molecule)
    comp = defaultdict(lambda: 0)

    # Get atom counts
    for atom in molecule.GetAtoms():
        element_symbol = atom.GetSymbol()
        # If isotopes not requested, simply count the number of atoms of each element
        if not isotopes:
            comp[element_symbol] += 1
        # If isotopes requested, count the number of each isotope of each element
        else:
            isotope = atom.GetIsotope()
            try:
                comp[element_symbol][isotope] += 1
            except:
                comp[element_symbol] = defaultdict(lambda: 0)
                comp[element_symbol][isotope] += 1
    return comp

def mol_to_formatted_formula(
    mol: Chem.Mol,
    isotopes: bool = False,
) -> dict[str, str]:
    """Convert an RDKit molecule to a formatted formula, in Markdown and LaTeX

    :param mol: RDKit molecule
    :param isotopes: Whether to consider isotopes
    :returns: a dictionary of format:string pairs, e.g. {"markdown":"markdown_string", "latex":"latex_string"}
    """
    if mol is None:
        return "Invalid molecule"
    comp = composition(mol, isotopes)

    formula = {"markdown": "", "latex": ""}

    if isotopes:
        isotopes_dict = defaultdict(lambda: defaultdict(str))
        subscripts = defaultdict(lambda: defaultdict(int))
        superscripts = defaultdict(list)

        for element, counts in comp.items():
            for isotope, count in counts.items():
                if count > 1:
                    subscripts[element][isotope] = count
                if isotope != 0:
                    superscripts[element].append(isotope)

                isotopes_dict[element][isotope] = 1
            # Sort the element's isotopes from lowest to highest
            superscripts[element].sort()

        last_item_is_subscript = False
        sorted_element_keys = sorted(isotopes_dict.keys())
        for element in sorted_element_keys:
            isotope_count_pairs = isotopes_dict[element]
            # Sort the element's isotopes from lowest to highest
            sorted_isotope_keys = sorted(isotope_count_pairs.keys())
            for isotope in sorted_isotope_keys:
                if element in superscripts:
                    if isotope in superscripts[element]:
                        formula["markdown"] += f"<sup>{isotope}</sup>"

                        # If superscript immediately follows subscript,
                        # add a small amount of horizontal space using an empty group {}
                        # to prevent them from vertically overlapping
                        if last_item_is_subscript:
                            formula["latex"] += "{}"
                        formula["latex"] += "^{ %d}" % isotope
                        last_item_is_subscript = False
                formula["markdown"] += element
                formula["latex"] += element
                last_item_is_subscript = False
                if element in subscripts:
                    if isotope in subscripts[element]:
                        isotope_count = subscripts[element][isotope]
                        formula["markdown"] += f"<sub>{isotope_count}</sub>"
                        formula["latex"] += "_{ %d}" % isotope_count
                    last_item_is_subscript = True
        # Add beginning and ending dollar signs to LaTeX formula
        formula["latex"] = "$" + formula["latex"] + "$"
    else:
        # Handling the case when isotopes is False
        sorted_element_keys = sorted(comp.keys())

        for element in sorted_element_keys:
            count = comp[element]
            formula["markdown"] += element
            formula["latex"] += element
            if count > 1:
                formula["markdown"] += f"<sub>{count}</sub>"
                formula["latex"] += "_{ %d}" % count
        formula["latex"] = "$" + formula["latex"] + "$"

    return formula

def markdown_formula(latex: str) -> str:
    """Make a LaTeX molecular formula non-italicized by removing math formatting
    :param latex: the molecular formula
    :returns: the non-italicized molecular formula
    """
    latex_markdown = r"$\mathrm{ %s}$" % (latex.strip("$"))
    return latex_markdown

def element_set(molecule: Chem.Mol) -> set[str]:
    """Get the set of elements in an RDKit molecule
    For example, ethanol (SMILES CCO, formula C2H5OH)
    returns {'C', 'O', 'H'}.

    :param molecule: An RDKit molecule
    :returns: A set of strings of element abbreviations
    """
    # Check that there is a valid molecule
    if molecule:
        # Add hydrogen atoms--RDKit excludes them by default
        molecule_with_Hs = Chem.AddHs(molecule)
        atom_symbols = set(atom.GetSymbol() for atom in molecule_with_Hs.GetAtoms())
        return atom_symbols

Creating a table with molecular formulas

To display molecular formulas in tables, we can use Markdown. The utility function dict_list_to_markdown_table formats a list of dictionaries as a Markdown table. This is convenient because a Polars dataframe can be made into a list of dictionaries using Dataframe.rows(named=True).

def dict_list_to_markdown_table(data: list, number_rows: bool = False) -> str:
    """
    Convert a list of dictionaries into a Markdown table.

    :param data: A list of dictionaries, where each dictionary corresponds to a row in the table.
    :param include_row_numbers: Whether to include a row number column.
    :returns: The Markdown-formatted table as a string.
    """
    if not data:
        return ""

    # Extract column headers from the keys of the first dictionary
    headers = data[0].keys()

    # If include_row_numbers is True, add the "Row #" column
    if number_rows:
        headers = ["#"] + list(headers)

    # Create the table headers
    table = "| " + " | ".join(headers) + " |\n"

    # Create the title divider line
    table += "|"
    for _ in headers:
        table += " --- |"
    table += "\n"

    # Create the table body
    for i, row in enumerate(data):
        if number_rows:
            row_data = [i + 1] + [row.get(header, "") for header in headers[1:]]
        else:
            row_data = [row.get(header, "") for header in headers]

        table += "| " + " | ".join(map(str, row_data)) + " |\n"

    return table

Displaying molecular formulas in molecular grids

The set of molecules we’ll consider have:

a carbon chain three, two, or one atoms long, and
two oxygen atoms, and
hydrogen atoms

The first series is dialdehydes, and the second is diols. For the purpose of this exercise, consider that it may be too difficult to do elemental analysis on gases, so we’ll only analyze condensed phases. Thus carbon dioxide will be excluded.

We can take advantage of the new feature that I contributed to the RDKit 2023.9.1 release, MolsMatrixToGridImage, to put each series in its own data structure and then plot each series in its own row without worrying about how many compounds are in each series. Taking advantage of the nested input parameters of MolsMatrixToGridImage also means we don’t have to add an empty placeholder SMILES string "" for the excluded compound carbon dioxide, which would cause empty cells in tables or graphs below.

If you do not yet have RDKit 2023.9.1 or greater, the except block below will use MolsToGridImage as a fallback. Note that a different list of SMILES has to be created to add the empty placeholder SMILES string "".

Either RDKit drawing function will correctly display Markdown in legends under the molecules, so we use mol_to_formatted_formula(mol)["markdown"]. LaTeX does not seem to be supported by those drawing functions. Because we’ll calculate the weight percent of each element, including hydrogen, let’s display the hydrogen atoms.

smlsDict = {"Dialdehydes": ["O=CCC=O", "O=CC=O"], "Diols": ["OCCCO", "OCCO", "OCO"]}
series_names = list(smlsDict.keys())
smlsMatrix = [value for key, value in smlsDict.items()]
molsMatrix = [
    [Chem.AddHs(Chem.MolFromSmiles(sml)) for sml in series] for series in smlsMatrix
]
legendsMatrix = [
    [mol_to_formatted_formula(mol)["markdown"] for mol in series]
    for series in molsMatrix
]

# Use MolsMatrixToGridImage if available; if not, fall back to MolsToGridImage, which requires some data structure changes
try:
    dwg = MolsMatrixToGridImage(molsMatrix, legendsMatrix=legendsMatrix)
except AttributeError:
    print(
        f"MolsMatrixToGridImage not available in RDKit {rdkit.__version__}; falling back to MolsToGridImage."
    )
    smls_one_dim = ["O=CCC=O", "O=CC=O", "", "OCCCO", "OCCO", "OCO"]
    mols_one_dim = [Chem.MolFromSmiles(sml) for sml in smls_one_dim]
    legends_one_dim = [
        mol_to_formatted_formula(mol)["markdown"] for mol in mols_one_dim
    ]
    dwg = MolsToGridImage(mols_one_dim, legends=legends_one_dim)
else:
    print(
        f"MolsMatrixToGridImage available in RDKit {rdkit.__version__} and used to produce this image."
    )

# Display the molecular grid image produced by the drawing function that was used
dwg

MolsMatrixToGridImage available in RDKit 2023.09.1 and used to produce this image.

Two series of molecules with carbon chains 3, 2, and 1 atoms long. Top: Dialdehydes, with the one-carbon molecule, CO2, not shown. Bottom: Diols.

Elemental analysis code

We define an elemental_analysis_mol class to hold the RDKit molecule, formulas, total mass, and the weight percent of each element in the molecule.

class elemental_analysis_mol:
    """Hold a molecule and the expected weight percent of each element therein"""

    def __init__(self, mol):
        self.mol = mol

        # Create dictionary of element:absolute mass
        element_abs_dict = defaultdict()
        for atom in mol.GetAtoms():
            try:
                element_abs_dict[atom.GetSymbol()] += atom.GetMass()
            except:
                element_abs_dict[atom.GetSymbol()] = atom.GetMass()

        formulas = mol_to_formatted_formula(self.mol)
        self.formula_markdown = formulas.get("markdown")
        self.formula_latex = formulas.get("latex")

        # Calculate total molecular mass as the sum of the mass of each atom
        self.total_mass = sum(element_abs_dict.values())

        # Create dictionary of element:relative (to the molecule) mass
        self.element_rel_dict = defaultdict()
        for element, mass in element_abs_dict.items():
            self.element_rel_dict[element] = mass / self.total_mass

    def update(self, mol):
        """Define an update method so the elemental_analysis_mol can remain as the same object if updated"""
        self.__init__(mol)

    def get_mass_frac(self, element):
        return self.element_rel_dict.get(element, 0)

We’ll build a Polars dataframe with the SMILES, elemental_analysis_mol, and formulas.

smls = [sml for series in smlsMatrix for sml in series]
mols = [mol for series in molsMatrix for mol in series]
df = pl.DataFrame({"sml": smls, "mol": mols})
df = df.with_columns(
    [
        pl.col("mol")
        .map_elements(lambda mol: elemental_analysis_mol(mol))
        .alias("elemental_analysis_mol"),
    ]
)
df = df.with_columns(
    [
        pl.col("elemental_analysis_mol")
        .map_elements(lambda x: x.formula_markdown)
        .alias("formula_markdown"),
        pl.col("elemental_analysis_mol")
        .map_elements(lambda x: markdown_formula(x.formula_latex))
        .alias("formula_latex"),
    ]
)
df

shape: (5, 5)

sml	mol	elemental_analysis_mol	formula_markdown	formula_latex
str	object	object	str	str
"O=CCC=O"	<rdkit.Chem.rdchem.Mol object at 0x1728f5540>	<__main__.elemental_analysis_mol object at 0x17235bb10>	"C<sub>3</sub>H…	"$\mathrm{ C_{ …
"O=CC=O"	<rdkit.Chem.rdchem.Mol object at 0x1728f55b0>	<__main__.elemental_analysis_mol object at 0x172927210>	"C<sub>2</sub>H…	"$\mathrm{ C_{ …
"OCCCO"	<rdkit.Chem.rdchem.Mol object at 0x1728f6030>	<__main__.elemental_analysis_mol object at 0x172927290>	"C<sub>3</sub>H…	"$\mathrm{ C_{ …
"OCCO"	<rdkit.Chem.rdchem.Mol object at 0x1728f60a0>	<__main__.elemental_analysis_mol object at 0x1729271d0>	"C<sub>2</sub>H…	"$\mathrm{ C_{ …
"OCO"	<rdkit.Chem.rdchem.Mol object at 0x1728f6110>	<__main__.elemental_analysis_mol object at 0x172927310>	"CH<sub>4</sub>…	"$\mathrm{ CH_{…

Now we need to add the elemental weight percents. We need to include values for each element in any molecule, so we create a set of all the elements in any molecule.

# Create set of elements in all molecules, initializing it as an empty set
element_set_mols = set()

for mol in df["mol"]:
    element_set_mols.update(element_set(mol))
element_set_mols

# Add a column for each element to the dataframe
for element in element_set_mols:
    df = df.with_columns(
        [
            pl.col("elemental_analysis_mol")
            .map_elements(lambda x: round(x.get_mass_frac(element) * 100, 4))
            .alias(f"wt %{element}"),
        ]
    )
df

shape: (5, 8)

sml	mol	elemental_analysis_mol	formula_markdown	formula_latex	wt %C	wt %H	wt %O
str	object	object	str	str	f64	f64	f64
"O=CCC=O"	<rdkit.Chem.rdchem.Mol object at 0x1728f5540>	<__main__.elemental_analysis_mol object at 0x17235bb10>	"C<sub>3</sub>H…	"$\mathrm{ C_{ …	50.0021	5.5951	44.4028
"O=CC=O"	<rdkit.Chem.rdchem.Mol object at 0x1728f55b0>	<__main__.elemental_analysis_mol object at 0x172927210>	"C<sub>2</sub>H…	"$\mathrm{ C_{ …	41.3916	3.4737	55.1347
"OCCCO"	<rdkit.Chem.rdchem.Mol object at 0x1728f6030>	<__main__.elemental_analysis_mol object at 0x172927290>	"C<sub>3</sub>H…	"$\mathrm{ C_{ …	47.3527	10.5973	42.0501
"OCCO"	<rdkit.Chem.rdchem.Mol object at 0x1728f60a0>	<__main__.elemental_analysis_mol object at 0x1729271d0>	"C<sub>2</sub>H…	"$\mathrm{ C_{ …	38.7027	9.7442	51.5531
"OCO"	<rdkit.Chem.rdchem.Mol object at 0x1728f6110>	<__main__.elemental_analysis_mol object at 0x172927310>	"CH<sub>4</sub>…	"$\mathrm{ CH_{…	25.0016	8.3928	66.6056

Displaying molecular formulas in a table

To display molecular formulas in a table, we’ll use the Markdown formulas and dict_list_to_markdown_table. We select only human-readable columns to display.

df_display = df.select(
    [
        col
        for col in df.columns
        if col.startswith("wt") or col in ["sml", "formula_markdown"]
    ]
)
table = dict_list_to_markdown_table(df_display.rows(named=True), number_rows=True)
display(Markdown(table))

#	sml	formula_markdown	wt %C	wt %H	wt %O
1	O=CCC=O	C₃H₄O₂	50.0021	5.5951	44.4028
2	O=CC=O	C₂H₂O₂	41.3916	3.4737	55.1347
3	OCCCO	C₃H₈O₂	47.3527	10.5973	42.0501
4	OCCO	C₂H₆O₂	38.7027	9.7442	51.5531
5	OCO	CH₄O₂	25.0016	8.3928	66.6056

This table tells us what weight percentages to expect for an elemental analysis of these compounds.

Displaying molecular formulas in a graph

It’s easier to create a graph from a Pandas dataframe, so we extract the data from Polars as lists and read them into Pandas. We’re going to use LaTeX to display the molecular formulas in the graph, so that’s the formula column we read into Pandas.

# Make Polars columns into lists
sml_list = df["sml"].to_list()
formula_latex_list = df["formula_latex"].to_list()
wt_O_list = df["wt %O"].to_list()
wt_H_list = df["wt %H"].to_list()
wt_C_list = df["wt %C"].to_list()

# Add those lists to Pandas dataframe
df_pandas_all = pd.DataFrame(
    {
        "sml": sml_list,
        "Molecule": formula_latex_list,
        "O": wt_O_list,
        "H": wt_H_list,
        "C": wt_C_list,
    }
)
df_pandas_all

	sml	Molecule	O	H	C
0	O=CCC=O	$\mathrm{ C_{ 3}H_{ 4}O_{ 2}}$	44.4028	5.5951	50.0021
1	O=CC=O	$\mathrm{ C_{ 2}H_{ 2}O_{ 2}}$	55.1347	3.4737	41.3916
2	OCCCO	$\mathrm{ C_{ 3}H_{ 8}O_{ 2}}$	42.0501	10.5973	47.3527
3	OCCO	$\mathrm{ C_{ 2}H_{ 6}O_{ 2}}$	51.5531	9.7442	38.7027
4	OCO	$\mathrm{ CH_{ 4}O_{ 2}}$	66.6056	8.3928	25.0016

Now, to plot the data, we need to “melt” or “explode” the data by making a row out of each element’s weight percent column. The columns that are not “melted” are the identifiers sml and Molecule.

# Melt the DataFrame for Seaborn so the weight percent for each element is in its own row for each molecule,
# keeping the labels sml and Molecule for each molecule-element pair
df_melted = df_pandas_all.melt(
    id_vars=["Molecule", "sml"], var_name="Element", value_name="weight %"
)
df_melted

	Molecule	sml	Element	weight %
0	$\mathrm{ C_{ 3}H_{ 4}O_{ 2}}$	O=CCC=O	O	44.4028
1	$\mathrm{ C_{ 2}H_{ 2}O_{ 2}}$	O=CC=O	O	55.1347
2	$\mathrm{ C_{ 3}H_{ 8}O_{ 2}}$	OCCCO	O	42.0501
3	$\mathrm{ C_{ 2}H_{ 6}O_{ 2}}$	OCCO	O	51.5531
4	$\mathrm{ CH_{ 4}O_{ 2}}$	OCO	O	66.6056
5	$\mathrm{ C_{ 3}H_{ 4}O_{ 2}}$	O=CCC=O	H	5.5951
6	$\mathrm{ C_{ 2}H_{ 2}O_{ 2}}$	O=CC=O	H	3.4737
7	$\mathrm{ C_{ 3}H_{ 8}O_{ 2}}$	OCCCO	H	10.5973
8	$\mathrm{ C_{ 2}H_{ 6}O_{ 2}}$	OCCO	H	9.7442
9	$\mathrm{ CH_{ 4}O_{ 2}}$	OCO	H	8.3928
10	$\mathrm{ C_{ 3}H_{ 4}O_{ 2}}$	O=CCC=O	C	50.0021
11	$\mathrm{ C_{ 2}H_{ 2}O_{ 2}}$	O=CC=O	C	41.3916
12	$\mathrm{ C_{ 3}H_{ 8}O_{ 2}}$	OCCCO	C	47.3527
13	$\mathrm{ C_{ 2}H_{ 6}O_{ 2}}$	OCCO	C	38.7027
14	$\mathrm{ CH_{ 4}O_{ 2}}$	OCO	C	25.0016

Now we’re ready to create the plot using Seaborn and Matplotlib.

# Create the column plot
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
plt.legend(title="Element")
plt.tight_layout()

# Use hue (color) to group by elements (O, H, C)
sns.barplot(x="Molecule", y="weight %", hue="Element", data=df_melted, palette="Set3")

# Use LaTeX to include a formula in the plot title
class_of_molecules = "$\mathrm{ C_{ a}H_{ b}O_{ c}}$"
plt.title("Weight percent of elements in " + class_of_molecules + " molecules")

# xticks gets the x values from the Molecule column, which are LaTeX strings of the molecular formulas
plt.xticks()

# -----------------------------------------------------
# Divide and label the two series of molecules
# -----------------------------------------------------

# Get length (number of molecules) of first series
len_first_series = len(smlsMatrix[0])

# Calculate the midpoint between the last item of the first series and the first item of the next series
x_between_series = len_first_series - 0.5

# Add a dashed vertical line between the two series
plt.axvline(x=x_between_series, color="black", linestyle="--")

# Calculate the y-coordinate for the labels 5% below the maximum value
y_max = df_melted["weight %"].max()
label_y_pos = 0.95 * y_max

# Add series labels to the left and right of the dashed vertical line
plt.text(
    x=x_between_series - 0.7,
    y=label_y_pos,
    s=series_names[0],
)
plt.text(
    x=x_between_series + 0.2,
    y=label_y_pos,
    s=series_names[1],
)

# Display the plot
plt.show()

No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
/Users/jemonat/Projects/bertiewooster.github.io/venv/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jemonat/Projects/bertiewooster.github.io/venv/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jemonat/Projects/bertiewooster.github.io/venv/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
/Users/jemonat/Projects/bertiewooster.github.io/venv/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):

Graph of weight percent against molecule for the series of dialdehydes and diols

Because we used a nested data structure (supported by MolsMatrixToGridImage) for the two series of molecules, we were able to programmatically extract the x coordinate of the dividing line between the two series x_between_series, and place a vertical line there, plus series labels on either side.

This graph displays the weight percentages we expect for an elemental analysis of these compounds. For example, for both series (dialdehydes and diols), we can tell that the weight percent of carbon decreases as carbon chain length decreases.

Conclusion

This was an example of how to display molecular formulas in molecular grids, tables, and graphs for a straightforward calculation. In the next blog posts, we’ll apply some of these techniques to a more involved cheminformatics problem.