Comparing Tautomer Generation Algorithms

Tautomers are chemical structures that readily interconvert under given conditions. For example, an amino acid has a neutral form, and a zwitterionic form with separated positive and negative charges. Cheminformatics packages have algorithms to enumerate tautomers based on rules. Which algorithms produce the most tautomers? And how successful is InChI at representing with a single representation all tautomers of a given structure?

Tautomeric forms of an amino acid: neutral (1) and zwitterionic (2)

Image credit: TimVickers vector version by YassineMrabet

The tautomer generation algorithms discussed below are based on rules from

Markus Sitzmann, Wolf-Dietrich Ihlenfeldt, and Marc C. Nicklaus, “Tautomerism in Large Databases”, JCAMD 24:521 (2010) https://doi.org/10.1007/s10822-010-9346-4
Devendra K. Dhaked, Wolf-Dietrich Ihlenfeldt, Hitesh Patel, Victorien Delannée, and Marc C. Nicklaus, “Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2”, J. Chem. Inf. Model. 60:3 (2020) https://pubs.acs.org/doi/10.1021/acs.jcim.9b01080 (preprint also available)

RDKit has two tautomer generation algorithms:

GetV1TautomerEnumerator
the new algorithm, TautomerEnumerator, which will be our baseline.

The 2022.03 release notes state:

The rules for tautomer enumeration in MolStandardize have been updated to more closely match the rules in the original publication [cited above]. These changes primarily consist of making the rules more specific; the consequence is that less tautomers will be generated with this version.

but maintainer Greg Landrum wrote

[the new TautomerEnumerator returning fewer examples than the previous rules V1 is] not something I’ve noticed. The code change adds a missed case to the enumeration rule set, so at first glance you’d expect it to always produce more tautomers, but I suppose that could still result in a smaller number of tautomers in the end because of how the transformations interact with each other.

National Institutes of Health (NIH) CADD Group Chemoinformatics Tools and User Services (cactus) also has algorithms that can be run on a Tautomerizer web page or with the package CACTVS from Xemistry GmbH. NIH’s Marc Nicklaus notes that CACTVS has “many additional transforms vs. the ones we used in 2010 [for the paper cited above]. This now includes numerous ring-chain and valence tautomerism rules, for a total of…120 rules” (private communications; rules are from Devendra Kumar Dhaked and Marc Nicklaus, “Tautomeric Conflicts in Forty Small-Molecule Databases” (2024) https://doi.org/10.26434/chemrxiv-2024-jzpw2-v2). We compare these two NIH algorithms to our baseline, the new RDKit algorithm. Because I couldn’t install CACTVS on my computer due to CPU compatibility, I used the web site which has 86 rules, and Marc kindly ran CACTVS for a few structures using “exhaustive multi-step enumeration iteratively with all rules, until no more new tautomer is found (with a limit of 1,000 attempts)”.

So one purpose of this blog post is to empirically answer the question of which algorithms produce more tautomers, and particularly whether the new RDKit algorithm produces more or fewer tautomers than the V1 algorithm.

The second purpose is to check how well InChI (International Chemical Identifier) accomplishes its goal of being “tautomer-invariant”, meaning tautomers of a structure should be assigned the same InChI, so ideally only one InChI should be able to represent all tautomers of a given structure. However, “It was recognized early on that important types of tautomerism are missing,” so we should not expect InChI to be completely successful in its current incarnation.

To empirically address these questions, we need a set of molecules which have tautomers: Many molecules will not have any tautomers. Nicklaus and team also created Tautomer Structures Extracted from Experimental Literature, with

5,977 structures extracted from experimental literature representing 2,819 cases of tautomeric tuples (mostly tautomer pairs but also a few triples and higher-order tuples). Note that the number of structurally different tuples is only 1,776 (comprising 3,884 different structures) since some tuples are differentiated from each other only by experimental conditions such as solvent, spectroscopy method, etc.

We use release 3 in Excel format, specifically Tautomer_database_release_3a.xlsx.

For reference, here’s a summary of all tautomer sources used in this post:

Source	Have data for all compounds	Nature	Algorithm source	Interface	Role
Expt	Yes	Experimental	Not applicable	Not applicable	Comparison
GetV1TautomerEnumerator	Yes	Cheminformatic	RDKit	Programmatic	Comparison
TautomerEnumerator	Yes	Cheminformatic	RDKit	Programmatic	Baseline
cactus	No	Cheminformatic	CACTVS (86 rules)	Web page	Comparison
CACTVS	No	Cheminformatic	CACTVS (120 rules)	Programmatic	Comparison

Code foundation

import sys

print(sys.version)

3.11.7 (v3.11.7:fa7a6f2303, Dec  4 2023, 15:22:56) [Clang 13.0.0 (clang-1300.0.29.30)]

# !pip install rdkit polars>=0.20.7 seaborn matplotlib

from typing import Iterable, Callable
import warnings
import statistics
from statistics import StatisticsError

from rdkit.Chem import AllChem as Chem
from rdkit import RDLogger
from rdkit.Chem import Draw, rdFMCS
from rdkit.Chem import Mol
from rdkit.Chem.MolStandardize import rdMolStandardize
import polars as pl
import polars.selectors as cs
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator

# Suppress RDKit warnings so it doesn't warn about SMILES that produce molecules which violate a rule,
# which would produce many warnings as we create thousands of molecules
lg = RDLogger.logger()
lg.setLevel(RDLogger.CRITICAL)

# Ignore all FutureWarnings--there are several from internal code of imported packages
warnings.simplefilter(action="ignore", category=FutureWarning)

These functions convert SMILES to RDKit molecules to canonical SMILES.

def mol_from_sml(sml: str) -> Mol | None:
    """Create an RDKit molecule from a SMILES string.
    RDKit cannot create molecules from some SMILES strings if it tries to sanitize them,
    but sometimes can if it does not sanitize them.
    It is preferable to sanitize molecules if possible to, for example,
    delocalize aromatic bonds (instead of alternating single and double bonds),
    which makes different structures (single-double vs. double-single bond sequence) give the same molecule.

    :param sml: SMILES string
    :returns: RDKit molecule if SMILES is valid; None if invalid
    """
    # Try sanitizing molecule first
    mol = Chem.MolFromSmiles(sml, sanitize=True)
    if mol is not None:
        return mol
    # If sanitization fails, turn it off and try creating molecule again
    mol = Chem.MolFromSmiles(sml, sanitize=False)
    return mol

def canon_sml_or_none(mol: Mol | None) -> str | None:
    """Return the canonical SMILES for an RDKit molecule if it exists; otherwise, None.

    :param mol: RDKit molecule or None
    :returns: Canonical SMILES string if RDKit molecule supplied; None if None supplied
    """
    if mol is None:
        return None
    return Chem.MolToSmiles(mol)

def canonicalize(sml: str) -> str | None:
    """
    :param sml: SMILES string
    :returns: Canonical SMILES string if RDKit molecule could be created; None if input SMILES invalid
    """
    mol = mol_from_sml(sml)
    return canon_sml_or_none(mol)

Preparing and inspecting the data from the tautomer database

In preparation for reading in the data, we specify the file and sheet names from the Tautomer Structures Extracted from Experimental Literature.

file_name = "../data/Tautomer_database_release_3a.xlsx"

# If you want to use only the first row for quicker debugging, use this file instead
# file_name = "../data/Tautomer_database_release_3a_first_row.xlsx"

# The name of the sheet to process
sheet_name = "Tautomer_database"

We start by reading the data from the sheet in the Excel file of the tautomer database using polars.read_excel. We have Polars consider all the data using "infer_schema_length":10000 (there are <3,000 rows in the sheet) before deciding on data types for columns, to ensure all cells in a column fit the data type that Polars assigns.

df = pl.read_excel(
    source=file_name,
    sheet_name=sheet_name,
    read_options={"infer_schema_length": 10000},
)
df.head()

shape: (5, 147)

Ref	Size	Solvent	Solvent_Proportion	Solvent_Mixture	Temperature	pH	Experimental_Method	Entry_ID1	Type_1	ID_Hash_1	FICTS_1	HASHISY_1	FICuS_1	TAUTOHASH_1	uuuuu_1	Std_InChIKey_1	Std_InChI_1	SMILES_1	Mol_Formula_1	Mol_Weight_1	IUPAC_Name_1	Quantitative_ratio_1	Qualitative_prevalence_1	Prevalence_Category_1	Filename_1	Publication_DOI_1	Publication_ID_1	Authors_1	Affiliation_1	Title_1	Section_1	Page_Number(s)_1	Notes_1	Cmpd_Number_1	Entry_ID2	Type_2	…	Publication_DOI_4	Publication_ID_4	Authors_4	Affiliation_4	Title_4	Section_4	Page_Number(s)_4	Notes_4	Cmpd_Number_4	Entry_ID5	Type_5	Transf_1_5	ID_Hash_5	FICTS_5	HASHISY_5	FICuS_5	TAUTOHASH_5	uuuuu_5	Std_InChIKey_5	Std_InChI_5	SMILES_5	Mol_Formula_5	Mol_Weight_5	IUPAC_Name_5	Quantitative_ratio_5	Qualitative_Prevalence_5	Prevalence_Category_5	Filename_5	Publication_DOI_5	Publication_ID_5	Authors_5	Affiliation_5	Title_5	Section_5	Page_Number(s)_5	Notes_5	Cmpd_Number_5
i64	i64	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	f64	str	str	str	i64	str	str	str	str	str	str	str	str	str	str	str	str	…	str	str	str	str	str	str	i64	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	f64	str	str	str	i64	str	str	str	str	str	str	str	i64	str	str
1	2	"Gas phase"	"nul"	"no"	"377.15-417.15"	"nul"	"1H NMR spectra…	"Prog. NMR. Spe…	"Diketo"	"8da8a44a54e4cd…	"748BBAA5E5F382…	"748BBAA5E5F382…	"748BBAA5E5F382…	"748BBAA5E5F382…	"748BBAA5E5F382…	"InChIKey=YRKCR…	"InChI=1S/C5H8O…	"O=C(C)CC(C)=O"	"C5H8O2"	100.117	"pentane-2,4-di…	"nul"	"Observed"	1	"(1)_Claramunt_…	"10.1016/j.pnmr…	"Prog. NMR. Spe…	"Claramunt, R. …	"Departamento d…	"The use of NMR…	"Scheme 1"	"171"	"nul"	"1c"	"Prog. NMR. Spe…	"Keto-enol"	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
2	2	"Gas phase"	"nul"	"no"	"nul"	"nul"	"nul"	"Prog. NMR. Spe…	"NH"	"ef8866bd4d9949…	"AF721AB4BAA47F…	"AF721AB4BAA47F…	"AF721AB4BAA47F…	"AF721AB4BAA47F…	"E0755E1A90D93F…	"InChIKey=QWENR…	"InChI=1S/C2H3N…	"N1=CC=N[NH]1"	"C2H3N3"	69.0658	"2H-triazole"	"nul"	"Only observed"	4	"(1)_Claramunt_…	"10.1016/j.pnmr…	"Prog. NMR. Spe…	"Claramunt, R. …	"Departamento d…	"The use of NMR…	"Scheme 3"	"172"	"nul"	"8b"	"Prog. NMR. Spe…	"NH"	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
3	2	"HMPT"	"nul"	"no"	"nul"	"nul"	"13C NMR spectr…	"Prog. NMR. Spe…	"NH"	"7714934ea0f29d…	"96D7C0A293F7BC…	"96D7C0A293F7BC…	"B5EA71BA1BFB8B…	"B5EA71BA1BFB8B…	"B5EA71BA1BFB8B…	"InChIKey=XKVUY…	"InChI=1S/C4H6N…	"CC1=CC=NN1"	"C4H6N2"	82.1048	"5-methyl-1H-py…	"nul"	"Predominant"	3	"(1)_Claramunt_…	"10.1016/j.pnmr…	"Prog. NMR. Spe…	"Claramunt, R. …	"Departamento d…	"The use of NMR…	"Scheme 3"	"172"	"nul"	"7b"	"Prog. NMR. Spe…	"NH"	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
4	2	"DMSO-d6"	"nul"	"no"	"nul"	"nul"	"1H NMR spectra…	"Prog. NMR. Spe…	"NH"	"8c6a7d710c6321…	"DD375DCBC6DDA9…	"DD375DCBC6DDA9…	"DD375DCBC6DDA9…	"DD375DCBC6DDA9…	"DD375DCBC6DDA9…	"InChIKey=OWLHY…	"InChI=1S/C7H6N…	"N=N=NC1=NNC2=C…	"C7H6N5"	160.1579	"3-[(imino-lamb…	"nul"	"Predominant"	3	"(1)_Claramunt_…	"10.1016/j.pnmr…	"Prog. NMR. Spe…	"Claramunt, R. …	"Departamento d…	"The use of NMR…	"Scheme 4"	"173"	"nul"	"9a"	"Prog. NMR. Spe…	"NH"	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
5	3	"Carbon tetrach…	"nul"	"no"	"nul"	"nul"	"1H NMR spectra…	"Prog. NMR. Spe…	"Keto-enol"	"8366b277b21ed8…	"7DA12DCD45D806…	"7DA12DCD45D806…	"CB73FB7D041984…	"9016C9F8C95B7C…	"9016C9F8C95B7C…	"InChIKey=NLPZO…	"InChI=1S/C11H1…	"CC1([C@H]2C(C=…	"C11H16O2"	180.2462	"(1S,4S)-3-hydr…	"nul"	"Not observed"	0	"(1)_Claramunt_…	"10.1016/j.pnmr…	"Prog. NMR. Spe…	"Claramunt, R. …	"Departamento d…	"The use of NMR…	"Scheme 8"	"175"	"nul"	"16b"	"Prog. NMR. Spe…	"Keto-enol"	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null

In case there are any empty rows, let’s filter down to rows where Ref is not null:

df = df.filter(pl.col("Ref").is_not_null())

And we get the 2,819 cases cited in the data documentation.

df.shape

(2819, 147)

The number of unique Ref values is 1776, as stated in the data documentation:

Ref_count = df["Ref"].n_unique()
Ref_count

Let’s cut down the number of columns (we don’t need 147!) to remove those relating to experimental conditions, etc. by keeping only the Ref, SMILES, and Std_InChI columns. We can use a combination of selectors within df.select():

To keep the column “Ref”, we simply use "Ref"
To keep the columns starting with “SMILES_” or “Std_InChI_”, we use cs.starts_with("SMILES_", "Std_InChI_")

By putting those two selectors in the same df.select(), we select columns matching either criterion.

df = df.select("Ref", cs.starts_with("SMILES_", "Std_InChI_"))
df.head()

shape: (5, 11)

Ref	Std_InChI_1	SMILES_1	Std_InChI_2	SMILES_2	Std_InChI_3	SMILES_3	Std_InChI_4	SMILES_4	Std_InChI_5	SMILES_5
i64	str	str	str	str	str	str	str	str	str	str
1	"InChI=1S/C5H8O…	"O=C(C)CC(C)=O"	"InChI=1S/C5H8O…	"O/C(C)=C\C(C)=…	null	null	null	null	null	null
2	"InChI=1S/C2H3N…	"N1=CC=N[NH]1"	"InChI=1S/C2H3N…	"[NH]1N=NC=C1"	null	null	null	null	null	null
3	"InChI=1S/C4H6N…	"CC1=CC=NN1"	"InChI=1S/C4H6N…	"CC1=NNC=C1"	null	null	null	null	null	null
4	"InChI=1S/C7H6N…	"N=N=NC1=NNC2=C…	"InChI=1S/C7H6N…	"N=N=NC1=C2C=CC…	null	null	null	null	null	null
5	"InChI=1S/C11H1…	"CC1([C@H]2C(C=…	"InChI=1S/C11H1…	"CC1([C@H](CC[C…	"InChI=1S/C11H1…	"CC1([C@H](CC[C…	null	null	null	null

The Excel sheet has sets of columns for each of up to five experimentally-observed structures for a tautomer, called a Ref in the sheet, in a row. We will use the experimentally-observed structures as inputs to the tautomer generation algorithms, so we melt the dataframe by making each Ref-structure pair into its own row. We can again use the selector starts_with() to select all the SMILES columns as the value variables value_vars.

# Melt dataframe: Break out each row's SMILES_n columns into its own row
df_melted = df.melt(
    id_vars=["Ref"], value_vars=cs.starts_with("SMILES_"), value_name="sml"
)

This produces 2,819 (# of cases) * 5 (SMILES columns per case) = 14,095 rows:

df_melted

shape: (14_095, 3)

Ref	variable	sml
i64	str	str
1	"SMILES_1"	"O=C(C)CC(C)=O"
2	"SMILES_1"	"N1=CC=N[NH]1"
3	"SMILES_1"	"CC1=CC=NN1"
4	"SMILES_1"	"N=N=NC1=NNC2=C…
5	"SMILES_1"	"CC1([C@H]2C(C=…
6	"SMILES_1"	"OC(C=CC=C1)=C1…
7	"SMILES_1"	"O=C(C([H])([H]…
8	"SMILES_1"	"O/C(C1=CC=CC=C…
9	"SMILES_1"	"OC1=C(C2=CC=CC…
9	"SMILES_1"	"OC1=C(C2=CC=CC…
9	"SMILES_1"	"OC1=C(C2=CC=CC…
9	"SMILES_1"	"OC1=C(C2=CC=CC…
…	…	…
1767	"SMILES_5"	null
1767	"SMILES_5"	null
1767	"SMILES_5"	null
1767	"SMILES_5"	null
1768	"SMILES_5"	null
1769	"SMILES_5"	null
1770	"SMILES_5"	null
1771	"SMILES_5"	null
1772	"SMILES_5"	null
1773	"SMILES_5"	null
1774	"SMILES_5"	null
1775	"SMILES_5"	null

Because many rows of the Excel sheet have less than five structures, there are many rows with no structure (SMILES), so let’s remove those rows. We also don’t need to know which number SMILES (1-5) each was, so we’ll drop the variable columns that Polars created when we melted the original dataframe.

# Keep rows where SMILES is supplied
df_melted = df_melted.filter(pl.col("sml").is_not_null())

# Remove SMILES_n label column (n = 1-5)
df_melted = df_melted.drop("variable")

A check of the dataframe confirms that it contains the 5,977 structures reported by the data description, and none of the rows have null SMILES.

df_melted

shape: (5_977, 2)

Ref	sml
i64	str
1	"O=C(C)CC(C)=O"
2	"N1=CC=N[NH]1"
3	"CC1=CC=NN1"
4	"N=N=NC1=NNC2=C…
5	"CC1([C@H]2C(C=…
6	"OC(C=CC=C1)=C1…
7	"O=C(C([H])([H]…
8	"O/C(C1=CC=CC=C…
9	"OC1=C(C2=CC=CC…
9	"OC1=C(C2=CC=CC…
9	"OC1=C(C2=CC=CC…
9	"OC1=C(C2=CC=CC…
…	…
1216	"O=C(OCC)/C(C)=…
926	"O=C(/C=C(O)/C/…
927	"O=C(CC(C(CC)/C…
928	"O=C(CC(C/C(O)=…
929	"O=C(OC)/C=C(C/…
929	"O=C(OC)/C=C(C/…
930	"O=C(CC(C/C(O)=…
931	"O=C(/C=C(O)/C/…
932	"O=C(OC)/C=C(C/…
1213	"O=C(OCC)C1/C(S…
1214	"O=C(OCC)C1/C(S…
1215	"O=C(OCC)C(C)/C…

Next we’ll remove duplicate rows, which are possible because a Ref can have multiple rows in the Excel sheet, and those rows may well have some of the same structures.

df_melted = df_melted.unique()
unique_sml = df_melted.shape[0]
unique_sml

This gives us 3,911 different structures. Compared to the 3,884 reported in the data description, this is slightly greater, by 27 or 0.7%. We’ll discuss this below.

Another level of redundancy is that some structures which have different SMILES are actually the same molecule. A way to check if two SMILES correspond to the same molecule is to make each into an RDKit molecule, then output their canonical SMILES, then check if those SMILES are the same.

We defined mol_from_sml() above to create RDKit molecules with two methods: 1) with sanitization if possible 2) as a fallback, without sanitization

Without falling back to sanitize=False, eight rows comprising three Ref values (4, 355, and 1130) are not made into molecules, which prevents them from being processed with the steps that follow. Including the fallback option allows all rows to be made into molecules.

So let’s add the canonical SMILES.

df_melted = df_melted.with_columns(
    [
        pl.col("sml").map_elements(canonicalize).alias("canon_sml"),
    ]
)

df_melted.head()

shape: (5, 3)

Ref	sml	canon_sml
i64	str	str
6	"OC(C=CC=C1)=C1…	"O=C(/C=C(\O)c1…
11	"O=C1C=CC2=CC=C…	"O=C1C=Cc2ccccc…
24	"NC1=CC=CC=CC1=…	"N=c1cccccc1N"
27	"C1(/C=CC=CC=C1…	"c1ccc(Nc2ccccc…
29	"NC1=CC=CC=CC1=…	"Nc1cccccc1=O"

Now let’s keep only unique rows, based on Ref and canonical SMILES.

df_melted = df_melted.unique(["Ref", "canon_sml"])

After removing those duplicates, we have 3,797 different structures:

df_melted.shape

(3797, 3)

Whereas we originally had slightly more than the reported 3,884 data structures, we now have slightly fewer (87 or 2.2%). Let’s take a moment to discuss why.

Determining which molecules are the same is difficult at scale

The discrepancy is likely due to the definition of “different”. Again, we should identify as “the same” molecules where, for example, SMILES writes an aromatic ring coded with single bonds at odd bond indexes (and double bonds at even bond indexes) and one coded with single bonds at even bond indexes (and double bonds at odd bond indexes). Modifying the example from Richard L. Apodaca, if we don’t sanitize these two alternate SMILES for 1,2-bromobenzene, we get different structures and canonical SMILES because they are represented in Kekulé form, with alternating single and double bonds:

smls_aromatic_equivalent = ["C1=C(Br)C(Br)=CC=C1", "C1C(Br)=C(Br)C=CC=1"]
mols_aromatic_equivalent_unsanitized = [
    Chem.MolFromSmiles(sml, sanitize=False) for sml in smls_aromatic_equivalent
]
canon_smls_aromatic_equivalent_unsanitized = [
    Chem.MolToSmiles(mol) for mol in mols_aromatic_equivalent_unsanitized
]
Draw.MolsToGridImage(
    mols_aromatic_equivalent_unsanitized,
    legends=canon_smls_aromatic_equivalent_unsanitized,
)

Two unsanitized structures of 1,2-bromobenzene, differing by position of single and double bonds around the ring

and the two canonical SMILES are not the same

canon_smls_aromatic_equivalent_unsanitized[
    0
] == canon_smls_aromatic_equivalent_unsanitized[1]

False

whereas if we do sanitize them, we get the same structure and canonical SMILES:

mols_aromatic_equivalent_sanitized = [
    Chem.MolFromSmiles(sml, sanitize=True) for sml in smls_aromatic_equivalent
]
canon_smls_aromatic_equivalent_sanitized = [
    Chem.MolToSmiles(mol) for mol in mols_aromatic_equivalent_sanitized
]
Draw.MolsToGridImage(
    mols_aromatic_equivalent_sanitized, legends=canon_smls_aromatic_equivalent_sanitized
)

Two sanitized structures of 1,2-bromobenzene, with the same positioning of single and double bonds around the ring

canon_smls_aromatic_equivalent_sanitized[0] == canon_smls_aromatic_equivalent_sanitized[
    1
]

True

So the slight differences in the number of different structures as determined by RDKit compared to the tool used by Nicklaus and team (presumably their CACTVS tool) are probably due to the algorithms used to determine the molecular graph (from the input SMILES, InChI, or other identifier) and then the canonical version of the identifier. Inspection by chemists of individual pairs would probably lead to better agreement of whether two similar structures are different, but the automatic adjudication required for large datasets is undoubtedly challenging to encode in an algorithm.

Summary: We’ve ingested the experimental data, confirmed the raw number of structures, and noted that differences in cheminformatics algorithms lead to small differences in the number of different structures.

InChI incorporating multiple tautomers

“InChI is in principle designed to be tautomer-invariant”, meaning tautomers of a structure should be assigned the same InChI. Let’s check how successful InChI is at that by performing the same operations we did on the SMILES columns (melting and removing duplicates), but using the InChI columns, and determining how many InChI are required to represent the structures.

df_melted_InChI = df.melt(
    id_vars=["Ref"], value_vars=cs.starts_with("Std_InChI_"), value_name="InChI"
)

# Keep rows where InChI is supplied
df_melted_InChI = df_melted_InChI.filter(pl.col("InChI").is_not_null())

# Remove InChI_n label column (n = 1-5)
df_melted_InChI = df_melted_InChI.drop("variable")

# Remove duplicate rows
df_melted_InChI = df_melted_InChI.unique()
unique_InChI = df_melted_InChI.shape[0]
df_melted_InChI

shape: (3_464, 2)

Ref	InChI
i64	str
9	"InChI=1S/C17H1…
21	"InChI=1S/C12H9…
58	"InChI=1S/CH2N4…
59	"InChI=1S/C7H6N…
72	"InChI=1S/C10H1…
79	"InChI=1S/C19H1…
80	"InChI=1S/C16H1…
93	"InChI=1S/C12H1…
103	"InChI=1S/C4H6N…
109	"InChI=1S/C4H6N…
111	"InChI=1S/C10H1…
139	"InChI=1S/C11H9…
…	…
1139	"InChI=1S/C16H1…
1146	"InChI=1S/C16H1…
1154	"InChI=1S/C5H8O…
1161	"InChI=1S/C9H16…
1167	"InChI=1S/C12H1…
1191	"InChI=1S/C7H12…
1213	"InChI=1S/C14H2…
1214	"InChI=1S/C15H2…
1215	"InChI=1S/C8H14…
930	"InChI=1S/C9H11…
930	"InChI=1S/C9H11…
1215	"InChI=1S/C8H14…

So 3464 InChI are required to represent the 1776 sets of tautomers, whereas 3911 SMILES were. Let’s check how well InChI does at representing multiple tautomers with a single representation (InChI), compared to SMILES which does not have that goal or capability:

round(unique_InChI / Ref_count, 2)

1.95

round(unique_sml / Ref_count, 2)

2.2

# Calculate percent reduction
f"{(unique_sml - unique_InChI) / unique_sml:.0%}"

'11%'

So on average 1.95 InChI can represent a set of tautomers, compared to 2.2 SMILES, or an 11% reduction. Should InChI achieve its goal of being tautomer invariant, it would require only one InChI for a set of tautomers, which would be 1776 here. So InChI is only modestly successful.

Summary: InChI is only modestly successful at representing a set of tautomers with a single representation.

Enumerating tautomers using algorithms

Let’s start enumerating tautomers using two RDKit algorithms. We’ll define a function to create tautomers for a given enumerator class.

def tauts_as_sml_list(
    enumerator: Callable,
    sml: str,
) -> list[str]:
    """Use a tautomer enumerator to find the tautomers for a given SMILES

    :param enumerator: The tautomer enumerator class, which has an Enumerate method
    :param mol: An RDKit molecule
    :returns: A list of SMILES
    """
    mol = mol_from_sml(sml)
    tauts = enumerator.Enumerate(mol)
    tauts_as_sml_list = []
    for taut in tauts:
        tauts_as_sml_list.append(Chem.MolToSmiles(taut))

    # Make smls into a set to remove duplicates, then back into a list so can be in dataframe
    tauts_as_sml_list = list(set(tauts_as_sml_list))
    return tauts_as_sml_list

Now comes the computationally-intensive step of enumerating tautomers for each experimental structure (SMILES) using the two RDKit tautomer enumerators:

TautomerEnumerator, the default class with the updated rules, which will also be our baseline
GetV1TautomerEnumerator, the previous version

This step takes about 1.5 minutes on my laptop.

# Enumerate tautomers--this may take more than one minute
enumerators = [
    rdMolStandardize.TautomerEnumerator,
    rdMolStandardize.GetV1TautomerEnumerator,
]
for i, enumerator in enumerate(enumerators):
    df_melted = df_melted.with_columns(
        [
            pl.col("canon_sml")
            .map_elements(lambda s: tauts_as_sml_list(enumerator(), s))
            .alias(f"tauts_{enumerator.__name__}"),
        ]
    )

df_melted.head(3)

shape: (3, 5)

Ref	sml	canon_sml	tauts_TautomerEnumerator	tauts_GetV1TautomerEnumerator
i64	str	str	list[str]	list[str]
37	"C1(/C=N/N2C=CC…	"C1=C/C(=C/Nc2c…	["C1=CC(C=Nc2ccccc2)C(C=Nn2cccc2)=C1", "C1=CC(=CNc2ccccc2)C(C=Nn2cccc2)=C1", … "C1=CC(C=Nn2cccc2)C(C=Nc2ccccc2)=C1"]	["C1=CC(C=Nc2ccccc2)C(C=Nn2cccc2)=C1", "C1=CC(=CNc2ccccc2)C(C=Nn2cccc2)=C1", … "C1=CC(C=Nn2cccc2)C(C=Nc2ccccc2)=C1"]
66	"O=C(C1=CC(C(C(…	"COC(=O)C1=CC(c…	["COC(=O)C1=CC(c2cc3ccccc3[nH]2)C(=C(O)OC)N=N1", "COC(=O)C1=CC(c2cc3ccccc3[nH]2)C(C(=O)OC)=NN1", "COC(=O)C1=CC(c2cc3ccccc3[nH]2)C(C(=O)OC)N=N1"]	["COC(=O)C1=NNC(=C(O)OC)C=C1c1cc2ccccc2[nH]1", "COC(=O)C1=C(c2cc3ccccc3[nH]2)CC(C(=O)OC)N=N1", … "COC(=O)C1C=C(c2cc3ccccc3[nH]2)C(=C(O)OC)N=N1"]
87	"C1(C2=CC=CC=C2…	"c1ccc(C2=NCC(c…	["c1ccc(C2=NCC(c3ccccc3)=NC2)cc1"]	["C1=NC(c2ccccc2)=CNC1c1ccccc1", "C1=C(c2ccccc2)NCC(c2ccccc2)=N1", … "C1=C(c2ccccc2)NC=C(c2ccccc2)N1"]

Summary: We applied RDKit’s tautomer enumeration algorithms to all experimental structures.

Entering tautomers generated by external algorithms

For the NIH cactus Tautomerizer and CACTVS, I don’t have programmatic access, so tautomers for some Refs are given below, either directly or read from a file.

NIH cactus Tautomerizer

# Manually list tautomers created by NIH Tautomerizer
# https://cactus.nci.nih.gov/cgi-bin/tautomerize.tcl
# Settings:
#   steps: multi
#   Predicted tautomers by: All Rules
cactus_sml_tauts = {
    # 73a
    "CSC1=NC(c2ccccc2[N+](=O)[O-])C(C(=O)OC(C)C)=C(C)N1": [
        # 4 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "CSC1=NC(=C(C(N1)C2=C(C=CC=C2)[N+](=O)[O-])C(=O)OC(C)C)C",
        "CSC2=NC(C1=C(C=CC=C1)[N+](=O)[O-])C(C(=O)OC(C)C)C(=N2)C",
        "CSC2=NC(C1=C(C=CC=C1)[N+](=O)[O-])C(C(=O)OC(C)C)C(=C)N2",
        "CSC1=NC(C(C(N1)C2=C(C=CC=C2)[N+](=O)[O-])C(=O)OC(C)C)=C",
        # 1 tautomer(s) generated using PT_07_00 - 1,5 (aromatic) heteroatom H shift (1)
        "CSC2=NC(C1=C(C=CC=C1)[N+](=O)[O-])C(=C(O)OC(C)C)C(=N2)C",
        # 1 tautomer(s) generated using PT_22_00 - imine/imine
        "CSC1NC(=C(C(=N1)C2=C(C=CC=C2)[N+](=O)[O-])C(=O)OC(C)C)C",
        # 2 tautomer(s) generated using PT_29_00 - nitro/aci-nitro via aromatic ring (2): 1,5 H-shift
        "CSC2=NC(=C1C(C=CC=C1)=[N+]([O-])O)C(=C(C)N2)C(=O)OC(C)C",
        "CSC2=NC(=C1C(=CCC=C1)[N+]([O-])=O)C(=C(C)N2)C(=O)OC(C)C",
    ],
    # 73b
    "CSC1=NC(C)=C(C(=O)OC(C)C)C(c2ccccc2[N+](=O)[O-])N1": [
        # 4 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "CSC1=NC(C(=C(N1)C)C(=O)OC(C)C)C2=C(C=CC=C2)[N+](=O)[O-]",
        "CSC1=NC(C(C(=N1)C)C(=O)OC(C)C)C2=C(C=CC=C2)[N+](=O)[O-]",
        "CSC1=NC(C(C(N1)=C)C(=O)OC(C)C)C2=C(C=CC=C2)[N+](=O)[O-]",
        "CSC2=NC(=C)C(C(=O)OC(C)C)C(C1=C(C=CC=C1)[N+](=O)[O-])N2",
        # 1 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "CSC1=NC(C(C(=N1)C)=C(O)OC(C)C)C2=C(C=CC=C2)[N+](=O)[O-]",
        # 2 tautomer(s) generated using PT_29_00 - nitro/aci-nitro via aromatic ring (2): 1,5 H-shift
        "CSC2=NC(=C(C(=O)OC(C)C)C(=C1C(C=CC=C1)=[N+]([O-])O)N2)C",
        "CSC2=NC(=C(C(=O)OC(C)C)C(=C1C(=CCC=C1)[N+]([O-])=O)N2)C",
    ],
    # 457a
    "[2H]Oc1ccc(-c2oc3c([2H])c(O[2H])c(OC)c(=O)c-3c(O[2H])c2O[2H])cc1O[2H]": [
        # 2 tautomer(s) generated using PT_02_00 - 1,5 (thio)keto/(thio)enol
        "O([2H])C1=C(C=C(C=C1)C2=C(C(C3=C(O2)C([2H])(C(=C(OC)C3=O)O[2H])[2H])=O)O[2H])O[2H]",
        "O([2H])C1=C(C=C(C=C1)C2=C(C(C3=C(O2)C(=C(O[2H])C(=C3O[2H])OC)[2H])=O)O[2H])O[2H]",
        # 9 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "O=C1C(C=C(C=C1)C3=C(C(=C2C(C(=C(C(=C2O3)[2H])O[2H])OC)=O)O[2H])O[2H])(O[2H])[2H]",
        "O([2H])C1=C(C=C(C=C1)C3=C(C(=C2C(C(C(C(=C2O3)[2H])=O)(OC)[2H])=O)O[2H])O[2H])O[2H]",
        "O([2H])C1=C(C=C(C=C1)C3=C(C(C2(C(C(=C(C(=C2O3)[2H])O[2H])OC)=O)[2H])=O)O[2H])O[2H]",
        "O([2H])C1=C(C=C(C=C1)C3(C(C(=C2C(C(=C(C(=C2O3)[2H])O[2H])OC)=O)O[2H])=O)[2H])O[2H]",
        "O([2H])C1(C(C=C(C=C1)C3=C(C(=C2C(C(=C(C(=C2O3)[2H])O[2H])OC)=O)O[2H])O[2H])=O)[2H]",
        "O=C1C(C=C(C=C1)C3=C(C(=C2C(C(C(C(=C2O3)[2H])=O)(OC)[2H])=O)O[2H])O[2H])(O[2H])[2H]",
        "O=C1C(C=C(C=C1)C3=C(C(C2(C(C(=C(C(=C2O3)[2H])O[2H])OC)=O)[2H])=O)O[2H])(O[2H])[2H]",
        "O=C1C(C=C(C=C1)C3(C(C(=C2C(C(=C(C(=C2O3)[2H])O[2H])OC)=O)O[2H])=O)[2H])(O[2H])[2H]",
        "O([2H])C1=C(C=C(C=C1)C3=C(C(C2(C(C(C(C(=C2O3)[2H])=O)(OC)[2H])=O)[2H])=O)O[2H])O[2H]",
        # 2 tautomer(s) generated using PT_07_00 - 1,5 (aromatic) heteroatom H shift (1)
        "O([2H])C1=C(C=C(C=C1)C3=C(C(=C2C(=C(C(C(=C2O3)[2H])=O)OC)O[2H])O[2H])O[2H])O[2H]",
        "O([2H])C1=C(C=C(C=C1)C3=C(C(C2=C(C(=C(C(=C2O3)[2H])O[2H])OC)O[2H])=O)O[2H])O[2H]",
        # 2 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "O=C1C(=CC(=CC1[2H])C3=C(C(=C2C(C(=C(C(=C2O3)[2H])O[2H])OC)=O)O[2H])O[2H])O[2H]",
        "O([2H])C3=CC=C(C2=C(C(=C1C(C(=C(C(=C1O2)[2H])O[2H])OC)=O)O[2H])O[2H])C(C3=O)[2H]",
        # 1 tautomer(s) generated using PT_11_00 - 1,11 (aromatic) heteroatom H shift
        "O=C1C(=CC(C=C1)=C3C(=C(C2=C(C(=C(C(=C2O3)[2H])O[2H])OC)O[2H])O[2H])O[2H])O[2H]",
        # 1 tautomer(s) generated using PT_11_02 - 1,15 (aromatic) heteroatom H shift
        "O=C1C(=CC(C=C1)=C2C(=C(C3=C(O2)C(=C(O[2H])C(=C3O[2H])OC)[2H])O[2H])O[2H])O[2H]",
    ],
    # 457b
    "[2H]Oc1ccc(-c2oc3c([2H])c(O[2H])c(OC)c(O[2H])c3c(=O)c2O[2H])cc1O[2H]": [
        # 9 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "O=C1C(C=C(C=C1)C2=C(C(C3=C(O2)C(=C(O[2H])C(=C3O[2H])OC)[2H])=O)O[2H])(O[2H])[2H]",
        "O([2H])C1=C(C=C(C=C1)C2=C(C(C3=C(O2)C([2H])(C(=O)C(=C3O[2H])OC)[2H])=O)O[2H])O[2H]",
        "O([2H])C1=C(C=C(C=C1)C2=C(C(C3=C(O2)C(=C(O[2H])C(OC)(C3=O)[2H])[2H])=O)O[2H])O[2H]",
        "O([2H])C1=C(C=C(C=C1)C2(C(C(C3=C(O2)C(=C(O[2H])C(=C3O[2H])OC)[2H])=O)=O)[2H])O[2H]",
        "O([2H])C1(C(C=C(C=C1)C2=C(C(C3=C(O2)C(=C(O[2H])C(=C3O[2H])OC)[2H])=O)O[2H])=O)[2H]",
        "O=C1C(C=C(C=C1)C2=C(C(C3=C(O2)C([2H])(C(=O)C(=C3O[2H])OC)[2H])=O)O[2H])(O[2H])[2H]",
        "O=C1C(C=C(C=C1)C2=C(C(C3=C(O2)C(=C(O[2H])C(OC)(C3=O)[2H])[2H])=O)O[2H])(O[2H])[2H]",
        "O=C1C(C=C(C=C1)C2(C(C(C3=C(O2)C(=C(O[2H])C(=C3O[2H])OC)[2H])=O)=O)[2H])(O[2H])[2H]",
        "O([2H])C1=C(C=C(C=C1)C2=C(C(C3=C(O2)C([2H])(C(=O)C(OC)(C3=O)[2H])[2H])=O)O[2H])O[2H]",
        # 7 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "O=C1C(=CC(=CC1[2H])C2=C(C(C3=C(O2)C(=C(O[2H])C(=C3O[2H])OC)[2H])=O)O[2H])O[2H]",
        "O([2H])C1=C(C=C(C=C1)C3=C(C(C2(C(C(=C(C(=C2O3)[2H])O[2H])OC)=O)[2H])=O)O[2H])O[2H]",
        "O([2H])C1=C(C=C(C=C1)C3=C(C(=C2C(=C(C(C(=C2O3)[2H])=O)OC)O[2H])O[2H])O[2H])O[2H]",
        "O([2H])C3=CC=C(C1=C(C(C2=C(O1)C(=C(O[2H])C(=C2O[2H])OC)[2H])=O)O[2H])C(C3=O)[2H]",
        "O=C1C(=CC(=CC1[2H])C3=C(C(C2(C(C(=C(C(=C2O3)[2H])O[2H])OC)=O)[2H])=O)O[2H])O[2H]",
        "O=C1C(=CC(=CC1[2H])C3=C(C(=C2C(=C(C(C(=C2O3)[2H])=O)OC)O[2H])O[2H])O[2H])O[2H]",
        "O=C1C(=CC(C=C1[2H])=C2C(=C(C3=C(O2)C(=C(O[2H])C(=C3O[2H])OC)[2H])O)O[2H])O[2H]",
        # 1 tautomer(s) generated using PT_10_00 - 1,9 (aromatic) heteroatom H shift
        "O([2H])C1=C(C=C(C=C1)C3=C(C(=C2C(C(=C(C(=C2O3)[2H])O[2H])OC)=O)O[2H])O[2H])O[2H]",
    ],
    # 467a
    "O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12": [
        # 9 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "O=C1C3=C(OC(C1=O)C2=CC(=C(C=C2)O)O)C=C(O)C=C3O",
        "O=C1C3=C(OC(=C1O)C2=CC(C(C=C2)=O)O)C=C(O)C=C3O",
        "O=C1C3=C(OC(=C1O)C2=CC(C(C=C2)O)=O)C=C(O)C=C3O",
        "O=C1C3=C(OC(=C1O)C2=CC(=C(C=C2)O)O)CC(=O)C=C3O",
        "O=C1C3=C(OC(=C1O)C2=CC(=C(C=C2)O)O)C=C(O)CC3=O",
        "O=C1C3=C(OC(C1=O)C2=CC(C(C=C2)=O)O)C=C(O)C=C3O",
        "O=C1C3=C(OC(C1=O)C2=CC(C(C=C2)O)=O)C=C(O)C=C3O",
        "O=C1C3=C(OC(C1=O)C2=CC(=C(C=C2)O)O)CC(=O)C=C3O",
        "O=C1C3=C(OC(C1=O)C2=CC(=C(C=C2)O)O)C=C(O)CC3=O",
        # 7 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "OC2=C1C(=CC(C=C1OC(=C2O)C3=CC(=C(C=C3)O)O)=O)O",
        "O=C1C3=C(OC(=C1O)C2=CCC(=O)C(=C2)O)C=C(O)C=C3O",
        "O=C1C3=C(OC(=C1O)C2=CC=C(O)C(=O)C2)C=C(O)C=C3O",
        "O=C2C1=C(CC(C=C1OC(=C2O)C3=CC(=C(C=C3)O)O)=O)O",
        "O=C2C1C(C=C(C=C1OC(=C2O)C3=CC(=C(C=C3)O)O)O)=O",
        "OC2=C1C(=CC(C=C1OC(=C2O)C3=CCC(=O)C(=C3)O)=O)O",
        "OC2=C1C(=CC(C=C1OC(=C2O)C3=CC=C(O)C(=O)C3)=O)O",
        # 1 tautomer(s) generated using PT_10_00 - 1,9 (aromatic) heteroatom H shift
        "OC2=C1C(C=C(C=C1OC(=C2O)C3=CC(=C(C=C3)O)O)O)=O",
    ],
    # 467b
    "O=c1cc(O)cc2oc(-c3ccc(O)c(O)c3)c(O)c(O)c1-2": [
        # 2 tautomer(s) generated using PT_02_00 - 1,5 (thio)keto/(thio)enol
        "O=C1C2=C(CC(=C1)O)OC(=C(O)C2=O)C3=CC(=C(C=C3)O)O",
        "OC2=C1C(C(=C(OC1=CC(=C2)O)C3=CC(=C(C=C3)O)O)O)=O",
        # 9 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "O=C2C1=C(C(=C(OC1=CC(C2)=O)C3=CC(=C(C=C3)O)O)O)O",
        "O=C2C1=C(C(=C(OC1=CC(=C2)O)C3=CC(C(C=C3)=O)O)O)O",
        "O=C2C1=C(C(=C(OC1=CC(=C2)O)C3=CC(C(C=C3)O)=O)O)O",
        "O=C2C1=C(C(C(OC1=CC(=C2)O)C3=CC(=C(C=C3)O)O)=O)O",
        "O=C2C1C(C(=C(OC1=CC(=C2)O)C3=CC(=C(C=C3)O)O)O)=O",
        "O=C2C1=C(C(=C(OC1=CC(C2)=O)C3=CC(C(C=C3)=O)O)O)O",
        "O=C2C1=C(C(=C(OC1=CC(C2)=O)C3=CC(C(C=C3)O)=O)O)O",
        "O=C2C1=C(C(C(OC1=CC(C2)=O)C3=CC(=C(C=C3)O)O)=O)O",
        "O=C2C1C(C(=C(OC1=CC(C2)=O)C3=CC(=C(C=C3)O)O)O)=O",
        # 2 tautomer(s) generated using PT_07_00 - 1,5 (aromatic) heteroatom H shift (1)
        "OC2=C1C(C(=C(OC1=CC(=C2)O)C3=CC(=C(C=C3)O)O)O)=O",
        "OC1=CC(=O)C=C2C1=C(C(=C(O2)C3=CC(=C(C=C3)O)O)O)O",
        # 7 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "O=C1C2=C(C=C(C1)O)OC(=C(O)C2=O)C3=CC(=C(C=C3)O)O",
        "O=C2C1=C(C(=C(OC1=CC(=C2)O)C3=CCC(=O)C(=C3)O)O)O",
        "O=C2C1=C(C(=C(OC1=CC(=C2)O)C3=CC=C(O)C(=O)C3)O)O",
        "OC2=C1C(C(=C(OC1=CC(=C2)O)C3=CC(=C(C=C3)O)O)O)=O",
        "OC2=C1C(C(=C(OC1=CC(C2)=O)C3=CC(=C(C=C3)O)O)O)=O",
        "O=C1C2=C(C=C(C1)O)OC(=C(O)C2=O)C3=CCC(=O)C(=C3)O",
        "O=C1C2=C(C=C(C1)O)OC(=C(O)C2=O)C3=CC=C(O)C(=O)C3",
        # 1 tautomer(s) generated using PT_11_00 - 1,11 (aromatic) heteroatom H shift
        "OC2=C1C(=C(C(OC1=CC(=C2)O)=C3C=C(C(C=C3)=O)O)O)O",
        # 1 tautomer(s) generated using PT_11_02 - 1,15 (aromatic) heteroatom H shift
        "OC1=CC(=CC2=C1C(=C(C(O2)=C3C=C(C(C=C3)=O)O)O)O)O",
    ],
    # 888a
    "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c6c(c(c1[C@@H]([C@H](C)O)[C@@H]6[C@H](C)O)c23)c54": [
        # 5 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "COC4=C6C3=C1C(=C(C(=C2C(C=C(C(=C12)C5=C3C(C4=O)C(=O)C=C5OC)OC)=O)O)OC)[C@H]([C@@H]6[C@H](C)O)[C@H](C)O",
        "COC4=C6C3=C1C(=C(C(C2C(C=C(C(=C12)C5=C3C(C4=O)C(=O)C=C5OC)OC)=O)=O)OC)[C@H]([C@@H]6[C@H](C)O)[C@H](C)O",
        "COC5=C6C3=C1C(=C(C(=C2C(C=C(C(=C12)C4=C3C(=C(O)C=C4OC)C5=O)OC)=O)O)OC)[C@H]([C@@H]6[C@H](C)O)[C@H](C)O",
        "COC5=C6C3=C1C(=C(C(C2C(C=C(C(=C12)C4=C3C(=C(O)C=C4OC)C5=O)OC)=O)=O)OC)[C@H]([C@@H]6[C@H](C)O)[C@H](C)O",
        "COC5=C6C3=C1C(=C(C(C2=C(C=C(C(=C12)C4=C3C(=C(O)C=C4OC)C5=O)OC)O)=O)OC)[C@H]([C@@H]6[C@H](C)O)[C@H](C)O",
        # 2 tautomer(s) generated using PT_07_00 - 1,5 (aromatic) heteroatom H shift (1)
        "COC5=C6C3=C1C(=C(C(=C2C(C=C(C(=C12)C4=C3C(=C(O)C=C4OC)C5=O)OC)=O)O)OC)[C@H]([C@@H]6[C@H](C)O)[C@H](C)O",
        "COC5=C6C3=C1C(=C(C(C2=C(C=C(C(=C12)C4=C3C(=C(O)C=C4OC)C5=O)OC)O)=O)OC)[C@H]([C@@H]6[C@H](C)O)[C@H](C)O",
        # 6 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "COC5=C(O)C1=C(O)C=C(OC)C2=C1C6=C3C(=C(C(=C4C(C=C(C2=C34)OC)=O)O)OC)[C@H](C(=C56)[C@H](C)O)[C@H](C)O",
        "COC5=C6C3=C1C(=C(C(=C2C(C=C(C(=C12)C4=C(CC(C(=C34)C5=O)=O)OC)OC)=O)O)OC)[C@H]([C@@H]6[C@H](C)O)[C@H](C)O",
        "COC2=C6C3=C1C(=C(C(C5=C1C(=C4C(=CC(C(=C2O)C34)=O)OC)C(=CC5=O)OC)=O)OC)[C@H]([C@@H]6[C@H](C)O)[C@H](C)O",
        "COC5=C(O)C4=C3C6=C1C(=C(C(=C2C(C=C(C(=C12)C3=C(CC4=O)OC)OC)=O)O)OC)[C@H](C(=C56)[C@H](C)O)[C@H](C)O",
        "COC5=C(O)C1=C(O)C=C(OC)C3=C1C6=C2C(=C(C(C4=C2C3=C(OC)CC4=O)=O)OC)[C@H](C(=C56)[C@H](C)O)[C@H](C)O",
        "COC2=C(O)C1=C(O)C=C(OC)C4=C1C3=C5C(=C(C(=C23)[C@H](C)O)[C@H](C)O)C(=C(C6=C(C=C(C4=C56)OC)O)O)OC",
    ],
    # 888b
    "COc1c2c3c4c(c(OC)c(=O)c5c(O)cc(OC)c(c6c(OC)cc(O)c(c1=O)c63)c54)[C@@H]([C@H](C)O)[C@@H]2[C@H](C)O": [
        # 2 tautomer(s) generated using PT_02_00 - 1,5 (thio)keto/(thio)enol
        "COC4=C1[C@H](C(=C6C2=C5C(=C3C(=C12)C(=C(C=C3OC)O)C4=O)C(=CC(=C5C(=C6OC)O)O)OC)[C@H](C)O)[C@H](C)O",
        "COC1=C(C2=C(C=C(C3=C5C(=CC(=C6C(=C(C4=C([C@H](C)O)C(=C1C(=C23)C4=C56)[C@H](C)O)OC)O)O)OC)OC)O)O",
        # 4 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "COC6=C2[C@H]([C@@H](C1=C(OC)C(=O)C5C4=C1C2=C3C(=C(C=C(C3=C4C(=CC5=O)OC)OC)O)C6=O)[C@H](C)O)[C@H](C)O",
        "COC6=C2[C@H]([C@@H](C1=C(OC)C(=O)C5C4=C1C2=C3C(C(C=C(C3=C4C(=CC5=O)OC)OC)=O)C6=O)[C@H](C)O)[C@H](C)O",
        "COC6=C2[C@H]([C@@H](C1=C(OC)C(=C5C4=C1C2=C3C(=C(C=C(C3=C4C(=CC5=O)OC)OC)O)C6=O)O)[C@H](C)O)[C@H](C)O",
        "COC6=C2[C@H]([C@@H](C1=C(OC)C(=C5C4=C1C2=C3C(C(C=C(C3=C4C(=CC5=O)OC)OC)=O)C6=O)O)[C@H](C)O)[C@H](C)O",
        # 2 tautomer(s) generated using PT_07_00 - 1,5 (aromatic) heteroatom H shift (1)
        "COC6=C2[C@H]([C@@H](C1=C(OC)C(=C5C4=C1C2=C3C(=C(C=C(C3=C4C(=CC5=O)OC)OC)O)C6=O)O)[C@H](C)O)[C@H](C)O",
        "COC3=C2[C@H]([C@@H](C1=C(OC)C(=C6C5=C1C2=C4C(=C3O)C(C=C(C4=C5C(=CC6=O)OC)OC)=O)O)[C@H](C)O)[C@H](C)O",
        # 4 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "COC2=C3[C@H]([C@@H](C4=C(OC)C(=O)C6=C5C(=C1C(C(=C(C=C1OC)O)C2=O)C3=C45)C(=CC6=O)OC)[C@H](C)O)[C@H](C)O",
        "COC6=C1[C@H]([C@@H](C5=C3C1=C2C(=C(C=C(C2=C4C(=CC(C(=C34)C(=C5OC)O)=O)OC)OC)O)C6=O)[C@H](C)O)[C@H](C)O",
        "COC6=C1[C@H]([C@@H](C5C3=C1C2=C(C(C=C(C2=C4C(=CC(C(=C34)C(=C5OC)O)=O)OC)OC)=O)C6=O)[C@H](C)O)[C@H](C)O",
        "COC6=C2[C@H]([C@@H](C1=C(OC)C(=O)C5C4=C1C2=C3C(=C(C=C(C3=C4C(=CC5=O)OC)OC)O)C6=O)[C@H](C)O)[C@H](C)O",
    ],
    # 890a
    "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c(C[C@H](C)OC(=O)c6ccc(O)cc6)c(c(c1C[C@@H](C)OC(=O)c1ccccc1)c23)c54": [
        # 7 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "COC4=C(C3=C1C(=C(C(=C2C(C=C(C(=C12)C5=C3C(C4=O)C(=O)C=C5OC)OC)=O)O)OC)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC4=C(C3=C1C(=C(C(C2C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)=O)=O)OC)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC4=C(C3=C1C(=C(C(=C2C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)=O)O)OC)C[C@H](C)OC(=O)C6=CCC(C=C6)=O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC4=C(C3=C1C(=C(C(C2C(C=C(C(=C12)C5=C3C(C4=O)C(=O)C=C5OC)OC)=O)=O)OC)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC4=C(C3=C1C(=C(C(=C2C(C=C(C(=C12)C5=C3C(C4=O)C(=O)C=C5OC)OC)=O)O)OC)C[C@H](C)OC(=O)C6=CCC(C=C6)=O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC4=C(C3=C1C(=C(C(C2C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)=O)=O)OC)C[C@H](C)OC(=O)C6=CCC(C=C6)=O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC4=C(C3=C1C(=C(C(C2C(C=C(C(=C12)C5=C3C(C4=O)C(=O)C=C5OC)OC)=O)=O)OC)C[C@H](C)OC(=O)C6=CCC(C=C6)=O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        # 3 tautomer(s) generated using PT_07_00 - 1,5 (aromatic) heteroatom H shift (1)
        "COC5=C(C3=C1C(=C(C(=C2C(C=C(C(=C12)C4=C3C(=C(O)C=C4OC)C5=O)OC)=O)O)OC)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC4=C(C3=C1C(=C(C(C2=C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)O)=O)OC)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC5=C(C3=C1C(=C(C(C2=C(C=C(C(=C12)C4=C3C(=C(O)C=C4OC)C5=O)OC)O)=O)OC)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        # 8 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "COC6=C(O)C1=C(O)C=C(OC)C2=C1C(=C3C(=C(C(=C4C(C=C(C2=C34)OC)=O)O)OC)C[C@H](C)OC(=O)C5=CC=C(C=C5)O)C6=C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC5=C(C3=C1C(=C(C(=C2C(C=C(C(=C12)C4=C(CC(C(=C34)C5=O)=O)OC)OC)=O)O)OC)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC4=C(C3=C1C(=C(C(C2=C1C(=C(OC)CC2=O)C5=C3C(=C4O)C(=O)C=C5OC)=O)OC)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC4=C(C3=C1C(C(=C(C2=C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)O)O)OC)=C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC4=C(C3=C1C(=C(C(=C2C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)=O)O)OC)C[C@H](C)OC(O)=C6C=CC(C=C6)=O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC4=C(C3=C1C(=C(C(=C2C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)=O)O)OC)C[C@H](C)OC(=O)C6=CCC(=O)C=C6)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC2=C(C3=C1C(=C(C(C5=C1C(=C4C(=CC(C(=C2O)C34)=O)OC)C(=CC5=O)OC)=O)OC)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
        "COC3=C(C1=C4C(=C(C(=C5C(C=C(C(=C2C(=CC(C(=C12)C3=O)=O)OC)C45)OC)=O)O)OC)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7",
    ],
    # 890b
    "COc1c(C[C@@H](C)OC(=O)c2ccccc2)c2c3c(C[C@H](C)OC(=O)c4ccc(O)cc4)c(OC)c(=O)c4c(O)cc(OC)c(c5c(OC)cc(O)c(c1=O)c52)c43": [
        # 3 tautomer(s) generated using PT_02_00 - 1,5 (thio)keto/(thio)enol
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C3=C2C(=C(C=C(C2=C4C(=CC(=C5C(=C(C(C3=C45)=C[C@H](C)OC(=O)C6=CC=C(C=C6)O)OC)O)O)OC)OC)O)C7=O",
        "COC5=C(C3=C(C=C(C4=C1C(=CC(=C2C(C(=C(C(=C12)C(=C34)C5=C[C@@H](C)OC(=O)C6=CC=CC=C6)C[C@H](C)OC(=O)C7=CC=C(C=C7)O)OC)=O)O)OC)OC)O)O",
        "COC5=C(C3=C(C=C(C4=C1C(=CC(=C2C(=C(C(C(=C12)C(=C34)C5=C[C@@H](C)OC(=O)C6=CC=CC=C6)=C[C@H](C)OC(=O)C7=CC=C(C=C7)O)OC)O)O)OC)OC)O)O",
        # 7 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C3=C2C(=C(C=C(C2=C4C(=CC(=C5C(C(=C(C3=C45)C[C@H](C)OC(=O)C6=CCC(C=C6)=O)OC)=O)O)OC)OC)O)C7=O",
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C3=C2C(=C(C=C(C2=C4C(=CC(C5C(C(=C(C3=C45)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)OC)=O)=O)OC)OC)O)C7=O",
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C3=C2C(C(C=C(C2=C4C(=CC(=C5C(C(=C(C3=C45)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)OC)=O)O)OC)OC)=O)C7=O",
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C3=C2C(=C(C=C(C2=C4C(=CC(C5C(C(=C(C3=C45)C[C@H](C)OC(=O)C6=CCC(C=C6)=O)OC)=O)=O)OC)OC)O)C7=O",
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C3=C2C(C(C=C(C2=C4C(=CC(=C5C(C(=C(C3=C45)C[C@H](C)OC(=O)C6=CCC(C=C6)=O)OC)=O)O)OC)OC)=O)C7=O",
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C3=C2C(C(C=C(C2=C4C(=CC(C5C(C(=C(C3=C45)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)OC)=O)=O)OC)OC)=O)C7=O",
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C3=C2C(C(C=C(C2=C4C(=CC(C5C(C(=C(C3=C45)C[C@H](C)OC(=O)C6=CCC(C=C6)=O)OC)=O)=O)OC)OC)=O)C7=O",
        # 3 tautomer(s) generated using PT_07_00 - 1,5 (aromatic) heteroatom H shift (1)
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C3=C2C(=C(C=C(C2=C4C(=CC(C5=C(C(=C(C3=C45)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)OC)O)=O)OC)OC)O)C7=O",
        "COC2=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C4=C3C(=C2O)C(C=C(C3=C5C(=CC(=C6C(C(=C(C4=C56)C[C@H](C)OC(=O)C7=CC=C(C=C7)O)OC)=O)O)OC)OC)=O",
        "COC2=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C4=C3C(=C2O)C(C=C(C3=C5C(=CC(C6=C(C(=C(C4=C56)C[C@H](C)OC(=O)C7=CC=C(C=C7)O)OC)O)=O)OC)OC)=O",
        # 6 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C3=C2C(=C(C=C(C2=C4C(=CC(=C5C(C(=C(C3=C45)C[C@H](C)OC(O)=C6C=CC(C=C6)=O)OC)=O)O)OC)OC)O)C7=O",
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C3=C2C(=C(C=C(C2=C4C(=CC(=C5C(C(=C(C3=C45)C[C@H](C)OC(=O)C6=CCC(=O)C=C6)OC)=O)O)OC)OC)O)C7=O",
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C5=C2C4=C(C(C(=C2C[C@H](C)OC(=O)C3=CC=C(C=C3)O)OC)=O)C(=O)C=C(OC)C4=C6C5C(=C(C=C6OC)O)C7=O",
        "COC4=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C2=C6C5C(=C3C2=C(C(C=C3OC)=O)C4=O)C(=CC(=C5C(C(=C6C[C@H](C)OC(=O)C7=CC=C(C=C7)O)OC)=O)O)OC",
        "COC7=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C5=C2C4=C(C(C(=C2C[C@H](C)OC(O)=C3C=CC(C=C3)=O)OC)=O)C(=O)C=C(OC)C4=C6C5C(=C(C=C6OC)O)C7=O",
        "COC4=C(C[C@@H](C)OC(=O)C1=CC=CC=C1)C2=C6C5C(=C3C2=C(C(C=C3OC)=O)C4=O)C(=CC(=C5C(C(=C6C[C@H](C)OC(O)=C7C=CC(C=C7)=O)OC)=O)O)OC",
    ],
    # 891a
    "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c6c(c(c1CC(C)(O)[C@H]6C(C)=O)c23)c54": [
        # 6 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "COC4=C6C3=C1C(=C(C(=C2C(C=C(C(=C12)C5=C3C(C4=O)C(=O)C=C5OC)OC)=O)O)OC)[C@@H](C(C6)(C)O)C(C)=O",
        "COC4=C6C3=C1C(=C(C(C2C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)=O)=O)OC)[C@@H](C(C6)(C)O)C(C)=O",
        "COC4=C6C3=C1C(=C(C(C2C(C=C(C(=C12)C5=C3C(C4=O)C(=O)C=C5OC)OC)=O)=O)OC)[C@@H](C(C6)(C)O)C(C)=O",
        "COC4=C6C3=C1C(=C(C(=C2C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)=O)O)OC)C(C(C6)(C)O)=C(C)O",
        "COC4=C6C3=C1C(=C(C(=C2C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)=O)O)OC)[C@@H](C(C6)(C)O)C(=C)O",
        "COC5=C6C3=C1C(=C(C(=C2C(C=C(C(=C12)C4=C3C(=C(O)C=C4OC)C5=O)OC)=O)O)OC)[C@@H](C(C6)(C)O)C(C)=O",
        # 3 tautomer(s) generated using PT_07_00 - 1,5 (aromatic) heteroatom H shift (1)
        "COC5=C6C3=C1C(=C(C(=C2C(C=C(C(=C12)C4=C3C(=C(O)C=C4OC)C5=O)OC)=O)O)OC)[C@@H](C(C6)(C)O)C(C)=O",
        "COC4=C6C3=C1C(=C(C(C2=C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)O)=O)OC)[C@@H](C(C6)(C)O)C(C)=O",
        "COC5=C6C3=C1C(=C(C(C2=C(C=C(C(=C12)C4=C3C(=C(O)C=C4OC)C5=O)OC)O)=O)OC)[C@@H](C(C6)(C)O)C(C)=O",
        # 9 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "COC5=C(O)C1=C(O)C=C(OC)C2=C1C6=C3C(=C(C(=C4C(C=C(C2=C34)OC)=O)O)OC)[C@@H](C(C=C56)(C)O)C(C)=O",
        "COC5=C6C3=C1C(=C(C(=C2C(C=C(C(=C12)C4=C(CC(C(=C34)C5=O)=O)OC)OC)=O)O)OC)[C@@H](C(C6)(C)O)C(C)=O",
        "COC4=C6C3=C1C(=C(C(C2=C1C(=C(OC)CC2=O)C5=C3C(=C4O)C(=O)C=C5OC)=O)OC)[C@@H](C(C6)(C)O)C(C)=O",
        "COC5=C1C4=C2C(=C(C(C1)(C)O)C(C)=O)C(=C(C3=C(C=C(C(=C23)C6=C4C(=C5O)C(=O)C=C6OC)OC)O)O)OC",
        "COC2=C6C3=C1C(=C(C(C5=C1C(=C4C(=CC(C(=C2O)C34)=O)OC)C(=CC5=O)OC)=O)OC)[C@@H](C(C6)(C)O)C(C)=O",
        "COC3=C6C1=C4C(=C(C(=C5C(C=C(C(=C2C(=CC(C(=C12)C3=O)=O)OC)C45)OC)=O)O)OC)[C@@H](C(C6)(C)O)C(C)=O",
        "COC5=C(O)C4=C3C6=C1C(=C(C(=C2C(C=C(C(=C12)C3=C(CC4=O)OC)OC)=O)O)OC)[C@@H](C(C=C56)(C)O)C(C)=O",
        "COC5=C(O)C1=C(O)C=C(OC)C3=C1C6=C2C(=C(C(C4=C2C3=C(OC)CC4=O)=O)OC)[C@@H](C(C=C56)(C)O)C(C)=O",
        "COC2=C(O)C1=C(O)C=C(OC)C4=C1C3=C5C(=C(C(C=C23)(C)O)C(C)=O)C(=C(C6=C(C=C(C4=C56)OC)O)O)OC",
    ],
    # 891b
    "COc1c2c3c4c(c(OC)c(=O)c5c(O)cc(OC)c(c6c(OC)cc(O)c(c1=O)c63)c54)[C@H](C(C)=O)C(C)(O)C2": [
        # 3 tautomer(s) generated using PT_02_00 - 1,5 (thio)keto/(thio)enol
        "COC4=C1CC(C(=C6C2=C5C(=C3C(=C12)C(=C(C=C3OC)O)C4=O)C(=CC(=C5C(=C6OC)O)O)OC)C(C)=O)(C)O",
        "COC5=C(C3=C(C=C(C4=C1C(=CC(=C2C(C(=C6C(=C12)C(=C34)C5=CC([C@H]6C(C)=O)(C)O)OC)=O)O)OC)OC)O)O",
        "COC1=C(C2=C(C=C(C3=C5C(=CC(=C6C(=C(C4=C(C(C)=O)C(C)(O)C=C1C(=C23)C4=C56)OC)O)O)OC)OC)O)O",
        # 6 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "COC6=C2CC([C@H](C1=C(OC)C(=O)C5C4=C1C2=C3C(=C(C=C(C3=C4C(=CC5=O)OC)OC)O)C6=O)C(C)=O)(C)O",
        "COC6=C1CC([C@H](C2=C(OC)C(=O)C3=C(O)C=C(OC)C4=C5C(=C1C2=C34)C(C(C=C5OC)=O)C6=O)C(C)=O)(C)O",
        "COC6=C2CC([C@H](C1=C(OC)C(=O)C5C4=C1C2=C3C(C(C=C(C3=C4C(=CC5=O)OC)OC)=O)C6=O)C(C)=O)(C)O",
        "COC6=C1CC(C(C2=C(OC)C(=O)C3=C(O)C=C(OC)C4=C5C(=C1C2=C34)C(=C(C=C5OC)O)C6=O)=C(C)O)(C)O",
        "COC6=C1CC([C@H](C2=C(OC)C(=O)C3=C(O)C=C(OC)C4=C5C(=C1C2=C34)C(=C(C=C5OC)O)C6=O)C(=C)O)(C)O",
        "COC6=C2CC([C@H](C1=C(OC)C(=C5C4=C1C2=C3C(=C(C=C(C3=C4C(=CC5=O)OC)OC)O)C6=O)O)C(C)=O)(C)O",
        # 3 tautomer(s) generated using PT_07_00 - 1,5 (aromatic) heteroatom H shift (1)
        "COC6=C2CC([C@H](C1=C(OC)C(=C5C4=C1C2=C3C(=C(C=C(C3=C4C(=CC5=O)OC)OC)O)C6=O)O)C(C)=O)(C)O",
        "COC5=C1CC([C@H](C2=C(OC)C(=O)C3=C(O)C=C(OC)C4=C6C(=C1C2=C34)C(=C5O)C(C=C6OC)=O)C(C)=O)(C)O",
        "COC3=C2CC([C@H](C1=C(OC)C(=C6C5=C1C2=C4C(=C3O)C(C=C(C4=C5C(=CC6=O)OC)OC)=O)O)C(C)=O)(C)O",
        # 3 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "COC2=C3CC([C@H](C4=C(OC)C(=O)C6=C5C(=C1C(C(=C(C=C1OC)O)C2=O)C3=C45)C(=CC6=O)OC)C(C)=O)(C)O",
        "COC2=C3CC([C@H](C4=C(OC)C(=O)C5=C(O)C=C(OC)C6=C1C(=C(C(C=C1OC)=O)C2=O)C3=C4C56)C(C)=O)(C)O",
        "COC6=C1CC([C@H](C5=C3C1=C2C(=C(C=C(C2=C4C(=CC(C(=C34)C(=C5OC)O)=O)OC)OC)O)C6=O)C(C)=O)(C)O",
    ],
    # 1512a
    "CCP1(CC)(c2ccccc2)Nc2cc(C(c3ccccc3)(c3ccccc3)c3ccccc3)cc(C(C)(C)C)c2O1": [
        # 1 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "CC[P]2(OC1C(=CC(=CC1=N2)C(C3=CC=CC=C3)(C4=CC=CC=C4)C5=CC=CC=C5)C(C)(C)C)(CC)C6=CC=CC=C6",
        # 1 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "CC[P]2(OC1=C(C=C(CC1=N2)C(C3=CC=CC=C3)(C4=CC=CC=C4)C5=CC=CC=C5)C(C)(C)C)(CC)C6=CC=CC=C6",
        # 1 tautomer(s) generated using RC_12_00 - 5_endo_tet or iminophosphorane/benzoxazaphospholine
        "CC[P](CC)(C1=CC=CC=C1)=NC2=C(C(=CC(=C2)C(C3=CC=CC=C3)(C4=CC=CC=C4)C5=CC=CC=C5)C(C)(C)C)O",
    ],
    # 1512b
    "CCP(CC)(=Nc1cc(C(c2ccccc2)(c2ccccc2)c2ccccc2)cc(C(C)(C)C)c1O)c1ccccc1": [
        # 3 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "CC[P](CC)(=NC1C(C(=CC(=C1)C(C2=CC=CC=C2)(C3=CC=CC=C3)C4=CC=CC=C4)C(C)(C)C)=O)C5=CC=CC=C5",
        "CC=[P](CC)(NC1=C(C(=CC(=C1)C(C2=CC=CC=C2)(C3=CC=CC=C3)C4=CC=CC=C4)C(C)(C)C)O)C5=CC=CC=C5",
        "CC=[P](CC)(NC1C(C(=CC(=C1)C(C2=CC=CC=C2)(C3=CC=CC=C3)C4=CC=CC=C4)C(C)(C)C)=O)C5=CC=CC=C5",
        # 1 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "CC[P](CC)(=NC1=CC(=CC(C(C)(C)C)C1=O)C(C2=CC=CC=C2)(C3=CC=CC=C3)C4=CC=CC=C4)C5=CC=CC=C5",
        # 1 tautomer(s) generated using RC_12_00 - 5_endo_tet or iminophosphorane/benzoxazaphospholine
        "CC[P]5(CC)(NC1=C(C(=CC(=C1)C(C2=CC=CC=C2)(C3=CC=CC=C3)C4=CC=CC=C4)C(C)(C)C)O5)C6=CC=CC=C6",
    ],
    # 1688a
    "C=C(/C=[N+](\[O-])C(C)(C)/C(C)=N/O)OCC": [
        # 1 tautomer(s) generated using PT_03_00 - simple (aliphatic) imine
        "C=C(\C=[N+](/[O-])C(C)(C)C(=C)NO)OCC",
        # 2 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "C=C(\C=[N+](/[O-])C(C)(C)C(C)N=O)OCC",
        "C=C(\C=[N+](/[O-])C(C)(C)C(=C)NO)OCC",
        # 1 tautomer(s) generated using PT_16_00 - nitroso/oxime
        "C=C(\C=[N+](/[O-])C(C)(C)C(C)N=O)OCC",
        # 1 tautomer(s) generated using PT_36_00 - oxime/nitrone: 1,2 H-shift
        "C=C(\C=[N+](/[O-])C(C)(C)C(C)=[NH+][O-])OCC",
        # 1 tautomer(s) generated using RC_22_00 - 5_endo_trig: 1,5 H-shift
        "C=C(C1N(O)C(C)(C)C(=[N+]1[O-])C)OCC",
    ],
    # 1688b
    "C=C(OCC)C1N(O)C(C)(C)C(C)=[N+]1[O-]": [
        # 1 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "C=C(OCC)C1[NH+](C(C(N1O)(C)C)=C)[O-]",
        # 1 tautomer(s) generated using PT_39_00 - nitrone/azoxy or Behrend rearrangement
        "C=C(OCC)C1=[N+](C(C(N1O)(C)C)C)[O-]",
        # 1 tautomer(s) generated using RC_22_00 - 5_endo_trig: 1,5 H-shift
        "C=C(OCC)C=[N+]([O-])C(C)(C)C(C)=NO",
    ],
    # 1704 a
    "Cc1cc(C=O)c(C)c(C=O)c1": [
        # 2 tautomer(s) generated using PT_29_01 - o-tolualdehyde
        "CC1=CC(C(C(=C1)C=O)=C)=CO",
        "C[CH]1C=C([C](C(=C1)C=O)=[CH2])C=O",
    ],
    # 1704 b
    "C=c1c(C=O)cc(C)cc1=CO": [
        # 2 tautomer(s) generated using PT_02_00 - 1,5 (thio)keto/(thio)enol
        "[CH2]=[C]1C(=C[CH](C=C1C=O)C)C=O",
        "CC1=C(C=C(C=C1C=O)C)C=O",
        # 1 tautomer(s) generated using PT_06_00 - 1,3 heteroatom H shift
        "C=C1C(C=C(C=C1C=O)C)C=O",
        # 1 tautomer(s) generated using PT_09_00 - 1,7 (aromatic) heteroatom H shift
        "C=C1C(=CC(=CC1C=O)C)C=O",
        # 2 tautomer(s) generated using PT_29_01 - o-tolualdehyde
        "[CH2]=[C]1C(=C[CH](C=C1C=O)C)C=O",
        "CC1=C(C=C(C=C1C=O)C)C=O",
    ],
}

Now we define a function to return a list of unique canonical SMILES from a list of SMILES.

def canonicalize_smiles(smls_list: list[str]) -> list[str]:
    """
    From a list of SMILES, return a list of unique canonical SMILES.
    Canonicalize a list of SMILES, trying first with and then without sanitization;
    convert list into a set to remove duplicates, then back to a list so can go in dataframe

    :param smls_list: List of SMILES strings
    :returns: List of canonical SMILES strings
    """
    canonical_smiles = []
    for sml in smls_list:
        mol = Chem.MolFromSmiles(sml)
        if mol:
            canonical = Chem.MolToSmiles(mol)
            canonical_smiles.append(canonical)
        else:
            mol = Chem.MolFromSmiles(sml, sanitize=False)
            if mol:
                canonical = Chem.MolToSmiles(mol)
                canonical_smiles.append(canonical)
            else:
                # canonical_smiles.append(None) #? This shouldn't be here, correct?
                print("Molecule couldn't be created")
    # Eliminate duplicates by turning list into a set, then back to a list so can go in dataframe
    canonical_smiles_unique = list(set(canonical_smiles))
    return canonical_smiles_unique

Let’s apply that to the cactus tautomers.

cactus_sml_tauts_canon = {
    key: canonicalize_smiles(value) for key, value in cactus_sml_tauts.items()
}
cactus_inputs = cactus_sml_tauts_canon.keys()
cactus_tauts = cactus_sml_tauts_canon.values()
df_cactus = pl.DataFrame({"canon_sml": cactus_inputs, "tauts_cactus": cactus_tauts})
df_cactus

shape: (18, 2)

canon_sml	tauts_cactus
str	list[str]
"CSC1=NC(c2cccc…	["CSC1=NC(C)=C(C(=O)OC(C)C)C(c2ccccc2[N+](=O)[O-])N1", "C=C1N=C(SC)NC(c2ccccc2[N+](=O)[O-])C1C(=O)OC(C)C", … "C=C1NC(SC)=NC(c2ccccc2[N+](=O)[O-])C1C(=O)OC(C)C"]
"CSC1=NC(C)=C(C…	["C=C1N=C(SC)NC(c2ccccc2[N+](=O)[O-])C1C(=O)OC(C)C", "CSC1=NC(c2ccccc2[N+](=O)[O-])C(C(=O)OC(C)C)=C(C)N1", … "CSC1=NC(C)=C(C(=O)OC(C)C)C(=C2C=CCC=C2[N+](=O)[O-])N1"]
"[2H]Oc1ccc(-c2…	["[2H]OC1=C(C2=CC([2H])(O[2H])C(=O)C=C2)OC2=C([2H])C(O[2H])=C(OC)C(=O)C2([2H])C1=O", "[2H]OC1=C(C2=CC([2H])(O[2H])C(=O)C=C2)OC2=C([2H])C(=O)C([2H])(OC)C(=O)C2=C1O[2H]", … "[2H]Oc1c(C2=CC([2H])(O[2H])C(=O)C=C2)oc2c([2H])c(O[2H])c(OC)c(=O)c-2c1O[2H]"]
"[2H]Oc1ccc(-c2…	["[2H]OC1=CC(C2=C(O[2H])C(=O)C3([2H])C(=O)C(OC)=C(O[2H])C([2H])=C3O2)=CC([2H])C1=O", "[2H]OC1=C(OC)C(=O)C([2H])([2H])c2oc(-c3ccc(O[2H])c(O[2H])c3)c(O[2H])c(=O)c21", … "[2H]Oc1c(OC)c(O[2H])c2c(=O)c(O[2H])c(C3=CC(=O)C([2H])(O[2H])C=C3)oc2c1[2H]"]
"O=c1c(O)c(-c2c…	["O=C1CC(c2oc3cc(O)cc(O)c3c(=O)c2O)=CC=C1O", "O=c1cc(O)cc2oc(-c3ccc(O)c(O)c3)c(O)c(O)c1-2", … "O=C1C=C(O)c2c(oc(-c3ccc(O)c(O)c3)c(O)c2=O)C1"]
"O=c1cc(O)cc2oc…	["O=C1C=CC(c2oc3cc(O)cc(=O)c-3c(O)c2O)=CC1O", "O=C1C=CC(=C2Oc3cc(O)cc(O)c3C(O)=C2O)C=C1O", … "O=C1CC(c2oc3c(c(=O)c2O)C(=O)CC(O)=C3)=CC=C1O"]
"COc1c(O)c2c(=O…	["COC1=CC(=O)C2=C3C1=C1C(OC)=CC(=O)C4=C(O)C(OC)=C5C(=C3C(=C(OC)C2=O)[C@@H]([C@H](C)O)[C@@H]5[C@H](C)O)C41", "COC1=CC(=O)c2c(O)c(OC)c3c4c2c1c1c(OC)cc(O)c2c(O)c(OC)c(c4c21)=C([C@H](C)O)[C@@H]3[C@H](C)O", … "COC1=CC(=O)C2C(=O)C(OC)=C3c4c5c6c(c1c42)C(OC)=CC(=O)C6C(=O)C(OC)=C5[C@@H]([C@H](C)O)[C@@H]3[C@H](C)O"]
"COc1c2c3c4c(c(…	["COC1=CC(=O)C2=c3c1c1c(OC)cc(=O)c4c(=O)c(OC)c5c(c3C(C(OC)=C2O)[C@@H]([C@H](C)O)[C@@H]5[C@H](C)O)c1=4", "COC1=CC(=O)C2C(=O)C(OC)=C3c4c2c1c1c(OC)cc(O)c2c1c4C(=C(OC)C2=O)[C@@H]([C@H](C)O)[C@@H]3[C@H](C)O", … "COC1=CC(=O)C2C(=O)C(OC)=C3c4c5c6c(c1c42)C(OC)=CC(=O)C6C(=O)C(OC)=C5[C@@H]([C@H](C)O)[C@@H]3[C@H](C)O"]
"COc1c(O)c2c(=O…	["COC1=CC(=O)C2C(=O)C(OC)=C(C[C@@H](C)OC(=O)c3ccccc3)c3c2c1c1c2c(c(O)c(OC)c(C[C@H](C)OC(=O)C4=CCC(=O)C=C4)c32)C(=O)C=C1OC", "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(O)c5c(=O)c(OC)c(C[C@@H](C)OC(=O)c6ccccc6)c(c(c1C[C@H](C)OC(=O)c1ccc(O)cc1)c23)c54", … "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c(C[C@H](C)OC(=O)C6=CCC(=O)C=C6)c(c(c1C[C@@H](C)OC(=O)c1ccccc1)c23)c54"]
"COc1c(C[C@@H](…	["COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(O)c5c(=O)c(OC)c(C[C@@H](C)OC(=O)c6ccccc6)c(c(c1C[C@H](C)OC(=O)c1ccc(O)cc1)c23)c54", "COC1=CC(=O)C2C(=O)C(OC)=C(C[C@@H](C)OC(=O)c3ccccc3)c3c2c1c1c(OC)cc(O)c2c1c3C(C[C@H](C)OC(=O)c1ccc(O)cc1)=C(OC)C2=O", … "COC1=C(C[C@@H](C)OC(=O)c2ccccc2)c2c3c(c(O)cc(OC)c3c3c(OC)cc(O)c4c(O)c(OC)c(=C[C@H](C)OC(=O)c5ccc(O)cc5)c2c43)C1=O"]
"COc1c(O)c2c(=O…	["COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(O)c5c(=O)c(OC)c6c(c(c1[C@H](C(C)=O)C(C)(O)C6)c23)c54", "COc1c2c3c4c(c(OC)c(=O)c5c(O)cc(OC)c(c6c(OC)cc(O)c(c1=O)c63)c54)[C@H](C(C)=O)C(C)(O)C2", … "COC1=CC(=O)C2=C3C1=C1C(OC)=CC(=O)C4=C(O)C(OC)=C5CC(C)(O)[C@@H](C(C)=O)C(=C(OC)C2=O)C3=C5C41"]
"COc1c2c3c4c(c(…	["COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c6c(c(c1CC(C)(O)[C@H]6C(C)=O)c23)c54", "COC1=CC(=O)C2=C3C1=C1C(OC)=CC(O)=C4C(=O)C(OC)=C5CC(C)(O)[C@@H](C(C)=O)C(=C(OC)C2=O)C3=C5C41", … "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(O)c5c(=O)c(OC)c6c(c(c1CC(C)(O)[C@H]6C(C)=O)c23)c54"]
"CCP1(CC)(c2ccc…	["CCP1(CC)(c2ccccc2)N=C2C=C(C(c3ccccc3)(c3ccccc3)c3ccccc3)C=C(C(C)(C)C)C2O1", "CCP1(CC)(c2ccccc2)N=C2CC(C(c3ccccc3)(c3ccccc3)c3ccccc3)=CC(C(C)(C)C)=C2O1", "CCP(CC)(=Nc1cc(C(c2ccccc2)(c2ccccc2)c2ccccc2)cc(C(C)(C)C)c1O)c1ccccc1"]
"CCP(CC)(=Nc1cc…	["CCP(CC)(=NC1=CC(C(c2ccccc2)(c2ccccc2)c2ccccc2)=CC(C(C)(C)C)C1=O)c1ccccc1", "CC=P(CC)(NC1C=C(C(c2ccccc2)(c2ccccc2)c2ccccc2)C=C(C(C)(C)C)C1=O)c1ccccc1", … "CC=P(CC)(Nc1cc(C(c2ccccc2)(c2ccccc2)c2ccccc2)cc(C(C)(C)C)c1O)c1ccccc1"]
"C=C(/C=[N+](\[…	["C=C(/C=[N+](\[O-])C(C)(C)C(C)N=O)OCC", "C=C(OCC)C1N(O)C(C)(C)C(C)=[N+]1[O-]", … "C=C(/C=[N+](\[O-])C(C)(C)C(C)=[NH+][O-])OCC"]
"C=C(OCC)C1N(O)…	["C=C(OCC)C1N(O)C(C)(C)C(=C)[NH+]1[O-]", "C=C(C=[N+]([O-])C(C)(C)C(C)=NO)OCC", "C=C(OCC)C1=[N+]([O-])C(C)C(C)(C)N1O"]
"Cc1cc(C=O)c(C)…	["C=C1C(C=O)=CC(C)C=C1C=O", "C=c1c(C=O)cc(C)cc1=CO"]
"C=c1c(C=O)cc(C…	["Cc1cc(C=O)c(C)c(C=O)c1", "C=C1C(C=O)=CC(C)C=C1C=O", "C=C1C(C=O)=CC(C)=CC1C=O"]

Now we merge in the cactus tautomers by left-joining on canonical SMILES.

# Ensure no tauts_cactus columns already exist--can cause additional column tauts_cactus_right to be created
df_melted = df_melted.drop(cs.starts_with("tauts_cactus"))

df_melted = df_melted.join(df_cactus, on="canon_sml", how="left")

To inspect the results and check which Refs are included, let’s display one row for each Ref.

df_melted.filter(pl.col("tauts_cactus").is_not_null()).unique(subset="Ref").sort("Ref")

shape: (9, 6)

Ref	sml	canon_sml	tauts_TautomerEnumerator	tauts_GetV1TautomerEnumerator	tauts_cactus
i64	str	str	list[str]	list[str]	list[str]
73	"O=[N+](C1=C(C2…	"CSC1=NC(C)=C(C…	["CSC1=NC(c2ccccc2[N+](=O)[O-])C(C(=O)OC(C)C)=C(C)N1", "CSC1=NC(C)=C(C(=O)OC(C)C)C(c2ccccc2[N+](=O)[O-])N1", … "CSC1=NC(c2ccccc2[N+](=O)[O-])C(=C(O)OC(C)C)C(C)=N1"]	["CSC1=NC(C)=C(C(=O)OC(C)C)C(c2ccccc2[N+](=O)[O-])N1", "C=C1N=C(SC)NC(c2ccccc2[N+](=O)[O-])C1C(=O)OC(C)C", … "C=C1NC(SC)=NC(c2ccccc2[N+](=O)[O-])C1C(=O)OC(C)C"]	["C=C1N=C(SC)NC(c2ccccc2[N+](=O)[O-])C1C(=O)OC(C)C", "CSC1=NC(c2ccccc2[N+](=O)[O-])C(C(=O)OC(C)C)=C(C)N1", … "CSC1=NC(C)=C(C(=O)OC(C)C)C(=C2C=CCC=C2[N+](=O)[O-])N1"]
457	"O=C2C1=C(O[2H]…	"[2H]Oc1ccc(-c2…	["[2H]OC1=C(O)C2=C(O[2H])C(OC)C(=O[2H])C([2H])=C2OC1=C1C=CC(=O[2H])C(=O[2H])C1", "[2H]O=C1CC(C2OC3=C([2H])C(=O[2H])C(OC)C(=O[2H])C3=C(O)C2=O[2H])=CC=C1[OH][2H]", … "[2H]OC1=CC(C2OC3=C([2H])C(O[2H])=C(OC)C(=O[2H])C3=C(O)C2=O[2H])C=CC1=O[2H]"]	["[2H]OC1=C(O)C2=C(O[2H])C(OC)C(=O[2H])C([2H])=C2OC1=C1C=CC(=O[2H])C(=O[2H])C1", "[2H]OC1=CC(C2OC3=C(C(=O)C2=O[2H])C(=O[2H])C(OC)C([OH][2H])=C3[2H])=CCC1=O[2H]", … "[2H]OC1=C(C2=CC(O[2H])C(=O[2H])C=C2)OC2=C([2H])C(=O[2H])C(OC)C(=O[2H])C2=C1O"]	["[2H]OC1=CC(C2=C(O[2H])C(=O)C3([2H])C(=O)C(OC)=C(O[2H])C([2H])=C3O2)=CC([2H])C1=O", "[2H]OC1=C(OC)C(=O)C([2H])([2H])c2oc(-c3ccc(O[2H])c(O[2H])c3)c(O[2H])c(=O)c21", … "[2H]Oc1c(OC)c(O[2H])c2c(=O)c(O[2H])c(C3=CC(=O)C([2H])(O[2H])C=C3)oc2c1[2H]"]
467	"O=C1C2=C(C=C(C…	"O=c1c(O)c(-c2c…	["O=C1C=CC(=C2Oc3cc(O)cc(O)c3C(=O)C2O)CC1=O", "O=C1C=C(O)C2C(=O)C(O)=C(C3=CCC(=O)C(O)=C3)OC2=C1", … "O=C1C=C(O)C2=C(C1)OC(=C1C=CC(O)C(=O)C1)C(=O)C2=O"]	["O=C1C=CC(=C2Oc3cc(O)cc(O)c3C(=O)C2O)CC1=O", "O=C1C=C(O)C2C(=O)C(O)=C(C3=CCC(=O)C(O)=C3)OC2=C1", … "O=C1C=C2OC(C3=CC(=O)C(=O)CC3)C(=O)C(=O)C2C(=O)C1"]	["O=C1CC(c2oc3cc(O)cc(O)c3c(=O)c2O)=CC=C1O", "O=c1cc(O)cc2oc(-c3ccc(O)c(O)c3)c(O)c(O)c1-2", … "O=C1C=C(O)c2c(oc(-c3ccc(O)c(O)c3)c(O)c2=O)C1"]
888	"O=C1C(C(O)=C2)…	"COc1c2c3c4c(c(…	["COC1=CC(=O)C2C(=O)C(OC)c3c([C@H](C)O)c([C@H](C)O)c4c5c6c(c1c2c35)=C(OC)CC(=O)C6C(=O)C=4OC", "COC1=CC(=O)C2C(O)=C(OC)c3c([C@H](C)O)c([C@H](C)O)c4c5c6c(c1c2c35)=C(OC)CC(=O)C6C(=O)C=4OC", … "COC1=CC(=O)C2=C(O)C(OC)c3c([C@H](C)O)c([C@H](C)O)c4c(OC)c(O)c5c6c(c1c2c3c46)=C(OC)CC5=O"]	["C=C(O)c1c(C(C)O)c2c3c4c(c5c6c3c1=C(OC)C(=O)C6C(O)=CC5OC)C(OC)CC(=O)C4C(=O)C=2OC", "COc1cc(O)c2c(O)c(OC)c3c(C(C)O)c(C(C)=O)c4c5c6c(c1c2c35)C(OC)C=C(O)C6C(=O)C4OC", … "C=C(O)c1c2c3c4c(c5c6c3c(c1=C(C)O)C(OC)C(=O)C=6C(=O)CC5OC)C(OC)CC(=O)C4C(O)=C2OC"]	["COC1=CC(=O)C2=c3c1c1c(OC)cc(=O)c4c(=O)c(OC)c5c(c3C(C(OC)=C2O)[C@@H]([C@H](C)O)[C@@H]5[C@H](C)O)c1=4", "COC1=CC(=O)C2C(=O)C(OC)=C3c4c2c1c1c(OC)cc(O)c2c1c4C(=C(OC)C2=O)[C@@H]([C@H](C)O)[C@@H]3[C@H](C)O", … "COC1=CC(=O)C2C(=O)C(OC)=C3c4c5c6c(c1c42)C(OC)=CC(=O)C6C(=O)C(OC)=C5[C@@H]([C@H](C)O)[C@@H]3[C@H](C)O"]
890	"OC1=CC(OC)=C(C…	"COc1c(C[C@@H](…	["COC1=CC(=O)C2C(=O)C(OC)C(=C[C@H](C)OC(=O)c3ccc(O)cc3)c3c2c1c1c2c3C(C[C@@H](C)OC(=O)c3ccccc3)=C(OC)C(=O)C=2C(=O)CC=1OC", "COc1c(O)c2c(O)cc(OC)c3c4c(OC)cc(O)c5c(O)c(OC)c(=C[C@H](C)OC(=O)c6ccc(O)cc6)c(c(c1=C[C@@H](C)OC(=O)c1ccccc1)c23)c54", … "COC1=CC(=O)C2C(=O)C(OC)C(=C[C@@H](C)OC(=O)c3ccccc3)c3c2c1c1c2c(c(O)c(OC)c(=C[C@H](C)OC(=O)c4ccc(O)cc4)c32)C(=O)CC=1OC"]	["COc1c(C[C@@H](C)OC(=O)c2ccccc2)c2c3c(c(OC)cc(=O)c=3c1=O)c1c3c(c(O)c(OC)c(=C[C@H](C)OC(=O)c4ccc(O)cc4)c32)C(=O)CC1OC", "COC1=c2c3c(c(O)c(OC)c(=C[C@H](C)OC(=O)C4C=CC(=O)C=C4)c3c3c(=C[C@@H](C)OC(=O)c4ccccc4)c(OC)c(O)c4c(O)cc(OC)c2c43)C(=O)C1", … "COC1=C(C[C@H](C)OC(O)=C2C=CC(=O)C=C2)c2c3c(c4c5c(c(O)c(OC)c(=C[C@@H](C)OC(=O)c6ccccc6)c25)C(=O)CC=4OC)=C(OC)CC(=O)C=3C1=O"]	["COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(O)c5c(=O)c(OC)c(C[C@@H](C)OC(=O)c6ccccc6)c(c(c1C[C@H](C)OC(=O)c1ccc(O)cc1)c23)c54", "COC1=CC(=O)C2C(=O)C(OC)=C(C[C@@H](C)OC(=O)c3ccccc3)c3c2c1c1c(OC)cc(O)c2c1c3C(C[C@H](C)OC(=O)c1ccc(O)cc1)=C(OC)C2=O", … "COC1=C(C[C@@H](C)OC(=O)c2ccccc2)c2c3c(c(O)cc(OC)c3c3c(OC)cc(O)c4c(O)c(OC)c(=C[C@H](C)OC(=O)c5ccc(O)cc5)c2c43)C1=O"]
891	"OC1([C@@H](C(C…	"COc1c2c3c4c(c(…	["COC1=CC(=O)C2C(O)=C(OC)C3=CC(C)(O)C(C(C)=O)C4=C(OC)C(=O)c5c(O)cc(OC)c6c1c2c3c4c56", "C=C(O)C1c2c(OC)c(=O)c3c(O)cc(OC)c4c5c(OC)cc(=O)c6c(O)c(OC)c(c(c2c34)c65)CC1(C)O", … "COC1=CC(=O)C2C(=O)C(OC)C3=CC(C)(O)C(C(C)=O)C4=C(OC)C(=O)c5c(O)cc(OC)c6c1c2c3c4c56"]	["COC1=CC(=O)C2C(O)=C(OC)C3=CC(C)(O)C(C(C)=O)C4=C(OC)C(=O)c5c(O)cc(OC)c6c1c2c3c4c56", "C=C(O)C1c2c(OC)c(=O)c3c(O)cc(OC)c4c5c(OC)cc(=O)c6c(O)c(OC)c(c(c2c34)c65)CC1(C)O", … "COC1=CC(=O)C2C(=O)C(OC)C3=CC(C)(O)C(C(C)=O)C4=C(OC)C(=O)c5c(O)cc(OC)c6c1c2c3c4c56"]	["COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c6c(c(c1CC(C)(O)[C@H]6C(C)=O)c23)c54", "COC1=CC(=O)C2=C3C1=C1C(OC)=CC(O)=C4C(=O)C(OC)=C5CC(C)(O)[C@@H](C(C)=O)C(=C(OC)C2=O)C3=C5C41", … "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(O)c5c(=O)c(OC)c6c(c(c1CC(C)(O)[C@H]6C(C)=O)c23)c54"]
1512	"CC(C)(C)C6=CC(…	"CCP1(CC)(c2ccc…	["CCP1(CC)(c2ccccc2)Nc2cc(C(c3ccccc3)(c3ccccc3)c3ccccc3)cc(C(C)(C)C)c2O1"]	["CCP1(CC)(c2ccccc2)Nc2cc(C(c3ccccc3)(c3ccccc3)c3ccccc3)cc(C(C)(C)C)c2O1"]	["CCP1(CC)(c2ccccc2)N=C2C=C(C(c3ccccc3)(c3ccccc3)c3ccccc3)C=C(C(C)(C)C)C2O1", "CCP1(CC)(c2ccccc2)N=C2CC(C(c3ccccc3)(c3ccccc3)c3ccccc3)=CC(C(C)(C)C)=C2O1", "CCP(CC)(=Nc1cc(C(c2ccccc2)(c2ccccc2)c2ccccc2)cc(C(C)(C)C)c1O)c1ccccc1"]
1688	"O/N=C(C)/C(C)(…	"C=C(/C=[N+](\[…	["C=C(/C=[N+](\[O-])C(C)(C)C(C)N=O)OCC", "C=C(/C=[N+](\[O-])C(C)(C)C(C)=NO)OCC"]	["C=C(/C=[N+](\[O-])C(C)(C)C(C)N=O)OCC", "C=C(/C=[N+](\[O-])C(C)(C)C(C)=NO)OCC", "C=C(/C=[N+](\[O-])C(C)(C)C(=C)NO)OCC"]	["C=C(/C=[N+](\[O-])C(C)(C)C(C)N=O)OCC", "C=C(OCC)C1N(O)C(C)(C)C(C)=[N+]1[O-]", … "C=C(/C=[N+](\[O-])C(C)(C)C(C)=[NH+][O-])OCC"]
1704	"C=C1C(=CC(=CC1…	"C=c1c(C=O)cc(C…	["C=C1C(C=O)=CC(C)=CC1C=O", "C=c1c(C=O)cc(C)cc1=CO"]	["C=c1c(C=O)cc(C)cc1=CO"]	["Cc1cc(C=O)c(C)c(C=O)c1", "C=C1C(C=O)=CC(C)C=C1C=O", "C=C1C(C=O)=CC(C)=CC1C=O"]

Now let’s merge the several rows for each Ref by putting the tautomers into lists. We use explode to make each Ref-tautomer pair into its own row, then aggregate into a list, which Polars uses as the default aggregation. (Instead of explode, we could use flatten, whose name may be more intuitive for some.)

df_melted_aggregated = (
    # Aggregate by combining multiple lists into a single flat list
    df_melted.group_by("Ref").agg(["canon_sml", cs.starts_with("tauts_").explode()])
    # Keep the unique elements in each list; that is, remove duplicates
).with_columns(cs.starts_with("tauts_").list.unique())
df_melted_aggregated.filter(pl.col("Ref").is_in([73, 888, 1704])).sort("Ref")

shape: (3, 5)

Ref	canon_sml	tauts_TautomerEnumerator	tauts_GetV1TautomerEnumerator	tauts_cactus
i64	list[str]	list[str]	list[str]	list[str]
73	["CSC1=NC(C)=C(C(=O)OC(C)C)C(c2ccccc2[N+](=O)[O-])N1", "CSC1=NC(c2ccccc2[N+](=O)[O-])C(C(=O)OC(C)C)=C(C)N1"]	["CSC1=NC(c2ccccc2[N+](=O)[O-])C(C(=O)OC(C)C)=C(C)N1", "CSC1=NC(C)=C(C(=O)OC(C)C)C(c2ccccc2[N+](=O)[O-])N1", … "CSC1=NC(c2ccccc2[N+](=O)[O-])C(=C(O)OC(C)C)C(C)=N1"]	["CSC1=NC(c2ccccc2[N+](=O)[O-])C(C(=O)OC(C)C)C(C)=N1", "CSC1=NC(c2ccccc2[N+](=O)[O-])C(C(=O)OC(C)C)=C(C)N1", … "C=C1N=C(SC)NC(c2ccccc2[N+](=O)[O-])C1C(=O)OC(C)C"]	["CSC1=NC(C)=C(C(=O)OC(C)C)C(c2ccccc2[N+](=O)[O-])N1", "CSC1=NC(c2ccccc2[N+](=O)[O-])C(C(=O)OC(C)C)=C(C)N1", … "CSC1=NC(C)=C(C(=O)OC(C)C)C(=C2C=CC=CC2=[N+]([O-])O)N1"]
888	["COc1c2c3c4c(c(OC)c(=O)c5c(O)cc(OC)c(c6c(OC)cc(O)c(c1=O)c63)c54)[C@@H]([C@H](C)O)[C@@H]2[C@H](C)O", "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c6c(c(c1[C@@H]([C@H](C)O)[C@@H]6[C@H](C)O)c23)c54"]	["COC1=CC(=O)C2C(=O)C(OC)=C3c4c2c1c1c2c4C(=C(OC)C(=O)C=2C(=O)CC=1OC)C([C@H](C)O)C3[C@H](C)O", "COC1=c2c3c(c(=O)c(OC)c4c3c3c(c(OC)c(O)c5c(=O)cc(OC)c2c53)C([C@H](C)O)C4[C@H](C)O)C(=O)C1", … "COC1=CC(=O)C2C(=O)C(OC)=C3c4c5c6c(c1c42)C(OC)=CC(=O)C6C(=O)C(OC)C5=C([C@H](C)O)C3[C@H](C)O"]	["COC1=c2c3c4c(c(C(C)O)c(C(C)O)c5c(OC)c(O)c6c(O)cc(OC)c2c6c54)=C(OC)C(=O)C3C(O)=C1", "COC1=c2c(C(C)=O)c(C(C)O)c3c(OC)c(O)c4c5c(c6c(c2c35)C(C(=O)CC6OC)C1=O)C(OC)CC4=O", … "COC1=c2c3c4c5c(c(=C(C)O)c(C(C)O)c6c(OC)c(O)c(c2c65)C(=O)C1)=C(OC)C(=O)C4C(=O)CC3OC"]	["COC1=CC(=O)C2=C3C1=C1C(OC)=CC(=O)C4=C(O)C(OC)=C5C(=C3C(=C(OC)C2=O)[C@@H]([C@H](C)O)[C@@H]5[C@H](C)O)C41", "COc1c2c3c4c(c(OC)c(=O)c5c(O)cc(OC)c(c6c(OC)cc(O)c(c1=O)c63)c54)[C@@H]([C@H](C)O)[C@@H]2[C@H](C)O", … "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(O)c5c(=O)c(OC)c6c(c(c1[C@@H]([C@H](C)O)[C@@H]6[C@H](C)O)c23)c54"]
1704	["C=c1c(C=O)cc(C)cc1=CO", "Cc1cc(C=O)c(C)c(C=O)c1"]	["Cc1cc(C=O)c(C)c(C=O)c1", "C=C1C(C=O)=CC(C)=CC1C=O", "C=c1c(C=O)cc(C)cc1=CO"]	["C=c1c(C=O)cc(C)cc1=CO", "Cc1cc(C=O)c(C)c(C=O)c1"]	["C=C1C(C=O)=CC(C)=CC1C=O", "C=c1c(C=O)cc(C)cc1=CO", … "C=C1C(C=O)=CC(C)C=C1C=O"]

Summary: We added the cactus tautomers to our dataset.

CACTVS

The CACTVS results provided by Marc Nicklaus are in SDF (structure-data format), so this function extracts the Ref.

def extract_ref(
    e_name: str,
):
    """Extract the reference number from an E_NAME entry in an SDF file, for example
        > <E_NAME>
        Ref_0890_01_1
    has e_name of
        Ref_0890_01_1
    which returns
        890

    :param e_name: The E_NAME property, for example "Ref_0890_01_1"
    :returns: The reference number, for example 890
    """
    parts = e_name.split("_")
    ref_str = parts[1]
    try:
        ref = int(ref_str)
    except:
        ref = None
    return ref

This code actually reads the SDF files into lists.

files = [
    "Refs",
    "Ref_0467_expanded17",
    "Ref_0890_different_expanded14c",
    "Ref_0891_expanded17",
]

refs_cactvs = []
nums_cactvs = []
tauts_cactvs = []

for file in files:
    # Iterate through the molecules in the SDF file
    cactvs = Chem.SDMolSupplier(f"../data/{file}.sdf")
    last_num = None
    for mol in cactvs:
        if mol is not None:
            ref = extract_ref(mol.GetProp("E_NAME"))
            sml = mol.GetProp("E_SMILES")
            refs_cactvs.append(ref)
            tauts_cactvs.append(sml)

We then put those lists into another Polars dataframe.

df_cactvs = pl.DataFrame({"Ref": refs_cactvs, "tauts_CACTVS": tauts_cactvs})

To inspect the results and check which Refs are included, let’s display one tautomer for each Ref.

df_cactvs.unique(subset=["Ref"]).sort("Ref")

shape: (5, 2)

Ref	tauts_CACTVS
i64	str
467	"O=c1cc(O)cc2oc…
890	"COC2C(O)C4C(=O…
891	"COc1c(O)c2c(=O…
1512	"CC[P]6(CC)(C1=…
1704	"CC1=CC(=C(C(=C…

Now we can merge in the CACTVS tautomers by joining two dataframes.

# Ensure no CACTVS columns already exist--if one does, it can cause additional column _right to be created
df_melted_aggregated = df_melted_aggregated.drop(cs.contains("CACTVS"))

# Merge in CACTVS tauts by left-joining on Ref, then aggregate tauts_CACTVS into a list
df_melted_aggregated = (
    df_melted_aggregated.join(df_cactvs, on="Ref", how="left")
    .group_by("Ref")
    .agg(pl.exclude("tauts_CACTVS").first(), "tauts_CACTVS")
)
df_melted_aggregated.filter(pl.col("Ref").is_in([890, 891]))

shape: (2, 6)

Ref	canon_sml	tauts_TautomerEnumerator	tauts_GetV1TautomerEnumerator	tauts_cactus	tauts_CACTVS
i64	list[str]	list[str]	list[str]	list[str]	list[str]
891	["COc1c2c3c4c(c(OC)c(=O)c5c(O)cc(OC)c(c6c(OC)cc(O)c(c1=O)c63)c54)[C@H](C(C)=O)C(C)(O)C2", "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c6c(c(c1CC(C)(O)[C@H]6C(C)=O)c23)c54"]	["COc1c2c3c4c5c(c(=O)cc(OC)c5c5c(OC)cc(O)c(c1=O)c35)C(=O)C(OC)C=4CC(C)(O)C2C(C)=O", "COC1=CC(=O)C2=c3c1c1c4c5c3C(=C(C(C)=O)C(C)(O)CC5C(OC)=C(O)C=4C(=O)C=C1OC)C(OC)C2=O", … "COC1=CC(=O)C2C(=O)C(OC)=C3c4c2c1c1c2c4=C(CC(C)(O)C3C(C)=O)C(OC)C(=O)C=2C(=O)C=C1OC"]	["COC1=CC(=O)C2=c3c1c1c4c5c3=C(CC(C)(O)C(=C(C)O)C=5C(OC)C(=O)C=4C(=O)C=C1OC)C(OC)C2=O", "C=C(O)C1=C2c3c4c(c(OC)c(O)c5c(=O)cc(OC)c(c6c(OC)cc(=O)c(c36)=C(O)C2OC)c45)CC1(C)O", … "C=C(O)C1=C2c3c4c(c5c(OC)cc(=O)c6c(=O)c(OC)c(c3c5=6)CC1(C)O)C(OC)CC(=O)C=4C(=O)C2OC"]	["C=C(O)[C@H]1c2c(OC)c(=O)c3c(O)cc(OC)c4c5c(OC)cc(O)c6c(=O)c(OC)c(c(c2c34)c65)CC1(C)O", "COC1=CC(=O)C2C(=O)C(OC)=C3CC(C)(O)[C@@H](C(C)=O)c4c(OC)c(O)c5c6c(c1c2c3c46)C(OC)=CC5=O", … "COC1=CC(=O)C2C(=O)C(OC)=C3c4c2c1c1c(OC)cc(O)c2c1c4C(=C(OC)C2=O)CC(C)(O)[C@H]3C(C)=O"]	["COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c6c(c(c1CC(C)(O)[C@H]6C(C)=O)c23)c54", "COC6=C1C4=C2C(=C(C(C1)(C)O)C(C)=O)C(=C(C3=C(C=C(C(=C23)C5=C4C(=C(O)C=C5OC)C6=O)OC)O)O)OC", … "COC4=C6C2=C1C(C(OC)C(=O)C5=C1C(=C3C(=CC(=O)C(=C23)C4=O)OC)C(=CC5=O)OC)C(=C(C)O)C(C)(O)C6"]
890	["COc1c(C[C@@H](C)OC(=O)c2ccccc2)c2c3c(C[C@H](C)OC(=O)c4ccc(O)cc4)c(OC)c(=O)c4c(O)cc(OC)c(c5c(OC)cc(O)c(c1=O)c52)c43", "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c(C[C@H](C)OC(=O)c6ccc(O)cc6)c(c(c1C[C@@H](C)OC(=O)c1ccccc1)c23)c54"]	["COC1=CC(=O)C2C(O)=C(OC)C(=C[C@@H](C)OC(=O)c3ccccc3)c3c2c1c1c2c3C(C[C@H](C)OC(=O)C3=CCC(=O)C=C3)=C(OC)C(=O)C=2C(=O)CC=1OC", "COC1=CC(=O)C2C(=O)C(OC)C(=C[C@H](C)OC(O)=C3C=CC(=O)C=C3)c3c2c1c1c2c3=C(C[C@@H](C)OC(=O)c3ccccc3)C(OC)C(=O)C=2C(=O)C=C1OC", … "COC1=C(O)c2c(=O)cc(OC)c3c2c(c2c(=C[C@@H](C)OC(=O)c4ccccc4)c(OC)c(O)c4c(=O)cc(OC)c3c42)C1C[C@H](C)OC(=O)C1C=CC(=O)C=C1"]	["COc1c(O)c2c(O)cc(OC)c3c4c(OC)cc(O)c5c(O)c(OC)c(=C[C@H](C)OC(O)=C6C=CC(=O)C=C6)c(c(c1=C[C@@H](C)OC(=O)c1ccccc1)c23)c54", "COC1=CC(=O)C2C(=O)C(OC)C(=C[C@@H](C)OC(=O)c3ccccc3)c3c2c1c1c(OC)cc(O)c2c(O)c(OC)c(=C[C@H](C)OC(O)=C4C=CC(=O)C=C4)c3c21", … "COC1=C(C[C@H](C)OC(O)=C2C=CC(=O)C=C2)c2c3c(c(O)cc(OC)c3c3c4c(c(O)c(OC)c(=C[C@@H](C)OC(=O)c5ccccc5)c24)C(=O)CC=3OC)C1=O"]	["COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(O)c5c(=O)c(OC)c(C[C@H](C)OC(=O)c6ccc(O)cc6)c(c(c1C[C@@H](C)OC(=O)c1ccccc1)c23)c54", "COc1c(C[C@@H](C)OC(=O)c2ccccc2)c2c3c(C[C@H](C)OC(O)=C4C=CC(=O)C=C4)c(OC)c(=O)c4c(O)cc(OC)c(c5c(OC)cc(O)c(c1=O)c52)c43", … "COC1=C(C[C@H](C)OC(=O)c2ccc(O)cc2)c2c3c(c(O)cc(OC)c3c3c(OC)cc(O)c4c(O)c(OC)c(=C[C@@H](C)OC(=O)c5ccccc5)c2c43)C1=O"]	["COC2C(O)C4C(=O)CC(OC)C5C6C(OC)CC(=O)C7C(O)C(OC)C(C[C@H](C)OC(=O)C1CCC(O)CC1)C(C(C2C[C@@H](C)OC(=O)C3CCCCC3)C45)C67", "COC2C(O)C4C(=O)CC(OC)C5C6C(OC)CC(=C7C(O)C(OC)C(C[C@H](C)OC(=O)C1CCC(O)CC1)C(C(C2C[C@@H](C)OC(=O)C3CCCCC3)C45)C67)O", … "COC4=C(C3=C1C(=C(C(=C2C(C=C(C(=C12)C5=C3C(=C4O)C(=O)C=C5OC)OC)=O)O)OC)C[C@H](C)OC(=O)C6=CC=C(C=C6)O)C[C@@H](C)OC(=O)C7=CC=CC=C7"]

Summary: We added the CACTVS tautomers to our dataset.

Data cleanup

Because we have tautomers for cactus and CACTVS for only certain Refs, we need to remove None entries in lists so that, for example, [None, None] won’t be counted as two tautomers; it will be replaced with [] which will be counted as zero tautomers, indicating that we didn’t obtain tautomers for that Ref.

# Remove None (null) values in tauts_cactus and tauts_CACTVS lists
for source in ["cactus", "CACTVS"]:
    tauts_source = df_melted_aggregated[f"tauts_{source}"].to_list()
    tauts_source_no_nulls = []
    for tauts_list in tauts_source:
        tauts_list_no_nulls = []
        for taut in tauts_list:
            if taut != None:
                tauts_list_no_nulls.append(taut)
        tauts_source_no_nulls.append(tauts_list_no_nulls)

    tauts_source_no_nulls_series = pl.Series(tauts_source_no_nulls)
    df_melted_aggregated = df_melted_aggregated.with_columns(
        tauts_source_no_nulls_series.alias(f"tauts_{source}")
    )

We can tell that worked because we have some [] entries in tauts_CACTVS and tauts_cactus.

df_melted_aggregated.head(3)

shape: (3, 6)

Ref	canon_sml	tauts_TautomerEnumerator	tauts_GetV1TautomerEnumerator	tauts_cactus	tauts_CACTVS
i64	list[str]	list[str]	list[str]	list[str]	list[str]
667	["O=C(/C=C(\S)c1ccccc1)c1ccccc1", "O/C(=C\C(=S)c1ccccc1)c1ccccc1"]	["OC(=CC(=S)c1ccccc1)c1ccccc1", "O=C(CC(=S)c1ccccc1)c1ccccc1", "O=C(C=C(S)c1ccccc1)c1ccccc1"]	["O=C(CC(=S)c1ccccc1)c1ccccc1", "O=C(C=C(S)c1ccccc1)c1ccccc1", "OC(=CC(=S)c1ccccc1)c1ccccc1"]	[]	[]
664	["O=C(Cc1ccccn1)c1ccc(N2CCCC2)cc1", "O/C(=C\c1ccccn1)c1ccc(N2CCCC2)cc1"]	["O=C(Cc1ccccn1)c1ccc(N2CCCC2)cc1", "OC(=Cc1ccccn1)c1ccc(N2CCCC2)cc1", "O=C(C=C1C=CC=CN1)c1ccc(N2CCCC2)cc1"]	["OC(=Cc1ccccn1)c1ccc(N2CCCC2)cc1", "O=C(C=C1C=CCC=N1)c1ccc(N2CCCC2)cc1", … "O=C(C=C1C=CC=CN1)c1ccc(N2CCCC2)cc1"]	[]	[]
429	["Nc1ncnc2[nH]cnc12", "Nc1nc[nH]c2ncnc1-2"]	["N=c1[nH]cnc2nc[nH]c12", "N=c1[nH]cnc2[nH]cnc12", … "N=c1nc[nH]c2nc[nH]c12"]	["Nc1nc[nH]c2ncnc1-2", "N=c1[nH]cnc2nc[nH]c12", … "N=c1[nH]cnc2[nH]cnc12"]	[]	[]

Convert enumerated tautomers from SMILES to InChI

Now that we’ve enumerated tautomers, let’s convert the SMILES of the RDKit TautomerEnumerator, our baseline algorithm, to InChI. This will let us check how well InChI does at its stated goal of covering multiple tautomers with a single InChI for algorithmically-enumerated tautomers.

We start by creating a function to return a list of unique InChI for an iterable of SMILES.

def InChI_smiles(smls_iterable: Iterable[str]) -> Iterable[str]:
    """
    Convert an iterable of SMILES to InChI, returning the list of unique InChI

    :param smls_iterable: Iterable of SMILES strings
    :returns: List of InChI strings
    """
    InChIs = []
    for sml in smls_iterable:
        mol = Chem.MolFromSmiles(sml)
        if mol:
            InChI = Chem.MolToInchi(mol)
            InChIs.append(InChI)
        else:
            mol = Chem.MolFromSmiles(sml, sanitize=False)
            if mol:
                InChI = Chem.MolToInchi(mol)
                InChIs.append(InChI)
            else:
                print("Molecule couldn't be created")
    # Eliminate duplicates by turning list into a set, then back to a list so can go in dataframe
    InChIs_unique = list(set(InChIs))
    return InChIs_unique

Now we determine the list of InChI for the list of tautomers from TautomerEnumerator.

df_melted_aggregated = df_melted_aggregated.with_columns(
    [
        pl.col("tauts_TautomerEnumerator")
        .map_elements(InChI_smiles)
        .alias("tauts_TautomerEnumerator_InChI"),
    ]
)

Let’s get a sense for the results for a couple Refs.

df_melted_aggregated.filter(pl.col("Ref").is_in([891, 892])).sort(pl.col("Ref")).select(
    "Ref", "tauts_TautomerEnumerator", "tauts_TautomerEnumerator_InChI"
)

shape: (2, 3)

Ref	tauts_TautomerEnumerator	tauts_TautomerEnumerator_InChI
i64	list[str]	list[str]
891	["COc1c2c3c4c5c(c(=O)cc(OC)c5c5c(OC)cc(O)c(c1=O)c35)C(=O)C(OC)C=4CC(C)(O)C2C(C)=O", "COC1=CC(=O)C2=c3c1c1c4c5c3C(=C(C(C)=O)C(C)(O)CC5C(OC)=C(O)C=4C(=O)C=C1OC)C(OC)C2=O", … "COC1=CC(=O)C2C(=O)C(OC)=C3c4c2c1c1c2c4=C(CC(C)(O)C3C(C)=O)C(OC)C(=O)C=2C(=O)C=C1OC"]	["InChI=1S/C30H26O10/c1-10(31)25-24-22-16-11(9-30(25,2)36)28(39-5)26(34)17-12(32)7-14(37-3)19(21(16)17)20-15(38-4)8-13(33)18(23(20)22)27(35)29(24)40-6/h7,31-32,36H,8-9H2,1-6H3", "InChI=1S/C30H26O10/c1-10(31)25-24-22-16-11(9-30(25,2)36)28(39-5)26(34)17-12(32)7-14(37-3)19(21(16)17)20-15(38-4)8-13(33)18(23(20)22)27(35)29(24)40-6/h7-9,25,29,31-32,34,36H,1H2,2-6H3", … "InChI=1S/C30H26O10/c1-10(31)25-24-22-16-11(9-30(25,2)36)28(39-5)26(34)17-12(32)7-14(37-3)19(21(16)17)20-15(38-4)8-13(33)18(23(20)22)27(35)29(24)40-6/h9,25,28,36H,7-8H2,1-6H3"]
892	["O=C1C=CC(=O)c2c(O)ccc(O)c21", "O=C1C=CC(=O)C2=C1C(=O)CCC2=O", … "O=C1C=CC(=O)C2C(=O)C=CC(O)=C12"]	["InChI=1S/C10H6O4/c11-5-1-2-6(12)10-8(14)4-3-7(13)9(5)10/h1-2H,3-4H2", "InChI=1S/C10H6O4/c11-5-1-2-6(12)10-8(14)4-3-7(13)9(5)10/h1-4,9,12H", … "InChI=1S/C10H6O4/c11-5-1-2-6(12)10-8(14)4-3-7(13)9(5)10/h1-3,13H,4H2"]

Summary: We created a unique set of InChI corresponding to each set of SMILES for the baseline RDKit TautomerEnumerator algorithm.

Comparing algorithms

Now that we’ve collected all the data from the various tautomerization algorithms, let’s compare the results. To have a consistent column name format, let’s duplicate canon_sml to tauts_Expt so that each column of tautomers begins with “tauts_”. We then start the analysis by counting the number of tautomers for each Ref for each algorithm.

df_melted_aggregated = df_melted_aggregated.with_columns(
    [
        pl.col("canon_sml").alias("tauts_Expt"),
    ]
).with_columns(
    # Add columns for number of tautomers in each tauts_ column
    cs.starts_with("tauts_")
    .list.len()
    .name.map(lambda name: "n_" + name.replace("tauts_", "")),
)

df_melted_aggregated.filter(pl.col("Ref").is_in([891, 892]))

shape: (2, 14)

Ref	canon_sml	tauts_TautomerEnumerator	tauts_GetV1TautomerEnumerator	tauts_cactus	tauts_CACTVS	tauts_TautomerEnumerator_InChI	tauts_Expt	n_TautomerEnumerator	n_GetV1TautomerEnumerator	n_cactus	n_CACTVS	n_TautomerEnumerator_InChI	n_Expt
i64	list[str]	list[str]	list[str]	list[str]	list[str]	list[str]	list[str]	u32	u32	u32	u32	u32	u32
891	["COc1c2c3c4c(c(OC)c(=O)c5c(O)cc(OC)c(c6c(OC)cc(O)c(c1=O)c63)c54)[C@H](C(C)=O)C(C)(O)C2", "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c6c(c(c1CC(C)(O)[C@H]6C(C)=O)c23)c54"]	["COc1c2c3c4c5c(c(=O)cc(OC)c5c5c(OC)cc(O)c(c1=O)c35)C(=O)C(OC)C=4CC(C)(O)C2C(C)=O", "COC1=CC(=O)C2=c3c1c1c4c5c3C(=C(C(C)=O)C(C)(O)CC5C(OC)=C(O)C=4C(=O)C=C1OC)C(OC)C2=O", … "COC1=CC(=O)C2C(=O)C(OC)=C3c4c2c1c1c2c4=C(CC(C)(O)C3C(C)=O)C(OC)C(=O)C=2C(=O)C=C1OC"]	["COC1=CC(=O)C2=c3c1c1c4c5c3=C(CC(C)(O)C(=C(C)O)C=5C(OC)C(=O)C=4C(=O)C=C1OC)C(OC)C2=O", "C=C(O)C1=C2c3c4c(c(OC)c(O)c5c(=O)cc(OC)c(c6c(OC)cc(=O)c(c36)=C(O)C2OC)c45)CC1(C)O", … "C=C(O)C1=C2c3c4c(c5c(OC)cc(=O)c6c(=O)c(OC)c(c3c5=6)CC1(C)O)C(OC)CC(=O)C=4C(=O)C2OC"]	["C=C(O)[C@H]1c2c(OC)c(=O)c3c(O)cc(OC)c4c5c(OC)cc(O)c6c(=O)c(OC)c(c(c2c34)c65)CC1(C)O", "COC1=CC(=O)C2C(=O)C(OC)=C3CC(C)(O)[C@@H](C(C)=O)c4c(OC)c(O)c5c6c(c1c2c3c46)C(OC)=CC5=O", … "COC1=CC(=O)C2C(=O)C(OC)=C3c4c2c1c1c(OC)cc(O)c2c1c4C(=C(OC)C2=O)CC(C)(O)[C@H]3C(C)=O"]	["COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c6c(c(c1CC(C)(O)[C@H]6C(C)=O)c23)c54", "COC6=C1C4=C2C(=C(C(C1)(C)O)C(C)=O)C(=C(C3=C(C=C(C(=C23)C5=C4C(=C(O)C=C5OC)C6=O)OC)O)O)OC", … "COC4=C6C2=C1C(C(OC)C(=O)C5=C1C(=C3C(=CC(=O)C(=C23)C4=O)OC)C(=CC5=O)OC)C(=C(C)O)C(C)(O)C6"]	["InChI=1S/C30H26O10/c1-10(31)25-24-22-16-11(9-30(25,2)36)28(39-5)26(34)17-12(32)7-14(37-3)19(21(16)17)20-15(38-4)8-13(33)18(23(20)22)27(35)29(24)40-6/h7,31-32,36H,8-9H2,1-6H3", "InChI=1S/C30H26O10/c1-10(31)25-24-22-16-11(9-30(25,2)36)28(39-5)26(34)17-12(32)7-14(37-3)19(21(16)17)20-15(38-4)8-13(33)18(23(20)22)27(35)29(24)40-6/h7-9,25,29,31-32,34,36H,1H2,2-6H3", … "InChI=1S/C30H26O10/c1-10(31)25-24-22-16-11(9-30(25,2)36)28(39-5)26(34)17-12(32)7-14(37-3)19(21(16)17)20-15(38-4)8-13(33)18(23(20)22)27(35)29(24)40-6/h9,25,28,36H,7-8H2,1-6H3"]	["COc1c2c3c4c(c(OC)c(=O)c5c(O)cc(OC)c(c6c(OC)cc(O)c(c1=O)c63)c54)[C@H](C(C)=O)C(C)(O)C2", "COc1c(O)c2c(=O)cc(OC)c3c4c(OC)cc(=O)c5c(O)c(OC)c6c(c(c1CC(C)(O)[C@H]6C(C)=O)c23)c54"]	337	340	26	324	337	2
892	["O=C1C=CC(=O)c2c(O)ccc(O)c21", "O=C1C=CC(O)=C2C(=O)C=CC(O)=C12"]	["O=C1C=CC(=O)c2c(O)ccc(O)c21", "O=C1C=CC(=O)C2=C1C(=O)CCC2=O", … "O=C1C=CC(=O)C2C(=O)C=CC(O)=C12"]	["O=C1C=CC(=O)C2=C1C(=O)CCC2=O", "O=C1C=CC(=O)C2C(=O)C=CC(O)=C12", … "O=C1C=CC(=O)C2=C1C(=O)C=CC2O"]	[]	[]	["InChI=1S/C10H6O4/c11-5-1-2-6(12)10-8(14)4-3-7(13)9(5)10/h1-2H,3-4H2", "InChI=1S/C10H6O4/c11-5-1-2-6(12)10-8(14)4-3-7(13)9(5)10/h1-4,9,12H", … "InChI=1S/C10H6O4/c11-5-1-2-6(12)10-8(14)4-3-7(13)9(5)10/h1-3,13H,4H2"]	["O=C1C=CC(=O)c2c(O)ccc(O)c21", "O=C1C=CC(O)=C2C(=O)C=CC(O)=C12"]	6	7	0	0	6	2

Comparing sets of tautomers

Next we’d like to check if the tautomers enumerated by two algorithms are the same. So we define a function to check if sets for two iterables are equal. The input iterables need not be sets; the function will convert them to sets, and then compare the sets.

def sets_are_equal(
    *iterables: Iterable,
) -> bool:
    """Check whether multiple iterables are equal as sets:
    For example [1, 1, 2] and {2, 1} are equal because both are {1, 2} as sets

    :param *iterables: the several iterables (can be lists, tuples, sets, etc.)
    :returns: True if all the iterables are equal; False if not
    """
    # If only one iterable is provided, it's equal to itself, so return True
    if len(iterables) < 2:
        return True

    # Put the iterables into a single list
    iterables_list = [iterable for iterable in iterables]
    # Convert iterables into sets
    sets_list = [set(iterable) for iterable in iterables_list]

    # Set the first set as the reference set
    reference_set = sets_list[0]

    # If set lengths differ, the two sets aren't the same
    for this_set in sets_list[1:]:
        if len(reference_set) != len(this_set):
            return False

    # Iterate through the sets starting from the second one
    for this_set in sets_list[1:]:
        # Check if each set matches the reference set
        if this_set != reference_set:
            # Return False if keys are not the same
            return False
    # Return True if all dictionaries have the same keys
    return True

Now we want to select all columns that start with “tauts_” except our reference column “tauts_TautomerEnumerator”. This is a two-part selector, so let’s test it out to check that it gives the desired columns.

tauts_compare_cols = df_melted_aggregated.select(
    cs.starts_with("tauts_").exclude("tauts_TautomerEnumerator")
).columns
tauts_compare_cols

['tauts_GetV1TautomerEnumerator',
 'tauts_cactus',
 'tauts_CACTVS',
 'tauts_TautomerEnumerator_InChI',
 'tauts_Expt']

Let’s check if the sets of tautomers are the same for different sources, and calculate the difference in the number of tautomers between sources.

for tauts_col in tauts_compare_cols:
    tauts_col_bare = tauts_col.replace("tauts_", "")
    df_melted_aggregated = df_melted_aggregated.with_columns(
        [
            # Check if different algos produce same set of tautomers
            pl.struct(["tauts_TautomerEnumerator", tauts_col])
            .map_elements(
                lambda x: sets_are_equal(x["tauts_TautomerEnumerator"], x[tauts_col])
            )
            .alias(f"same_{tauts_col_bare}"),
            # Calculate difference in number of tautomers between algos
            pl.when(pl.col(f"n_{tauts_col_bare}") == 0)
            # If tautomers weren't included for this source, set its n_tauts_diff to null so that difference is calculated only when source has data
            .then(None)
            .otherwise(
                pl.struct(["n_TautomerEnumerator", f"n_{tauts_col_bare}"]).map_elements(
                    lambda x: x["n_TautomerEnumerator"] - x[f"n_{tauts_col_bare}"]
                )
            )
            .alias(f"nDiff_{tauts_col_bare}"),
        ]
    )
df_melted_aggregated.head(1)

shape: (1, 24)

Ref	canon_sml	tauts_TautomerEnumerator	tauts_GetV1TautomerEnumerator	tauts_cactus	tauts_CACTVS	tauts_TautomerEnumerator_InChI	tauts_Expt	n_TautomerEnumerator	n_GetV1TautomerEnumerator	n_cactus	n_CACTVS	n_TautomerEnumerator_InChI	n_Expt	same_GetV1TautomerEnumerator	nDiff_GetV1TautomerEnumerator	same_cactus	nDiff_cactus	same_CACTVS	nDiff_CACTVS	same_TautomerEnumerator_InChI	nDiff_TautomerEnumerator_InChI	same_Expt	nDiff_Expt
i64	list[str]	list[str]	list[str]	list[str]	list[str]	list[str]	list[str]	u32	u32	u32	u32	u32	u32	bool	i64	bool	i64	bool	i64	bool	i64	bool	i64
667	["O=C(/C=C(\S)c1ccccc1)c1ccccc1", "O/C(=C\C(=S)c1ccccc1)c1ccccc1"]	["OC(=CC(=S)c1ccccc1)c1ccccc1", "O=C(CC(=S)c1ccccc1)c1ccccc1", "O=C(C=C(S)c1ccccc1)c1ccccc1"]	["O=C(CC(=S)c1ccccc1)c1ccccc1", "O=C(C=C(S)c1ccccc1)c1ccccc1", "OC(=CC(=S)c1ccccc1)c1ccccc1"]	[]	[]	["InChI=1S/C15H12OS/c16-14(12-7-3-1-4-8-12)11-15(17)13-9-5-2-6-10-13/h1-11,16H", "InChI=1S/C15H12OS/c16-14(12-7-3-1-4-8-12)11-15(17)13-9-5-2-6-10-13/h1-11,17H", "InChI=1S/C15H12OS/c16-14(12-7-3-1-4-8-12)11-15(17)13-9-5-2-6-10-13/h1-10H,11H2"]	["O=C(/C=C(\S)c1ccccc1)c1ccccc1", "O/C(=C\C(=S)c1ccccc1)c1ccccc1"]	3	3	0	0	3	2	true	0	false	null	false	null	false	0	false	1

Let’s examine the distribution of the number of tautomers for the baseline source, in both table and plot form.

frequency_table_baseline = (
    df_melted_aggregated.select(pl.col("n_TautomerEnumerator").value_counts())
    .unnest("n_TautomerEnumerator")
    .sort("n_TautomerEnumerator")
    .with_columns(
        pct=((pl.col("count") / pl.col("count").sum()) * 100).round(1),
        cumulative_pct=(
            (pl.col("count").cum_sum() / pl.col("count").sum()) * 100
        ).round(1),
    )
)
frequency_table_baseline

shape: (82, 4)

n_TautomerEnumerator	count	pct	cumulative_pct
u32	u32	f64	f64
1	16	0.9	0.9
2	457	25.7	26.6
3	381	21.5	48.1
4	119	6.7	54.8
5	227	12.8	67.6
6	85	4.8	72.4
7	67	3.8	76.1
8	79	4.4	80.6
9	49	2.8	83.3
10	30	1.7	85.0
11	25	1.4	86.4
12	59	3.3	89.8
…	…	…	…
291	1	0.1	99.3
315	1	0.1	99.4
329	1	0.1	99.4
332	1	0.1	99.5
337	1	0.1	99.5
360	1	0.1	99.6
400	2	0.1	99.7
402	1	0.1	99.8
441	1	0.1	99.8
443	1	0.1	99.9
454	1	0.1	99.9
801	1	0.1	100.0

Let’s plot a histogram of the number of tautomers from the baseline source. We’ll use a vertical logarithmic scale because there are many Refs with only a few tautomers, and a long, short tail at higher numbers of tautomers.

n_tauts_baseline = df_melted_aggregated.select("n_TautomerEnumerator")
x_min_value = n_tauts_baseline.min().item()
x_max_value = n_tauts_baseline.max().item()

x_buffer = 5

x_min = x_min_value - x_buffer
x_max = x_max_value + x_buffer

# Set the histogram bins so there will be a bin (bar) for each integer value
bins = range(x_min, x_max, 1)

# Set the size of the plot
plt.figure(figsize=(15, 5))

# Create the histogram plot
sns.histplot(
    n_tauts_baseline,
    bins=bins,
    legend=False,
    log_scale=(False, True),
)

plt.ylabel("Frequency")
plt.xlabel("# tautomers from RDKit baseline algorithm")

plt.xlim(x_min, x_max)

plt.show()

Histogram of frequency against number of tautomers from RDKit baseline algorithm showing a frequency of more than 100 at x=0, then rapidly tailing off with increasing x, with the greatest x value being 801

Let’s also check the distribution of the number of experimentally-observed tautomers.

frequency_table_Expt = (
    df_melted_aggregated.select(pl.col("n_Expt").value_counts())
    .unnest("n_Expt")
    .sort("n_Expt")
    .with_columns(
        pct=((pl.col("count") / pl.col("count").sum()) * 100).round(1),
        cumulative_pct=(
            (pl.col("count").cum_sum() / pl.col("count").sum()) * 100
        ).round(1),
    )
)
frequency_table_Expt

shape: (5, 4)

n_Expt	count	pct	cumulative_pct
u32	u32	f64	f64
1	28	1.6	1.6
2	1518	85.5	87.0
3	197	11.1	98.1
4	23	1.3	99.4
5	10	0.6	100.0

So more than 85% of the Refs have two experimentally-observed tautomers, and most of the remaining Refs have three.

Now we can check how many Refs have the same set of tautomers as our baseline, RDKit’s default tautomerizer, for each other source that lists SMILES:

df_same = df_melted_aggregated.select(
    cs.starts_with("same_")
    .exclude("same_TautomerEnumerator_InChI")
    .name.prefix("count_")
).sum()
df_same

shape: (1, 4)

count_same_GetV1TautomerEnumerator	count_same_cactus	count_same_CACTVS	count_same_Expt
u32	u32	u32	u32
1207	0	0	453

Let’s convert those numbers to percentages.

df_same.select((pl.all() / Ref_count * 100).round(1).name.prefix("%")).sum()

shape: (1, 4)

%count_same_GetV1TautomerEnumerator	%count_same_cactus	%count_same_CACTVS	%count_same_Expt
f64	f64	f64	f64
68.0	0.0	0.0	25.5

Let’s compare the various sources to our baseline, RDKit’s default tautomerizer:

GetV1TautomerEnumerator: The fact that ~68% of the sets of tautomers are the same comports with Greg Landrum’s note that “the code adds a missed case to the enumeration rule set”–adding one rule seems like a minor change.
cactus and CACTVS: The fact that these sources have no Refs where the set of tautomers are the same is not surprising because we enumerated tautomers for a limited number of Refs for cactus and CACTVS.
Expt: The fact that ~25.5% of the sets of tautomers are the same seems to reflect that the baseline tautomerizer didn’t find tautomers beyond those observed experimentally. If we filter down to these matching sets, we find that there are only a few tautomers in each case:

df_same_Expt = (
    df_melted_aggregated.filter(pl.col("same_Expt") == True)
    .select(
        [
            "Ref",
            "same_Expt",
            "n_Expt",
            "n_TautomerEnumerator",
            "tauts_TautomerEnumerator",
            "tauts_Expt",
        ]
    )
    .sort("Ref")
)
df_same_Expt.head()

shape: (5, 6)

Ref	same_Expt	n_Expt	n_TautomerEnumerator	tauts_TautomerEnumerator	tauts_Expt
i64	bool	u32	u32	list[str]	list[str]
2	true	2	2	["c1cn[nH]n1", "c1c[nH]nn1"]	["c1c[nH]nn1", "c1cn[nH]n1"]
3	true	2	2	["Cc1ccn[nH]1", "Cc1cc[nH]n1"]	["Cc1ccn[nH]1", "Cc1cc[nH]n1"]
24	true	1	1	["N=c1cccccc1N"]	["N=c1cccccc1N"]
40	true	3	3	["O=c1cc(-c2ccccc2)[nH]n1-c1ccc([N+](=O)[O-])cc1[N+](=O)[O-]", "O=C1CC(c2ccccc2)=NN1c1ccc([N+](=O)[O-])cc1[N+](=O)[O-]", "O=[N+]([O-])c1ccc(-n2nc(-c3ccccc3)cc2O)c([N+](=O)[O-])c1"]	["O=C1CC(c2ccccc2)=NN1c1ccc([N+](=O)[O-])cc1[N+](=O)[O-]", "O=[N+]([O-])c1ccc(-n2nc(-c3ccccc3)cc2O)c([N+](=O)[O-])c1", "O=c1cc(-c2ccccc2)[nH]n1-c1ccc([N+](=O)[O-])cc1[N+](=O)[O-]"]
41	true	3	3	["CC1=NN(c2ccc([N+](=O)[O-])cc2[N+](=O)[O-])C(=O)C1", "Cc1cc(=O)n(-c2ccc([N+](=O)[O-])cc2[N+](=O)[O-])[nH]1", "Cc1cc(O)n(-c2ccc([N+](=O)[O-])cc2[N+](=O)[O-])n1"]	["Cc1cc(=O)n(-c2ccc([N+](=O)[O-])cc2[N+](=O)[O-])[nH]1", "Cc1cc(O)n(-c2ccc([N+](=O)[O-])cc2[N+](=O)[O-])n1", "CC1=NN(c2ccc([N+](=O)[O-])cc2[N+](=O)[O-])C(=O)C1"]

and the maximum number of experimentally-observed tautomers is three:

df_same_Expt.select(pl.col("n_Expt")).max().item()

Summary: RDKit’s new tautomer enumerator produces sets of tautomers that are the same as its V1 about 68% of the time, and about 25% of the time the same as the experimental data. More than 85% of the Refs have two experimentally-observed tautomers.

Statistical comparison

We define a dictionary of sources and their type:

baseline (RDKit TautomerEnumerator)
all: we have data for all Refs
manual: we gathered data manually, so we do not have data for all Refs

# List sources, and the sources to compare to the first source
sources = {
    "TautomerEnumerator": "baseline",
    "GetV1TautomerEnumerator": "all",
    "cactus": "manual",
    "CACTVS": "manual",
    "Expt": "all",
}
sources_compare = {
    source: kind for source, kind in sources.items() if kind != "baseline"
}

Now we set up data structures to hold data from the dataframe’s nDiff_ columns–the difference in the number of tautomers for the baseline source minus for another source.

# Create several dictionaries where each key will be a tautomer source
n_tauts_diff = dict()
n_tauts_diff_no_zeros = dict()
n_tauts_diff_positive = dict()
n_tauts_diff_negative = dict()
for source in sources_compare.keys():
    # Extract the data from the dataframe's nDiff_ columns
    nDiff_col = f"nDiff_{source.replace('tauts_', '')}"
    this_n_tauts_diff = df_melted_aggregated.filter(pl.col(nDiff_col).is_not_null())[
        nDiff_col
    ].to_list()
    # Remove None values to prevent errors
    this_n_tauts_diff_clean = [item for item in this_n_tauts_diff if item is not None]
    n_tauts_diff.update({source: this_n_tauts_diff_clean})
    n_tauts_diff_no_zeros.update(
        {source: [diff for diff in n_tauts_diff[source] if diff != 0]}
    )
    n_tauts_diff_positive.update(
        {
            source: [
                diff
                for diff in n_tauts_diff_no_zeros[source]
                if (diff is not None) and (diff > 0)
            ]
        }
    )
    n_tauts_diff_negative.update(
        {
            source: [
                diff
                for diff in n_tauts_diff_no_zeros[source]
                if (diff is not None) and (diff < 0)
            ]
        }
    )

Next we define a function to return various descriptive statistics of interest for an iterable, which we’ll apply to those lists of the difference in the number of tautomers between sources.

def stats_for_iterable(iterable: Iterable) -> dict[str : int | float]:
    """Calculate descriptive statistics for an iterable.
    The input could be a set, but be aware that a set can only contain each value once.

    :param iterable: Iterable to be analyzed
    :returns: Descriptive statistics including counts of some values, mean, median, mode, and standard deviation
    """
    iterable_clean = [item for item in iterable if item is not None]
    stats = dict()
    for delta in range(-5, 6):
        stats.update({f"count({delta})": iterable_clean.count(delta)})

    try:
        mean = statistics.mean(iterable_clean)
    except StatisticsError:
        mean = None

    try:
        median = statistics.median(iterable_clean)
    except StatisticsError:
        median = None

    try:
        mode = statistics.mode(iterable_clean)
    except StatisticsError:
        mode = None

    stats.update(
        {
            "mean": mean,
            "median": median,
            "mode": mode,
        }
    )
    try:
        stats.update({"std dev": statistics.stdev(iterable_clean)})
    except StatisticsError:
        stats.update({"std dev": -1})
    return stats

To address the original question of which algorithms find more tautomers, we compute the mean of the difference between the baseline source (RDKit’s new algorithm) and the other sources.

print(f'{"Source":25} Mean')
for source in sources_compare.keys():
    stats = stats_for_iterable(
        df_melted_aggregated[f"nDiff_{source.replace('tauts','')}"]
    )
    print(f'{source:25} {stats["mean"]:.2f}')

Source                    Mean
GetV1TautomerEnumerator   -4.66
cactus                    212.00
CACTVS                    55.80
Expt                      9.78

These statistics are skewed by selection bias: I chose several Refs with hundreds of tautomers to gather manual data from cactus and CACTVS, so the difference in the number of tautomers for those Refs can be quite large. Whereas the experimental results are available for all Refs, and in many cases there are only a few tautomers for those Refs, so averaged over all Refs the mean difference in the number of tautomers for the experimental results is relatively small.

Let’s make a fairer comparison for the manual sources by narrowing the Refs to the narrowest set, namely for CACTVS–all the other sets (cactus, Expt, and RDKit GetV1TautomerEnumerator) are supersets of that narrow set, which contains five Refs.

df_narrow_set = (
    df_melted_aggregated.filter(pl.col("nDiff_CACTVS").is_not_null())
    .select(
        [
            "Ref",
            "n_TautomerEnumerator",
            "n_CACTVS",
            "n_cactus",
            "nDiff_CACTVS",
            "nDiff_cactus",
        ]
    )
    .sort("Ref")
)
df_narrow_set

shape: (5, 6)

Ref	n_TautomerEnumerator	n_CACTVS	n_cactus	nDiff_CACTVS	nDiff_cactus
i64	u32	u32	u32	i64	i64
467	360	275	32	85	328
890	454	258	31	196	423
891	337	324	26	13	311
1512	3	16	8	-13	-5
1704	3	5	4	-2	-1

That set contains five Refs:

n_narrow_set = df_narrow_set.select(pl.len()).item()
n_narrow_set

For the manual sources, we sum over the Refs, and then divide by the number of Refs to calculate a per-Ref difference.

df_narrow_set_sum = df_narrow_set.select(
    cs.starts_with("nDiff_").name.prefix("sum_")
).sum()
df_narrow_set_sum

shape: (1, 2)

sum_nDiff_CACTVS	sum_nDiff_cactus
i64	i64
279	1056

df_narrow_set_sum.select((pl.all() / n_narrow_set).round(1).name.prefix("per_ref_"))

shape: (1, 2)

per_ref_sum_nDiff_CACTVS	per_ref_sum_nDiff_cactus
f64	f64
55.8	211.2

So in this comparison tailored to the manually-generated sources, the manual sources’ differences in number of tautomers identified compared to RDKit’s baseline algorithm are, from least to greatest deficit:

CACTVS algorithm, which finds about 56 fewer tautomers per reference
cactus algorithm, which finds about 211 fewer tautomers per reference

Summary: Comparing the various tautomer sources, let’s go from most to fewest number of tautomers found:

*RDKit’s GetV1TautomerEnumerator produces the most, about 5 more per Ref than RDKit’s updated algorithm
*RDKit’s updated algorithm is our baseline
+CACTVS’s algorithm finds about 56 fewer tautomers per Ref
+cactus’s algorithm finds about 211 fewer tautomers per Ref
*Experimental results find about 229 fewer tautomers per Ref

*Comparison of all Refs because data source includes all refs

+Comparison of 5 Refs because data source does not include all refs because it had to be run manually for each Ref

Graphical comparison

We plot a histogram of the difference in the number of tautomers, compared to the baseline RDKit TautomerEnumerator, for the two sources which have data for all Refs: RDKit V1 enumerator, and experimental. This plot focuses on differences close to zero to depict the most common cases.

n_tauts_diff_all = {
    source: n_tauts_diff[f"{source}"]
    for source in sources.keys()
    if sources[source] == "all"
}

# Set the x bounds (axis will go from negative to positive of this value)
x_len = 17

# Calculate the percent of each data series (excluding delta=0) this plot covers

# Create dictionaries to store values in; keys will be data sources
n_tauts_diff_no_zeros_this_range = dict()
non_zero_counts_this_range = dict()
for source in n_tauts_diff_all.keys():
    n_tauts_diff_no_zeros_this_range.update(
        {source: [diff for diff in n_tauts_diff_no_zeros[source] if abs(diff) <= x_len]}
    )
    non_zero_counts_this_range.update(
        {
            source: len(n_tauts_diff_no_zeros_this_range[source])
            / len(n_tauts_diff_no_zeros[source])
        }
    )
non_zero_counts_this_range_min = min(list(non_zero_counts_this_range.values()))

# Set an accessible color palette from https://venngage.com/tools/accessible-color-palette-generator
colors = [
    "#029356",
    "#606ff3",
]
sns.set_palette(sns.color_palette(colors))

plt.ylabel("Frequency")
xlabel = "Difference in # tautomers from RDKit baseline algorithm minus other source"
xlabel += f"\n(this range covers at least {non_zero_counts_this_range_min:.0%} of non-zero counts for each source)"
plt.xlabel(xlabel)

# Limit the max y to avoid making smaller y values too hard to discern
plt.ylim(0, 600)

# Because this max y cuts off the y value for GetV1TautomerEnumerator at delta = 0, show the value as a label

# Find the frequency at n_tauts_diff = 0
frequency_at_zero = n_tauts_diff["GetV1TautomerEnumerator"].count(0)

# Annotate the plot with the frequency at n_tauts_diff = 0 for GetV1TautomerEnumerator
plt.text(
    1.5,
    max(plt.ylim()) * 0.9,
    f"0 has frequency of {frequency_at_zero}\nfor GetV1TautomerEnumerator",
    ha="left",
    va="bottom",
    color=colors[0],
)

# Set the histogram bins so there will be a bin (bar) for each integer value
x_min = (-1 * x_len) - 2
x_max = x_len + 2
bins = range(x_min, x_max, 1)

# Create the histogram plot
sns.histplot(n_tauts_diff_all, bins=bins)

plt.show()

Histogram of frequency against difference in number of tautomers from RDKit baseline algorithm minus other source, with the greatest frequency at x=0 and rapidly decreasing frequency as x decreases to -17 and increases to 17

Here’s the histogram for the RDKit V1 enumerator only with a much wider x-range.

n_tauts_diff_all = {
    source: n_tauts_diff[f"{source}"]
    for source in sources.keys()
    if source == "GetV1TautomerEnumerator"
}

# Set the x bounds (will go from negative to positive of this value)
x_axis_min = -650
x_axis_max = 300

# Set the histogram bins so there will be a bin (bar) for each integer value
bins = range(x_axis_min, x_axis_max, 1)

# Calculate the percent of each data series (excluding delta=0) this plot covers

# Create dictionaries to store values in; keys will be data sources
n_tauts_diff_no_zeros_this_range = dict()
non_zero_counts_this_range = dict()
for source in n_tauts_diff_all.keys():
    n_tauts_diff_no_zeros_this_range.update(
        {source: [diff for diff in n_tauts_diff_no_zeros[source] if abs(diff) <= x_len]}
    )
    non_zero_counts_this_range.update(
        {
            source: len(n_tauts_diff_no_zeros_this_range[source])
            / len(n_tauts_diff_no_zeros[source])
        }
    )
non_zero_counts_this_range_min = min(list(non_zero_counts_this_range.values()))

# Set an accessible color palette from https://venngage.com/tools/accessible-color-palette-generator
colors = [
    "#029356",
    "#606ff3",
]
sns.set_palette(sns.color_palette(colors))

# Set the width and height of the plot
plt.figure(figsize=(15, 6))  # Adjust the width as per your requirement

# Create the histogram plot
sns.histplot(n_tauts_diff_all, bins=bins)

plt.ylabel("Frequency")
xlabel = "Difference in # tautomers from RDKit baseline algorithm minus other source"
xlabel += f"\n(this range covers at least {non_zero_counts_this_range_min:.0%} of non-zero counts for each source)"
plt.xlabel(xlabel)

# Limit the max y to avoid making smaller y values too hard to discern
plt.ylim(0, 8)

# Because this max y cuts off the y value for GetV1TautomerEnumerator at delta = 0, show the value as a label

# Find the frequency at n_tauts_diff = 0
frequency_at_zero = n_tauts_diff["GetV1TautomerEnumerator"].count(0)

# Annotate the plot with the frequency at n_tauts_diff = 0 for GetV1TautomerEnumerator
plt.text(
    -45,
    max(plt.ylim()) * 0.9,
    f"0 has frequency of {frequency_at_zero}\nfor GetV1TautomerEnumerator",
    ha="right",
    va="bottom",
    color=colors[0],
)

plt.show()

Histogram of frequency against difference in number of tautomers from RDKit baseline algorithm minus other source (either RDKit GetV1TautomerEnumerator or experimental), with the greatest frequency at x=0 and rapidly decreasing frequency as x decreases to -17 and increases to 17

Here’s the histogram for manual sources: cactus and CACTVS.

# Plot histogram of n_tauts_diff for sources which do not have data for all refs, because each molecule is run manually
n_tauts_diff_manual = {
    source: n_tauts_diff[source]
    for source in sources.keys()
    if sources[source] == "manual"
}

# Set the histogram bins so there will be a bin (bar) for each integer value
combined_set = set(item for sublist in n_tauts_diff_manual.values() for item in sublist)
x_min, x_max = min(combined_set), max(combined_set)

bins = range(x_min, x_max + 1, 1)

colors = [
    "#0073e6",
    "#9b8bf4",
]

plt.figure(figsize=(12, 6))
sns.histplot(
    n_tauts_diff_manual,
    bins=bins,
)
sns.set_palette(sns.color_palette(colors))

# Set the x bounds (will go from negative to positive of this value)

plt.ylabel("Frequency")
xlabel = f"Difference in # tautomers from RDKit baseline algorithm minus other source"
plt.xlabel(xlabel)

# Get the current axis from the plot
ax = plt.gca()

# Set y-axis ticks to be at only integer values
ax.yaxis.set_major_locator(MultipleLocator(1))

plt.show()

Histogram of frequency against difference in number of tautomers from RDKit baseline algorithm minus RDkit GetV1TautomerEnumerator, with the greatest frequency at x=0 and considerable frequencies to x=-200, but little frequency to x=200

Summary: For sources for which we have data for all Refs, the most common difference in number of tautomers is zero. Compared to the baseline RDKit TautomerEnumerator,

Expt has almost exclusively positive differences, meaning there are fewer experimentally-observed tautomers than algorithmically enumerated.
GetV1TautomerEnumerator has more negative differences, meaning it tends to generate more tautomers than the new RDKit algorithm. While most differences are small (-5 to 5), there are dozens of cases where the difference is larger, almost all of them negative. For sources for which we have data for a small number of Refs, the difference in number of tautomers is less clustered. For cactus the difference is commonly multiple hundreds (positive), while for CACTVS the greatest difference is less than 200.

InChI coverage of multiple SMILES

Similar to what we did with the experimentally-observed structures, let’s check how well InChI represents multiple SMILES with one InChI. We start by checking how many fewer InChI are required to represent the set of SMILES for each Ref.

df_InChI = (
    df_melted_aggregated.select(pl.col("nDiff_TautomerEnumerator_InChI").value_counts())
    .unnest("nDiff_TautomerEnumerator_InChI")
    .sort("nDiff_TautomerEnumerator_InChI")
    .with_columns(
        pct=((pl.col("count") / pl.col("count").sum()) * 100).round(1),
        cumulative_pct=(
            (pl.col("count").cum_sum() / pl.col("count").sum()) * 100
        ).round(1),
    )
)
df_InChI.head(10)

shape: (10, 4)

nDiff_TautomerEnumerator_InChI	count	pct	cumulative_pct
i64	u32	f64	f64
0	1292	72.7	72.7
1	162	9.1	81.9
2	110	6.2	88.1
3	56	3.2	91.2
4	36	2.0	93.2
5	27	1.5	94.8
6	24	1.4	96.1
7	18	1.0	97.1
8	6	0.3	97.5
9	3	0.2	97.6

For almost 73% of the Refs, InChI cannot reduce the number of representations compared to SMILES; that is, one InChI is not covering multiple SMILES. For about 9% of Refs, InChI needs one fewer representation that SMILES. Here’s how many total representations each uses:

df_InChI_compare = (
    df_melted_aggregated.select(
        [
            "n_TautomerEnumerator",
            "n_TautomerEnumerator_InChI",
            "nDiff_TautomerEnumerator_InChI",
        ]
    )
    .select(pl.all().name.prefix("sum_"))
    .sum()
)
df_InChI_compare

shape: (1, 3)

sum_n_TautomerEnumerator	sum_n_TautomerEnumerator_InChI	sum_nDiff_TautomerEnumerator_InChI
u32	u32	i64
21163	18633	2530

and on a per-reference basis:

df_InChI_compare.select((pl.all() / Ref_count).round(1).name.prefix("per_ref_"))

shape: (1, 3)

per_ref_sum_n_TautomerEnumerator	per_ref_sum_n_TautomerEnumerator_InChI	per_ref_sum_nDiff_TautomerEnumerator_InChI
f64	f64	f64
11.9	10.5	1.4

and on a percent basis:

nDiff_EnumeratorInChI = df_InChI_compare["sum_nDiff_TautomerEnumerator_InChI"].item()
n_Enumerator = df_InChI_compare["sum_n_TautomerEnumerator"].item()
delta = nDiff_EnumeratorInChI / n_Enumerator
print(f"{nDiff_EnumeratorInChI} / {n_Enumerator} = {delta:.0%}")

2530 / 21163 = 12%

So InChI needs on average 10.5 representations per reference, a reduction of 1.4 (12%) from SMILES. If InChI were totally successful at representing all tautomers with one InChI, it would need only one InChI per reference, so clearly there are improvements to be made to reach that goal.

Let’s plot that the reduction of the number of representations using InChI in a histogram.

n_tauts_diff_InChI = df_melted_aggregated["nDiff_TautomerEnumerator_InChI"].to_list()

x_min, x_max = min(n_tauts_diff_InChI), max(n_tauts_diff_InChI)

plt.figure(figsize=(12, 6))
sns.histplot(
    df_melted_aggregated["nDiff_TautomerEnumerator_InChI"],
    log_scale=(False, True),
    linewidth=1,
)

plt.ylabel("Frequency")
xlabel = (
    f"Difference in # representations from RDKit baseline algorithm: SMILES minus InChI"
)
plt.xlabel(xlabel)

plt.show()

Histogram of frequency against difference in number of representations from RDKit baseline algorithm SMILES minus InChI , with the greatest frequency of >1,000 at x=0, rapidly decreasing as x increases, and the greatest x value being about 475

Summary: For almost 73% of the Refs, InChI cannot reduce the number of representations compared to SMILES. Overall, InChI reduces the number of representations by about 12%.

Visualizing tautomeric structures

Let’s make these results more concrete by visualizing molecular structures using a grid where each row represents a tautomer source. We’ll use my RDKit contribution MolsMatrixToGridImage() so we don’t have to worry about how many tautomers each source has.

The columns will also have meaning: To demonstrate which sources have the same structures, we’ll have each column represent a structure. If that source has that structure, there will be an entry in that cell; if not, it will be blank. To set that up, we define a function to align one iterable below another so that like elements will be in a column; if an element is in an upper row and not a lower row, the lower row will have an empty cell in that column.

def align_iterables(
    iterable1: list | tuple,
    iterable2: list | tuple,
    filler="",
) -> Iterable:
    """Align the second iterable under the first (in columns), using the filler for items in iterable1 not in iterable2,
    for example:

    input:
    iterable1: a,  c,  b
    iterable2: a,  f,  c,  e,  d

    output:
    iterable2: a,  c, "",  f,  e,  d

    Note that no other ordering is applied to either input iterable,
    and items in the second which are not in the first will be kept in their input order.
    If you want them to be ordered in some way, for example alphabetically, order them before calling this function.

    :param iterable1: Template iterable
    :param iterable2: Iterable to be aligned under the template iterable
    :param filler: The filler entry to use to provide an "empty" column to align iterable2 under iterable1; default is empty string but can be a number or anything else
    :returns: Aligned iterable2 such that its entries will line up under the same values in iterable1, or be placed to the right if they are not in iterable1
    """
    # If either iterable has no elements, return the initial iterable2
    if any([len(iterable1) == 0, len(iterable2) == 0]):
        return iterable2

    # Determine index for each element in iterable2:
    iterable2_indices = []
    iterable2_index_max = len(iterable1) - 1
    for iterable2_item in iterable2:
        try:
            iterable2_index = iterable1.index(iterable2_item)
        # If item from input 2 isn't in input 1, append item to the (growing) index list
        except ValueError:
            iterable2_index_max += 1
            iterable2_index = iterable2_index_max
        # If get an AttributeError, tell user that have a bad data type
        except AttributeError:
            raise AttributeError(
                f"The first iterable input must be a list or tuple, not a set or dictionary; it is {iterable1}, which is a {type(iterable1).__name__}"
            )
        iterable2_indices.append(iterable2_index)

    # Create "empty" (all filler) iterable2 aligned--may be longer than iterable2 due to blank spaces
    iterable2_aligned = [filler] * (max(iterable2_indices) + 1)

    # Slot in items from iterable2 to overwrite filler entries
    for i, iterable2_col in enumerate(iterable2_indices):
        try:
            iterable2_aligned[iterable2_col] = iterable2[i]
        except (TypeError, KeyError):
            raise TypeError(
                f"The second iterable input must be a list or tuple, not a set or dictionary; it is {iterable2}, which is a {type(iterable2).__name__}"
            )

    return iterable2_aligned

Here we choose an example Ref to plot molecules for:

df_melted_aggregated_example = df_melted_aggregated.filter(pl.col("Ref") == 467)
df_melted_aggregated_example

shape: (1, 24)

Ref	canon_sml	tauts_TautomerEnumerator	tauts_GetV1TautomerEnumerator	tauts_cactus	tauts_CACTVS	tauts_TautomerEnumerator_InChI	tauts_Expt	n_TautomerEnumerator	n_GetV1TautomerEnumerator	n_cactus	n_CACTVS	n_TautomerEnumerator_InChI	n_Expt	same_GetV1TautomerEnumerator	nDiff_GetV1TautomerEnumerator	same_cactus	nDiff_cactus	same_CACTVS	nDiff_CACTVS	same_TautomerEnumerator_InChI	nDiff_TautomerEnumerator_InChI	same_Expt	nDiff_Expt
i64	list[str]	list[str]	list[str]	list[str]	list[str]	list[str]	list[str]	u32	u32	u32	u32	u32	u32	bool	i64	bool	i64	bool	i64	bool	i64	bool	i64
467	["O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12", "O=c1cc(O)cc2oc(-c3ccc(O)c(O)c3)c(O)c(O)c1-2"]	["O=C1C=C(O)c2c(O)c(=O)c(=C3CCC(=O)C(=O)C3)oc2=C1", "O=C1CC=C(C2OC3=CC(O)=CC(=O)C3C(=O)C2=O)C=C1O", … "O=C1C=C(O)C2=C(O)C(=O)C(C3C=CC(=O)C(=O)C3)OC2=C1"]	["O=C1CC(O)=C2C(=O)C(=O)C(C3=CC(=O)C(=O)C=C3)OC2C1", "O=C1C(=O)c2c(O)cc(O)cc2OC1=C1C=CC(O)C(O)=C1", … "O=C1CC(=O)C2=C(O)C(=O)C(=C3C=CC(=O)C(=O)C3)OC2C1"]	["O=C1C=C2OC(c3ccc(O)c(O)c3)=C(O)C(O)=C2C(=O)C1", "O=c1cc(O)cc2oc(-c3ccc(O)c(O)c3)c(O)c(O)c1-2", … "O=c1cc2oc(-c3ccc(O)c(O)c3)c(O)c(O)c-2c(O)c1"]	["O=c1cc(O)cc2oc(-c3ccc(O)c(O)c3)c(O)c(O)c1-2", "OC2=C1C(C(=C(OC1=CC(=C2)O)C3=CC(=C(C=C3)O)O)O)=O", … "OC2=C(O)C(=C1C=CC(=O)C(=O)C1)OC3=C2C(C=C(C3)O)=O"]	["InChI=1S/C15H10O7/c16-7-4-10(19)12-11(5-7)22-15(14(21)13(12)20)6-1-2-8(17)9(18)3-6/h1,3,5,12,15,18H,2,4H2", "InChI=1S/C15H10O7/c16-7-4-10(19)12-11(5-7)22-15(14(21)13(12)20)6-1-2-8(17)9(18)3-6/h4,19H,1-3,5H2", … "InChI=1S/C15H10O7/c16-7-4-10(19)12-11(5-7)22-15(14(21)13(12)20)6-1-2-8(17)9(18)3-6/h3,15H,1-2,4-5H2"]	["O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12", "O=c1cc(O)cc2oc(-c3ccc(O)c(O)c3)c(O)c(O)c1-2"]	360	353	32	275	360	2	false	7	false	328	false	85	false	0	false	358

For the various data sources, we extract SMILES, align each source sequentially, and create RDKit molecules for drawing.

smls_expt = df_melted_aggregated_example.select("tauts_Expt").item().to_list()
smls_baseline = (
    df_melted_aggregated_example.select("tauts_TautomerEnumerator").item().to_list()
)
smls_baseline_aligned = align_iterables(smls_expt, smls_baseline)
smls_v1 = (
    df_melted_aggregated_example.select("tauts_GetV1TautomerEnumerator")
    .item()
    .to_list()
)
smls_v1_aligned = align_iterables(smls_baseline_aligned, smls_v1)
smls_cactus = df_melted_aggregated_example.select("tauts_cactus").item().to_list()
smls_cactus_aligned = align_iterables(smls_v1_aligned, smls_cactus)
smls_cactvs = df_melted_aggregated_example.select("tauts_CACTVS").item().to_list()
smls_cactvs_aligned = align_iterables(smls_cactus_aligned, smls_cactvs)

tauts_expt = [mol_from_sml(sml) for sml in smls_expt]
tauts_baseline_aligned = [
    mol_from_sml(sml) if sml else None for sml in smls_baseline_aligned
]
tauts_v1_aligned = [mol_from_sml(sml) if sml else None for sml in smls_v1_aligned]
tauts_cactus_aligned = [mol_from_sml(sml) if sml else None for sml in smls_cactus_aligned]
tauts_cactvs_aligned = [
    mol_from_sml(sml) if sml else None for sml in smls_cactvs_aligned
]

In RDKit molecular grid images, a cell’s label appears only if that cell has a molecule (is not empty). So let’s create a function to determine where to put the label for each row.

def find_first_non_none_index(
    my_iterable: Iterable,
):
    """Find the index of the first item in an iterable which is not None
    :param my_iterable: The iterable to examine
    :returns: The index of the first item which is not None; if there is no such index, return None
    """
    for i, item in enumerate(my_iterable):
        if item is not None:
            return i
    # Return None if all elements are None
    return None

Any time we show tautomeric structures, it’s helpful to align maximum common substructure (MCS) so we can focus on the differences between the tautomers in the same orientation.

# Limit number of columns so structures will be readable (not too small)
max_cols = 6

# Create nested (2D) data structures for Draw.MolsMatrixToGridImage
smls_matrix = [
    smls_expt,
    smls_baseline_aligned,
    smls_v1_aligned,
]
tauts_matrix = [
    tauts_expt[:max_cols],
    tauts_baseline_aligned[:max_cols],
    tauts_v1_aligned[:max_cols],
]

row_labels = [
    f"Expt: {len(smls_expt)} tautomers",
    f"RDKit baseline code: {len(smls_baseline)} tautomers",
    f"RDKit v1 code: {len(smls_v1)} tautomers",
]

# If cactus tautomers generated, add row for them
if not all([sml is None for sml in smls_cactus]):
    smls_matrix.append(smls_cactus_aligned)
    tauts_matrix.append(tauts_cactus_aligned[:max_cols])
    row_labels.append(f"cactus code: {len(smls_cactus)} tautomers")

# If CACTVS tautomers generated, add row for them
if not all([sml is None for sml in smls_cactvs]):
    smls_matrix.append(smls_cactvs)
    tauts_matrix.append(tauts_cactvs_aligned[:max_cols])
    row_labels.append(f"CACTVS code: {len(smls_cactvs)} tautomers")

# Initialize a legends matrix with empty strings
legends_matrix = [["" for item in row] for row in tauts_matrix]

# Label first non-empty column in each row with source and number of tautomers
# First row is the template that other rows align under, so first molecule will be non-empty
legends_matrix[0][0] = row_labels[0]
# For subsequent rows, any molecule could be empty (if it's not in the row(s) above)
for label_index, row_label in enumerate(row_labels[1:]):
    col_to_label = find_first_non_none_index(tauts_matrix[label_index + 1])

    if col_to_label != None:
        legends_matrix[label_index + 1][col_to_label] = row_label

# Align 2D structures based on maximum common substructure (mcs)
# Exclude blank entries to avoid problem with aligning empty molecules
tauts_all = [item for sublist in tauts_matrix for item in sublist if item is not None]
mcs = rdFMCS.FindMCS(
    tauts_all,
    bondCompare=rdFMCS.BondCompare.CompareAny,
)
mcs_smarts = mcs.smartsString
mcs_mol = Chem.MolFromSmarts(mcs_smarts)
Chem.Compute2DCoords(mcs_mol)
for m in tauts_all:
    Chem.GenerateDepictionMatching2DStructure(m, mcs_mol)

Draw.MolsMatrixToGridImage(
    molsMatrix=tauts_matrix, legendsMatrix=legends_matrix, useSVG=True, maxMols=1700
)

Molecular grid image with the following rows of source of tautomers: Experimental (2 tautomers), RDKit baseline (360), RDKit V1 (353), NIH (32), and CACTVS (275). The structures have a consistent alignment.

An important option to set when determining the MCS for tautomers is to allow matching of any bond type, for example a single bond in one structure should match a double bond in another, because tautomers often have different bond orders at the same position. In the RDKit, that option is bondCompare=rdFMCS.BondCompare.CompareAny. Notice how extensive the MCS is with that option:

tauts_baseline = [Chem.MolFromSmiles(sml) for sml in smls_baseline_aligned]

mcs = rdFMCS.FindMCS(
    tauts_baseline,
    bondCompare=rdFMCS.BondCompare.CompareAny,
)
mcs_smarts = mcs.smartsString
mcs_mol = Chem.MolFromSmarts(mcs_smarts)

# If there is a maximum common substructure
if mcs_mol:
    # And it has atoms
    if mcs_mol.GetNumAtoms() > 0:
        # Compute the coordinates and align the structures
        Chem.Compute2DCoords(mcs_mol)
        for m in tauts_baseline:
            Chem.GenerateDepictionMatching2DStructure(m, mcs_mol)

        Draw.MolsToGridImage(tauts_baseline, maxMols=100, molsPerRow=7)

mcs_mol

/Users/jemonat/Projects/bertiewooster.github.io/venv/lib/python3.11/site-packages/rdkit/Chem/Draw/IPythonConsole.py:261: UserWarning: Truncating the list of molecules to be displayed to 100. Change the maxMols value to display more.
  warnings.warn(

Maximum common substructure that allows matching of any bond type has three six-membered rings and seven oxygen atoms

and how much smaller the MCS is if don’t use that option:

mcs_strict = rdFMCS.FindMCS(
    tauts_baseline,
)
mcs_smarts_strict = mcs_strict.smartsString
mcs_mol_strict = Chem.MolFromSmarts(mcs_smarts_strict)
mcs_mol_strict

Maximum common substructure that does not allow matching of any bond type has only five atoms

and thus ineffective at aligning the structures between columns, not to mention odd-looking structures:

Chem.Compute2DCoords(mcs_mol_strict)
for m in tauts_all:
    Chem.GenerateDepictionMatching2DStructure(m, mcs_mol_strict)

Draw.MolsMatrixToGridImage(
    molsMatrix=tauts_matrix, legendsMatrix=legends_matrix, useSVG=True, maxMols=1700
)

Molecular grid image with the same data as the previous molecular grid, except the structure do not have a consistent alignment and do not conform to molecular depiction standards

Summary: Putting each source in a row, and aligning columns by common structures, helps visualize the tautomers from each source for a given Ref. Allowing matching of any bond type greatly aids in visually comparing tautomers from each source.

Conclusions

Tautomer enumeration algorithms

Comparing the various tautomer sources, let’s go from most to fewest number of tautomers found:

*RDKit’s GetV1TautomerEnumerator produces the most, about 5 more per Ref than RDKit’s updated algorithm
*RDKit’s updated algorithm is our baseline
+CACTVS’s algorithm finds about 56 fewer tautomers per Ref
+cactus’s algorithm finds about 211 fewer tautomers per Ref
*Experimental results find about 229 fewer tautomers per Ref

*Comparison of all Refs because data source includes all refs

+Comparison of 5 Refs because data source does not include all refs because it had to be run manually for each Ref

Of course, finding more tautomers is not necessarily better. If a tautomer is not observed experimentally, it may not exist in situ and thus not participate in whatever chemistry the system undergoes. The hundreds of tautomers generated by algorithms may be energetically unfavorable and thus unlikely to exist in situ, though a tautomer that an algorithm misses may be observed experimentally.

The RDKit 2022.03 release notes suggest that the updated algorithm deliberately does not produce certain tautomers (which do not match the more-specific rules). Considering Greg Landrum’s comment that the “code change adds a missed case to the enumeration rule set”, it seems that the number of additional tautomers found by the additional rule is outweighed by the narrowing of rules.

It is interesting that the CACTVS algorithm, which has additional transforms, produces fewer tautomers; though that’s for a narrow set of Refs, so it may not hold for a larger dataset. It makes sense that the cactus site produces fewer tautomers than CACTVS because the cactus algorithm by Nicklaus et al. is similar yet has fewer rules.

InChI’s ability to encompass multiple tautomers with one InChI

Regarding InChI’s goal of being “tautomer-invariant”, meaning tautomers of a structure should be assigned the same InChI so a single InChI should suffice to represent all tautomers of a given structure:

For the experimentally-observed tautomeric structures, InChI needed on average 1.95 representations to cover the 1776 Refs, which is much greater than the goal of 1.0 and only 11% better than SMILES (averaging 2.2 representations per Ref).
For the tautomeric structures generated by the updated RDKit algorithm, InChI needs on average 10.5 representations per Ref, a reduction of 1.4 from SMILES, or 12%. It seems encouraging for the generality of InChI that it does as well covering multiple SMILES for algorithmically-generated tautomers, which could be far afield of structures observed experimentally.

So overall the current InChI implementation has modest success in covering all tautomers of a structure with one InChI. Nicklaus and others are working “Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2” so hopefully that will improve InChI’s success.

Acknowledgments

Many thanks to Marc Nicklaus for running CACTVS tautomer enumerations, giving background on the web tool, informative discussions, and reviewing a draft of this post. Also thanks to my co-workers at Aionics for discussions about tautomers. (This post was done on my own and does not necessarily represent my employer’s views).