Molecules have a color if their electronic energy levels are close enough to absorb visible rather than ultraviolet light. For organic molecules, that’s often because of an extensive chain of conjugated bonds. Can we use cheminformatics to find evidence that increasing conjugated bond chain length decreases absorption wavelength, which makes a molecule colored?

Open this notebook in Google Colab so you can run it without installing anything on your computer

Introduction

Colored molecules have applications in television screens, sensors, and to give color to fabrics, paints, foods, and more. I did my PhD in optical spectroscopy, so I was interested by the open-access database Experimental database of optical properties of organic compounds with more than 20,000 data points. One of the first things that came to mind was that organic colored compounds often get their color from an extensive chain of conjugated bonds. Here are quotes from an online textbook:

If you extend this to compounds with really massive delocalisation, the wavelength absorbed will eventually be high enough to be in the visible region of the spectrum, and the compound will then be seen as colored. A good example of this is the orange plant pigment, beta-carotene - present in carrots, for example.

The more delocalization there is, the smaller the gap between the highest energy pi bonding orbital and the lowest energy pi anti-bonding orbital. To promote an electron therefore takes less energy in beta-carotene than in the cases we’ve looked at so far - because the gap between the levels is less.

Here are two examples of molecules, one cyclic and one acylic, with conjugated pi bonds. In each case, the p orbitals on adjacent atoms line up so that the electrons are delocalized over all the conjugated bonds in the molecule (which is all the bonds in these two molecules).

Molecules and their conjugated pi bonds

Attribution: Conjugated Pi Bond Systems from LibreTexts, remixed by Jeremy Monat

The more bonds that electrons can delocalize over, the more pi bonding and anti-bonding orbitals; and the highest occupied molecular orbital (HOMO, the top green line in each diagram below, the highest-energy pi bonding orbital) increases in energy, while the lowest unoccupied molecular orbital (LUMO, the lowest red line in each diagram below, the lowest-energy pi anti-bonding orbital) decreases in energy. The gap between the two becomes smaller, and if the conjugated chain is long enough, the HOMO-LUMO energy gap becomes small enough that it’s in the visible spectrum rather than the ultraviolet. When a molecule absorbs visible light, we perceive that it has a color.

Molecular orbital energy diagram for ethene, buta-1,3-diene, and hexa-1,3,5-triene

Attribution: What Causes Molecules to Absorb UV and Visible Light from LibreTexts, authored, remixed, and/or curated by Jim Clark.

The visible spectrum starts at about 400 nm (violet) and goes to about 740 nm (red). So a molecule that absorbs light in that range will be perceived as colored.

Visible spectrum

Attribution: Gringer, Public domain, via Wikimedia Commons.

Cheminformatics exploration

Can we use cheminformatics to find evidence that increasing conjugated bond chain length decreases absorption wavelength? To check, I used the open-access database Experimental database of optical properties of organic compounds from 2020 with >20,000 chromophore-solvent combinations. The optical data can be downloaded as a CSV file (version 3).

Packages setup

import math
from typing import Iterable, Dict, List, Tuple
from IPython.display import display, Math
from functools import cache
import warnings

from PIL import Image
from rdkit import Chem
from rdkit.Chem import AllChem, Draw, Mol
import numpy as np
import polars as pl
import polars.selectors as cs
from polars.exceptions import ColumnNotFoundError
import altair as alt
from great_tables import GT, style, loc
import latexify
# Suppress the specific RDKit IPythonConsole truncation warning for MolsToGridImage
warnings.filterwarnings(
    "ignore", message=r"Truncating the list of molecules to be displayed to \d+"
)

Find longest conjugated bond chain

To find the longest conjugated bond chain in each molecule, we’ll use a graph-traversal algorithm. We’ll define two functions–one to find connected bonds and one to get the longest conjugated bond chain in a molecule–then describe how they work using an example molecule.

def find_connected_bonds(
    graph: np.ndarray,
    start_bond: int,
    visited: Iterable,
) -> List:
    """
    Find connected bonds in an adjacency matrix. Terminology is specific to bonds in molecules, but algorithm should work for any adjacency matrix.

    :param graph: The bond adjacency matrix
    :param start_bond: The bond index to start the chain from
    :param visited: The bond indices already visited; preferably a set for efficiency
    :returns: The list of bonds connected to, and including, the start bond
    """
    # Initialize the stack with the start bond
    stack = [start_bond]
    connected_component = []

    while stack:
        # Pop off the stack the last bond added
        bond_index = stack.pop()
        # Only follow the chain from this bond if this bond hasn't already been visited
        if bond_index not in visited:
            # Add the bond index to the list of visited bonds
            visited.add(bond_index)
            # Note that the bond is connected to the start bond
            connected_component.append(bond_index)

            # Add all neighbors to the stack for traversal if they haven't already been visited
            for neighbor, is_connected in enumerate(graph[bond_index]):
                if is_connected and neighbor not in visited:
                    stack.append(neighbor)
    return connected_component
def get_longest_conjugated_bond_chain(
    mol: Mol, verbose: bool = False
) -> tuple[List[int], np.ndarray]:
    """
    Get the longest conjugated bond chain in a molecule.

    :param mol: The RDKit molecule
    :param verbose: Whether to return the bond_matrix in addition to conjugated_bonds_out
    :returns: The list of bond indices of the longest conjugated bond chain in the molecule; if verbose is true, also the bond adjacency matrix
    """
    # Create a list to store conjugated bond indices
    conjugated_bonds = [
        bond.GetIdx() for bond in mol.GetBonds() if bond.GetIsConjugated()
    ]

    if not conjugated_bonds:
        # No conjugated bonds found, return empty list
        return []

    n_conjugated_bonds = len(conjugated_bonds)

    # Build a subgraph of the conjugated bonds only;
    #   initially populate it with zeroes, indicating bonds are not connected
    bond_matrix = np.zeros((n_conjugated_bonds, n_conjugated_bonds), dtype=int)

    # Populate the bond adjacency matrix for conjugated bonds
    for i, bond_i in enumerate(conjugated_bonds):
        bond_i_obj = mol.GetBondWithIdx(bond_i)
        for j, bond_j in enumerate(conjugated_bonds):
            if i != j:
                bond_j_obj = mol.GetBondWithIdx(bond_j)
                # Check if two conjugated bonds share an atom--
                #   do the set of {beginning atom, ending atom} overlap for the two bonds
                if (
                    len(
                        set([bond_i_obj.GetBeginAtomIdx(), bond_i_obj.GetEndAtomIdx()])
                        & set(
                            [bond_j_obj.GetBeginAtomIdx(), bond_j_obj.GetEndAtomIdx()]
                        )
                    )
                    > 0
                ):
                    # Change the bond matrix value to 1, indicating the two bonds are connected
                    bond_matrix[i, j] = 1
                    bond_matrix[j, i] = 1

    # Initialize variables to store the longest conjugated bond chain
    visited = set()
    longest_bond_chain = []

    # Starting from each bond, traverse the graph and find the largest connected component
    for start_bond in range(n_conjugated_bonds):
        if start_bond not in visited:
            bond_chain = find_connected_bonds(bond_matrix, start_bond, visited)
            # Note that bonds are added to `visited` in find_connected_bonds(),
            #   so any bonds already visited from a previous starting bond
            #   won't have find_connected_bonds run on it
            # If this chain is longer than the longest one found so far, mark this chain as the longest
            if len(bond_chain) > len(longest_bond_chain):
                longest_bond_chain = bond_chain

    # Convert subgraph bond indices back to the original bond indices
    conjugated_bonds_out = [conjugated_bonds[i] for i in longest_bond_chain]
    conjugated_bonds_out.sort()

    if not verbose:
        return conjugated_bonds_out
    else:
        return conjugated_bonds_out, bond_matrix

Let’s use an example branched molecule to explain how these algorithms work.

C6H8 = Chem.MolFromSmiles("C=CC(=C)C=C")
longest_conjugated_bond_chain, bond_matrix = get_longest_conjugated_bond_chain(
    mol=C6H8, verbose=True
)
Draw.MolToImage(
    C6H8,
    highlightBonds=longest_conjugated_bond_chain,
)

Branched compound 3-methylidenepenta-1,4-diene with all bonds highlighted because they are all conjugated

Let’s label the bond indices using RDKit’s addBondIndices.

opts = Draw.MolDrawOptions()
opts.addBondIndices = True
Draw.MolToImage(C6H8,size=(350,300),options=opts)

Branched compound 3-methylidenepenta-1,4-diene with the bonds numbered: 0, 1, 3, 4 along the molecule's spine, and 2 for the branch between bonds 1 and 3

To understand the bond adjacency matrix, let’s use Polars’ Great Tables integration to make a nicely-formatted table.

# Convert the bond matrix to a Polars DataFrame
df_bond_matrix = pl.DataFrame(bond_matrix)

# Rename columns to add custom labels
df_bond_matrix = df_bond_matrix.rename(
    {f"column_{str(i)}": f"Bond {i}" for i in range(bond_matrix.shape[1])}
)

# Add row index labels (Polars doesn't have an index column, so we'll add it as a new column)
row_indices = [f"Bond {i}" for i in range(bond_matrix.shape[0])]
index_col = "Adjacent?"
df_bond_matrix = df_bond_matrix.insert_column(0, pl.Series(index_col, row_indices))

# Add a Total column to sum up how many bonds a given bond is connected to
total_col = "Total"
df_bond_matrix = df_bond_matrix.with_columns(
    pl.sum_horizontal(cs.starts_with("Bond")).alias(total_col)
)

# Use GreatTables to format the table
GT(df_bond_matrix).tab_options(
    # Bold the column headings
    column_labels_font_weight="bold",
).tab_style(
    # Bold the index and total columns
    style=style.text(weight="bold"),
    locations=loc.body(columns=[index_col, total_col]),
).tab_header(
    # Add a title to the table
    title="Bond adjacency matrix"
)
Bond adjacency matrix
Adjacent? Bond 0 Bond 1 Bond 2 Bond 3 Bond 4 Total
Bond 0 0 1 0 0 0 1
Bond 1 1 0 1 1 0 3
Bond 2 0 1 0 1 0 2
Bond 3 0 1 1 0 1 3
Bond 4 0 0 0 1 0 1

The table demonstrates which bonds are adjacent, in the sense that the two bonds share an atom. For example, bond 1 is adjacent to bonds 0, 2, and 3. That makes sense based on the molecular diagram.

The bond chain starts with a start bond, in this case 0, and follows all its adjacent bonds to make a chain. Here, the algorithm went to bond 1 (the only bond connected to bond 0), then at the branch chose to go off the molecule’s spine (longest atom chain) to go to bond 2, then followed the other branch to complete the molecule’s spine (bonds 3 and 4):

longest_conjugated_bond_chain
[0, 1, 2, 3, 4]

In this case, a single bond chain found all the conjugated bonds in the molecule. The algorithm loops over all conjugated bonds to make sure it finds the longest chain. But if a given start bond (e.g., 1) was already visited because it was in the bond chain of a previous starting bond (e.g., 0), the algorithm doesn’t re-trace the same bond chain. This is a big computational savings because it avoids unnecessary graph traversals, which are expensive. This is facilitated by the variable visited being passed from get_longest_conjugated_bond_chain() to find_connected_bonds(), where it is modified by adding the nodes visited.

Additional computational savings comes from excluding the non-conjugated bonds from the adjacency matrix. While not noticeable for the example molecule because all its bonds are conjugated, this can greatly reduce the adjacency matrix, and thus the graph traversal, for molecules where some bonds are not conjugated.

Check that longest conjugated bond chain gives expected results

Let’s make sure that our algorithm gives the expected results for a variety of cyclic and acyclic molecules.

examples = {
    "benzene": "c1ccccc1",
    "naphthalene": "c1c2ccccc2ccc1",
    "anthracene": "c1ccc2cc3ccccc3cc2c1",
    "toluene": "c1ccccc1C",
    "benzene + 5 linear conjugated bonds": "c1ccccc1CC=CC=CC=C",
    "benzene + 7 linear conjugated bonds": "c1ccccc1CC=CC=CC=CC=C",
    "1,3-butadiene": "C=CC=C",
    "branched": "C=C\C=C/C(/C=C)=C/C=C",
    "branched + distant": "C=C\C=C\C\C=C\C=C/C(/C=C)=C/C=C",
}
mols = [Chem.MolFromSmiles(sml) for sml in examples.values()]
conjugated_bonds = [get_longest_conjugated_bond_chain(mol) for mol in mols]
Draw.MolsToGridImage(
    mols=mols,
    legends=examples.keys(),
    highlightBondLists=conjugated_bonds,
    subImgSize=(300, 200),
)

Grid of molecules with each's longest conjugated chain highlighted

Those highlighted conjugated bond chains are as expected, for example

  • all the bonds are conjugated in benzene, as well as fused polycyclic molecules naphthalene and anthracene
  • the bond off the ring in toluene is not conjugated
  • if we put a side-chain on benzene with fewer than six conjugated bonds, the benzyl moiety remains the longest conjugated chain; but if we have seven conjugated bonds on the side chain, it becomes the longest conjugated chain
  • in the “branched + distant” molecule, if we break the conjugation chain by having two C-C single bonds in a row, the chain does not include the distant, disconnected conjugated bonds

Of course we should check beta-carotene, whose color comes from its extended conjugated bond chain and is “responsible for the orange color of carrots”. beta-carotene “absorbs most strongly between 400-500 nm. This is the green/blue part of the spectrum.” Because it absorbs those wavelengths, when we look at a sample of beta-carotene we see the reflected light of the complimentary colors, so we perceive an orange color. Note that beta-carotene’s absorption isn’t that much lower energy (higher wavelength) than ultraviolet (which goes up to 400 nm), so molecules with longer conjugated chains might absorb at higher wavelengths such as 565-590 nm (yellow), giving them perceived colors towards blue, magenta, and purple.

beta_carotene = Chem.MolFromSmiles(
    "CC2(C)CCCC(\C)=C2\C=C\C(\C)=C\C=C\C(\C)=C\C=C\C=C(/C)\C=C\C=C(/C)\C=C\C1=C(/C)CCCC1(C)C"
)
longest_conjugated_bond_chain = get_longest_conjugated_bond_chain(beta_carotene)
print(
    f"beta-carotene's longest conjugated bond chain is {len(longest_conjugated_bond_chain)} bonds long."
)
Draw.MolToImage(
    beta_carotene,
    highlightBonds=longest_conjugated_bond_chain,
    size=(800, 200),
)
beta-carotene's longest conjugated bond chain is 21 bonds long.

beta-carotene molecular structure with its conjugated bond chain highlighted

And beta-carotene’s molecular structure is also beautifully symmetric.

Now that we have an algorithm to find the longest conjugated bond chain in a molecule, let’s apply it to the optical dataset.

Prepare data

Let’s read into a Polars dataframe the data from the optical dataset CSV file.

df = pl.read_csv("../data/DB for chromophore_Sci_Data_rev03.csv")

# If there are any rows where all values are null, drop those rows
df = df.filter(~pl.all_horizontal(pl.all().is_null()))  

df
shape: (20_836, 14)
TagChromophoreSolventAbsorption max (nm)Emission max (nm)Lifetime (ns)Quantum yieldlog(e/mol-1 dm3 cm-1)abs FWHM (cm-1)emi FWHM (cm-1)abs FWHM (nm)emi FWHM (nm)Molecular weight (g mol-1)Reference
i64strstrf64f64f64f64f64f64f64f64f64f64str
0"N#Cc1cc2ccc(O)cc2oc1=O""O"355.0410.02.804262nullnullnullnullnullnull187.026943"10.1021/acs.jpcb.5b09905"
1"N#Cc1cc2ccc([O-])cc2oc1=O""O"408.0450.03.961965nullnullnull2128.3null43.2186.019667"10.1021/acs.jpcb.5b09905"
2"CCCCCCCCCCCC#CC#CCCCCCCCCCN1C(…"ClC(Cl)Cl"526.0535.03.602954nullnullnullnullnullnull1060.705709"10.1002/smll.201901342"
3"[O-]c1c(-c2nc3ccccc3s2)cc2ccc3…"CC#N"514.0553.73.81nullnullnull2120.5null65.2350.064509"10.1016/j.snb.2018.10.043"
4"[O-]c1c(-c2nc3ccccc3s2)cc2ccc3…"CS(C)=O"524.0555.04.7nullnull2219.71565.661.248.3350.064509"10.1016/j.snb.2018.10.043"
20831"N#Cc1c(N2CCCCC2)cc(-c2ccc3ccc4…"C1CCOC1"344.0473.0nullnullnullnull3937.1null88.9488.188863"10.1021/ol9000679"
20832"CCCCCCn1c2c(c3ccccc31)-c1c(c3c…"ClCCl"312.0352.0nullnull4.460898nullnullnullnull797.18392"DOI: 10.1021/ol501083d"
20833"CCCCCCn1c2c(c3cc(-c4ccc5ccc6cc…"ClCCl"340.0382.0nullnull4.70757nullnullnullnull1598.142"DOI: 10.1021/ol501083d"
20834"CCCCCCn1c2c(c3ccccc31)-c1c(c3c…"CCCCCCn1c2c(c3ccccc31)-c1c(c3c…321.0494.0nullnullnullnullnullnullnull797.18392"DOI: 10.1021/ol501083d"
20835"CCCCCCn1c2c(c3cc(-c4ccc5ccc6cc…"CCCCCCn1c2c(c3cc(-c4ccc5ccc6cc…365.0466.0nullnullnullnullnullnullnull1598.142"DOI: 10.1021/ol501083d"

The dataset gives absorption and emission maxima as wavelengths. That makes sense because spectroscopists wavelength describes the color of the light used in the laboratory. But to compare, for example, the absorption and emission maximum for a molecule, it’s better to use energy units to express the difference between different molecular energy levels. So let’s convert wavelengths to energies in electron volts, eV.

To determine the conversion factor, let’s use the equation for energy E as a function of wavelength λ.

@latexify.function
def E(
    h: float,
    c: float,
    λ: float,
) -> float:
    """Calculate the binomial coefficient: how many ways there are to choose k items from n items.

    :param h: Planck's constant
    :param c: speed of light
    :param λ: wavelength
    :returns: energy
    """
    return h * c / λ


E
\[\displaystyle E(h, c, λ) = \frac{h c}{λ}\]

Let’s plug in the values of the physical constants to four decimal places and use factor-label dimensional analysis:

eqn = r"$E = \frac{6.6261 \times 10^{-34} \, \text{J} \cdot \text{s} \cdot 2.9979 \times 10^8 \, \text{m/s}}{\lambda \times 10^{-9} \, \text{m}} \cdot \frac{1 \, \text{eV}}{1.6022 \times 10^{-19} \, \text{J}} = \frac{1239.8 \, \text{eV}}{\lambda}$"
display(Math(eqn))

$\displaystyle E = \frac{6.6261 \times 10^{-34} \, \text{J} \cdot \text{s} \cdot 2.9979 \times 10^8 \, \text{m/s}}{\lambda \times 10^{-9} \, \text{m}} \cdot \frac{1 \, \text{eV}}{1.6022 \times 10^{-19} \, \text{J}} = \frac{1239.8 \, \text{eV}}{\lambda}$

Doing that in Python to store the value in a variable:

h = 6.6261e-34  # J*s
c = 2.9979e8  # m/s
nm = 1e-9  # m
eV = 1.6022e-19  # J
eV_nm = h * c / (nm * eV)
eV_nm
1239.8193228061414

Now we can convert absorption and emission maxima to energy units of eV, then calculate their difference as the Stokes shift. The Stokes shift reflects how much the molecule relaxes from its initial excited state (the Franck–Condon state) to the lowest vibrational level in the excited state that it typically emits light from. In the following diagram, the blue arrow represents absorption from the ground state to the Franck-Condon state, and the green arrow represents emission from the relaxed excited state back to the ground state. The blue arrow is longer, representing the greater energy of absorption. The difference between the vertical length of the blue and green arrows is the Stokes shift.

Franck-Condon state and relaxation

Attribution: Franck Condon Diagram on Wikipedia by Samoza, licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license

# To prevent duplicate-column errors when re-running this code, drop the columns we're about to add
for column in [
    "longest_bond_indices",
    "Longest conjugated bond length",
    "Absorption max (eV)",
    "Emission max (eV)",
    "Stokes shift (eV)",
]:
    try:
        df.drop(column)
    except ColumnNotFoundError:
        pass

Now we can calculate the energies in eV from the wavelengths in nm.

df = df.with_columns(
    [
        (eV_nm / pl.col("Absorption max (nm)")).alias("Absorption max (eV)"),
        (eV_nm / pl.col("Emission max (nm)")).alias("Emission max (eV)"),
    ]
).with_columns(
    (pl.col("Absorption max (eV)") - pl.col("Emission max (eV)")).alias(
        "Stokes shift (eV)"
    ),
)

Finding the longest conjugated chain for each molecule

Now we come to the computationally-intensive operation: Finding the the longest conjugated chain for each molecule using conjugated_chain() which calls get_longest_conjugated_bond_chain(). We define a function that will return the indices and length of the longest conjugated bond chain. The data set repeats some chromophores: there are a little less than three rows per chromophore. So we’ll cache the results for each chromophore to avoid recalculating for each of its rows. Python’s built-in module functools include a cache decorator that makes this simple.

@cache
def conjugated_chain(sml) -> Dict[str : List[int], str:int]:
    """
    Find the indices and length for the longest bond chain in a SMILES.

    :param sml: SMILES to be made into a molecule
    :returns: A dictionary of longest bond chain indices and longest bond chain length
    """
    return_dict = dict()
    mol = Chem.MolFromSmiles(sml)
    longest_bond_indices = get_longest_conjugated_bond_chain(mol)
    return_dict["longest_bond_indices"] = longest_bond_indices
    return_dict["Longest conjugated bond length"] = len(longest_bond_indices)
    return return_dict

Now we use Polars’ map_elements to calculate the longest bond chain for each molecule. Because conjugated_chain() returns a dictionary, Polars treats it as a struct, which we can then unnest to create a column for each dictionary key-value pair. We then sort the dataframe to put the longest bond chain lengths first so we can examine those molecules. Finally, we use Polars’ shrink_to_fit() to decrease the dataframe memory usage and prevent problems with plotting.

This entire operation takes about 13 seconds on my laptop with caching, and about 33 seconds without, which roughly reflects the ratio of unique chromophores to data points. Caching is thus demonstrated to be effective here.

# This may take 12-60 seconds: Finding the longest conjugated chain in each molecule in the dataframe
df = (
    df.with_columns(
        conjugated=pl.col("Chromophore").map_elements(
            lambda sml: conjugated_chain(sml), return_dtype=pl.Struct
        )
    )
    .unnest("conjugated")
    .sort(pl.col("Longest conjugated bond length"), descending=True)
    .shrink_to_fit()
)
df.head()
shape: (5, 19)
TagChromophoreSolventAbsorption max (nm)Emission max (nm)Lifetime (ns)Quantum yieldlog(e/mol-1 dm3 cm-1)abs FWHM (cm-1)emi FWHM (cm-1)abs FWHM (nm)emi FWHM (nm)Molecular weight (g mol-1)ReferenceAbsorption max (eV)Emission max (eV)Stokes shift (eV)longest_bond_indicesLongest conjugated bond length
i64strstrf64f64f64f64f64f64f64f64f64f64strf64f64f64list[i64]i64
16578"N#CC(C#N)=C1C=C(C=Cc2ccc(-c3cc…"N#CC(C#N)=C1C=C(C=Cc2ccc(-c3cc…null706.0null0.12nullnullnullnullnull2731.043503"10.1039/c6tc03359h"null1.756118null[0, 1, … 242]243
16648"C(=Cc1ccc(C=Cc2ccc(N(c3ccc(-c4…"C(=Cc1ccc(C=Cc2ccc(N(c3ccc(-c4…null660.0nullnullnullnullnullnullnull1994.81382"10.1039/c6tc03359h"null1.878514null[0, 1, … 177]178
16555"CC(C)c1cccc(C(C)C)c1N1C(=O)c2c…"CC(C)c1cccc(C(C)C)c1N1C(=O)c2c…null790.0nullnullnullnullnullnullnull2094.857519"10.1039/c6tc03359h"null1.569392null[3, 4, … 185]174
16554"CC(C)c1cccc(C(C)C)c1N1C(=O)c2c…"CC(C)c1cccc(C(C)C)c1N1C(=O)c2c…null860.0nullnullnullnullnullnullnull2030.87786"10.1039/c6tc03359h"null1.44165null[3, 4, … 181]170
15298"c1ccc(C(=C(c2ccccc2)c2ccc(-c3c…"Cc1ccccc1"530.0637.0null0.252nullnullnullnullnull1760.698122"10.1016/j.dyepig.2012.08.028"2.3392821.9463410.392941[0, 1, … 157]158

Checking the results, we find that the longest conjugated bond chain length in the optical dataset is 158 bonds! That’s more than seven times longer than beta-carotene’s. So it certainly makes sense that some of these molecules absorb visible light. For example, the molecule with the longest conjugated bond chain has its absorption maximum at 530 nm in the solvent of toluene.

Checking for a correlation of absorption wavelength against longest conjugated bond chain length

Let’s use Polars’ plot capability to plot absorption max against longest bond length to check for any trends. Altair can plot a maximum of 5,000 data points, and there are more than 20,000 points in the optical dataset. We could enable VegaFusion to allow for more points, but that would be an additional dependency and it seems not to work well with Google Colab. Instead, let’s filter down to the ~7k unique chromophores in the dataset and plot the first 5k. Polars will select one solvent essentially at random for each chromophore, and the solvatochromic shift is generally not huge compared to the range we’ll be plotting (from <200 to 850 nm), so that’s a reasonable sampling. And 5k points should be plenty to discern if there’s a trend.

df_unique_chromophore = df.unique(subset="Chromophore")

We also need to drop the column longest_bond_indices because large lists of integers are not supported (we’re not plotting them anyway)

Now we can make a plot of absorption maximum against longest conjugated bond length for the first 5k unique chromophores.

df_unique_chromophore.slice(0, 5000).drop("longest_bond_indices").plot.scatter(
    x="Longest conjugated bond length", y="Absorption max (nm)"
)