Universal SMILES Finally, a canonical SMILES string? Noel M. O’Boyle Analytical and Biological Chemistry Research Facility, University College Cork, Ireland (Current address: NextMove Software, Cambridge, UK) Apr 2013 245th ACS National Meeting New Orleans Open Babel
2 Introduction to Canonical SMILES
3 How to create a SMILES string (1) Pick a starting atom (2) Traverse the molecular graph in a Depth-First manner (3) Encode the atoms and bonds traversed as a text string (•)Let’s assume that step (3) is done in a standard manner (•)Variation in steps (1) and (2) leads to many different possible SMILES C C O C C O (•)Ethanol as CCO or OCC (among others)
4 How to create a canonical SMILES string (1) Give each atom a canonical label (“canonicalize”) (2) Pick as starting atom the one with the smal est label1 (3) Traverse the molecular graph in a Depth-First manner fol owing the atom with the smal est label at each branch point1 (4) Encode the atoms and bonds traversed as a text string • The same SMILES string wil always be generated – The “canonical SMILES” C C O O C C 1 2 3 2 C C 3 O O C 1 C • Ethanol always1 as CCO 1 For example.
5 Why is a canonical SMILES useful? • Check identity – Graph isomorphism is faster, but less convenient • Find/avoid duplicates • Find overlap of two databases • Check that a structure remains unchanged – E.g. after some transformation • Canonical SMILES retains the features of regular SMILES – Although slower to calculate
6 Why are there different canonical SMILES? • There is no published canonical SMILES implementation for the general case – Neither Weininger, Weininger nor Weininger  described how to handle stereochemistry • Canonicalization is difficult – Not a simple algorithm, many corner cases – Trade secret • End result: Each cheminformatics toolkit generates its own canonical SMILES  Weininger D, Weininger A, Weininger JL. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 1989, 29, 97.
7 Why a “Universal” canonical SMILES? • Al the benefits of a global y unique identifier (like the InChI) – Can link databases – Of benefit to the average chemist, as having different SMILES for the same molecule is confusing – Can immediately see if the Wikipedia SMILES is in agreement with the PubChem SMILES • Final y possible to compare SMILES strings from different toolkits – Identify bugs – Explore underlying chemical models (e.g. aromatic models) – Explore underlying stereochemistry perception – Lead to improvements in quality and standards
8 Why base a canonical SMILES on the InChI? • Canonicalization is complicated – Devising and describing a general canonicalization procedure that others could implement exactly may not be possible • Better to build on existing work – Take advantage of the stel ar work by the InChI team – The InChI has already solved the canonicalization problem for a broad section of chemistry • It’s ubiquitous – The InChI is available in almost all cheminformatics toolkits • Final y, al toolkits wil be able to create the same canonical SMILES string – The “Universal SMILES” string!
9 How to use the InChI to create a Universal SMILES string
10 How to get canonical labels from the InChI • Use the Auxiliary Information, Luke $ obabel -:"ClCC(=O)Br" -oinchi -xa InChI=1S/C2H2BrClO/c3-2(5)1-4/h1H2 AuxInfo=1/0/N:2,3,5,1,4/rA:5ClCCOBr/rB:s1;s2;d3;s3;/rC:;;;;; • /N section gives the canonical labels – Canonical labels 1 through 5 correspond to input atoms 2, 3, 5, 1 and 4, respectively – E.g. canonical label 3 is applied to input atom 5, the Bromine • For Universal SMILES, I used two non-standard options – /FixedH: Enable the correct application of canonical labels in cases involving molecular symmetry broken by protonation states – /RecMet: Do not disconnect metals, as the labels for ligands wil not be canonical
11 Walk this way: Rules for graph traversal • Start the graph traversal at the atom with the lowest canonical label – For disconnected structures, visit each structure in order of its lowest canonical label • Visit atoms in a depth-first manner – At each branch point, multiple bonds are favoured over single or aromatic bonds, and lower canonical labels over higher. Cl Cl Cl 3 C C O C C O C C O 1 2 4 • Universal SMILES for this acid chloride: CC(=O)Cl
12 Corner case: Explicit hydrogens • Sometimes a SMILES string contains explicit hydrogens – Hydrogen isotopes, dihydrogen, hydrogen atoms, hydrogen ions • Sometimes the InChI labels hydrogens – Hydrogen atoms, bridging hydrogens • The problem: – What to do about explicit hydrogens unlabelled by the InChI? • A solution: – Consider these to have a low canonical label – That is, in the traversal visit these hydrogens prior to other singly-bonded branches C([2H])([3H])Cl rather than C(Cl)([3H])[2H]
13 A standard way to encode the SMILES • The graph traversal gives us a canonical atom order • However, despite this, many different SMILES strings may be written for the same molecule The following SMILES strings for ethanol al have the same atom order: CCO, C-C-O, C1.C12.O2, C(C(O)), [CH3]CO • For Universal SMILES, one particular form must be adopted – The standard form described by the Open SMILES specification Ref: Craig James et al, The Open SMILES specification, http://opensmiles.org – E.g. Don’t write single bonds explicitly, only use parentheses if there is a branch
14 Encoding cis/trans stereochemistry symbols • Question: – How do I know that the following SMILES string was not generated by Open Babel? C\C=C\Cl • There are two possible ways to write symbols for any double bond system • For Universal SMILES, the first stereochemistry bond symbol should be a forward slash – i.e. C/C=C/Cl not C\C=C\Cl – Minimises backslashes (can cause problems at commandline) – Useful aid if reading SMILES: If you see a backslash, there must be a corresponding forward slash preceding it • Show cis/trans symbols on al substituents – i.e. Cl/C=C(\Br)/I not C/C=C(\Br)I
15 Does it work?
16 Datasets for testing implementation • Universal SMILES was added to Open Babel v2.3.2 $ obabel -:"c1(cc(ccc1)[N+](=O)[O-])/C=C/F" -osmi -xU c1cc(/C=C/F)cc(c1)[N+](=O)[O-] • ChEMBL Release 13 – 1.14 mil ion compounds as 2D MOL – Highly curated, and normalised • PubChem Substance subset – 1.04 mil ion compounds as 2D or 3D MOL (those with SIDS from 0 to 2 million) – As deposited from a variety of sources – Duplicates exist as well as errors – 1.1% were discarded as InChIs could not be generated for them
17 Shuffle Test • Does the Universal SMILES procedure generate a canonical identifier? – A canonical identifier should be invariant to the input order of atoms – So…let’s shuffle the atoms and check whether the Universal SMILES changes • For each structure, I generated 10 “anti-canonical” SMILES strings using Open Babel – The “xC” SMILES output option • For each of these, the Universal SMILES was generated – If all identical, the test is passed
18 Shuffle Test Results • ChEMBL dataset – 2,425 canonicalization failures (0.21%) – 2,248 excluding failures for Open Babel’s own canonical SMILES • These failures are mainly due to kekulization problems • Differences in the stereochemical model used (81%) – 722 failures due to disagreement on the number of tetrahedral stereocenters (fault with OB typical y) – 1105 failures for stereogenic double bonds • Handling of delocalized charges – Where molecular graph symmetry is broken only by charge states in a delocalised system, the InChI will regard as equivalent atoms which appear as different charge states in the SMILES string. – Two different Universal SMILES for the example: • C[n+]1ccn(C)c1 and Cn1cc[n+](C)c1
19 Shuffle Test Results • PubChem dataset – 2,410 canonicalization failures (0.23%) – 2,183 excluding failures for Open Babel’s own canonical SMILES • Differences in the stereochemical model used (72%) • 56 cases of non-canonicalization of isotopes – Bug in InChI auxiliary information (they are aware of this) • Interesting failure case, SID 425526 – InChI regards ring as aromatic, and then identifies two-fold graph symmetry – Open Babel does not treat ring as aromatic • Series of double and single bonds – Two different Universal SMILES generated
20 Duplicate Test • Use the Universal SMILES to find duplicates – True duplicates – False duplicates • A shortcoming of Universal SMILES or its implementation • A normalization of distinct structures • ChEMBL dataset – There should not be any duplicates – 63 sets of duplicates according to InChI • Errors in database which had already been corrected in development version • PubChem dataset – 143,157 sets of duplicates • Duplicates according to InChI removed from further consideration
21 Duplicate Test Results • ChEMBL dataset – 29 duplicates found – The majority appear to be true duplicates which the InChI considers as distinct due to the specific coordinates in the Mol file • The InChI regards the stereochemistry in (b) to be undefined
22 • Identical according to Universal SMILES but distinct InChIs – The InChIs differ in the double bond stereochemistry layer: /b31-27+,32-28? versus /b31-27-,32-28+
23 Duplicate Test Results • PubChem dataset – 47 duplicates found • In 44 cases the InChI regarded as undefined the tetrahedral stereochemistry at a chiral center – The three non-H atoms were almost in the same plane as the center SID 855468
24 Discussion and conclusions
25 Overview of results • Universal SMILES can generate canonical identifiers… – for 99.79% of the ChEMBL database – for 99.77% of a subset of the PubChem Substance database – Disagreements between InChI and the underlying stereochemical model used by Open Babel, and the handling of delocalized charges • Performance could be improved further – Improvements in stereochemistry perception in Open Babel, or somehow use the stereochemistry perception from the InChI • Outstanding issues: – Failures due to delocalized charges – The Daylight aromaticity model is not wel -described and so different Universal SMILES implementations will vary in what is treated as an aromatic system
26 Overview of results • The InChI is quite sensitive to the specific geometry used at stereocenters – Some structures in databases may need to be redrawn • These ideas could be applied to other chemical file formats – Canonical forms of other line notations – Canonicalization of atom order in Mol files
27 What I didn’t talk about… • Inchified SMILES – A way to include the InChI normalizations into the SMILES string, by roundtripping through the InChI – A SMILES string representation of the InChI string – Available as Open Babel SMILES output option “I” – For more info see the paper (J. Cheminf., 2012, 4, 22)
Universal Final y a canonical SMILES SMILES string? J. Cheminf., 2012, 4, 22 email@example.com firstname.lastname@example.org http://baoil each.blogspot.com Acknowledgements Craig James (eMolecules): For OpenSMILES and the SMILES writer in Open Babel Funding Health Research Board: Career Development Fellowship