This article is the third part of a three-part series on chemical data recovery written by Kevin Theisen, President of iChemLabs:

  1. Embedded Chemical Data Recovery
  2. Chemical Image Recovery
  3. Legacy Chemical Data Recovery
Image

A chem-archaeologist has discovered an ancient library filled with long-lost molecular secrets. She transcribes some of the information using a chemist's triangle.

Introduction

Cheminformatics solutions can be incredibly challenging to implement. What this really means is cheminformatics problems are incredibly rewarding to solve. While cheminformatics work is difficult, such solutions are very important to our scientists and our society. As we discover more about the universe we exist in, the already impressive work created by those in the cheminformatics field only grows in usefulness.

Of core importance in cheminformatics is the actual storage and communication of chemical information. The protocols we implement to describe chemistry and molecules make it possible to quickly load and share data for use in the algorithms we create. Imagine if the PDB format never existed and we would need to hard-code the data for every atom in a protein structure. Today, we have access to a multitude of chemistry file formats for handling chemistry information. The most popular include the MDL connection tables, CDX files, SMILES, ChemDoodle JSON and CML.

You have probably heard of all of these formats, but have you ever heard of Wiswesser Line Notation (WLN)? At one time, WLN was the most popular protocol for describing molecular information. Books were published about it, conferences focused on it, high schoolers were taught to transcribe it (such skills would get you a decent paycheck), and the largest databases of the time primarily stored WLN codes. But no more. How did the most popular chemistry protocol disappear from our collective knowledge?

Almost a decade ago, I was fortunate to meet and befriend William L. Todsen, a WLN enthusiast. He taught me a lot about WLN and I decided it would be interesting to implement a WLN reader, allowing us to recover legacy chemical data and bring back to life an important chemical protocol. Along the way, we learned much about the history of cheminformatics, the ways in which WLN excels and the reasons why it is no longer used. Even if it is an abandoned technology, we want to make sure WLN is not forgotten and guarantee those in the future always have the tools to observe and learn from the ingenuity of the cheminformaticists that came before us.

A (brief) History of Wiswesser Line Notation

WLN was invented by William J. Wiswesser in 1949. His goal was to devise a unique identifier for a molecule to be used by both humans and machines, in a simpler manner than an IUPAC name, adopting common chemical notations chemists were already familiar with. Keep in mind WLN originated before computers were widely accessible to chemists. Wiswesser describes in a 1952 Chemical and Engineering News (C&EN) article, "More than a decade ago, the author recognized the need for a truly universal and fully systematic chemical structure notation... Having learned the penetrating value of molecular models and logical symbolism, the author has favored a pictorially obvious symbolism." The system Wiswesser devised was very understandable and professionally presented. WLN quickly gained in popularity and application over competing solutions.

Three official editions of the WLN specification were eventually published. The first edition was written by Wiswesser and published in 1954. In the forward to the first edition, Elbert G. Smith, states "[WLN] is a new chemical notation by which even complicated chemical structures may be expressed concisely and without ambiguity in a single line of letters, numbers and punctuation marks. It has been designed to provide a straightforward way of indexing chemical compounds and so to bypass the present growing confusions and frustrations in chemical nomenclature. Chemists generally seem aware that future progress in communication and utilization of chemical knowledge will require some sort of new chemical notation." A second edition of the WLN specification was published in 1968 under Smith's leadership and the newly created Chemical Notation Association (CNA). In the forward to the second edition, I. Moyer Hunsberger states "Since a computer can transform a Wiswesser notation to an atom connection table (which completely represents the structure of a chemical compound on magnetic tape), the notation may find favor as an economical input device for computer-based information retrieval systems." In 1976, Chemical Information Management, Inc. (CIMI) published the third and final official edition of the WLN specification, edited by Smith and Peter A. Baker, under the governance of the CNA. The following are images of the three editions of "The Wiswesser Line-Formula Chemical Notation".

Image
First Edition
Image
Second Edition
Image
Third Edition

WLN was at the height of its popularity in the 1960s and 70s. Cheminformatics was becoming an essential discipline and protocols like WLN were necessary to organize the growing chemical data available. One only needs to look at the vast amount of publication material related to WLN during this period to understand its impact. WLN was mainly used as an indexing solution. Innovators were also discovering ways to use WLN for molecular structure matching. A program called Pathfinder was developed by Carlos M. Bowman's group at Dow Chemical and its development continued by Tommy Ebe and Antonio Zamora at the Chemical Abstracts Service for elucidating canonical WLN paths in ring graphs. In 1968, Charles E. Granito (who later founded CIMI) at the Diamond Alkali Company documented WLN for registration systems and later at the Institute for Scientific Information (ISI) with Murray D. Rosenberg created the Chemical Substructure Index allowing for novel substructure searching based on WLN strings. The ISI was heavily invested in the WLN protocol and provided several programs and searching solutions, including the Index Chemicus Registry System for recording chemical data in the Index Chemicus as WLN codes and the popular CROSSBOW program for handling WLN strings as computer data structures.

The International Union of Pure and Applied Chemistry (IUPAC) even considered WLN as their standard line notation, before choosing the competing Dyson notation instead. The decision was very controversial and resulted in a lot of protest. Bonnie Lawlor from the Chemical Structure Association (CSA) Trust (the CSA Trust evolved from the CNA) summarizes the importance of WLN and IUPAC's choice in a 2016 article. Neither notation is maintained by IUPAC today.

In a 1982 publication, Wiswesser postulated what might become of WLN in the future, "Soon online computer and word-processing terminals will be as commonplace as IBM's Selectric typewriters are today; by 1999 high schools and alert grade schools will have color-coded chemistry in educational entertainment that goes far beyond today's pinball games of skill: computer-weaned grade-school science students might well be able to tap an online 'Chemical Picture Book for Children' with advanced WLN descriptions." In retrospect, we now know that did not happen. The world began to change, as it always does. As Wiswesser predicted, computers became more commonplace and many new programmers were entering the industry. New solutions were necessary to solve more difficult problems, but WLN did not keep up.

In his 1952 C&EN explanation, Wiswesser was clairvoyant in reasoning a concise and easy to use chemical protocol would be necessary, "Simplicity of usage is the prime requirement of a good structure notation... Conciseness is intimately related to the [ideal usefulness], and particularly to the one specifying ease of manipulation by machine methods. It should be obvious that conciseness is desirable for the efficient use of any tabulating machinery. Even with new machines, card punching and verifying will remain the most expensive of the numerous ingenious tabulating operations, and it is almost directly proportional to the number of symbols required." While WLN did achieve the goal of being compact and the basics were relatively easy to learn, the protocol was hardly simple to program (as we will find out below), requiring individuals to manually generate and decipher WLN strings. As Wendy A. Warr states bluntly in a 1982 review of WLN applications, "The principle disadvantage of WLN is that it is not user friendly...no one has yet produced a cost-effective program to [derive a canonical WLN] for over 90% of compounds...one has to balance the cost of hardware plus software against the costs of extra WLN-skilled personnel". As computers gained adoption in scientific laboratories and academic settings, simpler solutions overcame the conciseness of the WLN protocol. Today, the most popular chemical protocols are the easiest to program, regardless of their verbosity.

It may seem logical that the introduction of the much more readily implementable Daylight SMILES line notation in 1988 led to the replacement of WLN, but in reality, WLN had already fallen out of favor as more practical methods for storing chemical information were developed for the many computer systems introduced. The MDL connection tables, circa 1979, are ASCII formats many programmers could easily use, and Mike Elder produced the DARING software to help aid in the conversion from WLN to MDL connection tables. New developments in computer algorithms were also complicit in the demise of WLN. J. R. Ullman published an algorithm for graph isomorphism in 1976, enabling cheminformatics applications to directly and efficiently match parts of chemical structures based on the constituent atoms and bonds. Granito's Chemical Substructure Index was no longer the optimal solution and WLN was losing popularity by the time CIMI published the 3rd edition; Granito would soon change business directions.

Wiswesser would pass away in 1989, leaving one of the most impactful and impressive chemistry protocols ever created as his legacy. To date, WLN is still the most concise, lossless, string representation of chemical information. The WLN protocol is a passion project of a talented group of cheminformatics experts, and a work I hold in very high regard.

A Breakdown of Wiswesser Line Notation

WLN is a substructure-based, canonical, line notation for molecular structure(s). The characters in a WLN string define the atoms and bonds in the molecular structure(s). The entirety of the periodic table of elements is supported, and single, double, triple and dative bond types are available. Any type of complex ring system is compatible, including polycyclic fused, perifused, spiros, bridged and pseudo-bridged structures. There is no explicit aromaticity model, but unsaturations in rings are fully defined. Charges, radicals and isotopes are included. Stereochemistry is supported, but not using CIP. Beyond basic molecule structures, there are special rules for handling chelate compounds, metallocenes and catenanes, polymers, inorganic formulas, uncertainties and MANTRA (Mixture, Alternative possibility, Not assigned, Tautomer, Reactant, Addition) suffixes. One should be aware the WLN specification evolved through the editions, and the changes are not all backwards compatible. For instance, WLN version 2 removes lower case locant symbols and version 3 removes methyl and ring contractions.

WLN is a line notation, which means the chemical information a WLN string contains is defined in a single line of text. The characters in the WLN string are limited to those found on a standard typewriter. Contemporary line notation protocols include SMILES (Daylight and OpenSMILES), IUPAC InChI and technically IUPAC naming.

WLN is substructure-based, differentiating it from other line notations. SMILES is an all-atom representation and InChI is based on information layers. IUPAC names are also substructure-based, but are meant for written and spoken language between chemists. The substructure types in WLN are predetermined, for instance, a W symbol defines a dioxo group and an R symbol represents a phenyl group (when not preceded by a space, after which they would be locants instead). Chains are defined by numbers and rings are defined by symbol sequences beginning with a L (carbocyclic), T (heterocyclic) or D (chelate). WLN is therefore an intrinsically compact representation of a molecular connection table. Take a look at the following table for a comparison of WLN strings to SMILES, InChI and IUPAC names, and notice how much shorter the WLN strings are.

Molecular Structure Line Notation Value
IUPAC Name methane
WLN 1H
SMILES C
InChI InChI=1S/CH4/h1H4
IUPAC Name chlorobenzene
WLN GR
SMILES c1ccccc1Cl
InChI InChI=1S/C6H5Cl/c7-6-4-2-1-3-5-6/h1-5H
IUPAC Name 2-Amino-3-ethylvaleric acid
WLN QVYZY2&2
SMILES OC(=O)C(C(CC)CC)N
InChI InChI=1S/C7H15NO2/c1-3-5(4-2)6(8)7(9)10/h5-6H,3-4,8H2,1-2H3,(H,9,10)
IUPAC Name 2-Chloro-2',3,5'-trinitrobiphenyl
WLN WNR DNW BR BG CNW
SMILES c1ccc(N(=O)=O)c(Cl)c1c1c(N(=O)=O)ccc(N(=O)=O)c1
InChI InChI=1S/C12H6ClN3O6/c13-12-8(2-1-3-11(12)16(21)22)9-6-7(14(17)18)4-5-10(9)15(19)20/h1-6H
IUPAC Name 5'-[1-(Fluoromethyl)-1H-indol-3-yl]-4',7'-diazaspiro[2,4-cyclopentadiene-1,2'-indene]
WLN T56 CX FN INJ C-& AL5XJ& G- DT56 BNJ B1F
SMILES C1(=C2)C(=CC2(C=C2)C=C2)N=C(C=N1)C1=CN(C(=CC=C2)C1=C2)CF
InChI InChI=1S/C20H14FN3/c21-13-24-12-15(14-5-1-2-6-19(14)24)18-11-22-16-9-20(7-3-4-8-20)10-17(16)23-18/h1-12H,13H2

To finish the description, WLN is a canonical protocol, and therefore one and only one WLN string is theoretically acceptable for any chemical entity. Canonicalization is typically used for indexing and exact matching in databases. SMILES is not a canonical protocol, but Daylight Informatics did publish a vague and incomplete CANGEN algorithm for canonicalizing SMILES strings. Canonical SMILES algorithms today are unique to the developer and are not cross-compatible. InChI is canonical by definition, and its implementation is so complex, only one official codebase exists. Traditional IUPAC names are not canonical, as several correct names are possible given the various IUPAC rules, but the latest 2013 IUPAC naming specification defines Preferred IUPAC Names (PINs), which are meant to be canonical.

Implementation

Today, Todsen has been unofficially maintaining the WLN specification, incrementing the minor version number. Todsen is currently polishing version 3.2 of the WLN specification and he states, "This project has been a labor of love, consuming a lot of my off-duty time for the better part of a decade. By and large (and by design!), my refinements resulted in almost no differences in the WLN codes. Indeed even going from the 1968 version of the rules to the 1975 version, the bulk of the codes stayed the same or could be easily updated." Todsen has provided a full description of his experience with WLN, which is well worth a read.

I was intrigued after learning more about WLN, and I am always interested in implementing unique chemistry protocols. There is no commercial application for WLN, so I pursued this project as an intellectual curiosity and wrote a WLN reader. Here is a picture of the reader in action. Just type in your WLN string (v3.2) and press the Read WLN button to see the related chemical drawing.

Image
A ChemDoodle Web Components demo allows you to recover the chemical structures defined in a WLN string.

I have to admit, implementing WLN is much more difficult than I anticipated. There is a significant level of detail concerning each aspect of chemical structures. One difficult concept is the connectivity of each symbol type. Care must be taken when completing valences by WLN rules to correctly place implied hydrogens and charges. The chain terminator symbol, &, is more complicated than expected; it is not trivial to understand where the previous connection may be, and mixed with terminal structure symbols, caused quite some confusion. But while annoying, everything, from symbols with multiple meanings to ring delocalization to dative bonds, is consistent once implemented properly. The worst procedures to implement are the complex ring rules. I had to stop at perifused systems, as the suggested algorithms required an unintuitive guess-and-check method for the correct answer. All together there are 9 sets of rules for handling different types of ring systems. By comparison, any ring system may be defined in a SMILES string using a simple molecular spanning tree and ring closure indexes. I now understand why there is no fully compliant program for handling the WLN protocol.

As for application, I want to briefly discuss canonicalization, which is a requirement for WLN strings. Canonicalization is an attempt to simulate graph theoretical functionality without making use of graph theoretical algorithms. One using a canonical WLN string for a structure would be confident in matching another structure converted into a WLN string if the two strings are identical. However, specifications and software implementations continuously change, invalidating previously generated canonical output. So the integrity of your dataset will eventually be dependent on using old and obsolete software. The correct way to handle molecule structure comparisons is through graph isomorphism implementations, perhaps after a fingerprint pruning, which neither existed nor were practical when WLN was conceived. Canonical string comparisons are also only applicable to exact matching, and not substructure, superstructure, query or maximum common substructure. That being said, canonical WLN strings do provide limited ability for substructure matching without graph isomorphism algorithms due to the notation symbols, for instance, you will know a structure contains a benzene ring from the inclusion of an R symbol with no space before it.

WLN is a very interesting format. I feel it fits right in between SMILES and IUPAC naming in terms of theory and implementation. The following image shows the print specifications for SMILES, WLN and IUPAC naming, where you can observe a direct correlation between page count and implementation difficulty. SMILES is relatively simple to implement, although Noel O'Boyle will provide you with an endless number of reasons why SMILES is not as simple as it appears. InChI, while very complex, is an open standard and the software to handle it is funded and open source. IUPAC naming is the most massive undertaking, by far, but that is a whole other complex discussion.

Image
Comparing the specification sizes for Daylight SMILES (left), WLN (center) and IUPAC naming (right). Page count correlates well with implementation difficulty.

Regardless, I enjoyed my time with WLN. It is a unique perspective on chemistry data and I hope to return to find a better algorithm for the complex WLN ring systems. If you have the time, please learn about it, try to implement it, and teach it to your students and colleagues. If you are interested in cheminformatics, writing a WLN parser or writer is an incredibly challenging project providing you with very thorough experience into the concepts of chemical structure and graph theory. It will certainly be an impressive statement in your portfolio.

Preservation

We may now understand why such an important chemistry protocol, which was widely adopted in the mid-1900's, is no longer known today. The importance of preserving the innovation of those in cheminformatics is significant. When we decide to investigate new solutions or new problems, we may find past work provides inspiration or even the answer. Due to WLN losing popularity 50 years ago, those with WLN experience are now retiring or have unfortunately passed away. It is very possible we will lose all WLN expertise in my lifetime.

To preserve WLN, Todsen is continuing to maintain the latest WLN specification. As a companion, iChemLabs has developed a WLN parser based on Todsen's work. Our goal is to continue to develop this parser until we can handle any WLN code created to spec. Other groups are also helping to preserve WLN. Both the PubChem and ChemSpider databases store WLN strings for many entries, although no defining version is provided in either. The Pistoia Alliance updated their UDM format specification in 2018 to allow WLN as an acceptable chemical protocol. In 2019, Roger Sayle developed a WLN parser based on the second edition of the specification and contributed it to the open source OpenBabel project.

The most difficult problem in preserving WLN is not in the complex theory or the difficult algorithm, but in the specification itself. The three editions of the specification are protected by copyright. This does not prevent us from discussing or implementing WLN, but precludes the redistribution of the WLN specification. There is no online copy and all three editions are out of print. You may occasionally find the first or second edition in a chemistry library or on Amazon or eBay, but the third edition is very rare. Without the ability to access the specification, the usefulness of the protocol is limited.

We searched for the copyright holders, but they cannot be found or are deceased. Efforts to locate Granito have been unsuccessful. Smith and Baker, editors of the third edition of the specification, are deceased, and Chemical Information Management, Inc., which holds the copyright on the third edition, no longer is in operation.

Our own legal team investigated this matter and has concluded that the copyrights have not expired and the specification is not currently in the public domain. Accordingly, we cannot redistribute the specification at this time. It is unlikely Wiswesser wanted copyright issues to cause his work to be lost to history, so please reach out if you have any information.

All together, WLN does not provide much benefit over the line notations widely used today, even if it is more compact. The size of databases can be reduced, but it is unlikely such an improvement would be significant given continued advances in computer systems. The difficult work in implementing the protocol would make any practical usage a chore. Rather, WLN is a part of the history of cheminformatics; a part of our culture. It needs to be preserved.

Final Word

I want to thank Todsen for working with me on WLN theory, helping me write this article, and for his continued friendship. Todsen continues to pursue his interest in WLN, and if you would like to know more, he may be reached by email.

If you enjoyed the artwork leading these articles, they were commissioned to Etienne Delalande, an artist in France who I have known for several years now. He is incredibly talented and I highly recommend him, but his art speaks for itself. You may reach out to him here if you are interested in working with him.

I was very excited to put these articles together as they place a coda on three massive projects I and my team at iChemLabs have been working on for a decade. But now that everything is discussed, I have to admit I feel sad it is all over. The work, however, will continue, as we will be looking to make our chemical data recovery tools even better. And, of course, we have some more projects in the pipeline, which I look forward to telling you about in the future.

I hope this series of articles has provided insight into what iChemLabs does and the problems we enjoy solving. I, myself, have been programming chemistry solutions since 2006, back when I was in college, and I don't ever plan to stop. If you have read this far, I want to say thank you. Please do reach out. I want to know about you and your interests in the cheminformatics industry. Feel free to connect to or contact me on LinkedIn, or address an email to me using our contact form. Until next time!