Converting Chemical Names to Structures with Name>Struct

Name>Struct is CambridgeSoft's comprehensive algorithm for converting English chemical names into chemical structure diagrams. It is designed to be as practical as possible, interpreting chemical names as they are actually used by chemists. In addition to recognizing most of  the official rules and recommendations of the International Union of Pure and Applied Chemistry (IUPAC), the International Union of Biochemistry and Molecular Biology (IUBMB), and the Chemical Abstracts Service (CAS), Name>Struct also recognizes the shorthand, slang, and neologisms of everyday usage. It is extremely tolerant of deviations from the "official" rules in regard to spaces, parentheses, and punctuation. Both regular names ("chlorobenzene") and CAS-style inverted names ("benzene, chloro-") are supported. In addition, it has an extensive algorithm for the identification of common typos (typing errors, such as "mehtyl") to increase the odds of generating structures for the names it is given.

Name>Struct will try its best to generate a reasonable structures. However, in the case of unspecific ("methyl phenol") or ambiguous ("2-chloroethylbenzene") input, it will display only the single structure that it deems most likely. In cases such as these, the addition of locants ("3-methyl phenol") or additional parentheses ("2-chloro(ethylbenzene)") will help ensure that the structure generated will match the structure you had in mind. When names can be identified as ambiguous, the possible ambiguity may also be noted.

A description of an older version of Name>Struct was published as Brecher, J. "Name=Struct: A Practical Approach to the Sorry State of Real-Life Chemical Nomenclature." J. Chem. Inf. Comput. Sci. 39, 6, 943-950.

This introduction is divided into several sections of increasing technical detail. Because nomenclature can get very technical very quickly, we have tried to separate the discussion into two levels. Each topic is described first in general terms with limited details; that description should be accessible to most chemists. The general description is then followed by a link to more-detailed information.

Sections:


Availability

Name>Struct is available in two forms:

Both versions have the same capabilities with the single exception that the interactive version is limited to one-at-a-time conversion and the batch version is not.

General capabilities of Name>Struct

Name>Struct is designed to be as complete, accurate, and fast as possible, so that it can be used with confidence to interpret one name or a million, whether those names follow any official published nomenclature recommendations or not. Name>Struct recognizes over 90% of organic nomenclature recommendations. While the figure is somewhat lower for inorganic nomenclature, all general procedures and all recommendations that occur with any frequency in real life usage is recognized. Testing over many different data sources has shown that with a typical database that is a combination of well-formed names, trade names, trivial names, and incorrect or misspelled names, Name>Struct will generate structures for about 70-90% of the names actually used. When running in batch mode, it can easily process over 30,000 names/minute, with an accuracy of greater than 99%.

More information about the general capabilities of Name>Struct

General limitations of Name>Struct

For the most part, the only limitations of Name>Struct are ones that are mandated by common sense. 

As a component of ChemDraw Ultra, the interactive version of Name>Struct runs under the same configurations as ChemDraw Ultra, which means that it is available for most modern Windows and Macintosh computers. The batch version is available only for Windows and Linux. The Name>Struct algorithm itself is implemented in standards-compliant C++, and should be readily convertible to other systems; we simply haven't bothered to do so. Please contact your CambridgeSoft sales representative if you think you need Name>Struct provided for some other operating system or in some other configuration.

There are very few limitations to Name>Struct in a chemical sense -- there are no limitations to the length of name interpreted or the number of atoms in the resulting structure. All elements from hydrogen to lawrencium are supported, even considering that some of them (such as helium) will rarely appear in chemical names simply because they form so few compounds.

Name>Struct does have some limitations in the types of structures it can generate. It is extremely difficult to generate good-looking structural diagrams for several classes of substances, including biological macromolecules (proteins, etc.), highly bridged ring systems (buckyballs), and polymers. Rather than producing incomprehensible diagrams for these cases, Name>Struct refuses to generate a structure.

More significant are the limitations inherent to chemical nomenclature itself. Many of the names in common use to describe various substances have no systematic component at all. These include many pharmaceuticals ("Viagra") and pesticides, dyes ("Brilliant Green"), and others. Although Name>Struct can interpret many of these so-called trivial names, that is not its primary focus. CambridgeSoft offers several other products that are more appropriate for the interpretation of collections of asystematic names. 

More information about the general limitations of Name>Struct

General issues related to nomenclature and Name>Struct

In general, Name>Struct is designed to be as smart as a real chemist -- if a human chemist can understand what structure is intended by a given name, then Name>Struct should manage to do so as well. Chemical names come in many styles. Some names truly do conform to published nomenclature recommendations, most commonly from IUPAC, IUBMB, or CAS. Clearly, Name>Struct needs to recognize these names, but that's only the start of the problem.

First, each of those organizations has changed their recommendations over time. There is no way to know which version of the recommendations were used to generate any given name, and so Name>Struct must recognize names produced by all versions.

Second, many chemical names use trivial forms that have long been forbidden by all of those nomenclature bodies. Nonetheless, these trivial names are used frequently enough that most chemists will recognize their meaning, and so Name>Struct should as well.

Finally, even though those organizations have published nomenclature recommendations, the recommendations are extremely complex and difficult to understand. Even the best-intentioned chemist will often produce names that -- technically or egregiously -- violate the published norms. As long as the meaning of the name remains clear, Name>Struct should be able to handle it.

To achieve this goal, Name>Struct attempts to be as flexible as possible. Capitalization, font type, and font style are completely ignored. Most punctuation is ignored as well, regardless of whether it is used correctly as per the published recommendations or not. Spelling, similarly, is important only for clarity: Name>Struct will interpret many common misspellings correctly, but proper spelling is much more likely to be interpreted correctly. More recently, extensive typo recognition has been added, increasing the likelihood that names will be interpreted correctly even if they are not technically correct.

More information about these general issues related to nomenclature and Name>Struct

Nomenclature classes handled by Name>Struct

The shortest answer to the question, "What types of nomenclature can be recognized by Name>Struct?" is "Just about everything!" However, we recognize that a longer answer might be slightly more useful. Accordingly, here is a more extensive discussion of the types of nomenclature supported.

Name>Struct can recognize all types of parent structures including chains and rings, and, of course, various combinations of the two. A "parent structure" is the core unit that most chemists would recognize as the basic framework of a chemical structure, something like "ethane" or "benzene". Natural products are special kinds of parent structures that are commonly found in biological organisms. Stereochemistry is crucially important for natural products, but may be relevant in any compound containing an asymmetric double bond or tetrahedral center.

With occasional exceptions, most parent structures can be used as ligands -- that is, as fragments attached to some other parent structure ("methane" is a parent structure; "methyl" is a ligand). Most parent structures can also be converted to a variety of functional class derivatives. The most common functional class derivatives feature nitrogen and oxygen, and include amines and alcohols as well as many different types of acids, both organic (acetic acid) and inorganic (perchloric acid).

In addition to forming neutral derivatives, parent structures may become charged to form ions; those ions may,  in turn, combine to form salts.

...and lots and lots of other nomenclature is also supported!

Many more examples of nomenclature supported, with the resulting chemical structures

See also some of the latest enhancements in Name>Struct 12.0.1