General issues related to nomenclature and Name>Struct

In general, Name>Struct is designed to be as smart as a real chemist -- if a human chemist can understand what structure is intended by a given name, then Name>Struct should manage to do so as well. Chemical names come in many styles. Some names truly do conform to published nomenclature recommendations, most commonly from the International Union of Pure and Applied Chemistry (IUPAC), the International Union of Biochemistry and Molecular Biology (IUBMB), or the Chemical Abstracts Service (CAS). Clearly, Name>Struct needs to recognize these names, but that's only the start of the problem.

First, each of these organizations has changed their recommendations over time. There is no way to know which version of the recommendations were used to generate any given name, and so Name>Struct recognizes names produced by all versions.

Second, many chemical names use trivial forms that have long been forbidden by all of those nomenclature bodies. Nonetheless, these trivial names are used frequently enough that most chemists will recognize their meaning, and so Name>Struct should as well.

Finally, even though these organizations have published nomenclature recommendations, the recommendations are extremely complex and difficult to understand. Even the best-intentioned chemist will often produce names that -- technically or egregiously -- violate the published norms. As long as the meaning of the name remains clear, Name>Struct should be able to handle it.

Capitalization

Capitalization is completely ignored when interpreting chemical names. "Benzene", "benzene", and "BENZENE" clearly all represent the same compound. There are a few nomenclature rules that call for a given capitalization. Possibly the most common of these is the "n" prefix. When used before an alkane, it indicates the straight-chain form ("n-butylamine"). When capitalized, it indicates that a ligand should be attached directly to a nitrogen atom ("N-chloroaniline").

If capitalization is ignored, there is a potential for ambiguity ("n-butylaniline"). In our experience, this is rarely an issue. Not only are names of this type uncommon, but the intended structure is almost always intended to be "the straight chain form on the nitrogen" ("N-(n-butyl)aniline"). It is much more common to find names with completely non-standard capitalization than it is to find ones where the capitalization actually makes a difference.

Name>Struct interprets all of the following names identically:

  • n-butylaniline
  • n-Butylaniline
  • n-BUTYLANILINE
  • N-butylaniline
  • N-Butylaniline
  • N-BUTYLANILINE
 

Punctuation

For the most part, Name>Struct ignores punctuation. It can be present, absent, or incorrect without affecting the interpretation of a chemical name. There are a few exceptions:

Commas must be present in CAS-style inverted names. "Benzene, chloro-" and "chlorobenzene" are interpreted identically. "Benzene chloro" without the comma cannot be interpreted.

Parentheses, brackets, and braces should be used for ambiguous names. "(Trichloromethyl)silane" is different from "trichloro(methyl)silane". They also must be paired and nested appropriately, each open parenthesis with a matching close parenthesis and so on. Otherwise, parentheses, brackets, or braces may be used interchangeably.

Some punctuation is required to separate digits. Traditionally, this duty is performed by a comma: "1,1-" vs. "11-", but Name>Struct will recognize any reasonable separator -- "1-1-" is just fine. Other roles are traditionally performed by periods, colons, and hyphens, but Name>Struct will recognize any reasonable separator there as well.

Spaces are ignored (may be present or absent) whenever possible, but there are a few cases where the space performs a vital role. This is most commonly important for esters. The following names represent very different structures:

methyl ethyl malonate methylethyl malonate methyl ethylmalonate methylethylmalonate

Baseline

Superscripts are occasionally called for by various published nomenclature recommendations. Name>Struct handles them in a way consistent with its handling of other typographical issues: If the superscript consists of a number and immediately follows some other number, they must be separated in some fashion (parentheses are most often used). Thus "13,7" becomes "1(3,7)" and should not be represented as the confusing "13,7". In all other cases, the superscripted characters should follow the previous characters with no added separation: "Na" becomes "Nalpha" (see below for comments on Greek characters), and should not be represented as "N-alpha" or "N(alpha)".

Fonts

Chemical names are not usually defined by the particular font used to display the name. There is one exception to this rule. Some chemical names are supposed to use Greek characters. Unfortunately, not only is this nuance lost on many people, but it is also impossible in many circumstances, such as in a text-only database. Accordingly, Name>Struct will recognize any reasonable representation of Greek characters. The following names are all interpreted identically:

  • α-methylphenethylamine
  • a-methylphenethylamine
  • alpha-methylphenethylamine
  • .alpha.-methylphenethylamine
  • α-methylphenethylamine
 

Italicization

All font styles, including italicization, are completely ignored when interpreting chemical names. This will never introduce any ambiguity.

Spelling

Spelling is critically important. There are many pairs of names that differ only by a single character -- compare methylamine and menthylamine:

methylamine menthylamine

However, in many cases, a human chemist might not even notice an incorrect spelling, and would interpret the name easily. Name>Struct should do the same. Many common misspellings, including "choro", "cloro", and "flouro" are automatically recognized.

Elision refers to the elimination of one vowel that is directly followed by another, and is technically required or prohibited in many cases: the rules say to use "propanamine" rather than "propaneamine". Name>Struct does not enforce those rules; vowels may be elided or not without penalty.

Automatic typo recognition

Because it is designed to interpret real-life usage, Name>Struct also handles real-life misusage, and typos (typing errors) are incredibly pervasive. Even more to the point, errors within chemical names can be quite subtle, to the point where a trained chemist won't even notice many "obvious" errors. As a result, without error recognition Name>Struct would fail to interpret many names that seemed quite reasonable. The following names, for example, each have a typing error. Can you find them? This is easier than usual, because we guarantee that each name really does have a typo --and a pretty obvious one in each case. Normally, you might have a thousand names and not know which of them had problems, or if any of them had problems at all.

(answers)

Obviously, there are a lot of ways that a name can be mis-typed. Name>Struct focuses on the four Damerau transformations of single-character additions, deletions, substitutions, and letter-pair inversions (Damerau, F. "A technique for computer detection and correction of spelling errors." Comm. of the ACM, 7(3):171-176, 1964.). These errors have historically been shown to account for the vast majority of errors in written text, an observation that appears to hold true even in the specific case of chemical names.

Order of substituents

Technically, the rules state that substituents must be in alphabetical order, that is "3-bromo-2-chloro-1-iodopentane" rather than "1-iodo-2-chloro-3-bromopentane" or any other ordering. Name>Struct does not enforce this rule; substituents may be listed in any order without affecting the recognition of the chemical name.

Inverted names

Name>Struct can interpret CAS-style inverted names ("benzene, chloro" rather than "chlorobenzene") without problems. That remains true even for more-complicated cases such as "1-Propanaminium, 3-carboxy-2-hydroxy-N,N,N-trimethyl-, hydroxide, inner salt, (S)-", which would be "(S)-3-carboxy-2-hydroxy-N,N,N-trimethyl-1-propanaminium hydroxide inner salt" in its uninverted form.

Language

Name>Struct is designed to interpret English chemical names only. Coincidentally, it can also interpret many German chemical names, but nowhere near as many as it can in English, simply because the structure of the two languages is very similar. The Name>Struct algorithm could be extended to other languages without much difficulty; we haven't done so simply because we haven't found the need. If you have a use for this functionality, please contact us.