Converting Chemical Names to Structures with Name>Struct

Name>Struct is a comprehensive algorithm for converting English chemical names into chemical structure diagrams. It is designed to be as practical as possible, interpreting chemical names as they are actually used by chemists, as well as names that observe the official nomenclature rules published by the International Union of Pure and Applied Chemistry (IUPAC), the International Union of Biochemistry and Molecular Biology (IUBMB), or the Chemical Abstracts Service (CAS).

This document discusses the batch version of Name>Struct. A more extensive discussion of the capabilities and limitations of Name>Struct in general is also available separately. The batch version is designed to convert thousands (or hundreds of thousands) of names to the corresponding chemical structures in a single operation without user intervention. As input, it accepts a text file containing a list of chemical names, one name per line. For output it produces an SDfile by default, or optionally a text file with SMILES strings.

Installation - Windows (2000, XP, or Vista) or Linux (Red Hat Enterprise Versions 4 and 5)

The batch version of Name>Struct consists of a command-line executable program, name-struct and its accompanying data file, NameToStructure.dat. Both files must be placed in the same directory. There are no other installation requirements.

We recognize that there are many uses of Name>Struct where it would be more useful in another format, such as a DLL, or for another operating system. We have not created those versions, but it certainly should be possible to do so. Please contact your CambridgeSoft sales representative to discuss pricing and availability if you need something like that.

Preparing an input file

To run the batch version of Name>Struct, you need to prepare a text file containing one or more chemical names. Each name should be on a separate line in the file. Any line in the file that is totally blank will be ignored during the batch conversion, as will any line that starts with an asterisk ("*") or semicolon (";").

Sometimes it is useful to associate an ID with each name, so that the resulting output can be merged into some other database more easily. If any line within the input file contains a tab character, everything before the first tab character is assumed to be an ID and is echoed to the output file unchanged. The name is assumed to start immediately after the first tab character. If there is no tab character, the entire contents of the line is treated as a single chemical name.

This sample input file is used to produce the output files in the examples below

Running the batch

name-struct should be run from the Windows or Linux command prompt. It accepts two required parameters specifying the names of the input and output files, with four other optional parameters:

name-struct [-s] [-t] [-w] [-smiles] infile outfile

infile specifies the name of the input text file. If the file is located in the same folder as name-struct, only the name of the file is necessary; if it is located in another folder, then a full path is necessary. If the name or full path contains a space, then it should be enclosed in quotes.

outfile specifies the name of the output file. If the file is to be created in the same folder as name-struct, only the name of the file is necessary; if it is located in another folder, then a full path is necessary. If the name or full path contains a space, then it should be enclosed in quotes.

The optional -s parameter, when present, will cause stopwords to generate error messages rather than being handled silently. A stopword is a bit of descriptive text like "anhydrous", "reagent grade", or "99%" that will often accompany a chemical name without changing the structure represented by that name. This parameter should usually be omitted in normal usage. Its purpose is to allow Name>Struct to be used as a tool to clean up databases where stopwords are not desired.

The optional -t parameter, when present, will enable extensive typo recognition and correction. Without this parameter Name>Struct will recognize only a few dozen of the most common typos, including "choro" and "flouro", but not many more. When this parameter is present Name>Struct will recognize many more, but at a significant cost in terms of speed. Full typo recognition and correction can easily be 20x slower than normal, and should not be used when speed is important.

The optional -w parameter, when present, will suppress the generation of a structure in any case where a warning would also be generated. Since many of the warnings produced by Name>Struct are for informational purposes only, use of this parameter will reduce the number of names that are successfully converted to structures. On the other hand, since some warnings do indicate problematic conversions, this parameter might well be useful in situations where unusually high confidence is needed in the generated structures, for example in fully automated environments where no manual review is possible.

This parameter is incompatible with the -t parameter; if the -w parameter is present, the -t parameter will be ignored.

The optional -smiles parameter, when present, changes the output file format from an SDfile to a tab-delimited text file where the structures are encoded as SMILES strings. SMILES strings are designed as a very simple valence-based format. As discussed in the SMILES Tutorial provided by Daylight, the creators of the SMILES format, "SMILES is not useful for describing things that cannot be well-represented by valence model." Additionally, it is worth emphasizing that SMILES strings are not unique. Name>Struct may generate the SMILES string "CCO" for hydroxyethane and "C(C)O" for ethanol -- two entirely different strings, textually, that describe the same substance chemically. SMILES strings should not be used on a textual basis to test identity of chemical substances. In general, the SDfile format can store more types of chemical information than can be stored in SMILES strings. Overall, we strongly recommend that the default SD file format be used, and that the SMILES output be considered only in cases where the SDfile would have been converted directly to SMILES format anyway.

Some examples:

name-struct in.txt out.sdf

Convert the names in the file "in.txt", writing the resulting structures to "out.sdf". Both files are in the same directory as name-struct. (result)

name-struct "c:\input data\names.txt" "d:\processed names\done.sdf"
or
name-struct -smiles "c:\input data\names.txt" "d:\processed names\done.txt"

Convert the names in the file "names.txt" (in the folder "c:\input data\"), writing the resulting structures to "done.sdf" (in the folder "d:\processed names\"). (result as SDfile or as SMILES)

name-struct -s -t "c:\input data\names.txt" "d:\processed names\done.sdf"
or
name-struct -s -t -smiles "c:\input data\names.txt" "d:\processed names\done.txt"

Convert the names in the file "names.txt" (in the folder "c:\input data\"), writing the resulting structures to "done.sdf" (in the folder "d:\processed names\"). Names like "copper sulfate, anhydrous" will generate an error message instead of a structure. Names like "benzioc acid" will generate a structure instead of an error message. (result as SDfile or as SMILES)

name-struct -s "c:\input data\names.txt" "d:\processed names\done.sdf"
or
name-struct -s -smiles "c:\input data\names.txt" "d:\processed names\done.txt"

Convert the names in the file "names.txt" (in the folder "c:\input data\"), writing the resulting structures to "done.sdf" (in the folder "d:\processed names\"). Names like "copper sulfate, anhydrous" will generate an error message instead of a structure. Names like "benzioc acid" will also generate an error message. (result as SDfile or as SMILES)

name-struct -s -w "c:\input data\names.txt" "d:\processed names\done.sdf"
or
name-struct -s -w -smiles "c:\input data\names.txt" "d:\processed names\done.txt"

Convert the names in the file "names.txt" (in the folder "c:\input data\"), writing the resulting structures to "done.sdf" (in the folder "d:\processed names\"). Names like "copper sulfate, anhydrous" will generate an error message instead of a structure. Names like "benzioc acid" will also generate an error message. Names like "dichlorobenzene" will will also generate an error message. This option will generate the smallest number of structures. (result as SDfile or as SMILES)

name-struct -t "c:\input data\names.txt" "d:\processed names\done.sdf"
or
name-struct -t -smiles "c:\input data\names.txt" "d:\processed names\done.txt"

Convert the names in the file "names.txt" (in the folder "c:\input data\"), writing the resulting structures to "done.sdf" (in the folder "d:\processed names\"). Names like "copper sulfate, anhydrous" will generate a structure with no error message. Names like "benzioc acid" will generate a structure instead of an error message. This option will generate the maximum number of structures possible, but will be slow. (result as SDfile or as SMILES)

Using the output file

Name>Struct produces an SDfile as output by default. SDfiles are widely compatible with many chemical database applications, including CS ChemBioOffice. Name>Struct can also optionally produce text files as output, with the chemical structures encoded as SMILES strings. In either case, the data included within the output file is the same.

The output file will contain exactly as many records as there were names in the original input file. If a name could be successfully converted to a structure, that structure will be present in the corresponding record. Additionally, the following fields will be written to the output file:

Field name Description
ID
The contents of an ID field as specified in the input file. Only present if the input file contained IDs.
Name
The original name as specified in the input file. Present for all records, even if no structure could be generated.
Error Message
Explanatory text stating that the name could not be converted to a structure, and occasionally describing why. If the name could be converted to a structure, this field may contain cautionary text about possible problems with the structure -- such as an unusual valence, etc. When Name>Struct is run in fully automated mode with no review of the output, it is strongly recommended that any structure with an error message be discarded (which can be done automatically by using the -w parameter on the command line).
Ambiguous
The letter "Y" if the original name was found to be ambiguous.
Name Without Typos
If Name>Struct found a typo in the original name, this field will contain the corrected name that was actually used when generating the structure. Only present if the -t flag was present on the command line.
Other Names Without Typos
If Name>Struct found a typo in the original name, and it found more than one way to correct the typo, additional corrected names are listed in this field, one per line. These names were not used to generate the structure, and are provided for informational purposes only. Only present if the -t flag was present on the command line.

In SDfiles, the fields are labeled explicitly with the names shown. In SMILES output files, the fields are listed in the order shown, separated by tabs with the SMILES string immediately following the name and with each record starting on a new line.

Sample usage scenarios

Populating a database (minimum intervention): Run Name>Struct in default mode (-s, -t and -w parameters all omitted), then load resulting structures back into the database. This is a reasonable approach when working with a database that is know to be fairly clean, as it will produce structures that accurately represent the set of names. It is not such a good idea to take this minimal-intervention approach when working with a set of names of marginal quality -- Name>Struct will quite happily produce the structure for "pentachloromethane", which isn't going to help if the actual substance was supposed to be "pentachloroethane" (no "m").

Populating a database (maximum accuracy): Run Name>Struct in default mode (-s, -t and -w parameters all omitted). Any resulting structures without an Error Message field and not marked as ambiguous can be loaded into the database as before. Any resulting structures with an Error Message field or marked as ambiguous should be reviewed by hand. Many of error messages will contain warnings that are simply informational; those structures can also be loaded into the database. Other warnings may indicate more serious problems, either in the original name ("pentachloromethane" would produce a valence warning, for example) or, more rarely, in Name>Struct itself. Those warnings should be dealt with on a case-by-case basis, as should any name that is identified as ambiguous.

Populating a database (maximum accuracy, no intervention): Run Name>Struct in with warnings treated as errors (-w parameter present but -s and -t parameters both omitted). Structures will be suppressed for all names with any sort of warning message or ambiguity. The remaining structures that were generated without warning can be loaded into the database.

Cleaning a database: Run Name>Struct without ignoring stopwords (-s parameter present, but -t and -w parameters omitted). Any resulting structures without an Error Message field and without a Name Without Typos field are probably fine and can be set aside. Examine the error messages and review any name that produced a "stopword found" error message. Those names include cases like "benzene, 99%", where the "99%" is probably not intended to be included in the database. Clean any names that you want, and repeat the process until you're sure that you have dealt with all the stopwords you care to deal with.

Now take the rest of the names -- the ones that failed to generate a structure and the ones that produced an error message as well as a structure -- and run Name>Struct again, this time with full typo correction turned on (-s and -w parameters omitted, but -t parameter present). Remember that this step will be extremely slow compared to the others, so you might want to run it overnight (figure no more than several hundred names per minute, compared to tens of thousands or so without typo recognition). Examine any record that contains a Name Without Typos field, and correct any typos that are present.

Finally, run Name>Struct one last time in default mode (-s, -t and -w parameters all omitted), and review by hand any resulting structures with an Error Message field. Many of them will contain warnings that are simply informational; those structures can be set aside. Other warnings may indicate more serious problems; including problems in the original name ("pentachloromethane" would produce a valence warning, for example). Those warnings should be dealt with on a case-by-case basis.