What is Normalization?

Some common functional groups and substructures can be legitimately drawn in more than one way.  In order to facilitate a query matching a target at search time, it is helpful to ensure that both are present in the same form.  The rephrasing of various molecular features into arbitrary, standard forms is called normalization.


Normalization consists of identifying particular patterns of atoms and bonds, and transforming them to a different pattern.  The order of the normalization steps (rules) does not generally matter, with a few exceptions.  Some of the steps are:

  1. Reaction normalization.  Ensure atom-to-atom map and reaction centers are defined.  For non-reactions, any reaction centers are stripped.
  2. Delete textual atoms.  These are cosmetic atoms of the special "text" type.  They are occasionally used in-house to annotate molecules, but are mainly jetsam received from ChemDraw structures containing text labels.
  3. Convert neutral diazo to the zwitterion, R2C=N=N --> R2C=N(+)=N(-)
  4. Convert neutral isonitrile to the zwitterion,  R-N#C  --> R-N(+)#C(-)
  5. Convert R3N=O to R3N(+)--O(-), including pentavelent nitro, RN(=O)2
  6. Convert M(+)--X(-) to M-X when M and X are both hypervalent.  (disabled)
  7. Discard neutral, non-bonded delreps.  Instead, convert the order of any involved bonds to delocalized.
  8. [1] Convert neutral tetravalent N to the cation.

[1] This step alters the meaning of the structure in an attempt to make sense out of it.  It is therefore hazardous.

Technical Notes

Normalization is a cooperative effort between the ccMolecule class, which directs, and the ccPercep class, which makes the actual changes. The latter is useful because it holds so much of the information by which the conversions are staged.  It returns maps describing how atom and bond numbers were resequenced, so that any data structures tied to them can be updated.  When a change is made, the ccPercep part terminates immediately, so that perception sets may be regenerated according to the new version.  The ccMolecule driver then resubmits the structure in case other, later, normalization rules need to be applied.  (This design carries the danger of infinite looping, however there is protection against this.)

All changes are accompanied by a message to the default logger.