RDFiles in ChemFinder

Oct 98

Data in Reaccs

The RDFile is a rich and complicated thing, largely because it contains data from Reaccs, an even more complicated thing. To envision a Reaccs database, think of the file system. There is a root level, containing data (files) and datatypes (folders), these containing more data and more datatypes, etc. Reaccs makes this more complicated by allowing at any level file or folder a numbered series of leaves, each containing its own subtree. If you’re having trouble picturing this, you’re not alone. For details, see the MDL web site.

In a default Reaccs database, the top-level item is the RXN, containing the structure of a reaction, plus some general data about the abstract reaction. Each reaction has one or more "variations" containing the details about a specific instance of the reaction perhaps one run of the reaction in a lab, or one article in the literature. The rest of the data is in a tree underneath each variation.

Data in RDFiles

When Reaccs writes data to an RDFile, it does so verbosely. Here’s an excerpt:

Each full record begins with $RFMT, followed by a reaction structure, followed by an unordered series of data items, each identified by its datatype name. A datatype name is a sort of pathname; the class we use to handle one of these is CRDPath, so we call it a "crudpath." Each level in the crdpath represents a node in the data tree, the rightmost being the fieldname.

Catalyst and solvent molecule structures are embedded in this collection, each under its own datatype under the variation. The datatype name is actually REGNO, but in place of a registry number is a molfile of the structure itself. When loading the file, a regno must be assigned to each structure based on where the structure goes.

What This Becomes

The RDFile loader in ChemFinder can read and manage this data like no other loader in the industry, and convert it from heirarchical to relational on the fly. What is has to do includes the following: parse and decipher crdpaths; figure out the incoming data tree, and construct a special tree-style recordset to hold incoming data; build a series of relational tables able to contain and organize the data; store data in these tables along with linking information allowing them to be related correctly; plus all the work done by the SDFile loader: pre-scan the file, build a form on the fly, process structures.

If you choose to create a log file during the load, here is a little-known feature: the loader writes to the log file a tree-style diagram of the datatypes found in the file. To see an example, see the sample logfile.  The tree of datatypes is constructed during the preliminary scan of the file, and held in a tree-style recordset. Prior to the actual load, the tree is interpreted and used to build a collection of tables. To see what these look like, examine the database structure after the load is complete. The following screen shot gives an example.

Each node in the data tree becomes a table. The main RXN level forms the main structure table (called MolTable), which in this case contains no data except structure. In all tables below the root, an ID column contains a unique index for each record, and a <parent>_ID column indicates the id of the parent.

The Variation table, for example, contains an id for each variation, plus a rxn_id pointing to the parent reaction, plus some variation-specific fields. Tables beneath this for example Conditions contain internal id and variation_id columns.

Where the Solvents and Catalysts Go

Until ChemFinder 5, the RDFile loader skipped entirely over solvent and catalyst structures. Other data was stored id, variation_id, grade but no structures. This is no longer the case; we now read the structures, convert them into molecule objects, and store them in the database.

The catalyst and solvent molecules are stored in the same database as the reactions. As each structure reaction or molecule is encountered, it is appended to the structure database, and its id is then entered in the appropriate table. This is illustrated in the following diagram. The result is a main table (MolTable) containing only the reaction id’s, and Catalyst and Solvent tables containing their corresponding molecule id’s. Note that in order to have this work easily, we change the name REGNO to MOL_ID in the process.

There are some drawbacks to this arrangement. For one thing, the user cannot search both types of structure in a single search. It is not possible to say "find me all reactions, solvents, or catalysts containing benzene." Nor is it possible to search both catalysts and solvents in the same search. For another thing, reactions do not have consecutive id’s; if you display the id and browse through the reactions, you hit 1, 3, 6, …, which is a little disconcerting, though not really harmful.

How To Set Up A Database

To read an RDFile containing solvents and catalysts and convert it into a useful form which shows all the structural information, follow this procedure:

  1. Create a new, empty form.
  2. Use Import RDFile from the File menu to load the file. Be sure to let the scan go to completion. Accept all defaults.
  3. The result will be a form showing the reaction structure and any top-level data available. It should have plenty of room on the right; if not, rearrange it as necessary.

  1. Create a large subform for the Variation table. Right-click within it, choose Data Source, select the Variation table, click OK.
  2. Draw data boxes as desired to display variation data. This data typically includes reaction text, lit ref, keywords.
  3. Within the variation subform, create a subform for solvents and one for catalysts. Do the steps 7-9 within each.
  4. Create a data box for display of the structure. Right-click on it, for data source select the Catalyst (Solvent) table, for data type select Structure. Click OK, and a catalyst structure should appear in this box.
  5. Optionally, create data boxes for other data (e.g., Grade).
  6. Double-click within the subform to change its display to table view.
  7. Set up links between subforms as follows:
    In subform: To set links: Choose:
    Variation Relate By [MolTable] MOL_ID
      Relate To [Variation] RXN_ID
    Catalyst Relate By [Variation] ID
      Relate To [Catalyst] VARIATION_ID
    Solvent Relate By [Variation] ID
      Relate To [Solvent] VARIATION_ID
  1. Click in the main form. Browse to check for correct action.

You should have a form roughly like the following picture. As you browse through the reactions, each displays its first variation, showing a little table of solvents and catalysts for that variation. In the diagram, I’ve added a second subtable of variation data, to show whether there is more than one variation for a reaction; if so, the user will need to browse manually through the variations.