RDFiles in ChemFinder
Data in Reaccs
The RDFile is a rich and complicated thing, largely because it contains data from Reaccs, an even more complicated thing. To envision a Reaccs database, think of the file system. There is a root level, containing data (files) and datatypes (folders), these containing more data and more datatypes, etc. Reaccs makes this more complicated by allowing at any level ¾ file or folder ¾ a numbered series of leaves, each containing its own subtree. If youre having trouble picturing this, youre not alone. For details, see the MDL web site.
In a default Reaccs database, the top-level item is the RXN, containing the structure of a reaction, plus some general data about the abstract reaction. Each reaction has one or more "variations" containing the details about a specific instance of the reaction ¾ perhaps one run of the reaction in a lab, or one article in the literature. The rest of the data is in a tree underneath each variation.
Data in RDFiles
When Reaccs writes data to an RDFile, it does so verbosely. Heres an excerpt:
Each full record begins with $RFMT, followed by a reaction structure, followed by an unordered series of data items, each identified by its datatype name. A datatype name is a sort of pathname; the class we use to handle one of these is CRDPath, so we call it a "crudpath." Each level in the crdpath represents a node in the data tree, the rightmost being the fieldname.
Catalyst and solvent molecule structures are embedded in this collection, each under its own datatype under the variation. The datatype name is actually REGNO, but in place of a registry number is a molfile of the structure itself. When loading the file, a regno must be assigned to each structure based on where the structure goes.
What This Becomes
The RDFile loader in ChemFinder can read and manage this data like no other loader in the industry, and convert it from heirarchical to relational on the fly. What is has to do includes the following: parse and decipher crdpaths; figure out the incoming data tree, and construct a special tree-style recordset to hold incoming data; build a series of relational tables able to contain and organize the data; store data in these tables along with linking information allowing them to be related correctly; plus all the work done by the SDFile loader: pre-scan the file, build a form on the fly, process structures.
If you choose to create a log file during the load, here is a little-known feature: the loader writes to the log file a tree-style diagram of the datatypes found in the file. To see an example, see the sample logfile. The tree of datatypes is constructed during the preliminary scan of the file, and held in a tree-style recordset. Prior to the actual load, the tree is interpreted and used to build a collection of tables. To see what these look like, examine the database structure after the load is complete. The following screen shot gives an example.
Each node in the data tree becomes a table. The main RXN level forms the main structure table (called MolTable), which in this case contains no data except structure. In all tables below the root, an ID column contains a unique index for each record, and a <parent>_ID column indicates the id of the parent.
The Variation table, for example, contains an id for each variation, plus a rxn_id pointing to the parent reaction, plus some variation-specific fields. Tables beneath this ¾ for example Conditions ¾ contain internal id and variation_id columns.
Where the Solvents and Catalysts Go
Until ChemFinder 5, the RDFile loader skipped entirely over solvent and catalyst structures. Other data was stored ¾ id, variation_id, grade ¾ but no structures. This is no longer the case; we now read the structures, convert them into molecule objects, and store them in the database.
The catalyst and solvent molecules are stored in the same database as the reactions. As each structure ¾ reaction or molecule ¾ is encountered, it is appended to the structure database, and its id is then entered in the appropriate table. This is illustrated in the following diagram. The result is a main table (MolTable) containing only the reaction ids, and Catalyst and Solvent tables containing their corresponding molecule ids. Note that in order to have this work easily, we change the name REGNO to MOL_ID in the process.
There are some drawbacks to this arrangement. For one thing, the user cannot search both types of structure in a single search. It is not possible to say "find me all reactions, solvents, or catalysts containing benzene." Nor is it possible to search both catalysts and solvents in the same search. For another thing, reactions do not have consecutive ids; if you display the id and browse through the reactions, you hit 1, 3, 6, , which is a little disconcerting, though not really harmful.
How To Set Up A Database
To read an RDFile containing solvents and catalysts and convert it into a useful form which shows all the structural information, follow this procedure:
|In subform:||To set links:||Choose:|
|Variation||Relate By [MolTable]||MOL_ID|
|Relate To [Variation]||RXN_ID|
|Catalyst||Relate By [Variation]||ID|
|Relate To [Catalyst]||VARIATION_ID|
|Solvent||Relate By [Variation]||ID|
|Relate To [Solvent]||VARIATION_ID|
You should have a form roughly like the following picture. As you browse through the reactions, each displays its first variation, showing a little table of solvents and catalysts for that variation. In the diagram, Ive added a second subtable of variation data, to show whether there is more than one variation for a reaction; if so, the user will need to browse manually through the variations.