Rules for Text Searching

Jim Dill, 3 Apr 02


The way text searching works in ChemFinder 7.0 is inconsistent, nonstandard, and poorly documented.  This page was used to organize thoughts on the matter and design a better scheme for 7.0.3.

Update 4 Apr 02: new code committed today improves the situation considerably.  New documentation is on the 7.0.3 doc page.  Below are notes made during the improvement process.


Notes made earlier

The online help for ChemFinder 7.0.1 gives this set of rules about text searching:

In summer 2001, when the Merck Index was produced, these rules changed.  The rules as documented above still apply to text fields (length up to 254 chars), but not to memo fields (unlimited length), which work as follows:

We are making the following changes to these rules for ChemFinder 7.0.3:

Notes:

  1. When searching a memo field in an attached Orable table, floating text searches are very slow.  If you know the exact string you're looking for, be sure to prefix the query with "=" for better performance.
  2. The reason memo fields are different from normal text fields is that the former often contain rich text instead of plain text.  That means that strings stored in the database may have formatting information surrounding the data.  Thus, a search for "= benzene" may fail if the string on file is formatted, i.e., it says "benzene" on the screen, but is stored as "<format>benzene<format>."
  3. We have more work to do in this area.  ChemFinder 8 will have an improved, fully documented scheme with more consistency and more flexibility.  But we actually got a bit further in 7.0.3 than we originally thought we would.

New Rules Planning

This table summarizes how single-token text queries ought to work in 7.0.3, given two new switches:

Query Normal Auto-Float Full-Word AF and FW
         
abc like abc* like *abc* =abc or like abc[d]* =abc or like abc[d]* or like *[d]abc or like *[d]abc[d]*
abc* like abc* < same =abc or like abc[d]* < same
*abc like *abc < same =abc or like *[d]abc < same
*abc* like *abc* < same =abc or like abc[d]* or like *[d]abc or like *[d]abc[d]* < same
=abc = abc < same < same < same

Notes:

  1. Normal match on straight text query is taken as "anchored-substring" match, because our doc says so.
  2. If a wildcard is included in the query, auto-float preference is ignored.
  3. If query is prefixed with "=," both auto-float and full-word match are ignored.
  4. The notation "[d]" means "delimiter", represented in SQL as "[!a-z]" meaning any non-alphabetic character.  This notation does not allow for zero such characters.

Sample hits

Note: ALL queries hit the target "abc."

Query Normal Auto-Float Full-Word AF and FW
         
abc abcx xabcx abc x x abc x
abc* abcx < same abc x < same
*abc xabc < same x abc < same
*abc* xabcx < same x abc x < same
=abc abc < same < same < same

 


Appendix: Jet String Comparison Table

Asterisk is not the only wildcard allowed in text queries.  The following is from Microsoft Access Help (Jet SQL Reference):

Character(s) in pattern Matches in expression
? or _ (underscore) Any single character
* or % Zero or more characters
# Any single digit (0 9)
[charlist] Any single character in charlist
[!charlist] Any single character not in charlist

Kind of match Pattern Match
(returns True)
No match
(returns False)
Multiple characters a*a aa, aBa, aBBBa aBC
  *ab* abc, AABB, Xab aZb, bac
Special character a[*]a a*a aaa
Multiple characters ab* abcdefg, abc cab, aab
Single character a?a aaa, a3a, aBa aBBBa
Single digit a#a a0a, a1a, a2a aaa, a10a
Range of characters [a-z] f, p, j 2, &
Outside a range [!a-z] 9, &, % b, a
Not a digit [!0-9] A, a, &, ~ 0, 1, 9
Combined a[!b-m]# An9, az0, a99 abc, aj0

Questions:
  1. Does our parser have to understand all this?
  2. Do we need to look out for text strings containing any of the above characters ([,],?,#,!,*) and surround them with escape chars or something?
  3. Do we need to document this stuff in our manual?
  4. Does the statement above -- "If a wildcard is included in the query, auto-float preference is ignored" -- apply to patterns within the text as well as asterisks at the ends?

Planning For ChemFinder 8

For future reference, let's start a compilation of all the types of query it might occur to a user to enter when searching for text or numeric data.  This will be useful when we rip out the whole mechanism and build it over from scratch, next round.

Text query Description
abc straight text
abc* text with wildcards
ab to cd range
ab - cd range
between ab and cd range
> ab inequality
\sql raw sql
=abc exact match
abc,def,ghi list
not abc negated
abc or (>d and <g) subphrase
not (>d and <g) negated subphrase
   
Numeric query  
x value
x - y range
x to y range
between x and y range
>x and <y range
x to y range
not x negated
<> x negated
not (x - y) outside range

 


Copyright 2002 CambridgeSoft Corp. All Rights Reserved.
Proprietary and confidential to CambridgeSoft Corp.
Last updated: 06/15/2002
Pagekeeper:  Jim

[ R&D Home | ChemFinder Home ]