Regular expressions for Caché

This document describes a regular expression DLL/shared object with a class interface for Caché (version 5, non-unicode) ObjectScript (COS), both for Windows and Linux.

Regular expressions originated in the Unix world and are a way of matching text against a pattern (regular expression), similar to COS pattern matching. However, regular expressions have features that lack in COS patterns. For example, in regular expressions substrings of the matched text may be defined and returned.

Caché has no native support for regular expressions, and implementing them in COS would be non-trivial. However, many free implementations in C source code are available. Unfortunately, Caché can not easily call arbitrary external DLLs; they have to be compiled specifically for it.

The code described on this page implements a C DLL (for Windows) or shared object (for Linux) that is callable from Caché. The actual regular expression code is taken from the Perl Compatible Regular Expressions library (version 5.0). The code is linked statically, i.e., no DLLs other than the supplied one are needed for operation. A class interface to the DLL is provided that takes care of loading and unloading the DLL, and passing/retrieving data in the proper manner.

PCRE

The Perl Compatible Regular Expressions (PCRE) library is written by Philip Hazel, and provides a rich set of features that exceed those of traditional Unix regular expressions. It is, as its name implies, feature-compatible with the Perl implementation. It is used in many large open source projects (Python, Apache, and PHP, to name a few).

The PCRE library is supplied in C source code form. The supplied makefiles are strongly Unix based, as its author does not use Windows. Some contributions can be found on the PCRE web site that provide instructions and/or makefiles for compiling PCRE on Windows, using different compilers. Most seem somewhat outdated, though.

The PCRE manual is available from its web site only in text form. The source code, however, contains a version in HTML. For convenience, I've made it available here.

Features & limitations

Features of the provided class and DLL include:

An online copy of the interface class's documentation can be found here.

Limitations of the provided class and DLL are:

Installation

As the provided library does not depend on other libraries (other than standard operating system components), it can be placed anywhere. The interface class' parameter DLLLocation should be set to the full path to the DLL or shared object (including the filename) before compiling.

So, to start using the regular expression object:

  1. Download and unpack the proper version for your operating system.
  2. Copy the DLL or shared object to a suitable location.
  3. Import the interface class in Caché; set it's DLLLocation parameter to the full path to the library (including the filename), and compile the class.

You can now start using the regular expression object. Example code can be found below.

Usage

Below are a few examples of how the RegEx class can be used. The actual regular expressions used are simple and contrived; the examples are inteded to demonstrate the usage of the class, and by no means demonstrate the power of regular expressions. Note that, for brevity, no error checking is shown.

From COS

Simple matches

  Set RegEx=##class(Utility.RegEx).%New()
  Set RegEx.Pattern="\w+"
  Do RegEx.Match("a sample string")
  Write RegEx.Matches.GetAt(1)_": """_RegEx.GetResult(1)_"""",!

When the above test is run, the following will be output:

1,1: "a"

The regular expression \w+ matches one or more ‘word’ characters; the match stops at the first non-word character, which is the space after ‘a’.

Note that the starting index and length of the match are available; this information could be used for substitution operations.

To match all words in a string, keep calling NextMatch until no more matches are found:

  Set RegEx=##class(Utility.RegEx).%New()
  Set RegEx.Pattern="\w+"
  Do RegEx.Match("a sample string")
  Do {
    Write RegEx.Matches.GetAt(1)_": """_RegEx.GetResult(1)_"""",!
  } While RegEx.NextMatch()

The output of this code is:

1,1: "a"
3,6: "sample"
10,6: "string"

Substring matches

Substrings in regular expressions are defined using parenthesis inside the regular expression. They match the part of the expression delimited by the parenthesis. In the Utility.RegEx class, substring matches are returned (in the order in which they appear in the regular expression) in the second and subsequent entries of the Matches array.

The following code demonstrates this:

  Set RegEx=##class(Utility.RegEx).%New()
  Set RegEx.Pattern="(\w+)\s+(\w+)"
  Do RegEx.Match("This is a longer sample string")
  For i=1:1:RegEx.Matches.Count() {
    Write RegEx.Matches.GetAt(i)_": """_RegEx.GetResult(i)_"""",!
  }

In this regular expression, the \s+ signifies one or more whitespace characters. The output is:

1,7: "This is"
1,4: "This"
6,2: "is"

The first entry in the matches array is always the full matched string, in this case a word, followed by whitespace, followed by another word (‘This is’). Subsequent entries are the submatches, in this case the first word (‘This’) and the second word (‘is’).

Named matches

A complicated pattern with many substring definitions can be difficult to maintain and use. PCRE supports the use of named substrings, which are defined as follows: (?P<name_of_substring>substring pattern). The following example demonstrates this:

  Set RegEx=##class(Utility.RegEx).%New()
  Set RegEx.Pattern="(?P<first_word>\w+)\s+(?P<second_word>\w+)"
  Do RegEx.Match("This is a longer sample string")
  For i=1:1:RegEx.Matches.Count() {
    Write RegEx.Matches.GetAt(i)_": """_RegEx.GetResult(i)_"""",!
  }
  Set Name=""
  For  {
    Set Name=RegEx.NamedMatches.Next(Name)
    If Name="" Quit
    Set Index=RegEx.NamedMatches.GetAt(Name)
    Write "Named match """_Name_""": """_RegEx.GetResult(Index)_"""",!
  }

The output of this code is:

1,7: "This is"
1,4: "This"
6,2: "is"
Named match "first_word": "This"
Named match "second_word": "is"

Note that the named matches are merely substring matches, which are returned in the Matches list as well; the value of each named match in the NamedMatches array is simply the index in the Matches list for its name.

From SQL

A simple stored procedure, Matches, can be used for ad-hoc queries (for often-used queries, a class specific query is probably faster). For example, if the interface class is loaded in the Caché SAMPLES namespace, the following (contrived) query can be run in SQL Manager:

SELECT * FROM Sample.Employee WHERE Utility.RegEx_Matches('.*,.*y',Name)=1

The query returns all Employees that have the letter "y" in their last name.

The stored procedure can also access saved regular expressions. To specify the ID of a saved regular expression, pass :ID as the first argument (ID should be replaced with the object ID of the saved regular expression). Alternatively, the Name property can be specified in the same way: :Name. (Note that this means that a purely numerical name will be mistaken for an object ID, so don't use names consisting entirely of digits.)

So, to access a saved regular expression with object ID 42:

SELECT * FROM ATable WHERE Utility.RegEx_Matches(':42',AProperty)=1

And to access a regular expression saved as TheName by name:

SELECT * FROM ATable WHERE Utility.RegEx_Matches(':TheName',AProperty)=1

Dowload

The following files can be downloaded here:

regex_1.0.zip For Windows: The C source code for the library with build instructions, a compiled version (compiled with the free Microsoft C/C++ compiler Visual C++ Toolkit 2003 on Windows XP SP2), and the interface class.
regex_1.0.tgz For Linux: The C source code for the library with build instructions, a compiled version (compiled with gcc 2.95 against libc6), and the interface class.

Note that neither version includes the source or compiled code of PCRE; if you want to change anything or compile the code yourself, you'll have to get the PCRE source code and compile that first. Build instructions for the Caché interface can be found in the above downloads. The precompiled versions do include the PCRE code.

License

All code presented on this page is copyrighted © 2004 by Gertjan Klein, and currently available under a Creative Commons License. In short, this license requires you to attribute me if you decide to use the code.