This document describes a regular expression DLL/shared object with a class interface for Caché (version 5, non-unicode) ObjectScript (COS), both for Windows and Linux.
Regular expressions originated in the Unix world and are a way of matching text against a pattern (regular expression), similar to COS pattern matching. However, regular expressions have features that lack in COS patterns. For example, in regular expressions substrings of the matched text may be defined and returned.
Caché has no native support for regular expressions, and implementing them in COS would be non-trivial. However, many free implementations in C source code are available. Unfortunately, Caché can not easily call arbitrary external DLLs; they have to be compiled specifically for it.
The code described on this page implements a C DLL (for Windows) or shared object (for Linux) that is callable from Caché. The actual regular expression code is taken from the Perl Compatible Regular Expressions library (version 5.0). The code is linked statically, i.e., no DLLs other than the supplied one are needed for operation. A class interface to the DLL is provided that takes care of loading and unloading the DLL, and passing/retrieving data in the proper manner.
The Perl Compatible Regular Expressions (PCRE) library is written by Philip Hazel, and provides a rich set of features that exceed those of traditional Unix regular expressions. It is, as its name implies, feature-compatible with the Perl implementation. It is used in many large open source projects (Python, Apache, and PHP, to name a few).
The PCRE library is supplied in C source code form. The supplied makefiles are strongly Unix based, as its author does not use Windows. Some contributions can be found on the PCRE web site that provide instructions and/or makefiles for compiling PCRE on Windows, using different compilers. Most seem somewhat outdated, though.
The PCRE manual is available from its web site only in text form. The source code, however, contains a version in HTML. For convenience, I've made it available here.
Features of the provided class and DLL include:
Matches
property)
and named substrings (returned in the NamedMatches
property).Pattern
property.
Utility.RegEx
class is persistent; when saved, it will save both the pattern and its
compiled version. When reopening the object, the precompiled pattern
will be passed back to the DLL. For often-used, large and complicated
patterns, this may increase performance. Saving patterns also allows
the creation of a pattern library, allowing refinements to a pattern
to be picked up by all code that uses it automatically. The
Name
and Description
properties and
GetObjectByPattern
and GetObjectByName
methods
facilitate pattern libraries.NextMatch
method.Test
classmethod, as well as
more comprehensive object methods.Matches
property), as well as the actual matched
string (using the GetResult
convenience method). Named
matches are stored (in the NamedMatches
array) as an
index in the Matches
list; this index can then be used as
argument to the GetResult
method.Matches
, is provided
to facilitate running ad-hoc regular expression-based queries from SQL.An online copy of the interface class's documentation can be found here.
Limitations of the provided class and DLL are:
As the provided library does not depend on other libraries (other than
standard operating system components), it can be placed anywhere. The
interface class' parameter DLLLocation
should be set to the
full path to the DLL or shared object (including the filename) before compiling.
So, to start using the regular expression object:
DLLLocation
parameter to the full path to the library (including the filename),
and compile the class.You can now start using the regular expression object. Example code can be found below.
Below are a few examples of how the RegEx class can be used. The actual regular expressions used are simple and contrived; the examples are inteded to demonstrate the usage of the class, and by no means demonstrate the power of regular expressions. Note that, for brevity, no error checking is shown.
Set RegEx=##class(Utility.RegEx).%New() Set RegEx.Pattern="\w+" Do RegEx.Match("a sample string") Write RegEx.Matches.GetAt(1)_": """_RegEx.GetResult(1)_"""",!
When the above test is run, the following will be output:
1,1: "a"
The regular expression \w+
matches one or more
‘word’ characters; the match stops at the first non-word
character, which is the space after ‘a’.
Note that the starting index and length of the match are available; this information could be used for substitution operations.
To match all words in a string, keep calling NextMatch
until no more matches are found:
Set RegEx=##class(Utility.RegEx).%New() Set RegEx.Pattern="\w+" Do RegEx.Match("a sample string") Do { Write RegEx.Matches.GetAt(1)_": """_RegEx.GetResult(1)_"""",! } While RegEx.NextMatch()
The output of this code is:
1,1: "a" 3,6: "sample" 10,6: "string"
Substrings in regular expressions are defined using parenthesis inside
the regular expression. They match the part of the expression delimited by
the parenthesis. In the Utility.RegEx
class, substring matches
are returned (in the order in which they appear in the regular expression)
in the second and subsequent entries of the Matches
array.
The following code demonstrates this:
Set RegEx=##class(Utility.RegEx).%New() Set RegEx.Pattern="(\w+)\s+(\w+)" Do RegEx.Match("This is a longer sample string") For i=1:1:RegEx.Matches.Count() { Write RegEx.Matches.GetAt(i)_": """_RegEx.GetResult(i)_"""",! }
In this regular expression, the \s+
signifies one or more
whitespace characters. The output is:
1,7: "This is" 1,4: "This" 6,2: "is"
The first entry in the matches array is always the full matched string, in this case a word, followed by whitespace, followed by another word (‘This is’). Subsequent entries are the submatches, in this case the first word (‘This’) and the second word (‘is’).
A complicated pattern with many substring definitions can be difficult
to maintain and use. PCRE supports the use of named substrings,
which are defined as follows: (?P<name_of_substring>substring pattern)
.
The following example demonstrates this:
Set RegEx=##class(Utility.RegEx).%New() Set RegEx.Pattern="(?P<first_word>\w+)\s+(?P<second_word>\w+)" Do RegEx.Match("This is a longer sample string") For i=1:1:RegEx.Matches.Count() { Write RegEx.Matches.GetAt(i)_": """_RegEx.GetResult(i)_"""",! } Set Name="" For { Set Name=RegEx.NamedMatches.Next(Name) If Name="" Quit Set Index=RegEx.NamedMatches.GetAt(Name) Write "Named match """_Name_""": """_RegEx.GetResult(Index)_"""",! }
The output of this code is:
1,7: "This is" 1,4: "This" 6,2: "is" Named match "first_word": "This" Named match "second_word": "is"
Note that the named matches are merely substring matches, which are
returned in the Matches
list as well; the value of each
named match in the NamedMatches
array is simply the index
in the Matches
list for its name.
A simple stored procedure, Matches
, can be used for ad-hoc
queries (for often-used queries, a class specific query is probably
faster). For example, if the interface class is loaded in the Caché SAMPLES
namespace, the following (contrived) query can be run in SQL Manager:
SELECT * FROM Sample.Employee WHERE Utility.RegEx_Matches('.*,.*y',Name)=1
The query returns all Employees that have the letter "y" in their last name.
The stored procedure can also access saved regular expressions. To specify
the ID of a saved regular expression, pass :ID
as the first
argument (ID
should be replaced with the object ID of the saved
regular expression). Alternatively, the Name
property can be
specified in the same way: :Name
. (Note that this means that a
purely numerical name will be mistaken for an object ID, so don't use names
consisting entirely of digits.)
So, to access a saved regular expression with object ID 42
:
SELECT * FROM ATable WHERE Utility.RegEx_Matches(':42',AProperty)=1
And to access a regular expression saved as TheName
by name:
SELECT * FROM ATable WHERE Utility.RegEx_Matches(':TheName',AProperty)=1
The following files can be downloaded here:
regex_1.0.zip | For Windows: The C source code for the library with build instructions, a compiled version (compiled with the free Microsoft C/C++ compiler Visual C++ Toolkit 2003 on Windows XP SP2), and the interface class. |
regex_1.0.tgz | For Linux: The C source code for the library with build instructions, a compiled version (compiled with gcc 2.95 against libc6), and the interface class. |
Note that neither version includes the source or compiled code of PCRE; if you want to change anything or compile the code yourself, you'll have to get the PCRE source code and compile that first. Build instructions for the Caché interface can be found in the above downloads. The precompiled versions do include the PCRE code.
All code presented on this page is copyrighted © 2004 by Gertjan Klein, and currently available under a Creative Commons License. In short, this license requires you to attribute me if you decide to use the code.