Regular Expressions in RPG

One of the things that RPG isn’t particularly good at is string scanning & manipulation.

Many other programming languages support using Regular Expressions (or regex, as they are often referred to). Java, PHP, Node.JS, Python, & Perl have support for regex’s built in.

Regular expressions are a very powerful tools for parsing, analyzing, and manipulating text. It should be noted, however, that with such power also comes the possibility for complexity. Some regular expressions can get VERY VERY complex. See the end of this post for a VERY complex expression.

A true regex master can create a functioning expression that is indistinguishable from modem line noise.

– Unknown (maybe me)

One thing that needs to be stated up front … this blog post isn’t going to try and teach regular expressions … the topic is far to involved.

This post is simply going to give an overview of how to use regular expressions in RPG.

Multiple Approaches

There are a few different approaches to using regular expressions in RPG. Some easier than others. The two most popular are using Posix C function calls and DB2 functions.

C Function Calls

One approach is to use C function calls from within RPG. I’m not a big fan of this way, because it adds complexity before you even start using the expressions.

Something like this …

string = 'a very simple simple simple string' + x'00';
expression = '\\(sim[a-z]le\\) \\1' + x'00';

regex_p = %addr(regex_t);
match_p = %addr(regmatch_t);

// Compile RE
rc=regcomp(regex_p : %addr(expression) : REG_EXTENDED); 

// Execute RE
rc = regexec(regex_p: %addr(string) : nmatch : match_p : 0); Code language: JavaScript (javascript)

As you can see, it’s kind of on the complex side … it requires pointers, x’00’ terminated strings, ‘compiling’ the expression, and then executing it.

A whole lot of work, in my opinion.

DB2 Functions

On the other hand, DB2 provides regular expression functions that eliminates a lot of the extra work that you have to do. These functions can be used anywhere a function can be used, including ‘set’ or ‘values into’ operation.

DB2 provides the following regex functions…

REGEXP_COUNT – Returns a count of the number of times that a pattern is matched.

REGEXP_INSTR – Returns the position (starting or ending) of the pattern.

REGEXP_REPLACE – Returns a modified version of the string where occurrences of the expression pattern is found with the specified replacements.

REGEXP_SUBSTR – returns a substring of the string that matches the expression.

Here’s a simple example of how I used this …

The objective was to scan a SQL statement that created a table or view, and insert ‘OR REPLACE’ in the appropriate location if it didn’t already exist.

The code looked something like this …

exec sql
  set :newStmt = regexp_replace(:orgStmt, '^CREATE\s+(TABLE|VIEW)\s+',
   'CREATE OR REPLACE $1 ',1, 1, 'i');Code language: JavaScript (javascript)

The parameters used are …

  1. Character field to be processed.
  2. Regex to be applied
  3. Replacement text
  4. Starting position to scan
  5. Occurrence number
  6. Regex flags

The statement does the following …

  1. ‘set’ will assign the results of a SQL function to a host variable. In this case, it will go in newStmt.
  2. regexp_replace is the sql function to replace text.
  3. The regex is next and does the following…
    1. “^” means start at the beginning.
    2. Scan for CREATE followed by one or more spaces.
    3. It then looks for the word TABLE or VIEW and holds that in a regex capture group.
  4. If the pattern is found, it will return a variable that has been modified using the 3rd parameter.
    1. The constant ‘CREATE OR REPLACE ‘
    2. ‘$1’ is the value of the first (in this case, only) capture group.
    3. The rest of the text
  5. If the pattern is not found, it will return the original text unmodified.
  6. The regex flag ‘i’ means the expression should be case insensitive.

Note: To use the DB2 regex functions you will need the International Components for Unicode (ICU) LPP installed (5770SS1, Option 39, on IBM i 7.4).

Externalize Expressions

One thing that I’m a big fan of is externalizing the expressions to message files.

There are two major reasons for this …

Multi-National Environments

If your program is running in a multi-national environment (i.e., running in a different CCSID than it was compiled in), and the message file you retrieve the expression has a specific CCSID, you don’t have to worry about your expression not being valid due to differences in character sets.

For example, if your program was compiled in CCSID 37, but is running in CCSID 5029, the lower case characters aren’t going to be correct due to the differences in CCSID.

However, if your message file has CCSID 37 associated with it, when you retrieve the message containing your expression it will automatically be transcoded to the jobs CCSID.

Expression Changes

Another reason to put your expressions in message files is that you can change the expression without having to recompile your program.

This will work so long as you don’t change the inputs & outputs of the expression.

Do use externalized expressions, you would need to …

  1. Create a message file to contain the expressions and give it a specific CCSID. Something like ‘CRTMSGF MSGF(YOURLIB/REGEXMSG) CCSID(37)‘.
  2. Add the expression with a specific message id.
  3. If handling multi-national character sets is important, make sure your job has the correct CCSID.
  4. In your program, use the QMHRTVM api to retrieve the message text and use that as the expression.

Complex Expression

This expression is used to validate RFC822 email addresses:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)Code language: JavaScript (javascript)

7 thoughts on “Regular Expressions in RPG

  1. Mario

    Who say that rpgle is not good manipulating strings?
    %scan %Len %size
    %replace %editc
    %lookup
    %subset
    %trim

    Reply
    1. David Post author

      You could use RPG built in functions to do the string manipulation … but it will take multiple operations to perform what a single regular expression can accomplish.

      Reply
  2. Steve Richter

    that is actually pretty nice. regexp_substr both matches and returns the match. Would be nice if the RPG compiler was smart enough to optimize built in SQL functions, where the performance was on par with RPG built in functions.

    Reply
  3. Diogo

    Great post, it helped me a lot!

    Still, I have a problem concerning the CCSID. My job CCSID is 500 and when I use the RegEx DB2 it doesn’t work correctly. However, when I change the CCSID to 37 it works fine.

    I don’t want to create a new message file to store the expressions like you suggest. It’s a good solution but my supervisors don’t like the idea of creating new message files.

    I’m thinking of creating a program that
    1 – Change the CCSID to 37
    2 – Execute regexp_replace
    3 – Change the CCSID back to 500 (the default)

    Do you see any problems with this solution? To be honest I don’t understand the implications of changing the CCSID, even if momentarily.

    Or do you have another suggestion?
    Thank you!

    Reply
    1. David Post author

      Are you putting your regular expressions in a message file?

      If so, make sure you tag the message file with a specific CCSID. That way, when you retrieve a regex from the message file, it will be transcoded to your jobs CCSID automatically.

      Reply
        1. David Post author

          TBH, I’ve always used message files when doing regular expressions in with SQL in RPG.

          The expressions should work fine if your program is compiled and run in the same CCSID.

          Reply

Leave a Reply

Your email address will not be published. Required fields are marked *