Regular Expression Tutorial
Regular expressions are extremely useful for parsing data, IT professional
should take advantage of this very powerful tool. There are many different
ways to use regular expression to accomplish a wide variety of tasks.
We often have to preform analysis on log or data files to determine the
source of a problem. In the following example we will analysis a log file
for errors and report the number of occurrences.
Data to Analysis
The following is based on the default W3C Extended Log File Format generated by IIS.
(C:\WINNT\system32\Logfiles\W3SVC1)
11:57:17 127.0.0.1 GET /Website/images/logo.jpg 304
11:57:17 127.0.0.1 GET /Website/images/pic.gif 304
12:01:35 127.0.0.1 GET /Website/subfolder/page1.asp 500
12:01:35 127.0.0.1 GET /Website/css/main.css 304
12:01:35 127.0.0.1 GET /Website/src/common.js 304
12:01:35 127.0.0.1 GET /Website/process/page3.asp 200
12:01:45 127.0.0.1 GET /Website/process/page3.asp 500
12:01:35 127.0.0.1 GET /Website/subfolder/page1.asp 500
12:01:45 127.0.0.1 GET /Website/process/page4.asp 500
12:01:45 127.0.0.1 GET /Website/process/page5.asp 200
Step 1 - Analysis the data and determine the information to extract.
Examine the data for a pattern, sometimes this is not an easy task. However,
our log file has a simple pattern:
TIME IPADDRESS METHOD URL STATUS
After you have established the pattern, determine what information you would like to extract.
For this example we will extract a distinct list of URLs with a 500 (internal server error) status.
The output we would like to see is:
URL COUNT
Step 2 - Construct a regular expression
The syntax for a regular expression can be difficult to understand, however,
there is lots of help and samples online. Depending on the programming language
your using, the syntax may vary so check the documentation.
The regular expression for extacting our data would be a follows:
[GET|POST]\s+(.*)500
| [GET|POST] | Find text starting with either GET or POST |
| \s+ | Followed by spaces |
| (.*) | Match multiple(*) occurances of any character(.) and group them() |
| 500 | Stop grouping characters when the match end occurs. 500(status code) |
Step 3 - Choose a language to implement your data extraction
This expression will extract the matching data, however, some programming code needs to be added to
count the distinct matches. There are many scripting languages that can be used to preform the task,
however, many require downloading, installing, licensing, etc... If your using a Microsoft Windows based
operating system, then you already have a powerful scripting engine installed and available to use. Using
a text editor, "I recommend LargeEdit :)", you can write JScript code and execute it directly against the scripting engine(WScript.exe).
Step 4 - Create the Script to extract the data
This task can be simplify by using LargeEdit 2.0. We will create a scripting macro based on JScript and execute it within scripting engine against an open log file.
- Open LargeEdit
- Select Tools from the main menu
- Select Script Macros and click Create
- Enter the following Script in the editor window
- Save your Script as "distinctlistcounts.js"
You can also download the file distinctlistcounts.js
function Run() {
LargeEdit.ResultLog(' Regular Expression Search');
var inpStr = LargeEdit.CurrentFile.Text; var oRe;
oRe = new RegExp("[GET|POST]\\s+(.*)500", "g"); var arr;
var cntarray = new Array(); var idxarray = new Array(); var idx;
var cnt = 0;
var distinct = 0;
var retStr = '';
while ((arr = oRe.exec(inpStr)) != null) {
retStr += arr[1];
cnt++;
idx = idxarray.indexOf(arr[1]); if (idx < 0) {
idxarray.push(arr[1]);
cntarray.push(1);
distinct++;
} else {
cntarray[idx]++;
}
retStr = '';
}
LargeEdit.ResultLog('');
LargeEdit.ResultLog('Total matches found ' + cnt)
LargeEdit.ResultLog('Total distinct matches found ' + distinct)
LargeEdit.ResultLog('');
LargeEdit.ResultLog('List of distinct matches with counts');
LargeEdit.ResultLog('');
for (var i = 0; i < idxarray.length; i++) {
LargeEdit.ResultLog( idxarray[i] + ' = ' + cntarray[i] );
}
}
function Array_indexOf(text) {
var res = -1;
for (var i = 0; i < this.length; i++) {
if (text.toUpperCase() == this[i].toUpperCase()) {
res = i;
break;
}
}
return res;
}
Array.prototype.indexOf = Array_indexOf;
Step 5 - Execute the scripting macro
Using LargeEdit we can now execute this scripting macro against any file open in the editor.
- Open the log file using LargeEdit
You can also download the example log file example.txt
- Select Tools from the main menu
- Select Script Macros and click Play
- Browse for the scripting macro you saved in the last Step distinctlistcounts.js
- The Script will execute and report the information in the result window.

Summary
Using regular expressions and scripting languages can save you huge amounts of time, and you can quick
reproduce your data analysis results on any number of files easily. If this task is something you preform
regularly, you can create a custom toolbar and add a button to execute your script (see Custom Toolbar tutorial ).
|