Thursday, July 13. 2006
Tips for getting the most from regex.
Developing regular expressions isn't easy so there is nothing to be gained by making it harder still. Here are seven tips to make it as easy as possible.
1 - Don't re-invent the wheel
If you can save yourself some time through a quick search of the internet or by looking at the source code of one of your (or somebody else's) project(s) then do it. I'm not suggesting you just pick one up from anywhere but if you can get one from a source you trust then you should be able to save yourself some time.
The availability of regular expressions on the internet and in the source code of open source projects is no reason not to learn the basics of regex. Afterall, if you can't understand a regular expression how can you know whether it will do what you want it to?
2 - Use a reference sheet
Bookmark or print off a reference sheet which explains the meaning of the special characters so you have something to refer to.
3 - Break your expression up and use comments
Just as you wouldn't squeeze an entire script onto one line don't put an entire regular expression on one line, break it up with whitespace. This makes it easier to create and also easier to troubleshoot or develop further, whether tomorrow or in six months time. This is achieved by using a modifier with the regular expression. The modifier used is 'x' and it is simply put at the end of the regular expression after the delimiters.
Using comments is also as important for a regular expression as it is for a script in general. With your pattern broken up on several lines adding comments couldn't be easier.
4 - Test your expression
Before you even begin writing a regular expression you should stop and consider exactly what you want it to accept and what you don't want it to accept. Deciding exactly what to match is a tradeoff between making false matches and missing valid matches. If your regular expression is too strict you may miss valid matches, if it is too loose it will make false matches. How strict you want your pattern will depend on your objectives, whether you can handle false matches and whether you can tolerate missing valid matches.
Once you have decided what you want your pattern to match and what to ignore and once you have written a pattern you need to test it. For this you need some sample data, ideally this will be real data but if this isn't available to you then you need to get creative. The sample data should include your ideal match, a match not resembling your pattern in the slightest and several options in between. These tests are best performed in a small test script specifically looking at the regular expression.
5 - Use lazy operators
If you have ever used a negated character set you may find you would have been better using lazy operators. For some reason many people avoid lazy operators but that can be very useful and are not at all difficult to understand or use.
Lazy operators change the meaning of the operators so that instead of trying to match as much text as possible while supporting a match they match as little text as possible while supporting a match.
Lets take an example and look as a spider being used to gather links from a webpage. The text we are looking to match would look something like this:
We could match this using a negated character set as in the following:
If we could replace those negated character sets with the dot operator this regular expression would be far simply. Without lazy operators though such a pattern would start matching with the first link on the page and then continue, including all the text on the page, until the last link is encountered. To get around this problem we have lazy operators. Instead of matching all the text between the first link and the last the lazy operator means that as little text is matched as possible while supporting the match. In this case each link will be recognised as an individual match. Using lazy operators the match looks like this:
6 - Use grouping to simplify using the alternation operator
Because of the low precedence of the alternation operator (|) it is all too easy to alternate over more than intended. The best way to avoid such problems is to add parentheses around the alternate sub-patterns as in the example below. Using the letters a, b and c to represent sub-patterns abd assuming we want to match either ac or bc then a common error would be a regular expression as below:
This would match a or bc rather than ac or bc as intended. A better regular expression would be:
7 - Minimise the need for escaping when choosing delimiters
Every regular expression begins and ends with a delimiting character. Although a forward slash is a popular choice it isn't the only choice and can sometimes be a real inconvenience as every time a forward slash appears in the pattern it needs to be escaped with a leading backslash. If you are attempting to match html or xml, where every closing tag includes a forward slash, this could rapidly become annoying. The best solution is simple to use a different delimiter. Try #, ! or | for example.
If you can save yourself some time through a quick search of the internet or by looking at the source code of one of your (or somebody else's) project(s) then do it. I'm not suggesting you just pick one up from anywhere but if you can get one from a source you trust then you should be able to save yourself some time.
The availability of regular expressions on the internet and in the source code of open source projects is no reason not to learn the basics of regex. Afterall, if you can't understand a regular expression how can you know whether it will do what you want it to?
2 - Use a reference sheet
Bookmark or print off a reference sheet which explains the meaning of the special characters so you have something to refer to.
3 - Break your expression up and use comments
Just as you wouldn't squeeze an entire script onto one line don't put an entire regular expression on one line, break it up with whitespace. This makes it easier to create and also easier to troubleshoot or develop further, whether tomorrow or in six months time. This is achieved by using a modifier with the regular expression. The modifier used is 'x' and it is simply put at the end of the regular expression after the delimiters.
Using comments is also as important for a regular expression as it is for a script in general. With your pattern broken up on several lines adding comments couldn't be easier.
4 - Test your expression
Before you even begin writing a regular expression you should stop and consider exactly what you want it to accept and what you don't want it to accept. Deciding exactly what to match is a tradeoff between making false matches and missing valid matches. If your regular expression is too strict you may miss valid matches, if it is too loose it will make false matches. How strict you want your pattern will depend on your objectives, whether you can handle false matches and whether you can tolerate missing valid matches.
Once you have decided what you want your pattern to match and what to ignore and once you have written a pattern you need to test it. For this you need some sample data, ideally this will be real data but if this isn't available to you then you need to get creative. The sample data should include your ideal match, a match not resembling your pattern in the slightest and several options in between. These tests are best performed in a small test script specifically looking at the regular expression.
5 - Use lazy operators
If you have ever used a negated character set you may find you would have been better using lazy operators. For some reason many people avoid lazy operators but that can be very useful and are not at all difficult to understand or use.
Lazy operators change the meaning of the operators so that instead of trying to match as much text as possible while supporting a match they match as little text as possible while supporting a match.
Lets take an example and look as a spider being used to gather links from a webpage. The text we are looking to match would look something like this:
<a xhref="http://www.newsite.com/newpage.html">Text</a>
We could match this using a negated character set as in the following:
#<a\shref="[^"]+">[^<]+</a>#
If we could replace those negated character sets with the dot operator this regular expression would be far simply. Without lazy operators though such a pattern would start matching with the first link on the page and then continue, including all the text on the page, until the last link is encountered. To get around this problem we have lazy operators. Instead of matching all the text between the first link and the last the lazy operator means that as little text is matched as possible while supporting the match. In this case each link will be recognised as an individual match. Using lazy operators the match looks like this:
#<a\shref=".+?">.+?</a>#
6 - Use grouping to simplify using the alternation operator
Because of the low precedence of the alternation operator (|) it is all too easy to alternate over more than intended. The best way to avoid such problems is to add parentheses around the alternate sub-patterns as in the example below. Using the letters a, b and c to represent sub-patterns abd assuming we want to match either ac or bc then a common error would be a regular expression as below:
a|bc
This would match a or bc rather than ac or bc as intended. A better regular expression would be:
(a|b)c
7 - Minimise the need for escaping when choosing delimiters
Every regular expression begins and ends with a delimiting character. Although a forward slash is a popular choice it isn't the only choice and can sometimes be a real inconvenience as every time a forward slash appears in the pattern it needs to be escaped with a leading backslash. If you are attempting to match html or xml, where every closing tag includes a forward slash, this could rapidly become annoying. The best solution is simple to use a different delimiter. Try #, ! or | for example.
Trackbacks
Trackback specific URI for this entry
No Trackbacks
