Regular expressions are patterns used to search or search and replace in a text.
The purpose of this post is to motivate the use of RegEx in our real life, showing examples of common uses.
Jump to the Cheap Sheet section before continuing if you are not familiarized with the RegEx syntax.
Contents:
- Search: Searching for a class in C# with prefix and suffix
- Replace: Refactoring some HTML code
- Validation: Web Api validation in .NET using models
- Web scraping
Searching for a class in C# with prefix and suffix
Problem
In this example, we can consider that we have several classes that contains a prefix and suffix in the name of the class. For example, it’s very common that developers write sometimes the name of the project as the prefix and the base class as suffix:
Solution
Tips
- Use \s+ in order to skip as many spaces between "class" and the name of the class in case we have more than one space.
- [\d\w]+ means that we can have a sequence (1..infinity) marked by + which contains digits or words used for the name of the adapter.
- Use parentheses to organize our search and be more readable. But they are optional.
Refactoring some HTML code
Problem
We would like to refactor some HTML code. For this example, we have some old code that uses paragraphs and classes to define titles and subtitles in certain text.
Solution
Result
Tips
- Use (.?) if you don't know there is something before the class you are looking for — same rule for after. Doing (.?)WHATEVER(.?), looks, in normal search, like WHATEVER*.
- Instead, use (.?) at the close of the HTML tag, use > to define that until you get to the next >.
- Group what you want to reuse in the replacement. In our case, we group the content between the paragraphs in the third group (\$3).
Web Api validation in .NET using models
Problem
We would like to include validation in the attributes of our models when the client makes a call to our API. Adding validation on the individual attributes of our models, we skip adding more logic inside our controllers and make them focus on the business logic.
Solution
We would like to have at least one digit, one uppercase letter and one lowercase letters. The minimal amount of digits is 6, but this can be validated with another attribute on .NET. RegEx: {'^(?=.[a-z])(?=.[A-Z])(?=.*\d)[a-zA-Z\d]{6,}\$'}
Tips
- Using Lookahead we can verify some conditions before match the text
- With (?=.*[a-z]) we will look ahead that the text contains at least one lowercase character
- With (?=.*[A-Z]) we will look ahead that the text contains at least one lowercase character
- With (?=.*\d) we will look ahead that the text contains at least one lowercase character
- [a-zA-Z\d]* is the "real" match, but it does not contains the logic to understand if there is at least one minimal appearance of a lowercase letter, uppercase letter and a digit
In this example, we would like to parse information from a website (HTML) and recollect it inside some objects in our C# application.
Without focusing on the example to scrap, let's say we created a regular expression that recognizes items in a list, where:
- the first group correspond to the name of a product;
- the second one correspond to its composition;
- and the third one to its price.
Web scraping
Problem
Using .NET C#, we created a class to recreate the information that we want to obtain:
Using the Regex class and iterating over the Match class, we can obtain the information of the search for each found on the input text (HTML):
Common Search Regex cheat sheet
Regex | Description | Example |
---|---|---|
\d | Match a number from 0 to 9 (\D match the opposite) | \d\d\d |
\w | Match a word character from A-Z and a-z (\W match the opposite) | \w\w\w |
\s | Match any space character but next line | \s\s\s |
\n | Match next line | \n\n\n |
. | Match any character but next line | ... |
* | Repetition of character (or group) from 0 to infinity | \d* |
+ | Repetition of character (or group) from 1 to infinity | \d+ |
? | Repetition of character (or group) from 0 to 1 | \d? |
{N1}, {N1,N2} | Repetition from N1 to N2 | d{1,5} |
[...] | Explicit set of characters that can match | [abc] means (a |
[^...] | Explicit set of characters that cannot match | [^abc] means !(a |
Common Replace Regex cheat sheet
Regex | Description | Example |
---|---|---|
(...) | Grouping sequence | |
$N | Used on replace section means replace for the group number N | Search: hello (\w+) Replace: goodbye $1 |
\N | Used on search means to match the same Regex than group number N | Search: <(\w+)>.* |
Lookahead and Lookbehind cheat sheet
Regex | Description | Example |
---|---|---|
(?=X) | Lookahead: Asserts that what immediately follows the current position in the string is X | (?=foo) |
(?<=X) | Lookbehind: Asserts that what immediately precedes the current position in the string is X | (?<=foo) |
(?!X) | Negative Lookahead: Asserts that what immediately follows the current position in the string is not X | (?!foo) |
(?<!X) | Negative Lookbehind: Asserts that what immediately precedes the current position in the string is not X | (?<!foo) |
References
- http://www.rexegg.com/: These cheat sheets and more information can be found here.