Sunday, 18 August 2019

Regular expressions tricks and tips

Regex:

Literal characters: 

  • Alphanumeric character: itself
  • \o: null character
  • \t: tab
  • \n: newline
  • \v: vertical tab
  • \f: form feed
  • \r: carriage return
  • \x nn: Latin character specified by a hexadecimal number
  • \u xxxx: Unicode character specified by the hexadecimal number
  • \c X: Control character 

Character classes: 

  • /[abc]/-  Any of a, b or c.
  • /[^abc]/- None of a, b and c.
  • /[a-z]/- Any char from a to z.
  • /[a-zA-Z0-9]/- Any char from a-z, A-Z or 0-9.

  • Note: ^ and $ have different meaning inside character classes. ^ means negation and $ behaves like a usual dollar symbol
    Eg: /a[b$]/: a$ or ab 


    Clasess:
  • /w: /[a-zA-Z0-9_]/ Any char from a-z, A-Z or 0-9.
  • /W: /[^a-zA-Z0-9_]/ Any char other than a-z, A-Z or 0-9.
  • /s: Unicode whitespace
  • /S: Any char other than whitespace charac
  • /d: /[0-9]/ Any ascii digit
  • /D: /[^0-9]/ 
  • [\b]: A literal backspace

Repetition:

  • {n,m}: Match the previous item at least n times but no more than m times.
  • {n,}: Match the previous item n or  more times.
  • {n}: Match the previous item exactly n times.
  • ?: Previous item zero or one time.
  • +: Previous item one or one time.
  • *: Previous item zero or more time.

Non greedy repetion:

Turn any regex into non-greedy regex by adding a "?" after any repition character.
Eg: ??, +?, *?, {1,5}?
  • Example1:
  • /a+/ --> "aaa" = aaa
  • /a+?/ --> "aaa" = a

    But always keep in mind that non-greedy repetion starts matching from first match. For example the following regex will work same for both greedy and non-greedy repetion.
  • Example2:
  • /a+b/ --> "aaab" = aaab
  • /a+?b/ --> "aaab" = aaab .. since non-greedy started matching from first 'a' and kept on matching until it gets a 'b'
  • Examples of greedy repetion:

Alternation:

Use '|' for alternation.
Example: /ab|cd/ will match string containing either "ab" or "cd"
Note: It matches from left to right, eg:
/a|ab/ --> 'ab' = This will return 'a' not 'ab' 

Grouping:

Brackets "()" use to group subexpressions and to define subpatterns whithin the complete pattern.

When a regular express is successfully matched against a target string, it is possible to extract the portions of the target string that matched any particular parenthesized subpattern. For example: suppose you are looking for one or more lowercase letters followed by one or more digits. But you only really care about the digits.
/[a-z]+\d+/: Will match lowercase characters followed by digits.
/[a-z]+(\d+)/: Will also match the same pattern but with this you can extract only that digit part.

References:

References can be used to enforce a constraint that separate portions of a string contain exactly the same characters. Lets try to find zero or more characters between single or double quotes.
['"][^'"]*['"] : This will match "My world" and 'Hello World" but also 'Invalid expression".
This should not have matched a string started with single quote and ends with double quote.

So there come references to the resue.
Eg: /(['"])[^'"]*\1/: The '\1' matches the whatever the first parenthesized subexpression matched.

Only grouping not reference:

Use "(?:)" not "()"


Specifying match position:

  • ^: Beginning of the string, in multiline, begining of a  line
  • $: End of string/line, 
  • /b: Word boundary
    Eg:
    /\bJava\b/ : Will match these cases:
    1. "Java is ..": Start of string and followed by a space
    2. "This is Java language": In between and spaces are around.
    3. I code in JAva': Space and then end of the string.
    Will not match: "ScriptJava": Not a word boundary
  • /B: Not a word boundary
  • (?=p): Positive look ahead. Characters should match, but do not include those charactes in match.
    /javascript(?=:)/: This will match "javascript:"but will not include the ":" in the result
  • (?!p): Negetive outlook assertion. Characters should not match.

Flags:

  • i: Case insensitive m
  • g: Global match, find all matches rather than just first.
  • m: multiline mode: Matches beginning of line or string, end of line or string.

String Methods for Pattern Matching

  1. search
  2. replace
  3. match
  4. test

1. search:  str.search(regexp)

Return: The index of the first match between the regular expression and the given string; if not found, -1. If passed a string as the only argument, a regex will be created from that string and then comparision will take place

When you want to know whether a pattern is found and also its index in a string use search() (if you only want to know if it exists, use the similar test() method on the RegExp prototype, which returns a boolean); for more information (but slower execution) use match() (similar to the regular expression exec() method).

2. replace: search + replace

The replace() method returns a new string with some or all matches of a pattern replaced by a replacement. The pattern can be a string or a RegExp, and the replacement can be a string or a function to be called for each match. If pattern is a string, only the first occurrence will be replaced.

The replacement string can include the following special replacement patterns:
Pattern Inserts
$$ Inserts a "$".
$& Inserts the matched substring.
$` Inserts the portion of the string that precedes the matched substring.
$' Inserts the portion of the string that follows the matched substring.
$n Where n is a positive integer less than 100, inserts the nth parenthesized submatch string, provided the first argument was a RegExp object. Note that this is 1-indexed.


3. match: 
var paragraph = 'The quick brown fox jumps over the lazy dog. It barked.';
var regex = /[A-Z]/g; // match for all the capital case letters
var found = paragraph.match(regex);

console.log(found);
// expected output: Array ["T", "I"]
With global flag:
If no result found, returns null.
Else, returns array of all the matches that appeared in that string.
Without global flag:
If no result found, returns null.
Else, returns array where first element of the array is the matched string and rest are the parenthesized subexpressions of the resular expressions. Thus, is match() returns an array a, a[0] contains the complete match, a[1] contains the substring that matched the first parenthesized expression, and so on.

4. split: 
str.split([separator[, limit]])
sparator: string or regex
limit: Integer specifying a limit on the number of splits to be found



The Regex Object: To be used wherever we need dynamic regex. Where regex depends on the conditions or user input or api response.


a) RegExp Properties: 

1. source: read only string , that contains the text of the regular expression.

2. global: read only boolean, that contains the value of the "g" flag.

3. ignoreCase: read only boolean, that contains the value of the "i" flag.

4. multiline: read only boolean, that contains the value of the "m" flag.

5. lastIndex: read/write integer. For patterns with the 'g' flag, this property stores the position in the string at which the next search is to begin. It is used by the exec() and text() methods.

b) RegExp Methods:

1. exec:  regexp.exec(str)
silimar to "match" method of string. Returns null if no match found, if it finds one, returns array just like the match method for non global searches. Element 0 contains the string that matched the regex, and any subsequest array elements contain the substrings that matched any parenthesized subexpression.
When exec is invoked a second time for the same regex, it begins its search at the char position indicated by the lastIndex peroperty.If exec does not find a match, it resets lastIndex to 0.

2. test:
regex.test(str);// true or false

3. toString:

var patt = new RegExp("Hello World", "g");
 var res = patt.toString(); 
o/p: /Hello World/g

No comments:

Post a Comment