Must know regular expressions - regular expression solutions to common problems

Table of contents

1. North American phone number

2. American ZIP encoding

3. Canadian postal code

4. British postal code

5. U.S. Social Security Number

6. IP address

7. URL

8. Complete URL

9. Email address

10. HTML comments

11. JavaScript comments

12. Credit card number


        Questions about regular expressions rarely have a final answer. More often it depends on tolerance for uncertainty. There are multiple solutions, and there is always a trade-off between regular expression performance and the scenarios it can handle. Remember, you not only need to match numbers that meet the criteria, but you also need to exclude numbers that don't, which is why this regular expression looks complicated.

1. North American phone number

        The North American Numbering Plan defines the format for telephone numbers in North America. Under this scheme, telephone numbers in North America (the United States, Canada, most of the Caribbean, and several other regions) consist of a 3-digit area code and a 7-digit number. These 7-digit numbers are divided into a 3-digit office number and a 4-digit line number. The office number and line number are separated by a hyphen. Each phone number can be any number, but the first digit of the area code and office code cannot be 0 or 1. The area code is often placed in parentheses, and a hyphen is added between the area code and the actual phone number to separate them. Assume that only the following 4 formats are legal:
(555) 555-5555
(555)555-5555
555-555-5555
555.555.5555

mysql> set @s:='J. Doe: 248-555-1234
    '> B. Smith: (313) 555-1234
    '> A. Lee: (810)555-1234
    '> M. Jones: 734.555.9999';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='(\\([2-9]\\d{2}\\)\\s?\\d{3}-\\d{4})|([2-9]\\d{2}(?<z>[.-]))\\d{3}\\k<z>\\d{4}';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+--------------------------------------------------------+------------+
| c    | s                                                      | i          |
+------+--------------------------------------------------------+------------+
|    4 | 248-555-1234,(313) 555-1234,(810)555-1234,734.555.9999 | 9,32,55,79 |
+------+--------------------------------------------------------+------------+
1 row in set (0.01 sec)

2. American ZIP encoding

        The United States began using ZIP encoding (ZIP, acronym for Zone Improvement Plan) in 1963. There are currently more than 40,000 ZIP codes in the United States, all of which are numbers (the first digit represents a region from the east to west of the United States, 0 for the east coast region, and 9 for the west coast region). In 1983, the United States Postal Service began using extended ZIP encoding, referred to as ZIP+4 encoding. The newly added 4-digit number makes a more detailed division of the letter delivery area (refinement to a specific city block or a specific building), which greatly improves the efficiency and accuracy of letter delivery. However, the use of ZIP+4 codes is optional, so the check for ZIP codes usually has to take care of both 5-digit ZIP codes and 9-digit ZIP+4 codes, the last 4 digits of the ZIP+4 code Separate the first 5 digits with a hyphen.

mysql> set @S:='999 1st Avenue, Bigtown, NY, 11222
    '> 123 High Street, Any City, MI 48034-1234';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='\\d{5}(-\\d{4})?';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+------------------+-------+
| c    | s                | i     |
+------+------------------+-------+
|    2 | 11222,48034-1234 | 30,66 |
+------+------------------+-------+
1 row in set (0.00 sec)

        \d{5} matches any 5-digit number, -\d{4} matches a hyphen and the last 4 digits. Because the last 4 digits are optional, -\d{4} should be enclosed in parentheses to make it a subexpression, and then use a ? to indicate that this subexpression is only allowed to appear once at most.

3. Canadian postal code

        Canadian postal codes consist of six alternating alphanumeric characters. Each code is divided into two parts: the first 3 characters give the FSA (forward sortation area, forward sorting area) code, and the last 3 characters give the LDU (local delivery unit, local delivery unit) code. The first character of the FSA code is used to identify the province, city, or region. There are 18 valid choices for this character, such as A for Newfoundland, B for Nova Scotia, K, L, N, and P for Ontario, M for Toronto, and so on. The schema should validate this to ensure the character is valid. When writing Canadian postal codes, the FSA code and the LDU code are usually separated by a space.

mysql> set @s:='123 4th Street, Toronto, Ontario, M1A 1A1
    '> 567 8th Avenue, Montreal, Quebec, H9Z 9Z9';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='[ABCEGHJKLMNPRSTVXY]\\d[A-Z] \\d[A-Z]\\d';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+-----------------+-------+
| c    | s               | i     |
+------+-----------------+-------+
|    2 | M1A 1A1,H9Z 9Z9 | 35,77 |
+------+-----------------+-------+
1 row in set (0.00 sec)

        [ABCEGHJKLMNPRSTVXY] matches any of 18 valid characters, and \d[AZ] matches a single digit followed by any letter, which together match the FSA code. \d[AZ]\d matches the LDU code, which is any letter between any two numeric characters. This regular expression matches Canadian postal codes without case sensitivity.

4. British postal code

        British postal codes consist of 5 to 7 letters and numbers, and these codes are defined by the Royal Mail. British postal codes are divided into two parts: the outer postal code (or outer code (outcode)) and the inner postal code (or inner code (incode)). An external code is one to two letters followed by one or two digits, or one or two letters followed by a number and a letter. The internal code is always a single digit followed by two letters (any letter except C, I, K, M, O, and V, which do not appear in zip codes). The inner code and the outer code should be separated by a space.

mysql> set @s:='171 Kyverdale Road, London N16 6PS
    '> 33 Main Street, Portsmouth, P01 3AX
    '> 18 High Street, London NW11 8AB';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='[A-Z]{1,2}\\d[A-Z\\d]? \\d[ABD-HJLNP-UW-Z]{2}';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+--------------------------+----------+
| c    | s                        | i        |
+------+--------------------------+----------+
|    3 | N16 6PS,P01 3AX,NW11 8AB | 28,64,95 |
+------+--------------------------+----------+
1 row in set (0.00 sec)

        In this pattern, [AZ]{1,2}\d matches one or two letters followed by a digit, and the following [AZ\d]? matches an optional alphabetic or numeric character. Therefore, [AZ]{1,2}\d[AZ\d]? can match any valid foreign code combination. The inner code part is matched by \d[ABD-HJLNP-UW-Z]{2}, which can match any digit followed by two letters that are allowed to appear in the inner code (A, B, D ~H, J, L, N, P~U, W~Z). This regular expression matches UK postal codes in a case-insensitive manner.

5. U.S. Social Security Number

        A U.S. social security number (SSN) consists of 3 sets of numbers separated by a hyphen: the first set contains 3 digits, the second set contains 2 digits, and the third set contains 4 digits. Beginning in 1972, the U.S. government began assigning the first three digits to SSN applicants based on the address they provided.

mysql> set @s:='John Smith: 123-45-6789';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='\\d{3}-\\d{2}-\\d{4}';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+-------------+------+
| c    | s           | i    |
+------+-------------+------+
|    1 | 123-45-6789 | 13   |
+------+-------------+------+
1 row in set (0.00 sec)

        \d{3}-\d{2}-\d{4} will match: any 3 digits, a hyphen, any 2 digits, a hyphen, any 4 digits. Most combinations of numbers are valid SSNs, but in practice, several requirements must be met. First, the all-zero field cannot appear in a valid SSN; second, the first set of numbers (so far) must not be greater than 728, because SSN has not been allocated such a large number, but it may be in the future. However, this will be a very complex pattern, so the simpler \d{3}-\d{2}-\d{4} is usually used.

6. IP address

        The IP address consists of 4 bytes (the value range of these 4 bytes is 0~255). IP addresses are usually written as 4 groups of integers separated by . characters, each integer consists of 1 to 3 digits.

mysql> set @s:='localhost is 127.0.0.1.';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='(((\\d{1,2})|(1\\d{2})|(2[0-4]\\d)|(25[0-5]))\\.){3}((\\d{1,2})|(1\\d{2})|(2[0-4]\\d)|(25[0-5]))';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+-----------+------+
| c    | s         | i    |
+------+-----------+------+
|    1 | 127.0.0.1 | 14   |
+------+-----------+------+
1 row in set (0.00 sec)

        This pattern uses a series of nested subexpressions. (((\d{1,2})|(1\d{2})|(2[0-4]\d)|(25[0-5])\.) is expressed by 4 nested sub- The formula consists of: (\d{1,2}) matches any 1 or 2-digit number (0~99), (1\d{2}) matches any 3-digit number starting with 1 (100~199), ( 2[0-4]\d) matches integers 200~249; (25[0-5]) matches integers 250~255. These 4 parts are formed by the | operator (which means that only one part of them needs to be matched) A subexpression. The following . is used to match the . character, which forms a larger subexpression with the previous one. The following {3} indicates that it needs to be repeated 3 times. Finally, the value range appears once more (This time omitting the trailing \.) to match the last group of numbers. By limiting all four groups of numbers to between 0 and 255, this pattern accurately matches only valid IP addresses and excludes invalid IPs. address.

7. URL

        Matching URLs is a difficult task, and its complexity depends on how precise a match is desired. The URL matching pattern should match at least the protocol (http or https), hostname, optional port number, and path.

mysql> set @s:='http://www.forta.com/blog
    '> https://www.forta.com:80/blog/index.cfm
    '> http://www.forta.com
    '> http://ben:[email protected]/
    '> http://localhost/index.php?ab=1&c=2
    '> http://localhost:8500/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='https?:\\/\\/[-\\w.]+(:\\d+)?(\\/([\\w\\/_.]*)?)?';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| c    | s                                                                                                                                                   | i                  |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|    6 | http://www.forta.com/blog,https://www.forta.com:80/blog/index.cfm,http://www.forta.com,http://ben,http://localhost/index.php,http://localhost:8500/ | 1,27,67,88,123,159 |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
1 row in set (0.01 sec)

        https?:\/\/ matches http:// or https://, ? makes the character s optional. [-\w.]+ matches hostname. (:\d+)? matches an optional port number. (\/([\w\/_.]*)?)? Match paths: The outer subexpression matches / (if present), and the inner subexpression matches the path itself. As you can see, this mode cannot handle query strings, nor can it correctly interpret "username:password" embedded in the URL. However, it is sufficient to handle most URLs (matching hostname, port number, and path). This regular expression that matches URLs is not case-sensitive.

        If you also want to match URLs that use the ftp protocol, replace https? with (http|https|ftp). URLs using other protocols can also be matched according to similar ideas.

8. Complete URL

        Here is a more complete (and slower) URL matching pattern that can also match URL query strings (variable information embedded in the URL, separated from the address in the URL by a ?) and Optional user login information.

mysql> set @s:='http://www.forta.com/blog
    '> https://www.forta.com:80/blog/index.cfm
    '> http://www.forta.com
    '> http://ben:[email protected]/
    '> http://localhost/index.php?ab=1&c=2
    '> http://localhost:8500/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='https?:\\/\\/(\\w*:\\w*@)?[-\\w.]+(:\\d+)?(\\/([\\w\\/_.]*(\\?\\S+)?)?)?';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| c    | s                                                                                                                                                                                    | i                  |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|    6 | http://www.forta.com/blog,https://www.forta.com:80/blog/index.cfm,http://www.forta.com,http://ben:[email protected]/,http://localhost/index.php?ab=1&c=2,http://localhost:8500/ | 1,27,67,88,123,159 |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
1 row in set (0.02 sec)

        This pattern is an improvement on the previous example. This time https?: \/\/ is followed by (\w*:\w*@)?, which matches the username and password embedded in the URL (the username and password should be separated by :, followed by followed by an @ character), see line 4 in this example. In addition, (\?\S+)? after the path is responsible for matching the query string, and the text that appears after ? is optional, which can be represented by ?. This regular expression that matches URLs is not case-sensitive. Why not use this pattern instead of the previous one? In terms of performance, the more complex the pattern, the slower it executes. If you don't need the extra functionality, it's better not to use it.

9. Email address

        Regular expressions are often used to validate email addresses, but even a simple email address can be challenging to verify.

mysql> set @s:='My name is Ben Forta, and my
    '> email address is [email protected].';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='(\\w+\\.)*\\w+@(\\w+\\.)+[A-Za-z]+';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+---------------+------+
| c    | s             | i    |
+------+---------------+------+
|    1 | [email protected] | 47   |
+------+---------------+------+
1 row in set (0.00 sec)

        (\w+\.)*\w+ is responsible for matching the username part of the email address (everything before @): (\w+\.)* matches zero or more occurrences of the text followed by .,\w+ Matches required text (for example, this combination matches ben and ben.forta). Next, @ matches the @ character itself. (\w+\.)+ matches at least one string ending with ., [A-Za-z]+ matches the top-level domain name (com, edu, us, uk, etc.). The rules that determine the validity of email address formats are extremely complex. This mode cannot verify all possible email addresses. For example, this pattern would consider [email protected] to be valid (obviously not), and would not allow an IP address as part of the hostname (which would be fine). Again, it's enough to verify most email addresses, so you can still use it. This regular expression that matches email addresses is case-insensitive.

10. HTML comments

        Comments in HTML pages must be placed between <!-- and --> tags. Both tags must contain at least two hyphens, but more than two are fine. When browsing (or debugging) a Web page, it is useful to find all comments.

mysql> set @s:='<!-- Start of page -->
    '> <html>
    '> <!-- Start of head -->
    '> <head>
    '> <title>My Title</title> <!-- Page title -->
    '> </head>
    '> <!-- Body -->
    '> <body>';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='<!(-{2,}).*?[^-]\\1>';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+---------------------------------------------------------------------------------+-------------+
| c    | s                                                                               | i           |
+------+---------------------------------------------------------------------------------+-------------+
|    4 | <!-- Start of page -->,<!-- Start of head -->,<!-- Page title -->,<!-- Body --> | 1,31,85,113 |
+------+---------------------------------------------------------------------------------+-------------+
1 row in set (0.00 sec)

        <!-{2,} matches the opening tag of an HTML comment, that is, <! followed by two or more hyphens. .*? matches the text part of HTML comments, using lazy quantifiers here. -{2,}> matches the closing tag of an HTML comment. This pattern matches two or more hyphens, so it can also be used to find CFML comments that contain 3 hyphens in the start/end tags. This mode also checks whether the number of hyphens in the opening and closing tags of the HTML comment matches (can be used to check whether the format of the HTML comment is incorrect).

11. JavaScript comments

        In JavaScript, and other scripting languages ​​including ActionScript and ECMAScript variants, comments in the code begin with //. As shown in the previous example, it is quite useful to find all the comments on a given page.

mysql> set @s:='<script language="JavaScript">
    '> // Turn off fields used only by replace
    '> function hideReplaceFields() {
    '>   document.getElementById(\'RegExReplace\').disabled=true;
    '>   document.getElementById(\'replaceheader\').disabled=true;
    '> }
    '> // Turn on fields used only by replace
    '> function showReplaceFields() {
    '>   document.getElementById(\'RegExReplace\').disabled=false;
    '>   document.getElementById(\'replaceheader\').disabled=false;
    '> }';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='\\/\\/.*';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+--------------------------------------------------------------------------------+--------+
| c    | s                                                                              | i      |
+------+--------------------------------------------------------------------------------+--------+
|    2 | // Turn off fields used only by replace,// Turn on fields used only by replace | 32,220 |
+------+--------------------------------------------------------------------------------+--------+
1 row in set (0.00 sec)

        The pattern is simple: \/\/.* matches // followed by the comment content.

12. Credit card number

        Regular expressions cannot verify whether the credit card number is actually valid. The final conclusion must be made by the credit card issuer. However, regular expressions can be used to exclude credit card numbers that have been entered incorrectly, such as one more digit or one less digit, before further processing.

        The patterns used here assume that spaces and hyphens have been removed from credit card numbers in advance. Generally speaking, it is a good idea to remove non-numeric characters from credit card numbers before using regular expressions to match them. All credit cards follow the same basic numbering scheme: they start with a specific sequence of numbers and have a fixed total number of digits. Let’s take a look at MasterCard first.

mysql> set @s:='MasterCard: 5212345678901234
    '> Visa 1: 4123456789012
    '> Visa 2: 4123456789012345
    '> Amex: 371234567890123
    '> Discover: 601112345678901234
    '> Diners Club: 38812345678901';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='5[1-5]\\d{14}';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+------------------+------+
| c    | s                | i    |
+------+------------------+------+
|    1 | 5212345678901234 | 13   |
+------+------------------+------+
1 row in set (0.00 sec)

        The total length of the MasterCard card number is 16 digits, the first digit is always 5, and the second digit is 1~5. 5[1-5] matches the first 2 digits, \d{14} matches the following 14 digits number. The situation with Visa cards is a little more complicated.

mysql> set @r:='4\\d{12}(\\d{3})?';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+--------------------------------+-------+
| c    | s                              | i     |
+------+--------------------------------+-------+
|    2 | 4123456789012,4123456789012345 | 38,60 |
+------+--------------------------------+-------+
1 row in set (0.00 sec)

        The first number of a Visa card is always 4, and the total length is 13 or 16 digits (there is no 14 or 15 digits, so you can't use a number range here). 4 matches the number 4 itself, \d{12} matches the next 12 digits, and (\d{3})? matches the optional last 3 digits. The pattern matching American Express card numbers is much simpler.

mysql> set @r:='3[47]\\d{13}';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+-----------------+------+
| c    | s               | i    |
+------+-----------------+------+
|    1 | 371234567890123 | 83   |
+------+-----------------+------+
1 row in set (0.01 sec)

        The total length of the American Express card number is 15 digits, and the first 2 digits must be 34 or 37. 3[47] matches the first 2 digits, \d{13} matches the remaining 13 digits. Matching the pattern of the Discover card number is not difficult either.

mysql> set @r:='6011\\d{14}';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+--------------------+------+
| c    | s                  | i    |
+------+--------------------+------+
|    1 | 601112345678901234 | 109  |
+------+--------------------+------+
1 row in set (0.00 sec)

        The total length of the Discover card number is 16 digits, and the first 4 digits must be 6011, so just use 6011\d{14}. The situation with Diners Club cards is a little more complicated.

mysql> set @r:='(30[0-5]|36\\d|38\\d)\\d{11}';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+----------------+------+
| c    | s              | i    |
+------+----------------+------+
|    1 | 38812345678901 | 141  |
+------+----------------+------+
1 row in set (0.00 sec)

        The total length of the Diners Club card number is 14 digits and must start with 300~305, 36 or 38. If the first 3 digits are 300~305, there must be 11 digits after them; if the first 2 digits are 36 or 38, there must be 12 digits after them. A relatively simple method is used here: no matter what it is, match the first 3 digits first. (30[0-5]|36\d|38\d) contains 3 subexpressions, as long as one of them is matched; 30[0-5] matches 300~305, 36\d matches starting with 36 of any 3-digit number, 38\d matches any 3-digit number starting with 38. Finally, \d{11} matches the remaining 11 digits. Now, just combine the above 5 credit card number matching patterns together.

mysql> set @r:='(5[1-5]\\d{14})|(4\\d{12}(\\d{3})?)|(3[47]\\d{13})|(6011\\d{14})|((30[0-5]|36\\d|38\\d)\\d{11})';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s, regexp_extract_index(@s, @r, 0, '') i;
+------+---------------------------------------------------------------------------------------------------+---------------------+
| c    | s                                                                                                 | i                   |
+------+---------------------------------------------------------------------------------------------------+---------------------+
|    6 | 5212345678901234,4123456789012,4123456789012345,371234567890123,601112345678901234,38812345678901 | 13,38,60,83,109,141 |
+------+---------------------------------------------------------------------------------------------------+---------------------+
1 row in set (0.00 sec)

        This pattern combines the five patterns obtained earlier using the | operator (which provides multiple selection branches). With it, you can verify the numbers of 5 common credit cards at once. The pattern used here can only check whether the sequence of digits at the beginning of the credit card number and the total length of the digits are correct. However, not all 13-digit numbers starting with 4 are valid Visa card numbers. Credit card numbers (all credit card types mentioned above) are also calculated using a mathematical formula called Mod 10 to determine whether the number is actually valid. The Mod 10 algorithm is an essential part of processing credit cards, but it is not a regular expression job because it involves mathematical operations.

Guess you like

Origin blog.csdn.net/wzy0623/article/details/130986791