Regular Expression

Intro to Regular Expression

Regular Expression (aka regex): 正規表達式, 有特殊pattern的sequence (ex. 電話號碼: 0X_XXXXXXX), 類似字串的樣板, 其功能是協助我們可以更快速地去搜尋字串, 或是對檔案內的字串進行全面性的汰換。
regex可以被用來確認format是不是正確的, ex: email的format: [email protected], 因此我們可以”whatever” + @ + “whatever” + “.com”的形式尋找。
在python 若要使用regex： import re

import re

text = "My number is 0911111113. I am available from 8am to 5pm."

patt = "from"

print(re.search(patt, text))

# <re.Match object; span=(40, 44), match='from'>
# re.Match object 也可以被存在variable

print(match.group()) # from - 得到的結果是
print(match.span()) # (40, 44) - 在哪個位置
print(match.start()) # 40
print(match.end()) # 44

-----------------------------------------------
text = "from, My number is 0911111113. I am available from 8am to 5pm, fromit"

patt = "from"

match = re.search(patt, text)
print(match.group()) # from
print(match.span()) # (0, 4)
print(match.start()) # 0
print(match.end()) # 4
# re.search只會得到第一組的
# 因此可以使用re.findall

match = re.findall(patt, text)
print(match)
# ['from', 'from', 'from'] 只要有from都會被找出來

Regular Expression Syntax

regular expression 最好用raw string來寫, 以避免一些有功能的字

# raw string: 在r後面的“”中, 所有的內容都沒有其他的意思
print(r"\\n") # \\n
print("\\n") #  # \\n換行的意思

syntax

Syntax	Description	Regular Expression	Matches
.	點, 會match所有的東西	r”he..o”	"hello”, “hek5o”

[] | 可以用來指名一個characters的集合(set), 也可被認作是customized dot. - (橫線在[]中為from … to …). Ex: 在 00 to 59 的任意數字可以寫成r”[0-5][0-9]”; 確認是否為英文字可以寫成r”[a-zA-Z]”; ^X代表除了x以外的東西 |

r”he[abcl][a-d]o” |

"heaao”, “hecco” | | \d | Match 一個decimal digit (0-9數字) | r”File\d\d\d” | “File100”, “File543” | | \D | Match 非 decimal digit | r”\D\D” | "He”, “We”, “?+” | | \w | Match 任何alphanumeric(number, alhabet) | r”\w-\w” | "4-3”, “4-t”, “a-b” | | \W | Match任何非alphanumeric, ex: +, ?, or ! | r”\W” | "+” | | \s | Match 空白鍵 | r”a\sb\sc” | "a b c” | | \S | Match 沒有空白鍵 | r”\S\S\S\S\S” | "AKB48” | | * | kleene star. 可以用來match ０或者更多的字, 可以與任意字或符號使用, ex: \, dot, [] | r”ab*” | "a”, “b”, “abbbbbbb” | | + | 可以用來match 1或者更多的字, 他可以與任意字或符號使用, ex: \, dot, [] | r”ab+” | “ab”, “abbbbbb” | | {m} | 可以得到m個複製字, ex: r”\d{3}”與 r”\d\d\d” 會得到相同答案 ,可以與任意字或符號使用, ex: \, dot, [] | r”Hello\d{3}” | "Hello123”, “Hello456” | | {m,n} | 可以得到m到n個複製字.n前面不可有空白鍵 | r”Hello\d{1,3}” | "Hello3”, “Hello58” | | {m,} | 可以得到m到m以上個複製字.n前面不可有空白鍵 | r”Hello\d{1,}” | "Hello3”, “Hello588873” | | \. | 由於原來dot在RE已經有意義了, 若要尋找dot可以用 \. | r”\.*” | "”, “.”, “……….” | | \b | Match 在前或後有empty string的字 | r”\bis” | "island”, “is” | | | | A|B RE match A 或Ｂ (Ａ, B皆為ＲＥ) | r”[a-zA-Z]+|[0-9]+” | "Hello”, “56778” |

在使用regular expression時逗號後不要使用空白鍵

### [] ###
text = "hello, heabo, hecdo, he56o"

print(re.findall(r"he[abcl][a-d]o", text))
# ['heabo', 'hecdo']

text = "How are you doing today?"
print(re.findall(r"[a-d]", text))
# ['a', 'd', 'd', 'a']

### . ###
text = "hello, heabo, hecdo, he56o, hellooooo"
print(re.findall(r"he..o", text))
# ['hello', 'heabo', 'hecdo', 'he56o', 'hello']

### * + ###
text = "I have hello, heo, helllllo, and hekko"

print(re.findall(r"he.*o", text))
# ['hello, heo, helllllo, and hekko']
#  只有找到一個match

print(re.findall(r"hel*o", text))
# ['hello', 'heo', 'helllllo']

print(re.findall(r"he[a-z]*o", text))
# ['heo']

print(re.findall(r"he[A-Z]+o", text))
# #無符合

### {m} ###
text = "I say hello526, hello747, hello696"
print(re.findall(r"hello\\d{3}", text))
# ['hello526', 'hello747', 'hello696']

text = "I say hello6, hello747 and, hello696"
print(re.findall(r"hello\\d{1,}", text))
# ['hello6', 'hello747', 'hello696']

### \\d ###
text = "my phone number is 08-7555555 and 07-4147414"
print(re.findall(r"0\\d-\\d{7}", text))
# ['08-7555555', '07-4147414']

### \\b ###
text = "This island is beautiful isn't it"
print(re.findall(r"\\bis", text))
# ['is', 'is', 'is']

text = "This island is beautiful isn't it"
print(re.findall(r"is\\b", text))
# ['is', 'is']

text = "This island is beautiful isn't it"
print(re.findall(r"\\bis\\b", text))
# ['is']

### | ###
text = "Are You a dog person or a cat person?"
print(re.findall(r"dog|cat", text))
# ['dog', 'cat']

*. If you cannot figure out how to write one regular expression for a very general-purpose (such as phone number or e-mail address), just google it. People already have really good ones.