Regex Pal

Dan's Tools

DOI without hyphen-like hyphens

https://www.doi.org/doi_handbook/2_Numbering.html#2.9 recommands NOT using the hyphen-like characters that look (visually) like the character \u002D, but does not match with it. These characters sometimes (by accident) end up in DOI. Among them: \u058A \u2010 \u2011 \u2012 \u2013 \u2014 \u2015 \u301C \uFE58 \uFE63 \uFF0D A regular expression that will only match DOI's without them is: \b(10[.][0-9]{4,}(?:[.][0-9]+)*\/(?:(?!["&\'\u058A\u2010\u2011\u2012\u2013\u2014\u2015\u301C\uFE58\uFE63\uFF0D])\S)+)\b A generic regular expression that will match DOI's regardless of any undesired hyphen-like characters is: \b(10[.][0-9]{4,}(?:[.][0-9]+)*\/(?:(?!["&\'])\S)+)\b A regular expression that will match any string with one or more of the undesired hyphens is: [\u058A\u2010\u2011\u2012\u2013\u2014\u2015\u301C\uFE58\uFE63\uFF0D]+

Comments

Top Regular Expressions

Cheat Sheet

Character classes
. any character except newline
\w \d \s word, digit, whitespace
\W \D \S not word, digit, whitespace
[abc] any of a, b, or c
[^abc] not a, b, or c
[a-g] character between a & g
Anchors
^abc$ start / end of the string
\b word boundary
Escaped characters
\. \* \\ escaped special characters
\t \n \r tab, linefeed, carriage return
\u00A9 unicode escaped ©
Groups & Lookaround
(abc) capture group
\1 backreference to group #1
(?:abc) non-capturing group
(?=abc) positive lookahead
(?!abc) negative lookahead
Quantifiers & Alternation
a* a+ a? 0 or more, 1 or more, 0 or 1
a{5} a{2,} exactly five, two or more
a{1,3} between one & three
a+? a{2,}? match as few as possible
ab|cd match ab or cd