Regular expressions: a practical primer you'll actually remember

Character classes, quantifiers, anchors, groups, and alternation — explained with worked patterns for emails, phone numbers, whitespace cleanup, and capture-group reformatting. Plus greedy vs lazy and the pitfalls that bite everyone.

By Muhammad Tahir6 min readdevregexexplainer

Most people learn regular expressions by copy-pasting a cryptic pattern from a search result, hoping it works, and moving on. That works right up until it doesn't — and then you're stuck staring at ^[\w.-]+@[\w.-]+\.\w+$ with no idea which part is broken. This primer builds the pieces from the ground up so the patterns stop being incantations and start being readable. By the end you'll have four or five genuinely useful patterns you understand well enough to modify.

A regular expression (regex) is a small pattern language for describing text. You write a pattern, and the engine checks whether — and where — that pattern appears in a string. That's it. Everything below is just vocabulary for describing text precisely.

The literal baseline

The simplest regex is just literal text. The pattern cat matches the letters c-a-t in that order, anywhere in the input. It matches inside category and scatter. Regex only gets interesting once you stop spelling things out literally and start describing classes of characters and repetition.

Character classes: matching a set of characters

A character class describes "any one character from this set." You write it with square brackets:

[aeiou]

That matches exactly one vowel. Ranges work too:

[a-z]      one lowercase letter
[0-9]      one digit
[A-Za-z0-9] one letter or digit

A caret at the start negates the class — [^0-9] means "any one character that is not a digit."

Because some classes are so common, regex has shorthand:

  • \d is [0-9] — a digit.
  • \w is [A-Za-z0-9_] — a "word" character (note: includes underscore).
  • \s is whitespace — space, tab, newline.
  • . matches any character except a newline.

The uppercase versions invert them: \D is a non-digit, \W a non-word character, \S a non-space. A frequent surprise: \w includes the underscore but not the hyphen, which is why patterns for things like slugs need [\w-] spelled out.

Quantifiers: how many times

Quantifiers say how many of the preceding thing to match:

  • * — zero or more
  • + — one or more
  • ? — zero or one (optional)
  • {3} — exactly 3
  • {2,4} — between 2 and 4
  • {2,} — 2 or more

So \d{3} matches exactly three digits. \d+ matches one or more digits. colou?r matches both color and colour because the u is optional.

Anchors: position, not characters

Anchors don't match characters — they match positions.

  • ^ anchors to the start of the string (or line, in multiline mode).
  • $ anchors to the end.
  • \b is a word boundary — the edge between a \w character and a non-\w character.

The difference between anchored and unanchored is enormous. The pattern cat matches inside scatter. The pattern ^cat$ matches only the exact string cat and nothing else. And \bcat\b matches the whole word cat but not the cat inside category. Forgetting anchors is the single most common reason a "validation" regex lets garbage through — it found your pattern somewhere in the string instead of checking the whole string.

Groups and alternation

Parentheses ( ) group part of a pattern so a quantifier or alternation applies to the whole group. The pipe | means "or."

(cat|dog)s?

That matches cat, cats, dog, or dogs. The parentheses also capture what they matched, which we'll use heavily below. If you want grouping without capturing, use (?: ) — a non-capturing group, which is slightly faster and keeps your capture numbering clean.

Worked pattern 1: an email-ish check

A full, RFC-correct email regex is monstrous and nobody should write one. For real validation you send a confirmation link. But a practical "does this look like an email" check is useful for catching typos before submission:

^[\w.+-]+@[\w-]+\.[\w.-]+$

Reading it left to right: anchor to start, one or more word/dot/plus/hyphen characters (the local part), a literal @, one or more word/hyphen characters (the domain), a literal dot (escaped as \.), then one or more word/dot/hyphen characters (the TLD and any subdomains), anchored to the end. This accepts jane.doe+news@mail.example.co and rejects not an email. It will accept some technically-invalid addresses — and that's fine, because its job is catching obvious mistakes, not being a mail server.

Worked pattern 2: a flexible phone number

Phone numbers come in many shapes: 555-123-4567, (555) 123 4567, 5551234567. Here's a pattern that tolerates the common separators:

^\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$

Piece by piece: an optional opening paren \(?, three digits, an optional closing paren, an optional single separator from the class [\s.-] (space, dot, or hyphen), three more digits, another optional separator, then four digits. The ? after each separator and paren is what makes the format flexible. Note we escape the parens as \( and \) because bare parens mean "group."

Worked pattern 3: trailing whitespace

Trailing spaces at the ends of lines are invisible noise that pollutes diffs. To find them:

[ \t]+$

That's "one or more spaces or tabs, immediately before the end of a line." In multiline mode, $ matches the end of each line, so this finds trailing whitespace throughout a file. Replace the matches with nothing and the clutter is gone. This is a perfect job for a find-and-replace pass — the Find & Replace tool supports regex mode, so you can paste this pattern in and clear every trailing space at once.

Worked pattern 4: reformatting with capture groups

This is where regex stops being a filter and becomes a transformer. Say you have dates in YYYY-MM-DD format and want DD/MM/YYYY. You capture each part:

(\d{4})-(\d{2})-(\d{2})

Three capture groups: year, month, day. Now in the replacement field, you refer back to them. Most tools use $1, $2, $3 (some use \1, \2, \3):

$3/$2/$1

Run that and 2026-06-01 becomes 01/06/2026. The same idea reformats names from Last, First to First Last with the pattern (\w+),\s*(\w+) and replacement $2 $1. Capture-and-rearrange is one of the highest-value regex skills, and it's exactly the kind of bulk transformation Find & Replace is built for.

Greedy vs lazy: the quantifier gotcha

By default, quantifiers are greedy — they match as much as they possibly can. Suppose you want to extract the contents of the first HTML tag from <b>bold</b>:

<(.+)>

You'd expect b. Instead the capture grabs b>bold</b — because .+ is greedy and gobbles everything up to the last > it can find while still letting the pattern match. The fix is the lazy quantifier, made by adding a ?:

<(.+?)>

Now .+? matches as little as possible, stopping at the first >, and you correctly capture b. The ? after + (or *) is the difference between "as much as possible" and "as little as possible." Greedy-versus-lazy confusion is behind a huge share of "my regex matches too much" bugs.

Pitfalls that bite everyone

  • Forgetting to escape special characters. A literal dot is \., a literal ? is \?, a literal ( is \(. Inside a character class, most of these lose their special meaning, so [.?] matches a literal dot or question mark.
  • Unanchored validation. As covered, \d{5} finds five digits somewhere; ^\d{5}$ requires the whole string to be exactly five digits. For validation, almost always anchor both ends.
  • . doesn't match newlines by default. If you need it to span lines, enable the "dotall" / single-line flag.
  • Catastrophic backtracking. Nested quantifiers like (a+)+ against certain inputs can make the engine explore exponentially many possibilities and hang. Keep patterns simple and avoid nesting quantifiers on overlapping character sets.
  • Assuming regex flavors are identical. JavaScript, Python, PCRE, and others differ in lookbehind support, named groups, and escapes. A pattern that works in one may need tweaks in another.

Build, then test

The reliable way to write a working regex is to build it incrementally and watch what each addition does to the matches. Start with the literal core, add character classes, add quantifiers, then anchor it — testing against both strings that should match and strings that shouldn't at every step. The Regex Tester highlights matches live as you type, which turns the whole process from guesswork into a quick feedback loop. Once you can see your groups light up in real time, the pieces in this primer click together fast.