Regex for GitHub Secret Scanning
Regex for GitHub Secret Scanning
Regular Expressions (regex) are the butt of many a joke - “now you have two problems”, but they’re a powerful tool for searching and matching text.
As someone who’s been jokingly called “The King of Regex” 👑 before, I’ve got a little bit to say about them.
They’re used in many places, including GitHub’s Secret Scanning, where as part of Advanced Security they give you the ability to match your own patterns to search for secrets or personal data (or anything you like!) in your code files.
What Secret Scanning is and why I care
A great part of GitHub is our secret scanning. That looks for keys 🔑, passwords 💬 and tokens 🔐 that should be kept secret 🤫 and not published into GitHub repos.
Customers paying for GitHub Advanced Security on GitHub Enterprise can write their own custom patterns to use with secret scanning, to cover vendors or contexts for secrets that aren’t yet included in the vendor patterns baked into the product.
I work as a security specialist for GitHub, on Advanced Security, so I’ve dived into how I can use my regex knowledge to write custom patterns.
How to write custom patterns
You can follow the instructions in the GitHub Docs to write your own custom patterns.
If you need an introduction to regex, I suggest RegexOne. Regexr is a great tool for testing your regex, but be aware that it doesn’t work exactly like the custom pattern syntax. More on that below.
I don’t plan to teach you how to use regex here, so if you’re not familiar with them, bookmark this, take a pause and read an intro, then come back when you know the syntax and you’ve tried some of the basics out.
And that’s all folks?
Do we just need to apply standard regex knowledge to write custom patterns?
Not quite. There are a few things to be aware of when writing your own custom patterns.
The first is that the regex engine used by Secret Scanning is a version of Hyperscan, Intel’s high-performance engine for regular expression matching. This is a very fast regex engine, but it has some differences from other regex engines, such as the one used by Python’s re
module, or what you are used to in JavaScript.
First, it doesn’t allow backreferences.
Those are what look like \1
or \2
in your regex, to refer back to previous capture groups in your regex. If I have a regex like (['"]).*?\1
that lets me look for either a single or double quote, matching anything in-between, and then stop when I see the next matching quote. Hyperscan doesn’t support that, so you have to be creative, or have two separate custom patterns.
Second, it doesn’t support lookaround expressions.
These are the expressions that look like (?=...)
or (?!...)
that let you look ahead or look behind in your regex. Here, you can use the feature of Secret Scanning Custom Patterns that defines a “context” for your pattern, to look for a pattern before or after your custom pattern. This mostly works, but there are still some cases where I’d really love to have lookaround! 😔
The tradeoff for this loss of expressiveness is that Hyperscan is very fast, and can match many patterns in parallel.
Common regex errors
I also wanted to touch on a couple of errors that I’ve seen or heard of in regex in general, and some in custom patterns, and how to avoid them.
ReDoS
ReDoS is “Regular Expression Denial of Service” 🙅♂️. It happens when a regex can be made to take a very long time to match with a short amount of input, which can cause a denial of service by consuming lots of RAM or taking a long time to match.
It’s a form of “algorithmic complexity attack”. It mostly happens when you have a repetition operator 🔁 that repeats another repetition operator, especially if one is unbounded ∞, or can match zero-length strings (e.g. .*
).
I’m not going to give a full tutorial on it here: try OWASP’s ReDoS page for that.
www.example.com
I’ve seen a few regex that look for a domain like so: www.example.com
.
They’re not quite right. Close, but no cigar.
The trouble is that this .
will match any character! You need to “escape” the dot, like: www\.example\.com
.
Something similar can be said for the other regex special characters, such as braces, repetition operators, and so on.
www.example.co
I’ve also seen a few regex that look for a domain like this: example\.co
.
That’s not quite right either. This overmatches on substrings of longer domains, such as: www.an-example.com
.
We need to exclude characters that can be in the domain on either side of the domain you want to match.
In custom patterns, the “before” and “after” patterns should be used, so the whole thing resembles:
before: \A|[^a-zA-Z0-9.-]
pattern: example\.co
after: \z|[^a-zA-Z0-9.-]
Matching a single character
I’ve also heard of people making regex that look for a single character: a
.
How?! Scripting!
They meant to loop over each regex in a file, for example, but accidentally looped over every character in each line too.
Consider this Python:
for line in file.readlines():
for item in line:
print(item)
That’s going to individually print every character in each line, not just each line.
Make sure to test your regex, and if you’re using a scripting language, make sure you’re looping over the right thing.
Belt and braces (1)
This one is not wrong so much as untidy.
If we go back to our quotes example, I see folk do ("|'|)
to match an optional single or double quote. It works.
I much prefer ['"]?
though, since it’s more concise, easier to read, and doesn’t need escaping for special characters such as .
inside that character class.
Belt and braces (2)
Again, this isn’t wrong, just verbose.
If you want to repeat a character, you can use curly braces {n,m}
to specify a range of repetitions, optionally leaving off the second number to mean “n or more”, like so: [a-zA-Z0-9]{8,}
.
I’ve seen “0 or more” and “1 or more” specified as {0,}
and {1,}
; yet these can be more succinctly expressed as *
and +
respectively.
Tricky character classes
If you want to look for the -
character in a character class, you need to escape it, like so: [=+\-_]
or you can place it last, like [=+_-]
.
If you forget and use [=+-_]
then it will match a spurious range of characters, such as +
to _
: not what you wanted!
There is also a whole load of trouble you can get into by specifying a range that is overly broad. That’s more about the logic of the siutation than the regex itself, but it’s worth being aware of.
Let’s say we’re trying to spot alphabetic characters, so we do: [A-z]
. That’s does what we want, but it also spots characters between Z
and a
, such as [
and @
- not what we intended! Watch those ranges to make sure they’re what you want.
CodeQL has some queries that are designed to spot overly broad ranges in regex, such as this one for JavaScript. It’s important in code, since it can lead to bypasses of security checks, but it’s also important in custom patterns, since it can lead to false positives.
Being over-specific
If we’re looking for a password inside quotes, we might do:
before: password=\"
pattern: [a-zA-Z0-9]{8,}
after: \"
Quick note: those escapes before each "
are for the sake of this YAML format I am using as an example, but they’re not needed in the custom pattern itself.
That’s fine, but why restrict the character set like that? What if someone uses a different language that doesn’t just use the Roman alphabet? “㊙”, anyone? That’s the Japanese Kanji character for “secret”.
We can change our pattern to generalise it a bit:
before: password\s*=\s*\"
pattern: [^"\x00-\x08]+
after: \"
Now we grab anything between the quotes that isn’t a quote, or a control character (just the first few), which is to avoid matching on binary data.
We are also adding an awareness that there can be spaces in this imaginary example around the =
character.
Being under-specific (1)
If we’re looking for a password that is specified on a single line, we might do:
before: (\A|\n)password\s*=
pattern: [a-zA-Z0-9\s]+
after: (\z|\n)
Looks fine, right? It’s not going to match on a password that is split across multiple lines, right? Right?
In secret scanning, it will happily match on any newlines, and greedily match, maybe up until the end of the file.
We can replace that \s
with [ \t]
instead:
before: (\A|\n)password[ \t]*=
pattern: [a-zA-Z0-9 \t]+
after: (\z|\n)
Being under-specific (2)
We want to match our password now, and we know if must be 8 characters or longer, and we know the allowed range.
We might do:
before: (\A|\n)password[ \t]*=
pattern: [a-zA-Z0-9 \t]{8,}
after: .
Our thinking is that we want to match on the password, and then have anything after it.
That will work, but we will end up matching on the password several times - we’ll match 8 characters, then end because the ninth character matches that .
, then match 9 characters, then end because the tenth character matches that .
, and so on.
It’s not quite ReDoS, but it’s not what we’re after.
We can either tighten up the after pattern, like so:
before: (\A|\n)password[ \t]*=
pattern: [a-zA-Z0-9 \t]{8,}
after: \z|[^a-zA-Z0-9 \t]
We can also forget about the password character constraints, and just match on the whole line, ignoring some control characters (much as we did before):
before: (\A|\n)password[ \t]*=
pattern: [^\n\x00-\x08]*
after: \z|\n
Loose anchors
If we’re looking for a password with a defined format, say, 8 characters, we might do:
before: .*
pattern: [a-zA-Z0-9]{8}
after: .*
That’s going to work on small test cases, but in practice it’ll grind to a halt. Why?
The regex is going to match literally anything before the pattern, and literally anything after the pattern, and select those as part of the match done by Hyperscan. That could be GBs of data, and Hyperscan will happily match on it.
Not only that, it’ll match all substrings of that - 1 character less than the whole match, 2 characters less, and so on. That’s a form of ReDoS, and it’s not good.
We can fix this by tightening up the anchors, like so:
before: \A|[^a-zA-Z0-9]
pattern: [a-zA-Z0-9]{8}
after: \z|[^a-zA-Z0-9]
In case you’re wondering why the before and after matches are not just [^a-zA-Z0-9]
, it’s because we want to be able to match on this pattern right at the start or end of the content.
\A
and \z
are the anchors for the start and end of the file, respectively. They work like ^
and $
, which you are probably more used to as anchors, but are more specific to what we want with custom patterns.
\z
is particularly useful, since it spots either the end of a file, or a newline at the end of a file. That saves us hassle in specifying that ourselves. Nice!
That really is all now
That’s all I wanted to cover for now. There are some more custom pattern topics that I might cover in future, such as how to test them, and walking over some of the custom patterns that we’ve written in Field for secret scanning, but that’s enough for now.
There are plenty of other ways to trip up with regex, some of which are covered in this page on packt.
Hope this helped you avoid some of these regex pitfalls; happy regexing!