This web page was created programmatically, to learn the article in its unique location you possibly can go to the hyperlink bellow:
https://blog.yossarian.net/2025/08/14/Fun-with-finite-state-transducers
and if you wish to take away this text from our website please contact us
Aug 14, 2025
Tags:
devblog,
programming,
rust,
zizmor
I just lately solved an attention-grabbing downside inside zizmor
with a kind of state machine/automaton I hadn’t used earlier than: a
finite state transducer (FST).
This is only a fast write-up of the issue and the way I solved it. It doesn’t
go significantly deep into the info buildings themselves. For extra data
on FSTs themselves, I strongly advocate burntsushi’s article on transducers
(which is what truly led me to his fst crate).
TL;DR: I used the fst crate to construct a finite state transducer that
maps GitHub Actions context patterns to their logical “capability” in
the context of a possible template injection vulnerability. This ended up
being an order of magnitude smaller when it comes to illustration (~14.5KB as a substitute
of ~240 KB) and sooner and extra reminiscence environment friendly than my naïve
preliminary approaches (tables and prefix trie walks). It additionally enabled me
to totally precompute the FST at compile time, eliminating the startup price
of priming a trie- or table-based map.
zizmor is a static evaluation instrument for GitHub Actions.
One of the classes of weaknesses it will probably discover is template injections,
whereby the CI writer makes use of a GitHub Actions expression in a shell or
comparable context with out realizing that the expression can escape any
shell-level quoting meant to “defuse” it.
Here’s an instance, derived from a sample that will get
exploited over and over again:
1
2
3
- identify: "Print the current ref"
run: |
echo "The current ref is: ${{ github.ref }}"
If this step is a part of a workflow that grants elevated privileges to 3rd
events (like pull_request_target), and attacker can contrive a git
ref that escapes the shell quoting and runs arbitrary code.
For instance, the next ref:
1
harmless";cat${IFS}/etc/passwd;true${IFS}"
…would develop as:
1
echo "The current ref is innocent";cat /and so on/passwd;true ""
Fortunately, zizmor detects these:
1
2
3
4
5
6
7
8
9
10
11
12
zizmor hackme.yml
error[template-injection]: code injection by way of template enlargement
--> hackme.yml:15:41
|
14 | run: |
| ^^^ this run block
15 | echo "The current ref is: ${{ github.ref }}"
| ^^^^^^^^^^ could develop into attacker-controllable code
|
= word: audit confidence → High
= word: this discovering has an auto-fix
There’s a quite simple solution to detect these vulnerabilities: we might stroll each
code “sink” in a given workflow (e.g. run: blocks, motion inputs which might be
identified to include code, &c.) and search for the fences of an expression
(${{ ... }}). If we see these fences, we all know that the contents
are a possible injection threat.
This is interesting for causes of simplicity, however is unacceptably noisy:
There are many actions expressions which might be trivially secure, or
non-trivial however deductively secure:
${{ true }} or ${{ 'lol' }};Any expression that may solely develop to a literal:
1
2
3
# solely ever expands to 'essential' or 'not-main'
# regardless of utilizing the github.ref context
${ 'not-main' }
Any expression that may’t develop to significant code, e.g.
because of the expression’s sort:
1
2
3
4
5
# solely ever expands to a quantity
${{ github.run_number }}
# solely ever expands to `true` or `false`
${{ github.ref == 'essential' }}
There are many expressions that may seem unsafe by advantage of dataflow
or context enlargement, however are literally secure due to the context’s
underlying sort or constraints:
${{ github.occasion.pull_request.merged }} is populated bytrue or false, however requires${{ github.actor }} is an arbitrary string, however is restricted$, &c.).zizmor usually goals to current low-noise findings, so filtering these
out by default is paramount.
The first group is fairly simple: we will do a small quantity of dataflow evaluation
to find out whether or not an expression’s analysis is “tainted” by arbitrary
controllable inputs.
The second group is more durable, as a result of it requires to know further info about
arbitrary-looking contexts. The two essential info we care about are sort
(whether or not a context expands to a string, a quantity, or one thing else)
and functionality (whether or not the enlargement is absolutely arbitrary, or constrained
in some method which may make it secure or not less than much less dangerous). In observe
these each collapse right down to functionality, since we will categorize
sure sorts (e.g. booleans and numbers) as inherently secure.
So, what we would like is a solution to acquire info about each legitimate GitHub Actions
context.
The trick to this lies in remembering that, beneath the hood, GitHub Actions
is pushed by GitHub’s webhooks API: most of the context state loaded
right into a GitHub Actions workflow run is derived from the webhook payload
comparable to the occasion that triggered the workflow.
So, how will we get an inventory of all legitimate contexts together with data
about their enlargement? GitHub doesn’t present this instantly, however we will
derive it from their OpenAPI specification for the webhooks API.
This comes within the type of a ~4.5MB OpenAPI schema, which is ache within the ass
to work with instantly: it’s each closely self-referential (by necessity,
since an “unrolled” model with inline schemas for every property would
infeasibly giant), is closely telescoped (additionally by necessity, since
GitHub’s API responses themselves should not significantly flat), and makes ample
use of OpenAPI constructions like oneOf, anyOf, and allOf that require
cautious further dealing with.
At the underside of all of this, nevertheless, is our reward: detailed data
about each property offered by every webhook occasion, together with the property’s
sort and beneficial details about how the property is constrained.
For instance, right here’s the schema for a pull_request.state property:
1
2
3
4
5
6
"state": {
"description": "State of this Pull Request. Either `open` or `closed`.",
"enum": ["open", "closed"],
"sort": "string",
"examples": ["open"]
}
This tells us that pull_request.state is a string, however that its worth
is constrained to both open or closed. We categorize this as having
a “fixed” functionality, since we all know that the attacker can’t management the
construction of the worth itself in a significant method.
Long story brief: that is carried out as a helper script inside zizmor
referred to as webhooks-to-contexts.py. This script is run periodically in GitHub
Actions and walks the OpenAPI scheme to produces a CSV,
context-capabilities.csv, that appears like this:
| context | functionality |
|---|---|
| github.occasion.pull_request.active_lock_reason | arbitrary |
| github.occasion.pull_request.additions | fastened |
| github.occasion.pull_request.allow_auto_merge | fastened |
| github.occasion.pull_request.allow_update_branch | fastened |
| github.occasion.pull_request.assignee | fastened |
| github.occasion.pull_request.assignee.avatar_url | structured |
| github.occasion.pull_request.assignee.deleted | fastened |
| github.occasion.pull_request.assignee.e mail | arbitrary |
| github.occasion.pull_request.assignee.events_url | arbitrary |
| github.occasion.pull_request.assignee.followers_url | structured |
| github.occasion.pull_request.assignee.following_url | arbitrary |
| github.occasion.pull_request.assignee.gists_url | arbitrary |
| github.occasion.pull_request.assignee.gravatar_id | arbitrary |
| github.occasion.pull_request.assignee.html_url | structured |
…to the tune of about 4000 distinctive contexts.
So, we’ve just a few thousand contexts, every with a functionality that
tells us how a lot of a threat that context poses when it comes to template injection.
We can simply shove these right into a map and name it a day, proper?
Wrong. We’ve glossed over a major wrinkle, which is that context
accesses in GitHub Actions are not themselves at all times literal. Instead, they
will be patterns that may develop to a number of values.
An excellent instance of that is github.occasion.pull_request.labels: labels is
an array of objects, every of which has a identify property corresponding
to the label’s precise identify. To entry these, we will use syntaxes that choose
particular person labels or all labels:
1
2
3
4
5
# entry the primary label's identify
github.occasion.pull_request.labels[0].identify
# entry all labels' names
github.occasion.pull_request.labels.*.identify
In each instances, we need to apply the identical functionality to the context’s enlargement.
To make issues much more difficult thrilling, GitHub’s personal context
entry syntax is surprisingly malleable: every of the next is a legitimate
and equal solution to entry the primary label’s identify:
1
2
3
4
5
github.occasion.pull_request.labels[0].identify
github.EVENT.PULL_REQUEST.LABELS[0].NAME
github['event']['pull_request']['labels'][0]['name']
github['event'].pull_request['labels'][0].identify
github.occasion.pULl_ReQUEst['LaBEls'][0].nAmE
..and so forth.
In sum, we’ve two properties that blow a gap in our “just shove it in a map”
method:
github.occasion.pull_request.labels.*.identify context into N contextsjq-ish syntax, however we nonetheless must normalize any contexts as they seemTo recap:
fastened | structured | arbitrary, indicatingTo me, this initially smacked of a prefix/radix trie downside: there are a
numerous frequent prefixes/segments within the sample set, which means that
the trie may very well be made comparatively compact. However, tries are optimized for
operations that the issue doesn’t require:
Finally, on a extra sensible stage: I couldn’t discover an excellent trie/radix trie
crate to make use of. Some of this might need been a discovery failure on my half,
however I couldn’t discover one which was already extensively used and nonetheless actively
maintained. radix_tree got here the closest, however hasn’t been up to date in almost
5 years.
While studying about different environment friendly prefix illustration buildings,
I got here throughout DAFSAs (additionally generally referred to as DAWGs). These provide a
considerably extra compact illustration of prefixes than a trie, however at a
price: in contrast to a trie, a DAFSA can not include auxiliary knowledge. This makes them
nice for inclusion checking (together with of prefixes), however not so nice for my
function of storing every sample’s related functionality.
That introduced me to transducers as a category of finite state machines: in contrast to
acceptors (DAFSAs, but in addition regular DFAs and NFAs) that map from
an enter sequence to a boolean settle for/reject state, transducers map
an enter sequence to an output sequence. That output sequence can then be
composed (e.g. by way of summation) into an output worth. In impact, the “path”
an enter takes by way of the transducer yields an output worth.
In this manner, FSTs can behave quite a bit like a map (whether or not backed by a
prefix trie or a hash desk), however with some interesting further properties:
These fascinating properties include downsides too: optimum FST building
requires reminiscence proportional to the overall enter measurement, and requires ordered
insertion of every enter. Modifications to an FST are additionally restricted: optimum
insertions should be ordered, whereas deletions or adjustments to an related
worth require a full rebuild of the FST. In observe, this that FST building
is a static affair over a preprocessed enter set. But that’s completely
tremendous for my use case!
As it seems, utilizing the fst crate to assemble an FST at construct time
is fairly easy. Here’s the totality of the code that I put in
construct.rs to remodel context-capabilities.csv:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
fn do_context_capabilities() {
let manifest_dir = env::var("CARGO_MANIFEST_DIR").unwrap();
let supply = Path::new(&manifest_dir).be a part of("data/context-capabilities.csv");
println!(
"cargo::rerun-if-changed={source}",
supply = supply.show()
);
let out_dir = env::var("OUT_DIR").unwrap();
let out_path = Path::new(&out_dir).be a part of("context-capabilities.fst");
let out = io::BufWriter::new(File::create(out_path).unwrap());
let mut construct = fst::MapBuilder::new(out).unwrap();
let mut rdr = csv::ReaderBuilder::new()
.has_headers(false)
.from_path(supply)
.unwrap();
for file in rdr.data() {
let file = file.unwrap();
let context = file.get(0).unwrap();
let functionality = match file.get(1).unwrap() {
"arbitrary" => 0,
"structured" => 1,
"fixed" => 2,
_ => panic!("Unknown functionality"),
};
construct.insert(context, functionality).unwrap();
}
construct.end().unwrap();
}
…after which, loading and querying it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
static CONTEXT_CAPABILITIES_FST: LazyLock<Map<&[u8]>> = LazyLock::new(|| {
fst::Map::new(include_bytes!(concat!(env!("OUT_DIR"), "/context-capabilities.fst")).as_slice())
.count on("couldn't initialize context capabilities FST")
});
impl Capability {
fn from_context(context: &str) -> Option<Self> {
match CONTEXT_CAPABILITIES_FST.get(context) {
Some(0) => Some(Capability::Arbitrary),
Some(1) => Some(Capability::Structured),
Some(2) => Some(Capability::Fixed),
Some(_) => unreachable!("unexpected context capability"),
_ => None,
}
}
}
(the place the enter context has been normalized, e.g. from
github["event"]["pull_request"]["title"] to
github.occasion.pull_request.title).
Everything above was included in zizmor 1.9.0, as a part of a large-scale
refactor of the template-injection audit.
Was this overkill? Probably: I solely have about ~4000 legitimate context patterns,
which might have comfortably match right into a hash desk.
However, utilizing an FST for this makes the footprint ludicrously and satisfyingly
small: every sample takes lower than 4 bytes to symbolize within the serialized FST,
properly beneath the roughly linear reminiscence footprint of loading the equal
knowledge from context-capabilities.csv.
Using an FST additionally unlocked future optimizations concepts that I haven’t
bothered to experiment with but:
fst crate helps
This web page was created programmatically, to learn the article in its unique location you possibly can go to the hyperlink bellow:
https://blog.yossarian.net/2025/08/14/Fun-with-finite-state-transducers
and if you wish to take away this text from our website please contact us
This web page was created programmatically, to learn the article in its authentic location you…
This web page was created programmatically, to learn the article in its unique location you…
This web page was created programmatically, to learn the article in its unique location you…
This web page was created programmatically, to learn the article in its authentic location you…
This web page was created programmatically, to learn the article in its unique location you…
This web page was created programmatically, to learn the article in its authentic location you'll…