Categories: Fun

Fun with finite state transducers

This web page was created programmatically, to learn the article in its unique location you possibly can go to the hyperlink bellow:
https://blog.yossarian.net/2025/08/14/Fun-with-finite-state-transducers
and if you wish to take away this text from our website please contact us

Programming, philosophy, pedaling.

Aug 14, 2025

Tags:

devblog,

programming,

rust,

zizmor

I just lately solved an attention-grabbing downside inside zizmor
with a kind of state machine/automaton I hadn’t used earlier than: a
finite state transducer (FST).

This is only a fast write-up of the issue and the way I solved it. It doesn’t
go significantly deep into the info buildings themselves. For extra data
on FSTs themselves, I strongly advocate burntsushi’s article on transducers
(which is what truly led me to his fst crate).

TL;DR: I used the fst crate to construct a finite state transducer that
maps GitHub Actions context patterns to their logical “capability” in
the context of a possible template injection vulnerability. This ended up
being an order of magnitude smaller when it comes to illustration (~14.5KB as a substitute
of ~240 KB) and sooner and extra reminiscence environment friendly than my naïve
preliminary approaches (tables and prefix trie walks). It additionally enabled me
to totally precompute the FST at compile time, eliminating the startup price
of priming a trie- or table-based map.

Background

zizmor is a static evaluation instrument for GitHub Actions.

One of the classes of weaknesses it will probably discover is template injections,
whereby the CI writer makes use of a GitHub Actions expression in a shell or
comparable context with out realizing that the expression can escape any
shell-level quoting meant to “defuse” it.

Here’s an instance, derived from a sample that will get
exploited over and over again:

1
2
3
   - identify: "Print the current ref"
  run: |
    echo "The current ref is: ${{ github.ref }}"

If this step is a part of a workflow that grants elevated privileges to 3rd
events (like pull_request_target), and attacker can contrive a git
ref that escapes the shell quoting and runs arbitrary code.

For instance, the next ref:

1
   harmless";cat${IFS}/etc/passwd;true${IFS}"

…would develop as:

1
   echo "The current ref is innocent";cat /and so on/passwd;true ""

Fortunately, zizmor detects these:

1
2
3
4
5
6
7
8
9
10
11
12
   zizmor hackme.yml

error[template-injection]: code injection by way of template enlargement
  --> hackme.yml:15:41
   |
14 |         run: |
   |         ^^^ this run block
15 |           echo "The current ref is: ${{ github.ref }}"
   |                                         ^^^^^^^^^^ could develop into attacker-controllable code
   |
   = word: audit confidence → High
   = word: this discovering has an auto-fix

The downside

There’s a quite simple solution to detect these vulnerabilities: we might stroll each
code “sink” in a given workflow (e.g. run: blocks, motion inputs which might be
identified to include code, &c.) and search for the fences of an expression
(${{ ... }}). If we see these fences, we all know that the contents
are a possible injection threat.

This is interesting for causes of simplicity, however is unacceptably noisy:

There are many actions expressions which might be trivially secure, or
non-trivial however deductively secure:

Literals, e.g. ${{ true }} or ${{ 'lol' }};

Any expression that may solely develop to a literal:

1
2
3
   # solely ever expands to 'essential' or 'not-main'
# regardless of utilizing the github.ref context
${ 'not-main' }

Any expression that may’t develop to significant code, e.g.
because of the expression’s sort:

1
2
3
4
5
   # solely ever expands to a quantity
${{ github.run_number }}

# solely ever expands to `true` or `false`
${{ github.ref == 'essential' }}

There are many expressions that may seem unsafe by advantage of dataflow
or context enlargement, however are literally secure due to the context’s
underlying sort or constraints:
- ${{ github.occasion.pull_request.merged }} is populated by
  GitHub’s backend and may solely develop to true or false, however requires
  us to know a priori that it’s a “safe” context;
- ${{ github.actor }} is an arbitrary string, however is restricted
  in construction to characters that make it infeasible to carry out
  a helpful injection with (no semicolons, $, &c.).

zizmor usually goals to current low-noise findings, so filtering these
out by default is paramount.

The first group is fairly simple: we will do a small quantity of dataflow evaluation
to find out whether or not an expression’s analysis is “tainted” by arbitrary
controllable inputs.

The second group is more durable, as a result of it requires to know further info about
arbitrary-looking contexts. The two essential info we care about are sort
(whether or not a context expands to a string, a quantity, or one thing else)
and functionality (whether or not the enlargement is absolutely arbitrary, or constrained
in some method which may make it secure or not less than much less dangerous). In observe
these each collapse right down to functionality, since we will categorize
sure sorts (e.g. booleans and numbers) as inherently secure.

Fact discovering

So, what we would like is a solution to acquire info about each legitimate GitHub Actions
context.

The trick to this lies in remembering that, beneath the hood, GitHub Actions
is pushed by GitHub’s webhooks API: most of the context state loaded
right into a GitHub Actions workflow run is derived from the webhook payload
comparable to the occasion that triggered the workflow.

So, how will we get an inventory of all legitimate contexts together with data
about their enlargement? GitHub doesn’t present this instantly, however we will
derive it from their OpenAPI specification for the webhooks API.

This comes within the type of a ~4.5MB OpenAPI schema, which is ache within the ass
to work with instantly: it’s each closely self-referential (by necessity,
since an “unrolled” model with inline schemas for every property would
infeasibly giant), is closely telescoped (additionally by necessity, since
GitHub’s API responses themselves should not significantly flat), and makes ample
use of OpenAPI constructions like oneOf, anyOf, and allOf that require
cautious further dealing with.

At the underside of all of this, nevertheless, is our reward: detailed data
about each property offered by every webhook occasion, together with the property’s
sort and beneficial details about how the property is constrained.

For instance, right here’s the schema for a pull_request.state property:

1
2
3
4
5
6
   "state": {
    "description": "State of this Pull Request. Either `open` or `closed`.",
    "enum": ["open", "closed"],
    "sort": "string",
    "examples": ["open"]
}

This tells us that pull_request.state is a string, however that its worth
is constrained to both open or closed. We categorize this as having
a “fixed” functionality, since we all know that the attacker can’t management the
construction of the worth itself in a significant method.

Long story brief: that is carried out as a helper script inside zizmor
referred to as webhooks-to-contexts.py. This script is run periodically in GitHub
Actions and walks the OpenAPI scheme to produces a CSV,
context-capabilities.csv, that appears like this:

context	functionality
github.occasion.pull_request.active_lock_reason	arbitrary
github.occasion.pull_request.additions	fastened
github.occasion.pull_request.allow_auto_merge	fastened
github.occasion.pull_request.allow_update_branch	fastened
github.occasion.pull_request.assignee	fastened
github.occasion.pull_request.assignee.avatar_url	structured
github.occasion.pull_request.assignee.deleted	fastened
github.occasion.pull_request.assignee.e mail	arbitrary
github.occasion.pull_request.assignee.events_url	arbitrary
github.occasion.pull_request.assignee.followers_url	structured
github.occasion.pull_request.assignee.following_url	arbitrary
github.occasion.pull_request.assignee.gists_url	arbitrary
github.occasion.pull_request.assignee.gravatar_id	arbitrary
github.occasion.pull_request.assignee.html_url	structured

…to the tune of about 4000 distinctive contexts.

The larger downside

So, we’ve just a few thousand contexts, every with a functionality that
tells us how a lot of a threat that context poses when it comes to template injection.
We can simply shove these right into a map and name it a day, proper?

Wrong. We’ve glossed over a major wrinkle, which is that context
accesses in GitHub Actions are not themselves at all times literal. Instead, they
will be patterns that may develop to a number of values.

An excellent instance of that is github.occasion.pull_request.labels: labels is
an array of objects, every of which has a identify property corresponding
to the label’s precise identify. To entry these, we will use syntaxes that choose
particular person labels or all labels:

1
2
3
4
5
   # entry the primary label's identify
github.occasion.pull_request.labels[0].identify

# entry all labels' names
github.occasion.pull_request.labels.*.identify

In each instances, we need to apply the identical functionality to the context’s enlargement.
To make issues much more ~~difficult~~ thrilling, GitHub’s personal context
entry syntax is surprisingly malleable: every of the next is a legitimate
and equal solution to entry the primary label’s identify:

1
2
3
4
5
   github.occasion.pull_request.labels[0].identify
github.EVENT.PULL_REQUEST.LABELS[0].NAME
github['event']['pull_request']['labels'][0]['name']
github['event'].pull_request['labels'][0].identify
github.occasion.pULl_ReQUEst['LaBEls'][0].nAmE

..and so forth.

In sum, we’ve two properties that blow a gap in our “just shove it in a map”
method:

Contexts are patterned, and may’t be expanded right into a static finite
enumeration of simplified contexts. For instance, we will’t know what number of
labels a repository has, so we will’t statically unfold
the github.occasion.pull_request.labels.*.identify context into N contexts
that match every part the consumer may write in a workflow.
Contexts will be expressed by way of a number of syntaxes. We can hold issues
easy on the potential extraction aspect by solely utilizing the
jq-ish syntax, however we nonetheless must normalize any contexts as they seem
in workflows. This shouldn’t be very troublesome, but it surely makes a single map lookup
even much less interesting.

The resolution

To recap:

We have just a few hundreds contexts, every of which is known as a
sample that may match a number of “concrete” context usages as they seem
in a consumer’s workflow.
Each of those context patterns has an related
functionality, as one among fastened | structured | arbitrary, indicating
how a lot of a threat the context poses when it comes to template injection.
Our objective is to effectively match these patterns towards the contexts as
they seem in a consumer’s workflow, and return the related functionality.

To me, this initially smacked of a prefix/radix trie downside: there are a
numerous frequent prefixes/segments within the sample set, which means that
the trie may very well be made comparatively compact. However, tries are optimized for
operations that the issue doesn’t require:

Tries are optimized to develop and shrink at runtime, however we don’t want that:
the set of context patterns is static, and we’d ideally commerce off some
compile-time price for runtime measurement and velocity.
Tries can carry out environment friendly prefix and actual matches, however (sometimes)
at the price of a bigger runtime reminiscence footprint.

Finally, on a extra sensible stage: I couldn’t discover an excellent trie/radix trie
crate to make use of. Some of this might need been a discovery failure on my half,
however I couldn’t discover one which was already extensively used and nonetheless actively
maintained. radix_tree got here the closest, however hasn’t been up to date in almost
5 years.

While studying about different environment friendly prefix illustration buildings,
I got here throughout DAFSAs (additionally generally referred to as DAWGs). These provide a
considerably extra compact illustration of prefixes than a trie, however at a
price: in contrast to a trie, a DAFSA can not include auxiliary knowledge. This makes them
nice for inclusion checking (together with of prefixes), however not so nice for my
function of storing every sample’s related functionality.

That introduced me to transducers as a category of finite state machines: in contrast to
acceptors (DAFSAs, but in addition regular DFAs and NFAs) that map from
an enter sequence to a boolean settle for/reject state, transducers map
an enter sequence to an output sequence. That output sequence can then be
composed (e.g. by way of summation) into an output worth. In impact, the “path”
an enter takes by way of the transducer yields an output worth.

In this manner, FSTs can behave quite a bit like a map (whether or not backed by a
prefix trie or a hash desk), however with some interesting further properties:

A finite state transducer can compress its complete enter, in contrast to a prefix
trie (which solely compresses the prefixes). That means a extra compact
illustration.
A finite state transducer can share duplicated output values throughout
inputs, in contrast to a prefix trie or hash desk (which might retailer the identical
output worth a number of instances). This additionally means a extra compact illustration,
and is especially interesting in our case as we solely have a small set
of doable capabilities.

These fascinating properties include downsides too: optimum FST building
requires reminiscence proportional to the overall enter measurement, and requires ordered
insertion of every enter. Modifications to an FST are additionally restricted: optimum
insertions should be ordered, whereas deletions or adjustments to an related
worth require a full rebuild of the FST. In observe, this that FST building
is a static affair over a preprocessed enter set. But that’s completely
tremendous for my use case!

Putting it collectively

As it seems, utilizing the fst crate to assemble an FST at construct time
is fairly easy. Here’s the totality of the code that I put in
construct.rs to remodel context-capabilities.csv:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
   fn do_context_capabilities() {
    let manifest_dir = env::var("CARGO_MANIFEST_DIR").unwrap();
    let supply = Path::new(&manifest_dir).be a part of("data/context-capabilities.csv");

    println!(
        "cargo::rerun-if-changed={source}",
        supply = supply.show()
    );

    let out_dir = env::var("OUT_DIR").unwrap();
    let out_path = Path::new(&out_dir).be a part of("context-capabilities.fst");

    let out = io::BufWriter::new(File::create(out_path).unwrap());
    let mut construct = fst::MapBuilder::new(out).unwrap();

    let mut rdr = csv::ReaderBuilder::new()
        .has_headers(false)
        .from_path(supply)
        .unwrap();

    for file in rdr.data() {
        let file = file.unwrap();
        let context = file.get(0).unwrap();
        let functionality = match file.get(1).unwrap() {
            "arbitrary" => 0,
            "structured" => 1,
            "fixed" => 2,
            _ => panic!("Unknown functionality"),
        };

        construct.insert(context, functionality).unwrap();
    }

    construct.end().unwrap();
}

…after which, loading and querying it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
   static CONTEXT_CAPABILITIES_FST: LazyLock<Map<&[u8]>> = LazyLock::new(|| {
    fst::Map::new(include_bytes!(concat!(env!("OUT_DIR"), "/context-capabilities.fst")).as_slice())
        .count on("couldn't initialize context capabilities FST")
});


impl Capability {
    fn from_context(context: &str) -> Option<Self> {
        match CONTEXT_CAPABILITIES_FST.get(context) {
            Some(0) => Some(Capability::Arbitrary),
            Some(1) => Some(Capability::Structured),
            Some(2) => Some(Capability::Fixed),
            Some(_) => unreachable!("unexpected context capability"),
            _ => None,
        }
    }
}

(the place the enter context has been normalized, e.g. from
github["event"]["pull_request"]["title"] to
github.occasion.pull_request.title).

Concluding ideas

Everything above was included in zizmor 1.9.0, as a part of a large-scale
refactor of the template-injection audit.

Was this overkill? Probably: I solely have about ~4000 legitimate context patterns,
which might have comfortably match right into a hash desk.

However, utilizing an FST for this makes the footprint ludicrously and satisfyingly
small: every sample takes lower than 4 bytes to symbolize within the serialized FST,
properly beneath the roughly linear reminiscence footprint of loading the equal
knowledge from context-capabilities.csv.

Using an FST additionally unlocked future optimizations concepts that I haven’t
bothered to experiment with but:

The FST is presently searched utilizing a normalization of every context
because it seems in a workflow. However, FSTs will be searched by any DFA,
which means that I might in principle convert every context into an everyday
expression as a substitute. I’m unclear on whether or not this might carry a efficiency
benefit, because the context-to-regex-to-DFA conversion itself is
not essentially low-cost.
In precept, the FST’s measurement may very well be squeezed down even additional by
splitting the context patterns into segments, relatively than decomposing
into sequences of bytes. This comes from the remark that many vertices
within the FST are shared and singular, which means that they solely have one
incoming edge and one outgoing edge. I don’t suppose the fst crate helps
this natively, however it will yield the prefix deduplication advantages of
a prefix trie whereas nonetheless preserving the compression advantages of an FST.

This web page was created programmatically, to learn the article in its unique location you possibly can go to the hyperlink bellow:
https://blog.yossarian.net/2025/08/14/Fun-with-finite-state-transducers
and if you wish to take away this text from our website please contact us

fooshya

Next Battlefield 6: All Class Gadgets »

Previous « New elevate taking form at Monarch Mountain in Colorado | Life-style

Published by

fooshya

4 months ago

Jaylen Brown extends streak of 30-plus-points to eight video games

This web page was created programmatically, to learn the article in its unique location you'll…

56 seconds ago

Gadgets

‘Tarkov is Made for Satisfaction, Not For Enjoyable’, Battlestate Games Explains

This web page was created programmatically, to learn the article in its authentic location you…

3 minutes ago

5 Fun Docker Projects for Absolute Beginners

This web page was created programmatically, to learn the article in its authentic location you…

23 minutes ago

Gaming

Closing Two Video games Are ‘Great Preparation’ as 49ers Gear Up for Playoff Run

This web page was created programmatically, to learn the article in its unique location you'll…

24 minutes ago

Photography

‘The sight of it is still shocking’: 46 pictures that inform the story of the century thus far | Photography

This web page was created programmatically, to learn the article in its unique location you…

54 minutes ago

Travel

Main winter storm triggers state of emergency as thousands and thousands within the Northeast brace for vital snowfall

This web page was created programmatically, to learn the article in its authentic location you…

1 hour ago

Fun with finite state transducers

Programming, philosophy, pedaling.

Background

The downside

Fact discovering

The larger downside

The resolution

Putting it collectively

Concluding ideas

Recent Posts

Jaylen Brown extends streak of 30-plus-points to eight video games

‘Tarkov is Made for Satisfaction, Not For Enjoyable’, Battlestate Games Explains

5 Fun Docker Projects for Absolute Beginners

Closing Two Video games Are ‘Great Preparation’ as 49ers Gear Up for Playoff Run

‘The sight of it is still shocking’: 46 pictures that inform the story of the century thus far | Photography

Main winter storm triggers state of emergency as thousands and thousands within the Northeast brace for vital snowfall