Basics of representing syntax with VSAs

Converting a simple language into VSAs

Arbitrary syntax

In this section we will talk about the basics of specifying an abstract syntax, beginning first with our example language from the previous tutorial.

The language \(\mathcal{L}_{\text{fruit}}\)

Previously, we discussed the small language \(\mathcal{L}\) in VSAs. We were being a bit greedy with that symbol, so let us refer to the original language in the first example as \(\mathcal{L}_{\text{fruit}}\).

We also provided an example of defining a language inductively. It’s worthwhile to inexpect what we did and why we did it. Recall, the language \(\mathcal{L}_{\text{fruit}}\) is the minimal set of formulas which have the following properties:

Atomic formulae: For atomic symbols \(\mathcal{A} = \{\texttt{apple}, \texttt{banana}, \texttt{strawberry}\}\), if \(x \in \mathcal{A}\), then \(x \in \mathcal{L}_\text{fruit}\).

Atomic symbols like these are the most basic units of our language. E.g., if we are talking about the language of propositional logic, then the atomic values are the propositional variables \(p, q, r, \dots\). But, when talking about VSAs, we think it to be best practice to talk about delicious treats.

Compound formulae: If \(\phi \in \mathcal{L}_\text{fruit}\), and \(\psi \in \mathcal{L}_{\text{fruit}}\), then the tuple \((\phi, \psi)\) and the disjunction \((\phi \lor \psi)\) are both in \(\mathcal{L}_\text{fruit}\).

Compound statements are defined inductively over all the possible elements of \(\mathcal{L}_{\text{fruit}}\). This allows for arbitrary nesting tements. This is known as a compositional structure, roughly meaning that the things we’re talking about are made up of a small collection of basic units.

Sometimes definitions include a final case:

Anything else that isn’t mentioned in the above conditions is not in \(\mathcal{L}_{\text{fruit}}\).

But, we have already covered this by mentioning that the set was the minimal set. Both of these prevent us from shoving in whatever we want into \(\mathcal{L}_\text{fruit}\).

Backus-Naur Form

Another way of defining an set like \(\mathcal{L}_\text{fruit}\) is to use something called Backus-Naur Form (BNF).

Using BNF for \(\mathcal{L}_\text{fruit}\):

<expr> ::= <atomic>
         | (<expr>, <expr>)
         | (<expr> v <expr>)

<atomic> ::= strawberry | apple | banana

On the left-hand side of the rule we have the name of the rule, in our case we have one for composite formulae called <expr>, and for atomic formulae we have <atomic>. On the right hand side of the rule we have either: a reference to another rule or a collection of symbols with mentions to other rules or raw symbols. For example, (<expr>, <expr>) says that the rule <expr> contains formulae that have the symbols (, ,, and ) as well as two <expr>’s inside of them.

In general, we can read BNF descriptions as inductive definitions like the section above:

The set called <left-hand side name> is the minimal set defined by the condition (1), (2), etc.

For more information about how to talk about sets of things defined inductively, Hopcroft and Ullman (1979) and Friemdan and Mitchell (2008).

Role-Filler Pairs

The key data structure that we will be using to represent abstract syntax is the role-filler pair Kleyko, et al. (2023). Role-filler pairs are structured VSA representations which allow for representing practically any information that has a similar structure accross instances. For example, suppose we wish to represent a person’s likes and dislikes. Each person can only like or dislike one item. Say that Jane likes apples, but dislikes bananas. Let us have a codebook of \(D\)-dimensional vectors for symbols \([\texttt{like},~\texttt{dislike},~\texttt{apple},~\texttt{banana}]\). The preference can then be represented as:

\[ p = (\texttt{like} \otimes \texttt{apple}) \oplus (\texttt{dislike} \otimes \texttt{banana}). \tag{1} \]

If we store this representation, and we desire to decide on what kind of food to order from the grocery store, we can query the preference in order to determine what kinds of things to buy:

\[ p \otimes^{-1} \texttt{like} \approx \texttt{apple}. \tag{2} \]

It is important to note how general this scheme is. Any structure that has consistent elements can be represented as a role-filler pair. For example, we can conceivably represent the preferences not only of Jane, but also Bob and Kent.

Code exampleclass Codebook(UserDict):
    """Thin dictionary wrapper for codebooks."""

    def __init__(self, symbols: list[str], dim: int) -> None:
        super(Codebook, self).__init__()
        self.dim = dim
        for symbol in symbols:
            self.data[symbol] = random(1, dim).squeeze()


dim = 400
codebook = Codebook(symbols=["like", "dislike", "apple", "banana"], dim=dim)


def role_filler_pair(
    struct: typing.Union[
        dict[str, str], dict[str, HRR], dict[HRR, str], dict[HRR, HRR]
    ],
    codebook=codebook,
) -> HRR:
    """Create a role-filler pair from a dictionary."""
    v = np.zeros(codebook.dim).view(HRR)
    for role, filler in struct.items():
        if isinstance(role, str):
            role = codebook[role]
        if isinstance(filler, str):
            filler = codebook[filler]
        v += role.bind(filler)
    v /= np.linalg.norm(v)
    return v


# Jane's Preference: likes: apple, dislikes: banana
jane_preference = role_filler_pair(
    {
        "like": "apple",
        "dislike": "banana",
    }
)

# Test whether or not Jane likes apples or bananas?
jane_like = jane_preference.bind(codebook["like"].inverse())
print(f"Jane likes apples: {jane_like.cosine_similarity(codebook['apple'])}")
print(f"Jane likes bananas: {jane_like.cosine_similarity(codebook['banana'])}")

Jane likes apples: 0.5731136799350733
Jane likes bananas: 0.010662048594337383

Representing Syntax with Role-Filler Pairs

Let us consider again our language \(\mathcal{L}_{\text{fruit}}\). Recall that there were two broad classes of things within the language: atomic formulae and composite formulae, with the former being items in the language which have no other components, and composite representations which are made up of other items in the language.

While this is an important generalization, we have to be a bit more fine-grained as well. Recall that we have atomics, tuples, and disjunctions, with the latter two being the composite ones. In order to represent these, we will use a strategy with role-filler pairs called a tagged union. A tagged union is a structure which contains an role called a tag which has as its filler a small set of possible items. This lets the user know the data layout of the representation.

For each item in the language, we will create a tag. Let us create a codebook of \(D\)-dimensional vectors: \[ \mathcal{T} = [\texttt{atomic},~\texttt{tuple},~\texttt{disjunction}] \tag{3} \]

We will also create a codebook of what we’ll of roles. Let \(\mathcal{R}\) be the codebook of \(D\)-dimensional vectors: \[ \mathcal{R} = [\texttt{tag},~\texttt{name},~\texttt{lhs},~\texttt{rhs}]. \tag{4} \]

Astute readers will note that these correspond with the attributes of the dataclasses defined previously. For the sake of learning, we’ll reproduce them here below.

from dataclasses import dataclass
from abc import ABCMeta


@dataclass
class L(metaclass=ABCMeta):
    """Abstract base class of our language L."""

    pass


@dataclass
class Atomic(L):
    """Abstract base class of atomic elements in the language."""

    name: str


@dataclass
class Tuple(L):
    """Tuples in L."""

    lhs: L
    rhs: L


@dataclass
class Disjunction(L):
    """Disjunctions in L."""

    lhs: L
    rhs: L

Finally, we need to define a codebook for the atomic vectors. Let \[ \mathcal{A} = [\texttt{strawberry},~\texttt{banana},~\texttt{apple}], \tag{5} \] be the codebook for the atomic symbols.

Encoding

We can now define the encoding function using this tagged-union approach:

dim = 1_000
T = ["atomic", "tuple", "disjunction"]
R = ["tag", "name", "lhs", "rhs"]
A = ["strawberry", "banana", "apple"]
codebook = Codebook(T + R + A, dim=dim)


def encode(expr: L, codebook: Codebook = codebook) -> HRR:
    """Encode a formula in the language $\mathcal{L}_{\text{fruit}}$."""
    if not isinstance(expr, L):
        raise TypeError("Expected a subclass of L", expr)

    if isinstance(expr, Atomic):
        name = expr.name
        return role_filler_pair(
            {
                "tag": "atomic",
                "name": codebook[name],
            },
            codebook=codebook,
        )
    elif isinstance(expr, Tuple):
        lhs = expr.lhs
        rhs = expr.rhs
        return role_filler_pair(
            {
                "tag": "tuple",
                "lhs": encode(lhs, codebook=codebook),
                "rhs": encode(rhs, codebook=codebook),
            },
            codebook=codebook,
        )
    elif isinstance(expr, Disjunction):
        lhs = expr.lhs
        rhs = expr.rhs
        return role_filler_pair(
            {
                "tag": "disjunction",
                "lhs": encode(lhs, codebook=codebook),
                "rhs": encode(rhs, codebook=codebook),
            },
            codebook=codebook,
        )

We can see below that the encoded representation for the atomic formula \(\texttt{apple}\) is now structured. It loses the intuitive mapping from the previous encoding, with the tradeoff now that we can directly inspect the elements of the encoded representation without knowing already what is in the body.

apple = Atomic("apple")
enc_apple = encode(apple)
is_atomic = enc_apple.bind(codebook["tag"].inverse()).cosine_similarity(
    codebook["atomic"]
)
print(f"Is `enc_apple` atomic: {is_atomic:.2f}")
is_apple = enc_apple.bind(codebook["name"].inverse()).cosine_similarity(
    codebook["apple"]
)
print(f"Is the name of `enc_apple` `apple: {is_apple:.2f}")

Is `enc_apple` atomic: 0.58
Is the name of `enc_apple` `apple: 0.57

In fact, using this representation, we can actually decode the encoded formulae of the language.

Decoding

Given that our representations are structured, and that we have the codebook of all potential values hanging around, we can decode the representations back into a human readable form (with some loss, but that’s the name of the game).

def decode(enc: HRR, codebook=codebook, theta: float = 0.2) -> L:
    """Decode a representation back to ``L``."""
    tag = enc.bind(codebook["tag"].inverse())
    t_atom = codebook["atomic"]
    t_tuple = codebook["tuple"]
    t_disj = codebook["disjunction"]

    if tag.cosine_similarity(t_atom) > theta:
        name = enc.bind(codebook["name"].inverse())

        # Recall the name from the codebook
        keys, values = zip(*codebook.items())
        V = np.array(values)
        sims = V @ name
        argmax = np.argmax(sims)

        return Atomic(keys[argmax])
    else:
        # Trick here is that both of the other representations have
        # an `lhs` and a `rhs`.
        lhs = enc.bind(codebook["lhs"].inverse())
        rhs = enc.bind(codebook["rhs"].inverse())
        dec_lhs = decode(lhs, codebook=codebook, theta=theta)
        dec_rhs = decode(rhs, codebook=codebook, theta=theta)
        if tag.cosine_similarity(t_tuple) > theta:
            return Tuple(dec_lhs, dec_rhs)
        else:
            return Disjunction(dec_lhs, dec_rhs)


# Testing atomic:
straw = Atomic("strawberry")
print(f"Original form: {straw}")
enc_straw = encode(straw)
dec_straw = decode(enc_straw)
print(f"Decoded form: {dec_straw}")

print()

# Testing tuples:
apple = Atomic("apple")
enc_apple = encode(apple)
tupl = Tuple(straw, apple)
print(f"Original form: {tupl}")
enc_tupl = encode(tupl)
dec_tupl = decode(enc_tupl)
print(f"Decoded form: {dec_tupl}")

print()

# Testing disjunctions
disj = Disjunction(straw, apple)
print(f"Original form: {disj}")
enc_disj = encode(disj)
dec_disj = decode(enc_disj)
print(f"Decoded form: {dec_disj}")

Original form: Atomic(name='strawberry')
Decoded form: Atomic(name='strawberry')

Original form: Tuple(lhs=Atomic(name='strawberry'), rhs=Atomic(name='apple'))
Decoded form: Tuple(lhs=Atomic(name='strawberry'), rhs=Atomic(name='apple'))

Original form: Disjunction(lhs=Atomic(name='strawberry'), rhs=Atomic(name='apple'))
Decoded form: Disjunction(lhs=Atomic(name='strawberry'), rhs=Atomic(name='apple'))

Unfortunately, decoding is brittle. We will talk about ways to remedy this in the next section.

Conclusion

It is clear to see how we can generalize this approach to any arbitrary language \(\mathcal{L}\) which is defined inductively. What we shall soon see are: (1) the limits of this approach, and how we can remedy them with associative memories; (2) alternative approaches to representing syntax. After this, we will talk about interpretation.