From text to circuit¶

Thus far, our exploration of lambeq has been confined to sentence-level analysis. However, many compelling NLP tasks, such as discourse analysis, summarization, and coreference resolution, inherently operate at the discourse level, requiring models to understand and process relationships and structures spanning across multiple sentences. To make this kind of tasks possible on a quantum computer, lambeq supports DisCoCirc [Coe21], a framework of compositional models (still at the experimental stage) with the ability to encode entire paragraphs or documents into a quantum circuit. The generated quantum circuits capture the core semantic information of the provided text, and can be trained using lambeq’s machine learning features [KKOK25].

⬇️ Download code

Basics¶

DisCoCirc represents the entities found in the text as wires flowing from top to bottom in a string diagram. These entities are modified by boxes or frames, corresponding to higher-order linguistic constructions in the text. Let’s see a simple example.

from lambeq.experimental.discocirc import DisCoCircReader

reader = DisCoCircReader()

text = "Bob likes Alice. Alice hates Charlie."
diagram = reader.text2circuit(text)
diagram.draw()

../_images/1c90c2467f53ebc7d7dde3bf90f122e6dc6e914ae21770f73b8c3faccfc512ce.png

Our text has 3 entities, “Bob”, “Alice”, and “Charlie”, and 2 boxes representing actions, “likes” and “hates”, acting on and modifying these entities in the specific order. Note how the resulting diagram encapsulates the essential semantic content of the paragraph, abstracting away from its syntactic structure.

Let’s move to a more interesting example.

text = "Bob likes Alice. She likes Charlie more."
reader.text2circuit(text).draw()

../_images/2a0a34b773ccc181fcf8883dd38fb5088ce0be3265d3df996bb12825c03cec24.png

Note that in this case, the DisCoCirc reader uses coreference resolution to match the pronoun “she” in the second sentence with “Alice”. Also, the generated circuit now contains a box nested into a higher-order frame, representing the fact that the adverb “more” modifies the verb “likes” in the context of the interaction between “Alice” and “Charlie”.

Note

When frames are present in a string diagram, lambeq uses a colour encoding for wires and frames to make the diagram more readable, as in the above image. By default, the colour of each frame indicates its “type”, i.e. the number of nested boxes.

Sometimes, an entity can interact with more than one action, as in the following example:

text = "Bob gave Alice a book. He loves it, but she hated it."
reader.text2circuit(text).draw(figsize=(5,7))

../_images/f8c8f5238db8939ff5bcac6b0930d8563600675ccf1ad7d438c9d49e79e41a76.png

In the above, the book is loved by Bob, but at the same time hated by Alice. To allow this, the blue wire branches into two paths to interact with the separate actions and then recombines into a single wire to deliver the result. This branching and merging is achieved through spider operations.

Parsing longer texts¶

The DisCoCircReader is efficient and robust enough to parse really long texts, for example entire book chapters, into a single diagram. However, keep in mind that such a diagram can quickly become very dense, to the point it’s difficult to read. In fact, as you can see below, even relatively short and simple texts can generate fairly complicated diagrams.

text = "Anna found a small box in the garden."\
       "She opened it and saw a key inside."\
       "The key was old and rusty."\
       "Anna wondered what it unlocks."\
       "She asked John to help her,"\
       "but he was busy doing homework."
reader.text2circuit(text).draw(figsize=(10,15))

../_images/093bbd1acf3856f17ddf13b7cdeaec68a44290d7f5118a59cedb926abf001a41.png

Note

When trying to plot dense diagrams, increasing the size of the figure by using the figsize paramemeter in draw() method can significantly improve the readability of the result.

Simplifying the diagram¶

The discocirc package provides a few ways to abstact away some of the details of the text diagrams, which we introduce in the following sections.

Rewrite rules¶

One way to avoid making your text diagrams excessively complex is to reduce the amount of frame nesting in expressions that involve large chains of modifiers. Check the following example:

text =  "As Red Riding Hood walked in the forest, "\
        "a tall shadowy figure appeared among the trees. "\
        "It was the Big Bad Wolf."
reader.text2circuit(text).draw(figsize=(8,10))

../_images/f8fb199db45fffae73ac2fb6d179a2f27e4fc6e207c938968aab004f0ab0b46c.png

Note that the entity “Red Riding Hood” has been represented simply as “hood”, modified by the adjectives “red” and “riding” later in the timeline of the interactions in the text. Similarly, “big bad wolf” was analysed into a “wolf”, modified by the higher-order boxes “big” and “bad”. While this analysis makes linguistic sense, it has very limited use for cases like these.

DisCoCircReader provides the means to avoid using that level of detail through rewrite rules that collapse long modification chains into a simpler entity. For example, in order to collapse all the noun-modification chains in the above diagram, you can use the noun_modification rule as below:

reader.text2circuit(text, rewrite_rules=['noun_modification']).draw(figsize=(7,7))

../_images/4322d3cefa2d28069d84cfe01f04288189adc6d2a484acef5100b7f08e433a30.png

The resulting diagram is significantly simplified, while the entities on the top of the diagram make much more sense for the given story.

The table below includes all the pre-defined rewrite rules with short descriptions.

Rewrite rule	Description
`determiner`	Joins determiners (“the”, “an”, “a”) with the nouns they modify
`auxiliary`	Attempts to collapse auxiliaries (“do”, “is” etc) with entities or actions they modify
`noun_modification`	Collapses all noun modification chains into a single entity
`verb_modification`	Attempts to collapse verb modification chains into a single action
`sentence_modification`	Attempts to collapse sentence modification chains

Note

Users can create their own rewrite rules programmatically by using the class TreeRewriteRule in the lambeq.experimental.discocirc package.

Tip

To collapse all possibile modification chains in a diagram, create an empty TreeRewriteRule and pass it to the rewrite_rules argument of the text2circuit() method.

Pruning infrequent nouns¶

Another way to simplify a DisCoCirc diagram is to ignore any entities that do not appear very frequently in the text. In the following example, note that “beagle” appears only once. We can ask from the reader to ignore any entities that do not occur above a specific threshold with the parameter min_noun_freq.

text = "Tom adopted a dog. The animal was a beagle. "\
       "He named it Max. The dog loves playing outside. "
diagram = reader.text2circuit(text, min_noun_freq=2)
diagram.draw(figsize=(5,7))

../_images/f58c970154f00619130325d9157d60d77ce54409b841854f0fd5a62181c2e181.png

The “sandwich” functor¶

In a DisCoCirc diagram, frames can be seen as “quantum supermaps” that modify boxes. However, these supermaps do not correspond to unitary boxes and thus are not directly executable on a quantum computer. One way to convert a frame into a conventional structure of unitary boxes, is to insert trainable unitaries in the beginning and the end of the frame, as well as between each argument in it. This construction, known as the sandwich functor [LMC24], is inherently supported in DisCoCircReader class via the sandwich argument of the text2circuit() method. Consider the following example:

text = "Bob likes Alice, but she likes Charlie more."
reader.text2circuit(text).draw(figsize=(6,6))

../_images/4f867bb0692e2895204158d5860c4dfe8770e1560a750f7d178be5aa4fdea647.png

Note that in this diagram, we get one frame with a single box in it (“more”) and one frame with two boxes (“but”). Let’s see how these frames will be converted to unitary box sequences by the sandwich functor.

diagram = reader.text2circuit(text, sandwich=True)
diagram.draw(figsize=(6,6))

../_images/5995c4c294715b39b9f5b5acb32bcade045c518912d5d474f6526dfe7de34cdf.png

We’ll focus first on frame “more”, as it’s the simpler case. You can see that it is now replaced by two unitary boxes that enwrap the “likes” box, one before and one after it - this is pretty much the idea behind the sandwich construction. In the slightly more complicated case of frame “but” which contains two boxes (actually, one box and one frame), note that these two boxes are separated by another special “separator” box (but_1). This applies to any number of boxes in the frame, i.e. for \(n\) argument-boxes there will be \(n-1\) separator-boxes in the final diagram.

Applying an ansatz¶

A DisCoCirc diagram in the above form (i.e. with the sandwich construction applied), is a standard string diagram that can be converted into a quantum circuit by the application of any lambeq ansatz.

from lambeq import Sim4Ansatz, AtomicType

text = "Alice loves hiking. She often visits mountains. They provide peace and beauty."
diagram = reader.text2circuit(text, sandwich=True)
diagram.draw()

../_images/54d044f4e3c4c40d1ca07153bdfdc29ba8bf3a5866a83e9e6a0ab4381a8445e0.png

We’ll use the Sim4Ansatz to convert the above diagram into a circuit.

ansatz = Sim4Ansatz({AtomicType.NOUN: 1, AtomicType.SENTENCE: 1}, n_layers=1)
circuit = ansatz(diagram)
circuit.draw(figsize=(10,10))

../_images/2dbdab65f5ff1fa56b1271f47be80e877c7472141ae3e7ffe13b333ea3eb824f.png

You are now ready to train your text circuits with any lambeq model/trainer. See the DisCoCirc training tutorial for more details.