Context Free Grammar to Regular Expression?

enter image description here

I want to learn whether I can create regular expressions from the given Context Free Grammar. I found some examples that can be translated to regular expressions. However, all of them were like this: "S-> aB | bC". Whereas my example is like: "S-> AB | BC". Here is my example: Can I create RE's from this? What is the reason? I want to fully understand why I can and why I cannot. Thank you.

asked Dec 19, 2020 at 22:46 Dilara Daria Dilara Daria 33 1 1 gold badge 1 1 silver badge 4 4 bronze badges

2 Answers 2

$\begingroup$

The set of all regular languages is a subset of context free languages. So if you have a context free grammar (CFG) that generates a regular languages, you most certainly can convert it to a regular expression (RE), regular grammar (RG), or finite automata (FA).

Before I go further with your example, I will simplify it so that we only deal with 3 terminals (instead of 8). In the example below, the terminals a,b,c will be represented by the single terminal a , the terminals x,y,z will be represented by x , and the terminals p,q will be represented by p .

S -> N V N -> a N | N N | N a | x V -> N V | p 

There is no real procedure in converting it, so we will take it step by step. First we need to see what kinds of strings the variable N can produce.

We can see that for every N introduced, it terminates with x . The a 's can be on either side of x and repeat as many times as needed. This entire pattern can be repeated as many times as needed, but be produced at least once. So N generates the same strings as $(a^*x^+a^*)^+$ RE.

Next we look at V . It can produce any amount of N 's but must terminate with a single p . So we have $((a^*x^+a^*)^+)^*p = (a^*x^+a^*)^*p$ .

Finally, S produces NV which is similar to our RE for V . Note that there is at least one N produced. So that CFG is converted to $(a^*x^+a^*)^+p$ .

To put it in terms of your original grammar, it would look something like this $((a|b|c)^*(x|y|z)^+(a|b|c)^*)^+(p|q)$ . Again, because of the amount of terminals in the original grammar, it is easier to simply things to help see the pattern the grammar is actually showing.

As a disclaimer, this is not a proof, but it shows the reasoning as to how I created an RE from the CFG you provided. To know if this is correct, you would need to know the exact language the CFG produces. Unfortunately, the grammar is not easily recognizable.

I used the following CFG Developer tool to help see a pattern from your grammar.

answered Dec 22, 2020 at 22:34 392 1 1 silver badge 7 7 bronze badges $\begingroup$

To deal with the issue cogently requires first some background. These issues can be dealt with, entirely, within an algebraic framework. Algebraically, a grammar is a system of fixed-point inequations over a certain type of partially ordered algebra, with the item represented by the grammar given in terms of the least fixed point solution of the system.

The algebra underlying regular expressions is known as the Kleene Algebra. A weaker first-order formulation of the algebra is as a partially ordered monoid which possesses least upper bounds for finite subsets (particularly, with the least upper bounds $⋁ ∅ = 0$ and $⋁ \ = a + b$ , the latter also written as $a | b$ ), with the least upper bound distributive over the product; such that all linear systems possess least fixed point solutions. Axioms may be presented as follows

Monoid: $x(yz) = (xy)z, x1 = x = 1x$

Least Upper Bound: $x + (y + z) = (x + y) + z, x + 0 = x = 0 + x, x + y = y + x, x + x = x$

Distributivity: $x0y = 0, w(x + y)z = wxz + wyz$

with the partial ordering defined by $x ≥ y ⇔ ∃z: x = z + y ⇔ x = x + y$ . (This reverses the order in which $+$ and $≥$ , are given, by making $+$ the more fundamental of the two.)

The least solution of the fixed point inequality $x ≥ 1 + ax$ is $x = a^*$ , which defines the star operator. The last axiom may be given in any one of a large number of forms; such as the requirement that the least fixed point solution to $x ≥ a + bx + xc$ be $x = b^* a c^*$ . I think you also need $0^* = 1$ paired off with this axiom in order to recover $1 + a a^* = a^* = 1 + a^* a$ . There are several other axiom sets that do the same job as these two.

A stronger axiomatization (which, technically, is not first order since it is infinitary) replaces the least fixed point solvability requirement by the following:

*-continuity: $⋁_ \left(a b^n c\right) = a b^* c$

and defines $a^*$ as $⋁_ a^n$ .

The free Kleene algebra over a set $X$ is one and the same as the algebra of regular expressions taken over $X$ as the alphabet. The free extension of a monoid $M$ to a *-continuous Kleene algebra is one and the same as the family $ℜM$ of "rational subsets" of $M$ . The Kleene operations, for it, are defined as $$0 = ∅, A + B = A ∪ B, AB = \< ab ∈ M: a ∈ A, b ∈ B \>,$$ for subsets $A, B ⊆ M$ , where the individual elements of $M$ itself are included within the algebra by identifying each $m ∈ M$ with its corresponding singleton set $\$ . The partial ordering is the subset order $A ≤ B ⇔ A ⊆ B$ , for $A, B ⊆ M$ , and the Kleene star (by the *-continuity property) is given as $$A^* = ⋃_ A^n$$ (where $A^0 = \, A^1 = A, A^2 = A A$ , etc.), for subsets $A ⊆ M$ .

The rational subsets $ℜM$ of a monoid $M$ comprise the family of subsets obtained from the finite subsets of $M$ by the Kleene operations. For a finite alphabet $X$ , and its corresponding free monoid $X^*$ , the Kleene algebra $ℜX^*$ is the algebra of regular expressions over $X$ . But also: for two alphabets $X$ and $Y$ (which we may think of respectively as inputs and outputs), $ℜ\left(X^*×Y^*\right)$ is the Kleene algebra of "rational transductions" between alphabets $X$ and $Y$ . This can be generalized to $n = 3, 4, 5, . $ alphabets, defining rational transductions that have 3 or more channels or tiers.

A stronger axiom, still, asserts the existence of not just linear least fixed point solutions, but also least fixed point solutions to non-linear inequalities, $x ≥ f(x)$ , where $f(x)$ is a Kleene-algebraic function formed from $x$ , with the corresponding least fixed point being defined as $μx f(x)$ . In place of the *-continuity axiom, one may pose

μ-continuity: $⋁_ \left(a f^n(0) c\right) = a \left(μx f(x)\right) c,$

where $f^0(x) = x$ , $f^1(x) = f(x)$ , $f^2(x) = f(f(x))$ , etc.

These are the μ-continuous Chomsky Algebras (or I'll just call them Chomsky algebras for short). (Edit: the left context $a$ and right context $c$ , omitted by accident, were added back in. The μ-continity property is supposed to be a generalization of *-continuity.)

They define the algebra of context-freeness. The free algebra generated by a set $X$ is the family $ℭX^*$ of context-free subsets of $X^*$ ; i.e. context-free languages over the alphabet $X$ . Similarly, $ℭ\left(X^*×Y^*\right)$ comprise the push-down transductions from alphabet $X$ to $Y$ and, generally, the family $ℭM$ of context-free subsets of a monoid $M$ comprise the Chomsky-algebra freely generated from $M$ . A yacc grammar, for instance, is actually a push-down transduction since it has both input symbols and, effectively, output symbols (each item in a yacc grammar enclosed in < . >may be regarded as a single output symbol).

So, with that background out of the way, we get to your problem. Your grammar, written as a system of inequalities over a Kleene algebra is $$S ≥ N V$$ $$N ≥ A N + N N + N A$$ $$A ≥ a + b + c$$ $$N ≥ x + y + z$$ $$V ≥ N V$$ $$V ≥ p + q$$ or equivalently $$S ≥ N V$$ $$N ≥ A N + N N + N A + Z$$ $$A ≥ a + b + c$$ $$V ≥ N V + Q$$ where, for convenience, we define $Z = x + y + z$ and $Q = p + q$ .

In any Kleene algebra, the least fixed point solution to $x ≥ a + bx + xc + xdx$ is $x = b^* a c^* \left(d b^* a c^*\right)^*$ - that's a theorem of Kleene algebras. This can be applied to eliminate and reduce a system containing an inequality of this form. Thus we have (with $(Z,A,A,1)$ in place of $(a,b,c,d)$ and $N$ in place of $x$ ):

$$N ≥ A^* Z A^* \left(1 A^* Z A^*\right)^* = A^* Z A^* \left(A^* Z A^*\right)^*.$$

Using basic properties of Kleene algebras, such as $x(yx)^* = (xy)^*x$ , $x^* x^* = x^*$ , we can write

$$N ≥ A^* Z A^* \left(Z A^*\right)^*.$$

For $V$ , we have a reduction to

Now, back-substitute into $S$

$$S ≥ N N^* Q ≥ A^* B B^* \left(A^* B B^*\right)^* Q,$$

where $B = Z A^*$ . Apply basic Kleene identities, such as $1 + y^* = y^*$ , $c^* \left(d c^*\right)^* = (c + d)^*$ and $x + y^*x = \left(1 + y^*\right) x = y^* x$ to get

$$A^* B B^* \left(A^* B B^*\right)^* = A^* B \left(B + A^* B\right)^* = A^* B \left(A^* B\right)^* = A^* \left(B A^*\right)^* B.$$

Substituting back in $B = Z A^*$ , we get

$$S ≥ A^* \left(Z A^* A^*\right)^* Z A^* Q = A^* Z \left(A^* Z\right)^* A^* Q = A^* Z \left(A + Z\right)^* Q.$$

To get the final result for $S$ , substitute in $A = a + b + c$ , $Q = p + q$ and $Z = x + y + z$ .

So, each $S$ consists of a bunch of $A$ and $Z$ elements, containing at least one $Z$ element, suffixed by a $Q$ element.

For a system such as $S ≥ u S v + w$ , no least fixed point solution exists, in general for Kleene algebras, so the solution does not exist as any regular expression in $ℜ\^*$ . Instead, you need to extend the Kleene algebra by adding elements $𝐛, 𝐝, 𝐩, 𝐪$ satisfying the properties

This algebra is known as $C_2'$ and is a polycyclic Kleene algebra. The strong version of it, known as $C_2$ also contains the axiom

It has the additional property that it contains every single one of its finite-dimensional vector and matrix algebras, but it is more than what we need here.

We also postulate that elements of $C_2'$ commute with the expressions we use the additional elements with - here: the set $X = \< u, v, w \>$ . The resulting algebra is a tensor product $C_2' ⊗ ℜX^*$ of $C_2'$ and $ℜX^*$ .

For an arbitrary monoid $M$ , the algebraic form of the Chomsky-Schütenberger Theorem asserts that every context-free subset of $M$ , i.e. all the members of $ℭM$ , exist as elements of $C_2' ⊗ ℜM$ and that, in fact, it comprises all the members of this tensor product that commute with $C_2'$ . That is: every non-linear fixed point system over $ℜM$ has a least fixed point solution in $C_2' ⊗ ℜM$ and that this, in fact, contains the fixed point closure of $ℜM$ , i.e., $ℭM$ , itself.

All of this takes place within the framework of *-continuous Kleene algebras. The weak first-order axiomatization for Kleene algebras is not strong enough to get the job done .

. unless the Kleene algebras are commutative. All commutative Kleene algebras are fixed-point closed and, in fact, for a univariate system $μx f(x) = f'(f(0))^* f(0)$ , for any non-linear Kleene algebraic function $f(x)$ . Derivatives are defined by $d(x^*)/dx = x^*$ , and for commutative Kleene algebras, the Kleene-star satisfies the laws of exponents $(x+y)^* = x^* y^*$ and $0^* = 1$ . It is conjectured that the multivariate systems have least fixed point solutions $μx f(x) = f'\left(f'(f(0))^* f(0)\right)^* f(0)$ , and bear in mind that's a vector-matrix equation, since $f'(x)$ is a Jacobian matrix, for multivariate functions $f(x)$ .

That's the algebraic form of the Parikh Theorem, which may be regarded as the commutative version of the algebraic version of the Chomsky-Schützenberger Theorem. It works with the weaker first-order Kleene algebra axioms and does not need *-continuity.

For the non-commutative case, when the algebraic Chomsky-Schützenberger Theorem is applied here, the least fixed point solution to $S ≥ u S v + w$ is $μs.(usv + w)$ which is equivalently given by the Chomsky-Schütenberger context-free expression $S = μs.(usv + w) = 𝐛 (u𝐩)^* w (𝐪v)^* 𝐝$ .

If you took out the bracketing $𝐛,𝐝$ , you'd get the following expression $(u𝐩)^* w (𝐪v)^*$ , which reduces to the following normal form $$(u𝐩)^* w (𝐪v)^* = (u𝐩)^* u𝐩 S + S + S 𝐪v (𝐪v)^*.$$ Since $S$ commutes with $𝐛$ and $𝐝$ , we could also write $$𝐛(u𝐩)^* w (𝐪v)^* = 𝐛 (u𝐩)^* u𝐩 S + 𝐛S + S 𝐛𝐪v (𝐪v)^* = 𝐛((u𝐩)^* u𝐩 + 1) S = 𝐛 (u𝐩)^* S,$$ using the Kleene algebraic identity $x^*x + 1 = x^*$ . The last term cancelled since $𝐛𝐪 = 0$ . Similarly, $$(u𝐩)^* w (𝐪v)^*𝐝 = (u𝐩)^* u𝐩𝐝 S + S𝐝 + S 𝐪v (𝐪v)^*𝐝 = S(1 + 𝐪v (𝐪v)^*)𝐝 = S(𝐪v)^*𝐝.$$

Compare the non-commutative solution with the commutative case, by the way, for $μs.f(s)$ with $f(s) = usv + w$ , $f'(s) = uv$ and $f(0) = w$ , where we get $μs.f(s) = f'(f(0))^* f(0) = (uv)^* w$ .

This is all recent (except the algebraic Parikh Theorem, which is 1999), though its precursors exist on the USENET, back to the early 1990's, Chomsky, himself, was brought up to date on the new developments a few years ago, though he expresses regret that he wasn't able to keep active with this work. The Chomsky-Schützenberger Theorem is almost 60 years old, with its algebraic versions going back over 30 years now.

The evolution of recent developments in formal language theory from the 1960's

In the near future, these and other related materials will be linked to from within The Federation Archive on GitHub.