## Graduation Year

2005

## Document Type

Dissertation

## Degree

Ph.D.

## Degree Granting Department

Mathematics and Statistics

## Major Professor

Nata˘sa Jonoska, Ph.D.

## Committee Member

Gregory McColm, Ph.D.

## Committee Member

Arunava Mukherjea, Ph.D.

## Committee Member

Masahiko Saito, Ph.D.

## Committee Member

Stephen Suen, Ph.D.

## Keywords

Entropy, Language norm, Language distance, Language space, Pseudo-metric, Symbolic dynamics

## Abstract

This paper outlines and implements a systematic approach to the establishment, investigation, and testing of distances and topologies on language spaces. The collection of all languages over a given number of symbols forms a semiring, appropriately termed a language space. Families of languages are defined by interrelations among words. The traditional classification begins with the syntax rules or grammar of the language, that is, the word-transformations by which the entire language can be produced from a single axiom, or starting word. The study of distances between languages as objects and of the topologies induced by language distances upon spaces of languages has been of a limited character. Known language distances introduce topologically awkward features into a language space, such as total disconnectedness. This dissertation examines the topologies induced by three language distances, the effect that each one has upon the notion of a random language, and discusses continuity and word-distribution of structure-preserving language transformations, i.e., morphisms.

This approach starts from metric-like requirements, but adduces an additional condition intuitively appropriate to gauging language distance. At the same time, strict, i.e. non-metric pseudometrics are admitted as possible language distance functions, and these are investigated by the use of metric quotient spaces. The study of the notion of randomness implied by the topology induced by such a pseudo-metric on a language space offers insight into the structure of language spaces and verifies the viability of the pseudo-metric.

Three language pseudo-metrics are studied in this dissertation: a version of the most commonlyused (Cantor) word metric; an upper-density (Besicovitch) pseudo-metric borrowed from the study of cellular automata; and an adaptation and normalization of topological entropy, each evaluated on the symmetric set-difference between languages. It is shown that each of these distances induces a distinct topology on the space of languages. The topology induced by Cantor distance is compact and totally disconnected, the topologies induced by the other two are non-compact, with entropic distance resulting in a topology that is the strict refinement of the Besicovitch topology, enhancing the picture of the smaller languages in the Besicovitch topology. It is also shown that none of the three topologies gives quantitative expression to the distinction between regular and linear languages, although, using Martin-Lof randomness tests, it is shown that each pseudo-metric is associated with a new notion of a random language.

A classification of language mappings is introduced, with the aim of identifying those which best preserve the structure of languages under specific topologies. There are results regarding continuity of mappings, the matrix representation of the pre-image of certain morphisms, and the formal expressions of the probability distribution of the image of certain morphism. The continuity of an injective morphism on its image is demonstrated under limited conditions.

Finally, the questions which this approach leaves open are detailed. While basic facts about a permutation-invariant version of symmetric set difference are shown, this has yet to be fully elaborated. The outline is presented for a metric which distinguishes between regular and linear languages by brute force. Syntactic and as algebraic topological continuations of this approach await investigation. A variation of the Cantor distance is introduced, and this induces a non-Cantor topology on a language space.

In summary, this dissertation demonstrates that it is possible to systematically topologize the formal language space, and, having done so, to determine the major effects this has upon the notion of random languages and upon language morphisms.

## Scholar Commons Citation

Kephart, David E., "Topology, Morphisms, and Randomness in the Space of Formal Languages" (2005). *USF Tampa Graduate Theses and Dissertations.*

https://digitalcommons.usf.edu/etd/721