created 2025-06-09, & modified, =this.modified
tags:y2025
rel:
Computational Aesthetics Survey of Vandal, Fake and Replica
Thought
Encountered this while looking at some anonymous works (of a pornographic nature) – which were able to be attributed later, due to stylometry.
I’ve always entertained the idea that at some future date, many things we think of as anonymous are actually leaks of information that will be interpretable, and addressable by those who decided to look.
I am reminded of hackers who will harvest currently useless, encrypted data, because they presume at some point in the future times will change, or someone will slip and with the key they will be granted the treasure.
Stylometry is the study of linguistic style, usually to written language but it has also been applied to music, painting and chess.
It has legal and literary applications and has been used successfully to establish authorship to anonymous or disputed documents.
Expansion of em:
in 1901, one researcher attempted to use John Fletcher’s preference for ” ‘em”, the contractional form of “them”, as a marker to distinguish between Fletcher and Philip Massinger in their collaborations—but he mistakenly employed an edition of Massinger’s works in which the editor had expanded all instances of ” ‘em” to “them”.
Computers enhanced this type of effort but also with mixed results. In the 60’s stylistic analysis on the 14 Epistles of the New Testament attributed to St. Paul, indicated six authors.
Adversarial Stylometry
Adversarial stylometry may be used to resist identification by eliminating their own stylistic characteristics.
All adversarial stylometry shares the idea of paraphrasing a source text so the meaning is unchanged, but the stylistic signals are obscured.
- imitation – substitution the of the author’s style for another’s
- translation – applying machine translation with the hope this eliminates the characteristics of the author
- obfuscation – modifying a text’s style to make it not resemble the author’s own.
It is uncertain if the practice of adversarial stylometry is detectable in itself.
There are software packages that perform analysis.
Thought
This is how I ended up here, written by the creator of Bambi.
In 2022, the Italian scholars Simone Rebora and Massimo Salgaro showed, using John F. Burrows’ “Delta distance” method, that Felix Salten is the most probable author of the anonymous novel Josefine Mutzenbacher from 1906, the final pages excluded.
Data and Methods
In the past rarest or most striking elements of the text were used. Contemporary techniques can isolate patterns even in common parts of speech. Most systems are based on frequencies of words and terms in the text to characterize the author.
The primary stylometric method is the writer invariant: a property held by all texts long enough to admit an analysis.
Writer invariant is also an author’s pattern of writing a letter in handwritten text recognition.
Thought
What this kind of written identity implies is fascinating. Invariant – never changing. Some essence that can be extracted from the text.
One method
- text is analyzed to find the most 50 common words.
- divided into 5K word chunks and each chunk is analyzed to find the frequency of those 50 words in that chunk.
- this generates a unique 50-number identifier for that chunk.
- the chunks of text are placed into a point in 50-dimensional space.
- this is flattened into a plane, and that results in a display of points that corresponds to the author’s style.
- If two literary works are placed on the same plane, the resulting pattern may show that both works were by the same or different authors.
It is not agreed which properties of a text should be used
- word lengths
- average sentence length
- average word length
- noun, verb and adjective usage frequency
- vocabulary richness
- frequency of function words
- specific function words
Analysis of function words shows promise because they are used by authors unconsciously.
Function Words and Content Words
Function words (functors) are words that have little lexical meaning, or ambiguous meaning and express grammatical relationships among other words within a sentence or specify an attitude or mood of the speaker.
A lexical or content word are words that possess content and contribute meaning to the sentence they occur.
Thought
Imagine if it were just little nuances like this that is the reason why LLMs are effective. We end up thinking they are special/thinking things, but it’s just some language quirk we overlook because we are so richly immersed.
With only around 150 function words, 99.9% of words in the English language are content words. Although small in number, function words are used at a disproportionately higher rate than content and make up about 50% of any English text because of the conventional patterns of usage that binds function words to content words almost every time they are used, which creates an interdependence between the two word groups.
Content words are usually open class words, and new words are easily added to the language. “Typical closed classes are prepositions (or postpositions), determiners, conjunctions, and pronouns.”
List of English function words:
- articles — the and a. In some inflected languages, the articles may take on the case of the declension of the following noun.
- pronouns — he/him, she/her, etc. — inflected in English
- adpositions — in, under, towards, before, of, for, etc.
- conjunctions — and and but
- subordinating conjunctions — if, then, well, however, thus, etc.
- auxiliary verbs — would, could, should, etc. — inflected in English
- particles — up, on, down
- interjections — oh, ah, eh, sometimes called “filled pauses”
- expletives — take the place of sentences, among other functions.
- pro-sentences — yes, no, okay, etc.