Earlier this year I reached out to the Forensic Linguistics Facebook group hoping for some assistance in finally answering the question “Who Authored the John Titor Legend?” A linguistic profile had been done by the Hoax Hunter in the past, but his efforts relied on a generic computer program rather than the use of professional techniques.
Several forensic linguists reached out to me, but the most accomplished and also the most engaged was Andrea Nini, a professor of forensic linguistics at the University of Manchester, who has published widely on questions of authorship ranging from the Jack the Ripper letters to the Bixby letter.
Nini was interested in using the John Titor story as a case study for his class of post-graduate forensic linguistics students. I assembled writings from whom I consider to be the four most likely candidates, including Joseph Matheny, Morey Haber, Oliver Williams, and Temporal Recon, and sent them to Nini along with the complete John Titor posts. Each suspect was assigned to a pair of students, and each group worked on similarities and differences between that subject and John Titor, with the goal of trying to either clear their suspect of suspicion or determine that they could not clear their subject of suspicion.
The students (including but not limited to: Fatma Hamaid, Jiaqi Zhang, Christoper Powell, Sarah Mahmood, Lisa Donlan, and Guadalupe Pulido Casas) have come to at least one provisional conclusion at this point in their work in progress: that John Titor was not a real time traveler.
This has to do with a small change in the language over time that hardcore Titor-ites may be inclined to quickly dismiss, but that does possess a significant degree of validity, and that is the spelling of the word “web sites.”
“In his posts he writes ‘web site’ as two words,” said Nini, “One thing we know about English is that new word compounds of this type usually take the same path to become lexicalized into one word. Usually there’s a cycle. Students looked at a collection of words called a ‘corpus’ of English that include historical data, and you can see clearly that ‘website’ behaves as it should. You can see it behaves like all the other compounds. Back at the time John Titor was writing, it was spelled as two words and that variant basically died, and now it’s just one word. It would definitely disappear in the future.”
In a typical assignment, due to a relative lack of data for each subject, the students would apply qualitative analyses. In this case however, due to a high volume of data for each subject, computational analysis would be required, and this was where Nini stepped in, writing a unique program to compare the Titor posts with the writings of the four subjects.
With the extensive John Titor posts, Nini had to exclude everything under 100 words, as anything under that amount is considered unreliable. “When you remove everything under 100 words from the John Titor corpus, not a lot is available. Having a short disputed sample [the John Titor posts] is not a problem. Harder is a short known sample [the writings of the subjects.]” That’s not an issue for each subject when there’s a lot to compare,” said Nini.
“I ran features that involved taking the top 100 most frequent word sequences of 2-5 words, then taking punctuation, and taking average word length for each text, sentence and paragraph length. All of the markers were found to work in stylometry, the disciplines that studies the quantification of style. So you get all these measurements for each text, and what you want to find out is if there are any differences in the way all the suspects behave, say they all have the same average sentence length. So I ran tests of significance for variables. You take all the ones that are significantly different. You test statistically whether Morey Haber has a higher average sentence length than another suspect. If it does then you take that marker and put it in a basket, if not, you exclude it.”
“[Now] you have a basket of features you know are important. At that point you still have a problem because you probably have 100 or so of these features. It’s very difficult to make sense of all of that. So you reduce the dimensionality of that, and one way this is done is by using principle component analysis, a classic statistical technique used in several fields. It reduces many variables to a few. It finds common patterns and puts them together and gives you only a few variables which are like super variables that account for as much as possible. In this case, we found that The first components for the first super variable accounts for registers. The authors are different because they’re doing things differently.”
In other words, a person writes differently when they are posting in an online forum as Titor did, versus when they are writing a cyber security blog in the case of Morey Haber, or writing a non-fiction book in the case of Temporal Recon.
The following diagram is a visual representation of the distribution of values for this ‘super-variable’ created statistically. The ‘super-variable’ distinguishes the suspects as the boxes of the suspects do not overlap and each suspect has their own region in the plot. TI (John Titor), however, does appear to overlap with MH (Morey Haber)..“This technique only tells you which suspect is the most similar, or who [has] the most similar style to the disputed text. But that starts on the assumption that your suspect is in that sample,” said Nini.
Nini stresses that this inquiry remains a work in process, and is far from concrete. New forensic linguistic ideas that may shed further light on Who Authored the John Titor Legend? will be published here as they develop.