Principal component analysis for authorship attribution

Principal component analysis for authorship attribution Background: To recognize the authors of the texts by the use of statistical tools, one first needs to decide about the features to be used as author characteristics, and then extract these features from texts. The features extracted from texts are mostly the counts of so called function words. Objectives: The data extracted are processed further to compress as a data with less number of features, such a way that the compressed data still has the power of effective discriminators. In this case feature space has less dimensionality then the text itself. Methods/Approach: In this paper, the data collected by counting words and characters in around a thousand paragraphs of each sample book, underwent a principal component analysis performed using neural networks. Once the analysis was complete, the first of the principal components is used to distinguish the books authored by a certain author. Results: The achieved results show that every author leaves a unique signature in written text that can be discovered by analyzing counts of short words per paragraph. Conclusions: In this article we have demonstrated that based on analyzing counts of short words per paragraph authorship could be traced using principal component analysis. Methodology could be used for other purposes, like fraud detection in auditing.


INTRODUCTION
Of all text categorization problems, that of authorship attribution is probably the oldest; however, it is also possibly the least well organized, and its history is marred with the mishandling of statistical techniques.And yet, it still promises to provide useful applications in spheres as diverse as law, security, and education.
The origin of non-traditional authorship attribution, or stylometry, is often said to be Augustus de Morgan's suggestion in 1851 that certain authors of the Bible might be distinguishable from one another if one used longer words (Holmes 1998).In 1887, Mendenhall began investigating this hypothesis, searching for a characteristic difference in the distribution of different-sized words in writings of different languages and presentation styles.In 1901, he turned his methods to Shakespeare, Bacon and Marlowe, and found that while Shakespeare and Marlowe were nearly indistinguishable, they were both significantly and consistently different from Bacon (Williams 1975).The difference was mainly observed in the relative frequency of three-and four-letter words: Shakespeare used more fourletter words and Bacon more three-letter words.
Authorship studies also began independently around the same time in Russia; it seems, with Morozov proposing a model for measuring style that garnered the interest of A. Markov (Kukurushkina et al. 2002).In the West, it took 30 years or so for Mendenhall's studies to be resumed by other linguists.George Zipf examined word frequencies and determined not a stylometric but a universal law of language, Zipf's Law: that the statistical rank of a word varies inversely to its frequency.G. Udny Yule devised a feature known as "Yule's characteristic K," which estimated 'vocabulary richness' by comparing word frequencies to that expected by a Poisson distribution, but like Mendenhall's word lengths, this too was later found to be an unreliable marker of style (Holmes 1998).In fact, most of the measurements proposed in this period proved unhelpful: among others, researchers tried average sentence length, number of syllables per word, and other estimates of vocabulary richness such as Simpson's D index and a simple type/token ratio (a ratio of the number of unique words, or types, to the number of total words, or tokens) (Juola et. al. 2006).
A breakthrough was needed, and it came in 1963 with Mosteller and Wallace's study on the Federalist Papers.In 1787 and 1788, John Jay, Alexander Hamilton and James Madison collectively wrote 85 newspaper essays supporting the ratification of the constitution.Published under the pseudonym "Publius," the authors later revealed which of the Federalist Papers they had written; however, while authorship of 67 were undisputed, 12 were claimed by both Hamilton and Madison.Mosteller and Wallace hoped to characterize each author's style through their choice of function words, such as "to," "by," and so forth.Function words are regarded as good markers of style because they are (assumed to be) unconsciously generated and independent of semantics (meaning, or what the author is trying to convey).That is, an author may have a preference for modes of expression (for instance, the active vs. the passive voice) that emphasize certain function words, and the same broad set of function words will be used regardless of the topic at hand.Despite the fact that Hamilton and Madison have otherwise very similar styles-nearly identical sentence length distributions, as noted by (Juola 2006)-Mosteller and Wallace found sharp differences in their preference for different function words: for instance, the word "upon" appears 3.24 times per 1000 words in Hamilton, and just 0.23 times in Madison (quoted in Holmes 1998).Adjusting these frequencies with a Bayesian model, they showed that Madison had most likely written all 12 disputed papers.Traditional scholarship had already long come to the same conclusion, but Mosteller and Wallace's conclusion was independent, and thus a great achievement of the then quite exploratory field of stylometry.The Federalist Papers problem is still regarded as a very difficult test case, and as an unofficial benchmark it has been used to test most methods of authorship attribution developed since then (see, for instance, Kjell 1994, Holmes & Forsyth 1995, Bosch & Smith 1998, and Fung 2003).

PROBLEM DEFINITION
In this paper an application of principal component analysis is presented.The authorship attribution is considered as a classification task (Chaski, C. 2001(Chaski, C. , 2005)).Texts studied are literary works of three Bosnian writers, Ivo Andrić (1892Andrić ( -1975) ) , M. Meša Selimović (1910-1982), and Derviš Sušić (1925-1990).Feature selected to describe texts are lexical and syntactical components that show promising results when used as writer invariants because they are used rather subconsciously and reflect the individual writing style which is difficult to be copied.Principal components of data elicited from texts possess generalization properties that allow for the required high accuracy of classification (Hayes 2008).

Texts Used
In research texts of two famous Bosnian writers, Ivo Andrić, M. Meša Selimović, and Derviš Sušić are used.Their novels provide the corpora which are wide enough to make sure that characteristic features found based on the training data can be treated as representative of other parts of the texts and this generalized knowledge can be used to classify the test data according to their respective authors.
Obviously literary texts can greatly vary in length; what is more, all stylistic features can be influenced not only by different timelines within which the text is written but also by its genre.The first of these issues is easily dealt with by dividing long texts, such as novels, into some number of smaller parts of approximately the same size.

Described
approach gives additional advantage in classification tasks as even in case of some incorrect classification results of these parts the whole text can still be properly attributed to some author by based the final decision on the majority of outcomes instead of all individual decisions for all samples.
Whether the genre of a novel is reflected in lexical and syntactic characteristics of it is the question yet to be answered.If the influence is significant, then lexical and syntactic features cannot be used as the writer invariant as unreliable.

Feature Selection
Establishing features that work as effective discriminators of texts under study is one of critical issues in research on authorship analysis which are lexical.In this research five textual descriptors are used, numbers of characters, words, sentences, commas, and conjecture "and", in Bosnian "i", and other characteristics in paragraphs.Means and variances of the textual descriptors for the texts Ivo Andrić: Na Drini Ćuprija, and M. Meša Selimović: Derviš i Smrt are shown in Table 1.As it is seen, there is statistical difference between the usage of textual descriptors, for instance, Ivo Andrić prefers longer paragraphs.In average Ivo Andrić 's paragraphs contain 79 words with variance 5861.7, while Meša Selimović's average is 62 with variance 4756.4.
In the next chapter the pattern captured by principal components will be displayed.

PRINCIPAL COMPONENT ANALYSIS
The methods of Mosteller and Wallace have proved as enduring as the problem they investigated: they were only modestly altered when Burrows described his method of stylometric analysis in a series of papers published in the late 1980s and early 1990s (Holmes 1998; see, for instance, Burrows 1992).The Burrows method essentially involves computing the frequency of each of a list of function words (larger than that of Mosteller and Wallace), and performing principle component analysis (PCA) to find the linear combination of variables that best accounts for the variations in the data.Rather than analyze this result statistically, the transformed data are simply plotted (a two-dimensional plot of the first two principal components) and inspected visually for trends, which occur as clusters of points (Holmes 1998).(Later, cluster analysis would accomplish this step.)This simple but effective method continues to be used today, partly because of the ease with which the results are communicated and interpreted.For example, Binongo used this method to study the problem of the authorship of L. Frank Baum's last book, which historians had long suspected of being mostly the work of Baum's successor, Ruth P. Thompson (Binongo 2003).He confirmed this suspicion independently, demonstrating that Thompson was much more prone to use position words such as "up," "down," "over," and "back," than Baum.This was not demonstrated using complex statistical techniques; rather, function word frequencies were tallied, the authors' tallies compared, PCA used to reduce the dimensionality of the data, and the resulting plots inspected: the two authors' works form obvious clusters.Similar procedures can be found in (Holmes & Forsyth 1995, Holmes et al. 2001, and Peng & Hentgartner 2002).

PRINCIPAL COMPONENTS OF SAMPLE TEXTS
Next, random samples of 400 data are chosen from data sets for the textual descriptors for the texts authored by Ivo Andrić: Na Drini Ćuprija, and M. Meša Selimović: Derviš i Smrt, , and , and for other four books.These are all 40014 matrices.Their covariance matrices , are 1414 matrices.The information in the covariance matrices are used to define a set of new variables .
as a linear combination of the original variables in the data matrices.The new variables are derived in a decreasing order of importance.The first column of is called first principal component and accounts for as much much as possible of the variation in the original data.The second column is called second principal component and accounts for another, but smaller portion of the variation, and so on.
If there are p variables, to cover all of the variation in the original data, one needs p components, but often much of the variation is covered by a smaller number of components.Thus PCA has as its goals the interpretation of the variation and data reduction.

Variances
and percentage variances covered by fourteen principal components of the textual descriptors for the sample texts Ivo Andrić: Na Drini Ćuprija, and M. Meša Selimović: Derviš i Smrt are shown in Table 2. Table 2 reveals that the first two principal components cover more than %99 of variances of principal components.
In  A common range for the contents of these two vectors is the interval 500, 0 .We divide this interval into 25 bins of equal length of 20 500, 480 , 480, 460 , … , 40, 20 , 20,0  It is seen that the writeprints of the two authors are distinguishable.To see whether the captured features remains similar through random samplings from data sets, we sketch together the frequencies of ten different samples in figure 3. The first principal components of the another book authored by Meša Selimović; Tvrdjeva displayed in Figure 4a, a third author's text Pobune (Sušić 1966) in Figure 4a.The writing print of Meša Selimović is revealed as twice higher peaks compared to the corresponding Ivo Andrić peaks, and differs significantly from pattern for Derviš Sušić.

CONCLUSIONS
The research described in this paper concerning author identification analysis shows that the method of principal component analysis (PCA) is an efficient a tool.Yet conclusions as to the choice of textual descriptors used as features for recognition process, based only on results presented in the previous sections and leading to some arbitrary statement that syntactic attributes are more effective in authorship attribution, would be much too hasty and premature.Undeniably true in the studied example, it would have to be verified against much wider corpora as for other writers other features could give better results.

Figure 1 .
Figure 1.First principal components of samples from Cuprija na Drina (a) and Derviš i Smrt (b) data.These figures are similar, and do not seem to be used as writeprints of authors.It is the same for the second principal components.To search for a writerprint, we transform this information into the frequency domain.

Figure 2 .
Figure 2. Frequencies of elements of first principal component vectors of random samples from Cuprija na Drina (a) and Derviš i Smrt (b) data in 25 bins.
numbers of entries of first component vectors in these bins.Figure2displays the data in Figure1in frequency domain.

Figure 3 .Figure 3 .
Figure 3. Frequencies of elements of first principal component vectors of ten random samples from Cuprija na Drina (a) and Derviš i Smrt (b) data in 25 bins.

Figure 4 .
Figure 4. Frequencies of data in the first principal components of the book authored by Meša Selimović; Tvrdjeva a), and third author's text Pobune b).
artificial neural networks -based methodology to wider range of authors, definition of new sets of textual descriptors, and test for other types and structures of neural networks, and search the possibility of inheritance through translation into other languages.

Table 1 .
Paragraph averages and variances of the textual descriptors used in this research

Table 2 .
Variances and percentage variances covered by fourteen principal components of the textual descriptors used in this research.