1. bookVolume 39 (2015): Issue 1 (March 2015)
Journal Details
License
Format
Journal
eISSN
1502-5462
First Published
28 Apr 2014
Publication timeframe
1 time per year
Languages
English
access type Open Access

Modest XPath and XQuery for corpora: Exploiting deep XML annotation

Published Online: 01 Apr 2015
Volume & Issue: Volume 39 (2015) - Issue 1 (March 2015)
Page range: 47 - 84
Journal Details
License
Format
Journal
eISSN
1502-5462
First Published
28 Apr 2014
Publication timeframe
1 time per year
Languages
English
Abstract

This paper outlines a modest approach to XPath and XQuery, tools allowing the navigation and exploitation of XML-encoded texts. The paper starts off from where Andrew Hardie’s paper “Modest XML for corpora: Not a standard, but a suggestion” (Hardie 2014) left the reader, namely wondering how one’s corpus can be usefully analyzed once its XML-encoding is finished, a question the paper did not address. Hardie argued persuasively that “there is a clear benefit to be had from a set of recommendations (not a standard) that outlines general best practices in the use of XML in corpora without going into any of the more technical aspects of XML or the full weight of TEI encoding” (Hardie 2014: 73). In a similar vein this paper argues that even a basic understanding of XPath and XQuery can bring great benefits to corpus linguists. To make this point, we present not only a modest introduction to basic structures underlying the XPath and XQuery syntax but demonstrate their analytical potential using Obama’s 2009 Inaugural Address as a test bed. The speech was encoded in XML, automatically PoS-tagged and manually annotated on additional layers that target two rhetorical figures, anaphora and isocola. We refer to this resource as the Inaugural Rhetorical Corpus (IRC). Further, we created a companion website hosting not only the Inaugural Rhetorical Corpus, but also the Inaugural Training Corpus) (a training corpus in the form of an abbreviated version of the IRC to allow manual checks of query results) as well as an extensive list of tried and tested queries for use with either corpus. All of the queries presented in this paper are at beginners to lower-intermediate levels of XPath/XQuery expertise. Nonetheless, they yield fruitful results: they show how Obama uses the inclusive pronouns we and our as a discursive strategy to advance his political strategy to re-focus American politics on economic and domestic matters. Further, they demonstrate how sentence length contributes to the build-up of climactic tension. Finally, they suggest that Obama’s signature rhetorical figure is the isocolon and that the overwhelming majority of isocola in the speech instantiate the crescens type, where the cola gradually increase in length over the sequence.

Biria, Reza and Azadeh Mohammadi. 2012. The socio pragmatic functions of inaugural speech: A critical discourse analysis approach. Journal of Pragmatics 44: 1290-1302.10.1016/j.pragma.2012.05.013Search in Google Scholar

Clark, James and Steve DeRose. 1999. XML Path Language XPath Version 1.0, available at www.w3.org/TR/xpath (last accessed December 2014).Search in Google Scholar

Gleim, Rüdiger, Ulli Waltinger, Alexander Mehler, and Peter Menke. 2009. eHumanities Desktop - An extensible online system for corpus management and analysis. Proceedings of the Corpus Linguistics 2009 Conference. In M. Mahlberg, V. González-Díaz and C. Smith (eds.). Proceedings of the Corpus Linguistics Conference, available at http://ucrel.lancs.ac.uk/publications/cl2009/124_FullPaper.doc (last accessed December 2014).Search in Google Scholar

Gries, Stefan Th. 2009. Quantitative corpus linguistics with R. A practical introduction. New York and London: Routledge.Search in Google Scholar

Gries, Stefan Th. 2010. Methodological skills in corpus linguistics: A polemic and some pointers towards quantitative methods. In T. Harris and M. Moreno Jaén (eds.). Corpus linguistics in language teaching, 121-146. Frankfurt am Main: Peter Lang.Search in Google Scholar

Gries, Stefan Th. 2013. Statistics for linguistics with R. A practical introduction. 2nd rev. and ext. ed. Berlin and New York: De Gruyter Mouton.Search in Google Scholar

Hardie, Andrew. 2014. Modest XML for corpora: Not a standard, but a suggestion, ICAME Journal 38: 73-103.10.2478/icame-2014-0004Search in Google Scholar

Hoffmann, Sebastian, Stefan Evert, Nicholas Smith, David Lee and Ylva Berglund Prytz. 2008. Corpus linguistics with BNCweb - A practical guide. Frankfurt/Main: Peter Lang.Search in Google Scholar

Leech, Geoffrey. 2007. New resources, or just better ones? The holy grail of representativeness. In M.Hundt, N. Nesselhauf and C. Biewer (eds.). Corpus linguistics and the web, 133-150. Amsterdam/New York, NY: Rodopi.10.1163/9789401203791_009Search in Google Scholar

Leith, Sam. 2011. You talkin’ to me? Rhetoric from Aristotle to Obama. London: Profile Books.Search in Google Scholar

Levinson, Stephen C. 1983. Pragmatics. Cambridge: Cambridge University Press.Search in Google Scholar

Longacre, Robert E. 1983. The grammar of discourse. New York: Plenum Press.Search in Google Scholar

Mahlow, Cerstin, Christian Grün, Alexander Holupirek and Marc H. Scholl. 2012. A framework for retrieval and annotation in digital humanities using xquery full text and update in BaseX. Proceedings of the 2012 ACM Symposium on Document Engineering; September 4-7, 2012, Paris, France, 195-204. New York, NY: ACM. Available at http://kops.uni-konstanz.de/bitstream/handle/123456789/21363/mahlow_213637.pdf?sequence=2&isAllowed=y (last accessed December 2014).Search in Google Scholar

O’Donnell, Matthew B., Mike Scott, Michaela Mahlberg and Michael Hoey. 2012. Exploring text-initial words, clusters and concgrams in a newspaper corpus. Special issue of Corpus Linguistics and Linguistic Theory 8(1): 73-101.10.1515/cllt-2012-0004Search in Google Scholar

O’Donnell, Matthew B. and Ute Römer. 2012. From student hard drive to web corpus (Part 2): The annotation and online distribution of the Michigan Corpus of Upper-level Student Papers (MICUSP). Corpora 7(1): 1-18. 10.3366/cor.2012.0015Search in Google Scholar

R Development Core Team. 2010. R: A language and environment for statistical computing.Search in Google Scholar

R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/ Search in Google Scholar

Rehm, Georg, Richard Eckart, Christian Chiarcos and Johannes Dellert. 2008. Ontology-based XQuery’ing of XML-encoded language resources on multiple annotation layers. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis and D. Tapias (eds.). Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). Paris: ELRA, available at http://www.lrec-conf.org/proceedings/lrec2008/pdf/139_paper.pdf (last accessed December 2014).Search in Google Scholar

Rühlemann, Christoph. 2013. Narrative in English conversation. A corpus analysis. Cambridge: Cambridge University Press.10.1017/CBO9781139026987Search in Google Scholar

Rühlemann, Christoph and Matthew B. O’Donnell. 2012. Towards a corpus of conversational narrative. Construction and annotation of the Narrative Corpus. Corpus Linguistics and Linguistic Theory 8(2): 313-350.10.1515/cllt-2012-0015Search in Google Scholar

Rühlemann, Christoph, Matthew B. O’Donnell and Andrej Bagoutdinov. 2013. Windows on the mind: Pauses in conversational narrative. In G. Gilquin and S. De Cock (eds.). Errors and disfluencies in spoken corpora (Benjamins Current Topics 52), 59-91. Amsterdam/Philadelphia: John Benjamins.10.1075/bct.52.03ruhSearch in Google Scholar

Rühlemann (eds.). Corpus pragmatics. A handbook. Cambridge: Cambridge University Press.Search in Google Scholar

Rühlemann, Christoph and Matthew B. O’Donnell. 2015. Deixis. In K. Aijmer and C.10.1017/CBO9781139057493.018Search in Google Scholar

Scott, Mike 2010. WordSmith tools version 5.0. Lexical Analysis Software, Liverpool.Search in Google Scholar

Scott, Mike and Christopher Tribble. 2006. Textual patterns. Key words and corpus analysis in language education. Amsterdam/New York: John Benjamins.10.1075/scl.22Search in Google Scholar

Siegel, Erik and Adam Retter. 2014. eXist: A NoSQL Document Database and Application Platform. Sebastopol/CA: O’Reilly.Search in Google Scholar

Walmsley, Priscilla. 2007. XQuery. Sebastopol/CA: O’Reilly.Search in Google Scholar

Watt, Andrew. 2002. XPath essentials. New York: Wiley and Sons. Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo