1. bookVolume 38 (2014): Issue 1 (April 2014)
Journal Details
License
Format
Journal
eISSN
1502-5462
First Published
28 Apr 2014
Publication timeframe
1 time per year
Languages
English
access type Open Access

Modest XML for Corpora: Not a standard, but a suggestion

Published Online: 28 Apr 2014
Volume & Issue: Volume 38 (2014) - Issue 1 (April 2014)
Page range: 73 - 103
Journal Details
License
Format
Journal
eISSN
1502-5462
First Published
28 Apr 2014
Publication timeframe
1 time per year
Languages
English
Abstract

This paper argues for, and presents, a modest approach to XML encoding for use by the majority of contemporary linguists who need to engage in corpus construction. While extensive standards for corpus encoding exist - most notably, the Text Encoding Initiative’s Guidelines and the Corpus Encoding Standard based on them - these are rather heavyweight approaches, implicitly intended for major corpus-building projects, which are rather different from the increasingly common efforts in corpus construction undertaken by individual researchers in support of their personal research goals. Therefore, there is a clear benefit to be had from a set of recommendations (not a standard) that outlines general best practices in the use of XML in corpora without going into any of the more technical aspects of XML or the full weight of TEI encoding. This paper presents such a set of suggestions, dubbed Modest XML for Corpora, and posits that such a set of pointers to a limited level of XML knowledge could work as part of the normal, general training of corpus linguists.

The Modest XML recommendations cover the following set of things, which, according to the foregoing argument, are sufficient knowledge about XML for most corpus linguists’ day-to-day needs: use of tags; adding attribute value pairs; recommended use of attributes; nesting of tags; encoding of special characters; XML well-formedness; a collection of de facto standard tags and attributes; going beyond the basic de facto standard tags; and text headers.

Aston, Guy and Lou Burnard. 1998. The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.Search in Google Scholar

Bray, Tim, Jean Paoli, C.M. Sperberg-McQueen, Eve Maler and François Yergeau (eds.). 2008. Extensible Markup Language (XML) 1.0. Fifth edition. World Wide Web Consortium: available online at http://www.w3.org/TR/ REC-xml (last accessed 30th October 2013).Search in Google Scholar

Burnard, Lou. 2005. Metadata for corpus work. In Wynne (2005).Search in Google Scholar

Burnard, Lou and Syd Bauman (eds.). 2013. TEI P5: Guidelines for electronic text encoding and interchange. Version 2.5.0. Last updated on 26th July 2013. TEI Consortium: available online at http://www.tei-c.org/Guidelines/P5/ (last accessed 30th October 2013).Search in Google Scholar

Burnard, Lou and C.M. Sperberg-McQueen. 2012. TEI Lite: Encoding for interchange: An introduction to the TEI. Final revised edition for TEI P5. Available online at http://www.tei-c.org/Guidelines/Customization/Lite/ (last accessed 30th October 2013).Search in Google Scholar

Ide, Nancy. 1996a. The ACH-ACL-ALLC Text Encoding Initiative: A brief overview. Available online at http://www.tei-c.org/Vault/SC/teij17.txt (last accessed 30th October 2013).Search in Google Scholar

Ide, Nancy. 1996b. Corpus Encoding Standard. Version 1.5. Expert Advisory Group on Language Engineering Standards (EAGLES): available online at http://www.cs.vassar.edu/CES/ (last accessed 30th October 2013).Search in Google Scholar

McEnery, Tony and Andrew Wilson. 2001. Corpus linguistics. Second edition. Edinburgh: Edinburgh University Press.Search in Google Scholar

Sperberg-McQueen, C.M. and Lou Burnard (eds.). 1990. TEI P1: Guidelines for the encoding and interchange of machine-readable texts. Chicago, Oxford: Text Encoding Initiative.Search in Google Scholar

Wynne, Martin (ed.). 2005. Developing linguistic corpora: A guide to good practice. Oxford: Oxbow Books. Available online at http://www.ahds.ac.uk/creating/guides/linguistic-corpora/ (last accessed 30th October 2013).Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo