osdir.com
mailing list archive

Subject: RE: POS-tagging for spoken English and learner English - msg#00052

List: science.linguistics.corpora

Date: Prev Next Index Thread: Prev Next Index
Adam,

Folks in UCREL at Lancaster and elsewhere have got some experience of running
CLAWS over corpora such as the spoken part of the BNC, MICASE, ICLE, and
historical corpora (Nameless Shakespeare). My impression in general is that the
statistical HMM component of the tagger provides the robustness you need for
these kind of tasks, but you need to accompany that with tweaks to the other
components such as the 'idiom' lists and tokenisation.

Here's some more detail:

1. In the BNC project, the CLAWS transition probabilities were retrained on
spoken data. Also there were lexicon additions, special treatment of
contractions, truncated words and repetition, all closely tied to the
transcription and encoding formats in the BNC spoken corpus. For more detail,
see:

Garside, R. (1995) Grammatical tagging of the spoken part of the British
National Corpus: a progress report. In Leech, G., Myers, G. and Thomas, J.
(eds) (1995), Spoken English on Computer: Transcription, Mark-up and
Application. pp.161-7.

Garside, R., and Smith, N. (1997) A hybrid grammatical tagger: CLAWS4, in
Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic
Information from Computer Text Corpora. Longman, London, pp. 102-121.

Also see Nicholas Smith and Geoff Leech's manual for BNC version 2, which has
error analysis comparing written and spoken:

http://www.comp.lancs.ac.uk/ucrel/bnc2/bnc2error.htm
http://www.comp.lancs.ac.uk/ucrel/claws/

2. I don't have figures for MICASE which we tagged with CLAWS or an ICLE
sub-corpus, but came away with the general impression as above that the
probability matrix provides robustness in these types of text which you might
expect to cause problems for automatic POS annotation. For learner data, of
course, POS tagging accuracy depends on how advanced the learners are. You
could have a look at

Bertus van Rooy and Lande Schäfer: An evaluation of three POS taggers for the
tagging of the Tswana Learner English Corpus

comparing TOSCA-ICLE, Brill tagger, and CLAWS on their data. This was presented
at the learner corpus workshop at Corpus Linguistics 2003. The abstract is at
http://tonolab.meikai.ac.jp/~tono/cl2003/lcdda/abstracts/rooy.html
and the full paper is in the CL2003 proceedings.

3. In collaboration with Martin Mueller at Northwestern, we've recently been
applying CLAWS to the Nameless Shakespeare corpus and looking at error rates
and problems. There are other things which upset CLAWS (and would most likely
do the same for other POS taggers) such as different capitalisation and variant
spellings. Our approach has been to pre-process these as much as possible,
retaining original variants, but fooling CLAWS, if you like, into tagging a
version with modern equivalents. See:

Rayson, P., Archer, D. and Smith, N. (2005) VARD versus Word: A comparison of
the UCREL variant detector and modern spell checkers on English historical
corpora. In proceedings of Corpus Linguistics 2005.

Our experience with Nameless Shakespeare was that CLAWS' current statistical
language model copes pretty well in data from that time, but we expect that the
probability matrix will need to be retrained if we attempt tagging data much
earlier than 1550/1600.

Regards,
Paul.

Dr. Paul Rayson
Director of UCREL (University Centre for Computer Corpus Research on Language)
Computing Department, Infolab21, South Drive, Lancaster University, Lancaster,
LA1 4WA, UK.
Web: http://www.comp.lancs.ac.uk/computing/users/paul/
New telephone number: +44 1524 510357 Fax: +44 1524 510492





Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

slides from cl 2005 web-as-corpus workshop

Dear All, The slides from the Corpus Linguistics 2005 Web-as-Corpus workshop/tutorial are now available for download from the workshop site: http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html Regards, Marco -- Marco Baroni SSLMIT, University of Bologna http://sslmit.unibo.it/~baroni

Next Message by Date: click to view message preview

Digital documents and interpretation Conference - First announcement and call for papers

INTERNATIONAL CONFERENCE AND SUMMER SCHOOL Albi, July 10-14th 2006 Organized within the framework of the Albi Languages and Signification Conferences (CALS) DIGITAL DOCUMENTS AND INTERPRETATION — HUMANITIES/SOCIAL SCIENCES CORPORA First announcement ----------------------------------------------- Corpus analysis and building are redefining the practices, or even the theories of the humanities. As these disciplines are more and more dealing with digital documents, they have to reconsider their relation to the empirical. The digitization of scientific texts also involves a reflexive return to their very development. Do these new ways of accessing documents generate new forms of knowledge construction? The new national and international initiatives (e.g. the creation of the Centre for digital scientific edition of the French CNRS, the TGE Adonis – very large access equipment special for digital data and documents in the humanities and social sciences) may be the opportunities to build a federal project for the humanities and social sciences. Numerous communities have for a long time gotten involved in a thinking on digitization and computer-assisted analysis: information sciences, but also history, sociology, linguistics, archaeology, literary studies – non exhausting listing of course… Therefore, the aim of the conference is to reinforce links and to encourage connections between teachers and researchers belonging to these disciplines and the communities of corpus linguistics and digital document. Without much consideration towards ordinary objectivism, the conference will deal with the philological and hermeneutical problems corpus-based works have to handle, according to the tasks and the disciplines: for instance, genre and discourse typologies, description of semantic forms and contents, theme identification, concept characterization and evolution, form and content correlations. On the practical level, the conference will tackle the questions risen by corpus collecting, building, coding, tagging and processing and digital edition. Software demonstrations are scheduled, as well as introductions to issues specific to the concerned disciplines. Important dates ----------------------------------------------- Paper submissions : authors are invited to submit a one-page abstract containing references and keywords. Abstracts should be sent as attached files to LPE2@xxxxxxxxxxxxxx September 1, 2005: Abstract submission begins. December 31, 2005: Paper submission deadline February 1, 2006 : Notification of acceptance June 1, 2006 : camera-ready copies of accepted papers Camera-ready copy should not exceed 10 pages (plus abstract). All the accepted papers will be put online before the conference. Papers should be submitted in PDF and conform to the guidelines available from http://www.revue-texto.net/Redaction/Normes/Consignes.html Organization ----------------------------------------------- Initiative committee : François Rastier, Michel Ballabriga, Pierre Marillaud. Scientific committee : Etienne Brunet (University of Nice), Michel Ballabriga (University of Toulouse le Mirail), Kjersti Floettum (University of Bergen), Andrea Iacovella (CNRS, Cens et TGE Adonis), Ioannis Kanellos (ENST, Brest), Bénédicte Pincemin (LLI-CNRS, Villetanneuse), François Rastier (CNRS, Paris), André Salem (University of Paris III), Monique Slodzian (INALCO), Mathieu Valette (Atilf-CNRS, Nancy), Geoffrey Williams (University of Bretagne Sud – Lorient). Organization committee : Carine Duteil (University of Toulouse le Mirail), Baptiste Foulquié (University of Toulouse le Mirail), Céline Poudat (University of d’Orléans). Further information ----------------------------------------------- Information and pre-registration: CALS : beatrixmarillaud.cals@xxxxxxxxxx Conference registration fee : 40 euros ; students : 20 euros Conference location : Centre Saint Amarand, 16 rue de la République, 81000, Albi. Accommodation (limited number of places) : Single room : 20 euros, double : 29 euros (including breakfasts). Lunch : 11 euros (including drinks). Grants may be offered, providing there are justifiable reasons for such a request. Conference dates : July 10 – July 14 juillet (five full days). With the support of UPS TGE Adonis, CPST (University of Toulouse le Mirail) and Institut Ferdinand de Saussure (France).

Previous Message by Thread: click to view message preview

Re: POS-tagging for spoken English and learner English

Adam Kilgarriff a écrit : Do you have recent experiences of using available taggers on either of these kinds of data? Reports including accuracy figures would be particularly useful. We have recently tagged a 300,000 word corpus of spoken French. Strategy and evaluation and reported here: Campione, E., Véronis, J., & Deulofeu, J (2005). 3. The French corpus. In Cresti, E. & Moneglia, M. (Eds.), /C-ORAL-ROM, Integrated Reference Corpora for Spoken Romance Languages,/ (pp. 111-133). Amsterdam: John Benjamins. [Draft on-line: http://www.up.univ-mrs.fr/veronis/pdf/2005-Coralrom-book.pdf] The good surprise is that we achieved results as good as those we get on written corpora (ca. 98% precision). This is probably due to the fact that, on one hand, spoken corpora are more difficult because of disfluencies (repetitions, repairs, etc.), but on the other hand, their lexicon is much smaller and sentence complexity much lower. Best wishes --j http://aixtal.blogspot.com

Next Message by Thread: click to view message preview

Re: POS-tagging for spoken English and learner English

Hi, Adam and colleagues I agree with Paul in that "For learner data … POS tagging accuracy depends on how advanced the learners are". I have tried to have a native speaker corpus, LOCNESS and a learner corpus COLEC, as I call it, POS tagged. It works perfectly well with LOCNESS. But unfortunately, I was let down by the inaccuracy of the tagging to COLEC due to the special features of the learners errors. I am not a computer person, but I speculate that when a tagging system is devised, it would be based on the syntax rules most native speakers abide by. However, non-native speakers, especially those with an intermediate level or below would not produce the language in the way native speakers produce. You can hardly imagine how messy learner English could be. That would cause a huge problem to the POS tagging to a learner corpus and very likely indeed would disable the whole tagging system. Granger discussed this point in her article in Granger S., Hung J. and Petch-Tyson S. (eds) 2002. Computer Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins Publishing Company. Of course, it does not mean there will be no solutions to this. If people try hard enough, they may come up with a better accuracy rate. As far as I can see (pardon me if I am talking nonsense), at least the tagging system should not be based on the native speaker syntax rules. Perhaps the tagging system should be trained with adequate learner English data? But the problem is that it is hard to find a set of syntax rules to learner English. Anyway, I will keep all my fingers crossed for those who are dealing with this part of tagging system design. All the best Xiaotian Guo PhD Candidate The Department of English The University of Birmingham
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by