Adam,
Folks in UCREL at Lancaster and elsewhere have got some experience of running
CLAWS over corpora such as the spoken part of the BNC, MICASE, ICLE, and
historical corpora (Nameless Shakespeare). My impression in general is that the
statistical HMM component of the tagger provides the robustness you need for
these kind of tasks, but you need to accompany that with tweaks to the other
components such as the 'idiom' lists and tokenisation.
Here's some more detail:
1. In the BNC project, the CLAWS transition probabilities were retrained on
spoken data. Also there were lexicon additions, special treatment of
contractions, truncated words and repetition, all closely tied to the
transcription and encoding formats in the BNC spoken corpus. For more detail,
see:
Garside, R. (1995) Grammatical tagging of the spoken part of the British
National Corpus: a progress report. In Leech, G., Myers, G. and Thomas, J.
(eds) (1995), Spoken English on Computer: Transcription, Mark-up and
Application. pp.161-7.
Garside, R., and Smith, N. (1997) A hybrid grammatical tagger: CLAWS4, in
Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic
Information from Computer Text Corpora. Longman, London, pp. 102-121.
Also see Nicholas Smith and Geoff Leech's manual for BNC version 2, which has
error analysis comparing written and spoken:
http://www.comp.lancs.ac.uk/ucrel/bnc2/bnc2error.htm
http://www.comp.lancs.ac.uk/ucrel/claws/
2. I don't have figures for MICASE which we tagged with CLAWS or an ICLE
sub-corpus, but came away with the general impression as above that the
probability matrix provides robustness in these types of text which you might
expect to cause problems for automatic POS annotation. For learner data, of
course, POS tagging accuracy depends on how advanced the learners are. You
could have a look at
Bertus van Rooy and Lande Schäfer: An evaluation of three POS taggers for the
tagging of the Tswana Learner English Corpus
comparing TOSCA-ICLE, Brill tagger, and CLAWS on their data. This was presented
at the learner corpus workshop at Corpus Linguistics 2003. The abstract is at
http://tonolab.meikai.ac.jp/~tono/cl2003/lcdda/abstracts/rooy.html
and the full paper is in the CL2003 proceedings.
3. In collaboration with Martin Mueller at Northwestern, we've recently been
applying CLAWS to the Nameless Shakespeare corpus and looking at error rates
and problems. There are other things which upset CLAWS (and would most likely
do the same for other POS taggers) such as different capitalisation and variant
spellings. Our approach has been to pre-process these as much as possible,
retaining original variants, but fooling CLAWS, if you like, into tagging a
version with modern equivalents. See:
Rayson, P., Archer, D. and Smith, N. (2005) VARD versus Word: A comparison of
the UCREL variant detector and modern spell checkers on English historical
corpora. In proceedings of Corpus Linguistics 2005.
Our experience with Nameless Shakespeare was that CLAWS' current statistical
language model copes pretty well in data from that time, but we expect that the
probability matrix will need to be retrained if we attempt tagging data much
earlier than 1550/1600.
Regards,
Paul.
Dr. Paul Rayson
Director of UCREL (University Centre for Computer Corpus Research on Language)
Computing Department, Infolab21, South Drive, Lancaster University, Lancaster,
LA1 4WA, UK.
Web:
http://www.comp.lancs.ac.uk/computing/users/paul/
New telephone number: +44 1524 510357 Fax: +44 1524 510492
Thread at a glance:
Previous Message by Date:
click to view message preview
slides from cl 2005 web-as-corpus workshop
Dear All,
The slides from the Corpus Linguistics 2005 Web-as-Corpus
workshop/tutorial are now available for download from the workshop site:
http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html
Regards,
Marco
--
Marco Baroni
SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni
Next Message by Date:
click to view message preview
Digital documents and interpretation Conference - First announcement and call for papers
INTERNATIONAL CONFERENCE AND SUMMER SCHOOL
Albi, July 10-14th 2006
Organized within the framework of the Albi Languages and Signification
Conferences (CALS)
DIGITAL DOCUMENTS AND INTERPRETATION
— HUMANITIES/SOCIAL SCIENCES CORPORA
First announcement
-----------------------------------------------
Corpus analysis and building are redefining the practices, or even the
theories of the humanities. As these disciplines are more and more
dealing with digital documents, they have to reconsider their relation
to the empirical. The digitization of scientific texts also involves
a reflexive return to their very development. Do these new ways of
accessing documents generate new forms of knowledge construction?
The new national and international initiatives (e.g. the creation of
the Centre for digital scientific edition of the French CNRS, the TGE
Adonis – very large access equipment special for digital data and
documents in the humanities and social sciences) may be the
opportunities to build a federal project for the humanities and social
sciences. Numerous communities have for a long time gotten involved in
a thinking on digitization and computer-assisted analysis: information
sciences, but also history, sociology, linguistics, archaeology,
literary studies – non exhausting listing of course…
Therefore, the aim of the conference is to reinforce links and to
encourage connections between teachers and researchers belonging to
these disciplines and the communities of corpus linguistics and
digital document. Without much consideration towards ordinary
objectivism, the conference will deal with the philological and
hermeneutical problems corpus-based works have to handle, according to
the tasks and the disciplines: for instance, genre and discourse
typologies, description of semantic forms and contents, theme
identification, concept characterization and evolution, form and
content correlations.
On the practical level, the conference will tackle the questions risen
by corpus collecting, building, coding, tagging and processing and
digital edition. Software demonstrations are scheduled, as well as
introductions to issues specific to the concerned disciplines.
Important dates
-----------------------------------------------
Paper submissions : authors are invited to submit a one-page abstract
containing references and keywords. Abstracts should be sent as
attached files to LPE2@xxxxxxxxxxxxxx
September 1, 2005: Abstract submission begins.
December 31, 2005: Paper submission deadline
February 1, 2006 : Notification of acceptance
June 1, 2006 : camera-ready copies of accepted papers
Camera-ready copy should not exceed 10 pages (plus abstract). All the
accepted papers will be put online before the conference.
Papers should be submitted in PDF and conform to the guidelines
available from
http://www.revue-texto.net/Redaction/Normes/Consignes.html
Organization
-----------------------------------------------
Initiative committee : François Rastier, Michel Ballabriga, Pierre
Marillaud.
Scientific committee : Etienne Brunet (University of Nice), Michel
Ballabriga (University of Toulouse le Mirail), Kjersti Floettum
(University of Bergen), Andrea Iacovella (CNRS, Cens et TGE Adonis),
Ioannis Kanellos (ENST, Brest), Bénédicte Pincemin (LLI-CNRS,
Villetanneuse), François Rastier (CNRS, Paris), André Salem
(University of Paris III), Monique Slodzian (INALCO), Mathieu Valette
(Atilf-CNRS, Nancy), Geoffrey Williams (University of Bretagne Sud –
Lorient).
Organization committee : Carine Duteil (University of Toulouse le
Mirail), Baptiste Foulquié (University of Toulouse le Mirail), Céline
Poudat (University of d’Orléans).
Further information
-----------------------------------------------
Information and pre-registration: CALS :
beatrixmarillaud.cals@xxxxxxxxxx
Conference registration fee : 40 euros ; students : 20 euros
Conference location : Centre Saint Amarand, 16 rue de la République,
81000, Albi.
Accommodation (limited number of places) : Single room : 20 euros,
double : 29 euros (including breakfasts). Lunch : 11 euros (including
drinks).
Grants may be offered, providing there are justifiable reasons for
such a request.
Conference dates : July 10 – July 14 juillet (five full days).
With the support of UPS TGE Adonis, CPST (University of Toulouse le
Mirail) and Institut Ferdinand de Saussure (France).
Previous Message by Thread:
click to view message preview
Re: POS-tagging for spoken English and learner English
Adam Kilgarriff a écrit :
Do you have recent experiences of using available taggers on either of
these kinds of data?
Reports including accuracy figures would be particularly useful.
We have recently tagged a 300,000 word corpus of spoken French. Strategy
and evaluation and reported here:
Campione, E., Véronis, J., & Deulofeu, J (2005). 3. The French corpus.
In Cresti, E. & Moneglia, M. (Eds.), /C-ORAL-ROM, Integrated Reference
Corpora for Spoken Romance Languages,/ (pp. 111-133). Amsterdam: John
Benjamins.
[Draft on-line:
http://www.up.univ-mrs.fr/veronis/pdf/2005-Coralrom-book.pdf]
The good surprise is that we achieved results as good as those we get on
written corpora (ca. 98% precision). This is probably due to the fact
that, on one hand, spoken corpora are more difficult because of
disfluencies (repetitions, repairs, etc.), but on the other hand, their
lexicon is much smaller and sentence complexity much lower.
Best wishes
--j
http://aixtal.blogspot.com
Next Message by Thread:
click to view message preview
Re: POS-tagging for spoken English and learner English
Hi, Adam and colleagues
I agree with Paul in that "For learner data … POS tagging accuracy
depends on how advanced the learners are".
I have tried to have a native speaker corpus, LOCNESS and a learner
corpus COLEC, as I call it, POS tagged. It works perfectly well with
LOCNESS. But unfortunately, I was let down by the inaccuracy of the
tagging to COLEC due to the special features of the learners errors. I
am not a computer person, but I speculate that when a tagging system
is devised, it would be based on the syntax rules most native speakers
abide by. However, non-native speakers, especially those with an
intermediate level or below would not produce the language in the way
native speakers produce. You can hardly imagine how messy learner
English could be. That would cause a huge problem to the POS tagging
to a learner corpus and very likely indeed would disable the whole
tagging system. Granger discussed this point in her article in
Granger S., Hung J. and Petch-Tyson S. (eds) 2002. Computer Corpora,
Second Language Acquisition and Foreign Language Teaching. Amsterdam:
John Benjamins Publishing Company.
Of course, it does not mean there will be no solutions to this. If
people try hard enough, they may come up with a better accuracy rate.
As far as I can see (pardon me if I am talking nonsense), at least the
tagging system should not be based on the native speaker syntax rules.
Perhaps the tagging system should be trained with adequate learner
English data? But the problem is that it is hard to find a set of
syntax rules to learner English. Anyway, I will keep all my fingers
crossed for those who are dealing with this part of tagging system
design.
All the best
Xiaotian Guo
PhD Candidate
The Department of English
The University of Birmingham