|
Re: Cost of part of speech tagging: msg#00158science.linguistics.corpora
Hi, | Hello all! Does anyone have any thoughts on what the cost of annotating a | corpus with part of speech tags is? For example, would you pay someone per | word or per sentence and how much? Any other thoughts or information on | corpus preparation and financial cost would be very helpful. Thanks for | advance for any thoughts. | This depends on a number of dimensions, including: * The type of data being tagged (i.e. news or poetry); * The quality of the tokenization provided; * The narrow-tailoring of the annotation tool; * The size of the tagset (i.e. the number of distinct tags); * The education/training of the annotator; * The availability of native speakers in the language; The pay rate can vary widely, but $8-15/hour is typical. For a news dataset of 100K words and a tagset in the range of 15 distinct tags, this amounts to about 250 words/hour with the right tool -- or 400 hours of native speaker effort. In other words: $3200-6000 for labor ... plus overhead, data formatting, training, supervision and quality control. You may be able to pay by the word, but I don't have any experience with this approach. Assuming $0.20/word (slightly less than the standard going rate for a typical translation task), the same job would cost $20,000 -- and would probably still generate additional overhead, data formatting, training and quality control costs. LDC has some (open source, web-based) tools for POS annotation, but are still in the process of making those publicly available. Please let me know if you're interested and I'll try to put you in touch with the right people. -Christopher. On Sat, Dec 23, 2006 at 08:39:06PM -0700, Marc Carmen wrote: | Hello all! Does anyone have any thoughts on what the cost of annotating a | corpus with part of speech tags is? For example, would you pay someone per | word or per sentence and how much? Any other thoughts or information on | corpus preparation and financial cost would be very helpful. Thanks for | advance for any thoughts. | | -- | Thanks, | Marc Carmen | marc.carmen@xxxxxxxxx -- --------------------------------------- Christopher R. Walker, Project Manager Automatic Content Extraction (ACE) & Less-Commonly Taught Languages (LCTL) LDC Annotation Lab chwalker@xxxxxxxxxxxxx 215.898.0946 --------------------------------------- |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Researcher/PostDoc positions available at KAIST: 00158, Ji-Ae Shin |
|---|---|
| Next by Date: | Using MTurk for markup tasks (was Cost of part of speech tagging): 00158, Alexandre Rafalovitch |
| Previous by Thread: | Cost of part of speech taggingi: 00158, Marc Carmen |
| Next by Thread: | **** ISDA'07 - First Call for Papers ****: 00158, Ajith Abraham |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |