logo       

New Data from the LDC: msg#00108

science.linguistics.corpora

Subject: New Data from the LDC

LDC2006S26
CSLU: Speaker Recognition Version 1.1

LDC2006T10
English-Arabic Treebank V1.0

LDC2006S33
Middle East Technical University Turkish Microphone Speech V 1.0


In this month's newsletter, the Linguistic Data Consortium (LDC) would like to announce the availability of three new publications.


New Publications

(1)  CSLU: Speaker Recognition Version 1.1 consists of telephone speech from 91 participants. Each participant has recorded speech in twelve sessions over a two-year period answering questions like "what is your eye color" or respond to prompts like "describe a typical day in your life." Most of the utterances in the corpus have corresponding non-time-aligned word level transcriptions.

The goal of Speaker Recognition data collection was to collect speech from each participant over a two year period. Each participant called the data collection system twelve times over the two-year period and said the same utterances each time. 

*

(2)  English-Arabic Parallel Treebank V1.0 consists of 52,238 words in 224 files of individual Agence France Presse (AFP) news stories (corresponding to approximately the first 50K words of the Arabic Treebank: Part 1 v 3.0 -- LDC Catalog No.: LDC2005T02). The English translation was provided by LDC, and was part-of-speech tagged and treebanked for this project.

The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with two notable differences:
  1. POS: tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example), and the addition of HYPH and AFX tags necessitated by this change in tokenization
  2. TreeBank: the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC)

*

(3)  Middle East Technical University Turkish Microphone Speech V 1.0 corpus has been collected at the Middle East Technical University (METU) as part of a collaborative work between the Department of Electrical and Electronics Engineering of the Middle East Technical University in Turkey and the Center for Spoken Language Research (CSLR) of the University of Colorado at Boulder, USA.  The corpus was used to port the Speech Recognition System of CSLR, SONIC, to Turkish.

The corpus contains text, speech, and alignment files.  120 speakers (60 male and 60 female) spoke 40 sentences each for a total of approximately 500 minutes of speech. The 40 sentences were selected randomly for each speaker from a triphone-balanced set of 2462 Turkish sentences. All participants were native speakers of Turkish.



If you need further information, or would like to inquire about membership to the LDC, please email ldc@xxxxxxxxxxxxx or call +1 215 573 1275.



--------------------------------------------------------------------
Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                             	    	   ldc@xxxxxxxxxxxxx
Philadelphia, PA 19104                 	    http://www.ldc.upenn.edu
<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise