Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

Re: opening unknown fasta file: msg#00032

java.bio.general

Subject: Re: opening unknown fasta file

One way to do this would be to create a Unicode alphabet (or ASCII
alphabet) and read the file into a Sequence of that Alphabet, create a
Distribution, compare it to the DNA/ RNA/ Protein distributions using
DistributionTools and then convert it to the correct Alphabet.

Even more ambitious would be to read the whole file to a text buffer and
guess the format and alphabet based on the usage of characters.

Anyone feel inspired to do something like this. We are always getting
emails from students looking for short projects. How about that one? My
basic minimal requirement would be that the file should not be read twice.
I/O is expensive, Memory is cheap.

- Mark





Thomas Down <thomas@xxxxxxxxxxxx>
Sent by: biojava-l-bounces@xxxxxxxxxxxxxxxxxxx
11/13/2004 12:26 AM


To: Mark Schreiber/GP/Novartis@PH
cc: biojava-list <biojava-l@xxxxxxxxxxx>
Subject: Re: [Biojava-l] opening unknown fasta file


On Fri, Nov 12, 2004 at 10:01:13AM +0800,
mark.schreiber@xxxxxxxxxxxxxxxxxx wrote:
>
> Bascially there is absolutely no failsafe way to know if a fasta file is

> DNA or Protein (or RNA). It's perfectly reasonable to have a short
peptide
> which contains only acg and t although it becomes very unlikely with
> longer sequences.

The real problem isn't A, C, G, or T, but the other 11 ambiguity symbols
that appear in DNA sequences. Ns are everywhere, but many of the other
ambiguities appear from time to time, too.

If we were *really* serious about alphabet-guessing (which scares me, to
be
honest), one option would be to calculate histograms of character
frequencies
in EMBL and Swissprot, and look for the closest match. I believe that
Internet Explorer takes this approach when it hits a web page without an
explicitly-specified character encoding -- it apparently works pretty
well...

Does anyone feel this serious?

Thomas.
_______________________________________________
Biojava-l mailing list - Biojava-l@xxxxxxxxxxx
http://biojava.org/mailman/listinfo/biojava-l



_______________________________________________
Biojava-l mailing list - Biojava-l@xxxxxxxxxxx
http://biojava.org/mailman/listinfo/biojava-l



<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe

Navigation

Home | advertise | OSDir is an inevitable website. super tiny logo