osdir.com
mailing list archive

Subject: Re: Reading sequences: FormatIO, SeqIO, etc - msg#00010

List: python.bio.devel

Date: Prev Next Index Thread: Prev Next Index
> Question One
> ============
>
> Is reading sequence files an important
> function to you, and if so which file formats in particular (e.g.
> Fasta, GenBank, ...)
>
I use Fasta, GenBank, and occasionally clustalw.
>
> Question Two - Reading Fasta Files
> ==================================
> Which of the following do you currently use (and why)?:
>
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with
> a title, and the sequence as a string) (b) Bio.Fasta with the
> FeatureParser (giving SeqRecord objects) (c) Bio.Fasta with your own
> parser (Could you tell us more?) (d) Bio.SeqIO.FASTA.FastaReader
> (giving SeqRecord objects) (e) Bio.FormatIO (giving SeqRecord
> objects) (f) Other (Could you tell us more?)
I use Bio.Fasta with the RecordParser, but just because it's easy to
find in the documentation. As a user, I think Bio.Fasta requires too
many steps to be typed in; I would prefer something more
straightforward. For the output format, I don't care so much, but for
the sake of consistency a SeqRecord may be preferable.

>
> Question Three - index_file based dictionaries
> ============================================== Do you use any of the
> following: (a) Bio.Fasta.Dictionary (b) Bio.Genbank.Dictionary (c)
> Any other "Martel/Mindy" based dictionary which first requires
> creation of an index using the index_file function
>

No. I never really understood index files.

>
> Question Four - Record Access...
> ================================
> When loading a file with multiple sequences do you use:
>
> (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the
> records one by one in the order from the file.
>
> (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
> random access to the records using their identifier.
>
> (c) A list giving random access by index number (e.g. load the
> records using an iterator but saving them in a list).
I use (a). It's easy to create (b) or (c), if needed, if (a) is available.
>
> Question Four - Fasta files: FastaRecord or SeqRecord
> ===================================================== If you use
> Fasta files, do you want get records returned as FastaRecords or as
> SeqRecords? If SeqRecords, do you use your own title2ids mapping?
>
> For example,
>
>> name text text text
> ACGTACACGT
>
> As a FastaRecord this would have:
>
> FastaRecord.title = "name text text text" (string)
> FastaRecord.sequence= "ACGTACACGT" (string)
>
> As a SeqRecord (with the default title2ids mapping):
>
> SeqRecord.id = (default string) SeqRecord.name = (default string)
> SeqRecord.description = "name text text text" (string) SeqRecord.seq
> = Seq("ACGTACACGT", alphabet)
I use the FastaRecord, but again for no particular reason. I have not
experienced an advantage of Seq objects over simple strings, so for me
the fact that FastaRecord contains a simple string is more convenient.
But it doesn't matter much.

> Question Five - GenBank files: GenbankRecord or SeqRecord
> ========================================================== If you use
> GenBank files, do you use: (a) Bio.Genbank.FeatureParser which
> returns SeqRecord objects (b) Bio.Genbank.RecordParser which returns
> Bio.GenBank.Record objects
>
I don't care so much, but I think that having two record types is
confusing, so it would be better if we could decide on one. A SeqRecord
is more general than a Bio.GenBank.Record, so I have a slight preference
for a SeqRecord.

>
> Question Six - Martel, Scanners and Consumers
> ============================================== Some of BioPython's
> existing parsers (e.g. those using Martel) use an event/callback
> model, where the scanner component generates parsing events which are
> dealt with by the consumer component.
>
> Do any of you use this system to modify existing parser behaviour, or
> use it as part of your own personal file parser?
>
> (a) I don't know, or don't care. I just the the parsers provided.
> (b) I use this framework to modify a parser in order to do ...
> (please provide details).
>
(a). Often, I'm just at the Python prompt typing away. What I like about
Python and Numerical Python is that the commands are often obvious and
easy to remember. With the parser framework, on the other hand, I always
need to look up in the documentation how to use them.

--Michiel


Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

[Fwd: Re: Reading sequences: FormatIO, SeqIO, etc]

(this time without the signature) -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc@xxxxxxxxxxxxxxx W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster@xxxxxxxxxxxxxxx quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). --- Begin Message --- On Tue, 2006-08-01 at 21:53 +0100, Peter (BioPython Dev) wrote: > Peter then wrote: > > Good point. > > I take that back - I was right the first time ;) > > You are right to worry about the line sep changing from platform to > platform, but you shouldn't use "%s>" % os.linesep Yep - that'll learn me ;) I was thinking about one problem (code portability) and not the other, more obvious one (data sharing) across platforms... D'oh! > To get any platform to read newlines from any other platform what I > suggest is using "\n>" as the split string, but open the file in > universal text mode - this seems to work fine on Python 2.3, but I'm not > sure when universal newline reading was introduced. It looks like it came in in 2.3: http://www.python.org/dev/peps/pep-0278/ That PEP also states that there's no universal text support for writing to file, so at least os.linesep is salvaged for that (when you're writing a platform-local file) ;) > For example, I created a simple file using the three newline conventions > (using the TextPad on Windows). > > >>> import sys > >>> sys.platform > 'win32' > >>> os.linesep > '\r\n' > > >>> open("c:/temp/windows.txt","r").read() > 'line\nline\n' > >>> open("c:/temp/mac.txt","r").read() > 'line\rline\r' > >>> open("c:/temp/unix.txt","r").read() > 'line\nline\n' > > (Notice that using "\n>" wouldn't work when reading a Mac style file on > Windows) > > >>> open("c:/temp/windows.txt","rU").read() > 'line\nline\n' > >>> open("c:/temp/mac.txt","rU").read() > 'line\nline\n' > >>> open("c:/temp/unix.txt","rU").read() > 'line\nline\n' That's nice. I'd never considered that. The overhead seems quite small, too: 1.94s SeqUtils/quick_FASTA_reader 4.49s SeqUtils/quick_FASTA_reader (import re) 2.07s SeqUtils/quick_FASTA_reader (universal newlines) 1.95s SeqUtils/quick_FASTA_reader 4.51s SeqUtils/quick_FASTA_reader (import re) 2.07s SeqUtils/quick_FASTA_reader (universal newlines) 1.87s SeqUtils/quick_FASTA_reader 4.53s SeqUtils/quick_FASTA_reader (import re) 2.07s SeqUtils/quick_FASTA_reader (universal newlines) 72904 records L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc@xxxxxxxxxxxxxxx W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) signature.asc Description: This is a digitally signed message part --- End Message --- _______________________________________________ Biopython-dev mailing list Biopython-dev@xxxxxxxxxxxxxxxxxx http://lists.open-bio.org/mailman/listinfo/biopython-dev

Next Message by Date: click to view message preview

Fwd: contributing comparative genomics tools

Begin forwarded message: > From: Christopher Lee <leec@xxxxxxxxxxxxx> > Date: August 3, 2006 9:11:42 PM EDT > To: biopython-dev-owner@xxxxxxxxxxxxxxxxxx > Subject: Fwd: contributing comparative genomics tools > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > there appears to be an error in your code submission instructions > on the biopython.org/wiki, or in the configuration of the biopython- > dev list server. The code submission instructions tell me to > submit my proposal by email to biopython-dev@xxxxxxxxxxxxx, but the > list server responds by saying that all mail will automatically be > rejected! Please forward this proposal to the appropriate people > (presumably biopython-dev?), and let me know that you have done > so. Otherwise I won't have any way of knowing whether anyone even > reads this email address... > > Yours with thanks, > > Chris Lee, Dept. of Chemistry & Biochemistry, UCLA > > Begin forwarded message: > >> You are not allowed to post to this mailing list, and your message >> has >> been automatically rejected. If you think that your messages are >> being rejected in error, contact the mailing list owner at >> biopython-dev-owner@xxxxxxxxxxxxxxxxxxx >> >> >> From: Christopher Lee <leec@xxxxxxxxxxxxx> >> Date: August 3, 2006 3:55:52 PM PDT >> To: biopython-dev@xxxxxxxxxxxxx >> Cc: Namshin Kim <deepreds@xxxxxxxxx> >> Subject: contributing comparative genomics tools >> >> >> Hi Biopython developers, >> I'd like to contribute some Python tools that my lab has been >> developing for large-scale comparative genomics database query. >> These tools make it easy to work with huge multigenome alignment >> databases (e.g. the UCSC Genome Browser multigenome alignments) >> using a new disk-based interval indexing algorithm that gives very >> high performance with minimal memory usage. e.g. whereas queries >> of the UCSC 17genome alignment typically take about 30 sec. per >> query using MySQL, the same query takes about 200 microsec. per >> query, making it possible to run huge numbers of queries for >> genome-wide studies. >> >> Here's an example usage (click the URL or just look at the code >> below) >> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/seq- >> align.html#SECTION000125000000000000000 >> >> We've tested this code very extensively in our own research, and >> it has had four open source releases so far. At this point the >> code is in production use. All the code is compatible back to >> Python version 2.2, but not 2.1 or before (we use generators). >> There is C code (accessed as Python classes) for the high- >> performance interval database index. For details of history see >> the website >> http://www.bioinformatics.ucla.edu/pygr >> >> There is also extensive tutorial and reference documentation: >> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/ >> >> Let me know what questions you have, and what process we would >> need to follow to contribute this code. >> >> Yours with best wishes, >> >> Chris Lee, Dept. of Chemistry & Biochemistry, UCLA >> >> >> ####### EXAMPLE USAGE >> from pygr import cnestedlist >> msa=cnestedlist.NLMSA('/usr/tmp/ucscDB/mafdb','r') # OPEN THE >> ALIGNMENT DB >> >> def printResults >> (prefix,msa,site,altID='NULL',cluster_id='NULL',seqNames=None): >> 'get alignment of each genome to site, print %identity and % >> aligned' >> for src,dest,edge in msa[site].edges(mergeMost=True): # >> ALIGNMENT QUERY! >> print '%s\t%s\t%s\t%s\t%2.1f\t%2.1f\t%s\t%s' \ >> %(altID,cluster_id,prefix,seqNames[dest], >> 100.*edge.pIdentity(),100.*edge.pAligned(),src[: >> 2],dest[:2]) >> >> def getAlt3Conservation(msa,gene,start1,start2,stop,**kwargs): >> 'gene must be a slice of a sequence in our genome alignment msa' >> ss1=gene[start1-2:start1] # USE SPLICE SITE COORDINATES >> ss2=gene[start2-2:start2] >> ss3=gene[stop:stop+2] >> e1=ss1+ss2 # GET INTERVAL BETWEEN PAIR OF SPLICE SITES >> e2=gene[max(start1,start2):stop] # GET INTERVAL BETWEEN e1 AND >> stop >> zone=e1+ss3 # USE zone AS COVERING INTERVAL TO BUNDLE fastacmd >> REQUESTS >> cache=msa[zone].keys(mergeMost=True) # PYGR BUNDLES REQUESTS >> TO MINIMIZE TRAFFIC >> for prefix,site in [('ss1',ss1),('ss2',ss2),('ss3',ss3), >> ('e1',e1),('e2',e2)]: >> printResults(prefix,msa,site,seqNames=~ >> (msa.seqDict),**kwargs) >> >> # RUN A QUERY LIKE THIS... >> # getAlt3Conservation(msa,some_gene,some_start,other_start,stop) >> >> ############ EXPLANATION & NOTES >> David Haussler's group has constructed alignments of multiple >> genomes. These alignments are extremely useful and interesting, >> but so large that it is cumbersome to work with the dataset using >> conventional methods. For example, for the 8-genome alignment you >> have to work simultaneously with the individual genome datasets >> for human, chimp, mouse, rat, dog, chicken, fugu and zebrafish, as >> well as the huge alignment itself. Pygr makes this quite easy. >> Here we illustrate an example of mapping an alternative 3' exon, >> which has two alternative splice sites (start1 and start2) and a >> single terminal splice site (stop). We use the alignment database >> to map each of these splice sites onto all the aligned genomes, >> and to print the percent-identity and percent-aligned for each >> genome, as well as the two nucleotides consituting the splice site >> itself. To examine the conservation of the two exonic regions >> (between start1 and start2, and the adjacent region terminated by >> stop, we print the same information for each genome's alignment to >> these two regions as well. The code first opens the alignment >> database. The function (getAlt3Conservation) obtains sequence >> slice objects representing the various ``sites'' to be queried. >> The actual alignment database query is performed in printResults: >> >> * The alignment database query is in the first line of >> printResults(). msa is the database; site is the interval query; >> and the edges methods iterates over the results, returning a tuple >> for each, consisting of a source sequence interval (i.e. an >> interval of site), a destination sequence interval (i.e. an >> interval in an aligned genome), and an edge object describing that >> alignment. We are taking advantage of Pygr's group-by operator >> mergeMost, which will cause multiple intervals in a given sequence >> to be merged into a single interval that constitutes their >> ``union''. Thus, for each aligned genome, the edges iterator will >> return a single aligned interval. The alignment edge object >> provides some useful conveniences, such as calculating the percent- >> identity between src and dest automatically for you. pIdentity() >> computes the fraction of identical residues; pAligned computes the >> fraction of aligned residues (allowing you to see if there are big >> gaps or insertions in the alignment of this interval). If we had >> wanted to inspect the detailed alignment letter by letter, we >> would just iterate over the letters attribute instead of the edges >> method. (See the NLMSASlice documentation for further information). >> >> * src[:2] and dest[:2] print the first two nucleotides of the >> site in gene and in the aligned genome. >> >> * it's worth noting that the actual sequence string >> comparisons are being done using a completely different database >> mechanism (formerly NCBI's fastacmd, now our own (much faster) >> pureseq text format), not the cnestedlist database. Basically, >> each genome is being queried as a separate BLAST formatted >> database, represented in Pygr by the BlastDB class. Pygr makes >> this complex set of multi-database operations more or less >> transparent to the user. For further information, see the BlastDB >> documentation. >> >> * The other operations here are entirely vanilla: mainly >> slicing a gene sequence to obtain the specific sites that we want >> to query. Note: gene must itself be a slice of a sequence in our >> alignment, or the alignment query msa[site] will raise an >> IndexError informing the user that the sequence site is not in the >> alignment. >> >> * The only slightly interesting operation here is the use of >> interval addition to obtain the ``union'' of two intervals, e.g. >> e1=ss1+ss2. This obtains a single interval that contains both of >> the input intervals. >> >> * When the print statement requests str() representations of >> these sequence objects, Pygr uses fastacmd -L to extract just the >> right piece of the corresponding chromosomes from the eight BLAST >> databases. >> >> (Actually, because of Pygr's caching / optimizations, considerably >> more is going on than indicated in this simplified sketch. But you >> get the idea: Pygr makes it relatively effortless to work with a >> variety of disparate (and large) resources in an integrated way.) >> >> Here is some example output: >> >> 1 Mm.99996 ss1 hg17 50.0 100.0 AG GG >> 1 Mm.99996 ss1 canFam1 50.0 100.0 AG GG >> 1 Mm.99996 ss1 panTro1 50.0 100.0 AG GG >> 1 Mm.99996 ss1 rn3 100.0 100.0 AG AG >> 1 Mm.99996 ss2 hg17 100.0 100.0 AG AG >> 1 Mm.99996 ss2 canFam1 100.0 100.0 AG AG >> 1 Mm.99996 ss2 panTro1 100.0 100.0 AG AG >> 1 Mm.99996 ss2 rn3 100.0 100.0 AG AG >> 1 Mm.99996 ss3 hg17 100.0 100.0 GT GT >> 1 Mm.99996 ss3 canFam1 100.0 100.0 GT GT >> 1 Mm.99996 ss3 panTro1 100.0 100.0 GT GT >> 1 Mm.99996 ss3 rn3 100.0 100.0 GT GT >> 1 Mm.99996 e1 hg17 78.9 100.0 AG GG >> 1 Mm.99996 e1 canFam1 84.2 100.0 AG GG >> 1 Mm.99996 e1 panTro1 77.6 100.0 AG GG >> 1 Mm.99996 e1 rn3 97.4 98.7 AG AG >> 1 Mm.99996 e2 hg17 91.6 99.1 CC CC >> 1 Mm.99996 e2 canFam1 88.8 99.1 CC CC >> 1 Mm.99996 e2 panTro1 91.6 99.1 CC CC >> 1 Mm.99996 e2 rn3 97.2 100.0 CC CC >> >> >> >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (Darwin) >> >> iD8DBQFE0n8GLQ4dB3bqQz4RApcxAKCIHdZ9mttB1uC4HkY3xXEw1cWYswCeIg4i >> xhxE2zrffLaiCjSiEp4Eo6k= >> =BeOe >> -----END PGP SIGNATURE----- >> >> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (Darwin) > > iD8DBQFE0p7iLQ4dB3bqQz4RAkzJAJ4wxiZqi7lZGBUMTFwyquGOCajiKQCfUDBm > Wx/4AIstFjb+rbqY2QBppLg= > =fghY > -----END PGP SIGNATURE-----

Previous Message by Thread: click to view message preview

Re: Reading sequences: FormatIO, SeqIO, etc

> Question One > ============ > Is reading sequence files an important function to you, and if so which > file formats in particular (e.g. Fasta, GenBank, ...) Yes. FASTA. > Question Two - Reading Fasta Files > ================================== > Which of the following do you currently use (and why)?: > > (f) Other (Could you tell us more?) I have written my own short iterator so that my code is portable without requiring Biopython to be installed. > Question Three - index_file based dictionaries > ============================================== > Do you use any of the following: No. > Question Four - Record Access... > ================================ > When loading a file with multiple sequences do you use: > > (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the > records one by one in the order from the file. Yes. > Question Four - Fasta files: FastaRecord or SeqRecord > ===================================================== > If you use Fasta files, do you want get records returned as FastaRecords > or as SeqRecords? If SeqRecords, do you use your own title2ids mapping? SeqRecords. I hate it when an interface tries to parse the definition line for me. Perhaps a set of standard definition line parsers should be provided so that one can choose, but usually I would rather have plain text and parse it myself. > Question Six - Martel, Scanners and Consumers > ============================================== > Some of BioPython's existing parsers (e.g. those using Martel) use an > event/callback model, where the scanner component generates parsing > events which are dealt with by the consumer component. > > Do any of you use this system to modify existing parser behaviour, or > use it as part of your own personal file parser? No. -- Michael Hoffman <hoffman@xxxxxxxxx> European Bioinformatics Institute

Next Message by Thread: click to view message preview

Reading sequences: FormatIO, SeqIO, etc

I've having a few issues with my email setup which is why I haven't replied recently. A week ago I filed bug 2059 for this discussion, and attached some code: http://bugzilla.open-bio.org/show_bug.cgi?id=2059 I'm interested in your feedback - from the framework down to if you don't like the class names for example. Peter
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by