plastid.readers.bed module¶
This module contains BED_Reader
, an iterator that reads each line of a BED
or extended BED file into a SegmentChain
, Transcript
, or similar object.
Module contents¶
|
Reads BED and extended BED files line-by-line into |
Column names and types for various extended BED formats used by the ENCODE project. |
Examples¶
Read entries in a BED file as Transcripts
. thickEnd and thickStart
columns will be interpreted as the endpoints of coding regions:
>>> bed_reader = BED_Reader("some_file.bed",return_type=Transcript)
>>> for transcript in bed_reader:
pass # do something fun with each Transcript/SegmentChain
If return_type is unspecified, BED lines are read as SegmentChains
:
>>> my_chains = list(BED_Reader("some_file.bed"))
>>> my_chains[:5]
[list of segment chains as output...]
Open an extended BED file, which contains additional columns for gene_id
and favorite_color. Values for these attributes will be stored in the attr
dict of each Transcript
:
>>> bed_reader = BED_Reader("some_file.bed",return_type=Transcript,extra_columns=["gene_id","favorite_color"])
Open several Tabix-compressed BED files, and iterate over them as if they were one stream:
>>> bed_reader = BED_Reader("file1.bed.gz","file2.bed.gz",tabix=True)
>>> for chain in bed_reader:
>>> pass # do something interesting with each chain
See Also¶
- UCSC file format FAQ.
BED format specification at UCSC
- class plastid.readers.bed.BED_Reader(*streams, return_type=SegmentChain, add_three_for_stop=False, extra_columns=0, printer=None, tabix=False)[source]¶
Bases:
plastid.readers.common.AssembledFeatureReader
Reads BED and extended BED files line-by-line into
SegmentChains
orTranscripts
. Metadata, if present in a track declaration, is saved in self.metadata. Malformed lines are stored in self.rejected, while parsing continues.- Parameters
- *streamsfile-like
One or more open filehandles of input data.
- return_type
SegmentChain
or subclass, optional Type of feature to return from assembled subfeatures (Default:
SegmentChain
)- add_three_for_stopbool, optional
Some annotation files exclude the stop codon from CDS annotations. If set to True, three nucleotides will be added to the threeprime end of each CDS annotation, UNLESS the annotated transcript contains explicit stop_codon feature. (Default: False)
- extra_columns: int or list optional
Extra, non-BED columns in extended BED format file corresponding to feature attributes. This is common in ENCODE-specific BED variants.
if extra-columns is:
an
int
: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of theSegmentChain
, under names like custom0, custom1, … , customN.a
list
ofstr
, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored under their respective names in the attr dict.a
list
oftuple
, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of theSegmentChain
will be set to formatter_func(column_value).
(Default: 0)
- printerfile-like, optional
Logger implementing a
write()
method. Default:NullWriter
- tabixboolean, optional
streams point to tabix-compressed files or are open
tabix_file_iterator
(Default: False)
Examples
Read entries in a BED file as
Transcripts
. thickEnd and thickStart columns will be interpreted as the endpoints of coding regions:>>> bed_reader = BED_Reader(open("some_file.bed"),return_type=Transcript) >>> for transcript in bed_reader: >>> pass # do something fun
Open an extended BED file that contains additional columns for gene_id and favorite_color. Values for these attributes will be stored in the attr dict of each
Transcript
:>>> bed_reader = BED_Reader(open("some_file.bed"),return_type=Transcript,extra_columns=["gene_id","favorite_color"])
Open several Tabix-compressed BED files, and iterate over them as if they were one uncompressed stream:
>>> bed_reader = BED_Reader("file1.bed.gz","file2.bed.gz",tabix=True) >>> for chain in bed_reader: >>> pass # do something more interesting
- Attributes
- streamsfile-like
One or more open streams (usually filehandles) of input data.
- return_typeclass
The type of object assembled by the reader. Typically a
SegmentChain
or a subclass thereof. Must import a method calledfrom_bed()
- counterint
Cumulative line number counter over all streams
- rejectedlist
List of BED lines that could not be parsed
- metadatadict
Attributes declared in track line, if any
- extra_columnsint or list, optional
Extra, non-BED columns in extended BED format file corresponding to feature attributes. This is common in ENCODE-specific BED variants.
if extra_columns is:
an
int
: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of theSegmentChain
, under names like custom0, custom1, … , customN.a
list
ofstr
, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored under there respective names in the attr dict.a
list
oftuple
, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of theSegmentChain
will be set to formatter_func(column_value).
If unspecified,
BED_Reader
reads the track declaration line (if present), and:if a known track type is specified by the type field, it attempts to format the extra columns as specified by that type. Known track types presently include:
bedDetail
narrowPeak
broadPeak
gappedPeak
tagAlign
pairedTagAlign
peptideMapping
if not, it assumes 0 non-BED fields are present, and that all columns are BED formatted.
Methods
close
()Close stream
fileno
()Returns underlying file descriptor if one exists.
filter
(data)Return next assembled feature from self.stream
flush
(/)Flush write buffers, if applicable.
isatty
()Return whether this is an 'interactive' stream.
read
()Similar to
file.read()
.readable
()Return whether object was opened for reading.
readline
()Process a single line of data, assuming it is string-like
next(self)
is more likely to behave as expected.Similar to
file.readlines()
.Change stream position.
seekable
()Return whether object supports random access.
tell
(/)Return current stream position.
Truncate file to size bytes.
writable
()Return whether object was opened for writing.
writelines
(lines, /)Write a list of lines to stream.
next
- close()¶
Close stream
- fileno()¶
Returns underlying file descriptor if one exists.
OSError is raised if the IO object does not use a file descriptor.
- filter(data)¶
Return next assembled feature from self.stream
- Returns
SegmentChain
or subclassNext feature assembled from self.streams, type specified by self.return_type
- flush(/)¶
Flush write buffers, if applicable.
This is not implemented for read-only and non-blocking streams.
- isatty()¶
Return whether this is an ‘interactive’ stream.
Return False if it can’t be determined.
- next()¶
- read()¶
Similar to
file.read()
. Process all units of data, assuming it is string-like- Returns
- str
- readable()¶
Return whether object was opened for reading.
If False, read() will raise OSError.
- readline()¶
Process a single line of data, assuming it is string-like
next(self)
is more likely to behave as expected.- Returns
- object
a unit of processed data
- readlines()¶
Similar to
file.readlines()
.- Returns
- list
processed data
- seek()¶
Change stream position.
Change the stream position to the given byte offset. The offset is interpreted relative to the position indicated by whence. Values for whence are:
0 – start of stream (the default); offset should be zero or positive
1 – current stream position; offset may be negative
2 – end of stream; offset is usually negative
Return the new absolute position.
- seekable()¶
Return whether object supports random access.
If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().
- tell(/)¶
Return current stream position.
- truncate()¶
Truncate file to size bytes.
File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Returns the new size.
- writable()¶
Return whether object was opened for writing.
If False, write() will raise OSError.
- writelines(lines, /)¶
Write a list of lines to stream.
Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.
- closed¶
- plastid.readers.bed.bed_x_formats = {'bedDetail': [('ID', <class 'str'>), ('description', <class 'str'>)], 'broadPeak': [('signalValue', <class 'float'>), ('pValue', <class 'float'>), ('qValue', <class 'float'>)], 'gappedPeak': [('signalValue', <class 'float'>), ('pValue', <class 'float'>), ('qValue', <class 'float'>)], 'narrowPeak': [('signalValue', <class 'float'>), ('pValue', <class 'float'>), ('qValue', <class 'float'>), ('peak', <class 'int'>)], 'pairedTagAlign': [('seq1', <class 'str'>), ('seq2', <class 'str'>)], 'peptideMapping': [('rawScore', <class 'float'>), ('spectrumId', <class 'str'>), ('peptideRank', <class 'int'>), ('peptideRepeatCount', <class 'int'>)], 'tagAlign': [('sequence', <class 'str'>), ('score', <class 'float'>), ('strand', <class 'str'>)]}¶
Column names and types for various extended BED formats used by the ENCODE project. These can be passed to the extra_columns keyword of
BED_Reader
.