Read biological sequence files (FASTA, FASTQ, GenBank, EMBL, ABI, SFF) using Biopython Bio.SeqIO. Use when parsing sequence files, iterating multi-sequence files, random access to large files, or high-performance parsing.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: bio-read-sequences description: Read biological sequence files (FASTA, FASTQ, GenBank, EMBL, ABI, SFF) using Biopython Bio.SeqIO. Use when parsing sequence files, iterating multi-sequence files, random access to large files, or high-performance parsing. tool_type: python primary_tool: Bio.SeqIO
Version Compatibility
Reference examples tested with: BioPython 1.83+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show biopythonthenhelp(module.function)to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Read Sequences
Read biological sequence data from files using Biopython's Bio.SeqIO module.
"Read sequences from a file" β Parse file into a collection of SeqRecord objects with IDs, sequences, and annotations accessible.
- Python:
SeqIO.parse()orSeqIO.read()(BioPython) - R:
readDNAStringSet()orreadAAStringSet()(Biostrings)
Required Import
Core import
from Bio import SeqIO
Core Functions
SeqIO.parse() - Multiple Records
Use for files with one or more sequences. Returns an iterator of SeqRecord objects.
for record in SeqIO.parse('sequences.fasta', 'fasta'):
print(record.id, len(record.seq))
Important: Always specify the format explicitly as the second argument.
SeqIO.read() - Single Record
Use when file contains exactly one sequence. Raises error if zero or multiple records.
record = SeqIO.read('single.fasta', 'fasta')
SeqIO.to_dict() - Load All Into Memory
Use for random access by record ID. Loads entire file into memory.
records = SeqIO.to_dict(SeqIO.parse('sequences.fasta', 'fasta'))
seq = records['sequence_id'].seq
SeqIO.index() - Large File Random Access
Use for large files when random access is needed without loading everything into memory.
records = SeqIO.index('large.fasta', 'fasta')
seq = records['sequence_id'].seq
records.close()
SeqIO.index_db() - SQLite-Backed Indexing
Use for very large files or multiple files. Creates persistent SQLite index.
# Create index (first time - parses file)
records = SeqIO.index_db('index.sqlite', 'large.fasta', 'fasta')
seq = records['sequence_id'].seq
records.close()
# Reuse existing index (instant load)
records = SeqIO.index_db('index.sqlite')
# Index multiple files together
records = SeqIO.index_db('combined.sqlite', ['file1.fasta', 'file2.fasta'], 'fasta')
Advantages over index():
- Persistent index survives program restarts
- Can index multiple files as one database
- Lower memory for extremely large files
- SQLite file can be shared across processes
High-Performance Parsing
For maximum throughput on large files, use low-level parsers (3-6x faster than SeqIO.parse):
SimpleFastaParser
Goal: Parse large FASTA files at maximum speed without SeqRecord overhead.
Approach: Use low-level tuple-based parser returning (title, sequence) strings.
Reference (BioPython 1.83+):
from Bio.SeqIO.FastaIO import SimpleFastaParser
with open('large.fasta') as handle:
for title, sequence in SimpleFastaParser(handle):
if len(sequence) > 1000:
print(title.split()[0]) # First word is usually ID
Returns (title, sequence) tuples as strings (no SeqRecord overhead).
FastqGeneralIterator
Goal: Parse large FASTQ files at maximum speed.
Approach: Use low-level tuple-based parser returning (title, sequence, quality_string) strings.
Reference (BioPython 1.83+):
from Bio.SeqIO.QualityIO import FastqGeneralIterator
with open('reads.fastq') as handle:
for title, sequence, quality in FastqGeneralIterator(handle):
avg_qual = sum(ord(c) - 33 for c in quality) / len(quality)
Returns (title, sequence, quality_string) tuples.
Common Formats
| Format | String | Typical Extension | Notes |
|---|---|---|---|
| FASTA | 'fasta' | .fasta, .fa, .fna, .faa | Most common |
| FASTA 2-line | 'fasta-2line' | .fasta | One line per sequence (no wrapping) |
| FASTQ | 'fastq' | .fastq, .fq | With quality scores |
| FASTQ Solexa | 'fastq-solexa' | .fastq | Old Solexa/Illumina (pre-1.3) |
| FASTQ Illumina | 'fastq-illumina' | .fastq | Illumina 1.3-1.7 |
| GenBank | 'genbank' or 'gb' | .gb, .gbk | With features/annotations |
| EMBL | 'embl' | .embl | European format with features |
| Swiss-Prot | 'swiss' | .dat | UniProt format |
Specialized Formats
| Format | String | Use Case |
|---|---|---|
| ABI | 'abi' | Sanger sequencing trace files (.ab1) |
| ABI Trimmed | 'abi-trim' | ABI with low-quality ends trimmed |
| SFF | 'sff' | 454/Ion Torrent flowgram data |
| SFF Trimmed | 'sff-trim' | SFF with adapter/quality trimming |
| QUAL | 'qual' | Quality scores file (pairs with FASTA) |
| PHD | 'phd' | Phred/Phrap/Consed output |
| ACE | 'ace' | Assembly format (Consed) |
| PDB SEQRES | 'pdb-seqres' | Protein sequences from PDB files |
| PDB ATOM | 'pdb-atom' | Sequences from ATOM records in PDB |
| SnapGene | 'snapgene' | SnapGene .dna files |
| GCK | 'gck' | Gene Construction Kit files |
| XDNA | 'xdna' | DNA Strider / SerialCloner files |
Reading ABI Trace Files
# Read Sanger sequencing trace with quality
record = SeqIO.read('sample.ab1', 'abi')
print(f'Sequence: {record.seq}')
qualities = record.letter_annotations['phred_quality']
# Auto-trim low quality ends
record_trimmed = SeqIO.read('sample.ab1', 'abi-trim')
Reading 454/Ion Torrent SFF
for record in SeqIO.parse('reads.sff', 'sff'):
print(record.id, len(record.seq))
# With trimming applied
for record in SeqIO.parse('reads.sff', 'sff-trim'):
print(record.id, len(record.seq))
Reading PDB Sequences
# Get sequences from SEQRES records
for record in SeqIO.parse('structure.pdb', 'pdb-seqres'):
print(f'Chain {record.id}: {record.seq}')
# Get sequences from ATOM coordinates
for record in SeqIO.parse('structure.pdb', 'pdb-atom'):
print(f'Chain {record.id}: {record.seq}')
Alignment Formats (Read-Only)
| Format | String | Notes |
|---|---|---|
| PHYLIP | 'phylip' | Interleaved phylip |
| PHYLIP Sequential | 'phylip-sequential' | Sequential phylip |
| PHYLIP Relaxed | 'phylip-relaxed' | Longer names allowed |
| Clustal | 'clustal' | ClustalW output |
| Stockholm | 'stockholm' | Rfam/Pfam alignments |
| NEXUS | 'nexus' | PAUP/MrBayes format |
| MAF | 'maf' | Multiple Alignment Format |
SeqRecord Object Attributes
After parsing, each record has these key attributes:
record.id # Sequence identifier (string)
record.name # Sequence name (string)
record.description # Full description line (string)
record.seq # Sequence data (Seq object)
record.features # List of SeqFeature objects (GenBank/EMBL)
record.annotations # Dictionary of annotations
record.letter_annotations # Per-letter annotations (quality scores)
record.dbxrefs # Database cross-references
Code Patterns
Collect All Sequences Into a List
records = list(SeqIO.parse('sequences.fasta', 'fasta'))
Count Records Without Loading All
count = sum(1 for _ in SeqIO.parse('sequences.fasta', 'fasta'))
Fast Count (FASTA only)
from Bio.SeqIO.FastaIO import SimpleFastaParser
with open('sequences.fasta') as f:
count = sum(1 for _ in SimpleFastaParser(f))
Get Sequence IDs Only
ids = [record.id for record in SeqIO.parse('sequences.fasta', 'fasta')]
Read GenBank with Features
for record in SeqIO.parse('sequence.gb', 'genbank'):
for feature in record.features:
if feature.type == 'CDS':
print(feature.qualifiers.get('product', ['Unknown'])[0])
cds_seq = feature.extract(record.seq) # Get feature sequence
Access FASTQ Quality Scores
for record in SeqIO.parse('reads.fastq', 'fastq'):
qualities = record.letter_annotations['phred_quality']
avg_quality = sum(qualities) / len(qualities)
Read From File Handle
with open('sequences.fasta', 'r') as handle:
for record in SeqIO.parse(handle, 'fasta'):
print(record.id)
Custom ID Function for Indexing
def get_accession(identifier):
return identifier.split('.')[0] # Remove version
records = SeqIO.index('sequences.fasta', 'fasta', key_function=get_accession)
Common Errors
| Error | Cause | Solution |
|---|---|---|
ValueError: More than one record | Used read() on multi-record file | Use parse() instead |
ValueError: No records found | Used read() on empty file | Check file exists and has content |
ValueError: unknown format | Typo in format string | Check format string spelling |
UnicodeDecodeError | Binary file or wrong encoding | Open with encoding='latin-1' or check file |
sqlite3.OperationalError | index_db file locked | Close other connections first |
Decision Tree
Need to read sequences?
βββ Single record in file?
β βββ Use SeqIO.read()
βββ Multiple records?
β βββ Need all in memory at once?
β β βββ Use list(SeqIO.parse()) or SeqIO.to_dict()
β βββ Process one at a time (memory efficient)?
β β βββ Use SeqIO.parse() iterator
β βββ Large file, need random access by ID?
β β βββ Single session? β Use SeqIO.index()
β β βββ Persistent/multi-file? β Use SeqIO.index_db()
β βββ Maximum throughput needed?
β βββ Use SimpleFastaParser or FastqGeneralIterator
βββ Sanger sequencing trace?
β βββ Use 'abi' or 'abi-trim' format
βββ 454/Ion Torrent data?
β βββ Use 'sff' or 'sff-trim' format
βββ Protein from structure?
βββ Use 'pdb-seqres' or 'pdb-atom' format
Related Skills
- write-sequences - Write parsed sequences to new files
- filter-sequences - Filter sequences by criteria after reading
- format-conversion - Convert between formats
- compressed-files - Read gzip/bzip2/BGZF compressed sequence files
- sequence-manipulation/seq-objects - Work with parsed SeqRecord objects
- database-access - Fetch sequences from NCBI instead of local files
- alignment-files - For SAM/BAM/CRAM alignment files, use samtools/pysam
More by GPTomics
View allQuery dbSNP for rsID lookups, variant annotations, and cross-references to other databases. Use when mapping between rsIDs and genomic coordinates or retrieving basic variant information.
RNA-seq specific quality control including rRNA contamination detection, strandedness verification, gene body coverage, and transcript integrity metrics. Use when validating RNA-seq libraries before differential expression analysis.
Profile functional potential of metagenomes using HUMAnN3 and similar tools. Use when obtaining pathway abundances, gene family counts, or functional annotations from metagenomic data.
XCMS3 workflow for LC-MS/MS metabolomics preprocessing. Covers peak detection, retention time alignment, correspondence (grouping), and gap filling. Use when processing raw LC-MS data into a feature table for untargeted metabolomics.
