Fast COI Barcode Classification with Vector Search
Using FAISS and embedding-based search for rapid species identification from DNA sequences.
Abstract
DNA barcoding using the COI gene has become a standard method for species identification. However, traditional BLAST-based approaches can be computationally expensive when querying large reference databases. This paper presents a novel approach using FAISS vector search for rapid COI barcode classification.
By encoding DNA sequences as fixed-length vectors and leveraging approximate nearest neighbor search, we achieve sub-second classification times while maintaining high accuracy. Our implementation includes hierarchical confidence scoring that provides taxonomic predictions at multiple levels.
Benchmarks against BOLD Systems show comparable accuracy with significantly reduced query times, making this approach suitable for real-time species identification in field applications and high-throughput sequencing pipelines.