Sujai Kumar on
February 20th 2017
Version 4 of Lepbase, the hub for Lepidopteran genomes, not only includes several new genome assemblies and annotations, but also showcases a re-engineered infrastructure that will enable other groups to easily set up their own Ensembl-based genome hubs in the future.
New genomes with gene models:
Plutella xylostella pacbiov1 – Alistair Darby’s Lab
Heliconius erato lativitta v1 – Robert Reed’s Lab
Additional assemblies or annotations for existing species:
Previously, Lepbase v3 had hosted NCBI RefSeq annotations for Papilio xuthus and Papilio polytes. RefSeq annotations are independent gene predictions generated by the NCBI, but may include sequence data that are not present in the genome assemblies because they use additional information from independent transcripts. In Lepbase v4 we have also included the original gene predictions from the Fujiwara lab that sequenced these genomes:
Papilio xuthus Pxut 1.0
Papilio xuthus Pxut 1.0 Refseq
Papilio polytes Ppol 1.0
Papilio polytes Ppol 1.0 Refseq
The Heliconius melpomene genome assembly version remains the same (Hmel2), but the annotation has now been updated to include Braker 1.0 predictions based on several independently sequenced RNAseq libraries, in collaboration with Chris Jiggins’ lab. Manual annotations and old gene names have been retained where the overlap is unambiguous.
Heliconius melpomene melpomene Hmel2
Several new assembly-only species were also added to Lepbase v4. Although no gene prediction sets are available for these species, you can download the genome fasta files at download.lepbase.org/v4/sequence and do BLAST searches against them at blast.lepbase.org.
Low-coverage draft genome assemblies from Ferguson et al (2014) :
Glyphotaelius pellucidus (caddisfly outgroup)
The Heliconius genome consortium has released 24 new Heliconiine genome assemblies using the w2-wrap contigger and a-scaff tools, courtesy Jim Mallet and Bernardo Clavijo’s group at Earlham Institute. Some of the samples are species crosses or were contaminated, but we decided to host them as separate genome assemblies to make them easier to search against:
Heliconius erato mother
Heliconius erato x himera f1
Heliconius hecale old
Heliconius himera father
Heliconius telesiphe contaminated
Neruda aoede contaminated
We are also planning to make these assemblies more useful by running a comparative gene prediction across all the heliconiines. The first step towards this is to compute a whole-genome alignment using Progressive Cactus, which we have made available for download at download.lepbase.org/v4/progressivecactus
New and updated analyses
We generated a comprehensive gene orthology analyses spanning 23 genomes (21 species, but with two sets of genomes/gene-predictions for Heliconius erato and Plutella xylostella). Rather than using the Ensembl Compara pipeline as is, we used current best practices in gene-tree reconstruction such as:
- OrthoFinder instead of OrthoMCL for homologue clustering
- MAFFT for protein sequence alignment
- Noisy for automated filtering of multiple sequence alignments
- RAxML for phylogenetic inference
- NOTUNG for gene-tree/species-tree reconciliation.
Over 12,000 OrthoFinder clusters were processed into Gene Trees which can be accessed from their constituent gene pages in the Lepbase Ensembl browser.
The goal of this analysis was to provide a context for each gene. However, as with all automated analyses, it may contain artifacts due to differences in gene prediction methods or due to generic parameters used in the pipeline. If you are interested in a highly accurate reconstruction of a specific gene family, you can download all the sequence and alignment data for a given gene and its homologues, and redo the tree using your preferred phylogenetic pipeline. The specific steps and evaluations for each part of our gene-tree pipeline will be described in forthcoming technical notes.
One of the key features of Lepbase is that we provide consistent analyses across all genomes using the same software and database versions and parameters. All genome sequences were masked using RepeatMasker 4.0.6 with the built-in arthropod repeat database. Previous Lepbase releases used RepeatModeler to generate species-specific repeat libraries. However, we found that RepeatModeler risks masking recently expanded gene families, therefore we took a more conservative approach and did not use RepeatModeler for this v4 release.
Functional annotation using blastp and interproscan
We also provide functional annotations for protein-coding genes in all species with gene models using BLASTP against the SwissProt database, and using all the tools provided by InterProScan.
New infrastructure and features
Lepbase.org previously ran on Linux virtual machines because of the complex dependencies of the various software programs involved. For Lepbase release v4, we have successfully migrated all the services, including the import and annotation pipelines for new genomes, to a Docker-based container infrastructure.
None of this affects how the service looks to the outside world, but it makes it much easier for us to upgrade the software as new versions of Ensembl and other software are released. We will also be able to easily scale up individual services to meet the growing number of users.
Although all our code is public already (see github.org/lepbase and easy-import.readme.io), we plan to document it extensively in the coming weeks, so that other groups can rapidly and easily set up their own taxon-specific Ensembl-based genome hubs using our Docker infrastructure.