Introducing GenomeHubs

Added by Richard Challis on March 6th 2017

GenomeHubs have been developed to provide a simple and scalable way to deploy an Ensembl genome browser and associated tools for clades or communities of non-model organisms.

Through GenomeHubs we have made the technology and analysis pipelines underpinning Lepbase available in an easy to install/use collection of Docker containers, designed to run together to provide an Ensembl browser, an h5ai downloads server and a SequenceServer BLAST server.

Visit genomehubs.org or read our paper in Database  to find out more.

BioGenomics 2017

Added by Richard Challis on February 24th 2017

Following our presentation at PAG XXV we have been developing the containerised pipelines that we use to deploy Lepbase into more generic GenomeHubs that we hope will be useful to groups working on a wide range of clades and communities. Mark Blaxter presented our GenomeHubs approach to dealing with the “black hole” of genomics projects that don’t make it into the archival databases at BioGenomics 2017.

The GenomeHubs code and containers are all available on the Lepbase github and Docker hub repositories with some draft documentation at readme.io. Having just released Lepbase v4, we’re actively working on reorganising all of this to pull the code and documentation together and test for/fix any bugs. If you would like to set up your own GenomeHub then feel free to try out the code as is, but we should have a much clearer set of instructions and examples ready in the next few weeks.


picture credit: discovermagazine.com/2014/sept/10-to-the-edge-and-back Roen Kelly

Lepbase v4

Added by Sujai Kumar on February 20th 2017

Version 4 of Lepbase, the hub for Lepidopteran genomes, not only includes several new genome assemblies and annotations, but also showcases a re-engineered infrastructure that will enable other groups to easily set up their own Ensembl-based genome hubs in the future.

New datasets

New genomes with gene models:

Plutella xylostella pacbiov1 – Alistair Darby’s Lab
Heliconius erato lativitta v1 – Robert Reed’s Lab

Additional assemblies or annotations for existing species:

Previously, Lepbase v3 had hosted NCBI RefSeq annotations for Papilio xuthus and Papilio polytes. RefSeq annotations are independent gene predictions generated by the NCBI, but may include sequence data that are not present in the genome assemblies because they use additional information from independent transcripts. In Lepbase v4 we have also included the original gene predictions from the Fujiwara lab that sequenced these genomes:

Papilio xuthus Pxut 1.0
Papilio xuthus Pxut 1.0 Refseq
Papilio polytes Ppol 1.0
Papilio polytes Ppol 1.0 Refseq

The Heliconius melpomene genome assembly version remains the same (Hmel2), but the annotation has now been updated to include Braker 1.0 predictions based on several independently sequenced RNAseq libraries, in collaboration with Chris Jiggins’ lab. Manual annotations and old gene names have been retained where the overlap is unambiguous.

Heliconius melpomene melpomene Hmel2

Assembly-only species:

Several new assembly-only species were also added to Lepbase v4. Although no gene prediction sets are available for these species, you can download the genome fasta files at download.lepbase.org/v4/sequence and do BLAST searches against them at blast.lepbase.org.

Low-coverage draft genome assemblies from Ferguson et al (2014) :

Callimorpha dominula
Cameraria ohridella
Hepialus sylvina
Pararge aegeria
Polygonia c-album
Glyphotaelius pellucidus (caddisfly outgroup)

The Heliconius genome consortium has released 24 new Heliconiine genome assemblies using the w2-wrap contigger and a-scaff tools, courtesy Jim Mallet and Bernardo Clavijo’s group at Earlham Institute. Some of the samples are species crosses or were contaminated, but we decided to host them as separate genome assemblies to make them easier to search against:

Agraulis vanillae
Dryas iulia
Eueides tales
Heliconius besckei
Heliconius burneyi
Heliconius cydno
Heliconius demeter
Heliconius elevatus
Heliconius erato mother
Heliconius erato x himera f1
Heliconius hecale
Heliconius hecale old
Heliconius hecalesia
Heliconius himera
Heliconius himera father
Heliconius melpomene
Heliconius numata
Heliconius pardalinus
Heliconius sara
Heliconius telesiphe
Heliconius telesiphe contaminated
Heliconius timareta
Laparus doris
Neruda aoede contaminated

We are also planning to make these assemblies more useful by running a comparative gene prediction across all the heliconiines. The first step towards this is to compute a whole-genome alignment using Progressive Cactus, which we have made available for download at download.lepbase.org/v4/progressivecactus

New and updated analyses

Gene trees

We generated a comprehensive gene orthology analyses spanning 23 genomes (21 species, but with two sets of genomes/gene-predictions for Heliconius erato and Plutella xylostella). Rather than using the Ensembl Compara pipeline as is, we used current best practices in gene-tree reconstruction such as:

  • OrthoFinder instead of OrthoMCL for homologue clustering
  • MAFFT for protein sequence alignment
  • Noisy for automated filtering of multiple sequence alignments
  • RAxML for phylogenetic inference
  • NOTUNG for gene-tree/species-tree reconciliation.

Over 12,000 OrthoFinder clusters were processed into Gene Trees which can be accessed from their constituent gene pages in the Lepbase Ensembl browser.


The goal of this analysis was to provide a context for each gene. However, as with all automated analyses, it may contain artifacts due to differences in gene prediction methods or due to generic parameters used in the pipeline. If you are interested in a highly accurate reconstruction of a specific gene family, you can download all the sequence and alignment data for a given gene and its homologues, and redo the tree using your preferred phylogenetic pipeline. The specific steps and evaluations for each part of our gene-tree pipeline will be described in forthcoming technical notes.

RepeatMasking

One of the key features of Lepbase is that we provide consistent analyses across all genomes using the same software and database versions and parameters. All genome sequences were masked using RepeatMasker 4.0.6 with the built-in arthropod repeat database. Previous Lepbase releases used RepeatModeler to generate species-specific repeat libraries. However, we found that RepeatModeler risks masking recently expanded gene families, therefore we took a more conservative approach and did not use RepeatModeler for this v4 release.

download.lepbase.org/v4/repeatmasker

Functional annotation using blastp and interproscan

We also provide functional annotations for protein-coding genes in all species with gene models using BLASTP against the SwissProt database, and using all the tools provided by InterProScan.

download.lepbase.org/v4/blastp
download.lepbase.org/v4/interproscan

New infrastructure and features

Lepbase.org previously ran on Linux virtual machines because of the complex dependencies of the various software programs involved. For Lepbase release v4, we have successfully migrated all the services, including the import and annotation pipelines for new genomes, to a Docker-based container infrastructure.

None of this affects how the service looks to the outside world, but it makes it much easier for us to upgrade the software as new versions of Ensembl and other software are released. We will also be able to easily scale up individual services to meet the growing number of users.

Although all our code is public already (see github.org/lepbase and easy-import.readme.io), we plan to document it extensively in the coming weeks, so that other groups can rapidly and easily set up their own taxon-specific Ensembl-based genome hubs using our Docker infrastructure.

 

PAG XXV

Added by Richard Challis on January 14th 2017

Richard Challis presented our newly containerised easy Ensembl setup with EasyMirror and EasyImport at PAG XXV, the Plant & Animal Genomes conference, in San Diego on 14th January 2017.

A full version of this presentation with notes is also available.

Our code and containers are all available on the Lepbase github and Docker hub repositories with some draft documentation at readme.io. Having just released Lepbase v4, we’re actively working on reorganising all of this to pull the code and documentation together and test for/fix any bugs. If you would like to set up your own GenomeHub then feel free to try out the code as is, but we should have a much clearer set of instructions and examples ready in the next few weeks.

Lepbase release 3

Added by Sujai Kumar on August 12th 2016

Lepbase release 3

We are pleased to announce the third release of our Lepidopteran genome browser including new datasets, updated analyses, a new look, and some code changes that will help us to add new data between major releases in the future.

Screenshot 2016-08-12 15.54.17

New datasets

We’ve added assemblies for 4 species, bringing our tally up to 43 assemblies in 36 species, all available on our Ensembl genome browser (ensembl.lepbase.org), BLAST interface (blast.lepbase.org) and downloads server (download.lepbase.org). New datasets include:

Updated analyses

We run a standard (and consistent) set of analyses across all assemblies/gene sets, including InterProScan, Blastp against SwissProt, RepeatModeler/RepeatMasker. Blastp and InterProScan allow for text searches of gene models and proteins based on domain names or known genes in other species. These analyses are also available as bulk downloads: Blastp and InterProScan.

We have also updated our comparative pipeline to produce more accurate orthology predictions across the gene sets that were available in our previous release. These are all available at v2.ensembl.lepbase.org (e.g., gene tree for Bombyxin showing all orthologs and paralogs). It’s taken a while to work out the import process for adding these to the Ensembl Compara database so we decided to release the genomes that we have for now and we’ll update the gene trees to include the recently added v3 assemblies soon.

Whole genome alignments are also on the way – for now the multi-species alignments we have run are available at download.lepbase.org/current/wga/

New look

We’ve tried to make our main portals a little more consistent.

Screenshot 2016-08-12 16.14.30Screenshot 2016-08-12 16.15.00Screenshot 2016-08-12 16.14.53

We’ve also added more interactive data visualisations to the species pages on ensembl.lepbase.org. We now base these on a standardised file structure so we can offer alternative views, such as table and cumulative frequency distribution views for the assembly statistics. The code for these new visualisations is also available (see the link below each image).

Screenshot 2016-08-12 16.13.38 Screenshot 2016-08-12 16.13.25

Code changes

We don’t expect the average Lepbase user to be particularly interested in the code we use but we’re quite excited about the changes we’ve made to our Ensembl import scripts to create an easy-import pipeline that makes it very simple to set up and add species, not just to Lepbase but to any custom Ensembl instance. We expect this to make it simple for us to add new data as they become available so if you are working on a project that you would like to see included then please get in touch. We’re already starting to use Ensembls more widely in the Blaxter Lab and we hope our code will be useful to anyone who has been thinking about setting up an Ensembl-based website – check out the documentation at easy-import.readme.io to see just how simple setting up your own Lepbase could be.

Lepbase manuscript on bioRxiv

Added by Richard Challis on June 9th 2016

Our first Lepbase manuscript , describing our resource for studying moth and butterfly genomes is now on bioRxiv.

Abstract

As the generation and use of genomic datasets is becoming increasingly common in all areas of biology, the need for resources to collate, analyse and present data from independent (Tier 1) species-level genome projects into well supported clade-oriented (Tier 2) databases and provide a mechanism for these data to be propagated to pan-taxonomic (Tier 3) databases is becoming more pressing. Lepbase is a Tier 2 genomic resource for the Lepidoptera, supporting a research community using genomic approaches to understand evolution, speciation, olfaction, behaviour and pesticide resistance in a wide range of target species. Lepbase offers a core set of tools to make genomic data widely accessible including an Ensembl genome browser, text and sequence homology searches and bulk downloads of consistently presented and formatted datasets. As a part of the taxonomic community that we serve, we are working directly with Lepidoptera researchers to prioritise analyses and add tools that will be of most value to current research questions.

Lepbase release 2

Added by Richard Challis on February 13th 2016

Lepbase release 2 went live on 13th February 2016 with 21 annotated assemblies across 17 species.

This release adds six additional species:

Plus updated assemblies/annotations for three of the species in release 1:

release_2_species

All of these species (together with unannotated Heliconiine DISCOVAR assemblies) are available on our ensembl and blast servers.  The files used (assemblies, annotations, protein/cds sequences are available for download, however this includes some unpublished data so please respect the owners of each dataset and contact them to discuss any plans before publishing results based on these unpublished data.

Since release 1 we have also made a few changes to our blast server with a searchable hierarchical listing of blast databases to make it easier to use as the number of available assemblies keeps increasing.

release_2_blast

We’ll be focussing on adding more tools and comparative analyses over the next few weeks and months so we’ve also redesigned this website to make it easier to see which tools, analyses and downloads are available.

Lepbase problem being fixed

Added by Sujai Kumar on January 5th 2016

Dear Lepbase users

We have had a problem with the Lepbase servers over the winter break but are fixing it right now. Our virtual machines that run the main ensembl production server and blast server did not restart automatically after a scheduled power outage (although the test and development servers restarted without any problem).

We think we have identified the problem (the machines that did not restart are the ones that auto-updated to the latest linux kernel which is not booting for some reason) and should have a fix very soon.

Sincere apologies for any inconvenience caused. If the booting problem is not solved by the end of today, we will restore the machines manually which will take a couple of days at most.

Sujai and Richard

Update: The problem has now been fixed. Turns out that linux likes to auto-update kernels but they don’t always boot correctly when the machine is restarted (missing initramfs files). We deleted the problem files and have strategies in place for preventing this from happening in future, but the good news is that ensembl.lepbase.org and the blast server are live now.