Diploids: A2 and D5

The Gossypium genus contains about 50 speciesWendel J F, Albert V A. Phylogenetics of the cotton genus (Gossypium): character-state weighted parsimony analysis of chloroplast-DNA restriction site data and its systematic and biogeographic implications [J]. Syst Bot, 1992: 115-143.. Most of them are diploids and can be grouped to 8 diploid members (marked as A~G and K). They have different geographical origins and variable estimated genome sizes (Fig 1.)Hawkins J S, Kim H, Nason J D, et al. Differential lineage-specific amplification of transposable elements is responsible for genome size variation in Gossypium [J]. Genome Res, 2006, 16(10): 1252-1261., mostly due to different amount of DNA repeat elements. Despite all this, they share a common chromosome number (n = 13) and high levels of gene synteny.Rong J, Abbey C, Bowers J E, et al. A 3347-locus genetic recombination map of sequence-tagged sites reveals features of genome organization, transmission and evolution of cotton (Gossypium) [J]. Genetics, 2004, 166(1): 389-417., indicating that they have a common ancestor (Fig 1.).

Tetraploids: AD1 and AD2

The tetraploid cotton species are thought to formed by an allopolyploidization event about 1~2 million years ago, which involved a D-genome species as the pollen-providing parent and an A-genome species as the maternal parentChen Z J, Scheffler B E, Dennis E, et al. Toward sequencing cotton (Gossypium) genomes [J]. Plant Physiol, 2007, 145(4): 1303-1310.(Fig 2.). Their descendants evolve into the current 5 tetraploid cotton species (Marked as AD1~AD5). Among them, the upland cotton G. hirsutum (AD1) and sea-island cotton G. barbadense (AD2) have become the main cultivated cotton species.

Please refer to CottonGen database to view all cotton species.

cotton-diploid
Fig 1. 8 diploid cotton members. Image from Hawkins et al 2006. Note the origin "New World" of D-subgenome refers to America.
cotton-tetraploid
Fig 2. Allopolyploidization leads to the current 5 tetraploid cotton species. The tetraploid ADn genome sizes are roughly equal with the sum of genome sizes of An and Dn.

List of cotton genome assemblies.

Cotton Genome Assemblies List
Chromosome Set Species Strain Symbol in CottonFGD Sequencing Technology Available Date Reference
A2 Gossypium arboreum SXY1 Illumina HiSeq 2000 2014-05 Li et. al., Genome sequence of the cultivated cotton Gossypium arboreum.  Nature Genetics. 46, 567–572. 2014 
A2 Gossypium arboreum SXY1 CRI PacBio and Hi-C 2018-05 Du et al. Resequencing of 243 diploid cotton accessions based on an updated A genome identifies the genetic basis of key agronomic traits. Nature genetics. 2018 May 07.
A2 Gossypium arboreum SXY1 PacBio and Hi-C 2020-04 Huang, G. et al., Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nature Genetics. 2020.
A2 Gossypium arboreum SXY1 Oxford Nanopore 2021-08 Wang, M. et al. Comparative genome analyses highlight transposon-mediated genome expansion and the evolutionary architecture of 3D genomic folding in cotton. Mol Biol Evol. 2021 May 11:msab128.
D5 Gossypium raimondii D5-3 Illumina 2012-01 Wang et. al., The draft genome of a diploid cotton Gossypium raimondii.  Nature Genetics. 44, 1098–1103. 2012 
D5 Gossypium raimondii Ulbr. JGI Illumina, Sanger, Roche 454 2012-12 Paterson AH et al., "Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres.", Nature, 2012 Dec 20;492(7429):423-7
D5 Gossypium raimondii D5-4 PacBio 2019-09 Udall J A, Long E, Hanson C, et al. De Novo Genome Sequence Assemblies of Gossypium raimondii and Gossypium turneri[J]. G3: Genes, Genomes, Genetics, 2019, 9(10): 3079-3085.
D5 Gossypium raimondii D5-8 Illumina HiSeq2500 2021-02 Grover, et al. Insights into the evolution of the New World diploid cottons (Gossypium, subgenus Houzingenia) based on genome sequencing. Genome Biol Evol. 2019 Jan 1;11(1):53-71.
D5 Gossypium raimondii D502 Oxford Nanopore 2021-08 Wang, M. et al. Comparative genome analyses highlight transposon-mediated genome expansion and the evolutionary architecture of 3D genomic folding in cotton. Mol Biol Evol. 2021 May 11:msab128.
A1 Gossypium herbaceum Mutema PacBio and Hi-C 2020-04 Huang, G. et al., Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nature Genetics. 2020.
D1 Gossypium thurberi D1-35 Illumina 2018-11 Grover, et al. Insights into the evolution of the New World diploid cottons (Gossypium, subgenus Houzingenia) based on genome sequencing. Genome Biol Evol. 2019 Jan 1;11(1):53-71.
D10 Gossypium turneri D10-3 PacBio 2019-09 Udall J A, Long E, Hanson C, et al. De Novo Genome Sequence Assemblies of Gossypium raimondii and Gossypium turneri[J]. G3: Genes, Genomes, Genetics, 2019, 9(10): 3079-3085.
G2 Gossypium australe PA1801 PacBio 2019-09 Cai Y, Cai X, Wang Q, et al. Genome sequencing of the Australian wild diploid species Gossypium australe highlights disease resistance and delayed gland morphogenesis[J]. Plant biotechnology journal, 2019.
AD2 Gossypium barbadense 3-79 PacBio 2019-08 NA
AD2 Gossypium barbadense 3-79 HAU Illumina, PacBio, BioNano, Hi-C 2018-12 Wang M, Tu L, Yuan D, et al. Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense[J]. Nature genetics, 2019, 51(2): 224-229.
AD2 Gossypium barbadense 3-79 Illumina 2015-12 Yuan D, Tang Z, Wang M, et al. The genome sequence of Sea-Island cotton (Gossypium barbadense) provides insights into the allopolyploidization and development of superior spinnable fibres[J]. Scientific reports, 2015, 5: 17662.
AD2 Gossypium barbadense Hai7124 ZJU Illumina, BioNano, Hi-C 2019-03 Hu Y, Chen J, Fang L, et al. Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton[J]. Nature genetics, 2019, 51(4): 739-748.
AD2 Gossypium barbadense Pima90 P90HEBAU Illumina, BioNano, Hi-C 2021-08
AD5 Gossypium darwinii 1808015.09 2019-08 NA
AD1 Gossypium hirsutum Tm-1 Illumina 2015-04 Li et. al., Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution Nature Biotechnology. 33, 524–530. 2015
AD1 Gossypium hirsutum Tm-1 HAU Illumina, PacBio, BioNano, Hi-C 2018-12 Wang M, Tu L, Yuan D, et al. Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense[J]. Nature genetics, 2019, 51(2): 224-229.
AD1 Gossypium hirsutum Tm-1 NAU Illumina 2015-04 Zhang et. al., Sequencing of allotetraploid cotton (Gossypium hirsutum L. acc. TM-1) provides a resource for fiber improvement.  Nature Biotechnology. 33, 531–537. 2015
AD1 Gossypium hirsutum Tm-1 JGI Illumina, PacBio 2017-01 NA
AD1 Gossypium hirsutum Tm-1 ZJU Illumina, BioNano, Hi-C 2019-03 Hu Y, Chen J, Fang L, et al. Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton[J]. Nature genetics, 2019, 51(4): 739-748.
AD1 Gossypium hirsutum Tm-1 CRI Illumina, PacBio, Hi-C 2019-07 Yang Z, Ge X, Yang Z, et al. Extensive intraspecific gene order and gene structural variations in upland cotton cultivars[J]. Nature communications, 2019, 10(1): 1-13.
AD1 Gossypium hirsutum ZM24 Illumina, PacBio, Hi-C 2019-07 Yang Z, Ge X, Yang Z, et al. Extensive intraspecific gene order and gene structural variations in upland cotton cultivars[J]. Nature communications, 2019, 10(1): 1-13.
AD1 Gossypium hirsutum NDM8 NDM8HEBAU Illumina, PacBio, Hi-C 2021-08
AD4 Gossypium mustelinum 1408120.09 2019-09 NA
AD3 Gossypium tomentosum 7179.01 2020-01 NA

Gene ID

Due to the limit of data providers, currently CottonFGD only includes protein coding genes.

All gene IDs are directly imported from data provider annotations. Therefore, you can directly use these IDs to search in other cotton databases (e.g., CottonGen). Their formats are listed as follows:

Gene ID Formats:
Species Gene ID Format Gene ID Example
G. hirsutum, CRI assembly Gh_[%3c]G[%04d]00 Gh_A01G001100
G. hirsutum, HAU assembly Ghir_[%3c]G[%05d]0 Ghir_A01G000120
G. hirsutum, JGI assembly Gohir.[%03c]G[%06d] Gohir.A01G001300
G. hirsutum, NAU assembly Gh_[%3c]G[%04d] Gh_A01G0001
G. hirsutum, ZJU assembly GH_[%3c]G[%04d] GH_D07G0123
G. hirsutum, NDM8HEBAU assembly GhM_[%3c]G[%04d] GhM_D12G0324
G. barbadense, HAU assembly Gbar_[%3c]G[%05d]0 Gbar_A01G014970
G. barbadense, ZJU assembly GB_[%3c]G[%04d] GB_A01G0011
G. barbadense, P30HEBAU assembly GbM_[%3c]G[%04d] GbM_D13G0404
G. arboreum, CRI assembly Ga[%2c]G[%04d] Ga01G0012
G. raimondii, JGI assembly Gorai.[%03c]G[%06d] Gorai.001G000100

Transcript ID

IDs of transcripts are marked as .1, .2, .3, ..., appending to their gene IDs. IDs with .1 are usually the longest ones among all the isoforms (i.e., principle transcripts).

In order to maintain the consistence between the four cotton species, only principle transcripts are analyzed in CottonFGD.

Gene Name and Description

Each gene's name and description are based on its best homolog with Swiss-Prot proteins, as it is non-redundant and manually-reviewed. The homology is identified by NCBI BLAST+Camacho C, Coulouris G, Avagyan V, et al. BLAST+: architecture and applications [J]. BMC Bioinformatics, 2009, 10: 421.:

$ blastp -query [pep.fa] -db swissprotdb.fa -evalue 1e-5 -max_target_seqs 1 -max_hsps 1 

For each Swiss-Prot entry, the "Gene names" value is served as cotton gene's name (symbol) while the "Protein names" value is served as cotton gene's description. All genes with Swiss-Prot hits must have gene descriptions, but some of them have no gene names. For example, gene Gh_A01G0010 in G. hirsutum has its best Swiss-Prot homolog P48504. It has the description (Protein names) Cytochrome b-c1 complex subunit 6, but it has no gene names.

Transcript Structure

All the transcript structures: exons, introns, coding regions and untranslated regions (UTRs) are extracted from data providers annotations. It should be noticed that limited by data provider, UTR annotations might not completed.

As mentioned above, only the principle transcript of each gene is analyzed. Thus, all gene function annotations such as homology, GO(Gene Ontology), InterPro and pathway are all based on the protein sequences of their principle transcripts.

Protein Property Statistics

CottonFGD includes the following protein property statistics for each gene:

The value of residue composition, molecular weight, charge and isoelectric point are calculated using pepstats in EMBOSS package (v6.5.7.0):

$ pepstats -sequence [pep.fa] -outfile [pep.out]

You can see an example of the output in the pepstats manual.

The value of Grand Average of Hydropathy is calculated using BioPerl (v1.6.924):

use Bio::SeqIO;
use Bio::Tools::SeqStats;   # Package to calculate hydropathicity
my $seqio_obj = Bio::SeqIO->new(-file=>shift, -format=>"fasta");
while (my $seq_obj = $seqio_obj->next_seq) {
    my $id = $seq_obj->display_id;
    my $seq_stats = Bio::Tools::SeqStats->new(-seq => $seq_obj);
    my $gravy;
    eval{$gravy = $seq_stats->hydropathicity();};
}

Protein Domain, Gene Ontology & InterPro Items

The possible domain regions for each protein and the associated GO (Gene Ontology) / InterPro items are predicted using a locally installed copy of InterProScan (v5.16-55.0):

./interproscan.sh -dp -f tsv -goterms -i [pep.fa]

Currently InterProScan includes 15 types of domain databases:

Domain Databases and Their ID Formats
Database Description Accession ID Format Accession ID Example
Coils (2.2.1) Prediction of Coiled Coil Regions in Proteins Coil  
Gene3D (3.5.0) Structural assignment for whole genes and genomes using the CATH domain structure database G3DSA:[%d.%d....] G3DSA:3.10.330.20
Hamap (201511.02) High-quality Automated and Manual Annotation of Microbial Proteomes MF_[%05d] MF_01007
PANTHER (10.0) The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence. PTHR[%5d]
PTHR[%5d]:SF[%d]
PTHR12133
PTHR24279:SF100
Pfam (28.0) A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs) PF[%05d] PF03061
Phobius (1.01) A combined transmembrane topology and signal peptide predictor CYTOPLASMIC_DOMAIN
NON_CYTOPLASMIC_DOMAIN
SIGNAL_PEPTIDE_C_REGION
SIGNAL_PEPTIDE_H_REGION
SIGNAL_PEPTIDE_N_REGION
SIGNAL_PEPTIDE
TRANSMEMBRANE
 
PIRSF (3.01) The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships. PIRSF[%06d] PIRSF023803
PRINTS (42.0) A fingerprint is a group of conserved motifs used to characterise a protein family PR[%05d] PR00109
ProDom (2006.1) ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database. PD[%06d] PD005155
ProSite (20.113) PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them PS[%05d] PS00036
SignalP (4.1) SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences for eukaryotes, gram-positive prokaryotes or gram-negative prokaryotes SignalP-noTM
SignalP-TM
 
SMART (6.2) SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs SM[%05d] SM00338
SUPERFAMILY (1.75) SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes SSF[%d] SSF53335
TIGRFAM (15.0) TIGRFAMs are protein families based on Hidden Markov Models or HMMs TIGR[%05d] TIGR00006
TMHMM (2.0c) Prediction of transmembrane helices in proteins Tmhelix  

The accession ID formats of GO (Gene Ontology) and InterPro items are listed as follows:

Database Accession ID Format Accession ID Example
Gene Ontology GO:[%07d] GO:0006629
InterPro IPR[%06d] IPR004299

Homology

Currently CottonFGD includes homology information for 22 other representative plant species from all the main categories: Eudicots, Monocots, Acrogymnospermae, Lycopodiidae, Bryophyta and Chlorophyta. Homology information are only available for genes in G. hirsutum, CRI assembly.

The best homolog for each cotton gene is searched using NCBI BLAST+, similar with defining gene names.

KEGG Pathway

The associated KEGG pathways for each gene is defined by two steps: First, all cotton genes are assigned to KEGG Orthology using the KEGG Automatic Annotation Server. We select all the available plant species as our "GENES data set". For each gene, only one KEGG Orthology item (ID Format: K[%05d]) is assigned. Then, the KEGG Orthology item is mapped to its associated KEGG Pathways (ID Format: map[%05d] or ko[%05d]) and KEGG Modules (ID Format: M[%05d])

Gene Expression

The expression patterns for each gene (i.e. principle transcript) are measured using RNA-seq datasets from SRP166405 Hu Y, Chen J, Fang L, et al. Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton[J]. Nature genetics, 2019, 51(4): 739-748.. This dataset contains 49 samples each with 1-3 biological replicates. Expression data are only available for genes in G. hirsutum, CRI assembly.

RNA-seq analysis
TypeSampleNum. of biological ReplicatesSRA IDs
Tissue and Organ anther 3 SRR8089834; SRR8089835; SRR8089836
Tissue and Organ bract 3 SRR8089862; SRR8089861; SRR8089833
Tissue and Organ filament 3 SRR8089837; SRR8089838; SRR8089839
Tissue and Organ leaf 3 SRR8089855; SRR8090014; SRR8089903
Tissue and Organ pental 3 SRR8089902; SRR8089868; SRR8089867
Tissue and Organ pistil 1 SRR8089840
Tissue and Organ root 3 SRR8089897; SRR8089896; SRR8089895
Tissue and Organ sepal 3 SRR8089863; SRR8089866; SRR8089865
Tissue and Organ stem 3 SRR8089977; SRR8089975; SRR8089969
Tissue and Organ torus 3 SRR8089870; SRR8089869; SRR8089864
Fiber development fiber at 10DPA 3 SRR8090044; SRR8090041; SRR8090042
Fiber development fiber at 15DPA 3 SRR8090046; SRR8090049; SRR8090050
Fiber development fiber at 20DPA 3 SRR8090004; SRR8090007; SRR8090006
Fiber development fiber at 25DPA 1 SRR8090010
Drought stress leave under drought stress for 1h 3 SRR8089985; SRR8089984; SRR8089987
Drought stress leave under drought stress for 3h 3 SRR8089986; SRR8089989; SRR8089988
Drought stress leave under drought stress for 6h 3 SRR8089991; SRR8089990; SRR8089983
Drought stress leave under drought stress for 12h 3 SRR8089982; SRR8090019; SRR8090020
Drought stress leave under drought stress for 24h 3 SRR8090021; SRR8090022; SRR8090015
Cold stress leaves cold-treated for 1h 3 SRR8089823; SRR8089824; SRR8089825
Cold stress leaves cold-treated for 3h 3 SRR8089826; SRR8089827; SRR8089828
Cold stress leaves cold-treated for 6h 3 SRR8089829; SRR8089830; SRR8089831
Cold stress leaves cold-treated for 12h 3 SRR8089832; SRR8089924; SRR8089923
Cold stress leaves cold-treated for 24h 3 SRR8089922; SRR8089921; SRR8089920
control for stress leaves control 0h 3 SRR8090035; SRR8090032; SRR8090033
control for stress leaves control 1h 3 SRR8090030; SRR8090031; SRR8090039
control for stress leaves control 3h 3 SRR8090040; SRR8090074; SRR8090073
control for stress leaves control 6h 3 SRR8090076; SRR8090075; SRR8090070
control for stress leaves control 12h 3 SRR8090069; SRR8090072; SRR8090071
control for stress leaves control 24h 2 SRR8090078; SRR8090077
Heat stress leaves heat-treated for 1h 3 SRR8089919; SRR8089918; SRR8089917
Heat stress leaves heat-treated for 3h 3 SRR8089916; SRR8089915; SRR8089953
Heat stress leaves heat-treated for 6h 3 SRR8089954; SRR8089951; SRR8089952
Heat stress leaves heat-treated for 12h 3 SRR8089957; SRR8089958; SRR8089955
Heat stress leaves heat-treated for 24h 3 SRR8089956; SRR8089949; SRR8089950
salt stress leaves salt-treated for 1h 3 SRR8090016; SRR8090017; SRR8090018
salt stress leaves salt-treated for 3h 3 SRR8090026; SRR8090027; SRR8090056
salt stress leaves salt-treated for 6h 3 SRR8090055; SRR8090054; SRR8090053
salt stress leaves salt-treated for 12h 2 SRR8090060; SRR8090059
salt stress leaves salt-treated for 24h 3 SRR8090058; SRR8090057; SRR8090064
ovule development ovule at -3DPA 2 SRR8089841; SRR8089842
ovule development ovule at 0DPA 3 SRR8090087; SRR8090086; SRR8090085
ovule development ovule at 1DPA 3 SRR8090084; SRR8090083; SRR8090082
ovule development ovule at 3DPA 2 SRR8090081; SRR8090080
ovule development ovule at 5DPA 3 SRR8090089; SRR8090088; SRR8090043
ovule development ovule at 10DPA 3 SRR8090047; SRR8090048; SRR8090045
ovule development ovule at 15DPA 3 SRR8090003; SRR8090002; SRR8090005
ovule development ovule at 20DPA 3 SRR8090009; SRR8090008; SRR8090011
ovule development ovule at 25DPA 1 SRR8089973

TPM values are calculated using salmon Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. (v1.1.0).

salmon quant -i cDNA.Ghir.CRI -l A -1 reads.1.fq.gz -2 reads.2.fq.gz --seqBias --gcBias --validateMappings  -o output

The generated files quant.sf are available in the Download page.

CottonFGD is designed to be compatible with all modern web browsers (such as Mozilla Firefox, Google Chrome, Sarifi, Microsoft Edge, etc ...) on a variety of devices (such as PC, tablet and mobile). Except for some subtle differences on front-end appearance, the recent version of Microsoft Internet Explorer (later than version 9.0) is also acceptable. It is strongly not recommended to use old versions of browsers (such as Internet Explorer lower than 8.0) as this would encounter many unnecessary bugs.

Tips

  • "JavaScript" must be turned on, otherwise you can not do anything. By default it is turned on at almost all browsers.
  • "Cookie" is recommended to be turned on. It is used to "remember" your previous settings on several tools. By default it is turned on at almost all browsers.