Diploids: A₂ and D₅

The Gossypium genus contains about 50 speciesWendel J F, Albert V A. Phylogenetics of the cotton genus (Gossypium): character-state weighted parsimony analysis of chloroplast-DNA restriction site data and its systematic and biogeographic implications [J]. Syst Bot, 1992: 115-143.. Most of them are diploids and can be grouped to 8 diploid members (marked as A~G and K). They have different geographical origins and variable estimated genome sizes (Fig 1.)Hawkins J S, Kim H, Nason J D, et al. Differential lineage-specific amplification of transposable elements is responsible for genome size variation in Gossypium [J]. Genome Res, 2006, 16(10): 1252-1261., mostly due to different amount of DNA repeat elements. Despite all this, they share a common chromosome number (n = 13) and high levels of gene synteny.Rong J, Abbey C, Bowers J E, et al. A 3347-locus genetic recombination map of sequence-tagged sites reveals features of genome organization, transmission and evolution of cotton (Gossypium) [J]. Genetics, 2004, 166(1): 389-417., indicating that they have a common ancestor (Fig 1.).

Tetraploids: AD₁ and AD₂

The tetraploid cotton species are thought to formed by an allopolyploidization event about 1~2 million years ago, which involved a D-genome species as the pollen-providing parent and an A-genome species as the maternal parentChen Z J, Scheffler B E, Dennis E, et al. Toward sequencing cotton (Gossypium) genomes [J]. Plant Physiol, 2007, 145(4): 1303-1310.(Fig 2.). Their descendants evolve into the current 5 tetraploid cotton species (Marked as AD₁~AD₅). Among them, the upland cotton G. hirsutum (AD₁) and sea-island cotton G. barbadense (AD₂) have become the main cultivated cotton species.

Please refer to CottonGen database to view all cotton species.

Fig 1. 8 diploid cotton members. Image from Hawkins et al 2006. Note the origin "New World" of D-subgenome refers to America.

Fig 2. Allopolyploidization leads to the current 5 tetraploid cotton species. The tetraploid AD_n genome sizes are roughly equal with the sum of genome sizes of A_n and D_n.

List of cotton genome assemblies.

Cotton Genome Assemblies List
Chromosome Set	Species	Strain	Symbol in CottonFGD	Sequencing Technology	Available Date	Reference
A2	Gossypium arboreum	SXY1		Illumina HiSeq 2000	2014-05	Li et. al., Genome sequence of the cultivated cotton Gossypium arboreum. Nature Genetics. 46, 567–572. 2014
A2	Gossypium arboreum	SXY1	CRI	PacBio and Hi-C	2018-05	Du et al. Resequencing of 243 diploid cotton accessions based on an updated A genome identifies the genetic basis of key agronomic traits. Nature genetics. 2018 May 07.
A2	Gossypium arboreum	SXY1		PacBio and Hi-C	2020-04	Huang, G. et al., Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nature Genetics. 2020.
A2	Gossypium arboreum	SXY1		Oxford Nanopore	2021-08	Wang, M. et al. Comparative genome analyses highlight transposon-mediated genome expansion and the evolutionary architecture of 3D genomic folding in cotton. Mol Biol Evol. 2021 May 11:msab128.
D5	Gossypium raimondii	D5-3		Illumina	2012-01	Wang et. al., The draft genome of a diploid cotton Gossypium raimondii. Nature Genetics. 44, 1098–1103. 2012
D5	Gossypium raimondii	Ulbr.	JGI	Illumina, Sanger, Roche 454	2012-12	Paterson AH et al., "Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres.", Nature, 2012 Dec 20;492(7429):423-7
D5	Gossypium raimondii	D5-4		PacBio	2019-09	Udall J A, Long E, Hanson C, et al. De Novo Genome Sequence Assemblies of Gossypium raimondii and Gossypium turneri[J]. G3: Genes, Genomes, Genetics, 2019, 9(10): 3079-3085.
D5	Gossypium raimondii	D5-8		Illumina HiSeq2500	2021-02	Grover, et al. Insights into the evolution of the New World diploid cottons (Gossypium, subgenus Houzingenia) based on genome sequencing. Genome Biol Evol. 2019 Jan 1;11(1):53-71.
D5	Gossypium raimondii	D502		Oxford Nanopore	2021-08	Wang, M. et al. Comparative genome analyses highlight transposon-mediated genome expansion and the evolutionary architecture of 3D genomic folding in cotton. Mol Biol Evol. 2021 May 11:msab128.
A1	Gossypium herbaceum	Mutema		PacBio and Hi-C	2020-04	Huang, G. et al., Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nature Genetics. 2020.
D1	Gossypium thurberi	D1-35		Illumina	2018-11	Grover, et al. Insights into the evolution of the New World diploid cottons (Gossypium, subgenus Houzingenia) based on genome sequencing. Genome Biol Evol. 2019 Jan 1;11(1):53-71.
D10	Gossypium turneri	D10-3		PacBio	2019-09	Udall J A, Long E, Hanson C, et al. De Novo Genome Sequence Assemblies of Gossypium raimondii and Gossypium turneri[J]. G3: Genes, Genomes, Genetics, 2019, 9(10): 3079-3085.
G2	Gossypium australe	PA1801		PacBio	2019-09	Cai Y, Cai X, Wang Q, et al. Genome sequencing of the Australian wild diploid species Gossypium australe highlights disease resistance and delayed gland morphogenesis[J]. Plant biotechnology journal, 2019.
AD2	Gossypium barbadense	3-79		PacBio	2019-08	NA
AD2	Gossypium barbadense	3-79	HAU	Illumina, PacBio, BioNano, Hi-C	2018-12	Wang M, Tu L, Yuan D, et al. Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense[J]. Nature genetics, 2019, 51(2): 224-229.
AD2	Gossypium barbadense	3-79		Illumina	2015-12	Yuan D, Tang Z, Wang M, et al. The genome sequence of Sea-Island cotton (Gossypium barbadense) provides insights into the allopolyploidization and development of superior spinnable fibres[J]. Scientific reports, 2015, 5: 17662.
AD2	Gossypium barbadense	Hai7124	ZJU	Illumina, BioNano, Hi-C	2019-03	Hu Y, Chen J, Fang L, et al. Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton[J]. Nature genetics, 2019, 51(4): 739-748.
AD2	Gossypium barbadense	Pima90	P90HEBAU	Illumina, BioNano, Hi-C	2021-08
AD5	Gossypium darwinii	1808015.09			2019-08	NA
AD1	Gossypium hirsutum	Tm-1		Illumina	2015-04	Li et. al., Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution Nature Biotechnology. 33, 524–530. 2015
AD1	Gossypium hirsutum	Tm-1	HAU	Illumina, PacBio, BioNano, Hi-C	2018-12	Wang M, Tu L, Yuan D, et al. Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense[J]. Nature genetics, 2019, 51(2): 224-229.
AD1	Gossypium hirsutum	Tm-1	NAU	Illumina	2015-04	Zhang et. al., Sequencing of allotetraploid cotton (Gossypium hirsutum L. acc. TM-1) provides a resource for fiber improvement. Nature Biotechnology. 33, 531–537. 2015
AD1	Gossypium hirsutum	Tm-1	JGI	Illumina, PacBio	2017-01	NA
AD1	Gossypium hirsutum	Tm-1	ZJU	Illumina, BioNano, Hi-C	2019-03	Hu Y, Chen J, Fang L, et al. Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton[J]. Nature genetics, 2019, 51(4): 739-748.
AD1	Gossypium hirsutum	Tm-1	CRI	Illumina, PacBio, Hi-C	2019-07	Yang Z, Ge X, Yang Z, et al. Extensive intraspecific gene order and gene structural variations in upland cotton cultivars[J]. Nature communications, 2019, 10(1): 1-13.
AD1	Gossypium hirsutum	ZM24		Illumina, PacBio, Hi-C	2019-07	Yang Z, Ge X, Yang Z, et al. Extensive intraspecific gene order and gene structural variations in upland cotton cultivars[J]. Nature communications, 2019, 10(1): 1-13.
AD1	Gossypium hirsutum	NDM8	NDM8HEBAU	Illumina, PacBio, Hi-C	2021-08
AD4	Gossypium mustelinum	1408120.09			2019-09	NA
AD3	Gossypium tomentosum	7179.01			2020-01	NA

Gene ID

Due to the limit of data providers, currently CottonFGD only includes protein coding genes.

All gene IDs are directly imported from data provider annotations. Therefore, you can directly use these IDs to search in other cotton databases (e.g., CottonGen). Their formats are listed as follows:

Gene ID Formats:
Species	Gene ID Format	Gene ID Example
G. hirsutum, CRI assembly	`Gh_[%3c]G[%04d]00`	Gh_A01G001100
G. hirsutum, HAU assembly	`Ghir_[%3c]G[%05d]0`	Ghir_A01G000120
G. hirsutum, JGI assembly	`Gohir.[%03c]G[%06d]`	Gohir.A01G001300
G. hirsutum, NAU assembly	`Gh_[%3c]G[%04d]`	Gh_A01G0001
G. hirsutum, ZJU assembly	`GH_[%3c]G[%04d]`	GH_D07G0123
G. hirsutum, NDM8HEBAU assembly	`GhM_[%3c]G[%04d]`	GhM_D12G0324
G. barbadense, HAU assembly	`Gbar_[%3c]G[%05d]0`	Gbar_A01G014970
G. barbadense, ZJU assembly	`GB_[%3c]G[%04d]`	GB_A01G0011
G. barbadense, P30HEBAU assembly	`GbM_[%3c]G[%04d]`	GbM_D13G0404
G. arboreum, CRI assembly	`Ga[%2c]G[%04d]`	Ga01G0012
G. raimondii, JGI assembly	`Gorai.[%03c]G[%06d]`	Gorai.001G000100

Transcript ID

IDs of transcripts are marked as .1, .2, .3, ..., appending to their gene IDs. IDs with .1 are usually the longest ones among all the isoforms (i.e., principle transcripts).

In order to maintain the consistence between the four cotton species, only principle transcripts are analyzed in CottonFGD.

Gene Name and Description

Each gene's name and description are based on its best homolog with Swiss-Prot proteins, as it is non-redundant and manually-reviewed. The homology is identified by NCBI BLAST+Camacho C, Coulouris G, Avagyan V, et al. BLAST+: architecture and applications [J]. BMC Bioinformatics, 2009, 10: 421.:

$ blastp -query [pep.fa] -db swissprotdb.fa -evalue 1e-5 -max_target_seqs 1 -max_hsps 1

For each Swiss-Prot entry, the "Gene names" value is served as cotton gene's name (symbol) while the "Protein names" value is served as cotton gene's description. All genes with Swiss-Prot hits must have gene descriptions, but some of them have no gene names. For example, gene Gh_A01G0010 in G. hirsutum has its best Swiss-Prot homolog P48504. It has the description (Protein names) Cytochrome b-c1 complex subunit 6, but it has no gene names.

Transcript Structure

All the transcript structures: exons, introns, coding regions and untranslated regions (UTRs) are extracted from data providers annotations. It should be noticed that limited by data provider, UTR annotations might not completed.

As mentioned above, only the principle transcript of each gene is analyzed. Thus, all gene function annotations such as homology, GO(Gene Ontology), InterPro and pathway are all based on the protein sequences of their principle transcripts.

Protein Property Statistics

CottonFGD includes the following protein property statistics for each gene:

Residue Composition:
- The percentage of basic residues: His(H), Lys(K), Arg(R)
- The percentage of acidic residues: Asp(D), Glu(E)
All the remaining residues are neutral. Comparisons of the percentage of basic/acidic residues could give an rough estimation of protein's alkalinity or acidity.
Molecular Weight (kDa)
Charge
Isoelectric Point: the pH value at which this molecule carries no net electrical charge
Grand Average of Hydropathy: The sum of hydropathy values of all amino acids divided by the protein length. Positive value indicates hydrophobic.

The value of residue composition, molecular weight, charge and isoelectric point are calculated using pepstats in EMBOSS package (v6.5.7.0):

$ pepstats -sequence [pep.fa] -outfile [pep.out]

You can see an example of the output in the pepstats manual.

The value of Grand Average of Hydropathy is calculated using BioPerl (v1.6.924):

use Bio::SeqIO;
use Bio::Tools::SeqStats;   # Package to calculate hydropathicity
my $seqio_obj = Bio::SeqIO->new(-file=>shift, -format=>"fasta");
while (my $seq_obj = $seqio_obj->next_seq) {
    my $id = $seq_obj->display_id;
    my $seq_stats = Bio::Tools::SeqStats->new(-seq => $seq_obj);
    my $gravy;
    eval{$gravy = $seq_stats->hydropathicity();};
}

Protein Domain, Gene Ontology & InterPro Items

The possible domain regions for each protein and the associated GO (Gene Ontology) / InterPro items are predicted using a locally installed copy of InterProScan (v5.16-55.0):

./interproscan.sh -dp -f tsv -goterms -i [pep.fa]

Currently InterProScan includes 15 types of domain databases:

Domain Databases and Their ID Formats
Database	Description	Accession ID Format	Accession ID Example
Coils (2.2.1)	Prediction of Coiled Coil Regions in Proteins	Coil
Gene3D (3.5.0)	Structural assignment for whole genes and genomes using the CATH domain structure database	`G3DSA:[%d.%d....]`	G3DSA:3.10.330.20
Hamap (201511.02)	High-quality Automated and Manual Annotation of Microbial Proteomes	`MF_[%05d]`	MF_01007
PANTHER (10.0)	The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.	`PTHR[%5d]` `PTHR[%5d]:SF[%d]`	PTHR12133 PTHR24279:SF100
Pfam (28.0)	A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)	`PF[%05d]`	PF03061
Phobius (1.01)	A combined transmembrane topology and signal peptide predictor	CYTOPLASMIC_DOMAIN NON_CYTOPLASMIC_DOMAIN SIGNAL_PEPTIDE_C_REGION SIGNAL_PEPTIDE_H_REGION SIGNAL_PEPTIDE_N_REGION SIGNAL_PEPTIDE TRANSMEMBRANE
PIRSF (3.01)	The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.	`PIRSF[%06d]`	PIRSF023803
PRINTS (42.0)	A fingerprint is a group of conserved motifs used to characterise a protein family	`PR[%05d]`	PR00109
ProDom (2006.1)	ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.	`PD[%06d]`	PD005155
ProSite (20.113)	PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them	`PS[%05d]`	PS00036
SignalP (4.1)	SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences for eukaryotes, gram-positive prokaryotes or gram-negative prokaryotes	SignalP-noTM SignalP-TM
SMART (6.2)	SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs	`SM[%05d]`	SM00338
SUPERFAMILY (1.75)	SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes	`SSF[%d]`	SSF53335
TIGRFAM (15.0)	TIGRFAMs are protein families based on Hidden Markov Models or HMMs	`TIGR[%05d]`	TIGR00006
TMHMM (2.0c)	Prediction of transmembrane helices in proteins	Tmhelix

The accession ID formats of GO (Gene Ontology) and InterPro items are listed as follows:

Database	Accession ID Format	Accession ID Example
Gene Ontology	`GO:[%07d]`	GO:0006629
InterPro	`IPR[%06d]`	IPR004299

Homology

Currently CottonFGD includes homology information for 22 other representative plant species from all the main categories: Eudicots, Monocots, Acrogymnospermae, Lycopodiidae, Bryophyta and Chlorophyta. Homology information are only available for genes in G. hirsutum, CRI assembly.

The best homolog for each cotton gene is searched using NCBI BLAST+, similar with defining gene names.

KEGG Pathway

The associated KEGG pathways for each gene is defined by two steps: First, all cotton genes are assigned to KEGG Orthology using the KEGG Automatic Annotation Server. We select all the available plant species as our "GENES data set". For each gene, only one KEGG Orthology item (ID Format: K[%05d]) is assigned. Then, the KEGG Orthology item is mapped to its associated KEGG Pathways (ID Format: map[%05d] or ko[%05d]) and KEGG Modules (ID Format: M[%05d])

Gene Expression

The expression patterns for each gene (i.e. principle transcript) are measured using RNA-seq datasets from SRP166405 Hu Y, Chen J, Fang L, et al. Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton[J]. Nature genetics, 2019, 51(4): 739-748.. This dataset contains 49 samples each with 1-3 biological replicates. Expression data are only available for genes in G. hirsutum, CRI assembly.

RNA-seq analysis
Type	Sample	Num. of biological Replicates	SRA IDs
Tissue and Organ	anther	3	SRR8089834; SRR8089835; SRR8089836
Tissue and Organ	bract	3	SRR8089862; SRR8089861; SRR8089833
Tissue and Organ	filament	3	SRR8089837; SRR8089838; SRR8089839
Tissue and Organ	leaf	3	SRR8089855; SRR8090014; SRR8089903
Tissue and Organ	pental	3	SRR8089902; SRR8089868; SRR8089867
Tissue and Organ	pistil	1	SRR8089840
Tissue and Organ	root	3	SRR8089897; SRR8089896; SRR8089895
Tissue and Organ	sepal	3	SRR8089863; SRR8089866; SRR8089865
Tissue and Organ	stem	3	SRR8089977; SRR8089975; SRR8089969
Tissue and Organ	torus	3	SRR8089870; SRR8089869; SRR8089864
Fiber development	fiber at 10DPA	3	SRR8090044; SRR8090041; SRR8090042
Fiber development	fiber at 15DPA	3	SRR8090046; SRR8090049; SRR8090050
Fiber development	fiber at 20DPA	3	SRR8090004; SRR8090007; SRR8090006
Fiber development	fiber at 25DPA	1	SRR8090010
Drought stress	leave under drought stress for 1h	3	SRR8089985; SRR8089984; SRR8089987
Drought stress	leave under drought stress for 3h	3	SRR8089986; SRR8089989; SRR8089988
Drought stress	leave under drought stress for 6h	3	SRR8089991; SRR8089990; SRR8089983
Drought stress	leave under drought stress for 12h	3	SRR8089982; SRR8090019; SRR8090020
Drought stress	leave under drought stress for 24h	3	SRR8090021; SRR8090022; SRR8090015
Cold stress	leaves cold-treated for 1h	3	SRR8089823; SRR8089824; SRR8089825
Cold stress	leaves cold-treated for 3h	3	SRR8089826; SRR8089827; SRR8089828
Cold stress	leaves cold-treated for 6h	3	SRR8089829; SRR8089830; SRR8089831
Cold stress	leaves cold-treated for 12h	3	SRR8089832; SRR8089924; SRR8089923
Cold stress	leaves cold-treated for 24h	3	SRR8089922; SRR8089921; SRR8089920
control for stress	leaves control 0h	3	SRR8090035; SRR8090032; SRR8090033
control for stress	leaves control 1h	3	SRR8090030; SRR8090031; SRR8090039
control for stress	leaves control 3h	3	SRR8090040; SRR8090074; SRR8090073
control for stress	leaves control 6h	3	SRR8090076; SRR8090075; SRR8090070
control for stress	leaves control 12h	3	SRR8090069; SRR8090072; SRR8090071
control for stress	leaves control 24h	2	SRR8090078; SRR8090077
Heat stress	leaves heat-treated for 1h	3	SRR8089919; SRR8089918; SRR8089917
Heat stress	leaves heat-treated for 3h	3	SRR8089916; SRR8089915; SRR8089953
Heat stress	leaves heat-treated for 6h	3	SRR8089954; SRR8089951; SRR8089952
Heat stress	leaves heat-treated for 12h	3	SRR8089957; SRR8089958; SRR8089955
Heat stress	leaves heat-treated for 24h	3	SRR8089956; SRR8089949; SRR8089950
salt stress	leaves salt-treated for 1h	3	SRR8090016; SRR8090017; SRR8090018
salt stress	leaves salt-treated for 3h	3	SRR8090026; SRR8090027; SRR8090056
salt stress	leaves salt-treated for 6h	3	SRR8090055; SRR8090054; SRR8090053
salt stress	leaves salt-treated for 12h	2	SRR8090060; SRR8090059
salt stress	leaves salt-treated for 24h	3	SRR8090058; SRR8090057; SRR8090064
ovule development	ovule at -3DPA	2	SRR8089841; SRR8089842
ovule development	ovule at 0DPA	3	SRR8090087; SRR8090086; SRR8090085
ovule development	ovule at 1DPA	3	SRR8090084; SRR8090083; SRR8090082
ovule development	ovule at 3DPA	2	SRR8090081; SRR8090080
ovule development	ovule at 5DPA	3	SRR8090089; SRR8090088; SRR8090043
ovule development	ovule at 10DPA	3	SRR8090047; SRR8090048; SRR8090045
ovule development	ovule at 15DPA	3	SRR8090003; SRR8090002; SRR8090005
ovule development	ovule at 20DPA	3	SRR8090009; SRR8090008; SRR8090011
ovule development	ovule at 25DPA	1	SRR8089973

TPM values are calculated using salmon Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. (v1.1.0).

salmon quant -i cDNA.Ghir.CRI -l A -1 reads.1.fq.gz -2 reads.2.fq.gz --seqBias --gcBias --validateMappings  -o output

The generated files quant.sf are available in the Download page.

CottonFGD is designed to be compatible with all modern web browsers (such as Mozilla Firefox, Google Chrome, Sarifi, Microsoft Edge, etc ...) on a variety of devices (such as PC, tablet and mobile). Except for some subtle differences on front-end appearance, the recent version of Microsoft Internet Explorer (later than version 9.0) is also acceptable. It is strongly not recommended to use old versions of browsers (such as Internet Explorer lower than 8.0) as this would encounter many unnecessary bugs.

Tips

"JavaScript" must be turned on, otherwise you can not do anything. By default it is turned on at almost all browsers.
"Cookie" is recommended to be turned on. It is used to "remember" your previous settings on several tools. By default it is turned on at almost all browsers.

Overview

Cotton Species

Diploids: A₂ and D₅

Tetraploids: AD₁ and AD₂

Genome Assembly

Gene Models

Gene ID

Transcript ID

Gene Name and Description

Transcript Structure

Gene Function Annotation

Protein Property Statistics

Protein Domain, Gene Ontology & InterPro Items

Homology

KEGG Pathway

Gene Expression

Web Browser Compatibility

Tips

References

Overview

Cotton Species

Diploids: A2 and D5

Tetraploids: AD1 and AD2

Genome Assembly

Gene Models

Gene ID

Transcript ID

Gene Name and Description

Transcript Structure

Gene Function Annotation

Protein Property Statistics

Protein Domain, Gene Ontology & InterPro Items

Homology

KEGG Pathway

Gene Expression

Web Browser Compatibility

Tips

References

Diploids: A₂ and D₅

Tetraploids: AD₁ and AD₂