Table of Contents
General schema for fitting any experimental data into XGAP
This tutorial explains the general schema of converting any experimental data into the XGAP model. It shows that all standard annotations can go into annotation types like 'Sample', 'Strain' and 'Spot' and that all experiment-specific data can go into data matrices, optionally refering 'Factor' or 'Phenotype'. Note that when using a new biotechnology you may want to add a new core annotation type (as has been done before with MassPeak, NMR, etc). The MIBBI recommendations are helpful with deciding what new standard annotatations are needed. See AddingDataTypes for the technical procedure.
The general schema is demonstrated by the example of a file (shown below) describing a rather complex multifactorial experiment.
Original data file:
lineID | type_plante | BLOCK | BLOCK_line | env | FLOdate | DIAM1 | DIAM2 |
90 | epiRIL | 2 | 1 | c | 05/01/2007 | NA | NA |
col | mutant | 1 | 1 | s | 05/04/2007 | NA | NA |
381 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
497 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
432 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
In this example tabular (excel) data is shown but the strategy applies to other formats like XML as well. As no standard column names are used the help of the original data provider is needed to understand the data. Reformatting to XGAP can overcome this problem.
Step 1: Identify XGAP entities and fields
A practical procedure to map each data element to XGAP is to add two additional rows on top of the existing rownames. Then use those to define the XGAP entities and fields each column maps to as shown in bold below.
Example1: re-annotated as XGAP files
Strain | Factor | Phenotype | |||||
Strain.name | Strain.type | Factor.name | Factor.name | Factor.name | Phenotype.name | Phenotype.name | Phenotype.name |
lineID | type_plante | BLOCK | BLOCK_line | env | FLOdate | DIAM1 | DIAM2 |
90 | epiRIL | 2 | 1 | c | 05/01/2007 | NA | NA |
col | mutant | 1 | 1 | s | 05/04/2007 | NA | NA |
381 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
497 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
432 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
Step 2: identify data matrices
Not all columns will map to XGAP annotation fields like Probe or Sample. Typically, if there are repeated XGAP fields than this suggests a data matrix. In the example this holds for Factor.name and Phenotype.name. For each repeated XGAP field a data matrix can be defined as shown below.
Example 2: annotated data matrices
Strain | Datamatrix[“Factors”] | Datamatrix[“Phenotypes”] | |||||
Strain.name | Strain.type | Factor.name | Factor.name | Factor.name | Phenotype.name | Phenotype.name | Phenotype.name |
lineID | type_plante | BLOCK | BLOCK_line | env | FLOdate | DIAM1 | DIAM2 |
90 | epiRIL | 2 | 1 | c | 05/01/2007 | NA | NA |
col | mutant | 1 | 1 | s | 05/04/2007 | NA | NA |
381 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
497 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
432 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
Step 3: Add missing columns
First identify what data entities are described in each row. In this example each row describes 'Samples', although no sample identifier was provided. Then add missing but required columns for entities used. In this example the entities used are Sample, Strain, Factor and Phenotype. The required column 'Sample.name' was missing and is added.
Example 3: added missing column Sample.name
Sample | Strain | Datamatrix[“Factors”] | Datamatrix[“Phenotype”] | |||||
Sample.name | Strain.name | Strain.type | Factor.name | Factor.name | Factor.name | Phenotype.name | Phenotype.name | Phenotype.name |
lineID | type_plante | BLOCK | BLOCK_line | env | FLOdate | DIAM1 | DIAM2 | |
sample1 | 90 | epiRIL | 2 | 1 | c | 05/01/2007 | NA | NA |
sample2 | col | mutant | 1 | 1 | s | 05/04/2007 | NA | NA |
sample3 | 381 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
sample4 | 497 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
sample5 | 432 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
Step 4: Add cross-references columns
If fields from multiple XGAP entities are annotated within one file then there usually is an implicit cross-reference (xref) between them. In the example there is a reference between Sample and Strain. In the example below a column is added that define this xref explicitly using the xref from Sample.strain_name to Strain.name.
Example 4: added xref columns
Sample | Strain | Datamatrix[“Factors”] | Datamatrix[“Phenotype”] | ||||||
Sample.name | Sample.strain_name | Strain.name | Strain.type | Factor.name | Factor.name | Factor.name | Phenotype.name | Phenotype.name | Phenotype.name |
. | |||||||||
lineID | type_plante | BLOCK | BLOCK_line | env | FLOdate | DIAM1 | DIAM2 | ||
sample1 | 90 | 90 | epiRIL | 2 | 1 | c | 05/01/2007 | NA | NA |
sample2 | col | col | mutant | 1 | 1 | s | 05/04/2007 | NA | NA |
sample3 | 381 | 381 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
sample4 | 497 | 497 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
sample5 | 432 | 432 | epiRIL | 2 | 1 | c | 04/25/2007 | NA | NA |
Step 5: Split the data in separate XGAP files
Finally the provided data file can be reformatted into their respective XGAP *.txt files. Note that the annotation files use the XGAP headers (e.g. Sample.name is an XGAP field) while the matrix files use the original headers because these are instances of phenotype/factor names (e.g. FLOdate is a row in Factor.name column).
sample.txt
Sample.name | Sample.strain_name |
sample1 | 90 |
sample2 | col |
sample3 | 381 |
sample4 | 497 |
sample5 | 432 |
strain.txt
Strain.name | Strain.type |
90 | epiRIL |
col | mutant |
381 | epiRIL |
497 | epiRIL |
432 | epiRIL |
phenotype.txt N.B. with help of the data provider we have added descriptions of each phenotype.
name | description |
c | nb of days between sowing and flowering |
DIAM1 | longest rosette diameter |
DIAM2 | rosette diameter perpendicular to DIAM1 |
factor.txt N.B. with help of the data provider we have added descriptions of each factor.
name | description |
BLOCK | in our experiment, we had 6 blocks (each block corresponds to a different of sowing…so 6 blocks = 6 dates of sowing) |
BLOCK_line | line position within a block (11 lines) |
ENV | 2 levels of competition: with and without competition |
data/factordata.txt
BLOCK | BLOCK_line | env | ||
sample1 | 2 | 1 | c | |
sample2 | 1 | 1 | s | |
sample3 | 2 | 1 | c | |
sample4 | 2 | 1 | c | |
sample5 | 2 | 1 | c |
data/phenotypedata.txt
FLOdate | DIAM1 | DIAM2 | |
sample1 | 05/01/2007 | 23 | 29 |
sample2 | 05/04/2007 | 21 | 31 |
sample3 | 04/25/2007 | 25 | 33 |
sample4 | 04/25/2007 | NA | 35 |
sample5 | 04/25/2007 | NA | NA |
N.B. It has been proposed to make a wizard that automates this splitting procedure. See #22.