wiki:DataConversionSchema

Version 1 (modified by trac, 14 years ago) (diff)

--

General schema for fitting any experimental data into XGAP

This tutorial explains the general schema of converting any experimental data into the XGAP model. It shows that all standard annotations can go into annotation types like 'Sample', 'Strain' and 'Spot' and that all experiment-specific data can go into data matrices, optionally refering 'Factor' or 'Phenotype'. Note that when using a new biotechnology you may want to add a new core annotation type (as has been done before with MassPeak, NMR, etc). The MIBBI recommendations are helpful with deciding what new standard annotatations are needed. See AddingDataTypes for the technical procedure.

The general schema is demonstrated by the example of a file (shown below) describing a rather complex multifactorial experiment.

Original data file:

lineIDtype_planteBLOCKBLOCK_lineenvFLOdateDIAM1DIAM2
90epiRIL21c05/01/2007NANA
colmutant11s05/04/2007NANA
381epiRIL21c04/25/2007NANA
497epiRIL21c04/25/2007NANA
432epiRIL21c04/25/2007NANA

In this example tabular (excel) data is shown but the strategy applies to other formats like XML as well. As no standard column names are used the help of the original data provider is needed to understand the data. Reformatting to XGAP can overcome this problem.

Step 1: Identify XGAP entities and fields

A practical procedure to map each data element to XGAP is to add two additional rows on top of the existing rownames. Then use those to define the XGAP entities and fields each column maps to as shown in bold below.

Example1: re-annotated as XGAP files

StrainFactorPhenotype
Strain.nameStrain.typeFactor.nameFactor.nameFactor.namePhenotype.namePhenotype.namePhenotype.name
lineIDtype_planteBLOCKBLOCK_lineenvFLOdateDIAM1DIAM2
90epiRIL21c05/01/2007NANA
colmutant11s05/04/2007NANA
381epiRIL21c04/25/2007NANA
497epiRIL21c04/25/2007NANA
432epiRIL21c04/25/2007NANA

Step 2: identify data matrices

Not all columns will map to XGAP annotation fields like Probe or Sample. Typically, if there are repeated XGAP fields than this suggests a data matrix. In the example this holds for Factor.name and Phenotype.name. For each repeated XGAP field a data matrix can be defined as shown below.

Example 2: annotated data matrices

StrainDatamatrix[“Factors”]Datamatrix[“Phenotypes”]
Strain.nameStrain.typeFactor.nameFactor.nameFactor.namePhenotype.namePhenotype.namePhenotype.name
lineIDtype_planteBLOCKBLOCK_lineenvFLOdateDIAM1DIAM2
90epiRIL21c05/01/2007NANA
colmutant11s05/04/2007NANA
381epiRIL21c04/25/2007NANA
497epiRIL21c04/25/2007NANA
432epiRIL21c04/25/2007NANA

Step 3: Add missing columns

First identify what data entities are described in each row. In this example each row describes 'Samples', although no sample identifier was provided. Then add missing but required columns for entities used. In this example the entities used are Sample, Strain, Factor and Phenotype. The required column 'Sample.name' was missing and is added.

Example 3: added missing column Sample.name

SampleStrainDatamatrix[“Factors”]Datamatrix[“Phenotype”]
Sample.nameStrain.nameStrain.typeFactor.nameFactor.nameFactor.namePhenotype.namePhenotype.namePhenotype.name
lineIDtype_planteBLOCKBLOCK_lineenvFLOdateDIAM1DIAM2
sample190epiRIL21c05/01/2007NANA
sample2colmutant11s05/04/2007NANA
sample3381epiRIL21c04/25/2007NANA
sample4497epiRIL21c04/25/2007NANA
sample5432epiRIL21c04/25/2007NANA

Step 4: Add cross-references columns

If fields from multiple XGAP entities are annotated within one file then there usually is an implicit cross-reference (xref) between them. In the example there is a reference between Sample and Strain. In the example below a column is added that define this xref explicitly using the xref from Sample.strain_name to Strain.name.

Example 4: added xref columns

SampleStrainDatamatrix[“Factors”]Datamatrix[“Phenotype”]
Sample.nameSample.strain_nameStrain.nameStrain.typeFactor.nameFactor.nameFactor.namePhenotype.namePhenotype.namePhenotype.name
.
lineIDtype_planteBLOCKBLOCK_lineenvFLOdateDIAM1DIAM2
sample19090epiRIL21c05/01/2007NANA
sample2colcolmutant11s05/04/2007NANA
sample3381381epiRIL21c04/25/2007NANA
sample4497497epiRIL21c04/25/2007NANA
sample5432432epiRIL21c04/25/2007NANA

Step 5: Split the data in separate XGAP files

Finally the provided data file can be reformatted into their respective XGAP *.txt files. Note that the annotation files use the XGAP headers (e.g. Sample.name is an XGAP field) while the matrix files use the original headers because these are instances of phenotype/factor names (e.g. FLOdate is a row in Factor.name column).

sample.txt

Sample.name Sample.strain_name
sample1 90
sample2 col
sample3 381
sample4 497
sample5 432

strain.txt

Strain.name Strain.type
90 epiRIL
col mutant
381 epiRIL
497 epiRIL
432 epiRIL

phenotype.txt N.B. with help of the data provider we have added descriptions of each phenotype.

name description
c nb of days between sowing and flowering
DIAM1 longest rosette diameter
DIAM2 rosette diameter perpendicular to DIAM1

factor.txt N.B. with help of the data provider we have added descriptions of each factor.

name description
BLOCK in our experiment, we had 6 blocks (each block corresponds to a different of sowing…so 6 blocks = 6 dates of sowing)
BLOCK_line line position within a block (11 lines)
ENV 2 levels of competition: with and without competition

data/factordata.txt

BLOCKBLOCK_line env
sample1 2 1 c
sample2 1 1 s
sample3 2 1 c
sample4 2 1 c
sample5 2 1 c

data/phenotypedata.txt

FLOdate DIAM1 DIAM2
sample105/01/2007 23 29
sample205/04/2007 21 31
sample304/25/2007 25 33
sample404/25/2007 NA 35
sample504/25/2007 NA NA

N.B. It has been proposed to make a wizard that automates this splitting procedure. See #22.