Changes between Initial Version and Version 1 of XgapFormatTutorial


Ignore:
Timestamp:
2010-10-01T23:38:13+02:00 (14 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • XgapFormatTutorial

    v1 v1  
     1[[TOC()]]
     2= How to describe an investigation in XGAP format =
     3
     4In a typical genotype-to-phenotype study, there is information about:
     5 * genotyping (markers measured on individuals),
     6 * phenotyping (traits measured on individuals),
     7 * derived data such as QTL profiles, and
     8 * procedural metadata for example explaining the protocols used.
     9The XGAP tab delimited text file format allows capturing of all this information.
     10
     11Below we will use the MetaNetwork investigation of as an example to explain use of XGAP format. In MetaNetwork, the individuals belong to a certain Strain, which is of a certain inbreeding type. The Traits are in this case Metabolites heaving certain mass/charge annotations. Next to the genotype and metabolite data matrices, each of he Markers and Metabolite traits have additional annotations attached. Below this data will be recorded as follows:
     12
     13An XGAP data fileset is typically created in five steps:
     14
     15 1. contant.properties (optional): This optional file allows the central definition of values that are static within the whole data set. We here use it to define 'investigation_name' and 'species_name' centrally.
     16 1. data.txt: this file lists the data matrix files in this set.
     17 1. Data matrix files: these files contain the observed/calculated data values on Subjects and/or Traits.
     18 1. Subject and Trait Annotation files: these files list information about what was measured (Traits: Marker, Metabolite) and on who was measured (Subjects: Invidual, Strain).
     19 1. Metadata files: these files contain general investigation information, in this case on Investigation, Species (OntologyTerm) and Bibliographicalreferences.
     20
     21Notes:
     22 * All files are normally in a tabular format requiring particular column headers. An exception to this are the data-matrices which are two-dimensional having column headers and row headers. Another exception is the contant.properties which has a 'key=value' format for each row.
     23 * in practice an XGAP file set contains only one investigation which is practical using contant.properties. However the format allows for multiple investigations into one file set.
     24
     25Below each of these files is created for the MetaNetwork example.
     26
     27== 1. Creation of the 'constant.properties' file ==
     28This is an optional step. The constant properties file allows central definition of constant values such that one doesn't need to provide them in each file. For example: in each annotation file one normally needs to define a column 'investigation_name' denote a particular piece of information was defined in a particular investigation (except: matrix files!). However, this would be the same value over the whole data set. Therefore a mechanism has been implemented to define such values centrally.
     29
     30=== 'contant.properties' file ===
     31In the example of MetaNetwork this file looks as follows:
     32
     33{{{
     34#values that are constant in this file set
     35#for all entities holds that
     36investigation_name = MetaNetwork
     37species_name = Arabidopsis thaliana
     38}}}
     39
     40== 2. Creation of 'data.txt' file defines data sets ==
     41
     42All XGAP data sets have a data.txt that lists the data matrices in the set. To ensure suitable annotations, the column and row headers of each matrix are always coupled to specific annotations while the matrix cells contain the observed values (see examples below). The file data.txt describes these relationships, as well as the matrix dimensions and the type of data in the cells (decimal or textual).
     43
     44=== 'data.txt' file ===
     45To describe data matrices, the data.txt has the following columns:
     46||'''colunm name'''||'''description'''||
     47||name||name of the data set. In this case 'data_genotypes' and ' data_metaboliteexpression'||
     48||investigation_name||name of the investigation this data set is part of. Here ommitted because provided in constant.properties file||
     49||rowType||reference to the Subjects or Traits being observed||
     50||colType||reference to the Subjects or Traits being observed||
     51||valueType||specification of what type of data is in this matrix, either Decimal for numeric data or Textual for non-numeric data||
     52||totalRows||total number of rows of this matrix||
     53||totalCols||total number of columns of this matrix||
     54
     55For the MetaNetwork study the data.txt looks as follows:
     56{{{
     57name    rowType colType valueType       totalRows       totalCols
     58data_genotypes  Marker  Individual      Decimal 117     162
     59data_metaboliteexpression       Metabolite      Individual      Decimal 24      162
     60}}}
     61
     62As you can see, the genotypes have rows with Markers, and columns with Individuals.
     63
     64== 3. Creation of matrix files in the 'data' folder ==
     65Each of the data sets described in the data.txt file should be available in a subfolder called 'data'.For the creation of these files the following rules hold:
     66
     67 * The names of these files should match the names in data.txt with the suffix of '.txt'. In the MetaNetwork example there should be 'data_genotypes.txt' and 'data_metaboliteexpression.txt'.
     68 * The column and row headers should match appropriate names in the refered annotation files. For example, 'data_genotypes' is a matrix of Individual x Marker and headers should therefor refer to values in 'individual.txt' and 'marker.txt'.
     69
     70=== 'data/data_genotypes.txt' file ===
     71
     72The genotypes data reports genotypic obeservations on markers (rows) and individuals (columns); the two alleles are denoted by either '1' and '2'. A snapshot of this data matrix:
     73
     74{{{
     75"X1"    "X3"    "X4"    "X5"    "X6"
     76"PVV4"  1       1       2       1       2
     77"AXR-1" 1       1       2       1       2
     78"HH.335C-Col"   1       1       1       1       2
     79"DF.162L/164C-Col"      1       1       1       1       2
     80"EC.480C"       1       1       1       1       2
     81}}}
     82
     83Note that the column headers (X1, ...) should refer to 'name' values in 'individual.txt' and that the row headers (PVV4, ...) should refer to 'name' values in 'marker.txt'. See below.
     84
     85=== 'data/data_metaboliteexpression.txt' file ===
     86The matrix with traits has information about one or more traits, in this case metabolites (rows), measured on the same  individuals (columns) that were also genotyped. A snapshot of this data matrix:
     87
     88{{{
     89"X1"    "X3"    "X4"    "X5"    "X6"
     90"3-Hydroxypropyl"       NA      942     2402    602     213
     91"4-Hydroxybutyl"        NA      4       10      183     198
     92"4-Methylsulfinylbutyl" NA      55      62      13386   1671
     93"3-Butenyl"     NA      84      32      18      4339
     94"3-Methylthiopropyl"    NA      3108    569     4       7
     95}}}
     96
     97Note that the column headers (X1, ...) should refer to 'name' values in 'individual.txt' and that the row headers (3-Hydroxypropyl, ...) should refer to 'name' values in 'metabolite.txt'. See below.
     98
     99=== Notes about the matrix file format ===
     100
     101The ""'s are not necessary, but can prevent confusion during parsing. The importing process will determine the value seperator (tab in this case) and names with many whitespaces can (in rare cases) cause the parser to think that whitespace is the seperator.
     102
     103Notice the columnheader is not exactly on top the data columns but shifted one to the left. This is because the rowheaders are also a column but contain not data, therefore the 'first' column header is omitted. Insertion of only a seperator character as a first value is allowed as well.
     104
     105== 4. Creation of Subject and Trait annotation files ==
     106
     107From the data sets we refered to annotations on Individuals, Markers and Metabolite traits. Below it is shown how to add annotations for each of these. Again, the annotations go into file with the same name and a '.txt' suffix. So the annotations of Individual go into 'individual.txt', Strains go into 'strain.txt', Markers go into 'marker.txt', and Metabolites go into 'metabolite.txt'.
     108
     109=== 'individual.txt' file ===
     110In this case, there is not much information, only their name and their strain of origin. The data model allows also for optional pedigree information. A snapshot of the individual.txt annotation file:
     111
     112{{{
     113name    strain_name
     114X1      Ler x Cvi
     115X3      Ler x Cvi
     116X4      Ler x Cvi
     117X5      Ler x Cvi
     118X6      Ler x Cvi
     119}}}
     120
     121Strain is a reference to a different type of Subject in the database, Strain. Notice that we refer to this Strain by not using a numeric database id (which will be assigned by the database but we cannot know at this point) but by using a special syntax: "_name". This means the parser will automatically make the reference to the correct strain individual by identifying it by its 'name' attribute. There is however, not yet such a strain present. We add it by creating 'strain.txt', below.
     122
     123=== 'strain.txt' annotation file ===
     124
     125In this case only the straintype is known, which in this case: recombinant inbred by selfing (riself).
     126
     127{{{
     128name    straintype
     129Ler x Cvi       riself
     130}}}
     131
     132=== 'marker.txt' annotation file ===
     133The marker annotations go in 'marker.txt'. Here we add vital information for further analysis: the chromosome at which this marker is located, and its centiMorgan position on the chromosome. It may look like this:
     134
     135{{{
     136"name","chr","cm"
     137"PVV4",1,0
     138"AXR-1",1,6.398
     139"HH.335C-Col",1,10.786
     140"DF.162L/164C-Col",1,12.913
     141"EC.480C",1,15.059
     142}}}
     143
     144=== 'metabolite.txt' annotation file ===
     145We also add annotation for the metabolites, though with no additional information at this point. Still it is valuable to add them to the database as more annotations may come available later. Also, this ensures consistency if multiple observations including the same metabolites would be included, such as QTL profiles or correlation data.
     146
     147{{{
     148"name"
     149"3-Hydroxypropyl"
     150"4-Hydroxybutyl"
     151"4-Methylsulfinylbutyl"
     152"3-Butenyl"
     153"3-Methylthiopropyl"
     154}}}
     155
     156== 5. Creation of other meta data files ==
     157
     158XGAP allows for many more annotations, see XgapDataModel for a listing. In this case we only describe the investigation under which all information is stored should be described in 'investigation.txt' and related publication.
     159
     160=== 'investigation.txt' file ===
     161It can hold name, and optionally start date and end date. In this case we only provide a name:
     162
     163{{{
     164name
     165MetaNetwork
     166}}}
     167
     168=== 'species.txt' file ===
     169Also minimal information on the species studied has been added, as well as short name to be used in this study.
     170
     171{{{
     172name
     173Arabidopsis thaliana
     174}}}
     175
     176=== 'bibliographicalreference.txt' file ===
     177
     178We also add information concerning the publication for this investigation in 'bibliographicalreference.txt'.
     179
     180{{{
     181name    authors publication     publisher       editor  year    volume  issue   pages   title
     182PMID: 17406631  Fu J, Swertz MA, Keurentjes JJ, Jansen RC.      Nat Protoc.     -       -       2007    -       -       685-94  MetaNetwork: a computational protocol for the genetic study of metabolic networks.
     183}}}
     184
     185This example set can be downloaded from:
     186 * DataSets page, see 'Fu et al'.
     187 * [http://gbic.target.rug.nl/xgap/zip/Fu.zip Direct download link]