wiki:XgapFormatTutorial

Version 1 (modified by trac, 14 years ago) (diff)

--

How to describe an investigation in XGAP format

In a typical genotype-to-phenotype study, there is information about:

  • genotyping (markers measured on individuals),
  • phenotyping (traits measured on individuals),
  • derived data such as QTL profiles, and
  • procedural metadata for example explaining the protocols used.

The XGAP tab delimited text file format allows capturing of all this information.

Below we will use the MetaNetwork investigation of as an example to explain use of XGAP format. In MetaNetwork, the individuals belong to a certain Strain, which is of a certain inbreeding type. The Traits are in this case Metabolites heaving certain mass/charge annotations. Next to the genotype and metabolite data matrices, each of he Markers and Metabolite traits have additional annotations attached. Below this data will be recorded as follows:

An XGAP data fileset is typically created in five steps:

  1. contant.properties (optional): This optional file allows the central definition of values that are static within the whole data set. We here use it to define 'investigation_name' and 'species_name' centrally.
  2. data.txt: this file lists the data matrix files in this set.
  3. Data matrix files: these files contain the observed/calculated data values on Subjects and/or Traits.
  4. Subject and Trait Annotation files: these files list information about what was measured (Traits: Marker, Metabolite) and on who was measured (Subjects: Invidual, Strain).
  5. Metadata files: these files contain general investigation information, in this case on Investigation, Species (OntologyTerm?) and Bibliographicalreferences.

Notes:

  • All files are normally in a tabular format requiring particular column headers. An exception to this are the data-matrices which are two-dimensional having column headers and row headers. Another exception is the contant.properties which has a 'key=value' format for each row.
  • in practice an XGAP file set contains only one investigation which is practical using contant.properties. However the format allows for multiple investigations into one file set.

Below each of these files is created for the MetaNetwork example.

1. Creation of the 'constant.properties' file

This is an optional step. The constant properties file allows central definition of constant values such that one doesn't need to provide them in each file. For example: in each annotation file one normally needs to define a column 'investigation_name' denote a particular piece of information was defined in a particular investigation (except: matrix files!). However, this would be the same value over the whole data set. Therefore a mechanism has been implemented to define such values centrally.

'contant.properties' file

In the example of MetaNetwork this file looks as follows:

#values that are constant in this file set
#for all entities holds that
investigation_name = MetaNetwork
species_name = Arabidopsis thaliana

2. Creation of 'data.txt' file defines data sets

All XGAP data sets have a data.txt that lists the data matrices in the set. To ensure suitable annotations, the column and row headers of each matrix are always coupled to specific annotations while the matrix cells contain the observed values (see examples below). The file data.txt describes these relationships, as well as the matrix dimensions and the type of data in the cells (decimal or textual).

'data.txt' file

To describe data matrices, the data.txt has the following columns:

colunm namedescription
namename of the data set. In this case 'data_genotypes' and ' data_metaboliteexpression'
investigation_namename of the investigation this data set is part of. Here ommitted because provided in constant.properties file
rowTypereference to the Subjects or Traits being observed
colTypereference to the Subjects or Traits being observed
valueTypespecification of what type of data is in this matrix, either Decimal for numeric data or Textual for non-numeric data
totalRowstotal number of rows of this matrix
totalColstotal number of columns of this matrix

For the MetaNetwork study the data.txt looks as follows:

name	rowType	colType	valueType	totalRows	totalCols
data_genotypes	Marker	Individual	Decimal	117	162
data_metaboliteexpression	Metabolite	Individual	Decimal	24	162

As you can see, the genotypes have rows with Markers, and columns with Individuals.

3. Creation of matrix files in the 'data' folder

Each of the data sets described in the data.txt file should be available in a subfolder called 'data'.For the creation of these files the following rules hold:

  • The names of these files should match the names in data.txt with the suffix of '.txt'. In the MetaNetwork example there should be 'data_genotypes.txt' and 'data_metaboliteexpression.txt'.
  • The column and row headers should match appropriate names in the refered annotation files. For example, 'data_genotypes' is a matrix of Individual x Marker and headers should therefor refer to values in 'individual.txt' and 'marker.txt'.

'data/data_genotypes.txt' file

The genotypes data reports genotypic obeservations on markers (rows) and individuals (columns); the two alleles are denoted by either '1' and '2'. A snapshot of this data matrix:

"X1"	"X3"	"X4"	"X5"	"X6"
"PVV4"	1	1	2	1	2
"AXR-1"	1	1	2	1	2
"HH.335C-Col"	1	1	1	1	2
"DF.162L/164C-Col"	1	1	1	1	2
"EC.480C"	1	1	1	1	2

Note that the column headers (X1, ...) should refer to 'name' values in 'individual.txt' and that the row headers (PVV4, ...) should refer to 'name' values in 'marker.txt'. See below.

'data/data_metaboliteexpression.txt' file

The matrix with traits has information about one or more traits, in this case metabolites (rows), measured on the same individuals (columns) that were also genotyped. A snapshot of this data matrix:

"X1"	"X3"	"X4"	"X5"	"X6"
"3-Hydroxypropyl"	NA	942	2402	602	213
"4-Hydroxybutyl"	NA	4	10	183	198
"4-Methylsulfinylbutyl"	NA	55	62	13386	1671
"3-Butenyl"	NA	84	32	18	4339
"3-Methylthiopropyl"	NA	3108	569	4	7

Note that the column headers (X1, ...) should refer to 'name' values in 'individual.txt' and that the row headers (3-Hydroxypropyl, ...) should refer to 'name' values in 'metabolite.txt'. See below.

Notes about the matrix file format

The ""'s are not necessary, but can prevent confusion during parsing. The importing process will determine the value seperator (tab in this case) and names with many whitespaces can (in rare cases) cause the parser to think that whitespace is the seperator.

Notice the columnheader is not exactly on top the data columns but shifted one to the left. This is because the rowheaders are also a column but contain not data, therefore the 'first' column header is omitted. Insertion of only a seperator character as a first value is allowed as well.

4. Creation of Subject and Trait annotation files

From the data sets we refered to annotations on Individuals, Markers and Metabolite traits. Below it is shown how to add annotations for each of these. Again, the annotations go into file with the same name and a '.txt' suffix. So the annotations of Individual go into 'individual.txt', Strains go into 'strain.txt', Markers go into 'marker.txt', and Metabolites go into 'metabolite.txt'.

'individual.txt' file

In this case, there is not much information, only their name and their strain of origin. The data model allows also for optional pedigree information. A snapshot of the individual.txt annotation file:

name	strain_name
X1	Ler x Cvi
X3	Ler x Cvi
X4	Ler x Cvi
X5	Ler x Cvi
X6	Ler x Cvi

Strain is a reference to a different type of Subject in the database, Strain. Notice that we refer to this Strain by not using a numeric database id (which will be assigned by the database but we cannot know at this point) but by using a special syntax: "_name". This means the parser will automatically make the reference to the correct strain individual by identifying it by its 'name' attribute. There is however, not yet such a strain present. We add it by creating 'strain.txt', below.

'strain.txt' annotation file

In this case only the straintype is known, which in this case: recombinant inbred by selfing (riself).

name	straintype
Ler x Cvi	riself

'marker.txt' annotation file

The marker annotations go in 'marker.txt'. Here we add vital information for further analysis: the chromosome at which this marker is located, and its centiMorgan position on the chromosome. It may look like this:

"name","chr","cm"
"PVV4",1,0
"AXR-1",1,6.398
"HH.335C-Col",1,10.786
"DF.162L/164C-Col",1,12.913
"EC.480C",1,15.059

'metabolite.txt' annotation file

We also add annotation for the metabolites, though with no additional information at this point. Still it is valuable to add them to the database as more annotations may come available later. Also, this ensures consistency if multiple observations including the same metabolites would be included, such as QTL profiles or correlation data.

"name"
"3-Hydroxypropyl"
"4-Hydroxybutyl"
"4-Methylsulfinylbutyl"
"3-Butenyl"
"3-Methylthiopropyl"

5. Creation of other meta data files

XGAP allows for many more annotations, see XgapDataModel for a listing. In this case we only describe the investigation under which all information is stored should be described in 'investigation.txt' and related publication.

'investigation.txt' file

It can hold name, and optionally start date and end date. In this case we only provide a name:

name
MetaNetwork

'species.txt' file

Also minimal information on the species studied has been added, as well as short name to be used in this study.

name
Arabidopsis thaliana

'bibliographicalreference.txt' file

We also add information concerning the publication for this investigation in 'bibliographicalreference.txt'.

name	authors	publication	publisher	editor	year	volume	issue	pages	title
PMID: 17406631	Fu J, Swertz MA, Keurentjes JJ, Jansen RC.	Nat Protoc.	-	-	2007	-	-	685-94	MetaNetwork: a computational protocol for the genetic study of metabolic networks.

This example set can be downloaded from: