wiki:GenotypeMolgenis

Genotype Information Management System (GIMS)

developers:AndreDeVries, JorisLops?, MorrisSwertz?
state:design

At the LifeLines facility at the Bloemsingel in Groningen, samples of thousands of inhabitants of the north of the Netherlands are collected for study. They are stored in refrigerators and retrieved from there when necessary. Samples are selected for inclusion in the first stage of genome-wide studies using a software algorithm which is currently hosted by TCC (Trial Coordination Centre).

The project is divided into three tasks:

  1. Convert current excel and text file 'system' to database + user interface.
  2. Implement webservice (hosted by TCC) which daily delivers a list of new, genetically distinct, samples to be included in the WGA. This should be a dynamic process: TCC uses updated information about whether samples have already been included in WGA.
  3. Monitoring of genotyping quality. Coupling with external software to assess genotyping quality. Various presentations in UI of the genotyping quality.

Tasks 1 & 2 are planned to be finished by mid March 2010. Task 3 is next and is planned to be finished by about end of April 2010.

Next, this piece of software could be used and extended at other places, such as the genotyping facility at the genetics department.

REQUIREMENTS:

1.0 Introduction At the LifeLines facility at the Bloemsingel in Groningen, samples of thousands of inhabitants of the north of the Netherlands are collected for study. They are stored in refrigerators and retrieved from there when necessary. Samples are selected for inclusion in the first stage of genome-wide studies using a software algorithm which is currently hosted by TCC (Trial Coordination Centre). Selected samples should be genetically distinct and should have good quality. This selection of samples is now done static and should be done on a dynamic database. In addition, better monitoring of the genotyping quality is desired.

1.1 Goals and objectives

The LL-GIMS (lifelines genetics information management system) aims at assisting the lab people of the LifeLines project in order to better monitor sample handling and genotyping.

1.2 Statement of scope

The software system will have the following three key features, subdivided in more detailed functions: 1) Structured storage of sample information (database instead of Excel sheets) o A new database is needed, which will initially be filled with the current Excel data. o New samples and sample information from robots/machines are automatically put into the database. o Users can view, edit samples, samplesheets, batches. o Users can set properties for samples: whether they are in duplo, are normalized, are used for GWA, are irresolvable or are packed cells (uitgangsmateriaal).

2) Algorithm to determine whether a particular lifelines sample should be selected for WGA-analysis (good quality and genetically distinct). o A daily query should be performed (either by the push of a button or at a particular time) on the TCC LifeLines database to find all samples that have had their 2nd visit that day and that are genetically distinct, so that they can be selected for the WGA study. NB: this means that the algorithm needs information of each sample whether it already is in the WGA. o This query not only gives Lifelines identifiers, but also at least their gender and freezer storage position if that information is available. o A web service is used to connect between the TCC algorithm and GIMS application. o A list of samples to be included in the WGA is presented to the user and they will be flagged as “InWGA”.

3) Monitoring of genotype quality of a sample, chip and sheet. o Checks of the quality of the sample (concentration, genotyping call rate, heterozygosity, Hardy-Weinberg). Information from Illumina BeadStudio? is used. The QC-pipeline (or software by Lude) might also be used, adapted for single sample testing. o The user can in 1 view inspect the quality of a set of 12 samples from a single chip, and also the set of 92 samples from a sheet. Perhaps in such a way:

where red=bad quality, yellow=borderline.

Beware the mapping between position on plate and position on the chip. How is this exactly done? o Quality monitoring though time, per sample, per chip, per sheet, per session and over the total project. o Users can inspect nanodrop graphs, which are constructed on the fly from the nanodrop file. How is the file located? As input, the following is needed:

  • Sample information (database, initially filled with data from excel sheets).
  • Nanodrop files.
  • Raw intensity files (for the QC-software).

Output will be:

  • Quality information from the genotyping/QC-pipeline (stored in database).
  • A yes/no value saying whether a sample should be included in the WGA. 1.3 Software context

This software is related to the LifeLines genetics analysis architecture that will be built. The same (or parts of the same) QC-pipeline may be used in both. The LL-GIMS will connect to the LifeLines genetic analysis portal (in whatever form) through the LifeLines identifier value.

1.4 Major constraints

The connection to the TCC database is unclear. The sample selection algorithm needs information from both the TCC database (family relationships, gender) and the GIMS dataset. 2.0 Usage scenario

2.1 User profiles

Lab people will be the users. There will be one administrator, who can perform the following special actions:

  • Create/remove user accounts, reset passwords.
  • Delete samples. 2.2 Use-cases
  1. Enter/update a sample

New samples are automatically inserted daily into the GIMS dataset by retrieving them from the TCC lifelines database as soon as they become known. So, new records are initially placed in the SAMPLES table, without any sample information (such as 2D-barcode and RackID), but including Lifelines number, freezer storage position (if available) and gender. How is RackID and 2D-barcode automatically obtained? Also from TCC? Through automatically generated files? Nanodrop information is imported in the database through an operation done by the user (or automatically?). The user can inspect the nanodrop plot.

  1. Check whether a sample should be selected for WGA

Upon the push of a button by a user, information about whether samples are in WGA is pushed to the TCC Lifelines database. Subsequently, the sample selection algorithm runs at TCC and returns a list of samples that should be used in WGA. This list is then automatically checked: Is the concentration OK, is ratio 260/280 OK (when there are 2 records with the same Lifelines number, take the one with the best values). Is deze informatie al bekend op die dag? Good samples are presented to the user, who must confirm that the samples will be used in WGA.

  1. Monitor genotyping quality

Genotyping quality measures will be obtained from the Illumina BeadStudio? software. Genotyping quality will also be assessed using special software (written by Lude Franke) and fed into the GIMS database. The user can inspect the quality of a particular sample (through the LL#) and obtain the following statistics: call rate, HW-chi2, is-gender-correct, CNV-statistics (if applicable). A flag will be set for samples that are not within specifications. Also, the user can inspect the quality of all samples of a particular chip (SampleSheet?.SentrixBarcode_A), which presents 12 samples. Average, min, max are shown. Also, the user can inspect the quality of all samples of a plate (AMP_Plate). Average, min, max are shown.

2.3 Special usage considerations

Connections with the TCC Lifelines database and with the QC software are needed.

2.4 Screenshots and detailed description

The main menu shows various options, which lead to the next screen. User administration is available (and visible) for Admin only

The user enters a Lfelines number of interest and hits the FIND button. Perhaps, the user could also search through sampleID, 2D-barcode, RackID or isolation date (not shown). If a record is found, then there may be more types (stock, duplo, normalized, inWGA, Irresolvable), from which the used can choose. Then, the details are shown. The user may then hit the EDIT button, after which the screen becomes editable for a number of fields (all except LifeLines number, 2D-barcode, ND8000 and ratios). Values in red indicatie that there are quality issues with the sample (in this case: ratio 260/280 should be ≥1.8). The sample may be deleted, but only admin can do this. The user can inspect the nanodrop plot by the push of a button. From the respective nanodrop file (how is the file located?), information is read to construct and the nanodrop plot. Perhaps, the nanodrop plot could be shown directly on the above screen?

Using the above screen, the user imports information from a nanodrop file into the database. The following variables are imported: Date, Conc, A260/A280, A260/A230. They are stored in the table SAMPLES. (Alternatively, this screen is obsolete if nanodrop information is imported automatically).

In the above screen only the button START SEARCHING is visible at first. At the push of this button, a number of things happen under water in the following order:

  • The existence of new Lifelines participants is checked in the TCC lifelines database. The corresponding LLnumbers, freezer storage position (if available) and gender of new participants are inserted into the SAMPLES database as new records.
  • A list of samples that have already been selected for WGA (field SAMPLES.INWGA = TRUE) is constructed.
  • The sample selection algorithm at TCC is called and the just constructed sample list is send along to the algorithm.
  • At TCC the sample selection algorithm (is must be modified from its current form!) runs and it delivers a list of participants that have come this day for the 2nd visit and that are genetically distinct. These samples should be included in the WGA.
  • The list of samples appears on screen.

The user can print or save the resulting list to file. All shown samples are automatically registered as being included in WGA (this is already administered at TCC as soon as the list is composed). The field INWGA is set to TRUE for all samples of the presented list.

In the above screen the user is asked to select a particular sample. If the user is not interested in a specific sample, but in a particular sheet or session, the user can instead enter the genotyping date, after which all samples, all chip IDs and all plates of that date are shown. The user can select one of them. Then, the user can click on any of the lower buttons, in order to view the genotyping quality of a particular sample (must be selected), of a particular chip (sample or chipID must be selected), of a particular plate or session (sample, chipID or plateID must be selected) or of the total project (nothing needs to be selected). The button TOTAL PROJECT is therefore clickable right away, but the other four buttons only after selecting something. There are various statistics to assess the genotyping quality, which will be calculated using software by Lude Franke (called TriTyper?). The current TriTyper? program analyzes a dataset of multiple (hundreds of) samples and is therefore not suitable for single sample testing. Lude Franke will create a version that will be able to check a single sample. An important difference between single sample checking and QC of a large dataset, is that some statistics can not be calculated for a single sample, as shown in the table below. Statistic Dataset of hundreds of samples Dataset of a single sample Call rate YES YES Check gender YES YES Heterozygosity YES YES Hardy Weinberg test YES NO CNV testing YES NO

The above screen shows the genotypeing quality of a plate. Apparently, in this case row 7 is not good (pipetting error?) and column F is not good (chip xxxx). If the user clicks (or hoovers) at a particular spot, the Lifelines number (and sampleID?) is shown. Current file formats (Excel): The following tables are currently in use (mainly Excel). In yellow are indicated fields though which connection can be made. DNA-STORAGE Lade Racknr RackID Position 2D-bar LL# QiaTube? SampID Isoldate NDconc 260/280 260/230 IsoMeth? Charac 1 43 1052765 A1 8309446 1597677 3627 26794-5 2-9-2009 412,3 1,9 2,2 blabla blabla

WGA-STUDIE Lade RackNr? Position 2D-bar LL# Isoldate NDconc 260/280 260/230 IsoMeth? 1 43 A1 8309446 1597677 2-9-2009 412,3 1,9 2,2 blabla

DUPLOBOX Lade RackNr? Position 2D-bar LL# Isoldate NDconc 1 43 A1 8309446 1597677 2-9-2009 412,3

NORMALIZED 100 ug/ml Lade RackNr? BoxID Position 2D-bar LL# Isoldate 1 43 1052765 A1 8309446 1597677 2-9-2009 BoxID lijkt te matchen met RackID

UITVAL

LL# Isoldate NDconc IsoMeth? Analist 1597677 2-9-2009 412,3 blabla XX

IRRESOLVABLE SAMPLES

RackID Position TubeID LL# Isoldate 43 A1 8309446 1597677 2-9-2009

PACKED CELLS

LL# 1597677

SAMPLE SHEET (automatically generated) SampleID S.Plate S.Name Project AMP_Plate SampleWell? SentrixBar_A SentrixPos_A 1799306 0 1799306 LifelinesPlate27 wg0002915-msa3 A01 4799112148 R01C01 Additional columns: Scanner, DateScan?, Replicate, Parent1, Parent2, Gender SampleID = DNA-STORAGE .LL#

NANODROP FILE (automatically generated) PlateID Well SampID UserID Date Time Conc Units A260 A280 260/280 260/230 ConcFac? etc… 1024897 A1 default 1-1-2010 10:31 642,0 ng/ul 12,840 6,913 1,86 2,02 50,00 …… PlateID = DNA-STORAGE .RackID# Well = DNA-STORAGE .Position

FINAL REPORT SNPname SampleID Allele1 Allele2 GC-score cnvi0111185 1073859 - - 0.0000… SampleID = DNA-STORAGE .LL#

QUAGENPOS SampleId? QiagenId? QiagenId_1 TubeId? TubeId_1 RackPos? RackId? Error QiagenPos1 6837 6837 1030342040 1030342040 A01 100200676 QiagenId? = DNA-STORAGE . QiaTube?

3.0 Data Model and Description This section describes information domain for the software

3.1 Data Description

Data objects that will be managed/manipulated by the software are described in this section.

3.1.1 Data objects

Data objects and their major attributes are described.

3.1.2 Relationships

Relationships among data objects are described using an ERD- like form. No attempt is made to provide detail at this stage.

3.1.3 Complete data model

An ERD for the software is developed table SAMPLES fields: Lifelines number 7-digit integer unique, not null

Gender integer (1=male, 2=female) SampleID string (length 12) Lade integer RackNumber? integer (values <100) RackID integer (values >1000) Position string (length 2) 2D-barcode integer Isolation Date date ND8000 conc float Ratio 260/280 float Ratio 260/230 float Isolation method string (length 1000) Characteristics string (length 1000) Duplo Boolean Normalized Boolean InWGA Boolean Irresolvable Boolean Uitgangsmateriaal string

table VRIEZER probably not needed fields: Lifelines number string (formatted as “LL-1234”) unique, not null

RackID integer LadeNr? integer Positie integer

table SELECTIELIJST fields: Lifelines number 7-digit integer unique, not null

FamNr? integer RelGen? integer High Boolean Datum date Comment string (length 100) Positie string (length 1000) Status string (length 10)

table QUALITY fields: Lifelines number 7-digit integer unique, not null

Call rate float Heteroyzgosity float HW float Gendercorrect Boolean CNVstat float ??

3.1.4 Data dictionary

A reference to the data dictionary is provided. The dictionary is maintained in electronic form. 4.0 Description of the Sample selection algorithm This algorithm has been developed by TCC and in its current form it selects samples that are genetically unique from a static dataset. The algorithm needs to be changed such that it operates on a dynamic database. This means that it uses the most recent information of all samples whether they have been used for a WGA or not. Further, the algorithm should only return Lifelines numbers that had their 2nd visit at the day of running the algorithm.

Last modified 14 years ago Last modified on 2010-10-01T23:19:13+02:00

Attachments (2)

Download all attachments as: .zip