Changes between Initial Version and Version 1 of GenotypeMolgenis


Ignore:
Timestamp:
2010-10-01T23:19:13+02:00 (14 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GenotypeMolgenis

    v1 v1  
     1= Genotype Information Management System (GIMS) =
     2
     3||developers:||AndreDeVries, JorisLops, MorrisSwertz||
     4||state:||design||
     5
     6At the LifeLines facility at the Bloemsingel in Groningen, samples of thousands of inhabitants of the north of the Netherlands are collected for study. They are stored in refrigerators and retrieved from there when necessary. Samples are selected for inclusion in the first stage of genome-wide studies using a software algorithm which is currently hosted by TCC (Trial Coordination Centre). [[BR]]
     7
     8The project is divided into three tasks:[[BR]]
     91. Convert current excel and text file 'system' to database + user interface.[[BR]]
     102. Implement webservice (hosted by TCC) which daily delivers a list of new, genetically distinct, samples to be included in the WGA. This should be a dynamic process: TCC uses updated information about whether samples have already been included in WGA.[[BR]]
     113. Monitoring of genotyping quality. Coupling with external software to assess genotyping quality. Various presentations in UI of the genotyping quality.[[BR]]
     12
     13Tasks 1 & 2 are planned to be finished by mid March 2010.
     14Task 3 is next and is planned to be finished by about end of April 2010.
     15
     16Next, this piece of software could be used and extended at other places, such as the genotyping facility at the genetics department.
     17
     18
     19
     20
     21REQUIREMENTS:
     22
     231.0     Introduction
     24At the LifeLines facility at the Bloemsingel in Groningen, samples of thousands of inhabitants of the north of the Netherlands are collected for study. They are stored in refrigerators and retrieved from there when necessary.
     25Samples are selected for inclusion in the first stage of genome-wide studies using a software algorithm which is currently hosted by TCC (Trial Coordination Centre). Selected samples should be genetically distinct and should have good quality. This selection of samples is now done static and should be done on a dynamic database.
     26In addition, better monitoring of the genotyping quality is desired.
     27 1.1 Goals and objectives
     28The LL-GIMS (lifelines genetics information management system) aims at assisting the lab people of the LifeLines project in order to better monitor sample handling and genotyping.
     29 1.2 Statement of scope
     30The software system will have the following three key features, subdivided in more detailed functions:
     311)      Structured storage of sample information (database instead of Excel sheets)
     32o       A new database is needed, which will initially be filled with the current Excel data.
     33o       New samples and sample information from robots/machines are automatically put into the database.
     34o       Users can view, edit samples, samplesheets, batches.
     35o       Users can set properties for samples: whether they are in duplo, are normalized, are used for GWA, are irresolvable or are packed cells (uitgangsmateriaal).
     36
     372)      Algorithm to determine whether a particular lifelines sample should be selected for WGA-analysis (good quality and genetically distinct).
     38o       A daily query should be performed (either by the push of a button or at a particular time) on the TCC LifeLines database to find all samples that have had their 2nd visit that day and that are genetically distinct, so that they can be selected for the WGA study. NB: this means that the algorithm needs information of each sample whether it already is in the WGA.
     39o       This query not only gives Lifelines identifiers, but also at least their gender and freezer storage position if that information is available.
     40o       A web service is used to connect between the TCC algorithm and GIMS application.
     41o       A list of samples to be included in the WGA is presented to the user and they will be flagged as “InWGA”.
     42
     433)      Monitoring of genotype quality of a sample, chip and sheet.
     44o       Checks of the quality of the sample (concentration, genotyping call rate, heterozygosity, Hardy-Weinberg). Information from Illumina BeadStudio is used. The QC-pipeline (or software by Lude) might also be used, adapted for single sample testing.
     45o       The user can in 1 view inspect the quality of a set of 12 samples from a single chip, and also the set of 92 samples from a sheet. Perhaps in such a way:
     46  where red=bad quality, yellow=borderline.
     47Beware the mapping between position on plate and position on the chip. How is this exactly done?
     48o       Quality monitoring though time, per sample, per chip, per sheet, per session and over the total project.
     49o       Users can inspect nanodrop graphs, which are constructed on the fly from the nanodrop file. How is the file located?
     50As input, the following is needed:
     51•       Sample information (database, initially filled with data from excel sheets).
     52•       Nanodrop files.
     53•       Raw intensity files (for the QC-software).
     54Output will be:
     55•       Quality information from the genotyping/QC-pipeline (stored in database).
     56•       A yes/no value saying whether a sample should be included in the WGA.
     57 1.3 Software context
     58This software is related to the LifeLines genetics analysis architecture that will be built. The same (or parts of the same) QC-pipeline may be used in both. The LL-GIMS will connect to the LifeLines genetic analysis portal (in whatever form) through the LifeLines identifier value.
     59 1.4 Major constraints
     60The connection to the TCC database is unclear. The sample selection algorithm needs information from both the TCC database (family relationships, gender) and the GIMS dataset.
     612.0 Usage scenario
     62 2.1 User profiles
     63Lab people will be the users. There will be one administrator, who can perform the following special actions:
     64•       Create/remove user accounts, reset passwords.
     65•       Delete samples.
     66 2.2 Use-cases
     671.      Enter/update a sample
     68New samples are automatically inserted daily into the GIMS dataset by retrieving them from the TCC lifelines database as soon as they become known. So, new records are initially placed in the SAMPLES table, without any sample information (such as 2D-barcode and RackID), but including Lifelines number, freezer storage position (if available) and gender.
     69How is RackID and 2D-barcode automatically obtained? Also from TCC? Through automatically generated files?
     70Nanodrop information is imported in the database through an operation done by the user (or automatically?). The user can inspect the nanodrop plot.
     71 
     722.      Check whether a sample should be selected for WGA
     73Upon the push of a button by a user, information about whether samples are in WGA is pushed to the TCC Lifelines database. Subsequently, the sample selection algorithm runs at TCC and returns a list of samples that should be used in WGA. This list is then automatically checked: Is the concentration OK, is ratio 260/280 OK (when there are 2 records with the same Lifelines number, take the one with the best values). Is deze informatie al bekend op die dag? Good samples are presented to the user, who must confirm that the samples will be used in WGA.
     74 
     753.      Monitor genotyping quality
     76Genotyping quality measures will be obtained from the Illumina BeadStudio software. Genotyping quality will also be assessed using special software (written by Lude Franke) and fed into the GIMS database. The user can inspect the quality of a particular sample (through the LL#) and obtain the following statistics: call rate, HW-chi2, is-gender-correct, CNV-statistics (if applicable). A flag will be set for samples that are not within specifications.
     77Also, the user can inspect the quality of all samples of a particular chip (SampleSheet.SentrixBarcode_A), which presents 12 samples. Average, min, max are shown. Also, the user can inspect the quality of all samples of a plate (AMP_Plate). Average, min, max are shown.
     78 
     79
     80 2.3 Special usage considerations
     81Connections with the TCC Lifelines database and with the QC software are needed.
     82 2.4 Screenshots and detailed description
     83 
     84The main menu shows various options, which lead to the next screen. User administration is available (and visible) for Admin only
     85
     86 
     87The user enters a Lfelines number of interest and hits the FIND button.  Perhaps, the user could also search through sampleID, 2D-barcode, RackID or isolation date (not shown). If a record is found, then there may be more types (stock, duplo, normalized, inWGA, Irresolvable), from which the used can choose. Then, the details are shown. The user may then hit the EDIT button, after which the screen becomes editable for a number of fields (all except LifeLines number, 2D-barcode, ND8000 and ratios). Values in red indicatie that there are quality issues with the sample (in this case: ratio 260/280 should be ≥1.8).
     88The sample may be deleted, but only admin can do this.
     89The user can inspect the nanodrop plot by the push of a button. From the respective nanodrop file (how is the file located?), information is read to construct and the nanodrop plot. Perhaps, the nanodrop plot could be shown directly on the above screen?
     90
     91 
     92Using the above screen, the user imports information from a nanodrop file into the database. The following variables are imported: Date, Conc, A260/A280, A260/A230. They are stored in the table SAMPLES. (Alternatively, this screen is obsolete if nanodrop information is imported automatically).
     93
     94 
     95In the above screen only the button START SEARCHING is visible at first. At the push of this button, a number of things happen under water in the following order:
     96•       The existence of new Lifelines participants is checked in the TCC lifelines database. The corresponding LLnumbers, freezer storage position (if available) and gender of new participants are inserted into the SAMPLES database as new records.
     97•       A list of samples that have already been selected for WGA (field SAMPLES.INWGA = TRUE) is constructed.
     98•       The sample selection algorithm at TCC is called and the just constructed sample list is send along to the algorithm.
     99•       At TCC the sample selection algorithm (is must be modified from its current form!) runs and it delivers a list of participants that have come this day for the 2nd visit and that are genetically distinct. These samples should be included in the WGA.
     100•       The list of samples appears on screen.
     101The user can print or save the resulting list to file. All shown samples are automatically registered as being included in WGA (this is already administered at TCC as soon as the list is composed). The field INWGA is set to TRUE for all samples of the presented list.
     102
     103 
     104In the above screen the user is asked to select a particular sample. If the user is not interested in a specific sample, but in a particular sheet or session, the user can instead enter the genotyping date, after which all samples, all chip IDs and all plates of that date are shown. The user can select one of them.
     105Then, the user can click on any of the lower buttons, in order to view the genotyping quality of a particular sample (must be selected), of a particular chip (sample or chipID must be selected), of a particular plate or session (sample, chipID or plateID must be selected) or of the total project (nothing needs to be selected). The button TOTAL PROJECT is therefore clickable right away, but the other four buttons only after selecting something.
     106There are various statistics to assess the genotyping quality, which will be calculated using software by Lude Franke (called TriTyper). The current TriTyper program analyzes a dataset of multiple (hundreds of) samples and is therefore not suitable for single sample testing. Lude Franke will create a version that will be able to check a single sample.
     107An important difference between single sample checking and QC of a large dataset, is that some statistics can not be calculated for a single sample, as shown in the table below.
     108Statistic       Dataset of hundreds of samples  Dataset of a single sample
     109Call rate       YES     YES
     110Check gender    YES     YES
     111Heterozygosity  YES     YES
     112Hardy Weinberg test     YES     NO
     113CNV testing     YES     NO
     114
     115 
     116
     117 
     118
     119 
     120The above screen shows the genotypeing quality of a plate. Apparently, in this case row 7 is not good (pipetting error?) and column F is not good (chip xxxx). If the user clicks (or hoovers) at a particular spot, the Lifelines number (and sampleID?) is shown.
     121 
     122Current file formats (Excel):
     123The following tables are currently in use (mainly Excel). In yellow are indicated fields though which connection can be made.
     124DNA-STORAGE
     125Lade    Racknr  RackID  Position        2D-bar  LL#     QiaTube SampID  Isoldate        NDconc  260/280 260/230 IsoMeth Charac
     1261       43      1052765 A1      8309446 1597677 3627    26794-5 2-9-2009        412,3   1,9     2,2     blabla  blabla
     127
     128WGA-STUDIE
     129Lade RackNr             Position        2D-bar  LL#                     Isoldate        NDconc 260/280  260/230 IsoMeth
     1301       43              A1      8309446 1597677                 2-9-2009        412,3   1,9     2,2     blabla 
     131
     132DUPLOBOX
     133Lade RackNr             Position        2D-bar  LL#                     Isoldate        NDconc
     1341       43              A1      8309446 1597677                 2-9-2009        412,3   
     135
     136NORMALIZED 100 ug/ml
     137Lade RackNr     BoxID   Position        2D-bar  LL#                     Isoldate
     1381       43      1052765 A1      8309446 1597677                 2-9-2009       
     139BoxID lijkt te matchen met RackID
     140
     141UITVAL
     142                                        LL#                     Isoldate        NDconc                  IsoMeth Analist
     143                                        1597677                 2-9-2009        412,3                   blabla  XX     
     144
     145IRRESOLVABLE SAMPLES
     146        RackID          Position        TubeID  LL#                     Isoldate                                       
     147        43              A1      8309446 1597677                 2-9-2009       
     148
     149PACKED CELLS
     150                                        LL#
     151                                        1597677
     152
     153SAMPLE SHEET  (automatically generated)
     154SampleID        S.Plate S.Name  Project         AMP_Plate       SampleWell      SentrixBar_A    SentrixPos_A
     1551799306 0       1799306 LifelinesPlate27        wg0002915-msa3  A01                     4799112148      R01C01
     156Additional columns:  Scanner, DateScan, Replicate, Parent1, Parent2, Gender
     157SampleID = DNA-STORAGE .LL#
     158
     159NANODROP FILE (automatically generated)
     160PlateID   Well  SampID  UserID  Date    Time    Conc    Units   A260    A280    260/280 260/230 ConcFac     etc…
     1611024897  A1             default 1-1-2010        10:31   642,0   ng/ul   12,840  6,913   1,86    2,02    50,00       ……
     162PlateID = DNA-STORAGE .RackID#
     163Well = DNA-STORAGE .Position
     164
     165FINAL REPORT
     166SNPname SampleID        Allele1 Allele2 GC-score
     167cnvi0111185     1073859         -               -               0.0000…
     168SampleID = DNA-STORAGE .LL#
     169
     170QUAGENPOS
     171SampleId        QiagenId             QiagenId_1 TubeId          TubeId_1                RackPos         RackId          Error
     172QiagenPos1      6837         6837               1030342040      1030342040      A01             100200676               
     173QiagenId = DNA-STORAGE . QiaTube
     174
     175
     1763.0 Data Model and Description
     177This section describes information domain for the software
     178 3.1 Data Description
     179Data objects that will be managed/manipulated by the software are described in this section.
     180 3.1.1 Data objects
     181Data objects and their major attributes are described.
     182 3.1.2 Relationships
     183Relationships among data objects are described using an ERD- like form. No attempt is made to provide detail at this stage.
     184 3.1.3 Complete data model
     185An ERD for the software is developed
     186table SAMPLES
     187fields: Lifelines number        7-digit integer         unique, not null
     188        Gender          integer                 (1=male, 2=female)
     189        SampleID                string (length 12)
     190        Lade                    integer
     191        RackNumber              integer  (values <100)
     192        RackID          integer (values >1000)
     193        Position                string (length 2)
     194        2D-barcode              integer
     195        Isolation Date          date
     196        ND8000 conc             float
     197        Ratio 260/280           float
     198        Ratio 260/230           float
     199        Isolation method        string (length 1000)
     200        Characteristics string (length 1000)
     201        Duplo                   Boolean
     202        Normalized              Boolean
     203        InWGA           Boolean
     204        Irresolvable            Boolean
     205        Uitgangsmateriaal       string
     206table VRIEZER   probably not needed
     207fields: Lifelines number        string (formatted as “LL-1234”)         unique, not null
     208        RackID          integer
     209        LadeNr          integer
     210        Positie                 integer
     211table SELECTIELIJST
     212fields: Lifelines number        7-digit integer         unique, not null
     213        FamNr                   integer
     214        RelGen          integer
     215        High                    Boolean
     216        Datum                   date
     217        Comment         string (length 100)
     218        Positie                 string (length 1000)
     219        Status                  string (length 10)
     220table QUALITY
     221fields: Lifelines number        7-digit integer         unique, not null
     222        Call rate               float
     223        Heteroyzgosity  float
     224        HW                      float
     225        Gendercorrect           Boolean
     226        CNVstat         float           ??
     227               
     228 3.1.4 Data dictionary
     229A reference to the data dictionary is provided. The dictionary is maintained in electronic form.
     230 
     2314.0 Description of the Sample selection algorithm
     232This algorithm has been developed by TCC and in its current form it selects samples that are genetically unique from a static dataset. The algorithm needs to be changed such that it operates on a dynamic database. This means that it uses the most recent information of all samples whether they have been used for a WGA or not. Further, the algorithm should only return Lifelines numbers that had their 2nd visit at the day of running the algorithm.
     233