Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of GenotypeMolgenis

Timestamp:: 2010-10-01T23:19:13+02:00 (14 years ago)
Author:: trac
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

GenotypeMolgenis

                       v1
+= Genotype Information Management System (GIMS) =
+||developers:||AndreDeVries, JorisLops, MorrisSwertz||
+||state:||design||
+At the LifeLines facility at the Bloemsingel in Groningen, samples of thousands of inhabitants of the north of the Netherlands are collected for study. They are stored in refrigerators and retrieved from there when necessary. Samples are selected for inclusion in the first stage of genome-wide studies using a software algorithm which is currently hosted by TCC (Trial Coordination Centre). [[BR]]
+The project is divided into three tasks:[[BR]]
+. Convert current excel and text file 'system' to database + user interface.[[BR]]
+. Implement webservice (hosted by TCC) which daily delivers a list of new, genetically distinct, samples to be included in the WGA. This should be a dynamic process: TCC uses updated information about whether samples have already been included in WGA.[[BR]]
+. Monitoring of genotyping quality. Coupling with external software to assess genotyping quality. Various presentations in UI of the genotyping quality.[[BR]]
+Tasks 1 & 2 are planned to be finished by mid March 2010.
+Task 3 is next and is planned to be finished by about end of April 2010.
+Next, this piece of software could be used and extended at other places, such as the genotyping facility at the genetics department.
+REQUIREMENTS:
+.0     Introduction
+At the LifeLines facility at the Bloemsingel in Groningen, samples of thousands of inhabitants of the north of the Netherlands are collected for study. They are stored in refrigerators and retrieved from there when necessary.
+Samples are selected for inclusion in the first stage of genome-wide studies using a software algorithm which is currently hosted by TCC (Trial Coordination Centre). Selected samples should be genetically distinct and should have good quality. This selection of samples is now done static and should be done on a dynamic database.
+In addition, better monitoring of the genotyping quality is desired.
+.1 Goals and objectives
+The LL-GIMS (lifelines genetics information management system) aims at assisting the lab people of the LifeLines project in order to better monitor sample handling and genotyping.
+.2 Statement of scope
+The software system will have the following three key features, subdivided in more detailed functions:
+)      Structured storage of sample information (database instead of Excel sheets)
+o       A new database is needed, which will initially be filled with the current Excel data.
+o       New samples and sample information from robots/machines are automatically put into the database.
+o       Users can view, edit samples, samplesheets, batches.
+o       Users can set properties for samples: whether they are in duplo, are normalized, are used for GWA, are irresolvable or are packed cells (uitgangsmateriaal).
+)      Algorithm to determine whether a particular lifelines sample should be selected for WGA-analysis (good quality and genetically distinct).
+o       A daily query should be performed (either by the push of a button or at a particular time) on the TCC LifeLines database to find all samples that have had their 2nd visit that day and that are genetically distinct, so that they can be selected for the WGA study. NB: this means that the algorithm needs information of each sample whether it already is in the WGA.
+o       This query not only gives Lifelines identifiers, but also at least their gender and freezer storage position if that information is available.
+o       A web service is used to connect between the TCC algorithm and GIMS application.
+o       A list of samples to be included in the WGA is presented to the user and they will be flagged as “InWGA”.
+)      Monitoring of genotype quality of a sample, chip and sheet.
+o       Checks of the quality of the sample (concentration, genotyping call rate, heterozygosity, Hardy-Weinberg). Information from Illumina BeadStudio is used. The QC-pipeline (or software by Lude) might also be used, adapted for single sample testing.
+o       The user can in 1 view inspect the quality of a set of 12 samples from a single chip, and also the set of 92 samples from a sheet. Perhaps in such a way:
+  where red=bad quality, yellow=borderline.
+Beware the mapping between position on plate and position on the chip. How is this exactly done?
+o       Quality monitoring though time, per sample, per chip, per sheet, per session and over the total project.
+o       Users can inspect nanodrop graphs, which are constructed on the fly from the nanodrop file. How is the file located?
+As input, the following is needed:
+•       Sample information (database, initially filled with data from excel sheets).
+•       Nanodrop files.
+•       Raw intensity files (for the QC-software).
+Output will be:
+•       Quality information from the genotyping/QC-pipeline (stored in database).
+•       A yes/no value saying whether a sample should be included in the WGA.
+.3 Software context
+This software is related to the LifeLines genetics analysis architecture that will be built. The same (or parts of the same) QC-pipeline may be used in both. The LL-GIMS will connect to the LifeLines genetic analysis portal (in whatever form) through the LifeLines identifier value.
+.4 Major constraints
+The connection to the TCC database is unclear. The sample selection algorithm needs information from both the TCC database (family relationships, gender) and the GIMS dataset.
+.0 Usage scenario
+.1 User profiles
+Lab people will be the users. There will be one administrator, who can perform the following special actions:
+•       Create/remove user accounts, reset passwords.
+•       Delete samples.
+.2 Use-cases
+.      Enter/update a sample
+New samples are automatically inserted daily into the GIMS dataset by retrieving them from the TCC lifelines database as soon as they become known. So, new records are initially placed in the SAMPLES table, without any sample information (such as 2D-barcode and RackID), but including Lifelines number, freezer storage position (if available) and gender.
+How is RackID and 2D-barcode automatically obtained? Also from TCC? Through automatically generated files?
+Nanodrop information is imported in the database through an operation done by the user (or automatically?). The user can inspect the nanodrop plot.
+.      Check whether a sample should be selected for WGA
+Upon the push of a button by a user, information about whether samples are in WGA is pushed to the TCC Lifelines database. Subsequently, the sample selection algorithm runs at TCC and returns a list of samples that should be used in WGA. This list is then automatically checked: Is the concentration OK, is ratio 260/280 OK (when there are 2 records with the same Lifelines number, take the one with the best values). Is deze informatie al bekend op die dag? Good samples are presented to the user, who must confirm that the samples will be used in WGA.
+.      Monitor genotyping quality
+Genotyping quality measures will be obtained from the Illumina BeadStudio software. Genotyping quality will also be assessed using special software (written by Lude Franke) and fed into the GIMS database. The user can inspect the quality of a particular sample (through the LL#) and obtain the following statistics: call rate, HW-chi2, is-gender-correct, CNV-statistics (if applicable). A flag will be set for samples that are not within specifications.
+Also, the user can inspect the quality of all samples of a particular chip (SampleSheet.SentrixBarcode_A), which presents 12 samples. Average, min, max are shown. Also, the user can inspect the quality of all samples of a plate (AMP_Plate). Average, min, max are shown.
+.3 Special usage considerations
+Connections with the TCC Lifelines database and with the QC software are needed.
+.4 Screenshots and detailed description
+The main menu shows various options, which lead to the next screen. User administration is available (and visible) for Admin only
+The user enters a Lfelines number of interest and hits the FIND button.  Perhaps, the user could also search through sampleID, 2D-barcode, RackID or isolation date (not shown). If a record is found, then there may be more types (stock, duplo, normalized, inWGA, Irresolvable), from which the used can choose. Then, the details are shown. The user may then hit the EDIT button, after which the screen becomes editable for a number of fields (all except LifeLines number, 2D-barcode, ND8000 and ratios). Values in red indicatie that there are quality issues with the sample (in this case: ratio 260/280 should be ≥1.8).
+The sample may be deleted, but only admin can do this.
+The user can inspect the nanodrop plot by the push of a button. From the respective nanodrop file (how is the file located?), information is read to construct and the nanodrop plot. Perhaps, the nanodrop plot could be shown directly on the above screen?
+Using the above screen, the user imports information from a nanodrop file into the database. The following variables are imported: Date, Conc, A260/A280, A260/A230. They are stored in the table SAMPLES. (Alternatively, this screen is obsolete if nanodrop information is imported automatically).
+In the above screen only the button START SEARCHING is visible at first. At the push of this button, a number of things happen under water in the following order:
+•       The existence of new Lifelines participants is checked in the TCC lifelines database. The corresponding LLnumbers, freezer storage position (if available) and gender of new participants are inserted into the SAMPLES database as new records.
+•       A list of samples that have already been selected for WGA (field SAMPLES.INWGA = TRUE) is constructed.
+•       The sample selection algorithm at TCC is called and the just constructed sample list is send along to the algorithm.
+•       At TCC the sample selection algorithm (is must be modified from its current form!) runs and it delivers a list of participants that have come this day for the 2nd visit and that are genetically distinct. These samples should be included in the WGA.
+•       The list of samples appears on screen.
+The user can print or save the resulting list to file. All shown samples are automatically registered as being included in WGA (this is already administered at TCC as soon as the list is composed). The field INWGA is set to TRUE for all samples of the presented list.
+In the above screen the user is asked to select a particular sample. If the user is not interested in a specific sample, but in a particular sheet or session, the user can instead enter the genotyping date, after which all samples, all chip IDs and all plates of that date are shown. The user can select one of them.
+Then, the user can click on any of the lower buttons, in order to view the genotyping quality of a particular sample (must be selected), of a particular chip (sample or chipID must be selected), of a particular plate or session (sample, chipID or plateID must be selected) or of the total project (nothing needs to be selected). The button TOTAL PROJECT is therefore clickable right away, but the other four buttons only after selecting something.
+There are various statistics to assess the genotyping quality, which will be calculated using software by Lude Franke (called TriTyper). The current TriTyper program analyzes a dataset of multiple (hundreds of) samples and is therefore not suitable for single sample testing. Lude Franke will create a version that will be able to check a single sample.
+An important difference between single sample checking and QC of a large dataset, is that some statistics can not be calculated for a single sample, as shown in the table below.
+Statistic       Dataset of hundreds of samples  Dataset of a single sample
+Call rate       YES     YES
+Check gender    YES     YES
+Heterozygosity  YES     YES
+Hardy Weinberg test     YES     NO
+CNV testing     YES     NO
+The above screen shows the genotypeing quality of a plate. Apparently, in this case row 7 is not good (pipetting error?) and column F is not good (chip xxxx). If the user clicks (or hoovers) at a particular spot, the Lifelines number (and sampleID?) is shown.
+Current file formats (Excel):
+The following tables are currently in use (mainly Excel). In yellow are indicated fields though which connection can be made.
+DNA-STORAGE
+Lade    Racknr  RackID  Position        2D-bar  LL#     QiaTube SampID  Isoldate        NDconc  260/280 260/230 IsoMeth Charac
+       43      1052765 A1      8309446 1597677 3627    26794-5 2-9-2009        412,3   1,9     2,2     blabla  blabla
+WGA-STUDIE
+Lade RackNr             Position        2D-bar  LL#                     Isoldate        NDconc 260/280  260/230 IsoMeth
+       43              A1      8309446 1597677                 2-9-2009        412,3   1,9     2,2     blabla
+DUPLOBOX
+Lade RackNr             Position        2D-bar  LL#                     Isoldate        NDconc
+       43              A1      8309446 1597677                 2-9-2009        412,3
+NORMALIZED 100 ug/ml
+Lade RackNr     BoxID   Position        2D-bar  LL#                     Isoldate
+       43      1052765 A1      8309446 1597677                 2-9-2009
+BoxID lijkt te matchen met RackID
+UITVAL
+                                        LL#                     Isoldate        NDconc                  IsoMeth Analist
+                                        1597677                 2-9-2009        412,3                   blabla  XX
+IRRESOLVABLE SAMPLES
+        RackID          Position        TubeID  LL#                     Isoldate
+              A1      8309446 1597677                 2-9-2009
+PACKED CELLS
+                                        LL#
+                                        1597677
+SAMPLE SHEET  (automatically generated)
+SampleID        S.Plate S.Name  Project         AMP_Plate       SampleWell      SentrixBar_A    SentrixPos_A
+1799306 0       1799306 LifelinesPlate27        wg0002915-msa3  A01                     4799112148      R01C01
+Additional columns:  Scanner, DateScan, Replicate, Parent1, Parent2, Gender
+SampleID = DNA-STORAGE .LL#
+NANODROP FILE (automatically generated)
+PlateID   Well  SampID  UserID  Date    Time    Conc    Units   A260    A280    260/280 260/230 ConcFac     etc…
+1024897  A1             default 1-1-2010        10:31   642,0   ng/ul   12,840  6,913   1,86    2,02    50,00       ……
+PlateID = DNA-STORAGE .RackID#
+Well = DNA-STORAGE .Position
+FINAL REPORT
+SNPname SampleID        Allele1 Allele2 GC-score
+cnvi0111185     1073859         -               -               0.0000…
+SampleID = DNA-STORAGE .LL#
+QUAGENPOS
+SampleId        QiagenId             QiagenId_1 TubeId          TubeId_1                RackPos         RackId          Error
+QiagenPos1      6837         6837               1030342040      1030342040      A01             100200676
+QiagenId = DNA-STORAGE . QiaTube
+.0 Data Model and Description
+This section describes information domain for the software
+.1 Data Description
+Data objects that will be managed/manipulated by the software are described in this section.
+.1.1 Data objects
+Data objects and their major attributes are described.
+.1.2 Relationships
+Relationships among data objects are described using an ERD- like form. No attempt is made to provide detail at this stage.
+.1.3 Complete data model
+An ERD for the software is developed
+table SAMPLES
+fields: Lifelines number        7-digit integer         unique, not null
+        Gender          integer                 (1=male, 2=female)
+        SampleID                string (length 12)
+        Lade                    integer
+        RackNumber              integer  (values <100)
+        RackID          integer (values >1000)
+        Position                string (length 2)
+D-barcode              integer
+        Isolation Date          date
+        ND8000 conc             float
+        Ratio 260/280           float
+        Ratio 260/230           float
+        Isolation method        string (length 1000)
+        Characteristics string (length 1000)
+        Duplo                   Boolean
+        Normalized              Boolean
+        InWGA           Boolean
+        Irresolvable            Boolean
+        Uitgangsmateriaal       string
+table VRIEZER   probably not needed
+fields: Lifelines number        string (formatted as “LL-1234”)         unique, not null
+        RackID          integer
+        LadeNr          integer
+        Positie                 integer
+table SELECTIELIJST
+fields: Lifelines number        7-digit integer         unique, not null
+        FamNr                   integer
+        RelGen          integer
+        High                    Boolean
+        Datum                   date
+        Comment         string (length 100)
+        Positie                 string (length 1000)
+        Status                  string (length 10)
+table QUALITY
+fields: Lifelines number        7-digit integer         unique, not null
+        Call rate               float
+        Heteroyzgosity  float
+        HW                      float
+        Gendercorrect           Boolean
+        CNVstat         float           ??
+.1.4 Data dictionary
+A reference to the data dictionary is provided. The dictionary is maintained in electronic form.
+.0 Description of the Sample selection algorithm
+This algorithm has been developed by TCC and in its current form it selects samples that are genetically unique from a static dataset. The algorithm needs to be changed such that it operates on a dynamic database. This means that it uses the most recent information of all samples whether they have been used for a WGA or not. Further, the algorithm should only return Lifelines numbers that had their 2nd visit at the day of running the algorithm.