Matrix Interface enhancements

With ever increasing data sizes the functionality and performance of the XGAP matrix component is of key importance. Here we discuss enhancments to make current methods even more powerful. This component should be of general usage. Prototyping is happening at

What the Matrix Interface is not about

Lets start with use cases that need to be addressed outside the Matrix Interface because not generic but instead depends on the use inside the application (xgap):

  • Selections of what should be opened or closed (selected or not, on or off, whatshallmecallit) should be saved separately as each user might have another window of interest (concurrently)
  • Aggregate data is actually a new data matrix (e.g. with columns as individuals and rows being mean, stdev, est dist) which we may optionally want to join to each individual data set
  • QC, Imputation, Normalization etc procedures run on top of the matrix interface. However: we may want to add a mechanism to run some methods 'inside', e.g. SUM works faster inside SQL ;-)

Functional requirements

Based on the attached requirements doc we distill the following functional requirements for the MatrixInterface:

  • Data querying (selections, views, joins)
    • Already provided was the ability to quickly select relevant sub sections of a matrix by columns and rows. That would also support switching sections on/off, open/close
    • This should be enhanced to setting a selection of a part of the matrix; This can be thought of as a matrix partition (as view) on top of a physical matrix
    • We need ability to convert data on the fly, e.g. to rename 'A' to 'B' only for the current view of the matrix (view). (is like a decorator).
    • We need ability to virtually merge to matrices into one (e.g. export all genotypes, phenotypes data joined together as one matrix of individuals).
  • Data manipulation (updates, removes, rename all 'A' to 'B');
    • The interface provided data manipulation by row, col (update value=B where row=r1 and col=c1); we also have this function in batch to apply to a list of values
    • We need data manipulation to update by value (update value=B where value =A)
    • We need to add batch update to set the same value to multiple indices (update value=B where row in(r1,r5,r9) and col in (c15,c7)
    • We need ability to remove whole rows/cols
  • Heterogeneous matrices that mix numeric and textual data are in one matrix
    • example is summary level data having mean (dec), stdev (dec), est dist (text/ontology term). -> we would now treat all as text.
    • and, I don't dare to say, but do we als need 3D, 4D, nD matrices (i.e., nested references to other matrices)?
  • Aggregate functions/'embedded' methods
    • Example: If you want to calculate it is typically much more efficient to bring the algorithm to the data instead of bringing data to the algorithm.
    • Problem: this depends on implementation. For example, in a database you can quickly calculate MEAN compared to loading it into R first. E.g. Oracle has quite a method set here.
    • Discussion: we can make a collection of functions that we can than also make optimized versions of. (e.g. 'CalculateMean?', 'CalculateMeanHadoop?', 'CalculateMeanMysql?'). However, before you know it we are reimplementing all R in Java.

non-functional requirements

In the current phase we should ignore the scalability issues when fleshing out the interface; just assume it is fast enough. Because for small data sets the current implementation is very suitable: it is simple to deploy, has constraint checking, and it works for example to represent human phenotype data. For huge marker/expression sets we clearly would like a more performant solution. In any case, the perfect implementation we will probably never reach as each implementation will be good at some and bad at other methods; also secondary features may differ (loading time, retrieval time, selection time, contraint checks like a database, complexity of deployment, etc). Therefore we want to seperate the functionality and scalability concerns here using the MatrixInterface so we can plug-in more or alternative implementations (current with Infobright?, Hadoop, Roll-our-own binary files, BioHDF, JBoss Cache, Target, ...) without having to recode everything we build on top :-)

Last modified 10 years ago Last modified on 2010-10-01T23:38:14+02:00

Attachments (1)

Download all attachments as: .zip