wiki:MolgenisFile

MolgenisFile

Managing files becomes increasingly more important in some of our projects that deal with large, preformatted datasets. Also, many results are files of non-relational nature such as images or documents.

I would like to present some thoughts and principles here on how to better deal with such files in a database (Molgenis) context. I do not pretend to have the best solutions, nor do I think it is clever to introduce features already present in Molgenis.

Instead I would like to open a constructive discussion on how to use this work and/or its principes to improve Molgenis and the way we design software dealing with the challenges it addresses. I hope you find it at least informative or inspirational :)

  • Joeri

Overview

Here we explain the differences between two ways to treat files, and how they can be harmonized.

Field is file

Molgenis has a field type 'file' which allows you to store and retrieve files. This is a solid mechanism that works just fine for most applications. See the molgenis guide.

For advanced users however, there are some limitations to this. For instance:

The storage directory is hardcoded in properties file. Not ideal, because:

  • Often cannot redeploy elsewhere without editing this file (ie. application not portable)
  • No way to check whether the path is correct, nor if tomcat/java has rights to use it for read/write actions

MS: not a valid argument. You can solve this by making the 'properties' live inside the database. Action: We therefore could (and should) make a 'MolgenisProperties?' entity to store these system settings.

The file is a field and not an entity of its own. This means:

  • There is no straightforward way to attach a plugin for eg. viewing the file
  • Adding decorators, extensions, etc on is not possible in a suitable way

MS: this is a very valid argument. However, this should be solved as part of the data model, not as core method. Action plan: we keep and improve field type="file". Only then we add inside the GCC model a way to use File as entity. type="file_entity"

(see note B )

Entity is file

To make the mechanism more open and flexible, the MolgenisFile entity was added to the core datamodel. The model without descriptions:

<entity name="MolgenisFile" abstract="false" implements="Nameable" decorator="decorators.MolgenisFileDecorator">
	<field name="Extension" nillable="false" length="8" />
</entity>

This basic entity represents a file. It has two attributes: file name and extension. The extension is important, because it is used to map the MIME type at runtime. For example, 'png' will be served out as 'image/png'.

More about the attributes and subclassing later on.

Merging

If the entity way of handling files is a good idea, it would be very feasable to combine the two and use the best of both worlds. The model and classes that deal with filehandling could be put into the Molgenis source so they are always available and centrally updated.

When using file as a field, it would secretly simply be an XREF to the MolgenisFile table, so the user does not notice a difference at all. However, it would allow freedom for developers because the files can also be treated as entities.

Developers can extend upon the MolgenisFile definition and handlers to tailor projects their specific needs, while keeping the field + XREF construction for the end users. There does not have to be a conflict with the current implementation :)

MS: I want to have a more backward compatible and flexible method. Proposal: make the 'complex' file build on top of the 'field' file:

<entity name="MolgenisFile" abstract="false" implements="Nameable" decorator="decorators.MolgenisFileDecorator">
	<field name="Extension" nillable="false" length="8" />
        <field name="File" type="file"/>
</entity>

Technical

Here we delve into the cool stuff on how to exploit new possibilities.

MS: A completely different and much more direct approach to influence this would be:

<field type="file" file_decorator="some interface that allows you to validate or handle files easily, building on standard file stuff"/>

And then a java interface, for example:

public interface FileInputDecorator
{
  //decorator that influences how this thing is rendered
  public String render();

  //validate the file in how many ways you want
  public MolgenisMessage validate();

  //have a decorator to do things before insert, update, delete 
  public void preUpdate(Entity e, Operation operation);

  //have a decorator to do things after insert, update, delete
  public void postUpdate(Entity e, Operation e)
}

Decorating

When file is an entity, we can use a decorator to influence its behaviour. The decorator is automatically applied to all the subclasses of the entity as well.

Basically, the decorator takes care of the mapping of the entity (any MolgenisFile) to the file on the filesystem. It does things like:

  • Names are 'escaped' to filesafe versions (eg. strange characters removed)
  • Names must be unique when escaped (handy for finding/downloading)
  • Files need to be renamed when the name is changed
  • Files need to be deleted properly when the record (entity) is removed
  • Extensions must be correct
  • And so on. Informative errors are thrown when something isn't going right.

The code can be found here

Setting storage location

MS: This is very useful but I want this to be generalized to a 'MolgenisProperties?' screen that would validate the molgenis settings. It should also include database name, password, etc. Also I would like to have a standard 'installation' script that checks if the database is consistent with the code and optionally to automatically load system data (such as system settings). Like an install wizard that asks users for some parameters on first start.

Before you can start storing files, you need a validated storage location. There is a plugin that helps you do this here.

The idea is as follows:

  • In a running application (deployed anywhere) you browse to the plugin. Preferably an administrator - we should hide this plugin from others.
  • You type in the preferred storage path, and click 'Set path' to save it.
  • Now, you must run two tests which both need to succeed before this path is marked as validated.
  • When the tests are successful, your path is marked VALIDATED and you can store MolgenisFiles?.

If the tests fail, refer to the error message and fix what is wrong. Maybe the database is not accessable, tomcat/java lacks rights on this directory, directory is not a valid path, etc. Some information about the location is also displayed: Does it already exists? Are there files in it?

The path is stored in a special table which is located inside your selected database, but outside the range of tables accessable by your application.

For testing purposes, the path can be set and receive validated status manually. (see note A )

Java API

MS: all what you describe here already holds for field type="file". So this is in my book duplicated work. Only thing we can differ on in opinion is where files should go. In MOLGENIS that is /path/entity/entity_label.ext. In older versions of MOLGENIS that could be customized.

The API has two layers: BasicFileHandler and MolgenisFileHandler, which extends BasicFileHandler.

BasicFileHandler tells you the most basic information. For example, give me the common file storage directory for my application as a Java 'File' object. For example:

BasicFileHandler bfh = new BasicFileHandler(db);
File fileStorage = bfh.getFileStorage()

MolgenisFileHandler is a direct extension of BasicFileHandler and is constructed in the same way.

Mostly focused on 'MolgenisFile' objects, you can get information or manipulate files using functions such as getFile(), deleteFile(), findFile(), getStorageDirFor().

MolgenisFileHandler mfh = new MolgenisFileHandler(db);
File myRealFile = mfh.getFile(myMolgenisFile);
File storageForFileType = mfh.getStorageDirFor(myMolgenisFile);

Note that each 'type' of MolgenisFile has its own subdirectory, and your application name is used as part of the storage location. For example: You have set your path to "/data/xgap" and deploy the application as "ngspipeline". You created an entity 'Video extends MolgenisFile'. A video file "result.mpg" would be saved as "/data/xgap/ngspipeline/video/result.mpg". This makes manual tasks such as browsing or backing up files on your filesystem easier.

Services: uploading and downloading

MS: I feel that this is already covered by the html <input type="file" funcitonality already in place in MOLGENIS. So I don't get the added value of this one. If you would just have implement MolgenisFile with a <field type="file" you would have this already.

Uploading means creating a new MolgenisFile record, plus put the file in the correct place. There is a simple upload servlet to do this.

For a basic MolgenisFile, the servlet expects to receive:

  • name = The name under which the file should be stored
  • type = The type (subclass) of MolgenisFile, in this case: 'MolgenisFile'
  • file = A filestream with the content of your file you wish to store

The servlet can be called in many ways, for example with RCurl or regular commandline cURL.

Cool thing nr.1:

The servlet is detached from the actual procedure that handles creating the database records and storing the file. This is another Java API. See code here.

This means you can store files from anywhere in Java sourcode by using the static doUpload() function. There are two flavours:

doUpload(Database db, MolgenisFile mf, File content)

Which needs a database object, a MolgenisFile definition, and a File pointer to the content. Example usage:

File content = request.getFile("upload");
PerformUpload.doUpload((JDBCDatabase) db, this.model.getMolgenisFile(), content);

And the second:

doUpload(Database db, boolean useTx, String name, String type, File content, HashMap<String, String> extraFields)

Which requires some low-level specifications instead of a 'MolgenisFile' object. Example usage:

//upload as a MolgenisFile, type 'BinaryDataMatrix'
HashMap<String, String> extraFields = new HashMap<String, String>();
extraFields.put("data_name", data.getName());
PerformUpload.doUpload(db, true, data.getName()+".bin", "BinaryDataMatrix", binFile, extraFields);

Cool thing nr.2:

The upload services will ask for the additional fields of a subclass if you forget them! For example, if you have a 'Image extends MolgenisFile', and add a field to this subclass:

<field name="Investigation" type="xref" xref_entity="Investigation" />

Then the upload API will want you to provide in an 'investigation_name', or report back an error if you don't. (e.g. "Missing needed field 'investigation_name' for MolgenisFile type 'Image'")

Downloading is as simple as can be. All you need to do is provide the name of the MolgenisFile to the download service, and it will return a download (outputstream) with the file content.

The cool thing here is that MIME types are automatically mapped to the file extension, so your browser will know what to do with this type of file.

response.setContentType(sc.getMimeType(mf.getExtension()));

Use the service by calling:

http://255.255.255.255:8080/xgap_1_4_distro/download.do?name=SomeFile

Just like the Upload servlet, it wraps a Java API (MolgenisFileHandler) that you can use elsewhere. (see sourcecode)

Practical example

MS: this doesn't allow you to easily reuse viewers in more complicated entities and it greatly polutes the model. I have now to subclass MolgenisFile for all types instead of just saying

<field name="myPngImage" type="file" file_decorator="org.molgenis.file.PngFileDecorator"/>

Let's walk through a practical example on how to use all this stuff, step-by-step.

Say we want to store images in a Molgenis database. These images are coupled to an 'Investigation'. Start by adding the entity in the datamodel, extending MolgenisFile:

<entity name="Image" extends="MolgenisFile">
	<field name="Investigation" type="xref" xref_entity="Investigation" />
	<unique fields="name,Investigation" description="Name is unique within an investigation" />
</entity>

Now add a GUI component. We nest a small plugin to the form that will allow us to upload and view the images that belong to the records.

<form name="Images" entity="Image">
	<plugin name="Viewer" type="plugins.molgenisfile.MolgenisFileManager" />
</form>

After generating, browse to the image section of the GUI. Create a new record as normal.

The plugin appears, telling you there is no source file. Select a picture and press upload.

Done! If you take at a look at your filesystem, you'll find it back at your storage path + app name + MolgenisFile type, meaning:

The plugin that we use here is very simple, and only wraps the upload and download services.

Here's the upload code:

File content = request.getFile("upload");
PerformUpload.doUpload((JDBCDatabase) db, this.model.getMolgenisFile(), content);
this.setMessages(new ScreenMessage("File uploaded", true));

And the viewer simply puts an IFRAME around a download:

<iframe width="750px" height="600px" src="download.do?name=mypicture">

The plugin is extensible to use different viewers for different MIME types. For example, we have *.fig files (which are in essence text files), representing a figure. Instead of looking at the text, we want to use an applet to display a graph. Inside the viewer part of the plugin, we add:

<#if model.molgenisFile.extension == 'fig'>
	<applet code=jfig.gui.JFigViewerApplet>

And now the applet appears when we view *.fig files.

Java API extension example

To be able to store and manage datamatrices (a special datatype) in file backend sources, while reusing the MolgenisFile handlers to do so, we extend them.

We define of a matrix backend file as an Entity. MolgenisFile is extended, and furthermore we have an XREF to 'Data' to link the matrix metadata to the files.

<entity name="BinaryDataMatrix" extends="MolgenisFile">
	<field name="Data" type="xref" xref_entity="Data" description="Reference to the datamatrix this binary file belongs to." />
</entity>

By doing so, we get all the services, handlers and decorators for free. We don't have to worry anymore about placing the file in the correct location, renaming, deleting, serving it out, interaction with other records in the database, etc.

But since this is a special datatype, we need more. For example:

  • We would like to use the 'Data' definition to find, verify or delete backend files
  • We would like to create instances of 'Matrix' using the 'Data' definition, regardless of the location of the backend file

For this purpose, we created DataMatrixHandler, which extends MolgenisFileHandler.

(see note C )

A few usage examples.

Check if the data elements for this data matrix are stored in the database:

DataMatrixHandler dmh = new DataMatrixHandler(db);
if (data.getStorage().equals("Database"))
{
	if (dmh.isDataMatrixStoredInDatabase(data))
	{
		throw new DatabaseException("Database source already exists for source type '" + data.getStorage() + "'");
	}
}

Iterate through all 'Data' definitions in the database and create a list of BinaryMatrix? instances. (only succeeds if they all are!)

List<BinaryDataMatrixInstance> bmList = new ArrayList<BinaryDataMatrixInstance>();
for (Data data : db.find(Data.class)) {
	BinaryDataMatrixInstance bm = (BinaryDataMatrixInstance) new DataMatrixHandler(db).createInstance(data);
	bmList.add(bm);
}

Find the 'Data' definition that belongs to this MolgenisFile in a constructor wrapper:

public CSVDataMatrixInstance(Database db, MolgenisFile mf) throws Exception
{
	DataMatrixHandler dmh = new DataMatrixHandler(db);
	new CSVDataMatrixInstance(dmh.findData(mf), dmh.getFile(mf));
}

Check if this 'Data' is stored as a binary file:

DataMatrixHandler dmh = new DataMatrixHandler(db);
Data dm = db.find(Data.class).get(0);
dmh.isDataStoredIn(dm, "Binary");

Create an instance of ANY matrix, regardless of storage mechanism:

db = new JDBCDatabase("xgap.properties");
Data data = db.find(Data.class).get(0);
DataMatrixHandler dmh = new DataMatrixHandler(db);
AbstractDataMatrixInstance<Object> myMatrix = dmh.createInstance(data);

Notes

  1. Manual setting of path, not recommended. If path is 'C:\data', do an SQL insert:
  • create table systemsettings_090527PBDB00QCGEXP4G (filedirpath VARCHAR(255), verified BOOL DEFAULT 0);
  • insert into systemsettings_090527PBDB00QCGEXP4G (filedirpath, verified) values ('C:\data', 1);
  1. Are my statements here even correct? :)
  1. This part is terribly nerdy and incomprehensible I think..
Last modified 13 years ago Last modified on 2011-01-20T18:41:10+01:00

Attachments (4)

Download all attachments as: .zip