wiki:MolgenisProcessing

Version 7 (modified by Morris Swertz, 14 years ago) (diff)

--

Use Cases

This is a short note on use cases we want to support in the MOLGENIS processing extension

  • Share my pipeline
  • Add new module
  • List my data items
  • Incorporate Galaxy or GenePattern? modules in my pipeline
  • How did I produce this result file?
  • Autogenerate a R-suave (?) document that is executable documentation?
  • Export R data annotation packages?

PBS best practices

Overview:

  • We use Freemarker to define templates of jobs
  • We generate for each job one <job>.sh
  • We generate one submit.sh for the whole workflow
  • The whole workflow behaves like 'make': it can recover from failure where it left of
  • The workflow shares one working directory with conventions to ease inter-step variable passing

Main ingredients:

  • The workflow works on a data blackboard
    • The whole workflow uses the same working directory (= blackboard architecture pattern)
    • We use standard file names to reduce inter-step parameter passing (= convention over configuration)
    • Naming convention: <unit of analysis>_<name of step>.<ext>
    • For example in NGS lane (unit) alignment (step): <flowcell_lane>_<pairedalign>.bam
  • Make style submit.sh
    • Each line puts one command in the qsub queue
    • We solve dependency ordering using -W depend=afterok:job1:job2 option
    • Use of proper return values will ensure dependent jobs are canceled on fail
  • Recoverable steps job<step>.sh
    • We generate a .sh file for each job including standard logging
    • Each script checks if the output is already there (otherwise it can be skipped)
    • Each script checks if it has produced its output (otherwise return error)
    • N.B. check file existence using if ! test -h FILE return -1