class Workflow

class rnaseqflow.workflow.Workflow

Execute a simple series of steps used to preprocess RNAseq files

append(item)

Add a WorkflowStage to the workflow

Parameters:item (WorkflowStage) -- the WorkflowStage to insert
insert(idx, item)

Insert a WorkflowStage into the workflow

Parameters:
  • idx (int) -- list index for insertion
  • item (WorkflowStage) -- the WorkflowStage to insert
logger = <logging.Logger object>

log4j-style class logger

run()

Allows the user to select a directory and processes all files within that directory

This function is the primary function of the Workflow class. All other functions are written as support for this function, at the moment

class WorkflowStage

class rnaseqflow.workflow.WorkflowStage

Interface for a stage of a Workflow

Subclasses must override the run method, which takes and verifies arbitrary input, processes it, and returns some output

They must also provide a .spec property which is a short string to be used to select the specific WorkflowStage from many options. These should not overlap, but at the moment no checking is done to see if they do.

logger = <logging.Logger object>

log4j-style class logger

classmethod longhelp()

Create a long help text with full docstrings for each subclass of WorkflowStage

Subclasses are found using cliutils.all_subclasses

run(stage_input)

Attempt to process the provided input according to the rules of the subclass

Parameters:stage_input (object) -- an arbitrary input to be processed, usually a list of file names or file-like objects. The subclass must typecheck the input as necessary, and define what input it takes
Returns:the results of the subclass's processing
classmethod shorthelp()

Create a short help text with one line for each subclass of WorkflowStage

Subclasses are found using cliutils.all_subclasses

spec

Abstract class property, override with @classmethod

Used by the help method to specify available WorkflowItems

class FindFiles

class rnaseqflow.workflow.FindFiles(args)

Bases: rnaseqflow.workflow.WorkflowStage

Find files recursively in a folder

Input:
No input is required for this WorkflowStage
Output:
A flat set of file path strings
Args used:
  • --root: the folder in which to start the search
  • --ext: the file extention to search for
logger = <logging.Logger object>

log4j-style class-logger

run(stage_input)

Run the recursive file finding stage

Parameters:stage_input (object, None) -- not used, only for the interface
Returns:A flat set of files found with the correct extension
Return type:set(str)
spec = '1'

FindFiles uses '1' as its specifier

class MergeSplitFiles

class rnaseqflow.workflow.MergeSplitFiles(args)

Bases: rnaseqflow.workflow.WorkflowStage

Merge files by the identifying sequence and direction

Input:
An iterable of file names to be grouped and merged
Output:
A flat set of merged filenames
Args used:
  • --root: the folder where merged files will be placed
  • --ext: the file extention to be used for the output files
  • --blocksize: number of kilobytes to use as a copy block size
static _get_direction_id(filename)

Gets the direction identifier from an RNAseq filename

A direction identifier is either R1 or R2, indicating a forward or a backwards read, respectively.

Parameters:filename (str) -- the base filename to be processed
Returns:the file's direction ID, R1 or R2
Return type:string
static _get_part_num(filename)

Returns an integer indicating the file part number of the selected RNAseq file

RNAseq files, due to their size, are split into many smaller files, each of which is given a three digit file part number (e.g. 001, 010). This method returns that part number as an integer.

This requires that there only be one sequence of three digits in the filename

Parameters:filename (str) -- the base filename to be processed
Returns:the file's part number
Return type:int
static _get_sequence_id(filename)

Gets the six-letter RNA sequence that identifies the RNAseq file

Returns a six character string that is the ID, or an empty string if no identifying sequence is found.

Parameters:filename (str) -- the base filename to be processed
Returns:the file's sequence ID, six characters of ACTG
Return type:string
_organize_files(files)

Organizes a list of paths by sequence_id, part number, and direction

Uses regular expressions to find the six-character sequence ID, the three character integer part number, and the direction (R1 or R2)

Parameters:files (iterable(str)) -- filenames to be organized
Returns:organized files in a dictionary mapping the sequence ID and direction to the files that have that ID, sorted in ascending part number
Return type:dict(tuple:list)
logger = <logging.Logger object>

log4j-style class-logger

run(stage_input)

Run the merge files operation

Creates a directory merged under the root directory and fills it with files concatenated from individual parts of large RNAseq data files

Files are grouped and ordered by searching the file basename for a sequence identifier like AACTAG, a direction like R1, and a part number formatted 001

Parameters:stage_input (iterable(str)) -- file names to be organized and merged
Returns:a set of organized files
Return type:set(str)
spec = '2'

MergeSplitFiles uses '2' as its specifier

class FastQMCFTrimSolo

class rnaseqflow.workflow.FastQMCFTrimSolo(args)

Bases: rnaseqflow.workflow.WorkflowStage

Trim adapter sequences from files using fastq-mcf one file at a time

Input:
A flat set of files to be passed into fastq-mcf file-by-file
Output:
A flat set of trimmed file names
Args used:
  • --root: the folder where trimmed files will be placed
  • --adapters: the filepath of the fasta adapters file
  • --fastq: the location of the fastq-mcf executable
  • --fastq_args: a string of arguments to pass directly to fastq-mcf
  • --quiet: silence fastq-mcf's output if given
logger = <logging.Logger object>

log4j-style class-logger

run(stage_input)

Trim files one at a time using fastq-mcf

Parameters:stage_input (iterable(str)) -- filenames to be processed
Returns:a set of filenames holding the processed files
Return type:set(str)
spec = '3.0'

FastQMCFTrimSolo uses '3.0' as its specifier

class FastQMCFTrimPairs

class rnaseqflow.workflow.FastQMCFTrimPairs(args)

Bases: rnaseqflow.workflow.WorkflowStage

Trim adapter sequences from files using fastq-mcf in paired-end mode

Input:
A flat set of files to be passed into fastq-mcf in pairs
Output:
A flat set of trimmed file names
Args used:
  • --root: the folder where trimmed files will be placed
  • --adapters: the filepath of the fasta adapters file
  • --fastq: the location of the fastq-mcf executable
  • --fastq_args: a string of arguments to pass directly to fastq-mcf
  • --quiet: silence fastq-mcf's output if given
_find_file_pairs(files)

Finds pairs of forward and backward read files

Parameters:files (iterable(str)) -- filenames to be paired and trimmed
Returns:pairs (f1, f2) that are paired files, forward and backward If a file f1 does not have a mate, f2 will be None, and the file will be trimmed without a mate
Return type:set(tuple(str, str))
static _get_sequence_id(filename)

Gets the six-letter RNA sequence that identifies the RNAseq file

Returns a six character string that is the ID, or an empty string if no identifying sequence is found.

Parameters:filename (str) -- the base filename to be processed
Returns:the file's sequence ID, six characters of ACTG
Return type:string
logger = <logging.Logger object>

log4j-style class-logger

run(stage_input)

Trim files one at a time using fastq-mcf

Parameters:stage_input (iterable(str)) -- filenames to be processed
Returns:a set of filenames holding the processed files
Return type:set(str)
spec = '3.1'

FastQMCFTrimPairs uses '3.1' as its specifier