class Workflow¶
-
class
rnaseqflow.workflow.Workflow¶ Execute a simple series of steps used to preprocess RNAseq files
-
append(item)¶ Add a WorkflowStage to the workflow
Parameters: item (WorkflowStage) -- the WorkflowStage to insert
-
insert(idx, item)¶ Insert a WorkflowStage into the workflow
Parameters: - idx (int) -- list index for insertion
- item (WorkflowStage) -- the WorkflowStage to insert
-
logger= <logging.Logger object>¶ log4j-style class logger
-
run()¶ Allows the user to select a directory and processes all files within that directory
This function is the primary function of the Workflow class. All other functions are written as support for this function, at the moment
-
class WorkflowStage¶
-
class
rnaseqflow.workflow.WorkflowStage¶ Interface for a stage of a Workflow
Subclasses must override the run method, which takes and verifies arbitrary input, processes it, and returns some output
They must also provide a .spec property which is a short string to be used to select the specific WorkflowStage from many options. These should not overlap, but at the moment no checking is done to see if they do.
-
logger= <logging.Logger object>¶ log4j-style class logger
-
classmethod
longhelp()¶ Create a long help text with full docstrings for each subclass of WorkflowStage
Subclasses are found using cliutils.all_subclasses
-
run(stage_input)¶ Attempt to process the provided input according to the rules of the subclass
Parameters: stage_input (object) -- an arbitrary input to be processed, usually a list of file names or file-like objects. The subclass must typecheck the input as necessary, and define what input it takes Returns: the results of the subclass's processing
-
classmethod
shorthelp()¶ Create a short help text with one line for each subclass of WorkflowStage
Subclasses are found using cliutils.all_subclasses
-
spec¶ Abstract class property, override with @classmethod
Used by the help method to specify available WorkflowItems
-
class FindFiles¶
-
class
rnaseqflow.workflow.FindFiles(args)¶ Bases:
rnaseqflow.workflow.WorkflowStageFind files recursively in a folder
- Input:
- No input is required for this WorkflowStage
- Output:
- A flat set of file path strings
- Args used:
- --root: the folder in which to start the search
- --ext: the file extention to search for
-
logger= <logging.Logger object>¶ log4j-style class-logger
-
run(stage_input)¶ Run the recursive file finding stage
Parameters: stage_input (object, None) -- not used, only for the interface Returns: A flat set of files found with the correct extension Return type: set(str)
-
spec= '1'¶ FindFiles uses '1' as its specifier
class MergeSplitFiles¶
-
class
rnaseqflow.workflow.MergeSplitFiles(args)¶ Bases:
rnaseqflow.workflow.WorkflowStageMerge files by the identifying sequence and direction
- Input:
- An iterable of file names to be grouped and merged
- Output:
- A flat set of merged filenames
- Args used:
- --root: the folder where merged files will be placed
- --ext: the file extention to be used for the output files
- --blocksize: number of kilobytes to use as a copy block size
-
static
_get_direction_id(filename)¶ Gets the direction identifier from an RNAseq filename
A direction identifier is either R1 or R2, indicating a forward or a backwards read, respectively.
Parameters: filename (str) -- the base filename to be processed Returns: the file's direction ID, R1 or R2 Return type: string
-
static
_get_part_num(filename)¶ Returns an integer indicating the file part number of the selected RNAseq file
RNAseq files, due to their size, are split into many smaller files, each of which is given a three digit file part number (e.g. 001, 010). This method returns that part number as an integer.
This requires that there only be one sequence of three digits in the filename
Parameters: filename (str) -- the base filename to be processed Returns: the file's part number Return type: int
-
static
_get_sequence_id(filename)¶ Gets the six-letter RNA sequence that identifies the RNAseq file
Returns a six character string that is the ID, or an empty string if no identifying sequence is found.
Parameters: filename (str) -- the base filename to be processed Returns: the file's sequence ID, six characters of ACTG Return type: string
-
_organize_files(files)¶ Organizes a list of paths by sequence_id, part number, and direction
Uses regular expressions to find the six-character sequence ID, the three character integer part number, and the direction (R1 or R2)
Parameters: files (iterable(str)) -- filenames to be organized Returns: organized files in a dictionary mapping the sequence ID and direction to the files that have that ID, sorted in ascending part number Return type: dict(tuple:list)
-
logger= <logging.Logger object>¶ log4j-style class-logger
-
run(stage_input)¶ Run the merge files operation
Creates a directory merged under the root directory and fills it with files concatenated from individual parts of large RNAseq data files
Files are grouped and ordered by searching the file basename for a sequence identifier like AACTAG, a direction like R1, and a part number formatted 001
Parameters: stage_input (iterable(str)) -- file names to be organized and merged Returns: a set of organized files Return type: set(str)
-
spec= '2'¶ MergeSplitFiles uses '2' as its specifier
class FastQMCFTrimSolo¶
-
class
rnaseqflow.workflow.FastQMCFTrimSolo(args)¶ Bases:
rnaseqflow.workflow.WorkflowStageTrim adapter sequences from files using fastq-mcf one file at a time
- Input:
- A flat set of files to be passed into fastq-mcf file-by-file
- Output:
- A flat set of trimmed file names
- Args used:
- --root: the folder where trimmed files will be placed
- --adapters: the filepath of the fasta adapters file
- --fastq: the location of the fastq-mcf executable
- --fastq_args: a string of arguments to pass directly to fastq-mcf
- --quiet: silence fastq-mcf's output if given
-
logger= <logging.Logger object>¶ log4j-style class-logger
-
run(stage_input)¶ Trim files one at a time using fastq-mcf
Parameters: stage_input (iterable(str)) -- filenames to be processed Returns: a set of filenames holding the processed files Return type: set(str)
-
spec= '3.0'¶ FastQMCFTrimSolo uses '3.0' as its specifier
class FastQMCFTrimPairs¶
-
class
rnaseqflow.workflow.FastQMCFTrimPairs(args)¶ Bases:
rnaseqflow.workflow.WorkflowStageTrim adapter sequences from files using fastq-mcf in paired-end mode
- Input:
- A flat set of files to be passed into fastq-mcf in pairs
- Output:
- A flat set of trimmed file names
- Args used:
- --root: the folder where trimmed files will be placed
- --adapters: the filepath of the fasta adapters file
- --fastq: the location of the fastq-mcf executable
- --fastq_args: a string of arguments to pass directly to fastq-mcf
- --quiet: silence fastq-mcf's output if given
-
_find_file_pairs(files)¶ Finds pairs of forward and backward read files
Parameters: files (iterable(str)) -- filenames to be paired and trimmed Returns: pairs (f1, f2) that are paired files, forward and backward If a file f1 does not have a mate, f2 will be None, and the file will be trimmed without a mate Return type: set(tuple(str, str))
-
static
_get_sequence_id(filename)¶ Gets the six-letter RNA sequence that identifies the RNAseq file
Returns a six character string that is the ID, or an empty string if no identifying sequence is found.
Parameters: filename (str) -- the base filename to be processed Returns: the file's sequence ID, six characters of ACTG Return type: string
-
logger= <logging.Logger object>¶ log4j-style class-logger
-
run(stage_input)¶ Trim files one at a time using fastq-mcf
Parameters: stage_input (iterable(str)) -- filenames to be processed Returns: a set of filenames holding the processed files Return type: set(str)
-
spec= '3.1'¶ FastQMCFTrimPairs uses '3.1' as its specifier