Skip to content
Snippets Groups Projects
README.md 4.83 KiB
Newer Older
Flavien Lihouck's avatar
Flavien Lihouck committed

This tools aims to extract a DNA sequence from reads applying a consensus strategy to several MSA built from the reads, creating a meta-MSA from these consensus, and applying the consensus strategy on the meta-MSA, creating what we call a meta-consensus.
Flavien Lihouck's avatar
Flavien Lihouck committed

Flavien Lihouck's avatar
Flavien Lihouck committed
## Requirements
Flavien Lihouck's avatar
Flavien Lihouck committed

Flavien Lihouck's avatar
Flavien Lihouck committed
This tool uses the Conda environment manager from the Anaconda software to run.

To install Conda check : https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
Flavien Lihouck's avatar
Flavien Lihouck committed

## Installation
Flavien Lihouck's avatar
Flavien Lihouck committed

You only need to download this repository in order to launch the pipeline, the first run will handle the setup for every tools using Conda.
Flavien Lihouck's avatar
Flavien Lihouck committed

## Usage
> 
> usage: mc_msa -i INPUT -o OUTPUT [-r REFERENCE -t TOOLS ...]  
>
>Creates the config file then runs the Meta-consensus pipeline  
>
> Required arguments:  
> -input INPUT, -i INPUT  
> Reads file
> -output OUTPUT, -o OUTPUT  
Flavien Lihouck's avatar
Flavien Lihouck committed
> Target directory for the pipeline results  
>
>Standard arguments:  
>-h, --help            
Flavien Lihouck's avatar
Flavien Lihouck committed
>  show this help message and exit  
>
> -reference REFERENCE, -r REFERENCE  
Flavien Lihouck's avatar
Flavien Lihouck committed
>  Reference for alignment and statistics  
>
>-tools TOOLS, -t TOOLS  
Flavien Lihouck's avatar
Flavien Lihouck committed
> The list of tools to use in the meta-consensus  
> (default: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle'])  
>
>-cores CORES, -c CORES  
Flavien Lihouck's avatar
Flavien Lihouck committed
>The amount of cores to use in the pipeline run (default 1)  
>
>Advanced arguments:  
>  -list LIST            
Flavien Lihouck's avatar
Flavien Lihouck committed
>  A list of regions to work on (format: [r1, r2, ...] or [rStart_End, ...]) (default: no region)  
Flavien Lihouck's avatar
Flavien Lihouck committed
> The desired region size (default: maximum)  
>
>-consensus_threshold CONSENSUS_THRESHOLD, -ct CONSENSUS_THRESHOLD  
Flavien Lihouck's avatar
Flavien Lihouck committed
> Threshold(s) used for the MSA consensus step (default: [70])  
>
>  -metaconsensus_threshold METACONSENSUS_THRESHOLD, -mt METACONSENSUS_THRESHOLD  
Flavien Lihouck's avatar
Flavien Lihouck committed
> Threshold(s) used for the Meta-consensus result (default: [60])  
>
>  -depth DEPTH, -d DEPTH  
Flavien Lihouck's avatar
Flavien Lihouck committed
> The depth used in the process (default: max)  

Flavien Lihouck's avatar
Flavien Lihouck committed

### Example

To run the tool with our provided dataset, run the command :

> $./mc_msa.py -i test_data/stup_virus_seq.fa -o output/stup_virus -r test_data/Stup_virus.fa

This will setup the pipeline and start a run on the reads in the file `stup_virus_seq.fa` , with the reference file `Stup_virus.fa`
and every produced by the pipeline will be in the folder `output/stup_virus`. The meta-consensus sequence can be found at `output/stup_virus/meta_consensus/meta_consensus_d0_t1_50_t2_50.fasta`.


Flavien Lihouck's avatar
Flavien Lihouck committed
### Input

The input reads file, in the fasta format.

### Output

The output folder will contain 6 folders at the end of a pipeline run with a reference, and 4 folders without a reference:
Flavien Lihouck's avatar
Flavien Lihouck committed
- meta-consensus : the resulting meta-consensus, for each region and with each specified thresholds and depths combination.
- consensus: the intermediary consensus for every MSA, stored in a folder tree including *region*/*depth*/*consenus_threshold*/*metaconsensus_threshold* and consensus alignment
- data : the preliminary alignment, the cut reads, calculated MSAs, and possibly cut-reference. 
Flavien Lihouck's avatar
Flavien Lihouck committed
You can use the pipeline with pre-processed MSA by adding the MSA in `output/data/msa`, naming them MSA_*TOOL*_r*START_END*_d*DEPTH*.fasta with *TOOL* the tool used, *START* and *END* the limits of the region, and *DEPTH* the read depth for the MSA.
- logs: all the logs for the pipeline **will** be here in the final version (for now, some logs end up in the consensus folder ...)

- plot [requires reference] : 
Flavien Lihouck's avatar
Flavien Lihouck committed
### Region selection

There are 2 (two) main ways of setting up how the regions are selected.     
 
You can output manually the regions using `-list`, allowing 2 formats.  
- `-list "[rStart1, rStart2, rStart3, ...]"` : the corresponding regions will be from Start1 to Start2 , then from Start2 to Start 3 and so on.  
- `-list "[rStart1_End1, rStart2_End2, ...]"` : the corresponding regions will be from Start1 to End1, then from Start2 to End2 and so on.  

You can select a region size and an 'overlap', producing regions.    
`-size 2000 -overlap 50` : will create regions from the 2nd position to the 2002nd, then from the 1952nd to the 3952 and so on. This way, regions share OVERLAP basis, which can be used to join them.

Setting the region size to 0 will try to process the whole sequence in one file. This will be very slow, and cause some tools to either struggle or not produce a result.   
This comes from limitations from the MSA tools themself, as for example abPOA and SPOA require a lot of available RAM to function, and Muscle will slow down a lot for larger regions.

### Depth

Flavien Lihouck's avatar
Flavien Lihouck committed

Flavien Lihouck's avatar
Flavien Lihouck committed
## Authors and acknowledgment
Flavien Lihouck's avatar
Flavien Lihouck committed
Flavien Lihouck
Special thanks to Coralie Rohmer's work on the tool MSA-limit, which inspired and was used in many parts of this project.
Flavien Lihouck's avatar
Flavien Lihouck committed

## License

Conda is released under the 3-clause BSD license (https://docs.conda.io/en/latest/license.html)
Snakemake is licensed under the MIT License (https://snakemake.readthedocs.io/en/stable/project_info/license.html)


Flavien Lihouck's avatar
Flavien Lihouck committed
Probably CC_SA ?