README.md

# Meta Consensus MSA

This tools aims to extract a DNA sequence from reads applying a consensus strategy to several MSA built from the reads, creating a meta-MSA from these consensus, and applying the consensus strategy on the meta-MSA, creating what we call a meta-consensus.

## Requirements

This tool uses the Conda environment manager from the Anaconda software to run.

To install Conda check : https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

## Installation

You only need to download this repository in order to launch the pipeline, the first run will handle the setup for every tools using Conda.

## Usage
> 
> usage: mc_msa -i INPUT -o OUTPUT [-r REFERENCE -t TOOLS ...]  
>
>Creates the config file then runs the Meta-consensus pipeline  
>
> Required arguments:  
> -input INPUT, -i INPUT  
> Reads file
> -output OUTPUT, -o OUTPUT  
> Target directory for the pipeline results  
>
>Standard arguments:  
>-h, --help            
>  show this help message and exit  
>
> -reference REFERENCE, -r REFERENCE  
>  Reference for alignment and statistics  
>
>-tools TOOLS, -t TOOLS  
> The list of tools to use in the meta-consensus  
> (default: ['abpoa', 'spoa', 'kalign2', 'kalign3', 'mafft', 'muscle'])  
>
>-cores CORES, -c CORES  
>The amount of cores to use in the pipeline run (default 1)  
>
>Advanced arguments:  
>  -list LIST            
>  A list of regions to work on (format: [r1, r2, ...] or [rStart_End, ...]) (default: no region)  
>
>-size SIZE, -s SIZE
> The desired region size (default: maximum)  
>
>-consensus_threshold CONSENSUS_THRESHOLD, -ct CONSENSUS_THRESHOLD  
> Threshold(s) used for the MSA consensus step (default: [70])  
>
>  -metaconsensus_threshold METACONSENSUS_THRESHOLD, -mt METACONSENSUS_THRESHOLD  
> Threshold(s) used for the Meta-consensus result (default: [60])  
>
>  -depth DEPTH, -d DEPTH  
> The depth used in the process (default: max)  


### Example

To run the tool with our provided dataset, run the command :

> $./mc_msa.py -i test_data/stup_virus_seq.fa -o output/stup_virus -r test_data/Stup_virus.fa

This will setup the pipeline and start a run on the reads in the file `stup_virus_seq.fa` , with the reference file `Stup_virus.fa`
and every produced by the pipeline will be in the folder `output/stup_virus`. The meta-consensus sequence can be found at `output/stup_virus/meta_consensus/meta_consensus_d0_t1_50_t2_50.fasta`.


### Input

The input reads file, in the fasta format.

### Output

The output folder will contain 6 folders at the end of a pipeline run with a reference, and 4 folders without a reference:
- meta-consensus : the resulting meta-consensus, for each region and with each specified thresholds and depths combination.
- consensus: the intermediary consensus for every MSA, stored in a folder tree including *region*/*depth*/*consenus_threshold*/*metaconsensus_threshold* and consensus alignment
- data : the preliminary alignment, the cut reads, calculated MSAs, and possibly cut-reference. 
You can use the pipeline with pre-processed MSA by adding the MSA in `output/data/msa`, naming them MSA_*TOOL*_r*START_END*_d*DEPTH*.fasta with *TOOL* the tool used, *START* and *END* the limits of the region, and *DEPTH* the read depth for the MSA.
- logs: all the logs for the pipeline **will** be here in the final version (for now, some logs end up in the consensus folder ...)

- plot [requires reference] : 
### Region selection

There are 2 (two) main ways of setting up how the regions are selected.     
 
You can output manually the regions using `-list`, allowing 2 formats.  
- `-list "[rStart1, rStart2, rStart3, ...]"` : the corresponding regions will be from Start1 to Start2 , then from Start2 to Start 3 and so on.  
- `-list "[rStart1_End1, rStart2_End2, ...]"` : the corresponding regions will be from Start1 to End1, then from Start2 to End2 and so on.  

You can select a region size and an 'overlap', producing regions.    
`-size 2000 -overlap 50` : will create regions from the 2nd position to the 2002nd, then from the 1952nd to the 3952 and so on. This way, regions share OVERLAP basis, which can be used to join them.

Setting the region size to 0 will try to process the whole sequence in one file. This will be very slow, and cause some tools to either struggle or not produce a result.   
This comes from limitations from the MSA tools themself, as for example abPOA and SPOA require a lot of available RAM to function, and Muscle will slow down a lot for larger regions.

### Depth


## Authors and acknowledgment
Flavien Lihouck
Special thanks to Coralie Rohmer's work on the tool MSA-limit, which inspired and was used in many parts of this project.

## License

Conda is released under the 3-clause BSD license (https://docs.conda.io/en/latest/license.html)
Snakemake is licensed under the MIT License (https://snakemake.readthedocs.io/en/stable/project_info/license.html)


Probably CC_SA ?