# SssICalc - A model of MBD pulldown alignments extracted from SssI-treated DNA The following programs execute the model described in the publication entitled "A model of pulldown alignments from SssI-treated DNA improves DNA methylation prediction". The instructions below describe the inputs and outputs of the included programs. This tool was developed at The Ohio State University by Blythe Moreland and Ralf Bundschuh. ## Introduction SssICalc calculates expected read coverage from an MBD pulldown experiment performed on DNA treated with M.SssI to methylate all C's in the CpG context. These calculated SssI Control values may be used in place of experimentally-determined SssI Control values for the purposes of estimating methylation level from MBD pulldown performed on DNA of unknown methylation status. We use these modeled SssI Control data to run the BayMeth algorithm (Riebler *et al.*, 2014) on MBD pulldown data without running a separate Control experiment on an SssI-treated sample. ### Prerequisites SssICalc was implemented in the C++ language and uses some functions from the Boost library (v. 1.56). We have also included a Python program to sum up $\Lambda_x$ values in non-overlapping genomic windows. ### Installation Download the source files to a directory on a Linux based machine and add execution permissions to the C++ script. ``` chmod +x SssICalc ``` To compile the SssICalc program directly, run the makefile. ``` make -f make_SssICalc ``` ## Usage ### Overview * [SssICalc] - This script calculates $Lambda_x$ terms for every site, $x$, given a list of CpG locations. * [mycpgs.cpp/.hpp] - This file contains the MYCPGS class that stores and modifies information on CpG locations. * [params.cpp/.hpp] - This file contains the Params class that stores and modifies information on model parameters. * [sum_up_lambdas.py] - This script sums up $Lambda_x$ terms on non-overlapping genomic windows. Example inputs (`input_options.txt`) and parameters (stored in `params/`) and outputs (stored in `outputs/`) are provided for analysis of chr22 of hg18. ### SssICalc To run SssICalc: ``` ./SssICalc [file describing input files, parameters] > [lambda output file] ./SssICalc input_options_chr22_hg18.txt > outputs/chr22_hg18_lambdas.txt ``` #### Inputs The file input_options.txt should be structured in the following format to feed in program parameters (comments in [] aside): ``` params/cpgs_chr22_hg18.txt [file that contains CpG locations, formatted 'chromosome C_location'] params/length_weights.txt [file that contains fragment length weights, formatted 'length weight'] params/number_weights.txt [file that contains enrichment factors for CpG number, formatted 'cpg_number weight'] 5 [int, minimum number of bps from C of a CpG to the C of the next CpG] 3 [int, minimum fragment length] 200 [int, maximum fragment length] chr22 [optional, label of chromosome being analyzed] 49691432 [int, optional, required if previous field is set, length of chromosome being analyzed] ``` If the last two lines (chromosome label and chromosome length) are empty then by default the program assumes that the CpG locations file provides data for all chromosomes in the genome. The chromosome labels and lengths for the hg18 genome are hard-coded into the SssICalc.cpp program, so that program can be modified in order to be applied to different genomes. The fragment length weights correspond to $P(l)$ as described in the manuscript. The CpG number enrichment factors correspond to $C(N)$ as described in the manuscript. ### sum_up_lambdas.py The genomic windows generated start with the first site in the genome (index 1) and cover the whole chromosome in a non-overlapping way. The specified window width should be an integer > 0. To run sum_up_lambdas.py: ``` python sum_up_lambdas.py [lambda output file] [window width] [opt, chromosome label] [opt, chromosome length] > [lambda window output file] python sum_up_lambdas.py outputs/chr22_hg18_lambdas.txt 100 chr22 49691432 > outputs/chr22_hg18_lambdas_100bp.txt ``` Again, the command line options can specify a particular chromosome, and if not set then all chromosomes are analyzed and assumed to be from hg18. The genome-wide hg18 info is hard-coded into the sum_up_lambdas.py program, so that program can be modified in order to be applied to different genomes.