$Id: README,v 0.4 2004/09/05 20:37:56 mccorkle Exp mccorkle $ The SACO software README Table of Contents I) Preliminary comments II) Get and install SACO software tarball III) Copy organism "database". IV) Create a SACO project directory V) Gather the chromatograms VI) Processing with saco-matic VII) Miscellaneous I) Preliminary comments For the sake of simplicity, unix (linux,macosx,Irix etc.) is assumed, and tcsh or csh is also assumed (any guru worth their salary should be able to translate this to "sh" or "bash" if needed). A perl interpreter, csh, and a working C compiler are required. make is also needed. (Perl & csh are supplied with the base MacOSX operating system, but the C compiler and make are included in the Development Tools disk, so be sure to install that). Make sure perl (or a link to it) is in /usr/bin/perl. You'll need phred (http://www.phrap.org). Data is in ASCII (text) format. Always. Sequences are FASTA format, most other data is tabular or quasi-tabular. II) Get and install SACO software tarball (% indicates the tcsh prompt ) The SACO software is in a small directory tree "SACO", with executables in "SACO/bin". Decide where you want to unpack & install the directory. (i.e. /usr/local/SACO - you'll need write privilege in /usr/local, or wherever you decide to put it) get ftp://ftp.genome.bnl.gov/Vollum/SACO.tar.gz or http://genome.bnl.gov/SACO/SACO.tar.gz % mv SACO.tar.gz /usr/local % cd /usr/local % tar zxvf SACO.tar.gz (on linux,macosx, or newer systems) (on older unix systems, use this: % gunzip -c SACO.tar.gz | tar xvf - ) Now the SACO directory has been created. Compile the C programs. % cd SACO % make Now, put /usr/local/SACO/bin into your execution path. This command, % set path=($path /usr/local/SACO/bin) or an equivalent, can go into your .cshrc file as well, so that the programs are always available to you, from now on. you need to install phred (http://www.phrap.org) and you also need to put the path to the phred binary into your path, something like % set path=($path /usr/local/phred) phred will require that the enviroment variable PHRED_PARAMETER_FILE to be set, but that is handled by the SACO config file in the project directory. (its created by initproj, described beloow in section IV) III) Copy organism "database". For the sake of brevity (hah! thats funny!), only rat is described here. Mouse and Human people: don't despair - help is on the way! Rat genome data has been placed at ftp://ftp.genome.bnl.gov/Vollum/Rat_2004.07.24 Chrom - genome sequences in FASTA format Tags - CATG site tag tables Reg - CRE sequnce/position tables Annot - Annotation tables (Sorry that I didn't make this all one big tarball, but some of the files are so enormous I thought it was better to compress the files individually to make the ftp transfer more manageble. There's a lot of manual work here, but it only needs to be done once) Set up a variable, RATDB for example, to point to a directory where you want to keep the genome & derived tables, for example (% indicates tcsh prompt) % setenv RATDB /usr/local/ratdb % mkdir $RATDB (or somewhere where you would like all these files to reside. You'll need write privilege in /usr/local, or wherever you put it)) Create subdirectories "Chrom", "Tags", "Reg", "Annot" % mkdir $RATDB/Chrom % mkdir $RATDB/Tags % mkdir $RATDB/Reg % mkdir $RATDB/Annot Copy files accordingly from ftp://ftp.genome.bnl.gov/Vollum/Rat_2004.07.24 making sure the subdirectory structure is maintained. Don't forget to uncompress them after downloading, i.e.: % cd $RATDB/Chrom % foreach x ( *.gz ) gunzip $x echo $x end Do likewise for the "Tags", "Reg", and "Annot" subdirectories. There should be no *.gz files left when you're done. Lastly copy "index" from ftp site to $RATDB. This file relates contigs to chromosomes. When finished, the top level of $RATDB should contain % ls $RATDB Annot Chrom Reg Tags index The Chrom subdirectory should contain a lot of chr*.fa file, the Reg subdirectory should contain a lot of .cre files,the Tags subdirectory First time users can skip to the next section (IV) Regenerating the database ------------------------- This isn't necessary, unless a new Rat genome version shows up, or a new anchor enzyme is used (other than CATG), or if tag extenstions change (not 16) - then it will be necessary to regenerate the files. Its assumed that you've installed all the software in (II) above. Its also assumed that you've already created the $RATDB directory structure above, and that the genome fasta sequences are installed in $RATDB/Chrom. A) Index The file index relates contig to chromosome (very important for NCBI data). The script "genindex" creates this file. Important: the fasta files must end in .fa in order for the script to see them. % cd $RATDB % genindex >index (this can take a while) B) Tags These tables contain forward & reverse sequences % cd $RATDB/Chrom % genatags C) REGs, for anotation, other analysis [not yet finalized: highly volatile at this point] D) Annot [not yet finalized; probably never will be. Probably this will be quickly obsolete, and will be replaced by a relational database.] IV) Create a SACO project directory Plan for space of a few hundred megs, mostly for the chromatograms. Pick a name - like ProjX, for example. The script "initproj" will create a project subdirectory of that name in the current working directory. % cd (to wherever you want to put the project subdirectory) % initproj ProjX The program will prompt you for config parameter settings appropriate for the project. Default values come from the lib/config.default file in the saco-software tree (you should adjust these for your institute). (Defaults can also be overridden by unix shell environment variables) initproj creates a directory ./ProjX with a bunch of subdirectories. ./ProjX/Chromats - will hold chromatograms ./ProjX/Seqs - will hold fasta sequences called by phred ./ProjX/Quals - will hold phred quality scores ./ProjX/config - this is a csh-like file with environment variables that are parameters for the saco-matic script (described below). If you want to change any of these, simply edit the file, and rerun saco-matic. V) Gather the chromatograms Because the number of files can get unwieldy, the chromatograms need to be organized in batchs, each batch going into its own subdirectory in Chromats. For the first batch, % cd ./ProjX/Chromats % mkdir Batch1 <- or "mkdir B1" or something like that % cp files.... Batch1 (or untar them there, or something) for the next batch % cd ./ProjX/Chromats % mkdir Batch2 <- or "mkdir B2" % cp new files.... Batch2 (etc etc) (The batch naming scheme is up to you. BatchN is used here for the purposes of illustration) For each Chromats/BatchN subdirectory, corresponding BatchN subdirectories will be created in Seqs and Quals. Because duplicate concatemers will lead to systematic errors in tag counts, duplicates need to be removed. Every new sequence needs to be comparied with all previously collected sequences. While software/database processes could be cooked up to handle new sequence incrementally, its far easier to simply to re-analyze all the sequences from the beginning, each time new data is odded. The computational load is not that heavy. Important: sequence names must be unique for the process of duplicate concatemer detection. To help ensure this, the batch subdirectory names will be prefixed to each sequence in the saco-matic process, described below. VI) Processing with saco-matic When new chromatograms are added, the whole shebang gets reprocessed by a master program: a perl script called "saco-matic". saco-matic invokes several other unix programs (mostly perl scripts and a C program or two), all included the saco software package. The subprograms are mostly unix filters, accepting input on stdin, and sending output to stdout. These can be run individually, but they tend to be terse. saco-matic (and also initproj) are more verbose, and are intended to be more user-friendly. To run, cd into the project directory and simply type % saco-matic After a few environment checks, it starts running, and printing a lot of messages. The whole thing could take a few hours, so sit back and relax and enjoy the show. saco-matic implements this pipeline: (names in [] are subprograms) (corresponding files) Chromatograms (Chromats/BatchN/*) | v [phred] / \ | | v v sequences qualities (Seqs/BatchN/*, Quals/Batch/*) \ / \ / v v [qdisplay] | v combined sequence/qualities (*.qc.fasta) | v [ditags and ditag_duplicates] | \ | v v rejects (doppelgangers) [saco-matic] | v filtered sequences (total.fasta) | v [ditags] | \ | v | ditags (total.ditags) v tags & counts (total.tags) (*.atags) | genome tags | \ | v v [trp_compare] | v tags & locations (trp.out) | v [tag_clusters] | v loci clusters (clust.out) | \ | v genome | [summ_clusters] regulator sites, | \ annotation etc. | v \ | table of loci (clust.summ) v v [annot_clusters] | v annotated loci, (clust.annot) annotated multiple loci (clust2.annot) | [saco-matic] | v summary (summary) When saco-matic completes, all the files listed on the right will be found in the project directory Some notes on what is going on as saco-matic runs: Seqs and Quals: -------------- A subprogram called "seq_update" (perl) performs this processing. phred is only invoked on new batch directories in Chromats. (No need to re-phred on the same old sequences again and again, eh?). Because phred sequences and qualities are so potentially numerous, they are stored in the Seqs and Quals subdirectories (fasta format). The -trim_alt "" trimming option is used. These files are folded back into single fasta sequences (*.qc.fasta), by a program "qdisplay" (perl), which puts q<20 scores in lowercase, and q>=20 scores in uppercase. The next few programs will make doppelganger rejection decisions based on quality. Dopplegangers: -------------- As stated before, duplicate sequences must be removed from the statistics. Rather than performing slow sequence comparisons, we look for repetition of ditags to detect duplicates. Because the probability of ditag pairs being duplicated is already low, the probability of a few ditag pairs being duplicated between two or more sequences is far, far lower. Ditags are first collected from the *.qc.fasta files via the perl script "ditags". At this stage, any ditags with lower case (q<20) calls are excluded from counts, FOR THE SOLE PURPOSE OF SELECTING WHICH DUPLICATE(S) TO REJECT. The perl program "ditag_duplicates" then makes a weighted connection graph of concatemer sequences. Two sequences are "connected" if they share at least one common ditag, and the edge weight indicates the number of shared ditags. Nodes of connected subgraphs of edge weights >= 3 are taken to be duplicates. From each duplicate subgraph, a node (sequence) with a maximum number of q>=20 ditags is selected to keep, and the rest are removed. The sequences to be removed are listed in column 1 of the quasi-tabular file "doppelgangers". Note that two duplicate sequences can have the same number of q>=20 ditags. In such cases, the computer picks one more or less at random, so results may differ from run to run. Filtering: --------- The saco-matic script then removes the designated dopplegangers from *.qc.fasta, puts them in removed.fasta, and places the remainder (good sequence) into "total.fasta". Ditags: ------ The script "ditags" is rerun on "total.fasta" (this time, all sequence is included, even lowercase (q<20) calls; all base calls are converted to uppercase at this point. Ditags are listed & counted in the file "total.ditags", and single tags are listed and counted in "total.tags" Genome location: --------------- The program "trp_compare" builds an index tree of all the tag sequences in total.tags AND also all possible single-substitution permutations of these permutations. Then, it reads all the *.atags files (the files are identified by config parameters SACO_TAG_SEPARATOR and SACO_TAG_EXTENT) in the genome "database", and counts and identifies matches to any of the sequences in the tree. The number of exact matches and inexact matches to each sequence are reported, and the case of unique (exact =1, inexact = 0 or exact=0,inexact=1) or quasi-unique (exact=1, 1 <= inexact <= 4) matches, the unique or exact location and direction is listed. All output goes to the file "trp.out", whose format looks like number of number genome hits unique and quasi-unique: tag seq collected exact inexact matched seq position dir chr TAAATAAAACAGTATG 2 1 1 TAAATAAAACAGTATG 25178176 r chr5 TAAGATGTGAGGGCAG 2 0 0 TAATTAAAAGGAACGG 2 1 0 TAATTAAAAGGAACGG 84896832 r chr7 TACCTAACACGCGAAG 2 1 0 TACCTAACACGCGAAG 250853748 f chr2 TAGAAGAATGCAGATC 2 25543 9349 TAGACACCAAGCAGGT 2 1 1 TAGACACCAAGCAGGT 3627755 f chr20 TAGAGACCATAGCCAG 2 1 31 Cluster identification: ---------------------- (This program is still in a state of flux- we tried something that wasn't very useful, and have since abandoned it). The perl program "cluster_tags" sorts the unique & quasi-unique tags by chromosome and position, and looks for groups of tag positions which are within 2kb (config parameter SACO_CLUST_REACH). Note that this process collects forward and reverse tag pairs (from the same CATG) At this point, the program goes back to the atags files and grabs all the tag sites within a region around each cluster, to see where (if any) non-unique tags land - this wasn't that interesting and has been abandoned, but some of the code is still in place). Then, all this data is output in "clust.out" example cluster of 2 tags, 2 collected, 155242534 - 155244532 span 0 TAAGTCTCACGAACTT 1 1 0 155243533 r <- the two tags AAAAAGGTAAATTTTG 1 1 1 155243533 f <- in this cluster 155243061 ACCTACGAGTCAAATC r 155243061 TTATAGTGTCTTATAA f 155243236 CAAGCTAGCACTCCAC r 155243236 CACAAATCCTGGGTTC f 155243382 TAGCCGGAGCTAACCT r 155243382 GATGTCTCATCATCAA f 155243533 TAAGTCTCACGAACTT r 1 1 0 0 0 <- their location 155243533 AAAAAGGTAAATTTTG f 1 1 1 0 0 155244078 AGTCAGAAAATACTCT r 155244078 GGGATTGTCCCTCTGA f <- other tags in the region, 155244314 TCCTTTATATAGGCTG r as seen in the genome 155244314 GAAGAAGAAATCTTGA f 155244430 AGCCTGAATGTCTTAG r 155244430 GACTGAACAAACTACT f This file is then distilled into a more tabular form by the perl script "summ_cluster": number of tag collected chrom mean pos start stop span sites tags legacy chr2 112615041 112617039 112616040 0 1 1 0 chr2 113253820 113255944 113254882 126 2 2 0 chr2 113274939 113276937 113275938 0 1 1 0 start & stop include 1/2 reach past the entire span of the cluster. Annotation: ---------- clust.out is collated with promoter files and genes and other features from the organism database. This probably should be converted to a relational database system, so I won't bother describing it here. Summary: ------- Lastly, saco-matic synthesizes a summary file "summary" from several of the files generated during the run. VII) Miscellaneous ranloc ------ (C program) generates random chromosomal (contig) locations for Monte Carlo analysis. Usage: ranloc where is the number of positions to be generated, and is a text file containing lines with this data: where and are character strings and is an integer length of the corresponding contig. (whitespace separated, one contig per line.) This is exactly the format of the SACO index file, created by the SACO software script genindex from UCSC, Ensemble and NCBI genomes. Sample positions are generated in a two-step process. First, a contig is randomly selected from the list of contigs in , with the likelood of selection being proportional to the contig sizes. Then for position a random integer is selected in the range 0 to (contig_length-1) Example: % ranloc $RATDB/index 10 chr7 14923000 chr12 24851403 chr20 48513828 chr1 118256832 chr10 44536274 chrX 134820532 chr8 83111103 chr2 196222662 chr11 58291702 chr3 32211642 ransite ------- (C program) selects random sites from input stream for Monte Carlo analysis. Usage: ransite n where n is the number of positions to be generated. Stdin is assumed to be contig-name/position pairs, one per line: contig-name position contig-name position . . . . Its a simple matter to reformat SACO atags files into input suitable for ransite, by using awk: awk '{if ( $3 == "f" ) print $2, $4}' *.CATG.16.atags | ranenz 10000 ransite outputs contig-name/position pairs, one per line. ransite is fairly simple: all contig/position pairs are loaded into an array of size n_pos, and positions are selected by repeated pseudo-random generation of integer indices in the range 0 to (n_pos-1). Example: % awk '{if ( $3 == "f" ) print $2, $4}' $RATDB/Tags/*.CATG.16.atags \ | ransite 10 chr6 138521980 chr10 21985657 chr12 35137848 chr20 45077831 chr20 19565521 chr8 122125378 chr1 106162502 chr2 41664256 chr10 86841126 chr18 82119998 End: SACO README