Man page - gsnap(1)

Packages contains this manual

Manual

GSNAP

NAME
SYNOPSIS
OPTIONS
Input options (must include -d)

NAME

gsnap - Genomic Short-read Nucleotide Alignment Program

SYNOPSIS

gsnap [ OPTIONS ...] <FASTA file>, or cat <FASTA file> | gmap [OPTIONS...]

OPTIONS

Input options (must include -d)

-D , --dir = directory

Genome directory. Default (as specified by --with-gmapdb to the configure program) is /var/cache/gmap

-d , --db = STRING

Genome database

--two-pass

Two-pass mode, in which the sequences are processed first to identify splice sites and introns, and then aligned using this splicing information

--use-localdb = INT

Whether to use the local suffix arrays, which help with finding extensions to the ends of alignments in the presence of splicing or indels (0=no, 1=yes if available (default))

Transcriptome-guided options (optional)
-C
, --transcriptdir = directory

Transcriptome directory. Default is the value for --dir above

-c , --transcriptdb = STRING

Transcriptome database

--transcriptome-mode = STRING

Options: assist, only, annotate (default). The option assist means to try transcriptome alignment first, but then use genomic alignment if nothing is found. The option only means to try transcriptome alignment only. The option annotate means to try only genomic alignment, to use the transcriptome only for annotation; this is the fastest option. In the other two options, annotation is also performed

Computation options
-k
, --kmer = INT

kmer size to use in genome database (allowed values: 16 or less) If not specified, the program will find the highest available kmer size in the genome database

--sampling = INT

Sampling to use in genome database. If not specified, the program will find the smallest available sampling value in the genome database within selected k-mer size

--align-fraction = FLOAT

Process only the given fraction of reads, selected at random If --align-fraction and --part are given, --align-fraction takes precedence

-q , --part = INT /INT

Process only the i-th out of every n sequences e.g., 0/100 or 99/100 (useful for distributing jobs to a computer farm).

--input-buffer-size = INT

Size of input buffer (program reads this many sequences at a time for efficiency) (default 10000)

--barcode-length = INT

Amount of barcode to remove from start of every read before alignment (default 0)

--endtrim-length = INT

Amount of trim to remove from the end of every read before alignment (default 0)

--orientation = STRING

Orientation of paired-end reads Allowed values: FR (fwd-rev, or typical Illumina; default), RF (rev-fwd, for circularized inserts), or FF (fwd-fwd, same strand), or 10X (single-cell where read 1 has barcode information; read 2 is rev)

--10x-whitelist = FILE

Whitelist of 10X Genomics GEM bead barcodes, needed to perform correction of cellular barcodes. This file can be obtained at cellranger-x.y.z/lib/python/cellranger/barcodes (for Cell Ranger version >= 4) cellranger-x.y.z/lib/cellranger-cs/x.y.z/lib/python/cellranger/barcodes (<= 3)

--10x-well-position = INT

Position of well information in the accession, when separated by colons If set to 0, then no well information will be printed in the CB field (default: 4)

--fastq-id-start = INT

Starting position of identifier in FASTQ header, space-delimited (>= 1)

--fastq-id-end = INT

Ending position of identifier in FASTQ header, space-delimited (>= 1)

Examples:
@HWUSI-EAS100R:6:73:941:1973#0/1

start=1, end=1 (default) => identifier is HWUSI-EAS100R:6:73:941:1973#0

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

start=1, end=1 => identifier is SRR001666.1 start=2, end=2 => identifier is 071112_SLXA-EAS1_s_7:5:1:817:345 start=1, end=2 => identifier is SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345

--force-single-end

When multiple FASTQ files are provided on the command line, GSNAP assumes they are matching paired-end files. This flag treats each file as single-end.

--filter-chastity = STRING

Skips reads marked by the Illumina chastity program. Expecting a string after the accession having a ’Y’ after the first colon, like this:

@accession 1:Y:0:CTTGTA

where the ’Y’ signifies filtering by chastity. Values: off (default), either, both. For ’either’, a ’Y’ on either end of a paired-end read will be filtered. For ’both’, a ’Y’ is required on both ends of a paired-end read (or on the only end of a single-end read).

--allow-pe-name-mismatch

Allows accession names of reads to mismatch in paired-end files

--interleaved

Input is in interleaved format (one read per line, tab-delimited

--gunzip

Uncompress gzipped input files

--bunzip2

Uncompress bzip2-compressed input files

Computation options
-B
, --batch = INT

Batch mode (default = 5) Mode Hash offsets Hash positions Genome Local hash offsets Local hash positions Localdb

0

allocate mmap mmap allocate mmap mmap

1

allocate mmap & preload mmap allocate mmap & preload mmap

2

allocate mmap & preload mmap & preload allocate mmap & preload mmap

3

allocate allocate mmap & preload allocate allocate mmap

4

allocate allocate allocate allocate allocate mmap

(default)

5 allocate allocate allocate allocate allocate allocate

Note: For a single sequence, all data structures use mmap

A batch level of 5 means the same as 4, and is kept only for backward compatibility

--use-shared-memory = INT

If 1, then allocated memory is shared among all processes on this node If 0 (default), then each process has private allocated memory

--preload-shared-memory

Load files indicated by --batch mode into shared memory for use by other GMAP/GSNAP processes on this node, and then exit. Ignore any input files.

--unload-shared-memory

Unload files indicated by --batch mode into shared memory, or allow them to be unloaded when existing GMAP/GSNAP processes on this node are finished with them. Ignore any input files.

-m , --max-mismatches = FLOAT

Maximum number of mismatches allowed (if not specified, then GSNAP tries to find the best possible match in the genome) If specified between 0.0 and 1.0, then treated as a fraction of each read length. Otherwise, treated as an integral number of mismatches (including indel and splicing penalties). Default is 0.3

--query-unk-mismatch = INT

Whether to count unknown (N) characters in the query as a mismatch (0=no (default), 1=yes)

--genome-unk-mismatch = INT

Whether to count unknown (N) characters in the genome as a mismatch (0=no, 1=yes). If --use-mask is specified, default is no, otherwise yes.

--maxsearch = INT

Maximum number of alignments to find (default 1000). Should be larger than --npaths , which is the number to report. Keeping this number large will allow for random selection among multiple alignments. Reducing this number can speed up the program.

--indel-endlength = INT

Minimum length at end required for indel alignments (default 4)

--max-insertions = INT

Maximum number of insertions allowed (default 9)

--max-deletions = INT

Maximum number of deletions allowed (default 15)

-M , --suboptimal-levels = INT

Report suboptimal hits beyond best hit (default 0) All hits with best score plus suboptimal-levels are reported (Note: Not currently implemented)

-a , --adapter-strip = STRING

Method for removing adapters from reads. Currently allowed values: off, paired. Default is "off". To turn on, specify "paired", which removes adapters from paired-end reads if they appear to be present.

-e , --use-mask = STRING

Use genome containing masks (e.g. for non-exons) for scoring preference

-V , --snpsdir = STRING

Directory for SNPs index files (created using snpindex) (default is location of genome index files specified using -D and -d )

-v , --use-snps = STRING

Use database containing known SNPs (in <STRING>.iit, built previously using snpindex) for tolerance to SNPs

--cmetdir = STRING

Directory for methylcytosine index files (created using cmetindex) (default is location of genome index files specified using -D , -V , and -d )

--atoidir = STRING

Directory for A-to-I RNA editing index files (created using atoiindex) (default is location of genome index files specified using -D , -V , and -d )

--mode = STRING

Alignment mode: standard (default), cmet-stranded, cmet-nonstranded, atoi-stranded, atoi-nonstranded, ttoc-stranded, or ttoc-nonstranded. Non-standard modes requires you to have previously run the cmetindex or atoiindex programs (which also cover the ttoc modes) on the genome

-t , --nthreads = INT

Number of worker threads

Splicing options for DNA-Seq
--find-dna-chimeras
= INT

Look for distant splicing involving poor splice sites (0=no, 1=yes) If not specified, then default is to be on unless only known splicing is desired ( --use-splicing is specified and --novelsplicing is off)

Splicing options for RNA-Seq
-N
, --novelsplicing = INT

Look for novel splicing (0=no (default), 1=yes)

--splicingdir = STRING

Directory for splicing involving known sites or known introns, as specified by the -s or --use-splicing flag (default is directory computed from -D and -d flags). Note: can just give full pathname to the -s flag instead.

-s , --use-splicing = STRING

Look for splicing involving known sites or known introns (in <STRING>.iit), at short or long distances See README instructions for the distinction between known sites and known introns

--splices-noeval

Do not evaluate splices for probability or intron length, but depend only on sequence alignment

--splices-dump = FILE

Write splice junction information to FILE, in the same format as for STAR plus MaxEnt probabilities for the two intron positions. Note that in this dump file, the annotation column is reserved strictly for known introns, and not novel introns that passed some criterion from a first pass.

--splices-include-knownp

In the file for --splices-dump , include all known introns

--splices-read = FILE

Read allowable splices from FILE, in the same format as for STAR. This is useful if some external program can evaluate and filter the results from --splices-dump in a first alignment pass, and then GSNAP can use the filtered splices in a second alignment pass

-w , --localsplicedist = INT

Definition of local novel splicing event (default 200000)

--merge-distant-samechr

Report distant splices on the same chromosome as a single splice, if possible. Will produce a single SAM line instead of two SAM lines, which is also done for translocations, inversions, and scramble events

Options for paired-end reads
--pairmax-dna
= INT

Max total genomic length for DNA-Seq paired reads, or other reads without splicing (default 2000). Used if -N or -s is not specified. This value is also used for circular chromosomes when splicing in linear chromosomes is allowed

--pairmax-rna = INT

Max total genomic length for RNA-Seq paired reads, or other reads that could have a splice (default 200000). Used if -N or -s is specified. Should probably match the value for -w , --localsplicedist .

--resolve-inner = INT

Whether to resolve soft-clipping on the insides of paired-end reads (default 1)

--pairexpect = INT

Expected paired-end length, used for resolving soft-clipping on the insides of paired-end reads, and for pairing DNA-seq reads (default 200)

--pairdev = INT

Allowable deviation from expected paired-end length, used for resolving soft-clipping on the insides of paired-end reads (default 100).

--pass1-min-support = INT

Threshold read support for learning an intron during pass 1 of --two-pass mode (default 20)

Options for quality scores
--quality-protocol
= STRING

Protocol for input quality scores. Allowed values: illumina (ASCII 64-126) (equivalent to -J 64 -j -31 ) sanger (ASCII 33-126) (equivalent to -J 33 -j 0)

Default is sanger (no quality print shift)

SAM output files should have quality scores in sanger protocol

Or you can customize this behavior with these flags:

-J , --quality-zero-score = INT

FASTQ quality scores are zero at this ASCII value (default is 33 for sanger protocol; for Illumina, select 64)

-j , --quality-print-shift = INT

Shift FASTQ quality scores by this amount in output (default is 0 for sanger protocol; to change Illumina input to Sanger output, select -31 )

Output options
-n
, --npaths = INT

Maximum number of paths to print (default 100).

-Q , --quiet-if-excessive

If more than maximum number of paths are found, then nothing is printed.

-O , --ordered

Print output in same order as input (relevant only if there is more than one worker thread)

--show-refdiff

For GSNAP output in SNP-tolerant alignment, shows all differences relative to the reference genome as lower case (otherwise, it shows all differences relative to both the reference and alternate genome)

--clip-overlap

For paired-end reads whose alignments overlap, clip the overlapping region.

--merge-overlap

For paired-end reads whose alignments overlap, merge the two ends into a single end (beta implementation)

--print-snps

Print detailed information about SNPs in reads (works only if -v also selected) (not fully implemented yet)

--failsonly

Print only failed alignments, those with no results

--nofails

Exclude printing of failed alignments

--only-concordant

Print only concordant alignments (concordant_uniq, concordant_mult, concordant_circular)

--omit-concordant-uniq

Do not print any concordant_uniq alignments

--omit-concordant-mult

Do not print any concordant_mult alignments

--omit-softclipped

Do not allow any alignments with soft clips

--only-tr-consistent

Print only alignments with consistent transcripts (XX field present, identical if paired-end)

-A , --format = STRING

Another format type, other than default. Currently implemented: sam, m8 (BLAST tabular format)

--split-output = STRING

Basename for multiple-file output, separately for nomapping, halfmapping_uniq, halfmapping_mult, unpaired_uniq, unpaired_mult, paired_uniq, paired_mult, concordant_uniq, and concordant_mult results

-o , --output-file = STRING

File name for a single stream of output results.

--failed-input = STRING

Print completely failed alignments as input FASTA or FASTQ format, to the given file, appending .1 or .2, for paired-end data. If the --split-output flag is also given, this file is generated in addition to the output in the .nomapping file.

--append-output

When --split-output or --failed-input is given, this flag will append output to the existing files. Otherwise, the default is to create new files.

--order-among-best = STRING

Among alignments tied with the best score, order those alignments in this order. Allowed values: genomic, random (default)

--output-buffer-size = INT

Buffer size, in queries, for output thread (default 1000). When the number of results to be printed exceeds this size, worker threads wait until the backlog is cleared

Options for SAM output
--no-sam-headers

Do not print headers beginning with ’@’

--add-paired-nomappers

Add nomapper lines as needed to make all paired-end results alternate between first end and second end

--paired-flag-means-concordant = INT

Whether the paired bit in the SAM flags means concordant only (1) or paired plus concordant (0, default)

--sam-headers-batch = INT

Print headers only for this batch, as specified by -q

--sam-hardclip-use-S

Use S instead of H for hardclips

--sam-use-0M = INT

If 1 (default), then insert 0M in CIGAR between adjacent indels and introns If 0, do not allow 0M. Picard disallows 0M, but other tools may require it

--sam-extended-cigar

Use extended CIGAR format (using X and = symbols instead of M, to indicate matches and mismatches, respectively

--sam-multiple-primaries

Allows multiple alignments to be marked as primary if they have equally good mapping scores

--sam-sparse-secondaries

For secondary alignments (in multiple mappings), uses ’*’ for SEQ and QUAL fields, to give smaller file sizes. However, the output will give warnings in Picard to give warnings and may not work with downstream tools

--force-xs-dir

For RNA-Seq alignments, disallows XS:A:? when the sense direction is unclear, and replaces this value arbitrarily with XS:A:+. May be useful for some programs, such as Cufflinks, that cannot handle XS:A:?. However, if you use this flag, the reported value of XS:A:+ in these cases will not be meaningful.

--md-report-snps

In MD string, when known SNPs are given by the -v flag, prints difference nucleotides when they differ from reference but match a known alternate allele

--no-soft-clips

Does not allow soft clips at ends. Mismatches will be counted over the entire query

--extend-soft-clips

Extends alignments through soft clipped regions. CIGAR string and coordinates will be revised, but mismatches and the MD string will reflect the clipped CIGAR

--action-if-cigar-error

Action to take if there is a disagreement between CIGAR length and sequence length Allowed values: ignore, warning (default), noprint, abort Note that the noprint option does not print the CIGAR string at all if there is an error, so it may break a SAM parser

--read-group-id = STRING

Value to put into read-group id (RG-ID) field

--read-group-name = STRING

Value to put into read-group name (RG-SM) field

--read-group-library = STRING

Value to put into read-group library (RG-LB) field

--read-group-platform = STRING

Value to put into read-group library (RG-PL) field

Help options
--check

Check compiler assumptions

--version

Show version

--help

Show this help message

Other tools of GMAP suite are located in /usr/lib/gmap