Man page - xtract(1)

Packages contains this manual

Manual

XTRACT

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
Processing Flags
Data Source
Exploration Argument Hierarchy
Path Navigation
Exploration Constructs
Conditional Execution
String Constraints
Object Constraints
Numeric Constraints
Format Customization
XML Generation
Tag and Attribute Construction
FASTA Parsable Fields
Element Selection
-element Constructs
Special -element Operations
Numeric (Integer) Processing
Leading Zero Padding
Character Processing
String Processing
Text Processing
Citation Functions
Value Transformation
Regular Expression
Sequence Processing
Nucleotide Processing
Protein Processing
Sequence Coordinates
Command Generator
Frequency Table
Entrez Indexing
Output Organization
Record Selection
Record Rearrangement
Reformatting
Validation
Summary
Full Exploration Command Precedence
Documentation
NOTES
SEE ALSO

NAME

xtract - NCBI Entrez Direct XML conversion and transformation tool

SYNOPSIS

xtract [ -help ] [ -strict ] [ -mixed ] [ -self ] [ -accent ] [ -ascii ] [ -compress ] [ -stops ] [ -input filename ] [ -transform filename ] [ -aliases filename ] [ -pattern expr ] [ -group expr ] [ -block expr ] [ -subset expr ] [ -path path ] [ -if expr [ constraint ]] [ -unless expr [ constraint ]] [ -and condition ] [ -or condition ] [ -else ] [ -position pos ] [ -equals str ] [ -contains str ] [ -mimics str ] [ -excludes str ] [ -includes str ] [ -is-within str ] [ -starts-with str ] [ -ends-with str ] [ -is-not str ] [ -is-before str ] [ -is-after str ] [ -consists-of str ] [ -matches str ] [ -resembles str ] [ -is-equal-to expr ] [ -differs-from expr ] [ -gt N ] [ -ge N ] [ -lt N ] [ -le N ] [ -eq N ] [ -ne N ] [ -ret str ] [ -tab str ] [ -sep str ] [ -pfx str ] [ -sfx str ] [ -rst ] [ -clr ] [ -pfc str ] [ -deq str ] [ -def str ] [ -lbl str ] [ -set tag ] [ -rec tag ] [ -wrp tag ] [ -enc tag ] [ -plg str ] [ -elg str ] [ -pkg tag ] [ -fwd str ] [ -awd str ] [ -tag tag ] [ -att key str ] [ -atr key element ] [ -cls ] [ -slf ] [ -end tag ] [ -bkt ] [ -element element ] [ -first element ] [ -last element ] [ -first element ] [ -last element ] [ -backward element ] [ - NAME ] [ -- STATS ] [ -num element ] [ -len element ] [ -sum element ] [ -acc element ] [ -min element ] [ -max element ] [ -inc element ] [ -dec element ] [ -sub element ] [ -avg element ] [ -dev element ] [ -med element ] [ -mul element ] [ -div element ] [ -mod element ] [ -geo element ] [ -hrm element ] [ -rms element ] [ -sqt element ] [ -lge element ] [ -lg2 element ] [ -log element ] [ -bin element ] [ -oct element ] [ -hex element ] [ -bit element ] [ -pad element ] [ -encode element ] [ -decode element ] [ -upper element ] [ -lower element ] [ -chain element ] [ -title element ] [ -mirror element ] [ -alpha element ] [ -alnum element ] [ -basic element ] [ -plain element ] [ -simple element ] [ -author element ] [ -journal element ] [ -prose element ] [ -terms element ] [ -words element ] [ -pairs element ] [ -split element -with str ] [ -order element ] [ -reverse element ] [ -letters element ] [ -clauses element ] [ -pentamers element ] [ -year element ] [ -month element ] [ -date element ] [ -page element ] [ -auth element ] [ -initials element ] [ -trim element ] [ -wct element ] [ -doi element ] [ -accession element ] [ -numeric element ] [ -translate element ] [ -classify element ] [ -replace -reg target -exp replacement ] [ -fasta ] [ -revcomp ] [ -nucleic ] [ -ncbi2na ] [ -ncbi4na ] [ -cds2prot [ -gcode N ] [ -frame N ]] [ -molwt ] [ -molwt-m ] [ -molwt-f ] [ -pept ] [ -0-based element ] [ -1-based element ] [ -ucsc-based element ] [ -insd arg ...] [ -insdx ] [ -histogram ] [ -indexer element ] [ -head str ] [ -tail str ] [ -hd str ] [ -tl str ] [ -select condition ] [ -in filename ] [ -sort [ -fwd ] element ] [ -sort-rev element ] [ -format fmt [ -unicode style ]] [ -verify ] [ -test ] [ -outline ] [ -synopsis ] [ -contour [ delimiter ]] [ -examples ] [ -unix ] [ -version ]

DESCRIPTION

xtract converts an XML document into a table of data values according to user-specified rules.

OPTIONS

Processing Flags

-strict

Remove HTML and MathML tags.

-mixed

Allow mixed content XML.

-self

Allow detection of empty self-closing tags.

-accent

Delete Unicode accents and diacritical marks.

-ascii

Convert Unicode to numeric HTML character entities.

-compress

Compress runs of spaces.

-stops

Retain stop words in selected phrases.

Data Source

-input filename

Read XML from file instead of standard input.

-transform filename

File of substitutions for -translate .

-aliases filename

Mappings file for -classify operation.

Exploration Argument Hierarchy

-pattern expr
-group
expr
-block
expr
-subset
expr

Name of record within set. Use of different argument names allows command-line control of nested looping.

Path Navigation

-path path

Explore by list of adjacent object names.

Exploration Constructs

Object

DateRevised

Parent/Child

Book/AuthorList

Path

MedlineCitation/Article/Journal/JournalIssue/PubDate

Heterogeneous

"PubmedArticleSet/*"

Exhaustive

"History/**"

Nested

"*/Taxon"

Conditional Execution

-if expr [ constraint ]

Element (or @ attribute ) must exist and satisfy any specified constraint.

-unless expr [ constraint ]

Skip if element matches.

-and condition

Preceding and following tests must both pass.

-or condition

Any passing test suffices.

-else

Execute if conditional test failed.

-position pos

first / last / outer / inner / even / odd / all .

String Constraints

-equals str

String must match exactly.

-contains str

Substring must be present.

-mimics str

Containment test after converting punctuation to space.

-excludes str

Substring must be absent.

-includes str

Substring must match at word boundaries.

-is-within str

String must be present.

-starts-with str

Substring must be at beginning.

-ends-with str

Substring must be at end.

-is-not str

String must not match.

-is-before str

First string < second string.

-is-after str

First string > second string.

-consists-of str

String must only contain specified characters.

-matches str

Matches without commas or semicolons.

-resembles str

Requires all words, but in any order.

Object Constraints

-is-equal-to expr

Object values must match.

-differs-from expr

Object values must differ.

Numeric Constraints

-gt N

Greater than.

-ge N

Greater than or equal to.

-lt N

Less than to.

-le N

Less than or equal to.

-eq N

Equal to.

-ne N

Not equal to.

Format Customization

-ret str

Override line break between patterns.

-tab str

Replace tab character between fields.

-sep str

Separator between group members.

-pfx str

Prefix to print before group.

-sfx str

Suffix to print after group.

-rst

Reset -sep through -elg .

-clr

Clear queued tab separator.

-pfc str

Preface combines -clr and -pfx .

-deq str

Delete and replace queued tab separator.

-def str

Default placeholder for missing fields.

-lbl str

Insert arbitrary text.

XML Generation

-set tag

XML tag for entire set.

-rec tag

XML tag for each record.

-wrp tag

Wrap elements in XML object.

-enc tag

Encase instance in XML object.

-plg str

Prologue to print before instance.

-elg str

Epilogue to print after instance.

-pkg tag

Package subset in XML object.

-fwd str

Foreword to print before subset.

-awd str

Afterword to print after subset.

Tag and Attribute Construction

-tag tag

Start with < tag .

-att key str

Attribute key and literal string.

-att key element

Attribute key and element name.

-cls

Close with > .

-slf

Self-close with /> .

-end tag

End contents with </ tag > .

FASTA Parsable Fields

-bkt

Wrap elements in bracketed fields.

Element Selection

-element element

Print all items that match tag name.

-first element

Only print value of first item.

-last element

Only print value of last item.

-even element

Only print value of even items.

-odd element

Only print value of odd items.

-backward element

Print values in reverse order.

- NAME

Record value in named variable.

-- STATS

Accumulate values into variable.

-element Constructs

Tag

Caption

Group

Initials,LastName

Parent/Child

MedlineCitation/PMID

Recursive

"**/Gene-commentary_accession"

Unrestricted

PubDate/*

Attribute

DescriptorName@MajorTopicYN

Range

MedlineDate[1:4]

Substring

"Title[phospholipase | rattlesnake]"

Alternative

"[can contain ˆ vertical bar]"

Object Count

"#Author"

Item Length

"%Title"

Element Depth

"ˆPMID"

Variable

"&NAME"

Special -element Operations

Parent Index

"+"

Object Name

"?"

Object Value

"˜"

XML Subtree

"*"

Children

"$"

Attributes

"@"

ASN.1 Record

"."

JSON Record

"%"

Numeric (Integer) Processing

-num element

Count.

-len element

Length.

-sum element

Sum.

-acc element

Accumulator.

-min element

Minimum.

-max element

Maximum.

-inc element

Increment.

-dec element

Decrement.

-sub element

Difference.

-avg element

Arithmetic mean.

-dev element

Deviation.

-med element

Median.

-mul element

Product.

-div element

Quotient.

-mod element

Remainder.

-geo element

Geometric mean.

-hrm element

Harmonic mean.

-rms element

Root mean square.

-sqt element

Square root.

-lge element

Natural logarithm.

-lg2 element

Logarithm base two.

-log element

Logarithm base ten.

-bin element

Binary.

-oct element

Octal.

-hex element

Hexadecimal.

-bit element

Number of bits set.

Leading Zero Padding

-pad element

Zero-pad to eight digits.

Character Processing

-encode element

XML-encode < , > , & , " , and ' characters.

-decode element

Base64-decode object embedded in XML.

-upper element

Convert text to uppercase.

-lower element

Convert text to lowercase.

-chain element

Change spaces to underscores.

-title element

Capitalize initial letters of words.

-mirror element

Reverse order of letters.

-alnum element

Non-alphabetic characters to space.

-alnum element

Non-alphanumeric characters to space.

String Processing

-basic element

Convert superscripts and subscripts.

-plain element

Remove embedded mixed-content markup tags.

-simple element

Normalize accented letters; spell Greek letters.

-author element

Multi-step author cleanup.

-jour element

Journal capitalization and punctuation punctuation.

-prose element

Text conversion to ASCII.

Text Processing

-terms element

Partition text at spaces.

-words element

Split at punctuation marks.

-pairs element

Adjacent informative words.

-split element

Split using -with for delimiter.

-order element

Rearrange words in sorted order.

-reverse element

Reverse words in string.

-letters element

Separate individual letters.

-clauses element

Break at phrase separators.

-pentamers element

Sliding window of pentamers.

Citation Functions

-year element

Extract first 4-digit year from string.

-month element

Match first month name and return a corresponding integer.

-date element

YYYY / MM / DD from -unit "PubDate" -date "*"

-page element

Get digits (and letters) of first page number.

-auth element

Change GenBank authors to Medline form.

-initials element

Parse initials from forename or given name.

-trim element

Remove extra spaces and leading zeros.

-wct element

Count number of -words in a string.

-doi element

Add https://doi.org/ prefix, URL encode.

-accession element

Allow indexing of full accession . version .

-numeric element

Only accept items that are entirely digits.

Value Transformation

-translate element

Substitute values with -transform table.

-classify element

Substring word or phrase matches to -aliases table.

Regular Expression

-replace

Substitute text using regular expressions.

-reg target

Target expression.

-exp pattern

Replacement pattern.

Sequence Processing

-fasta

Split sequence into blocks of 70 uppercase letters.

Nucleotide Processing

-revcomp

Reverse complement nucleotide sequence.

-nucleic

Subrange determines forward or revcomp.

-ncbi2na

Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)

-ncbi4na

Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)

-cds2prot [ -gcode N ] [ -frame N ]

Translate coding region using -gcode and (1-based) -frame (both 1 by default).

Protein Processing

-molwt

Calculate molecular weight of peptide.

-molwt-m

Molecular weight retaining initial methionine.

-molwt-f

Keep initial M residue as formyl-methionine.

-pept

Split amino acid runs at * , - , x , or X .

Sequence Coordinates

-0-based element

Zero-based.

-1-based element

One-based.

-ucsc-based element

Half-open.

Command Generator

-insd arg ...

Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:

Descriptor(s)

INSDSeq_sequence / INSDSeq_definition / INSDSeq_division /... [...]

Completeness

complete / partial

Feature(s)

CDS / mRNA /...[ , ...]

Qualifier(s)

INSDFeature_key / "#INSDInterval" / gene / product / feat_location / sub_sequence /... [...]

-insdx

Process -insd output table into XML.

Frequency Table

-histogram

Collects data for sort-uniq-count (1) on entire set of records.

Entrez Indexing

-indexer element

Positional index using -wrp for field name.

Output Organization

-head str

Print before everything else.

-tail str

Print after everything else.

-hd str

Print before each record.

-tl str

Print after each record.

Record Selection

-select condition

Select record subset by conditions.

-in filename

File of identifiers to use for selection.

Record Rearrangement

-sort [ -fwd ] element

Element to use as sort key.

-sort-rev element

Sort records in reverse order.

Reformatting

-format fmt

copy

Fast block copy (still applies processing flags).

compact

Compress runs of spaces.

flush

Suppress line indentation.

indent

Indent according to nesting depth.

expand

Place each attribute on a separate line.

Validation

-verify

Report XML data integrity problems.

-test

Check field for visible combining accents and invisible Unicode.

Summary

-outline

Display outline of XML structure.

-synopsis

Display individual XML paths.

-contour [ delimiter ]

Display XML paths to leaf nodes (delimited by / by default).

Full Exploration Command Precedence

-pattern

-path

-division

-group

-branch

-block

-section
-subset

-unit

Documentation

-help

Print usage information and some example argument combinations.

-examples

Complete usage examples, involving additional Entrez Direct tools.

-unix

Illustrate common Unix command arguments.

-version

Print version number.

NOTES

String constraints use case-insensitive comparisons.

Numeric constraints and selection arguments use integer values.

-num and -len selections are synonyms for Object Count ( # ) and Item Length ( % ).

-words , -pairs , and -indices convert to lower case.

SEE ALSO

align-columns (1), archive-nihocc (1), archive-nlmnlp (1), archive-nmcds (1), archive-pids (1), archive-pmc (1), archive-pubmed (1), archive-taxonomy (1), asn2ref (1), between-two-genes (1), bsmp2info (1), csv2xml (1), custom-index (1), disambiguate-nucleotides (1), download-flatfile (1), download-ncbi-data (1), ds2pme (1), efetch (1), esample (1), filter-columns (1), find-in-gene (1), fuse-ranges (1), fuse-segments (1), gbf2facds (1), gbf2fsa (1), gbf2info (1), gbf2tbl (1), gene2range (1), gff2xml (1), gff-sort (1), gm2segs (1), hgvs2spdi (1), nquire (1), pm-collect (1), pm-refresh (1), pma2apa (1), pma2pme (1), pmc2bioc (1), pmc2info (1), print-columns (1), rchive (1), refseq-nm-cds (1), reorder-columns (1), snp2hgvs (1), snp2tbl (1), sort-table (1), sort-uniq-count (1), spdi2tbl (1), tbl2prod (1), transmute (1), uniq-table (1), xfetch (1), xfilter (1), xinfo (1), xlink (1), xml2fsa (1), xml2tbl (1), xsearch (1), xy-plot (1).