Man page - xtract(1)
Packages contains this manual
- join-into-groups-of(1)
- gbf2ref(1)
- quote-grouped-elements(1)
- disambiguate-nucleotides(1)
- gff-sort(1)
- pma2pme(1)
- pm-collect(1)
- pm-setup(1)
- systematic-mutations(1)
- sort-uniq-count-rank(1)
- archive-pids(1)
- efetch(1)
- pma2apa(1)
- gbf2fsa(1)
- intersect-uid-lists(1)
- skip-if-file-exists(1)
- archive-taxonomy(1)
- jsonl2xml(1)
- efetch.ncbi(1)
- difference-uid-lists(1)
- download-ncbi-data(1)
- snp2tbl(1)
- einfo(1)
- gbf2facds(1)
- tbl2prod(1)
- cit2pmid(1)
- run-ncbi-converter(1)
- args2slice(1)
- xy-plot(1)
- xml2tbl(1)
- archive-nihocc(1)
- fsa2xml(1)
- download-pubmed(1)
- just-top-hits(1)
- sort-uniq-count(1)
- sort-by-length(1)
- json2xml(1)
- filter-stop-words(1)
- archive-nmcds(1)
- print-missing-subranges(1)
- word-at-a-time(1)
- between-two-genes(1)
- esearch(1)
- xfilter(1)
- einfo.ncbi(1)
- efilter(1)
- gbf2info(1)
- pmc2bioc(1)
- download-ncbi-software(1)
- filter-columns(1)
- pm-refresh(1)
- edict(1)
- split-at-intron(1)
- amino-acid-composition(1)
- expand-current(1)
- gbf2xml(1)
- gff2xml(1)
- find-in-gene(1)
- archive-pmc(1)
- download-sequence(1)
- snp2hgvs(1)
- pmc2info(1)
- download-pmc(1)
- bsmp2info(1)
- pm-clean(1)
- esample(1)
- gene2range(1)
- print-columns(1)
- scn2xml(1)
- xfetch(1)
- archive-pubmed(1)
- ds2pme(1)
- tbl2xml(1)
- xml2fsa(1)
- xlink(1)
- download-flatfile(1)
- reorder-columns(1)
- blst2tkns(1)
- exclude-uid-lists(1)
- sort-table(1)
- fuse-segments(1)
- transmute(1)
- combine-uid-lists(1)
- pair-at-a-time(1)
- spdi2tbl(1)
- ecollect(1)
- accn-at-a-time(1)
- custom-index(1)
- gbf2tbl(1)
- align-columns(1)
- asn2ref(1)
- pm-prepare(1)
- elink(1)
- archive-nlmnlp(1)
- xtract(1)
- gm2segs(1)
- epost(1)
- hgvs2spdi(1)
- esummary(1)
- rchive(1)
- refseq-nm-cds(1)
- ini2xml(1)
- xinfo(1)
- xsearch(1)
- csv2xml(1)
- uniq-table(1)
- yaml2xml(1)
- ref2pmid(1)
- filter-record(1)
- fuse-ranges(1)
- nquire(1)
- filter-genbank(1)
apt-get install ncbi-entrez-direct
Manual
XTRACT
NAMESYNOPSIS
DESCRIPTION
OPTIONS
Processing Flags
Data Source
Exploration Argument Hierarchy
Path Navigation
Exploration Constructs
Conditional Execution
String Constraints
Object Constraints
Numeric Constraints
Format Customization
XML Generation
Tag and Attribute Construction
FASTA Parsable Fields
Element Selection
-element Constructs
Special -element Operations
Numeric (Integer) Processing
Leading Zero Padding
Character Processing
String Processing
Text Processing
Citation Functions
Value Transformation
Regular Expression
Sequence Processing
Nucleotide Processing
Protein Processing
Sequence Coordinates
Command Generator
Frequency Table
Entrez Indexing
Output Organization
Record Selection
Record Rearrangement
Reformatting
Validation
Summary
Full Exploration Command Precedence
Documentation
NOTES
SEE ALSO
NAME
xtract - NCBI Entrez Direct XML conversion and transformation tool
SYNOPSIS
xtract [ -help ] [ -strict ] [ -mixed ] [ -self ] [ -accent ] [ -ascii ] [ -compress ] [ -stops ] [ -input filename ] [ -transform filename ] [ -aliases filename ] [ -pattern expr ] [ -group expr ] [ -block expr ] [ -subset expr ] [ -path path ] [ -if expr [ constraint ]] [ -unless expr [ constraint ]] [ -and condition ] [ -or condition ] [ -else ] [ -position pos ] [ -equals str ] [ -contains str ] [ -mimics str ] [ -excludes str ] [ -includes str ] [ -is-within str ] [ -starts-with str ] [ -ends-with str ] [ -is-not str ] [ -is-before str ] [ -is-after str ] [ -consists-of str ] [ -matches str ] [ -resembles str ] [ -is-equal-to expr ] [ -differs-from expr ] [ -gt N ] [ -ge N ] [ -lt N ] [ -le N ] [ -eq N ] [ -ne N ] [ -ret str ] [ -tab str ] [ -sep str ] [ -pfx str ] [ -sfx str ] [ -rst ] [ -clr ] [ -pfc str ] [ -deq str ] [ -def str ] [ -lbl str ] [ -set tag ] [ -rec tag ] [ -wrp tag ] [ -enc tag ] [ -plg str ] [ -elg str ] [ -pkg tag ] [ -fwd str ] [ -awd str ] [ -tag tag ] [ -att key str ] [ -atr key element ] [ -cls ] [ -slf ] [ -end tag ] [ -bkt ] [ -element element ] [ -first element ] [ -last element ] [ -first element ] [ -last element ] [ -backward element ] [ - NAME ] [ -- STATS ] [ -num element ] [ -len element ] [ -sum element ] [ -acc element ] [ -min element ] [ -max element ] [ -inc element ] [ -dec element ] [ -sub element ] [ -avg element ] [ -dev element ] [ -med element ] [ -mul element ] [ -div element ] [ -mod element ] [ -geo element ] [ -hrm element ] [ -rms element ] [ -sqt element ] [ -lge element ] [ -lg2 element ] [ -log element ] [ -bin element ] [ -oct element ] [ -hex element ] [ -bit element ] [ -pad element ] [ -encode element ] [ -decode element ] [ -upper element ] [ -lower element ] [ -chain element ] [ -title element ] [ -mirror element ] [ -alpha element ] [ -alnum element ] [ -basic element ] [ -plain element ] [ -simple element ] [ -author element ] [ -journal element ] [ -prose element ] [ -terms element ] [ -words element ] [ -pairs element ] [ -split element -with str ] [ -order element ] [ -reverse element ] [ -letters element ] [ -clauses element ] [ -pentamers element ] [ -year element ] [ -month element ] [ -date element ] [ -page element ] [ -auth element ] [ -initials element ] [ -trim element ] [ -wct element ] [ -doi element ] [ -accession element ] [ -numeric element ] [ -translate element ] [ -classify element ] [ -replace -reg target -exp replacement ] [ -fasta ] [ -revcomp ] [ -nucleic ] [ -ncbi2na ] [ -ncbi4na ] [ -cds2prot [ -gcode N ] [ -frame N ]] [ -molwt ] [ -molwt-m ] [ -molwt-f ] [ -pept ] [ -0-based element ] [ -1-based element ] [ -ucsc-based element ] [ -insd arg ...] [ -insdx ] [ -histogram ] [ -indexer element ] [ -head str ] [ -tail str ] [ -hd str ] [ -tl str ] [ -select condition ] [ -in filename ] [ -sort [ -fwd ] element ] [ -sort-rev element ] [ -format fmt [ -unicode style ]] [ -verify ] [ -test ] [ -outline ] [ -synopsis ] [ -contour [ delimiter ]] [ -examples ] [ -unix ] [ -version ]
DESCRIPTION
xtract converts an XML document into a table of data values according to user-specified rules.
OPTIONS
Processing Flags
|
-strict |
Remove HTML and MathML tags. |
|||
|
-mixed |
Allow mixed content XML. |
|||
|
-self |
Allow detection of empty self-closing tags. |
|||
|
-accent |
Delete Unicode accents and diacritical marks. |
|||
|
-ascii |
Convert Unicode to numeric HTML character entities. |
|||
|
-compress |
Compress runs of spaces. |
|||
|
-stops |
Retain stop words in selected phrases. |
Data Source
|
-input filename |
Read XML from file instead of standard input. |
|||
|
-transform filename |
File of substitutions for -translate . |
|||
|
-aliases filename |
Mappings file for -classify operation. |
Exploration Argument Hierarchy
-pattern
expr
-group
expr
-block
expr
-subset
expr
Name of record within set. Use of different argument names allows command-line control of nested looping.
Path Navigation
|
-path path |
Explore by list of adjacent object names. |
Exploration Constructs
|
Object |
DateRevised |
|||
|
Parent/Child |
Book/AuthorList |
|||
|
Path |
MedlineCitation/Article/Journal/JournalIssue/PubDate |
|||
|
Heterogeneous |
"PubmedArticleSet/*" |
|||
|
Exhaustive |
"History/**" |
|||
|
Nested |
"*/Taxon" |
Conditional Execution
-if expr [ constraint ]
Element (or @ attribute ) must exist and satisfy any specified constraint.
-unless expr [ constraint ]
Skip if element matches.
-and condition
Preceding and following tests must both pass.
-or condition
Any passing test suffices.
|
-else |
Execute if conditional test failed. |
-position pos
first / last / outer / inner / even / odd / all .
String Constraints
|
-equals str |
String must match exactly. |
||
|
-contains str |
Substring must be present. |
||
|
-mimics str |
Containment test after converting punctuation to space. |
||
|
-excludes str |
Substring must be absent. |
||
|
-includes str |
Substring must match at word boundaries. |
||
|
-is-within str |
String must be present. |
||
|
-starts-with str |
Substring must be at beginning. |
||
|
-ends-with str |
Substring must be at end. |
||
|
-is-not str |
String must not match. |
||
|
-is-before str |
First string < second string. |
||
|
-is-after str |
First string > second string. |
||
|
-consists-of str |
String must only contain specified characters. |
||
|
-matches str |
Matches without commas or semicolons. |
||
|
-resembles str |
Requires all words, but in any order. |
Object Constraints
|
-is-equal-to expr |
Object values must match. |
|||
|
-differs-from expr |
Object values must differ. |
Numeric Constraints
|
-gt N |
Greater than. |
|||
|
-ge N |
Greater than or equal to. |
|||
|
-lt N |
Less than to. |
|||
|
-le N |
Less than or equal to. |
|||
|
-eq N |
Equal to. |
|||
|
-ne N |
Not equal to. |
Format Customization
|
-ret str |
Override line break between patterns. |
|||
|
-tab str |
Replace tab character between fields. |
|||
|
-sep str |
Separator between group members. |
|||
|
-pfx str |
Prefix to print before group. |
|||
|
-sfx str |
Suffix to print after group. |
|||
|
-rst |
Reset -sep through -elg . |
|||
|
-clr |
Clear queued tab separator. |
|||
|
-pfc str |
Preface combines -clr and -pfx . |
|||
|
-deq str |
Delete and replace queued tab separator. |
|||
|
-def str |
Default placeholder for missing fields. |
|||
|
-lbl str |
Insert arbitrary text. |
XML Generation
|
-set tag |
XML tag for entire set. |
|||
|
-rec tag |
XML tag for each record. |
|||
|
-wrp tag |
Wrap elements in XML object. |
|||
|
-enc tag |
Encase instance in XML object. |
|||
|
-plg str |
Prologue to print before instance. |
|||
|
-elg str |
Epilogue to print after instance. |
|||
|
-pkg tag |
Package subset in XML object. |
|||
|
-fwd str |
Foreword to print before subset. |
|||
|
-awd str |
Afterword to print after subset. |
Tag and Attribute Construction
|
-tag tag |
Start with < tag . |
|||
|
-att key str |
Attribute key and literal string. |
|||
|
-att key element |
Attribute key and element name. |
|||
|
-cls |
Close with > . |
|||
|
-slf |
Self-close with /> . |
|||
|
-end tag |
End contents with </ tag > . |
FASTA Parsable Fields
|
-bkt |
Wrap elements in bracketed fields. |
Element Selection
|
-element element |
Print all items that match tag name. |
|||
|
-first element |
Only print value of first item. |
|||
|
-last element |
Only print value of last item. |
|||
|
-even element |
Only print value of even items. |
|||
|
-odd element |
Only print value of odd items. |
|||
|
-backward element |
Print values in reverse order. |
|||
|
- NAME |
Record value in named variable. |
|||
|
-- STATS |
Accumulate values into variable. |
-element Constructs
|
Tag |
Caption |
|||
|
Group |
Initials,LastName |
|||
|
Parent/Child |
MedlineCitation/PMID |
|||
|
Recursive |
"**/Gene-commentary_accession" |
|||
|
Unrestricted |
PubDate/* |
|||
|
Attribute |
DescriptorName@MajorTopicYN |
|||
|
Range |
MedlineDate[1:4] |
|||
|
Substring |
"Title[phospholipase | rattlesnake]" |
|||
|
Alternative |
"[can contain ˆ vertical bar]" |
|||
|
Object Count |
"#Author" |
|||
|
Item Length |
"%Title" |
|||
|
Element Depth |
"ˆPMID" |
|||
|
Variable |
"&NAME" |
Special -element Operations
|
Parent Index |
"+" |
|||
|
Object Name |
"?" |
|||
|
Object Value |
"˜" |
|||
|
XML Subtree |
"*" |
|||
|
Children |
"$" |
|||
|
Attributes |
"@" |
|||
|
ASN.1 Record |
"." |
|||
|
JSON Record |
"%" |
Numeric (Integer) Processing
|
-num element |
Count. |
|||
|
-len element |
Length. |
|||
|
-sum element |
Sum. |
|||
|
-acc element |
Accumulator. |
|||
|
-min element |
Minimum. |
|||
|
-max element |
Maximum. |
|||
|
-inc element |
Increment. |
|||
|
-dec element |
Decrement. |
|||
|
-sub element |
Difference. |
|||
|
-avg element |
Arithmetic mean. |
|||
|
-dev element |
Deviation. |
|||
|
-med element |
Median. |
|||
|
-mul element |
Product. |
|||
|
-div element |
Quotient. |
|||
|
-mod element |
Remainder. |
|||
|
-geo element |
Geometric mean. |
|||
|
-hrm element |
Harmonic mean. |
|||
|
-rms element |
Root mean square. |
|||
|
-sqt element |
Square root. |
|||
|
-lge element |
Natural logarithm. |
|||
|
-lg2 element |
Logarithm base two. |
|||
|
-log element |
Logarithm base ten. |
|||
|
-bin element |
Binary. |
|||
|
-oct element |
Octal. |
|||
|
-hex element |
Hexadecimal. |
|||
|
-bit element |
Number of bits set. |
Leading Zero Padding
|
-pad element |
Zero-pad to eight digits. |
Character Processing
|
-encode element |
XML-encode < , > , & , " , and ' characters. |
|||
|
-decode element |
Base64-decode object embedded in XML. |
|||
|
-upper element |
Convert text to uppercase. |
|||
|
-lower element |
Convert text to lowercase. |
|||
|
-chain element |
Change spaces to underscores. |
|||
|
-title element |
Capitalize initial letters of words. |
|||
|
-mirror element |
Reverse order of letters. |
|||
|
-alnum element |
Non-alphabetic characters to space. |
|||
|
-alnum element |
Non-alphanumeric characters to space. |
String Processing
|
-basic element |
Convert superscripts and subscripts. |
||
|
-plain element |
Remove embedded mixed-content markup tags. |
||
|
-simple element |
Normalize accented letters; spell Greek letters. |
||
|
-author element |
Multi-step author cleanup. |
||
|
-jour element |
Journal capitalization and punctuation punctuation. |
||
|
-prose element |
Text conversion to ASCII. |
Text Processing
|
-terms element |
Partition text at spaces. |
|||
|
-words element |
Split at punctuation marks. |
|||
|
-pairs element |
Adjacent informative words. |
|||
|
-split element |
Split using -with for delimiter. |
|||
|
-order element |
Rearrange words in sorted order. |
|||
|
-reverse element |
Reverse words in string. |
|||
|
-letters element |
Separate individual letters. |
|||
|
-clauses element |
Break at phrase separators. |
|||
|
-pentamers element |
Sliding window of pentamers. |
Citation Functions
|
-year element |
Extract first 4-digit year from string. |
||
|
-month element |
Match first month name and return a corresponding integer. |
||
|
-date element |
YYYY / MM / DD from -unit "PubDate" -date "*" |
||
|
-page element |
Get digits (and letters) of first page number. |
||
|
-auth element |
Change GenBank authors to Medline form. |
||
|
-initials element |
Parse initials from forename or given name. |
||
|
-trim element |
Remove extra spaces and leading zeros. |
||
|
-wct element |
Count number of -words in a string. |
||
|
-doi element |
Add https://doi.org/ prefix, URL encode. |
||
|
-accession element |
Allow indexing of full accession . version . |
||
|
-numeric element |
Only accept items that are entirely digits. |
Value Transformation
|
-translate element |
Substitute values with -transform table. |
||
|
-classify element |
Substring word or phrase matches to -aliases table. |
Regular Expression
|
-replace |
Substitute text using regular expressions. |
-reg target
|
Target expression. |
||||
|
-exp pattern |
Replacement pattern. |
Sequence Processing
|
-fasta |
Split sequence into blocks of 70 uppercase letters. |
Nucleotide Processing
|
-revcomp |
Reverse complement nucleotide sequence. |
||
|
-nucleic |
Subrange determines forward or revcomp. |
||
|
-ncbi2na |
Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.) |
||
|
-ncbi4na |
Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.) |
-cds2prot [ -gcode N ] [ -frame N ]
Translate coding region using -gcode and (1-based) -frame (both 1 by default).
Protein Processing
|
-molwt |
Calculate molecular weight of peptide. |
|||
|
-molwt-m |
Molecular weight retaining initial methionine. |
|||
|
-molwt-f |
Keep initial M residue as formyl-methionine. |
|||
|
-pept |
Split amino acid runs at * , - , x , or X . |
Sequence Coordinates
|
-0-based element |
Zero-based. |
|||
|
-1-based element |
One-based. |
|||
|
-ucsc-based element |
Half-open. |
Command Generator
-insd arg ...
Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:
|
Descriptor(s) |
INSDSeq_sequence / INSDSeq_definition / INSDSeq_division /... [...] |
|||
|
Completeness |
complete / partial |
|||
|
Feature(s) |
CDS / mRNA /...[ , ...] |
|||
|
Qualifier(s) |
INSDFeature_key / "#INSDInterval" / gene / product / feat_location / sub_sequence /... [...] |
|||
|
-insdx |
Process -insd output table into XML.
Frequency Table
-histogram
Collects data for sort-uniq-count (1) on entire set of records.
Entrez Indexing
|
-indexer element |
Positional index using -wrp for field name. |
Output Organization
|
-head str |
Print before everything else. |
|||
|
-tail str |
Print after everything else. |
|||
|
-hd str |
Print before each record. |
|||
|
-tl str |
Print after each record. |
Record Selection
|
-select condition |
Select record subset by conditions. |
|||
|
-in filename |
File of identifiers to use for selection. |
Record Rearrangement
|
-sort [ -fwd ] element |
Element to use as sort key. |
|||
|
-sort-rev element |
Sort records in reverse order. |
Reformatting
-format fmt
|
copy |
Fast block copy (still applies processing flags). |
|||
|
compact |
Compress runs of spaces. |
|||
|
flush |
Suppress line indentation. |
|||
|
indent |
Indent according to nesting depth. |
|||
|
expand |
Place each attribute on a separate line. |
Validation
|
-verify |
Report XML data integrity problems. |
||
|
-test |
Check field for visible combining accents and invisible Unicode. |
Summary
|
-outline |
Display outline of XML structure. |
||
|
-synopsis |
Display individual XML paths. |
||
|
-contour [ delimiter ] |
Display XML paths to leaf nodes (delimited by / by default). |
Full Exploration Command Precedence
-pattern
|
-path |
-division
|
-group |
-branch
|
-block |
-section
-subset
|
-unit |
Documentation
|
-help |
Print usage information and some example argument combinations. |
||
|
-examples |
Complete usage examples, involving additional Entrez Direct tools. |
||
|
-unix |
Illustrate common Unix command arguments. |
||
|
-version |
Print version number. |
NOTES
String constraints use case-insensitive comparisons.
Numeric constraints and selection arguments use integer values.
-num and -len selections are synonyms for Object Count ( # ) and Item Length ( % ).
-words , -pairs , and -indices convert to lower case.
SEE ALSO
align-columns (1), archive-nihocc (1), archive-nlmnlp (1), archive-nmcds (1), archive-pids (1), archive-pmc (1), archive-pubmed (1), archive-taxonomy (1), asn2ref (1), between-two-genes (1), bsmp2info (1), csv2xml (1), custom-index (1), disambiguate-nucleotides (1), download-flatfile (1), download-ncbi-data (1), ds2pme (1), efetch (1), esample (1), filter-columns (1), find-in-gene (1), fuse-ranges (1), fuse-segments (1), gbf2facds (1), gbf2fsa (1), gbf2info (1), gbf2tbl (1), gene2range (1), gff2xml (1), gff-sort (1), gm2segs (1), hgvs2spdi (1), nquire (1), pm-collect (1), pm-refresh (1), pma2apa (1), pma2pme (1), pmc2bioc (1), pmc2info (1), print-columns (1), rchive (1), refseq-nm-cds (1), reorder-columns (1), snp2hgvs (1), snp2tbl (1), sort-table (1), sort-uniq-count (1), spdi2tbl (1), tbl2prod (1), transmute (1), uniq-table (1), xfetch (1), xfilter (1), xinfo (1), xlink (1), xml2fsa (1), xml2tbl (1), xsearch (1), xy-plot (1).