Unit 3 Text Processing and System Configuration tools
Week -7 Text Processing Tools
Tools for Extracting Text File Contents: less and cat File Excerpts: head and tail Extract by Column or Field: cut Extract by Keyword: grep
Viewing File Contents cat: dump one or more files to STDOUT Multiple files are concatenated together less: view file or STDIN one page at a time Useful commands while viewing: /text searches for text n/N jumps to next/previous match v opens the file in a text editor less is the pager used by man
Some useful options to use with cat -A: Show all characters, including control characters and non-printing characters -s: Squeeze multiple adjacent blank lines into single blank line -b: Number each(non-blank) line of output Viewing File Excerpts head: Display the first 10 lines of a file tail: Display the last 10 lines of a file Use –n to change number of lines displayed Use –f to follow subsequent additions to the file Very useful for monitoring log files
Extracting text by keyword - grep Print lines of files or STDIN where a pattern is matched $grep john /etc/passwd $date –help | grep year Use –i to search case-insensitively Use –n to print line numbers of matches Use –v to print lines not containing pattern Use –Ax to include x lines after each match Use –Bx to include x lines before each match Use –r to recursively search a directory Use –color=auto to highlight the match in color
Extracting text by column or field - cut Display specific columns or file or STDIN data $cut –d: -f1 /etc/passwd $grep root /etc/passwd | cut –d: -f7 Use –d to specify the column identifier Use –f to specify the column to print Use –c to cut by characters $cut –c2-5 /usr/share/dict/words
Tools for Analyzing text Text Stats: wc Sorting Text: sort Comparing files: diff and patch Spell check: aspell
Gathering Text Statistics – wc Counts words, lines, bytes and characters Can act upon a file or STDIN $wc a.txt Use –l for only line count Use –w for only word count Use –c for only byte count Use –m for character count
Sorting Text Sorts text to STDOUT – original file unchanged $sort [options] file(s) Common Options -r performs a reverse -n performs a numeric sort -f ignores (folds) case of characters in strings -u (unique) removes duplicate lines in output -t c uses c as a field separator -k x sorts by c-delimited field x Can be used multiple times
Eliminating duplicate lines sort –u: removes duplicate lines from input uniq: removes duplicate adjacent lines from input Use –c to count number of occurences Use with sort for best effect: $sort userlist.txt | uniq -c
Comparing files - diff Compares two files for differences $diff foo.conf-broken foo.conf-works 5c5 <use_widgets = no --- >use_widgets = yes Denotes a difference (change) on line 5 Use gvimdiff for graphical diff Provided by vim-x11 package
Duplicating file changes - patch diff output stored in a file is called a patchfile Use –u for unified diff, best in patchfiles patch duplicates changes in other files (use with care) Use –b to automatically back up changed files $diff –u foo.conf-broken foo-conf-works > foo.patch $ patch-b foo.conf-broken foo.patch
Spell checking with aspell Interactively spell-check files: $aspell check letter.txt Non-interactively list mis-spelled words in STDIN $aspell list<letter.txt $aspell list<letter.txt | wc -1