update 26-04-2023 This post is still valid but disconinued due to utilizing python for another application i’ve written called pygrep.
What is the reason for this?
After losing my memory with sed commands countless times, i decided to write something a little more easier for me to remember using the programming language awk/gawk.
Alternatives
Well, sed and grep are the two contenders, but the flexibility and remembering the complexity of both is what i’m trying to conquer here with this awk programme. I will be writing this in python as well, but for now this seems to work.
Grep has this command
1
grep -o 'search[^)]*)' file
This would search a keyword up to the first bracket, and only display this output, but if more instances of this occurs in the same line, all instances are displayed (not necessarily a bad thing though). Also, not 100% sure how grep alone could extend this out to the second bracket or second keyword without the help of a grep/sed combo script.
I know sed can do something like this, which would probably use loops and holding spaces no doubt, and i’ve probably read about sed a few dozen times doing this, but because the syntax of sed gets unreadable to me (after not using it for a while, and especially complex sed), I forget it… so i’m not going to go there in this blog post. Don’t get me wrong, I love sed, and if I used it often enough, i’d be all over it (obviously)
grawk
So i’ve tried to make this have some flexibility and on the command line it’ll read a little like this (grawk being the awk programme i’ve written, stored in my $PATH, otherwise call it like a regular script ./grawk…)….
1
grawk buzzword 1 ")" 2 file
This will search for the 1st buzzword found on a line, up to the 2nd bracket of file (or piped input).
Adding a $ to the commandline…
1
grawk buzzword 1 bash $ file
So it becomes a two keyword search, with an output starting from the buzzword, to end of line. There’s also a weird hacky bonus (which i did not add, so it must break something?) of adding a period to the buzzword…
1
grawk .buzzword 1 bash $ file
Which would print from the beginning of the line, of a two keyword search, and in this example, the $ prints to end of line, but if the $ was a 2, it would be a second instance of bash (or whatever character/word was there, for example..)
1
2
3
grawk .cron 1 ")" 2 /var/log/syslog | head -n1
Aug 17 21:30:01 jp-vivo CRON[16281]: (root) CMD ([ -x /etc/init.d/anacron ] && if [ ! -d /run/systemd/system ]; then /usr/sbin/invoke-rc.d anacron start >/dev/null; fi)
The above will print from start (using the period) to second bracket. I used head -n1 to suppress the output to the first line, as there’s a lot of cron jobs in syslog.
You can also search the 2nd, 3rd etc occurence of the buzzword..
1
grawk buzzword 3 $ file
Sometimes you might not want the last character so i added an exclude for the last character…
1
2
3
4
5
6
7
grawk jonny 1 : 6 /etc/passwd
output:
jonny:x:1000:1000:jonny,,,:/home/jonny:
grawk jonny 1 : 6 exc /etc/passwd
output:
jonny:x:1000:1000:jonny,,,:/home/jonny
I added an include one character (inc) opposite of the exc.
The Code
Can be found here..
Do as you wish with all the comments, i’ve added them to be helpful, but it won’t take long to understand this programme, it’s quite straight forward to use.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
#!/usr/bin/gawk -f
#
# Usage:
# on the commandline
#
# Put grawk into your /usr/local/bin path, and call without ./
# grawk [start keyword search] [start position Integar] [end of line character/word] [end position integar (or $ for end of line)] ["inc"/"exc" to include 1 or exclude 1 character] file
#
# ./grawk CRON 1 ")" 2 /var/log/syslog
# OUTPUT: This will search for the first instance CRON as a starting string up to the second instance of ")" from syslog
#
# ./grawk CRON 1 $ /var/log/syslog
# OUTPUT: Search for first instance of CRON to end of the line from syslog
#
# ./grawk root 1 "/" 1 exc /etc/passwd
# OUTPUT:
# root:x:0:0:root:
# root:x:524288:524288::
#
# ./grawk root 1 "/" 2 exc /etc/passwd
# root:x:0:0:root:/root:
# root:x:524288:524288::/nonexistent:
#
# ./grawk dns 1 "," 1 exc /etc/passwd
# OUTPUT:
# dnsmasq:x:112:65534:dnsmasq
# dnsmasq:x:132:141:Libvirt Dnsmasq
#
# cat /etc/passwd | ./grawk root $
# OUTPUT:
# root:x:0:0:root:/root:/bin/bash
# root:/usr/sbin/nologin
# root:x:524288:524288::/nonexistent:/usr/bin/false
BEGIN{
# start string search
start = ARGV[1]
delete ARGV[1]
# does start have a number value option (flag) between 1-9, if so add a num1 counter to other ARGVs
if ( ARGV[2] ~ "[1-9]{1}" ) {
startappear = ARGV[2]
num1 += 1
delete ARGV[2]
}
# Last string pattern to extract up to, from the start string
last = ARGV[2 + num1]
# length of the last string to be used at the end to calculate string size
len=length(last)
delete ARGV[2 + num1]
# instance number for end character/word
if (ARGV[3 + num1] ~ "[$1-9]{1}" ) {
lastappear = ARGV[3 + num1]
delete ARGV[3 + num1]
num1 += 1
}
# Including an exc argument at the end of the command line will exclude the last character, unless last == "$"
# If last == "$" the line printed will begin at start, finish at the end of the line.
# If the last charactaer is a /, and $ is not included with exc, then a search of /home/user/ will turn into /home/user
# inc is the opposite of exc
if (ARGV[3 + num1] == "exc") {
len -= 1
delete ARGV[3 + num1]
}
if (ARGV[3 + num1] == "inc") {
len += 1
delete ARGV[3 + num1]
}
}
{
# What this section does, is look for the start flag value, i.e. grawk root 2 $ /etc/passwd will use the second instance root is found in a line,
# and with the $ flag, it'll print to the end of line.
# grawk root 2 "/" 1 will look for the second instance of root in a line, and print up to the first "/"
# grawk root 2 "/" 1 exc , will look for the second instance of root in a line, and print up to the first "/", excluding the last character "/"
# There is another hacky thing you can do if you just want the whole line, and place a period before the start search and using the $
# Note: == 1 is not included because the first instance is the default, but the 1 flag is still required
for (m=2 ; m<=startappear ; m++) {
$0 = gensub(start,"",1)
}
$0 ~ start && $0 ~ last && b[lines++]=$0
# delim with inrefrequent characters to help separate and reintroduce into final output.
if (! /"¬"/ ) {
delim="¬"
} else if (! /"¶"/ ) {
delim="¶"
} else if (! /"¥"/ ) {
delim="¥"
}
}
# This below uses the data above to index and format the desired (hopefully) string output.
END{
for (i in b) {
if ( last == "$" || lastappear == "$") {
n=index(b[i],start)
z=substr(b[i],n)
if (z != "") {
print "\033[33m"z"\033[0m"
}
} else {
n=index(b[i],start)
t=substr(b[i],n)
# This section needs to occur once the start of the string has been established, i.e. indexed and substr.
if ( lastappear == 1 ) {f=index(t,start) ; c=index(t,last); z=substr(t,1,c+len-1) ; if (z != "") print "\033[33m"z"\033[0m" ; continue}
g = gensub(last,delim,1,t)
for (m=3 ; m<=lastappear ; m++) {
g = gensub(last,delim,1,g)
}
c=index(g,last)
z=substr(g,1,c+len-1)
gsub(delim,last,z)
if (z != "") {
print "\033[33m"z"\033[0m"
}
}
}
}
Improvements
Well, I would like to make this a bit more dynamic, but the more dynamic i make it, the more i’ll probably move it to python. But some improvements i’d like to add are…
- Case insensitive
- Potentially some regex
- If an instance occurs more than once in a line, have an option to print all instances
- More colours, with colour options
- And i’m sure i’ll think of more…