Top 10 Unix Command Line Utilities 2013

Again a selection of the some useful unix commands.

Posted on December 31, 2013 unix, bash .

For a long time I have used the bash on a daily basis. It has become such a trustworthy companion that I virtually cannot do any work on a windows box anymore. But even using bash every day I regularly come across ways of solving particular tasks that I did not really know about.
To keep track of such solutions, I try to keep my cheat-sheets up to date with what I found and learned during the year. In tradition with 2011 and 2012 I again took a look at the git-log and compiled 10 of the things that I learned or used a lot in the shell.
This year is different though since I decided to go into a little more detail about some of the items where I felt it was instructional. I also included a small table of contents, mainly ‘cause I tend to come back to those tips myself a lot and wanted a quicker way to find the item in question.

1> awk

awk is probably the most powerful unix tool I encountered so far. It’s actually a whole programming language, not just a tool. I tend to stay clear of the more complicated uses but for some small output processing it’s just invaluable.
The basic structure of awk statements is always the same, first you list a pattern to match, second the action you want awk to perform:

awk [condition] [{action}]

In the action it is possible to refer to different parts of the current input record. If awk is fed some lines of text, it will process each line that satisfy the condition and dissect it into parts. That is often helpful if only some parts of the output are wanted.

Dependent include files

One of the situations where this was just the right tool for me was trying to find recursively all dependent header files of a C-file. The compiler (clang in my case) can greatly help here since it has to know about the headers it needs. And it comes with an option to list this information:

-H                      Show header includes and nesting depth

I found this option is less picky than -M. Using it, the information about included header files will be displayed along with lots of other stuff that is unwanted in this case. awk can be used to filter the results down to the relevant lines and further extract the file names from those lines.
In the example below

  • all error messages are filtered out (2>/dev/null, see redirects section)
  • clang is used to list the recursive include tree (cpp -H [include-paths] [C/C++-file])
  • awk first filters all lines with header information (/^#.*\.h\"$/)
  • and extracts the header name ({print $3})
  • tr is used to get rid of the "" quotation-marks
  • finally all duplicates are removed by using sort and uniq
kernel(master) > 2>/dev/null cpp -H -Iinclude crypto/fcrypt.c \
  | awk '/^#.*\.h\"$/ {print $3}' | tr -d '""' | sort | uniq
include/asm-generic/atomic-long.h
include/linux/atomic.h
include/linux/bitmap.h
include/linux/compiler.h
include/linux/completion.h
include/linux/cpumask.h
include/linux/crypto.h
...

2> ls

Plain old ls is probably my most used command. But ls can do some nice tricks, too. Using the -S flag, ls will sort the files by size. Combined with -h it produces a very readable result:

kernel(master) > ls -lSh | head -5
total 6352
-rw-r--r--  1 muellero  staff   236K Jun  6  2013 sched.c
-rw-r--r--  1 muellero  staff   139K Jun  6  2013 cgroup.c
-rw-r--r--  1 muellero  staff   130K Jun  6  2013 sched_fair.c
-rw-r--r--  1 muellero  staff   104K Jun  6  2013 workqueue.c

Often very helpful is also to only include the most recent files. -t will sort files by the date they were modified the last time.

ls -l -t

In a directory with lots of files it’s sometimes practical to only list the directories. The -d option makes sure directories are listed as plain files and not searched recursively. Since all directories end in / this can be used to list only those.

ls -d */

Of course the opposite is also achievable, this one only lists the files. For this you can use -p which will write a slash (“/”) after each filename if that file is a directory.

ls -p | ack -v /

This one can nicely be combined with some awk magic to count the file-types by extensions in a directory:

kernel(master) > ls -p | ack -v / | awk -F . '{print $NF}' | sort | uniq -c
   1 Makefile
 112 c
   1 freezer
  17 h
  ...

Here awk was used to split at the “.” and print the last part (which is the extension).

3> find

Again a classic. After ls for sure an all-time favorite. Can also be combined with an action.
Here, find first seeks out all symbolic links in the current directory and deletes them.

find . -type l -delete

This only works for reasonably recent versions of find. For older versions the following syntax should works as well:

find . -type l -exec rm {} \;

4> ack

Ever since I found ack-grep I most of the time us it as a replacement for grep. The developers claim that it is ”…designed as a replacement for 99% of the uses of grep…” but to be honest, I never found that a case where it failed me. Out of the box it just works and produces beautifully highlighted results. But more importantly, it adds some nice features that I really grew to like.
And it’s quite fast. For most of my usecases faster then grep since it only searches whitelisted files by default. This guy here has put together a pretty good comparison with some performance tests.

Limit to certain File-Types

One very handy feature is the ability to narrow the search down to certain file types. By default ack already knows about countless file types (can be checked with ack --help=types). Limiting the searched files dramatically reduces the search time.
Here is an example of searching all C and C++ source files for a pattern and include 2 lines of context before (-B) and after (-A) the found match.

kernel(master) > ack --type=cpp probability -B 2 -A 2
drivers/net/ethernet/sun/cassini.h
875-
876-/* probabilities for random early drop (RED) thresholds on a FIFO threshold
877: * basis. probability should increase when the FIFO level increases. control
878: * packets are never dropped and not counted in stats. probability programmed
879- * on a 12.5% granularity. e.g., 0x1 = 1/8 packets dropped.
880- * DEFAULT: 0x00000000

Highlighter

A lesser common usecase for ack is the passthru mode. In this mode ack does not limit the output to the matching positions but spills out the whole input, highlighting the search-matches in the process. This practically makes ack a formidable highlighting tool.

kernel(master) > ls | ack -i mutex --passthru
...
lockdep_internals.h
lockdep_proc.c
lockdep_states.h
module.c
mutex-debug.c
mutex-debug.h
mutex.c
mutex.h
notifier.c
nsproxy.c
...

5> rename

Mass renaming in the shell is usually done with some form of loop, e.g.

for i in *.zip ; do mv "$i" "${i%.zip}`date +%Y`.zip"; done

Here a substitution operation is used to cut of the extension part which is then replaced with the year + the extension. The same thing using rename would look like this:

rename -X -a `date +%Y` *.zip

This is using the -a transform that appends some string to each filename.

rename part of filename

Using for loops can become complicated pretty soon, and for those cases rename is a pretty good alternative. Say when you have a bunch of files that contain the string “Aug” in their name but want to replace it with “08”. The by far easiest way I have found to accomplish stuff like this is to use the rename command.

rename 's/Aug/08/' *.*

rename takes modification rules and applies them to the files that match a pattern (if given, otherwise it expects a list of filenames on stdin). The beauty of this utility is that it comes with support for lot’s of common cases out of the box. For example, if files contain spaces or other unwanted characters, there is the option to sanitize them using -z.

tmp > ls
a b  c_.tif
tmp > rename -n -z *
'a b  c_.tif' would be renamed to 'a_b_c_.tif'

Here the option -n is applied as well, resulting in a dry run without any actual modifications. The -z option will replace all sequences of whitespaces or control characters with a single “_”, replace every shell meta-character with “_” and remove spaces and underscores from left and right end.
The result is almost as desired… just the trailing “_” is still annoying. The sanitize command did not remove it. That’s because the file extension is the last part of the filename. But I’d really like to sanitize the name without the extension. Turns out rename has a very handy feature that allows to save and remove the last extension before any modifications and slab it on again afterwards.

tmp > rename -n -X -z *
'a b  c_.tif' would be renamed to 'a_b_c.tif'

put files in folders according to their endings

This is something I use more and more often.

rename -p -X -e '$_ = "$EXT/$_" if @EXT' *

This will move all files into folders that have the same name as their extension. The -p is needed to make sure directories are created if needed. The -X chopes of the extension and saves it into the $EXT variable. -e will evaluate the following expression, in this case evaluate to a path consisting of the stored extension together with the filename if an extension exists.

  • p → creates directories if needed
  • X → chop of extension and append after the operation
  • e → evaluate perl expression
  • $EXT → A string containing the accumulated extensions saved by “-X” switches, without a leading dot
  • @EXT → An array containing the accumulated extensions saved by “-X” switches, from right to left, without any dots

Note that rename is not available on all systems by default and you might need to install it, e.g. brew install rename on OSX.

6> Redirects

For some shell commands I always have to consult my notes or google. Mostly that is ‘cause I haven’t taken the time to really understand the underlying concept. Redirects are a perfect example: simple enough to just use them so most people don’t worry about their mechanics.

Everything in UNIX is a file

Data streams and peripherals are treated as files, just like ordinary files. Each gets a file-descriptor assigned that can then be used to access the stream. A file-descriptor is an integer associated with a network connection, a pipe or a real file, amongst other things. When executing a command, it will mainly work with 3 different file-descriptors/files:

  • 0 ≘ stdin
  • 1 ≘ stdout
  • 2 ≘ stderr

Redirecting Output

For redirecting output you use “>” (the output redirection operator). Probably my most widely used redirect is to write to a file instead of stdout. The example will write to a .gitignore file, possibly creating it in the process if it doesn’t already exist.

echo temp > .gitignore

Slightly more useful is often to use the same redirect but append to a file rather than to overwrite it:

echo temp >> .gitignore

Such basic redirects are valid for the whole line and are a short form for specifying the target file descriptor explicitly (>>1>>). They can also appear before the command:

1>> .gitignore echo temp # redirect stdout and append to .gitignore

Of course redirects are not limited to redirecting to a file. Have you ever tried to redirect the output of a command to a file but there were still some messages displayed on the terminal that did not get redirected? This is a quite common case where stderr needs to be redirected to stdout so that all error messages are sent to stdout:

myCommand 2>&1

The & is similar to the address operator in C/C++ and is used to name a file-descriptor. So 2>&1 tells the shell that we want file-descriptor (2) (stderr) to point to the same file as file-descriptor (1) (stdout). This works with any file-descriptor, not just (2) and (1): x>&y will point file-descriptor x to where y is pointing to.
Using this knowledge, we can swap where 2 file-descritors point with a technique similar to pointer swapping in C:

myCommand 3>&1 1>&2 2>&3 3>&-

A temporary file-descriptor (3) is used as a temp. It will first capture the file pointed by stdout which is then be redirected to where stderr points to. As a last step stderr is now pointed to where stdout pointed originally and (3) is closed as it no longer is needed.

Order of Redirects

A fact that often causes considerable confusion is that the order of the redirects matters. Actually not to hard to remember once you picture the file-descriptors as pointers to files. If we for example try to capture stdout and stderr in a file, one might be tempted to use something like this:

myCommand 2>&1 >> my_log_file.txt

But this does not work as expected. Here, stderr is first pointed at the same file as stdout (i.e. the terminal). Then we redirect stdout away from the terminal to a file.
If both stdout and stderr should be captured in a file, we need to reverse the order of the redirects:

myCommand >>my_log_file.txt 2>&1 # capture everything in a file

Armed with an understanding how redirects work, it’s now quite simple to understand the following example:

Only use stderr output

Redirect stderr to where stdout points to and then stdout to /dev/null (dump it). The output can then further be processed, here we pipe it to grep for something.

myCommand 2>&1 >/dev/null | grep 'foo'

Redirecting Input

Similar to how we can modify stdout and stderr, stdin can also be replaced as an input to a program using the input redirection operator “<”.

myCommand < inputFile # same as cat inputFile | myCommand

Combining input and output redirection is also possible for one command:

myCommand < inputFile > outFile

7> Subshells

Quite often I want to download some file using wget and put it into my download folder. At the same time I don’t want to loose the context (current working directory).
One way to deal with this is to cd into the download directory, issue the wget command, and use a quick cd - to return from where you started. Lately my preferred way is to use a subshell.

tmp > (cd ~/downloads; wget http://www.example.com/abc.tar.gz)

A subshell is a child process of the process the shell is running in with access to the environment of it’s parent process. But any changes to the environment done in this child-process does not propagate to the parent. Thus we can change the directory, set or unset environment variables and so on.
Another nice example is to remove the http_proxy from the environment when temporarily not needed:

tmp > (unset http_proxy; wget http://www.example.com/abc.tar.gz)

Bonus Example

I just found another very nice usage of subshells on the discussion board of vimcasts. Barton Chittenden showed how to avoid the use of temporary files for using vimdiff using process substitution in bash:

tmp > vimdiff <(sort file1) <(sort file2)

8> Tee

Sometimes there is more then one consumer for some command output. This is what tee can be used for. It takes as an input the output of some other command using stdin and duplicates it, feeding the two streams to a file-descriptor and stdout.

After taking a look at redirects and subshells here is a neat commandlinefu-example of how tee can be used to split a pipe into multiple streams for one or more subshells to work it.

tmp > echo "tee can split a pipe in two" | tee >(rev) >(tr ' ' '_')
tee can split a pipe in two
owt ni epip a tilps nac eet
tee_can_split_a_pipe_in_two

Copy directory multiple time

To copy a directory containing everything without temporary files and preserving ownership, permissions, and timestamps is often done with tar.

tar cf - . | (cd targetDir && tar xfp -)

tar will create an archive containing everything in the current directory. The “-“ is used to write to stdout instead of a file. The output is then piped into a subshells to change into the target directory and extract the everything.
Copying everything twice can be accomplished using tee:

tar cf - . | tee >(cd targetDir && tar xfp -) | (cd targetDir2 && tar xfp -)

Here tee will duplicate stdin and feed it to a process substitution and to stdout, which in turn is then piped to the subshell.

9> Zips & Archives

Compressed archives are a brilliant way to exchange files… everything is bundled together, in general takes up way less space and can even be password protected. Inspecting or unpacking them usually involves creating temporary directories or files. But there are some handy ways to acoid such intermediate products.

view zipfiles

Often you just need to peek inside without actually extracting anything. zipinfo is a nice little utility that does just that.

tmp > zipinfo tmp.zip
Archive:  tmp.zip   20400 bytes   2 files
-rw-r--r--  3.0 unx    19960 bx defN 14-Nov-12 11:05 colordiff-1.0.13.tar.gz
-rw-r--r--  3.0 unx       72 bx stor 14-Nov-12 11:06 colordiff-1.0.13.tar.gz.sig
2 files, 20032 bytes uncompressed, 20022 bytes compressed:  0.0%

But of course unzip can also perform the task. The -t option tests an archive file, listing it’s content in the process:

tmp > unzip -t tmp.zip
Archive:  tmp.zip
    testing: colordiff-1.0.13.tar.gz   OK
    testing: colordiff-1.0.13.tar.gz.sig   OK
No errors detected in compressed data of tmp.zip.

A third and my prefered option is to use unzip with it’s -l option (lists the content of the archive file).

tmp > unzip -l tmp.zip
Archive:  tmp.zip
  Length     Date   Time    Name
 --------    ----   ----    ----
    19960  11-14-12 11:05   colordiff-1.0.13.tar.gz
       72  11-14-12 11:06   colordiff-1.0.13.tar.gz.sig
 --------                   -------
    20032                   2 files

Unpacking

Even upacking does not necessarily involve spilling the zipped files out to disk. The content of a zip-archive can be extracted and fed to a pipe so that it becomes usable by another process.

tmp > echo "text me" > test.txt
tmp > zip storage test.txt; rm test.txt
  adding: test.txt (stored 0%)
tmp > ls
storage.zip
tmp > unzip -p storage.zip | rev
em txet
tmp > ls
storage.zip

Here I created a sample zip archive and extracted it to feed the content to rev without creating any files in the process.

Inspecting tar Archives

Most of the archives dealt with under Linux are compressed tar files, so here is how you list their content.
First, for a gzipped file (ending in *.tar.gz or *.tgz):

tar -ztvf file.tar.gz

And finally for the bzip2 formats:

tar -jtvf file.tar.bz2
  • t → list contents
  • v → verbose, display detailed information
  • z → filter through gzip (for *.gz fils)
  • j → filter through bzip2 (for *.bz2 fils)
  • f → filename

10> Base Conversion

Even though printf in C/C++ offers great many pitfalls, I still like to use it often despite having access to C++ iostreams. Bash also supports a form of printf that can easily be used to do some basic number conversions.

convert decimal to hex/octal

Just as in C, the bash printf can easily print numerical values in different formats:

tmp > printf "%#x\n" 100
0x64
tmp > printf "%#o\n" 100
0144

The #-character is the alternative format modifier and is responsible for prepending the “0x” for hexadecimal values and a leading zero for octal values.

convert hex to decimal

tmp > printf "%d\n" 0x64
100

A nice example I found in the Bash Hackers Wiki is to print out a conversion table:

tmp > for ((x=0; x <= 127; x++)); do printf '%3d | 0x%02x\n' "$x" "$x"; done
  0 | 0x00
  1 | 0x01
  2 | 0x02
  3 | 0x03
  ...

Done for 2013

Ok, this was my bash feature list of the year. By no means am I an expert for the features described. If you see something that is incorrect or might otherwise be completed more elegantly I’d be glad to hear it!

Photo: Todd Quackenbush