- Purpose of course
- Learn advanced text manipulation from the command line using regular expressions
- Focused on use in bioinformatics and processing large/many files.
- What we expect you already know
- Some familiarity with regex (e.g. you took Part 1 of this course or reviewed that material on your own)
- Basic Unix command line
- What we're covering
- sed: for automated editing of files
- grep: for selecting lines from a file
- There are two common distributions of sed and grep: GNU (default on Linux systems) and BSD (default on Macs).
- There are minor, but important, differences between them. We've decided to focus on the GNU versions because Linux systems are more common for working with bioinformatic datasets and we find the GNU versions are more powerful.
- There are several ways to get access to GNU versions of
sed
andgrep
- If you're using a Linux system GNU versions will be available by default.
- If your workstation is not Linux based, you can log in to a remote Linux system such as the Hydra computing cluster or using a virtual machine like this mybinder.org VM that we've setup for other trainings and can be used here:
- Conda: If you have conda installed, you can use this command to install the gnu versions of
sed
andgrep
:
conda install -c conda-forge -c bioconda sed grep
- Homebrew: We prefer using Conda for managing packages on a Mac, but if you use Homebrew, use these commands:
brew install gnu-sed
brew install grep
- WSL
- cygwin
- Git for Windows?
sed
(short for Stream Editor) is a program used for editing text files. It reaches back to the very early days of Unix systems. It excels at working with large text files and for re-running the same types of edits on many files. The "Stream" portion of its name is because it doesn't the entire file into memory, it reads from disk as the file is being processed. Because of this, it can work with very large text files (like we sometimes see in bioinformatcs). It can also be used in conjuction with other commands via pipes where the output of one commands becomes the input of another.
sed
uses commands from its scripting language to edit the files. We'll only be using the s
script step (short for substitute) to change text in a file which is the most common use of sed, but there are other commands like d
to delete specific lines or p
to print certain lines that are sometimes used.
Before we look at using regular expressions with sed
, let's get familiar with using the text replace function in sed
.
As mentioned above, we'll be using sed's s
command (short for substitute)
Format of the s
command:
sed 's/<find>/<replace>/<options>'
<find>
: text (or regex) to find in the input file.
<replace>
: text to replace the found text with.
<options>
: options for the substitution (g
for global, i
for case insensitive). More about this below.
We our going to use a sample FASTA file, locus_1.fasta
, to try sed
$ cat locus_1.fasta
>locus_1-seq_1
ACCGTTATC
>locus_1-seq_2
ACCGTTCTC
>locus_1-seq_3
ACCGTTGTC
>locus_1-seq_4
ACCGTTTTC
>locus_1-seq_5
ACCGTT-TC
>locus_1-seq_6
ACCGTTNNN
We will change "seq" in the sequence name lines with "sample".
Hint: a good way to get started with the sed line is by entering the outline of the s
command for sed 's///'
and then adding the specifics.
$ sed 's/seq/sample/' locus_1.fasta
>locus_1-sample_1
ACCGTTATC
>locus_1-sample_2
ACCGTTCTC
>locus_1-sample_3
ACCGTTGTC
>locus_1-sample_4
ACCGTTTTC
>locus_1-sample_5
ACCGTT-TC
>locus_1-sample_6
ACCGTTNNN
The text of the file is displayed to the screen with the changes made.
Another common way to use sed
is to remove the found text. This can be done by leaving the replace section blank.
For example, referring back to the original locus_1.fasta
, if you want to remove sample_1
from each header, you can use:
$ sed 's/locus_1-seq_//' locus_1.fasta
>1
ACCGTTATC
>2
ACCGTTCTC
>3
ACCGTTGTC
>4
ACCGTTTTC
>5
ACCGTT-TC
>6
ACCGTTNNN
You may be wonder about re-using our original file if we already changed seq
to sample
.
Run the cat command on locus_1.fasta
to find out.
sed
will not change the original file unless you tell it do.
In Unix systems you can redirect the output from the screen to a file using the >
character.
This command will create a new file, locus1_mod.fasta
, that has the modified text.
sed 's/seq/sample/' locus_1.fasta >locus1_mod.fasta
Automatically create a backup:
sed -i.bak 's/seq/sample/' locus_1.fasta
locus_1.fasta
: modified files
locus_1.fasta.bak
: original file
Change the original with no backup (not this format doesn't work on BSD sed on Macs):
sed -i 's/seq/sample/' locus_1.fasta
Warning: this changes your original file without making a backup, be careful with this and have your own backup available |
---|
In the above example, the input text is from the file locus_1.fasta
. You can also use sed
with a pipe |
to take output from one command as input into sed. Here's a non-biolgical exmaple, but we'll see it again after we learn grep
.
$ echo "What are you trying to find?" | sed 's/find/replace/'
What are you trying to replace?
The s
command in this form will only find/replace the first occurrence of the search string in each line. You can have it replace every occurrence by adding g
(for global) to the end of the command.
Compare the difference on seq_6
with this command:
$ sed 's/N/-/' locus_1.fasta
>sample_1
ACCGTTATC
>sample_2
ACCGTTCTC
>sample_3
ACCGTTGTC
>sample_4
ACCGTTTTC
>sample_5
ACCGTT-TC
>sample_6
ACCGTT-NN
and this one that has the g
global option:
sed 's/N/-/g' locus_1.fasta
>sample_1
ACCGTTATC
>sample_2
ACCGTTCTC
>sample_3
ACCGTTGTC
>sample_4
ACCGTTTTC
>sample_5
ACCGTT-TC
>sample_6
ACCGTT---
You can make the find portion of the s
command case insensitive by adding i
to the end of the s
command.
echo "I need to go to the Smithsonian" | sed 's/smithsonian/museum/i'
I need to go to the museum
You can also combine i
and g
with /ig
echo "The Times has the best crosswords at times." | sed 's/times/post/ig'
The post has the best crosswords at post.
sed
can use regular expressions in its substitute command.
Use the option -r to enable "Extended Regular Expressions, sed -r ." |
---|
[[:digit:]]
instead of \d
[[:alnum:]]
instead of \w
(except that \w
includes _
)
...
\1
- escape it with
\
:\/
- Use a different special character instead of
/
in thes
step:sed 's|||'
grep
is another commonly used program for working with text files. The most common use is to return every line of a text file that matches a certain string or regular expression. Other related uses can be to print every line that doesn't match a certain pattern and printing the file names which contain or are missing a match.
Like we did with sed
, it's useful to first learn how to use grep
without regular expressions.
The format of a grep
command is:
grep <optional arguments> "<text to find>" <inputfile>
<optional arguments>
: command line arguments including -E
(use extended regular expressions, recommended), -i
(case insensitive), -c
(count of matches), -v
(inverse search: lines without a match), --files-with-matches
, --files-without-match
<text to find>
: string or regular expression to find
<inputfile>
: file to search, or grep can be used with |
to pipe output from another command