Text files contain a continuous stream of characters in no predefined format whatsoever. While some file formats have developed on top of text files (Eg. JSON, YAML), which expect text data to be present in a particular format, normal
'.txt' files have no such conventions. Hence, retrieving a specific line, or phrase, or string, from a text file, is to be done using generic Linux tools.
The grep command in Linux is used to find a substring or a text pattern, in a string or a file. It prints the line where the substring is found.
The syntax for using the grep command is as follows:
$ grep <substring> <filename/standard input>
For example, to search for substring “Name” in file ‘test.txt‘ (contents of which are shown in the screenshot), run the following.
$ grep "Name" test.txt
Today, we will see how to extract Email addresses out of text files using the grep command.
As we know, an Email address is present in the format:
Here, user_id is a unique identifier string chosen by the user, and domain and subdomain represent the Email service provider (Eg. gmail.com).
Domain and subdomain names can contain only alphabets, whereas user_id can contain alphabets, numeric characters as well as other common characters such as period
(.) and underscore
As this is a definite pattern that is to be searched, we can use the
'-e' flag of grep, which allows us to specify regular expression patterns instead of substrings, for extraction from a file.
Thus, the syntax of grep with
$ grep -e <regular_expression> <filename/standard input>
Based on the pattern of an Email address discussed before, we can form the following regular expression:
'a-zA-Z' represents any alphabet,
'0-9' represents numericals,
'._' represent a period or an underscore. Note that the characters
'\+' represent that the character set in the brackets should appear one or more times.
We will run this regular expression to extract Email addresses from the file ‘test2.txt‘.
First, view the contents of file test2.txt are:
$ cat test2.txt
Next, run the following command to extract Email addresses from the file.
$ grep -e "[a-zA-Z0-9._]\+@[a-zA-Z]\+.[a-zA-Z]\+" test2.txt
As we can see, the Email addresses were identified successfully by Grep. However, they are being displayed along with the complete line in the file.
To display just the found Email IDs, use the
'-o' flag along with
'-e' as shown.
$ grep -oe "[a-zA-Z0-9._]\+@[a-zA-Z]\+.[a-zA-Z]\+" test2.txt
In this article, we have seen how to extract Email addresses from a text file in Linux, using the handy command-line tool Grep. These Email addresses can then also be written to a file using redirection.
If you have any questions or feedback, let us know in the comments below.