How to Extract Email Addresses from Text File in Linux

Text files contain a continuous stream of characters in no predefined format whatsoever. While some file formats have developed on top of text files (Eg. JSON, YAML), which expect text data to be present in a particular format, normal '.txt' files have no such conventions. Hence, retrieving a specific line, or phrase, or string, from a text file, is to be done using generic Linux tools.

The grep command in Linux is used to find a substring or a text pattern, in a string or a file. It prints the line where the substring is found.

The syntax for using the grep command is as follows:

$ grep <substring> <filename/standard input>

For example, to search for substring “Name” in file ‘test.txt‘ (contents of which are shown in the screenshot), run the following.

$ grep "Name" test.txt
Find a String in File
Find a String in File

Today, we will see how to extract Email addresses out of text files using the grep command.

As we know, an Email address is present in the format:

<user_id>@<domain>.<subdomain>

Here, user_id is a unique identifier string chosen by the user, and domain and subdomain represent the Email service provider (Eg. gmail.com).

Domain and subdomain names can contain only alphabets, whereas user_id can contain alphabets, numeric characters as well as other common characters such as period (.) and underscore (_).

As this is a definite pattern that is to be searched, we can use the '-e' flag of grep, which allows us to specify regular expression patterns instead of substrings, for extraction from a file.

Thus, the syntax of grep with '-e' is:

$ grep -e <regular_expression> <filename/standard input>

Based on the pattern of an Email address discussed before, we can form the following regular expression:

[a-zA-Z0-9._]\+@[a-zA-Z]\+.[a-zA-Z]\+

Here, 'a-zA-Z' represents any alphabet, '0-9' represents numericals, '._' represent a period or an underscore. Note that the characters '\+' represent that the character set in the brackets should appear one or more times.

We will run this regular expression to extract Email addresses from the file ‘test2.txt‘.

First, view the contents of file test2.txt are:

$ cat test2.txt
View Contents of File
View Contents of File

Next, run the following command to extract Email addresses from the file.

$ grep -e "[a-zA-Z0-9._]\+@[a-zA-Z]\+.[a-zA-Z]\+" test2.txt
Extract Email Addresses from File
Extract Email Addresses from File

As we can see, the Email addresses were identified successfully by Grep. However, they are being displayed along with the complete line in the file.

To display just the found Email IDs, use the '-o' flag along with '-e' as shown.

$ grep -oe "[a-zA-Z0-9._]\+@[a-zA-Z]\+.[a-zA-Z]\+" test2.txt
Find Email Addresses in File
Find Email Addresses in File
Conclusion

In this article, we have seen how to extract Email addresses from a text file in Linux, using the handy command-line tool Grep. These Email addresses can then also be written to a file using redirection.

If you have any questions or feedback, let us know in the comments below.

Got something to say? Join the discussion.