Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/improve email extract #373

Merged
merged 6 commits into from
Oct 12, 2018

Conversation

klaxon1
Copy link
Contributor

@klaxon1 klaxon1 commented Oct 2, 2018

To resolve #365
Used email regex from https://www.regular-expressions.info/email.html
Added some basic tests.
The following email address are valid:
email@example.com firstname.lastname@example.com email@subdomain.example.com firstname+lastname@example.com 1234567890@example.com email@example-one.com _______@example.com email@example.name email@example.museum email@example.co.jp firstname-lastname@example.com

@edent
Copy link

edent commented Oct 3, 2018

Does this work with I18n domain names?

For example test@莎士比亚.org

(from https://shkspr.mobi/blog/2016/09/why-cant-you-send-email-to-a-chinese-address/)

@GCHQ77703
Copy link
Member

GCHQ77703 commented Oct 4, 2018

It does not appear to. Nor does it appear to work with any non-latin characters anywhere in the email address. E.g:

gâteau@cake.com

This is actually allowed though, according to RFC 5322, which specifically stipulates only latin characters in email addresses. Non-latin characters are usually converted into something that looks like:

test@xn--bcher-kva.com

Behind the scenes of whatever client you are using. Despite such characters not being allowed in some MTAs, might still be reasonable to support them ourselves as they are ubiquitous.

@klaxon1
Copy link
Contributor Author

klaxon1 commented Oct 5, 2018

good point all, i didnt consider those cases. lets leave this pull request open and i'll make some additional commits.

@klaxon1
Copy link
Contributor Author

klaxon1 commented Oct 11, 2018

so, I've learnt this week that email addresses are more complicated then i originally thought.
Hopefully this updated regex (from https://www.regextester.com/98066) seems to match 99% of valid email addresses.
The following email address are all extracted successfully:

伊昭傑@郵件.商務
म@मोहन.ईन्फो
юзер@екзампл.ком
θσερ@εχαμπλε.ψομ 
JosễSilvễ@googlễ.com
JosễSilvễ@google.com 
JosễSilva@google.com
FoO@BaR.CoM
john@192.168.10.100
gómez@junk.br
Abc.123@example.com.
user+mailbox/department=shipping@example.com
用户@例子.广告
उपयोगकर्ता@उदाहरण.कॉम
юзер@екзампл.ком
θσερ@εχαμπλε.ψομ
Dörte@Sörensen.example.com
аджай@экзампл.рус
test@xn--bcher-kva.com
gâteau@cake.com
test@莎士比亚.org
email@example.com
firstname.lastname@example.com
email@subdomain.example.com
firstname+lastname@example.com
1234567890@example.com
email@example-one.com
_______@example.com
email@example.name
email@example.museum
email@example.co.jp
firstname-lastname@example.com

@n1474335 n1474335 merged commit e638fb6 into gchq:master Oct 12, 2018
@n1474335
Copy link
Member

Excellent, thanks very much for this. I haven't updated the built in email regex in the 'Regular Expression' operation as I feel this version should be a bit more quick and dirty. It's useful to have a fairly accurate one for 'Extract email addresses' though.

@klaxon1 klaxon1 deleted the feature/improve-email-extract branch October 13, 2018 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Email search regex does not include +
4 participants