Detecting Homograph IDNs Using OCR

Yuta Sawabe, Daiki Chiba, Mitsuaki Akiyama, Shigeki Goto


Legitimate domain names are currently being targeted by homograph attacks. These attacks involve the generation of new domain names that appear similar to an existing legitimate domain name by replacing some characters in the legitimate name with others that are visually similar, thus leading users to visit different (fake) sites. In particular, internationalized domain names (IDNs), which can contain non-ASCII characters, can be used to generate/register many similar IDNs (homograph IDNs) for use as phishing sites. The conventional method of detecting such homograph IDNs uses a predefined mapping between ASCII characters and similar non-ASCII characters, but this approach has two major limitations: it cannot detect homograph IDNs that contain characters that are not defined in the mapping, and the mapping must be updated manually. In this paper, we propose a new method that detects homograph IDNs by using optical character recognition (OCR). By focusing on the core fact that homograph IDNs are visually similar to legitimate domain names, we leverage OCR techniques to recognize such similarities automatically. We then compare our approach against the conventional method in evaluations that use over 1.92 million real (registered) IDNs and over 10,000 malicious IDNs. The results illustrate that our method can automatically detect homograph IDNs that cannot be detected by the conventional approach.

Full Text:



Pieter Agten, Wouter Joosen, Frank Piessens, and Nick Nikiforakis. Seven months’ worth of mistakes: A longitudinal study of typosquatting abuse. In Proceedings of the 22nd Network and Distributed System Security Symposium (NDSS 2015). Internet Society, 2015.

Evgeniy Gabrilovich and Alex Gontmakher. The homograph attack. Communications of the ACM, Vol. 45, No. 2, p. 128, 2002.

Symatec. Bad guys using internationalized domain names (idns). domain-names-idns.

Marcin Ulikowski. dnstwist.

Unicode security mechanisms for utr #39., 2017.

ICANN. Internationalized domain names.

P. Faltstrom, P. Hoffman, and A. Costello. Internationalizing domain names in applications (idna). RFC 3490, RFC Editor, March 2003.

P. Hoffman and M. Blanchet. Nameprep: A stringprep profile for

internationalized domain names (idn). RFC 3491, RFC Editor, March

P. Hoffman and M. Blanchet. Preparation of internationalized strings (”stringprep”). RFC 3454, RFC Editor, December 2002.

A. Costello. Punycode: A bootstring encoding of unicode for

internationalized domain names in applications (idna). RFC 3492, RFC Editor, March 2003.

Tyson McElroy, Peter Hannay, and Greg Baatard. The 2017 homograph browser attack mitigation survey. 2017.

Xudong Zheng. Phishing with unicode domains., 2017.

Wordfence. Chrome and firefox phishing attack uses domains identical to known safe sites., 2017.

Public suffix list.

Daiki Chiba, Takeshi Yagi, Mitsuaki Akiyama, Toshiki Shibahara, Tatsuya Mori, and Shigeki Goto. Domainprofiler: toward accurate and early discovery of domain names abused in future. International Journal of Information Security, Dec 2017.

Rapid7. Project sonar forward dns. v2/, 2017.

hpHosts. Ad and tracking servers only. servers.txt.

Dns-bh malware domain blockilist. http://www.malredomains/com/.


The spamhaus project ltd., the domain block list.


Tesseract ocr.

Alexa Internet. Alexa topsites.

Janos Szurdi, Balazs Kocso, Gabor Cseh, Jonathan Spring, Mark Felegyhazi, and Chris Kanich. The long” taile” of typosquatting domain names. In USENIX Security Symposium, pp. 191–206, 2014.

Tobias Holgers, David E Watson, and Steven D Gribble. Cutting through the confusion: A measurement study of homograph attacks. In USENIX Annual Technical Conference, General Track, pp. 261–266, 2006.

Rachna Dhamija, J Doug Tygar, and Marti Hearst. Why phishing works. In Proceedings of the SIGCHI conference on Human Factors in computing systems, pp. 581–590. ACM, 2006.


  • There are currently no refbacks.