Detecting Homograph IDNs Using OCR

Yuta Sawabe, Daiki Chiba, Mitsuaki Akiyama, Shigeki Goto


Legitimate domain names are currently being targeted by homograph attacks. These attacks involve the generation of new domain names that appear similar to an existing legitimate domain name by replacing some characters in the legitimate name with others that are visually similar, thus leading users to visit different (fake) sites. In particular, internationalized domain names (IDNs), which can contain non-ASCII characters, can be used to generate/register many similar IDNs (homograph IDNs) for use as phishing sites. The conventional method of detecting such homograph IDNs uses a predefined mapping between ASCII characters and similar non-ASCII characters, but this approach has two major limitations: it cannot detect homograph IDNs that contain characters that are not defined in the mapping, and the mapping must be updated manually. In this paper, we propose a new method that detects homograph IDNs by using optical character recognition (OCR). By focusing on the core fact that homograph IDNs are visually similar to legitimate domain names, we leverage OCR techniques to recognize such similarities automatically. We then compare our approach against the conventional method in evaluations that use over 1.92 million real (registered) IDNs and over 10,000 malicious IDNs. The results illustrate that our method can automatically detect homograph IDNs that cannot be detected by the conventional approach.

Full Text:



