Breaking CAPTCHA By Willer Travassos
What it is CAPTCHA? CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. It is a challenge-response test used to ensure that the response comes from a human being. It, usually, requires that users type letters and/or digits (response) from an image that appears on screen (challenge).
Why use it & Who uses it CAPTCHA are used to avoid automated actions, which might prevent QOS. Be it from source exhaustion or source abuse. It is commonly used as means to stop spamming, automated postings, and limiting excessive automated probing to a resource. Services like Gmail, Yahoo! Mail, forums, wikis, and many others use CAPTCHA to avoid at least of one the three.
Reason to break it There might be several reasons to break CAPTCHA. Usually depending on the service that a person is trying to use. Here we concentrate on the CAPTCHAs in mail services, especially Yahoo! Mail, Gmail, and Windows Live Hotmail. The four main reasons to break mail CAPTCHAS are as follows:
Reason to break it 1.Signing up to a mail service gives access to a wide array of services. 2.Usually those three companies are unlikely to be blacklisted. 3.Those three services are free to signup. 4.It is hard to keep track of those who are using the accounts, and it services for spamming, as there are millions of other users that utilize these services.
Overview of how each of the 3 CAPTCHAs were broken Hackers seem to take a similar approach in breaking CAPTCHA. It consists of a Client- Server architecture. In it, the server program sits and waits for CAPTCHA information from the client program (both programs might be located in infected machines). The client program is responsible for reading the image files containing the CAPTCHAs to be sent to the server Once the CAPTCHA is received, i.e., the CAPTCHA, the server tries to break it.
A more concrete way of attacking CAPTCHA As you can imagined I was not clear on how the server program breaks a CAPTCHA. Reports of breaking the three mail services were never clear on the process breaking CAPTCHA. The sites that contained information these programs were not clear on whether breaking CAPTCHA was fully automated. And, the hackers sites were in Russian.
A more concrete way of attacking CAPTCHA According to “Computers beat humans at single character recognition in reading- based Human Interaction Proofs”, by K. Chellapilla, K. Larson, P. Simard, M. Czerwinski, computers are really good at single character recognition. So breaking CAPTCHA becomes a matter segmenting/separating the characters in the CAPTCHA text.
MSN CAPTCHA One thing to note is that there are several flavors for text-based CAPTCHAs. Thus, a method for breaking a CAPTCHA is usually exclusive to its particular flavor. The one method shown here is the one designed for breaking the MSN CAPTCHA (before the changes made to it). The structure of MSN CAPTCHA is as follows:
MSN CAPTCHA 1.It consists of eight characters. 2.Only upper-case letters and digits are used. 3.Challenge text is of a dark blue color and the background. 4.Warping is used as distortion for characters and the CAPTCHA as a whole.
MSN CAPTCHA 5.The random dashes/arcs of different thickness and sizes are there to avoid anti-segmentation. And they can be divided into 3 categories: Thick Arcs: they have the same color as the text, and do not intersect any characters Thin Arcs: they have the same color as the text, and intersect with other characters and arcs Thin Background Arcs: they are arcs with the same color as the background, and intersect characters removing pixels of it.
Segmenting MSN CAPTCHA An attack to MSN CAPTCHA has to take into consideration the following ideas: –Identification and Removal of arcs. –Identification of character locations, and division of characters. To break a MSN CAPTCHA, an attack will follow these 7 steps :
Segmenting MSN CAPTCHA 1.Binarization 2.Broken Character Correction 3.Vertical Segmentation 4.Color filling segmentation 5.Thick Arc Removal 6.Locating Connected Characters 7.Segmentation of Connected Characters
Binarization The MSN CAPTCHA contains different tonalities of blue in the same image. Thus, we convert the image to a black and white one, to easily separate background from foreground.
Broken Character Correction Here we take care of thin background arcs that omit parts of characters. Background arcs are usually 1-2 pixels wide and they become more pronounced after Binarization. This step is necessary because we want to have characters as a whole, and avoid pieces of a characters to be treated as arcs.
Broken Character Correction Thankfully the method to restore characters is a simple one. It consists checking the immediate vertical and horizontal neighbors of a pixel of background color. If 2 pixels surrounding our pixel are of foreground color, the turn that pixel into foreground.
Broken Character Correction Results
Vertical Segmentation Then the first attempt to segment a CAPTCHA is taken here, by segmenting it vertically into chunks containing one more letters. This done by mapping a CAPTCHA into a histogram that represents the number of foreground pixels per column in the image.
Color filling segmentation Here a color filling segment algorithm is used to color each connected component/object (arc or character) with a different color. It detects a foreground pixel, and then trace all its foreground neighbors until all pixels in this object are traversed. Then it looks for another foreground pixel outside of the current object, and repeats the previous process until every object is located
Color filling segmentation While traversing each pixel of an object, the algorithm colors the pixels that it traverse a certain color. This helps further segment letters in a CAPTCHA, since colors will give away objects that could not be segmented in the Vertical Segmentation step, i.e.,
Thick Arc Removal Once CFS is done, we look into the characteristics of arcs, and how we can recognize them. General characteristics of such arcs are as follows: –Usually made up of a small number of pixels. –Do not contain circles, like chars A, B, and etc. –Usually located near the border of the image. –Shape x Location relation, ex: arcs in the beginning of the image are usually tall and short. Arcs in the end tend to be wider and short.
Thick Arc Removal One thing to note is that thick arcs never cross a character, unless a thin arc (which can cross a character) crosses the thick arc, or the Broken Character Correction joins it with a character.
Thick Arc Removal To remove a thick arc the following procedures are taken: Circle Detection: –Draw a bounding box around an object. –Use color filling to color all the background not contained in a character. –Scan the box to find pixels of the original background color, i.e., char has a circle. If there is a circle we skip all steps. If not we going into arc detection and removal.
Thick Arc Removal Scan objects that passed the first step: –We count the number of pixels in order to differentiate chars from arcs, and remove them. Relative position checking: –We look at the chunks of objects that we got from CFS and Vertical segmentation. –The positioning of the objects in these chunks can then tell us whether they are arcs or not, which is a removal criteria. Ex: chars are usually close to the equator of the image, and arcs are in the extremities.
Thick Arc Removal
Detection of the remaining arcs: –We count the number of objects left in the image. If the count number is bigger than eight, then there are still arcs left. Usually, the arc is either the last or first object. –We then check to see which object has a circle in it. If both do not contain a circle, then we remove the object with the smallest pixel count. We repeat until there are 8 objects left.
Locating Connected Characters This step tries to take care of the chars not detect in CFS, and Vertical Segmentation, by estimating how many characters are connected. We play on the design of the MSN CAPTCHA to figure out connected characters. –Objects containing two or more chars are always wide, never tall, i.e., chars are not on top of each other, always side-by-side. –A single character, on average, never surpasses 35 pixels after being normalized. –MSN CAPTCHA always uses 8 chars.
Locating Connected Characters Using this information we can guess which chunks contains chars, and how may of them.
Segmentation of Connected Characters With the previous step locating characters, and determining the n non-connected objects in the image, we can segment the leftover c connected characters (where n +c = 8) by: –Finding the width of the connected character. –Diving the object into c parts of equal sizes, thus getting 8 final chars with 90% accuracy. Using Segmentation and Recognition, MSN CAPTCHA was broken with success rate of 61%.
Microsoft’s Response Due to the news of MSN CAPTCHA being broken, Microsoft answered it with a new scheme that tries to lessen the possible published attacks.
Example of what they can do with spam s For example, when a hacker gains access to a Gmail account he has access to social networks, Google applications, and free web hosting. Ex1: A hacker may use Google Pages to redirect people to blacklisted sites, since he/she can get around numerous well-established spam filters, which do not block the GooglePages sub domain, since Google is widely white listed. Ex2: A hacker may use Orkut (Google’s Facebook) to write scrap/wall messages that redirects users to other web pages, or executes code, infecting a machine with malware.
New CAPTCHAS Due to the weaknesses of text-based CAPTCHA, other new CAPTCHA flavors are being developed in hopes of replacing the text-based version. Most of them play on the human capacity of understanding picture meaning, geometric shapes, and points within shapes and pics.
New CAPTCHAS One of the new CAPTCHA is called Kitten Auth and uses pictures of Animals to determine if an user is human. Images (all different) are on a grid, and the challenge asks to click on all animals of a certain type. If the user gets them all they pass.
New CAPTCHAS Imagination is a two-step CAPTCHA that asks the user to first, to CLICK on the geometric center of any of the pictures displayed. Once the user passes the first test, he/she is asked to ANNOTATE (recognize), through radio buttons, the object being displayed in an image.
Questions?
References Jeff Yan, Ahmad Salah El Ahmad. “A Low-cost Attack on a Microsoft CAPTCHA”. School of Computing Science, Newcastle University, UK K. Chellapilla, K. Larson, P. Simard, M. Czerwinski, “Computers beat humans at single character recognition in reading-based Human Interaction Proofs”, 2nd Conference on and Anti-Spam (CEAS), “Network Security Research and AI”. research.blogspot.com/2008/01/yahoo-captcha-is-broken.html. [Feb. 27, 2008] WebSense. “Sumeet Prasad”. [Feb. 27, 2008] Ryan Naraine, Dancho Danchev, Adam O'Donnell. “Zero Day Blog”. [Feb. 27, 2008] Spam Trackers. [Feb. 28, 2008] “13BIT IT news Blog”. [March 2, 2008] “Spybot Search & Destroy Forums”. [March 2, 2008] Sam Hocevar. “PWNTCHA”. [March 2, 2008] “Kitten Auth”. [March 3, 2008] “Three Lights Bright”. busted.html. [March 3, 2008] “IMAGINATION”. [March 2, 2008]