What Is Captcha And reCAPTCHA ? How It's Work?


Hi, everyone. I'm sure most of you have seen those squiggly characters that you have to type all over the Internet. For example, when you're getting a Gmail account or trying to buy tickets online. That thing is called the CAPTCHA. And the reason it's there is to make sure that you, the entity filling out the form are actually a human and not some sort of computer program that was written to submit the form millions of millions of times. The reason it works is because humans, at least, non-visually impaired humans have no trouble reading these distorted squiggly characters whereas computer programs simply can't do it as well yet. So for example, the reason you have to type a CAPTCHA when buying tickets online for a show is to prevent scalpers from writing programs that try to buy millions of tickets too at a time.


Now, CAPTCHA area very effective tool against spammy behavior. Chances are that if you allow any type of user interaction on your site, you are susceptible to automated attacks by so-called bots. And CAPTCHAs can help you mitigate this problem. For example, in comment forms, if you have a blog that allows comments, you are susceptible to so-called comments spam. So, spammers actually write programs that search the whole Web for unprotected blogs in which they can send basically comments that advertise their products. One possible solution against this is to use a CAPTCHA to make sure that only humans can comment on your blogs.

There are many other cases in which CAPTCHAs are necessary to prevent automated abuse such as account sign offs, forms in which you can--people can contact you, etcetera. Now, because we here at Google hate spam as much as you do, we provide a free CAPTCHA service called reCAPTCHA. Here, I'm going to explain a few things about why reCAPTCHA is so cool and why it is so successful at stopping automated abuse. First, I should say that reCAPTCHA is fundamentally different from every other CAPTCHA out there. And to explain why, it's [INDISTINCT] how we originally got the idea for reCAPTCHA? Okay, so basically, about three years ago, we did this calculation, we calculated, we wondered how many CAPTCHAs are typed every day by people around the world. And after we're doing a simple calculation, it turns out that approximately 200 million CAPTCHAs are typed every day by people around the world.
           
     Okay, I type a CAPTCHA roughly once every two or three days because I use the Internet very heavily and it turns out there's a lot of people like me. And so, if you add it little up, approximately 200 million CAPTCHAs are typed every day by people around the world. Now, each time you type a CAPTCHA, essentially you waste 10 seconds of your time, takes about 10 seconds to type a CAPTCHA. And if you multiply that by 200 million, you get the humanity as a whole who's wasting approximately 500,000 hours every day typing CAPTCHAs.

Now, of course, we can't just get rid of CAPTCHAs because of the security of the Web server depends on them. So, the idea of reCAPTCHA was born by wondering is there any way in which we can use these 500,000 hours for something that's good for humanity? Okay, another way of putting it is, is there some way in which you can use those 10 seconds while you're typing a CAPTCHA for something that's useful? Okay, so think about it, while you're typing a CAPTCHA during those 10 seconds, your brain is doing something amazing. Your brain is doing something that computers cannot yet do. So, is there any way in which we can use this effort for something good? And it turns out, the answer is yes. And this is what we're doing with reCAPTCHAs, so with reCAPTCHA, not only.


Are you authenticating yourself ashuman, but in addition, you're helping to digitize books and newspapers, okay? And let me explain how. There's many projects out there trying to digitize old books and newspapers so that anybody on the Web can act system, okay? And the way book digitization works is that first, you start with a book in physical form, okay? And then, you scan this book. Now, scanning a book is like taking a digital photograph of every page of the book. You see an image for every page of the book, okay? The next step in the process is that the computer needs to be able to decipher all of the characters that are in this image, okay? And the reason for that is so that they can search through the book. This is done using a technology called OCR for Optical Character Recognition, okay? So, OCR keeps an image with text in it and tries to decipher what the text is and that is in it, okay? Now, the problem with OCR is that it doesn't always work very well. For things that were written a long time ago because the ink has faded and the pages have turned yellow, OCR cannot recognize many of the characters. 

In this example, on the screen, if you run OCR in this and if you'll try to get the computer to decipher all of these characters, here's what you get, okay? So basically, that you can see all mistakes that this makes. So, what we're doing now is we're taking all the words that the OCR cannot recognize, we're taking all the words the computer cannot recognize in the book digitization process. And we're using CAPTCHAs so that the answers that people enter CAPTCHAs are being used to help us correct these mistakes. Okay, so that's the fundamental idea of reCAPTCHA is that the answers that people enter are used to correct the mistakes in the book digitization process. So, let me explain the exact process of how this works.

So, first, we start with a scan of an old book. Then we take all the words that the computer cannot recognize. It turns out the computer actually tells us when it cannot recognize a word. So, we're going to take all the words the computer cannot recognize. And we're going to use these as the basis for reCAPTCHA. Notice these are perfect for CAPTCHA because these are, by definition,
Words that the computer cannot recognize. And so, we take this word, we distort it even further to really make sure that the computer cannot recognize it. And then we use this as a CAPTCHA. Now, you may be wondering, how can we use this word as a CAPTCHA if the system doesn't know the answer for it? This is a word that the system just took out of a book for which it didn't know the answer. It should be the case that if we use it as a CAPTCHA somewhere, when you're buying tickets online for example, we should only let you through if you type the word correctly. But how can the system do this when it doesn't know the answer? The solution to this problem is that whenever we gift a user a word for which the system doesn't know the answer, we actually give it to them along with another word. One for which we do know the answer, one for which the system does know the answer. So, we actually give the users two words, one for which thesystem knows the answer, one for which it doesn't. And, we ask them to type both words.

  Well, of course, we don't tell the users which word is which. And if the user enters the correct answer to the one for which the system already knows the answer, we assume that the user is a human. And we also get some confidence that they enter the correct one for the other word. And if we repeat this process multiple times giving the new word to different people, and they all typed the word "between" then, we get with overwhelmingly high probability that this word really is the word "between." So, it's the basic idea of how reCAPTCHA works.

Now, we put a lot of effort in making reCAPTCHA the best CAPTCHA out there. For example, reCAPTCHA is also accessible to visually impaired users because it has an audio alternative. It is very important that if you use a CAPTCHA on your site, you provide an audio alternative, because blind people cannot navigate around image-based CAPTCHAs. And you see, blind people surf the web using screen readers, programs that read the entire screen to them out loud. Now, whenever a screen reader gets to a CAPTCHA, to an image of a CAPTCHA, it can't read it because, by definition, the screen reader is itself a program. So, if your CAPTCHA doesn't provide an audio alternative, blind people are locked out of your site. And now with reCAPTCHA, we provide an audio alternative. Hundreds of thousands of sites use reCAPTCHA on the Web because also, we spend a lot of effort balancing security and usability. We want reCAPTCHA to be impossible to read for computers but also easy to read for humans. One of the things we do to achieve this balance between usability and security is that we have multiple levels of difficulty of the distorted images of reCAPTCHA. For most users, the first CAPTCHA that they see is relatively easy to read. But if they attempt to solve too many CAPTCHAs in a small period of time, with reCAPTCHA,

                We start giving them harder and harder CAPTCHAs, in case, it's somebody trying to attack your site. Also, reCAPTCHA is a Web service meaning that the images for reCAPTCHAs that are served to your users actually come directly from our servers. This is good for many reasons. First, reCAPTCHA doesn't require you to download a program and uses your server's resources to generate the distorted images. We do that for you. Second, and most importantly, if we ever notice that automated bots are able to read reCAPTCHA images because we watch our logs very carefully, we can immediately change the distortions to the images without you having to download an update. Okay, so this ensures that reCAPTCHA is effective at stopping spam at all times. Well, hopefully I've conveyed our excitement for our product. ReCAPTCHA is very easy to install. So, if you need a CAPTCHA on your site, you should use a reCAPTCHA especially because it's free.