Wednesday, October 6, 2010

Spamming of HTML forms - one case

Recently I found that a newspaper in its online edition switched from image based CAPTCHA system to solving a mathematical puzzle, in-order to prevent spamming of their comment section using a computer program. A screen capture of the same can be found below.



The problem with such a system is that they can be easily solved using a computer, which defeats the purpose of using it to differentiate human and computers apart. To test my own skill, I wanted to write a program that can download the page, read it and solve the puzzle as well. Using the information I obtain, I could then post comments without human intervention.

To accomplish this task, I used the usual suspects like python and and the HTML parser, BeautifulSoup. BeautifulSoup reads a string of html or xml and converts it to a tree. Using the tree, it is easy to navigate through the tags or search for a particular one based on id or name. It is also powerful enough to differentiate tags based on CSS class in html tags.


1. import urllib
2. from BeautifulSoup import BeautifulSoup
3. import string,re

4. doc = urllib.urlopen('http://www.somesight.com/comment/reply/1854565').read()
5. soup = BeautifulSoup(''.join(doc))

6. a = soup.findAll("span",{"class":"field-prefix"})
7. b = a[0].contents[0].split("=")[0].split("+")
8. c = [int(bs) for bs in b]
9. captcha_response = sum(c)
10. print a,captcha_response

11. token1 = soup.findAll("input",id="edit-captcha-token")
12. token1_val = token1[0]['value']
13. print token1,token1_val


The two important information that I need to calculate are the captcha_response which is the solution to the mathematical problem and the captcha_token, a hidden html field in the webpage. Line #6 searches for a class, field-prefix in span tag. This tag contains the string for the mathematical puzzle that needs to be solved. I obtain the contents of this string and split it in-order to obtain the individual numbers in a list. Finally I convert those numbers from string to integer in Line #8 and sum them using line #9.

Line #11 searches the hidden captcha token, stored in the input tag with id="edit-captcha-token".

Armed with these two information, we can post any name and comment to the form. The comments were moderated but it would still require lot of human intervention to clear the spams.

I informed the webmaster of this issue. They have since moved to a image based system. I removed all reference to the site in this blog post and program in-order to keep their anonymity.

No comments: