Sunday 13 October 2013

Scraping websites to identify if one's logo exists

This post is a demo for the project in which one had to scrape websites to see if they accept cards from one of the credit card companies.

Typically websites have the logo of he credit card company if they accept the card in their payments page.

Manually, if a person would have to do this task, he would open each link and go to the payments page and then check if there is the related logo.

However, if a machine were to do this task, as per the code I have written it would go through the following steps:

1. Open the webpage to be tested

For example: link='http://www.listphile.com/Fortune_500_Logos/list'
webpage=urllib.urlopen(link).read()

2. Look for all the href links that link to other pages, for now, let's assume that we directly land on the payments page

3. Find all logos existing in that page

findlogo=re.compile('<a href="/Fortune_500_Logos/(.*)"><img alt="Thumb" height="62" src="(.*)" title="(.*)')
findnewlogo=re.findall(findlogo,webpage)

4. Extract logos from the website

for i in xrange(100):
urllib.urlretrieve("http://www.listphile.com"+findnewlogo[i][1], "local drive".png")


5. Resize all the logos extacted so that they all can be compared
im1=Image.open("E:/amex/logo comparison/"+findnewlogo[j][0]+".png")
im1.resize((30,30))

6. Compute the difference between extracted logo and the logo of the company we wanted to check for match

h = ImageChops.difference(im11, im22).histogram()
difference=math.sqrt(reduce(operator.add,map(lambda h, i: h*(i**2), h, range(256))))

7. Flag if there is a difference


In this way, one could reduce a lot of manual intervention.

However, the caveat here is that, there might be cases when the program does not even land in payments' page.
This can be identified by raising a flag when the number of links within a webpage are too large that if the program did not find the payments' page, one needs to have a manual check for the respective website

No comments:

Post a Comment