Saturday, July 21. 2007
Posted by Jonathan Street
in AJAX, PHP Programming, Programming, Web Tools at
22:58
Comments (0)
Trackbacks (0)
Comments (0)
Trackbacks (0)
Xenu : Stats aggregation for any site
I've previously mentioned popuri.us as one of the better examples of a website stats aggregator but I think my loyalty is switching to Xenu.
It was initially flagged by techcrunch about a week ago. At the time it was struggling under the onslaught of being highlighted both on techcrunch and elsewhere. A few days later seomoz brought to my attention that the load had got so bad that the creator felt unable to cope and had released the source code to the community. There are now 13 mirrors you can use including Italian, German, Bulgarian, French and Dutch versions.
Source Code
The backend is all PHP while the frontend relies heavily on javascript. The source code is an interesting, though far from easy, read. Some of the functionality is in my opinion needlessly excessive. I really don't see the point of being able to drag the stats boxes around the screen for instance. The source code would also benefit from descriptive filenames. Principally in the results folder where stats are returned by 46 files cunningly named 1-46.
Despite this there are some real insights to be had for anyone interested in scraping these sorts of stats. For instance a stunningly simple method to grab the alexa rank is used which I hadn't come across before. It doesn't involve paying to access the API and you don't need to wrestle a css file into submission to extract the rank from the alexa site.
The Spoiler
The alexa data is returned in file number 7. Just in case you're not overly thrilled by the notion of opening all 46 files to find the one that accesses the service you're particularly interested in there is a useful key in the following file - js/general_without_encryption.js
It was initially flagged by techcrunch about a week ago. At the time it was struggling under the onslaught of being highlighted both on techcrunch and elsewhere. A few days later seomoz brought to my attention that the load had got so bad that the creator felt unable to cope and had released the source code to the community. There are now 13 mirrors you can use including Italian, German, Bulgarian, French and Dutch versions.
Source Code
The backend is all PHP while the frontend relies heavily on javascript. The source code is an interesting, though far from easy, read. Some of the functionality is in my opinion needlessly excessive. I really don't see the point of being able to drag the stats boxes around the screen for instance. The source code would also benefit from descriptive filenames. Principally in the results folder where stats are returned by 46 files cunningly named 1-46.
Despite this there are some real insights to be had for anyone interested in scraping these sorts of stats. For instance a stunningly simple method to grab the alexa rank is used which I hadn't come across before. It doesn't involve paying to access the API and you don't need to wrestle a css file into submission to extract the rank from the alexa site.
The Spoiler
The alexa data is returned in file number 7. Just in case you're not overly thrilled by the notion of opening all 46 files to find the one that accesses the service you're particularly interested in there is a useful key in the following file - js/general_without_encryption.js
Sunday, July 8. 2007
Posted by Jonathan Street
in PHP Programming, Programming, Web Tools at
13:44
Comment (1)
Trackbacks (0)
Comment (1)
Trackbacks (0)
Web Scraping at Seomoz
Scraping content from the web is a real pain. In fact, at the moment, I can't think of any programming tasks I like less. I'm sure they exist I just can't think of them - feel free to remind me in the comments.
I'm setting up a tool at the moment that is going to use a fair amount of web scraping. Sadly not all sites are kind enough to provide an API like Compete and some actively work to make your life more difficult. The three key points that I feel make web scraping a real pain are:
With my recent work on web scraping I found the latest 'Whiteboard Friday' video from seomoz to be exceptionally timely. I generally read the seomoz blog for the discussion on marketing issues but they also have a couple of tools which are heavily reliant on web scraping and on Friday the issues surrounding this were discussed. Apparently they've been having trouble providing data for all their visitors due to the ever increasing demand. As well as the video Matt, who is their CTO and web developer, goes into a fair amount of detail on their process and some of it really depressed and/or surprised me
Sadly this isn't the first time I've heard this. The costs aren't unreasonable but if you're paying for something I think it is fair to expect I higher standard of service than if it was free.
7 attempts strikes me as amazingly high. It would be interesting to know the delay between each request.
Experience so far
I've got two scripts/tools running at the moment that require data from external sources.
The first is the msn contact grabber script which can fail after half a dozen or fewer requests on a single IP address. To make matters worse the interface is complex and poorly documented. Thankfully it is stable.
The second is the dnsbl checker which I haven't had fail on me yet. It utilises the dns system and is designed for high use. Even if the tool did become insanely popular the demand placed on external services could be limited by zone transfers. The interface is also so simple that documentation really isn't needed.
My experience seems to be at the two extremes of scraping. I'm hoping my current work will be more dnsbl checker than msn contact grabber. Maybe I should just drop Alexa data?
I'm setting up a tool at the moment that is going to use a fair amount of web scraping. Sadly not all sites are kind enough to provide an API like Compete and some actively work to make your life more difficult. The three key points that I feel make web scraping a real pain are:
- Undocumented interface
- We're dealing with essentially unstructured HTML here. Web scraping also highlights just how little many companies care about validation and writing readable code. I'm assuming their code is a pain to read because they compact their files to save on bandwidth. If not then maintaining these sites could be an even bigger pain then scraping them.
- No warning of changes
- You're essentially at the mercy of the designer. If they change their layout your code breaks.
- Timeouts and blank pages
- Even with all the code in place you still may got no result. The site may be down or if you're being too aggressive your requests may be throttled. Even if you do get to a page it may nto be the one you want. Does a blank page mean the site has no info for the query you made or does it mean there was some sort of error? If the site is using 'soft' error messages it may be difficult to know.
With my recent work on web scraping I found the latest 'Whiteboard Friday' video from seomoz to be exceptionally timely. I generally read the seomoz blog for the discussion on marketing issues but they also have a couple of tools which are heavily reliant on web scraping and on Friday the issues surrounding this were discussed. Apparently they've been having trouble providing data for all their visitors due to the ever increasing demand. As well as the video Matt, who is their CTO and web developer, goes into a fair amount of detail on their process and some of it really depressed and/or surprised me
One of the most unreliable APIs I've had to deal with is the Alexa / Amazon API, which is funny because it's the only one that costs money.
Sadly this isn't the first time I've heard this. The costs aren't unreasonable but if you're paying for something I think it is fair to expect I higher standard of service than if it was free.
This entire [data fetching] process is repeated between 2-7 times with varying timeout lengths and user-agents until some kind of data is fetched.
7 attempts strikes me as amazingly high. It would be interesting to know the delay between each request.
Experience so far
I've got two scripts/tools running at the moment that require data from external sources.
The first is the msn contact grabber script which can fail after half a dozen or fewer requests on a single IP address. To make matters worse the interface is complex and poorly documented. Thankfully it is stable.
The second is the dnsbl checker which I haven't had fail on me yet. It utilises the dns system and is designed for high use. Even if the tool did become insanely popular the demand placed on external services could be limited by zone transfers. The interface is also so simple that documentation really isn't needed.
My experience seems to be at the two extremes of scraping. I'm hoping my current work will be more dnsbl checker than msn contact grabber. Maybe I should just drop Alexa data?
Monday, July 2. 2007
PHP5 Compete API Wrapper
For the latest and best version of the Compete API wrapper please view the Compete page in the scripts area.
My post from last night discussing the Compete API received two comments. The first highlighted that there was already a proposal in PEAR. Great to hear but as I did much of the work for a wrapper last night you'll get another version now. Actually there is some potential for a superior hybrid to emerge - my wrapper provides better methods for getting at the data while the PEAR proposal could be more robust.
Given that I wrote the post on a Sunday evening I was more than a little surprised, albeit delighted, to receive a response from Jay Meattle at Compete before I woke on Monday morning. A company which gives a named contact you would expect to be fairly responsive but I still found the time frame here to be impressive.
With the exception of a link from the homepage, which is apparently in the works, Compete has made good on the points I raised so it's time for me to fulfil my part of the deal and release a PHP5 wrapper for the compete API.
Hopefully it should save anyone else interested in using the Compete API within a PHP project from starting from scratch.
Sunday, July 1. 2007
First impressions of the Compete API
The next tool I want to launch on the site is going to pool data from a number of sources. One of the data types I want to use is traffic data for a site. The leader in this field is Alexa but with their bias towards ranking webmaster sites disproportionately high I wanted something else. There are a few competitors in this field but with the recent launch of an API by compete they seemed to be the best place to start.
Accordingly I went to compete.com and started looking around for a link to their API documentation. After a couple of minutes without any success I gave up and just typed 'compete API' in google. Generally I like clean interfaces and here compete doesn't disappoint but when google provides better navigation of your content than than your own site does it's possible you've taken things a little too far.
Having found the developer section sign-up was a breeze with the only potential point of confusion coming when you receive the confirmation email from mashery and not compete. The compete API is being handled by mashery but this isn't mentioned anywhere on the site. It's a minor issue but could lead to confusion.
Having confirmed your email you are directed back to the Mashery site which, while very slick, contains zero info on the API. I'm sure in time user generated content is going to accumulate but for the moment it's a desert.
It's a Good Service . . .
I just like to complain. Despite the points raised above sign-up went smoothly and once you get back to the compete site you have all the info you need to interact with the API. The only thing really lacking is a selection of wrappers.
So Compete, what would you say about an exchange? Put a link in the footer of your main site to the developer area, mirror the documentation on the mashery site, or at least hyperlink the url I entered in the wiki (I couldn't figure out how - html and wiki syntax don't seem to work), and I'll put together a PHP wrapper for the API. What do you say?
Accordingly I went to compete.com and started looking around for a link to their API documentation. After a couple of minutes without any success I gave up and just typed 'compete API' in google. Generally I like clean interfaces and here compete doesn't disappoint but when google provides better navigation of your content than than your own site does it's possible you've taken things a little too far.
Having found the developer section sign-up was a breeze with the only potential point of confusion coming when you receive the confirmation email from mashery and not compete. The compete API is being handled by mashery but this isn't mentioned anywhere on the site. It's a minor issue but could lead to confusion.
Having confirmed your email you are directed back to the Mashery site which, while very slick, contains zero info on the API. I'm sure in time user generated content is going to accumulate but for the moment it's a desert.
It's a Good Service . . .
I just like to complain. Despite the points raised above sign-up went smoothly and once you get back to the compete site you have all the info you need to interact with the API. The only thing really lacking is a selection of wrappers.
So Compete, what would you say about an exchange? Put a link in the footer of your main site to the developer area, mirror the documentation on the mashery site, or at least hyperlink the url I entered in the wiki (I couldn't figure out how - html and wiki syntax don't seem to work), and I'll put together a PHP wrapper for the API. What do you say?
