Sunday, November 4. 2007
Trackbacks
Trackback specific URI for this entry
No Trackbacks
Comments
Display comments as
(Linear | Threaded)
I think the real moral of the story is stop scraping other people's content without prior consent. If they're blocking you based on your user agent string they're obviously not happy with what you're doing.
No scraping?!? Paul, you have no sense of adventure, efficiency, or advancement of knowledge. You suggest gathering data by hand? I can think of better things to do with time. I thought computers were for automating tasks, including research. If you don't want your content accessed, don't put it in a system designed to automatically serve it up in response to (HTTP) requests. Don't like the load on your server? Expose a cached API.
Hi Paul
Thanks for your comment. I realise that scraping content is a sensitive issue but I think in this instance I acted appropriately.
I think it is important to point out that I wasn't reposting the content anywhere else, I was hoping to analyse it in aggregate purely out of personal interest. As such I don't think copyright issues will have come up leaving the issue being (mis)use of server resources.
The file I was scraping was an XML file without formatting which I interpreted as being a fairly safe sign that the file was meant for automated retrieval. It is possible that they had problems with people in the past scraping content with a user agent of 'PHP' but that's so generic that I can't consider it a vote specifically against my efforts.
If I changed my useragent string to 'STREET_CRAWLER' and then a few hours later noticed I was being blocked then it really would be time to reconsider.
Thanks for your comment. I realise that scraping content is a sensitive issue but I think in this instance I acted appropriately.
I think it is important to point out that I wasn't reposting the content anywhere else, I was hoping to analyse it in aggregate purely out of personal interest. As such I don't think copyright issues will have come up leaving the issue being (mis)use of server resources.
The file I was scraping was an XML file without formatting which I interpreted as being a fairly safe sign that the file was meant for automated retrieval. It is possible that they had problems with people in the past scraping content with a user agent of 'PHP' but that's so generic that I can't consider it a vote specifically against my efforts.
If I changed my useragent string to 'STREET_CRAWLER' and then a few hours later noticed I was being blocked then it really would be time to reconsider.
I guess its a war out there and everything is fair in war. I agree its not a fair practice.
But I am sick and tired of Popups and I have decided that no user from my site will get popup even if he is going to an external site FROM my site. So I use THIS method to get the necessary content effectively. Thanks for this method.
Now I am looking for a method to change the referrer.
But I am sick and tired of Popups and I have decided that no user from my site will get popup even if he is going to an external site FROM my site. So I use THIS method to get the necessary content effectively. Thanks for this method.
Now I am looking for a method to change the referrer.
Hello NOTRICK
That's not exactly the usage scenario I had in mind. You're going to use a lot of bandwidth and processor resources with that approach and unless you are thoroughly cleaning the content popups may still be an issue.
I would recommend education about pop-up blockers over this approach though you could use this technique to check whether a site is likely to show your visitors pop-ups.
I suspect that if you read the documentation on cURL changing the referer will be a simple process.
That's not exactly the usage scenario I had in mind. You're going to use a lot of bandwidth and processor resources with that approach and unless you are thoroughly cleaning the content popups may still be an issue.
I would recommend education about pop-up blockers over this approach though you could use this technique to check whether a site is likely to show your visitors pop-ups.
I suspect that if you read the documentation on cURL changing the referer will be a simple process.
