Sunday, November 4. 2007
Trackbacks
Trackback specific URI for this entry
No Trackbacks
Comments
Display comments as
(Linear | Threaded)
I think the real moral of the story is stop scraping other people's content without prior consent. If they're blocking you based on your user agent string they're obviously not happy with what you're doing.
No scraping?!? Paul, you have no sense of adventure, efficiency, or advancement of knowledge. You suggest gathering data by hand? I can think of better things to do with time. I thought computers were for automating tasks, including research. If you don't want your content accessed, don't put it in a system designed to automatically serve it up in response to (HTTP) requests. Don't like the load on your server? Expose a cached API.
#2
on
2007-11-07 23:54
Hi Paul
Thanks for your comment. I realise that scraping content is a sensitive issue but I think in this instance I acted appropriately.
I think it is important to point out that I wasn't reposting the content anywhere else, I was hoping to analyse it in aggregate purely out of personal interest. As such I don't think copyright issues will have come up leaving the issue being (mis)use of server resources.
The file I was scraping was an XML file without formatting which I interpreted as being a fairly safe sign that the file was meant for automated retrieval. It is possible that they had problems with people in the past scraping content with a user agent of 'PHP' but that's so generic that I can't consider it a vote specifically against my efforts.
If I changed my useragent string to 'STREET_CRAWLER' and then a few hours later noticed I was being blocked then it really would be time to reconsider.
Thanks for your comment. I realise that scraping content is a sensitive issue but I think in this instance I acted appropriately.
I think it is important to point out that I wasn't reposting the content anywhere else, I was hoping to analyse it in aggregate purely out of personal interest. As such I don't think copyright issues will have come up leaving the issue being (mis)use of server resources.
The file I was scraping was an XML file without formatting which I interpreted as being a fairly safe sign that the file was meant for automated retrieval. It is possible that they had problems with people in the past scraping content with a user agent of 'PHP' but that's so generic that I can't consider it a vote specifically against my efforts.
If I changed my useragent string to 'STREET_CRAWLER' and then a few hours later noticed I was being blocked then it really would be time to reconsider.
I guess its a war out there and everything is fair in war. I agree its not a fair practice.
But I am sick and tired of Popups and I have decided that no user from my site will get popup even if he is going to an external site FROM my site. So I use THIS method to get the necessary content effectively. Thanks for this method.
Now I am looking for a method to change the referrer.
But I am sick and tired of Popups and I have decided that no user from my site will get popup even if he is going to an external site FROM my site. So I use THIS method to get the necessary content effectively. Thanks for this method.
Now I am looking for a method to change the referrer.
Hello NOTRICK
That's not exactly the usage scenario I had in mind. You're going to use a lot of bandwidth and processor resources with that approach and unless you are thoroughly cleaning the content popups may still be an issue.
I would recommend education about pop-up blockers over this approach though you could use this technique to check whether a site is likely to show your visitors pop-ups.
I suspect that if you read the documentation on cURL changing the referer will be a simple process.
That's not exactly the usage scenario I had in mind. You're going to use a lot of bandwidth and processor resources with that approach and unless you are thoroughly cleaning the content popups may still be an issue.
I would recommend education about pop-up blockers over this approach though you could use this technique to check whether a site is likely to show your visitors pop-ups.
I suspect that if you read the documentation on cURL changing the referer will be a simple process.
cURL is not just a function to steal content...lol. I am an affiliate of another website, and I have no way to know if a particular product I am showing on my website is in stock or not, and the only way for me to find out is to check each page of that website I am affiliate with, but with cURL it is possible for me to update my website without having to manually check if their website for each product if they are instock or not.
I appreciate the information on this web site. Thanks for sharing.
I appreciate the information on this web site. Thanks for sharing.
#6
on
2008-12-18 22:34
I would also suggest this simple trick
curl_setopt($ch[$i], CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch[$i], CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
The author does not allow comments to this entry
