Wednesday, November 28. 2007
Dear Santa, please bring me a pony and a plastic rocket and one of those . . .
I'm privileged in that I'm able to 'guide' the choice of some of the gifts I'm likely to receive over Christmas. Given the equally dire state of both my understanding of OOP and the programming sections of my local libraries I feel this is somewhere a good book could help.
From various places on the web I've found three books which may be helpful and would love to get some feedback from the PHP community.
For background, programming is not a full time activity for me and although I can easily knock up a custom cms I wouldn't describe myself as a seasoned pro. Also, although it has not always been the case, I almost work exclusively in PHP and expect this to be the case for the foreseeable future.
The Books
p.s. A cookie for anyone who can name the film I've paraphrased in the post title.
From various places on the web I've found three books which may be helpful and would love to get some feedback from the PHP community.
For background, programming is not a full time activity for me and although I can easily knock up a custom cms I wouldn't describe myself as a seasoned pro. Also, although it has not always been the case, I almost work exclusively in PHP and expect this to be the case for the foreseeable future.
The Books
PHP|architect's Guide to PHP Design Patterns by Jason E. Sweat.
From the reviews it looks like this one may go somewhat off-topic.
PHP 5 Objects, Patterns, and Practice by Matt Zandstra
The reviews seem positive. Probably the most likely candidate at this point.
The Object Oriented Thought Process (Developer's Library) by Matt Weisfeld
From the reviews it may be a too brief introduction.
Have you bought any of the three books above? How useful was it to you? Any other suggestions?
p.s. A cookie for anyone who can name the film I've paraphrased in the post title.
Sunday, November 4. 2007
When scraping content from the web don't make it obvious
A couple of hours ago I was playing around scraping some content from a website. All was going well until suddenly I couldn't get my script to fetch meaningful content. I could view the content perfectly through my browser, on the same IP address, but through PHP is was a no go.
The first thing I did was stop visiting the site for 15 minutes or so and then increase the time between requests. It briefly worked again but quickly stopped. Next, I opened up php.ini and checked what useragent PHP was using. It turned out to be 'PHP'. I changed that and for the past 3 hours (almost) the script has been working perfectly.
Moral of the story: When scraping content from the web don't make it obvious
It's worth noting that I can't say for sure that it was changing the user agent which fixed the problem, it could have just been coincidence, but it's an easy fix and why make it obvious that you're scraping content?
Options
In this instance I was just working from my development server so I had access to php.ini but I had several options.
I could have added a line to my .htaccess file
or used ini_set.
Curl also allows you to specify the useragent.
If you want to take cloaking the useragent further with curl this comment in the PHP manual may be useful.
The first thing I did was stop visiting the site for 15 minutes or so and then increase the time between requests. It briefly worked again but quickly stopped. Next, I opened up php.ini and checked what useragent PHP was using. It turned out to be 'PHP'. I changed that and for the past 3 hours (almost) the script has been working perfectly.
Moral of the story: When scraping content from the web don't make it obvious
It's worth noting that I can't say for sure that it was changing the user agent which fixed the problem, it could have just been coincidence, but it's an easy fix and why make it obvious that you're scraping content?
Options
In this instance I was just working from my development server so I had access to php.ini but I had several options.
I could have added a line to my .htaccess file
php_value user_agent Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9
or used ini_set.
<?php
ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9');
?>
ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9');
?>
Curl also allows you to specify the useragent.
<?php
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9');
?>
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9');
?>
If you want to take cloaking the useragent further with curl this comment in the PHP manual may be useful.
