Parsing with Zend HTTP Client

I need to recap today’s adventures in parsing land, I’m currently writing a lot of different parsers that are retrieving information from various pages, sometimes I’m lucky and the info I want is available as CSV or XML, most often not.

Anyway, sometimes I need to login and fetch information and some sites are using cookies and whatnot that needs to be handled.

As it happens I’m very satisfied with the performance of Zend Http when it comes to the fetching and cookie parts, let’s take a look at an example:

$base = 'http://somesite.com/';

$client = new Zend_Http_Client($base.'login.php', array(
    'maxredirects' 	=> 0,
    'timeout'      	=> 30,
  'keepalive'		=> true
));

$client->setCookieJar();

$client->setParameterPost(array(
  'username' 	=> $user,
  'password' 	=> $pass,
  'submit'	=> 'Login'
));

$response = $client->request('POST');

$client->setUri($base.'adv_stats.php');

$period = date('d') == '01' ? 'last' : 'current';

$client->setParameterPost(array(
  'period' 		=> $period,
  'casinocode' 	=> '1071',
  'group_field_1'	=> 'x_p.username',
  'submit'		=> 'Show stats'
));

$response = $client->request('POST');

return $this->getTable($response->getBody(), '|<table width="100%" border="0" cellspacing=1 cellpadding=0>(.*?)</table>|s')
    ->getRows('|<tr>|')->shift()->pop()->getColumns()->trim()->convertChar('$', '')
    ->getStats(array(0, 6));

Note the use of $client->setCookieJar();, that is all that is needed to manage the logged in state, awesome. Without it the second post to adv_stats.php would’ve failed due to unauthorized access. Note the ‘keepalive’ => true in the setup array, also needed to make things work.

The fluent interface at the bottom is simply a wrapper around preg_match, preg_match_all and preg_split. We get the contents of the page with the statistics through $response->getBody().

We then proceed with fetching the contents of the table we are interested in, usually it’s unique by its setup attributes so we know we’re getting the right one. Note the use of the s modifier to ignore newlines, more on that here. Note also the (.*?), we get all characters through .* and we’re not greedy due to the ?, in other words, we will keep going until we hit the first table terminator we find, not the last one.

You realize that the above approach will break on nested tables but usually statistics tables tend to be pretty simple things, as is the case here.

Next we split by all table rows and get rid of some superfluous material through pop and shift, then we get the columns, the getColumns function has a default regex, that’s why we can just use it without passing anything to it:

public function getColumns($reg = "|<td[^>]*>([^>]*)</td>|s"){
  $rarr = array();
  foreach($this->content as $player){
    preg_match_all($reg, $player, $matches);
    $rarr[] = $matches[1];
  }
  return $this->rArr($rarr);
}

So we get the contents of each table column, and yet again we ignore newlines, trim() will simply trim all strings in our internal array, convertChar() is a wrapper around str_replace() and getStats() is not relevant to this article.

Tags: fluent interfaces, parsing, PHP, regex, regular expressions, spiders, zend framework, zend http, zf

This entry was posted on Monday, March 9th, 2009 at 6:50 pm and is filed under PHP, Zend Framework. Comments are currently closed.

Parsing with Zend HTTP Client

Related Posts

Archives

Tags

Pages