How to parse and tokenize a complete PHP project
Yesterday I had to loop through every directory in a whole PHP project and go through all the PHP files in order to get at their contents, the goal was to get information from each file in order to create an array looking like this for each class method:
Array
(
[class] => ClassName
[function] => funcName
[args] => $arg1,$arg2
)
Your goals will probably vary from mine but how I achieved the above might give you some hints as to how you might go about getting at the specific parts of the PHP code you are interested in.
More explanations of how this was accomplished after the code listing:
include '/opt/lib/FileDirExt.php';
$files = array();
FileDirExt::listFilesInTree('/var/www/project/lib', $files, array('.php'));
$to_xml = array();
foreach($files as $f){
$res = token_get_all(file_get_contents($f));
$tokens = array();
foreach($res as $t){
if(!empty($t[2]))
$tokens[$t[2]][] = $t[1];
}
for($i = 0; $i < count($tokens); $i++){
$line = $tokens[$i];
if(in_array('class', $line)){
$cur_class = $line[ array_search('class', $line) + 2 ];
for($j = $i + 1; $j < count($tokens); $j++){
$line2 = $tokens[$j];
if(in_array('function', $line2)){
$line2 = filterArray($line2, true, true, array(',', 'private', 'protected', 'public', 'function', 'static', 'final'));
$cur_func = array_shift($line2);
$new_line = array();
foreach($line2 as $el){
if(strpos($el, '$') === 0)
$new_line[] = $el;
}
$to_xml[] = array('class' => $cur_class, 'function' => $cur_func, 'args' => str_replace('$', '${dollar}', implode(',', $new_line)));
}
if(in_array('class', $line2)){
$i = $j - 1;
break;
}
}
}
}
}
In order to get a list with absolute paths to all the PHP files I use listFilesInTree of my extended file class. This list will exist in the $files variable.
Next we initiate the $to_xml variable which will contain the final result and start looping the files.
I use token_get_all to get at only the parts of the PHP document that I care about.
I suppose I could’ve used regular expressions but token_get_all is a much less error prone choice, it will return an array containing info about all the code elements in each file.
We use the fact that all the tokens in the $res array will contain info about on which row they are in the third slot to group them by row number.
After we’ve done that we start looping through the resultant $tokens array and do the following:
1.) We check if the current row contains a token called “class”, if it does we get the index of the class token and add 2 to it to get the class name (the space between class and the class name will occupy one slot in the token array, that’s why we add 2 and not 1).
2.) Now we initiate a sub loop to loop though all the methods contained in the current class, we check for the “function” keyword in the same way we were earlier looking for the “class” keyword.
3.) If we are dealing with a function we filter out all tokens containing only white spaces, are empty or equal anything in this array: array(‘,’, ‘private’, ‘protected’, ‘public’, ‘function’, ‘static’, ‘final’). This is done with the below function:
function filterArray($arr, $trim = false, $empty = false, $remove = array()){
$rarr = array();
foreach($arr as $el){
if($trim)
$el = trim($el);
if($empty == true){
if(!empty($el) && !in_array($el, $remove))
$rarr[] = $el;
}else if(!in_array($el, $remove))
$rarr[] = $el;
}
return $rarr;
}
Not much to add here, the above could probably have been done in a much more elegant way due to the fact that I created the function piecemeal in an ad hoc way, kind of the worst way to go about it.
4.) We return to the function loop where we now have an array containing the name of the function in the first spot, so we get it with array_shift.
5.) Functions in PHP can have default values, for example: function func1($arg = ‘hello’). These default values were “contaminating” the tokens in my function row since I only wanted $arg in this case, hence we loop through the line ($line2) and get only strings that start with a dollar sign, the cleaned result is saved in $new_line.
6.) We are now finally ready to add to our $to_xml array which will hold the result.
And that’s it, the result will later be used to create a templates file for Eclipse, but that is another story.
Related Posts
Tags: files and directories, PHP, tokenize