| AbyssUnderground |
22-10-2010 12:23 |
PHP RegEx Help
Hi all,
I have a small issue I can't seem to solve with this code that counts the number of words on a HTML page:
Code:
$PageDataStripped = $PageData; // $PageData is the source code of any HTML Page
$PageDataStripped = preg_replace("/<(.*)>/iU"," ",$PageDataStripped); //Strip out anything between < and > tags (strip_tags() not used because it seems to remove some normal text too)
preg_match_all("/([a-zA-Z0-9]*) /iU", $PageDataStripped, $wordCount); // Match each word on the page
// Debugging
echo "<pre>";
print_r($wordCount);
echo "</pre>";
//Cycle the array and make sure values don't = nothing and then increase the count variable
$wordCountfor = $wordCount[1];
$wordsOnPage = 0;
foreach($wordCountfor as $word){
if(!($word == "" || $word == " " || $word == " ")){
$wordsOnPage++;
echo $word." ";
}
}
The line in red seems to remove all of the code and replace with nothing rather than only replacing the contents of the two angle brackets. Using a RegEx helper it works fine but PHP just doesn't parse it the same.
Am I missing something?
Thanks in advance.
Andy
---------- Post added at 11:23 ---------- Previous post was at 11:11 ----------
Looks like I may have solved it (typical that eh?):
Code:
$PageDataStripped = $PageData;
$PageDataStripped = preg_replace("/<script(.*)<\/script>/iU"," ",$PageDataStripped);
//echo $PageDataStripped;
$PageDataStripped = strip_tags($PageDataStripped);
//$PageDataStripped = preg_replace("/<(.*)>/iU"," ",$PageDataStripped);
preg_match_all("/([a-zA-Z0-9:;,\.\'\"\?@£$%&\!]*) /iU", $PageDataStripped, $wordCount);
//echo "<pre>";
//print_r($wordCount);
//echo "</pre>";
$wordCountfor = $wordCount[1];
$wordsOnPage = 0;
foreach($wordCountfor as $word){
if(!($word == "" || $word == " " || $word == " ")){
$wordsOnPage++;
echo $word." ";
}
}
|