Thread: PHP RegEx Help
View Single Post
Old 22-10-2010, 12:23   #1
AbyssUnderground
Inactive
 
Join Date: Oct 2005
Location: Merseyside
Age: 37
Services: BT Infinity Option 2, HH5, synced at maximum 80Mbps/20Mbps.
Posts: 2,221
AbyssUnderground has reached the bronze age
AbyssUnderground has reached the bronze ageAbyssUnderground has reached the bronze ageAbyssUnderground has reached the bronze ageAbyssUnderground has reached the bronze ageAbyssUnderground has reached the bronze ageAbyssUnderground has reached the bronze ageAbyssUnderground has reached the bronze ageAbyssUnderground has reached the bronze ageAbyssUnderground has reached the bronze age
Send a message via MSN to AbyssUnderground
PHP RegEx Help

Hi all,

I have a small issue I can't seem to solve with this code that counts the number of words on a HTML page:

Code:
$PageDataStripped = $PageData; // $PageData is the source code of any HTML Page

$PageDataStripped = preg_replace("/<(.*)>/iU"," ",$PageDataStripped); //Strip out anything between < and > tags (strip_tags() not used because it seems to remove some normal  text too)
    preg_match_all("/([a-zA-Z0-9]*) /iU", $PageDataStripped, $wordCount); // Match each word on the page
// Debugging
    echo "<pre>";
    print_r($wordCount);
    echo "</pre>";
    
//Cycle the array and make sure values don't = nothing and then increase the count variable
    $wordCountfor = $wordCount[1];
    $wordsOnPage = 0;
    foreach($wordCountfor as $word){
        if(!($word == "" || $word == " " || $word == "  ")){
            $wordsOnPage++;
            echo $word." ";
        }
    }
The line in red seems to remove all of the code and replace with nothing rather than only replacing the contents of the two angle brackets. Using a RegEx helper it works fine but PHP just doesn't parse it the same.

Am I missing something?

Thanks in advance.

Andy

---------- Post added at 11:23 ---------- Previous post was at 11:11 ----------

Looks like I may have solved it (typical that eh?):

Code:
 

$PageDataStripped = $PageData;
    $PageDataStripped = preg_replace("/<script(.*)<\/script>/iU"," ",$PageDataStripped);
    //echo $PageDataStripped;
    $PageDataStripped = strip_tags($PageDataStripped);
    //$PageDataStripped = preg_replace("/<(.*)>/iU"," ",$PageDataStripped);

    preg_match_all("/([a-zA-Z0-9:;,\.\'\"\?@£$%&\!]*) /iU", $PageDataStripped, $wordCount);
    //echo "<pre>";
    //print_r($wordCount);
    //echo "</pre>";
    
    $wordCountfor = $wordCount[1];
    $wordsOnPage = 0;
    foreach($wordCountfor as $word){
        if(!($word == "" || $word == " " || $word == "  ")){
            $wordsOnPage++;
            echo $word." ";
        }
    }
AbyssUnderground is offline   Reply With Quote