Cable Forum

Cable Forum (https://www.cableforum.uk/board/index.php)
-   General IT Discussion (https://www.cableforum.uk/board/forumdisplay.php?f=19)
-   -   PHP RegEx Help (https://www.cableforum.uk/board/showthread.php?t=33671118)

AbyssUnderground 22-10-2010 12:23

PHP RegEx Help
 
Hi all,

I have a small issue I can't seem to solve with this code that counts the number of words on a HTML page:

Code:

$PageDataStripped = $PageData; // $PageData is the source code of any HTML Page

$PageDataStripped = preg_replace("/<(.*)>/iU"," ",$PageDataStripped); //Strip out anything between < and > tags (strip_tags() not used because it seems to remove some normal  text too)
    preg_match_all("/([a-zA-Z0-9]*) /iU", $PageDataStripped, $wordCount); // Match each word on the page
// Debugging
    echo "<pre>";
    print_r($wordCount);
    echo "</pre>";
   
//Cycle the array and make sure values don't = nothing and then increase the count variable
    $wordCountfor = $wordCount[1];
    $wordsOnPage = 0;
    foreach($wordCountfor as $word){
        if(!($word == "" || $word == " " || $word == "  ")){
            $wordsOnPage++;
            echo $word." ";
        }
    }

The line in red seems to remove all of the code and replace with nothing rather than only replacing the contents of the two angle brackets. Using a RegEx helper it works fine but PHP just doesn't parse it the same.

Am I missing something?

Thanks in advance.

Andy

---------- Post added at 11:23 ---------- Previous post was at 11:11 ----------

Looks like I may have solved it (typical that eh?):

Code:



$PageDataStripped = $PageData;
    $PageDataStripped = preg_replace("/<script(.*)<\/script>/iU"," ",$PageDataStripped);
    //echo $PageDataStripped;
    $PageDataStripped = strip_tags($PageDataStripped);
    //$PageDataStripped = preg_replace("/<(.*)>/iU"," ",$PageDataStripped);

    preg_match_all("/([a-zA-Z0-9:;,\.\'\"\?@£$%&\!]*) /iU", $PageDataStripped, $wordCount);
    //echo "<pre>";
    //print_r($wordCount);
    //echo "</pre>";
   
    $wordCountfor = $wordCount[1];
    $wordsOnPage = 0;
    foreach($wordCountfor as $word){
        if(!($word == "" || $word == " " || $word == "  ")){
            $wordsOnPage++;
            echo $word." ";
        }
    }



All times are GMT +1. The time now is 15:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2026, vBulletin Solutions Inc.
All Posts and Content are © Cable Forum