Hello,
I got a problem while extracting links from HTML. Usually I use the following function call to do this:
Code:
$count = preg_match_all("/<a[^>]+href\s*=\s*([\"']?)([^\\s\"'>]+)\\1/is", $data, $matches, PREG_SET_ORDER);
But today I got a page that contains a link like this:
Code:
<a href="/sitemap/Women's-Interests.html">
That single quote in "Women's" confuses me - it is allowed by HTML specs, but it does not pass through the regexp. Does anyone have a working method to extract all possible forms of <a href> links from a page? I mean ones that can be enclosed with single or double quotes along with having single quotes inside the link.
|