This parser is a combination of the ufXtract microformats parser and a spider which follows rel=”me” links. It returns two main collections of data, all the rel=”me” links and any hCard-XFN patterns it finds. Each collection item is given an additional source url attribute. You can restrict the spider to a single domain or spider across the web. Currently there are limits to the number of pages which will be parsed.
This is a piece of research work still under development. If you have any comments or want to point out issues please email - info.backnetwork.com
Updated
29-Nov-07
Added support for concept of representative-hcard. I have extended the idea to cover the parsing of multiple pages. You will often find multiple representative-hcard in the output, but there will always only be one per a Url. Also added support for pages encoded with ISO-8859-1
Example Xml output
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<ufxtract>
<me sourceurl="http://www.glennjones.net/">
<text>Glenn Jones</text>
<link>http://www.glennjones.net/about</link>
</me>
<me sourceurl="http://www.glennjones.net/about/">
<text>Twitter</text>
<link>http://twitter.com/glennjones</link>
</me>
<vcard sourceurl="http://www.glennjones.net/about/" representativehcard="true">
<fn>Glenn Jones>/fn>
<n>
<given-name>Glenn</given-name>
<family-name>Jones</family-name>
</n>
<url>http://www.glennjones.net/</url>
<org>Madgex</org>
<role>Creative Director</role>
<xfn>
<text>Glenn Jones</text>
<link>http://www.glennjones.net/about/</link>
<rel>me</rel>
</xfn>
</vcard>
<vcard sourceurl="http://www.glennjones.net/about/">
<fn>Jeremy Keith>/fn>
<n>
<given-name>Jeremy</given-name>
<family-name>Keith</family-name>
</n>
<xfn>
<text>Jeremy Keith</text>
<link>http://adactio.com/journal/</link>
<rel>friend met</rel>
</fxn>
</vcard>
<report>
<url status="200" millisec="109">http://www.glennjones.net/</url>
<url status="200" millisec="179">http://www.glennjones.net/about/</url>
<found>4</found>
</report>
</ufxtract>
Example Xml error
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<ufxtract>
<errors>
<error>
<msg>The remote name could not be resolved: 'htp'</msg>
<url>http://htp://www.glennjones.net/</url>
</error>
</errors>
</ufxtract>