Posts Tagged HTML
A little time ago I started experimenting with some of the new HTML 5 features. Some seam pretty impressive although some a rather unnecessary in my opinion. But one thing got me really hooked – the HTML canvas.
A lot of people out there use services like delicious where you tag you favorite sites and make this available to other users. Now I started to grab that data and began to build a massive tag cloud. After some time the site collected hundreds of thousands of links with their corresponding tags. So now you can start on the site and search for tags which interests you. These search tags will then be correlated against the cloud database and you get the most active links for your tags. So here is an example.
Let’s say you are interested in a tomcat tutorial.
Of course those results will be a link to the concrete tutorial (not just the entry page).
So far to the official part. For me this is more of a fun project. I prefer to start with some random tag and then wander around. It’s more like browsing, cause you start at points that you don’t already know. You have the chance to break out of your existing network of most used sites and see something new.
So have fun with it.
PS: For the implementation part – if you have any questions, just ask. I’m planning to explain some details about how it works on some later posts.
Today, most websites feature some kind of feed, so every user who wants to stay in touch, can follow new publications very easily. Some sites support RSS-feeds or mail notification. Although this is pretty common, there are still sites out there who doesn’t. For that purpose I tried to find some easy solution.
First problem here is, how to get the information into some format usable. HTML is not meant for complex data mining operations. So the first thing to look at would be to ignore the HTML part and do string analysis of the content. This can be really difficult, because you lose the structure of the site completely.
Another approach would be, to somehow utilize the DOM tree which the browser uses to obtain the data. One side Effect would be, that data mining could easily be done via DOM operations. But even for that solution I found no engine which provides DOM support for HTML pages which can be build into an application.
Keeping the DOM approach in mind I started to look around for XML solutions which can also parse HTML data (cause they are not so different from one another). So I came to the libXML project. They implemented a open-source XML-parser which has also the ability to parse HTML. Although the HTML part is still a bit shaky, it looked quite promising. Now there was still the problem of how to retrieve the information. LibXML provides no DOM interface at all. One thing it does provide is XSLT support. Being an EAI developer this was a good compromise.
So here a little tutorial how to make this work in the shell.
First choose your site.
I choosed this one (from a german stocks magazine I subscribed) just out of curiousity and I also have to mention, this site already has e-mail notification (so there is no need to actually use this on that site).
Second, get the XPath you want to extract. If you have knowledge of XPath this should be easy, if not use something like firebug to get there.
After that you can start on creating your XSL script for the transformation. The complete xslt should look something like this:
date: ,action: ,wkn: ,name: ,amount: ,value:
If you need more detail one XSLT you should check out w3schools. They have some good tutorials for starters.
The important part of this XSLT is the last template section. This is actually the part you have to use to get to your information. First comes the template match. Here you have to insert the XPath you have obtained before. After that you have to select (via XPath) what information you want and formulate how you want this to be written out.
Now you just have to put one and one together and you have your data mining solution.
I inserted the following bash script into my cron table and now have a subscription to this site.
curl -s http://www.deraktionaer.de/xist4c/web/Online---Musterdepot_id_1261_.htm | sed -e 's/&/&/g' - | xsltproc -html online.xslt -
Now that you have the raw information it should be no problem to get this into some mailinglist oder database for future use.
PS: In case you wonder about the sed in the statement. As I already mentionen the libXML is not really the most flexible solution for parsing HTML. Especially when it comes to HTML codes like ©, libXML resigns with an error. To avoid this I transformed all ampersands to escaped ampersands.