Archive for June, 2009

why sub-selects can be faster than inner joins

So here is my situation. I have 2 tables with the following DDL.

  id bigint NOT NULL,
  "value" character varying(150),
  CONSTRAINT tags_pkey PRIMARY KEY (id),
  CONSTRAINT tags_value_key UNIQUE (value)


CREATE TABLE sites_tags
  sites_id bigint NOT NULL,
  pages_id bigint NOT NULL,
  tags_id bigint NOT NULL,
  count integer,
  updated timestamp without time zone,
  CONSTRAINT sites_tags_pkey PRIMARY KEY (sites_id, pages_id, tags_id)

As you can see, the tags table is a simple value-id-table. The second table represents a join table between pages and tags.

The goal of my Query should be to get the most used tag from the join table. Only the first x-Rows are of interest to me. To get there I used a simple limit command. So just for comparison here a simple query of the join table without the actual values.

select sum(st.count) as anzahl from sites_tags st group by st.tags_id order by anzahl desc limit 50;
                                                                QUERY PLAN                                                                
 Limit  (cost=13185.22..13185.35 rows=50 width=12) (actual time=1974.893..1975.033 rows=50 loops=1)
   ->  Sort  (cost=13185.22..13192.27 rows=2819 width=12) (actual time=1974.888..1974.941 rows=50 loops=1)
         Sort Key: (sum(count))
         Sort Method:  top-N heapsort  Memory: 18kB
         ->  HashAggregate  (cost=13056.34..13091.58 rows=2819 width=12) (actual time=1766.681..1876.092 rows=66136 loops=1)
               ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.120..690.719 rows=570756 loops=1)
 Total runtime: 1975.669 ms
(7 rows)

This is just a statement to get you the picture of cost for a simple query (without fetching any actual values).

To make this query useful I needed to add the values. All the values will be joined through the tags table.

Here the first implementation I came up with.

select t.value,sum(st.count) as anzahl from sites_tags st inner join tags t on group by st.tags_id,t.value order by anzahl desc limit 50;
                                                                      QUERY PLAN                                                                      
 Limit  (cost=123002.69..123002.82 rows=50 width=22) (actual time=12640.100..12640.239 rows=50 loops=1)
   ->  Sort  (cost=123002.69..124429.58 rows=570756 width=22) (actual time=12640.095..12640.153 rows=50 loops=1)
         Sort Key: (sum(st.count))
         Sort Method:  top-N heapsort  Memory: 20kB
         ->  GroupAggregate  (cost=91200.58..104042.59 rows=570756 width=22) (actual time=10165.002..12537.121 rows=66136 loops=1)
               ->  Sort  (cost=91200.58..92627.47 rows=570756 width=22) (actual time=10162.562..11564.604 rows=570756 loops=1)
                     Sort Key: st.tags_id, t.value
                     Sort Method:  external merge  Disk: 18808kB
                     ->  Hash Join  (cost=1877.06..24921.63 rows=570756 width=22) (actual time=259.674..3080.093 rows=570756 loops=1)
                           Hash Cond: (st.tags_id =
                           ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.070..781.449 rows=570756 loops=1)
                           ->  Hash  (cost=1050.36..1050.36 rows=66136 width=18) (actual time=259.518..259.518 rows=66136 loops=1)
                                 ->  Seq Scan on tags t  (cost=0.00..1050.36 rows=66136 width=18) (actual time=0.027..115.197 rows=66136 loops=1)
 Total runtime: 12647.403 ms
(14 rows)

As you can see, simply joining the table makes this query quite complex. The part which consumes most of the cost is the more complicated group by clause. Now the execution engine has to join these tables and then sort all values by id and string (mostly the value is the important part).

To avoid this there only could be one solution – remove the join. With removing the join there comes the question how to get the values from the second table. One way to do this would be to use the program (in my case a php web application) to query again for every line of the result set.
Another way to approach this would be to do a sub-select in the select section. This way you don’t have the additional round trip of doing it in the application. Another advantage would be that the database would only do these sub-selects for the actually returning rows (with respect of the limit).

So here the query I came up with (with the query execution plan)

select (select value from tags t where,sum(st.count) as anzahl from sites_tags st group by st.tags_id order by anzahl desc limit 50;
                                                                QUERY PLAN                                                                 
 Limit  (cost=36511.94..36512.07 rows=50 width=12) (actual time=2682.650..2682.790 rows=50 loops=1)
   ->  Sort  (cost=36511.94..36518.99 rows=2819 width=12) (actual time=2682.645..2682.705 rows=50 loops=1)
         Sort Key: (sum(st.count))
         Sort Method:  top-N heapsort  Memory: 20kB
         ->  HashAggregate  (cost=13056.34..36418.30 rows=2819 width=12) (actual time=1752.934..2570.690 rows=66136 loops=1)
               ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.109..713.541 rows=570756 loops=1)
                 ->  Index Scan using tags_pkey on tags t  (cost=0.00..8.27 rows=1 width=10) (actual time=0.006..0.007 rows=1 loops=66136)
                       Index Cond: (id = $0)
 Total runtime: 2683.478 ms
(10 rows)

As you can see i still costs a lot. It is still 3 times more expensive then doing it without the values. On the other hand the cost is only a fourth of the cost of the join. This is mostly owed to the limit clause. The join has no way of knowing that it would be enough to run the limit without the join and later join the values. So far I found no way to tell postgres to do this more efficient.
So the simplest solution for that would be to do sub-queries. With that, the limit clause will be honored.

So as this example shows, it is always a good idea to try different approaches to one and the same query. Often you can see lots of differences in the execution plan which can have a major impact on performance.

, , ,

1 Comment

get hostname from url as stored procedure in plpgsql

I just needed a simple stored procedure to extract the hostname from any given URL. So here is what I came up with.

CREATE OR REPLACE FUNCTION getHostFromUrl(p_url character varying)
  RETURNS character varying AS
  return substring(p_url from  'http.?://(.*?)/(.*)');
  COST 100;



extracting information from websites through xslt

Today, most websites feature some kind of feed, so every user who wants to stay in touch, can follow new publications very easily. Some sites support RSS-feeds or mail notification. Although this is pretty common, there are still sites out there who doesn’t. For that purpose I tried to find some easy solution.
First problem here is, how to get the information into some format usable. HTML is not meant for complex data mining operations. So the first thing to look at would be to ignore the HTML part and do string analysis of the content. This can be really difficult, because you lose the structure of the site completely.
Another approach would be, to somehow utilize the DOM tree which the browser uses to obtain the data. One side Effect would be, that data mining could easily be done via DOM operations. But even for that solution I found no engine which provides DOM support for HTML pages which can be build into an application.
Keeping the DOM approach in mind I started to look around for XML solutions which can also parse HTML data (cause they are not so different from one another). So I came to the libXML project. They implemented a open-source XML-parser which has also the ability to parse HTML. Although the HTML part is still a bit shaky, it looked quite promising. Now there was still the problem of how to retrieve the information. LibXML provides no DOM interface at all. One thing it does provide is XSLT support. Being an EAI developer this was a good compromise.

So here a little tutorial how to make this work in the shell.

First choose your site.
I choosed this one (from a german stocks magazine I subscribed) just out of curiousity and I also have to mention, this site already has e-mail notification (so there is no need to actually use this on that site).

Second, get the XPath you want to extract. If you have knowledge of XPath this should be easy, if not use something like firebug to get there.
After that you can start on creating your XSL script for the transformation. The complete xslt should look something like this:





If you need more detail one XSLT you should check out w3schools. They have some good tutorials for starters.
The important part of this XSLT is the last template section. This is actually the part you have to use to get to your information. First comes the template match. Here you have to insert the XPath you have obtained before. After that you have to select (via XPath) what information you want and formulate how you want this to be written out.

Now you just have to put one and one together and you have your data mining solution.
I inserted the following bash script into my cron table and now have a subscription to this site.

curl -s | sed -e 's/&/&/g' - | xsltproc -html online.xslt -

Now that you have the raw information it should be no problem to get this into some mailinglist oder database for future use.

PS: In case you wonder about the sed in the statement. As I already mentionen the libXML is not really the most flexible solution for parsing HTML. Especially when it comes to HTML codes like ©, libXML resigns with an error. To avoid this I transformed all ampersands to escaped ampersands.

, ,

1 Comment