Howard's Blog: 2008

Thursday, October 30, 2008

Treatise handed in...

Picking up the treatise from world of print and looking at it, the treatise cloth bound made me feel a sense of great accomplishment.

I handed both copies (1 black and white, and 1 colour) this afternoon.

All I need to do is to record that video for my presentation now.

Tuesday, October 28, 2008

Thesis Cloth Binding

After calling up the university copy centre, officeworks, kwik kopy, and kinkos with no luck. I finally found a place called world of print, the fastest they can do cloth binding is thursday.

But after calling Dr. Calvo, he said if I hand it in on thursday I won't lose any marks. I was relieved :)

Went to world of print at ultimo this afternoon and now it's just a waiting game.

Monday, October 27, 2008

Treatise finished...

Finally, I've finished writing my treatise and had a few of my friends proof read it.

All that is left is to print and bind, as well as burn the cd.

Hope binding doesn't take a long time.

Monday, October 20, 2008

'by URL' Idea

After looking into possible ways of using MATLAB, I realised there's not a real advantage of using it for visualisation as it slows down a user's search process and doesn't really add much value to the web application.

Therefore, I've decided to populate the 'url' field with the program id's of each news story. That way, when you use the 'by URL' filtering algorithm, it chronologically sorts the program id's for you so you'll be able to have a rough idea of how a particular news story progresses.

Monday, October 13, 2008

Due date approaching fast!

Still need to tie a a few loose ends with handling mySQL exceptions.

Haven't had time to look at Matlab yet.

Spoke with a couple of friends about binding the treatise, it costs about $35/book. That should be ok.

Thursday, October 9, 2008

IDEA: Matlab for Visualisation

I was just reading some of my notes that i've took from the meetings that i've had with Dr. Calvo and he suggested looking into the possibility of using Matlab to display clustered information.

I totally forgot about this cool idea, so if I have time later next week i'll research into it.

I hope i'll have enough time as this would further enhance my project.

Sunday, October 5, 2008

mySQL connection pool?

After speaking to the guys from the carrot2 project (Dawid Weiss and Stanislaw Osinski), they have advised me since this is a production-level code/project, it is probably best if I implement some kind of sql connection pool.

That way database connections can be handled properly in a threaded/enterprise environment.

So I took a look around the net, and since I was using the MySQL Connector/J Driver to connect to the database, in their documentation I found:

http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee.html

It has some sample code for creating a connection pool. I'm not sure if I have enough time to implement it, but we'll see :)

Thursday, October 2, 2008

Time to update draft treatise

Most of the development is done, just need to update my draft with the advice from Dr. Calvo.

-Need to clearly state the work that i've done
-The abstract can be a statement of achievements

Due date for the fianl treatise is creeping up :(

Thursday, September 25, 2008

Visualisation Complete

I have used the visualisation tool that is included within the carrot2 project.

It is basically a flash visualisation that will display the main clusters as a pie graph and also display the results in that cluster in the frame next to the flash object.

Here's a snapshot of it in action, the search query was 'john howard' and that was run through etools.ch.

Monday, September 22, 2008

Intergration of the VisionBytes database search complete

I have integrated my app into the carrot2 webapp. My app simply creates a connection to a mySQL database and uses;

String sqlQuery = "SELECT * FROM segment WHERE summary LIKE '%" + query + "%'";

as the main search query.

It the will create a list of Documents with the appropriate fields populated from the mySQL query and that list will then be passed onto the next component of the application (the clustering engine).

Thursday, September 11, 2008

mySQL troubles

After what took like ages converting the VisionBytes Data (in the form of SQL statements) into mySQL statements, and keeping all the relationships intact, I wrote a simple app to compare and retrieve the appropriate data from the database from a query (a user inputted string).

Basically, the table that is of interest is the 'segment' table. The reason for this is because it contains;

- SEGMENT_ID
- PROGRAM_ID

Both of which is used to determine the location of a story/topic

- TITLE
- SUMMARY

Both of these are necessary to create a 'Document' in the webapp of the carrot2 project.

The breakdown of a 'Document' is as follows:

Document(String title, String summary, String contentUrl)

In our case, we will replace url data with the location of the topic/story.

Friday, September 5, 2008

Draft treatise handed in

I was almost late handing the draft in today, luckily I allocated enough time for travel.

What normally takes about 20mins to reach uni took about 40mins due to traffic.

You know what they say, it pays to plan ahead :)

Thursday, September 4, 2008

Draft treatise almost completed

Basically the chapters that I've worked on for this draft is:

Introduction
-Problem Statement
-Aim and Scope
-Thesis Overview

Background
-History of Topic Detection and Tracking
-Previous Thesis
-Approaches to Integrating the Search Query Component
-Approaches to Integrating the Clustering Algorithms/Engine
-Approaches to Integrating the Visualisation Component

Development
-Search Query Component

Synthesis
-Conclusion and Possible Future Work

But there's still a lot more work to be done...

Just another small update:

-Google search is no longer supported by the Carrot2 project, however the Yahoo web search, MSN live, as well as etools.ch is supported.
-The Yahoo web search is already integrated in the carrot2 project

Thursday, August 28, 2008

Draft Treatise Breakdown

After some thought, I've broken down the treatise into the following sections:

Introduction
Problem Statement
Aim and Scope
Thesis Overview

Background
History of Topic Detection and Tracking
Previous Thesis
Approaches to Integrating the Search Query Component
Approaches to Integrating the Clustering Algorithms/Engine
Approaches to Integrating the Visualisation Component

Development
Search Query Component
Clustering Algorithm/Engine
Visualisation Component
Additional Components

Results and Analysis

Synthesis
Discussion
Conclusion and Possible Future Work

I know I won't be able to write a lot for the latter sections, however that won't be required from what Dr. Calvo has requested.

Sunday, August 24, 2008

Draft Preparations

Draft deadline is approaching fast, I'll need to start writing up bits and pieces.

Dr. Calvo has sent us a link to some of the material the EIE faculty is using to work on a new report writing tool. The link is
here.

I've had a look at it and the sections for 'Introduction', 'Discussion' and 'References' are pretty useful.

I'll be using the Harvard system for referencing and citing, some examples of this system are as follows:

- For a journal article;

Oliveras, J & Montagne-Clavel, J 1996, 'Picrotoxin produces a "central" pain-like syndrome when microinjected Into the somato-motor cortex of the rat', Physiology & Behavior,V 60, 6, pp. 1425-1434.

- For a book;

Katz, B 1996, Nerve, muscle, and synapse, McGraw-Hill, New York.

Tuesday, August 19, 2008

Update: VisionBytes Data

Irena has finally sent me the VisionBytes data. She had Sergey Mainich (her student) to preprocess it for me as the data before would not be very useful for me.

Basically these files contain data for 19 news programs extracted from closed captions provided by VisionBytes.

The data files are in two formats:

1) Oracle SQL Loader (.ctl)
2) SQL insert statements (.sql)

The DDL script for four tables (program, sentence, token, segment) is in ddl_script.sql.

The closed captions have been pre-processed as follows:

1. Sentence boundaries have been calculated.
2. Caption text was tokenized into separate words.
3. The words were converted to lower case and stemmed using Porter's stemmer.
4. Stopwords were identified and marked.

Timestamps are specified in the UTC format.

The time values in columns program.start_time and program.end_time have been truncated by SQLDeveloper.
Program time can be calculated from the timestamps.

The table "segment" contains boundaries of the news stories.

I will have to install an SQL server and import all the data soon.

Thursday, August 14, 2008

Carrot2 Open Source Framework

Carrot2 is an open source framework for building search clustering engines. This is the framework used by grokker.com from which Dr. Calvo suggested I look into.

My current understanding of the framework is that it is a package of components:

1. Input components (interfacing between search engines, etc)

2. Filtering components (using clustering algorithms to obtain clusters of results)

3. Visualisation components (using DHTML, CSS, Javascript to display the clustered results)

A picture taken from the Carrot2 Project website sums up the overall structure of the framework:

Tuesday, August 12, 2008

Visionbytes Data

I've asked Dr. Irena Koprinska for the Visionbytes data today, so hopefully she'll get back to me as soon as possible.

I need to know what kind of database they'll be running that way i'll know how to design the web app to connect and run queries.

My guess is it'll probably be SQL or even Oracle.

Saturday, August 9, 2008

Google AJAX Search API

To understand how the Google API works i've created a simple servlet that basically does a bit of input and output testing.

The most crucial snippet of code would be:

URL url = new URL("http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=" + searchQuery + "&rsz=large");

searchQuery was just a string variable containing the search string. The documentation suggests using the JSON object but I couldn't get that to work so I used a stringbuilder and a printerwriter to output the result.

This is what I got when the searchQuery was "howard":

Text file of output.

It is basically in the structure of:

processResults({"responseData": { "results": [ { "GsearchResultClass": "GwebSearch", "unescapedUrl": "http://en.wikipedia.org/wiki/Paris_Hilton", "url": "http://en.wikipedia.org/wiki/Paris_Hilton", "visibleUrl": "en.wikipedia.org", "cacheUrl": "http://www.google.com/search?q\u003dcache:TwrPfhd22hYJ:en.wikipedia.org", "title": "\u003cb\u003eParis Hilton\u003c/b\u003e - Wikipedia, the free encyclopedia", "titleNoFormatting": "Paris Hilton - Wikipedia, the free encyclopedia", "content": "\[1\] In 2006, she released her debut album..." }, { "GsearchResultClass": "GwebSearch", "unescapedUrl": "http://www.imdb.com/name/nm0385296/", "url": "http://www.imdb.com/name/nm0385296/", "visibleUrl": "www.imdb.com", "cacheUrl": "http://www.google.com/search?q\u003dcache:1i34KkqnsooJ:www.imdb.com", "title": "\u003cb\u003eParis Hilton\u003c/b\u003e", "titleNoFormatting": "Paris Hilton", "content": "Self: Zoolander. Socialite \u003cb\u003eParis Hilton\u003c/b\u003e..." }, ... ], "cursor": { "pages": [ { "start": "0", "label": 1 }, { "start": "4", "label": 2 }, { "start": "8", "label": 3 }, { "start": "12","label": 4 } ], "estimatedResultCount": "59600000", "currentPageIndex": 0, "moreResultsUrl": "http://www.google.com/search?oe\u003dutf8\u0026ie\u003dutf8..." }}, "responseDetails": null, "responseStatus": 200})

Wednesday, August 6, 2008

Java Web Applications

I've never created a proper java web application besides a simple java app that's just embedded into a html page.

A simple breakdown of a java web application that I came across whilst reading the Java EE Tutorial was:

A web application consists of web components (either Java servlets, JSP pages or web service endpoints), static resource files such as images, and helper classes and libraries. The web container provides many supporting services that enhance the capabilities of web components and make them easier to develop. However, because a web application must take these services into account, the process for creating and running a web application is different from that of traditional stand-alone Java classes.

The process for creating, deploying, and executing a web application can be summarized as follows:

1. Develop the web component code.
2. Develop the web application deployment descriptor.
3. Compile the web application components and helper classes referenced by the components.
4. Optionally package the application into a deployable unit.
5. Deploy the application into a web container.
6. Access a URL that references the web application.

The start of my weekly updates

I'll be consistently updating this blog every thursday. However, if there are any issues or if i've read something interesting then i'll be posting those up as well.

So far;

- 25/07/08 Meeting Minutes have been updated and uploaded
- Project Plan completed and submitted

- Java EE SDK 5.0 and Netbeans IDE 6.1 have been installed on my laptop so implementation is good to go

- I have decided and started implementing the Google AJAX Search API into a basic web application first as Google seems to be the most popular search engine atm (with good reason as well, due to search algorithms and search tools)

Something to note is that Java can be used in conjunction with Google. However, to make a request to the Google AJAX Search API, the JSON (JavaScript Object Notation) library must be used as the API will return JSON encoded results.

Resources can be found from:
http://code.google.com/apis/ajaxsearch/
http://code.google.com/apis/ajaxsearch/documentation/#fonje
http://json-lib.sourceforge.net/
http://www.json.org/
http://www.json.org/java/

Tuesday, May 27, 2008

First official meeting

The meeting was basically to explain to us the structure of the thesis project, how it is going to be assessed and the due dates for each component.

Dr. Calvo also gave us a handout on how to write a better thesis, basically the structure we should follow should be:

Introduction
- Problem statement
- Aim and scope
- Thesis overview
Background
- List, analyse and explain past research
- Want: new solution to an old problem
- Current theory and practice
Own Work
- Design of your own work and results
- Your own contribution
Synthesis
- Discussion and conclusions

Link to the minutes for the meeting on 22/05/2008.

Wednesday, May 21, 2008

Misreading emails

Well it turns out that I misread my email and got the date wrong, the meeting wasn't last thursday but rather tomorrow.

Anyway, it's all good. I've written up the minutes of the last meeting so that's the first documentation for my thesis project! :)

Link to the minutes for the meeting on 02/05/2008.

Thursday, May 15, 2008

First official thesis discussion

Alright, today at 4pm will be my first official thesis discussion with my thesis supervisor Dr. Rafael Calvo. My co-supervisor is Daniel Lloyd-Jones of visionbytes.

Wonder how it'll go...