Howard's Blog: August 2008

Thursday, August 28, 2008

Draft Treatise Breakdown

After some thought, I've broken down the treatise into the following sections:

Introduction
Problem Statement
Aim and Scope
Thesis Overview

Background
History of Topic Detection and Tracking
Previous Thesis
Approaches to Integrating the Search Query Component
Approaches to Integrating the Clustering Algorithms/Engine
Approaches to Integrating the Visualisation Component

Development
Search Query Component
Clustering Algorithm/Engine
Visualisation Component
Additional Components

Results and Analysis

Synthesis
Discussion
Conclusion and Possible Future Work

I know I won't be able to write a lot for the latter sections, however that won't be required from what Dr. Calvo has requested.

Sunday, August 24, 2008

Draft Preparations

Draft deadline is approaching fast, I'll need to start writing up bits and pieces.

Dr. Calvo has sent us a link to some of the material the EIE faculty is using to work on a new report writing tool. The link is
here.

I've had a look at it and the sections for 'Introduction', 'Discussion' and 'References' are pretty useful.

I'll be using the Harvard system for referencing and citing, some examples of this system are as follows:

- For a journal article;

Oliveras, J & Montagne-Clavel, J 1996, 'Picrotoxin produces a "central" pain-like syndrome when microinjected Into the somato-motor cortex of the rat', Physiology & Behavior,V 60, 6, pp. 1425-1434.

- For a book;

Katz, B 1996, Nerve, muscle, and synapse, McGraw-Hill, New York.

Tuesday, August 19, 2008

Update: VisionBytes Data

Irena has finally sent me the VisionBytes data. She had Sergey Mainich (her student) to preprocess it for me as the data before would not be very useful for me.

Basically these files contain data for 19 news programs extracted from closed captions provided by VisionBytes.

The data files are in two formats:

1) Oracle SQL Loader (.ctl)
2) SQL insert statements (.sql)

The DDL script for four tables (program, sentence, token, segment) is in ddl_script.sql.

The closed captions have been pre-processed as follows:

1. Sentence boundaries have been calculated.
2. Caption text was tokenized into separate words.
3. The words were converted to lower case and stemmed using Porter's stemmer.
4. Stopwords were identified and marked.

Timestamps are specified in the UTC format.

The time values in columns program.start_time and program.end_time have been truncated by SQLDeveloper.
Program time can be calculated from the timestamps.

The table "segment" contains boundaries of the news stories.

I will have to install an SQL server and import all the data soon.

Thursday, August 14, 2008

Carrot2 Open Source Framework

Carrot2 is an open source framework for building search clustering engines. This is the framework used by grokker.com from which Dr. Calvo suggested I look into.

My current understanding of the framework is that it is a package of components:

1. Input components (interfacing between search engines, etc)

2. Filtering components (using clustering algorithms to obtain clusters of results)

3. Visualisation components (using DHTML, CSS, Javascript to display the clustered results)

A picture taken from the Carrot2 Project website sums up the overall structure of the framework:

Tuesday, August 12, 2008

Visionbytes Data

I've asked Dr. Irena Koprinska for the Visionbytes data today, so hopefully she'll get back to me as soon as possible.

I need to know what kind of database they'll be running that way i'll know how to design the web app to connect and run queries.

My guess is it'll probably be SQL or even Oracle.

Saturday, August 9, 2008

Google AJAX Search API

To understand how the Google API works i've created a simple servlet that basically does a bit of input and output testing.

The most crucial snippet of code would be:

URL url = new URL("http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=" + searchQuery + "&rsz=large");

searchQuery was just a string variable containing the search string. The documentation suggests using the JSON object but I couldn't get that to work so I used a stringbuilder and a printerwriter to output the result.

This is what I got when the searchQuery was "howard":

Text file of output.

It is basically in the structure of:

processResults({"responseData": { "results": [ { "GsearchResultClass": "GwebSearch", "unescapedUrl": "http://en.wikipedia.org/wiki/Paris_Hilton", "url": "http://en.wikipedia.org/wiki/Paris_Hilton", "visibleUrl": "en.wikipedia.org", "cacheUrl": "http://www.google.com/search?q\u003dcache:TwrPfhd22hYJ:en.wikipedia.org", "title": "\u003cb\u003eParis Hilton\u003c/b\u003e - Wikipedia, the free encyclopedia", "titleNoFormatting": "Paris Hilton - Wikipedia, the free encyclopedia", "content": "\[1\] In 2006, she released her debut album..." }, { "GsearchResultClass": "GwebSearch", "unescapedUrl": "http://www.imdb.com/name/nm0385296/", "url": "http://www.imdb.com/name/nm0385296/", "visibleUrl": "www.imdb.com", "cacheUrl": "http://www.google.com/search?q\u003dcache:1i34KkqnsooJ:www.imdb.com", "title": "\u003cb\u003eParis Hilton\u003c/b\u003e", "titleNoFormatting": "Paris Hilton", "content": "Self: Zoolander. Socialite \u003cb\u003eParis Hilton\u003c/b\u003e..." }, ... ], "cursor": { "pages": [ { "start": "0", "label": 1 }, { "start": "4", "label": 2 }, { "start": "8", "label": 3 }, { "start": "12","label": 4 } ], "estimatedResultCount": "59600000", "currentPageIndex": 0, "moreResultsUrl": "http://www.google.com/search?oe\u003dutf8\u0026ie\u003dutf8..." }}, "responseDetails": null, "responseStatus": 200})

Wednesday, August 6, 2008

Java Web Applications

I've never created a proper java web application besides a simple java app that's just embedded into a html page.

A simple breakdown of a java web application that I came across whilst reading the Java EE Tutorial was:

A web application consists of web components (either Java servlets, JSP pages or web service endpoints), static resource files such as images, and helper classes and libraries. The web container provides many supporting services that enhance the capabilities of web components and make them easier to develop. However, because a web application must take these services into account, the process for creating and running a web application is different from that of traditional stand-alone Java classes.

The process for creating, deploying, and executing a web application can be summarized as follows:

1. Develop the web component code.
2. Develop the web application deployment descriptor.
3. Compile the web application components and helper classes referenced by the components.
4. Optionally package the application into a deployable unit.
5. Deploy the application into a web container.
6. Access a URL that references the web application.

The start of my weekly updates

I'll be consistently updating this blog every thursday. However, if there are any issues or if i've read something interesting then i'll be posting those up as well.

So far;

- 25/07/08 Meeting Minutes have been updated and uploaded
- Project Plan completed and submitted

- Java EE SDK 5.0 and Netbeans IDE 6.1 have been installed on my laptop so implementation is good to go

- I have decided and started implementing the Google AJAX Search API into a basic web application first as Google seems to be the most popular search engine atm (with good reason as well, due to search algorithms and search tools)

Something to note is that Java can be used in conjunction with Google. However, to make a request to the Google AJAX Search API, the JSON (JavaScript Object Notation) library must be used as the API will return JSON encoded results.

Resources can be found from:
http://code.google.com/apis/ajaxsearch/
http://code.google.com/apis/ajaxsearch/documentation/#fonje
http://json-lib.sourceforge.net/
http://www.json.org/
http://www.json.org/java/

Howard's Blog