Irena has finally sent me the VisionBytes data. She had Sergey Mainich (her student) to preprocess it for me as the data before would not be very useful for me.
Basically these files contain data for 19 news programs extracted from closed captions provided by VisionBytes.
The data files are in two formats:
1) Oracle SQL Loader (.ctl)
2) SQL insert statements (.sql)
The DDL script for four tables (program, sentence, token, segment) is in ddl_script.sql.
The closed captions have been pre-processed as follows:
1. Sentence boundaries have been calculated.
2. Caption text was tokenized into separate words.
3. The words were converted to lower case and stemmed using Porter's stemmer.
4. Stopwords were identified and marked.
Timestamps are specified in the UTC format.
The time values in columns program.start_time and program.end_time have been truncated by SQLDeveloper.
Program time can be calculated from the timestamps.
The table "segment" contains boundaries of the news stories.
I will have to install an SQL server and import all the data soon.