In this homework, your classifier's job is to categorize web pages. This document will describe in general terms how web pages were transformed into the feature vectors that you will use in this homework. There are three datasets in this homework: News Dataset, Shopping Dataset, and the Personal Web Page Dataset. You will use these datasets for three different learning tasks: News versus non-News, Shopping versus non-Shopping, and Personal versus non-Personal. For the News versus non-News learning task, the pages in the News Dataset are labeled "YES" and the pages in the Shopping and Personal Datasets are labeled "NO". Analagous labelings are used for the other two learning tasks.
For each learning task, the features are formatted as 64 boolean values. Each value corresponds to the presence of a certain word in a page (0=not present, 1=present). Obviously, the choice of which words to use is very important, and will be different for each learning task. In these learning tasks, the word choice was performed by choosing the words with the greatest mutual information. (You do not need to know how mutual information works to complete the homework, but if you are curious, you may look at the paper that introduced the concept.)
The training set contains approximately 2000 examples. Pages in the training set that are categorized as belonging to News include pages from CNN, USAToday, Lincoln Journal Star, Omaha World Herald, and CBSNews. Pages that are categorized as belonging to Shopping include those from Amazon, Bluefly, Drugstore.com, OfficeMax, and WalMart.com. The Personal web page dataset consists of academic and non-academic personal web pages of all types. The test set contains dramatically more broad examples. There are fewer examples from each site, but from more varied sites.
+------------------------+ | japanese | | korean | | spanish | | warner | | italian | | german | | those | | public | | war | | languages | | arabic | | reserved | | because | | transcripts | | trial | | three | | guidelines | | says | | president | | news | | tuesday | | against | | iraq | | los | | issues | | lllp | | lp | | where | | 926 | | headline | | preferences | | fires | | killed | | cars | | features | | angeles | | many | | city | | sunday | | archives | | life | | denotes | | browser | | premium | | law | | endorse | | estate | | go | | before | | quoted | | workers | | washington | | alerts | | states | | such | | court | | school | | commercial | | cia | | high | | rates | | afghanistan | | record | | election | +------------------------+
+------------------------+ | story | | available | | list | | informational | | 100% | | sales | | depot | | �| | | more.. | | iraq | | stores | | orders | | problem | | people | | offer | | product | | deal | | household | | ea | | beauty | | spa | | bush | | directory | | return | | please | | pm | | weil | | two | | journal | | had | | buy | | supplements | | cabinet | | well-being | | brand | | wellness | | external | | personal | | oz | | system. | | what | | services | | sports | | purposes | | supplies | | medicine | | nutrition | | dr | | remedies | | d. | | education | | recommends | | dollars | | notice | | would | | entertainment | | what's | | sexual | | oct | | packaging | | save | | professional | | 1999-2003 | | pharmacy | +------------------------+
+------------------------+ | favorite | | hotlist | | csce | | u.s | | they | | student | | postscript | | free | | compressed | | machine | | links | | corner | | scott | | dive | | privacy | | abstract | | ferg | | me | | us | | i | | d | | modified | | web | | any | | welcome | | rights | | robotics | | thanks | | no | | proceedings | | were | | drugstore.com | | can | | policy | | thesis | | care | | help | | computing | | sites | | classes | | which | | engineering | | provided | | reserved | | misc | | contact | | out | | scuba | | store | | one | | 305 | | theory | | under | | story | | search | | who | | construction | | systems | | two | | when | | site | | programming | | Non ASCII Char for (C) | | times | +------------------------+