Creating a data set from a web data repository

Cancelled Posted Jun 20, 2006 Paid on delivery
Cancelled Paid on delivery

The goal of this project is to create a data set that will be used to examine how web pages change over time. We have a repository of 52 weekly snapshots of 152 web sites. Each weekly snapshot is stored in a separate, compressed file. We use zlib for compression, and we have the tools that allow the users to access individual pages within each snapshot. The first step is to extract the URLs from each snapshot, and extract a number of characteristics for each web page, (e.g., URL length, content length, number of words, HTML tags/word ratio, and so on -- we will agree on the exact fields to include). The second step is to examine the evolution of pages that correspons to the same URL. We need to compare the URLs for each week pair (e.g., week 1-2, 1-3, ..., 1-51, 2-3, 2-4,... 50-51) and examine how the pages changed (e.g., in terms of content length, in terms of "shingles", in terms of HTML structure, and so on). To retrieve a small subset of the data and one of the programs that allow you to access the data, the developer can go to: [url removed, login to view]~webarchive/access/ where the "webcat" program can be used to see the contents of the repository. It is possible to download chunks of the data (in 512M chunks). The developer will have access to our platform for running the code and will not need to download an extensive amount of data.

## Deliverables

The output of the project will be: The data set will contain two files one with the URL characteristics per week, and one with the URL comparison across weeks. The files should be in plain text, in a well-structured, tab-separated format, ready to be imported in the database. The coder should also give the source code, well-commented and with good documentation on how to run the code.

## Platform

It will run on Linux. Preferred language: Java or C++

Engineering Linux MySQL PHP Software Architecture Software Testing Web Hosting Website Management Website Testing

Project ID: #3589673

About the project

5 proposals Remote project Active Jul 2, 2006

5 freelancers are bidding on average $184 for this job

werac

See private message.

$127.5 USD in 30 days
(35 Reviews)
5.2
afzaal820

See private message.

$255 USD in 30 days
(29 Reviews)
5.1
sting01

See private message.

$170 USD in 30 days
(35 Reviews)
4.4
jazzmusicman

See private message.

$199.75 USD in 30 days
(31 Reviews)
4.3
dinamique

See private message.

$170 USD in 30 days
(0 Reviews)
0.0