Creating a data set from a web data repository
$30-500 USD
Paid on delivery
The goal of this project is to create a data set that will be used to examine how web pages change over time. We have a repository of 52 weekly snapshots of 152 web sites. Each weekly snapshot is stored in a separate, compressed file. We use zlib for compression, and we have the tools that allow the users to access individual pages within each snapshot. The first step is to extract the URLs from each snapshot, and extract a number of characteristics for each web page, (e.g., URL length, content length, number of words, HTML tags/word ratio, and so on -- we will agree on the exact fields to include). The second step is to examine the evolution of pages that correspons to the same URL. We need to compare the URLs for each week pair (e.g., week 1-2, 1-3, ..., 1-51, 2-3, 2-4,... 50-51) and examine how the pages changed (e.g., in terms of content length, in terms of "shingles", in terms of HTML structure, and so on). To retrieve a small subset of the data and one of the programs that allow you to access the data, the developer can go to: [url removed, login to view]~webarchive/access/ where the "webcat" program can be used to see the contents of the repository. It is possible to download chunks of the data (in 512M chunks). The developer will have access to our platform for running the code and will not need to download an extensive amount of data.
## Deliverables
The output of the project will be: The data set will contain two files one with the URL characteristics per week, and one with the URL comparison across weeks. The files should be in plain text, in a well-structured, tab-separated format, ready to be imported in the database. The coder should also give the source code, well-commented and with good documentation on how to run the code.
## Platform
It will run on Linux. Preferred language: Java or C++
Project ID: #3589673