We need a set of functions to explore a small website from a given starting URL, and return the following information: a) an array of all the external links (i.e. links outside the starting directory, or to a different host). $externalLinks[$URL]=$numberOfOccurrences b) a count of all the images linked to from a site c) the size of the smallest image linked from the site d) an array of all the page URLs which comprise the site, along with some identifying feature that can be used to identify if they change. I suggest using an MD5 hash of the HTML of a page, and returning an array like $sitePages[$pageURL]=$md5hash e) a flag indicating whether the site uses Javascript anywhere in it f) a flag indicating whether the site has any (Javascript) popups g) an array containing any of a list of 'banned words' that are found within the site (list will be provided). The functions will eventually be integrated as methods of a larger class. Features & notes: The code should deal appropriately with pages containing frames The code should take as parameters a) the maximum total number of pages to crawl (likely to be ~10-15) b) the maximum 'depth' to explore (likely to be ~3-5) A 'website' can be defined as any pages linked to from the starting page, within the initial directory structure. So for a starting address of [login to view URL], [login to view URL] would be considered part of the site, but [login to view URL] would be considered an external link, as would a link to [login to view URL] You will have access to a mysql database, if required. I would prefer you not to create temporary files on the filesystem, but am happy to listen to any pressing need for them. You may use a third-party crawler program to assist, if necessary. For instance, you may want to use the FreeBSD port of crawl. Please let me know if you have comments or questions.
## Deliverables
1) Complete and fully-functional working program(s) as well as complete source code of all work done. 2) Complete ownership and distribution copyrights to all work purchased. Completion: I will provide several sample sites for you to work with; the project will be deemed complete if it successfully reports on several other (similar) test sites. Other criteria include clean, well-structured code, and some commenting of it :-) Please let me know if you have comments or questions.
## Platform
FreeBSD 4.6 Apache 1.3 PHP Version 4.2 MySQL 4