Find Jobs
Hire Freelancers

Special-use spider

$100-500 USD

Closed
Posted over 21 years ago

$100-500 USD

Paid on delivery
We need a set of functions to explore a small website from a given starting URL, and return the following information: a) an array of all the external links (i.e. links outside the starting directory, or to a different host). $externalLinks[$URL]=$numberOfOccurrences b) a count of all the images linked to from a site c) the size of the smallest image linked from the site d) an array of all the page URLs which comprise the site, along with some identifying feature that can be used to identify if they change. I suggest using an MD5 hash of the HTML of a page, and returning an array like $sitePages[$pageURL]=$md5hash e) a flag indicating whether the site uses Javascript anywhere in it f) a flag indicating whether the site has any (Javascript) popups g) an array containing any of a list of 'banned words' that are found within the site (list will be provided). The functions will eventually be integrated as methods of a larger class. Features & notes: The code should deal appropriately with pages containing frames The code should take as parameters a) the maximum total number of pages to crawl (likely to be ~10-15) b) the maximum 'depth' to explore (likely to be ~3-5) A 'website' can be defined as any pages linked to from the starting page, within the initial directory structure. So for a starting address of [login to view URL], [login to view URL] would be considered part of the site, but [login to view URL] would be considered an external link, as would a link to [login to view URL] You will have access to a mysql database, if required. I would prefer you not to create temporary files on the filesystem, but am happy to listen to any pressing need for them. You may use a third-party crawler program to assist, if necessary. For instance, you may want to use the FreeBSD port of crawl. Please let me know if you have comments or questions. ## Deliverables 1) Complete and fully-functional working program(s) as well as complete source code of all work done. 2) Complete ownership and distribution copyrights to all work purchased. Completion: I will provide several sample sites for you to work with; the project will be deemed complete if it successfully reports on several other (similar) test sites. Other criteria include clean, well-structured code, and some commenting of it :-) Please let me know if you have comments or questions. ## Platform FreeBSD 4.6 Apache 1.3 PHP Version 4.2 MySQL 4
Project ID: 2889436

About the project

9 proposals
Remote project
Active 21 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
9 freelancers are bidding on average $242 USD for this job
User Avatar
See private message.
$153 USD in 14 days
5.0 (620 reviews)
7.7
7.7
User Avatar
See private message.
$115.60 USD in 14 days
5.0 (64 reviews)
5.5
5.5
User Avatar
See private message.
$335.75 USD in 14 days
5.0 (9 reviews)
5.2
5.2
User Avatar
See private message.
$340 USD in 14 days
4.5 (24 reviews)
4.0
4.0
User Avatar
See private message.
$85 USD in 14 days
5.0 (10 reviews)
3.1
3.1
User Avatar
See private message.
$340 USD in 14 days
5.0 (1 review)
0.3
0.3
User Avatar
See private message.
$552.50 USD in 14 days
0.0 (0 reviews)
0.0
0.0
User Avatar
See private message.
$127.50 USD in 14 days
0.0 (0 reviews)
0.0
0.0
User Avatar
See private message.
$127.50 USD in 14 days
0.0 (0 reviews)
0.0
0.0

About the client

Flag of ANGUILLA
Anguilla
0.0
0
Member since Dec 4, 2002

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.