Large Scale Crawler
Looking for a developer (or company) to build a robust web crawler system. There are approximately 20,000+ websites that we want to crawl and extract data from. We want to be able to extract these data within 3-6 months.
1. Design the architecture of the crawler or use existing open source crawler as a template. Because we’re dealing with large volume of data the architecture needs to be:
• Robust and scalable
• Efficient and Fast
• Support proxies (to bypass anti-scraping systems)
2. Create Admin dashboard where Admin can:
a. Add, Edit, View, Delete, Stop, Search crawler
b. Input the URL to crawl
c. Specify the data that needs to be extracted (ie. Title, Title URL, etc.)
d. View, Edit, and Delete extracted data
e. Option to download the data in JSON, XML, CSV
f. API of the data (either via Authorization Tokens or other means) for upload and integration
h. Users Management with ACL (Access Control List), Create, Edit, View, Delete users
3. Data normalization and clean up. The data coming in are unformatted and unstructured; an example would be the location or city, some site list location or city as Houston, TX, while other list as Houston, Texas or USA-TX-Houston. Therefore, the location or city data needs to be formatted, we use Google Location.
4. Because the data changes daily on these 20,000+ websites, there needs to be notifications put in place to notify the system of the changes (ie. what’s been added and what’s been removed) and update the data automatically.
5. Once the data is verified and cleansed, it will be available for search either via Solr or ElasticSearch or any other recommendation.
Some of the technical challenges that need to be addressed from the beginning:
• Make sure that the crawler compresses the data before fetching it otherwise it will uses a huge amount of storage
• No need to re-crawl a website every 1-2 days, because it would be a waste of resources, however we do want the data every 1-2 days
• Ways to prevent crawler from DoS (Denial of Service)
• Ways to prevent the system from crashing and overloading because there are so many crawlers running
• System should be scalable to handle crawling 100,000 – 200,000 websites
• Queuing: does the crawler start right away or does it run in batches at a certain time? How does it scale when we start adding more sites to crawl?
Example
Day 1: Admin adds 100 sites to crawl
Day 2: Admin adds 200 sites to crawl
Day 3: Admin adds 500 sites to crawl
Day 4: etc.
Hi
I am a Senior Mobile App Developer.
I have already developed many Android and iPhone apps.
You might have read this in my profile.
Please send me your detail requirements. We can discuss it.
Of course, I will do my best.
Thank you in advance.
Hello
Web scraping and data mining expert is here .I have done many similar scraping projects similar as website which you mentioned on description.I am able to complete your project at lower rate and quality result than other freelancers within short time.
Let me send you sample data in excel or csv file.
Best Regards
Hello I have read what you exactly need. I would like to ask you a few questions. Please feel free to contact me anytime so we can have a detailed discussion and finalize our budget and timeline. I will deliver in best possible way.
I have gone through the project details and understand what you are trying to achieve. We have completed a number of similar a projects to yours. I'd like to speak with you to confirm project details.
A little about us,
We are a team (19 operators) here, giving all data entry, research and scraping service world wide with best quality output ,we are experienced enough to collect the data from several source, from a deep investigation.
We would like to talk in details and give the total structure about how we ll do this job along with a sample, LETS TALK HERE FOR DISCUSSING THE JOB
Thanks
Dg
I want to discuss this project with you further, let me know the best suitable time for you to schedule the meeting, Feel free to message me at any time, i used to be online 14 hrs in a day on this website so probably you will get a quick response from my end.
Greetings,
We have developed many crawlers but many websites have different structure and restrictions as well.
Sometimes they blocked the ip as well. I have gone through the complete job description and we can surely developed the universal crawler but time to time we have to update it and you need our support so we can surely crawl big scale websites and extract the information.
If you are agreed with my point then i believe , we are perfect match for your project. I am available to discuss the project in detail and ready to get started.
Many Thanks
DH
I am an IITK graduate, 9 year experienced software professional and I have got top notch developers in my team, who have got experience across a span of technologies. The members in my team have worked with top notch tech organization such as Amazon, Cisco, Oracle etc. We have been involved in similar projects in the past and our track record has been excellent.
I got 7+years work experience in Data Collection,Bulk Email Campaign,Excel VBA and Internet Research in IT companies here.I can do create crawler and scrap datas from large scale sites using C++,Python and Perl coding as per your requirements in excel with multiple ip rotations.I have dealt with US,UK and Australia companies President,Directors and Managers for web design and development projects successfully and I have Good Communication with writing skills.I am well versed in Internet,MS Office Applications and Phone Etiquette manners with latest Technologies.I can accept your payment terms.
My name is Mike and I’m from UK. I work with individual clients and also provide outsourcing services for a number of UK and USA based agencies. Your project description sounds interesting to me and I do have skills & experience that are required to complete this project. I can show you some examples of my work. Please contact me to discuss your project.