I need a similar program as [url removed, login to view], preferably written in python but c++ is a possibility also. This must be able to run on a linux server.
This may need a font library, I am not sure... if this is a problem you can just add a few fonts and I will add more once the program is complete. I doubt this will even be an issue.
- This needs to take a partial screenshot of the URL (800px x 600px to save memory instead of a full screenshot).
- Only takes text out of the image that is a larger font (in comparison to the rest of the page), this text is likely to be bolded a darker color (such as black, dark blue, greys, dark red). Would we need a separate configuration for the color of the text were trying to pull? Example: Look at titles of news articles online and how they are always larger font sizes, bolded and a different color.
- Cant use x and y coordinates for point of text extraction, as this will be searching many different sites that have vastly different formats. So we must use optical character recognition or something similar.
- Must be able to find letters, numbers and basic symbols (!, @, #, $, %, etc).
Configuration for the scan:
URL: [url removed, login to view] # URL where we take a 800 X 600px screenshot of
Character Length: 10 # This would extract a group text that is 10 characters or under
TextReg: ('W') # The value must start with a W to extract
The program would then scan the screenshot and try to find a group of text with the requirements.
Then output the text result(s) for that URL's image scraping.
We hope to use the creator of this program for future updates.
If you use python I would want you to create a custom module as I dont want to use others open-sources and have to abide by there uses.