C code to index large text library and find similar

Closed Posted 5 years ago Paid on delivery
Closed Paid on delivery

I need a mini-app (Compiled C on Linux) that groups similar sentences together.

I have 100,000 sentences (say in a PostgresSQL DB, Unicode text). It must perform VERY fast - by indexing each root-word to a 16bit integer (which would reduce its memory foot print), then re-creating a new data structure with sentence delimeters and sentence length. Group into buckets of similar sentence length.

Then iterate through doing word-by-word comparisons (16bit comparisons).

Two algos are acceptable:-

1. Simple - Take a source sentence and iterate through XORing word by word (irrespective of word order or word frequency). If there are more than x words outstanding - then it is NOT a similar sentence. X in this case would be 25% of the number of total words.

We leave such large gap so that we don't need to worry about word roots.

From the smaller data set - we then proceed to do a classic levenstechn comparison - but with an upper bound of x deviation - meaning after it detects more than say 10% deviation - it exists that comparison. Here it is a character by character comparison.

The app should communicate with a folder of .gz files that contain the text and it could use a text boundary to distinguish each sentence.

The output would need to be a new text file that sorts every sentence into groups of similarity - separated by a text boundary.

I need something in 36 hours. A mediocre algorithm is fine.

C Programming C# Programming C++ Programming Linux Python

Project ID: #17551738

About the project

9 proposals Remote project Active 5 years ago

9 freelancers are bidding on average $372 for this job

hbxfnzwpf

You can trust my expertise, I can finish in time, thanks a lot! I am very proficient in c and c++. I have 16 years c++ developing experience now, and have worked for more than 7 years. My work is online game developin More

$300 USD in 1 day
(202 Reviews)
7.3
jjmutumi

Hello, I have more than 6 years experience writing software with Python. I can make a very fast, maintainable script for this in Cython if you are interested? Consider that: 1 - The main slowdown is from cache More

$250 USD in 0 days
(8 Reviews)
4.7
dstepanenko

Hello, I'm c developer with 6+ years of experience and mathematician with a number of publications. Also I'm participant and problem writer of many algorithm competitions (Topcoder, ACM ICPC, etc). Just 2 weeks More

$300 USD in 1 day
(33 Reviews)
6.9
MzHashmi

Hi im free so i can do this type of jobs in quick manner as you have 36 hours for the job lets dont waste the time and get it started

$555 USD in 10 days
(8 Reviews)
4.1
codingedward

Hello, I am an experienced algorithm designer and would really like to work on your project. I appreciate how detailed your project description is and have understood every aspect of it. Award me the project and I w More

$250 USD in 3 days
(11 Reviews)
3.2
Anpera

I have expertise in C/C++ My plan to solve this thing: 1) You give me example of dataset 2) I do rapid prototyping in python and show you approximate result of algorithm execution and timing. 3) If you like i More

$333 USD in 2 days
(1 Review)
2.6
ansarias21

Hi, I have 4 years of experience in C/C++ development in Linux environment. Looking forward for your response to discuss further. Regards, Akram

$250 USD in 1 day
(5 Reviews)
0.8
humrobo

Hi, Hope you doing well sir i read your message in given below i make sure you that i can help you to build mini-app (Compiled C on Linux) that groups similar sentences together. as better as per your given requir More

$555 USD in 10 days
(1 Review)
1.6
itsparx

Dear Prospect Hiring Manager. Thank you for giving me a chance to bid on your project. i am a serious bidder here and i have already worked on a similar project before and can deliver as u have mentioned I have More

$555 USD in 10 days
(0 Reviews)
0.6