I recently opened an account on OkCupid after being inspired by Chris McKinlay to see if I can game the matching system myself and draw attention from some women in Singapore. To do this is I created a profile on OkCupid and began writing the python script to log in to my account and retrieve the profiles which would fit my search query.
Here were the results I got from a quick search to identify what kind of numbers I would be dealing with.
- Total no of guys from age 18-30 in Singapore who were active within a month: >1000
- Total no of girls from age 18-30 in Singapore who were active within a month: 886
- Total no of guys from age 18-25 in Singapore who were active within a month: >1000
- Total no of girls from age 18-25 in Singapore who were active within a month: 592
Well I knew Singapore is a small island but these numbers are way too small to be interesting for a data miner. But given the amount of competition from guys maybe it would be worth landing a few dates if I optimised the question correctly? Anyway I proceeded...
One may now ask why only 500 questions, why not all? If it was only a matter of a few hours and increasing the loop counter variable to 3000 maybe?
Now here is where the numbers comes into play. I couldn't, quite frankly, wait for a few hours to just answer those irrelevant questions. And that too for about 600 target women? Many of whom wouldn't even have answered a single question and only a handful would have answered more than 50 questions.
A more important reason is hidden in How OkCupid determines match percentage? It can be observed that answering about 25 questions can get you a match percentage of about 96% and answering 50 question can get you a match percentage of about 98%. Which I would say is good enough to compete in Singapore given the population and the sex ratio.
Ok so now my questions are answered and I can write a script to visit each profile and retrieve their answers to my Questions.
I did so, I wrote a script and kept a log of all the profiles who's answers I have already retrieved. This was done so that my script doesn't keeps retrieving the answers to the same profile again and again in case it crashes due to network error or some bug. It went well after a few tries and I collected the data of 299 profiles, all logged in a CSV file for analysis.
Here is the age distribution of women in Singapore who are on OkCupid and were online atleast once since last month
When I looked at the data I was appalled to find out that majority of answers were missing. Nope, not an error of my scripts! Those were the questions only a few women had answered. Now, a natural step is to identify the most answered question, these questions would also be the most important questions which would determine the match. Hence, I wrote a shell script to count the frequency of missing answers per question and sort the questions in the ascending order of their missing answer's frequency. This gave me a sorted list beginning with the most important question and ending with the least important question. Of course the last few questions were completely unanswered.
Based on the premise that I only needed about 25 to 50 questions to get from 96% to 98% match, I kept the top 50 questions and filtered the rest.
Here are top 3 most important questions as I identified in Singapore
Now the stuff is simple. I just had to find the most optimal answer to the most important questions. Now I could either delve into clustering the like profiles using K-Means or K-Modes kind of clustering algorithms OR I could just follow my simplified approach. Since Singapore is small place there is no point spending lot of time to just classify almost similar kind of women into N number of group. Shouldn't I classify each woman into a separate category of her own? Everyone is unique afterall.
Well no! That's not what I wanted to do. Optimising each profile to each woman would be a classic example of overfitting the curve. I wanted to create a single profile which would be generally appealing to most of the women. Aha! There is my answer! Select the answer based on what majority says. But wait, there is a catch, wouldn't answering the questions with less polarity hurt my match with more than benefit it? It would. The key is to only answer the question in which there is overwhelming majority answering the same way. Therefore I selected the questions with least variance.
One such important question is this
Now I knew what to answer and which question to answer and what importance to put on each question. I created a new profile to answer these questions. After I finished answering about 25 most important questions I found that I was getting more than 80% match on most of the profiles returned by the search query for Singapore. This was the high time. I wrote interesting stuff about me on the profile and putup a glossy photo to look attractive.
Now I needed to come into notice of these ladies. So I wrote another script to just visit those women in Singapore who are frequently active. I chose to search women from 18 to 30 years of age who had visited their profile within last day. To keep visiting the profile I used my AWS account to run the script infinitely over and over again. This will keep me popping in their visitors list again and again until they become curious and start to message me.
It worked like a charm. The next day I got more than a 100 visitors on my profile. More than 10 messages flowed in, some found my profile interesting, some said I was handsome and they'd like to be friends. Just friends? I was still wondering. But this is Singapore.
For every new profile, OkCupid sends an email the next day showing a map of love over the world. Which country would be the best for me? Here is the map:
A lot can be inferred from these patterns. Do women of Singapore, Philippines, Malaysia, Indonesia and Hungary, including India, China and Mexico, think alike? How much are they different from those of Belgium, Spain, UK, France, Italy and US? So India has more influence over Singaporean women than China? I will leave this interpretation for a later time. For now I am happy to see my research validated by OkCupid! :D