It’s Valentines Day – a day when people think about love and relationships. How people meet and form a relationship works a lot quicker than in our parent’s or grandparent’s generation. I’m sure many of you are told how it used to be – you met someone, dated them for a while, proposed, got married. People who grew up in small towns maybe had one shot at finding love, so they made sure they didn’t mess it up.
Today, finding a date is not a challenge – finding a match is probably the issue. In the last 20 years we’ve gone from traditional dating to online dating to speed dating to online speed dating. Now you just swipe left or swipe right, if that’s your thing.
In 2002-2004, Columbia University ran a speed-dating experiment where they tracked 21 speed dating sessions for mostly young adults meeting people of the opposite sex. I found the dataset and the key to the data here: http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/.
I was interested in finding out what it was about someone during that short interaction that determined whether or not someone viewed them as a match. This is a great opportunity to practice simple logistic regression if you’ve never done it before.
The speed dating dataset
The dataset at the link above is quite substantial – over 8,000 observations with almost 200 datapoints for each. However, I was only interested in the speed dates themselves, and so I simplified the data and uploaded a smaller version of the dataset to my Github account here. I’m going to pull this dataset down and do some simple regression analysis on it to determine what it is about someone that influences whether someone sees them as a match.
Let’s pull the data and take a quick look at the first few lines:
library(tidyverse) library(corrplot) download.file("https://raw.githubusercontent.com/keithmcnulty/speed_dating/master/speed_data_data.RDS", "speed_dating_data.RDS") data <- readRDS("speed_dating_data.RDS") head(data, 3) %>% knitr::kable()
We can work out from the key that:
- The first five columns are demographic – we may want to use them to look at subgroups later.
- The next seven columns are important.
decis the raters decision on whether this individual was a match. Then we have scores out of ten on six characteristics: attractiveness, sincerity, intelligence, fun, ambitiousness and shared interests.
likecolumn is an overall rating. The
probcolumn is a rating on whether the rater believed that the other person would like them, and the final column is a binary on whether the two had met prior to the speed date, with the lower value indicating that they had met before.
We can leave the first four columns out of any analysis we do. Our outcome variable here is
dec. I’m interested in the rest as potential explanatory variables. Before I start to do any analysis, I want to check if any of these variables are highly collinear – ie, have very high correlations. If two variables are measuring pretty much the same thing, I should probably remove one of them.
corr_matrix <- data %>% dplyr::select(attr, sinc, intel, fun, amb, shar, like, prob, met) %>% as.matrix() M <- cor(corr_matrix, use = "complete.obs") corrplot::corrplot(M)
OK, clearly there’s mini-halo effects running wild when you speed date. But none of these get up really high (eg past 0.75), so I’m going to leave them all in because this is just for fun. I might want to spend a bit more time on this issue if my analysis had serious consequences here.
Running a logistic regression on the data
The outcome of this process is binary. The respondent decides yes or no. That’s harsh, I give you. But for a statistician it’s good because it points straight to a binomial logistic regression as our primary analytic tool. Let’s run a logistic regression model on the outcome and potential explanatory variables I’ve identified above, and take a look at the results.
model <- glm(dec ~ attr + sinc + intel + fun + amb + shar + like + prob + met, data = data, family = "binomial") summary(model)
## ## Call: ## glm(formula = dec ~ attr + sinc + intel + fun + amb + shar + ## like + prob + met, family = "binomial", data = data) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.8210 -0.7602 -0.2311 0.7921 3.6432 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.0769718 0.2101454 -28.918 < 2e-16 *** ## attr 0.4296537 0.0235843 18.218 < 2e-16 *** ## sinc -0.2088995 0.0273358 -7.642 2.14e-14 *** ## intel -0.0005595 0.0327799 -0.017 0.986 ## fun 0.1357729 0.0258310 5.256 1.47e-07 *** ## amb -0.1904715 0.0251996 -7.559 4.08e-14 *** ## shar 0.1073134 0.0208044 5.158 2.49e-07 *** ## like 0.5607669 0.0325454 17.230 < 2e-16 *** ## prob 0.1503168 0.0178518 8.420 < 2e-16 *** ## met 0.0021065 0.0310848 0.068 0.946 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 9322.9 on 6818 degrees of freedom ## Residual deviance: 6460.5 on 6809 degrees of freedom ## (1559 observations deleted due to missingness) ## AIC: 6480.5 ## ## Number of Fisher Scoring iterations: 5
So, perceived intelligence doesn’t really matter. (This could be a factor of the population being studied, whom I believe were all undergraduates at Columbia and so would all have a high average SAT I suspect – so intelligence might be less of a differentiator). Neither does whether or not you’d met someone before. Everything else seems to play a significant role.
More interesting is how much of a role each factor plays. The Coefficients Estimates in the model output above tell us the effect of each variable, assuming other variables are held still. But in the form above they are expressed in log odds, and we need to convert them to regular odds ratios so we can understand them better, so let’s adjust our results to do that.
ctable <- coef(summary(model)) odds_ratio <- exp(coef(summary(model))[ , c("Estimate")]) (coef_summary <- cbind(ctable, as.data.frame(odds_ratio, nrow = nrow(ctable), ncol = 1))) %>% knitr::kable()
|Estimate||Std. Error||z value||Pr(>|z|)||odds_ratio|
So we have some interesting observations:
- Unsurprisingly, the respondents overall rating on someone is the biggest indicator of whether they decide to match with them.
- Attractiveness seems substantially the primary positive indicator of a match.
- Interestingly, sincerity and ambitiousness decreased the likelihood of a match – they were seemingly turn-offs for potential dates.
- Other factors played a minor positive role, including whether or not the respondent believed the interest to be reciprocated.
Comparing the genders
It’s of course natural to ask whether there are gender differences in these dynamics. So I’m going to rerun the analysis on the two gender subsets and then create a chart that illustrates any differences.
# females only model_f <- glm(dec ~ attr + sinc + intel + fun + amb + shar + like + prob + met, data = data %>% dplyr::filter(gender == 0), family = "binomial") ctable_f <- coef(summary(model_f)) odds_ratio_f <- exp(coef(summary(model_f))[ , c("Estimate")]) coef_summary_f <- cbind(ctable_f, as.data.frame(odds_ratio_f, nrow = nrow(ctable_f), ncol = 1)) # males only model_m <- glm(dec ~ attr + sinc + intel + fun + amb + shar + like + prob + met, data = data %>% dplyr::filter(gender == 1), family = "binomial") ctable_m <- coef(summary(model_m)) odds_ratio_m <- exp(coef(summary(model_m))[ , c("Estimate")]) coef_summary_m <- cbind(ctable_m, as.data.frame(odds_ratio_m, nrow = nrow(ctable_m), ncol = 1)) # chart chart_data <- coef_summary_f %>% dplyr::add_rownames() %>% dplyr::left_join(coef_summary_m %>% dplyr::add_rownames(), by = "rowname") %>% dplyr::select(rowname, odds_ratio_f, odds_ratio_m) %>% tidyr::pivot_longer(cols = c("odds_ratio_f", "odds_ratio_m"), names_to = "odds_ratio") %>% dplyr::mutate(Effect = value - 1, Gender = ifelse(odds_ratio == "odds_ratio_f", "Female", "Male"), Factor = dplyr::recode(rowname, amb = "Ambitious", attr = "Attractive", fun = "Fun", intel = "Intelligent", like = "Liked", met = "Never met\nbefore", prob = "Believe\nthey like\nme", shar = "Shared\nInterests", sinc = "Sincere")) ggplot(data=chart_data %>% dplyr::filter(rowname != "(Intercept)"), aes(x=Factor, y=Effect, fill=Gender)) + geom_bar(stat="identity", color="black", position=position_dodge()) + theme_minimal() + labs(x = "", title = "What matters in speed dating?") + scale_fill_manual(values=c('#FFC0CB', '#0000FF'))
We find a couple of interesting differences. True to stereotype, physical attractiveness seems to matter a lot more to men. And as per long-held beliefs, intelligence does matter more to women. It has a significant positive effect versus men where it doesn’t seem to play a meaningful role. The other interesting difference is that whether you have met someone before does have a significant effect on both groups, but we didn’t see it before because it has the opposite effect for men and women and so was averaging out as insignificant. Men seemingly prefer new interactions, versus women who like to see a familiar face.
As I mentioned above, the entire dataset is quite large, so there is a lot of exploration you can do here – this is just a small part of what can be gleaned. If you end up playing around with it, I’m interested in what you find.