Developing Age and Gender Predictive Lexica over Social Media

Developing Age and Gender Predictive Lexica over Social Media

By
Maarten Sap, Michal Kosinski, Johannes C. Eichstaedt, Gregory Park
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (Conference). January
2014

Demographic lexica have potential for widespread use in social science, economic, and business applications. We derive predictive lexica (words and weights) for age and gender using regression and classification models from word usage in Facebook, blog, and Twitter data with associated demographic labels. The lexica, made publicly available,1 achieved state-of-the-art accuracy in language based age and gender prediction over Facebook and Twitter, and were evaluated for generalization across social media genres as well as in limited message situations.