Hostname: page-component-76fb5796d-vvkck Total loading time: 0 Render date: 2024-04-25T20:00:08.424Z Has data issue: false hasContentIssue false

Automated text analysis in psychology: methods, applications, and future developments*

Published online by Cambridge University Press:  31 July 2014

RUMEN ILIEV*
Affiliation:
University of Michigan
MORTEZA DEHGHANI
Affiliation:
University of Southern California
EYAL SAGI
Affiliation:
Northwestern University
*
Address for correspondence: e-mail: riliev@umich.edu

Abstract

Recent years have seen rapid developments in automated text analysis methods focused on measuring psychological and demographic properties. While this development has mainly been driven by computer scientists and computational linguists, such methods can be of great value for social scientists in general, and for psychologists in particular. In this paper, we review some of the most popular approaches to automated text analysis from the perspective of social scientists, and give examples of their applications in different theoretical domains. After describing some of the pros and cons of these methods, we speculate about future methodological developments, and how they might change social sciences. We conclude that, despite the fact that current methods have many disadvantages and pitfalls compared to more traditional methods of data collection, the constant increase of computational power and the wide availability of textual data will inevitably make automated text analysis a common tool for psychologists.

Type
Research Article
Copyright
Copyright © UK Cognitive Linguistics Association 2014 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*

This research has been supported in part by an AFOSR Young Investigator award to MD, and ARTIS research grant to RI. We are thankful to Jeremy Ginges, Sid Horton, Antonio Damasio, Jonas Kaplan, Sarah Gimbel, Kate Johnson, Lisa Aziz-Zadeh, Jesse Graham, Peter Khooshabeh, Peter Carnevale, and Derek Harmon for their helpful comments and suggestions.

References

references

Andrzejewski, D., & Zhu, X. (2009). Latent Dirichlet allocation with topic-in-set knowledge. In Proceedings of the NAACL 2009 Workshop on Semi-supervised Learning for NLP (pp. 4348). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Argamon, S., Koppel, M., Pennebaker, J. W., & Schler, J. (2009). Automatically profiling the author of an anonymous text. Communications of the ACM, 52 (2), 119123.Google Scholar
Back, M. D., Küfner, A. C., & Egloff, B. (2010). The emotional timeline of September 11, 2001. Psychological Science, 21 (10), 14171419.Google Scholar
Back, M. D., Küfner, A. C., & Egloff, B. (2011). ‘Automatic or the people?’ anger on September 11, 2001, and lessons learned for the analysis of large digital data sets. Psychological Science, 22 (6), 837838.Google Scholar
Baddeley, J. L., Pennebaker, J. W., & Beevers, C. G. (2013). Everyday social behavior during a major depressive episode. Social Psychological and Personality Science, 4 (4), 445452.CrossRefGoogle Scholar
Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley Framenet project. In Proceedings of the 17th International Conference on Computational linguistics: Volume 1 (pp. 8690). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Berger, H. (1929). Uber das Elektrenkephalogramm des Menschen [On the human electroencephalogram]. Archiv f. Psychiatrie u. Nervenkrankheiten, 87, 527570.Google Scholar
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55 (4), 7784.Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 9931022.Google Scholar
Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3 (2), 77101.Google Scholar
Campbell, R. S., & Pennebaker, J. W. (2003). The secret life of pronouns: flexibility in writing style and physical health. Psychological Science, 14 (1), 6065.Google Scholar
Carley, K. (1997). Network text analysis: the network position of concepts. In Roberts, C. (Ed.), Text analysis for the social sciences: methods for drawing statistical inference from texts and transcripts (pp. 79100). Mahwah, NJ: Lawrence Erlbaum.Google Scholar
Chen, D. (2011). Introduction to latent Dirichlet allocation. Edwin Chen’s Blog. Online: <http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/>.Google Scholar
Cohn, M. A., Mehl, M. R., & Pennebaker, J. W. (2004). Linguistic markers of psychological change surrounding September 11, 2001. Psychological Science, 15 (10), 687693.Google Scholar
Dave, K., Lawrence, S., & Pennock, D. M. (2003). Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proceedings of the 12th International Conference on World Wide Web (pp. 519528), online: <http://dpennock.com/papers/dave-www-2003-mining-opinions.pdf>.CrossRefGoogle Scholar
D’Mello, S., Dowell, N., & Graesser, A. (2009). Cohesion relationships in tutorial dialogue as predictors of affective states. In Proceedings of the 2009 Conference on Artificial Intelligence in Education: Building Learning Systems That Care: From Knowledge Representation to Affective Modelling, online: <http://celstec.org.uk/system/files/file/conference_proceedings/aeid2009/papers/paper_27.pdf>.Google Scholar
D’Mello, S., & Graesser, A. (2012) Language and discourse are powerful signals of student emotions during tutoring. IEEE Transactions on Learning Technologies, 5 (4), 304317.Google Scholar
Dam, G., & Kaufmann, S. (2008). Computer assessment of interview data using latent semantic analysis. Behavior Research Methods, 40 (1), 820.Google Scholar
Dehghani, M., Bang, M., Medin, D., Marin, A., Leddon, E., & Waxman, S. (2013). Epistemologies in the text of children’s books: native- and non-native-authored books. International Journal of Science Education, 35 (13), 21332151.CrossRefGoogle Scholar
Dehghani, M., Sagae, K., Sachdeva, S., & Gratch, J. (2014). Linguistic analysis of the debate over the construction of the ‘Ground Zero Mosque’. Journal of Information Technology & Politics, 11, 114.Google Scholar
Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19 (1/2), 109123.CrossRefGoogle Scholar
Diermeier, D., Godbout, J. F., Yu, B., & Kaufmann, S. (2011). Language and ideology in Congress. British Journal of Political Science, 42 (1), 3155.Google Scholar
DiSessa, A. A. (1993). Toward an epistemology of physics. Cognition and Instruction, 10 (2/3), 105225.Google Scholar
Dumais, S. T., & Landauer, T. K. (1997). A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104 (2), 211240.Google Scholar
Eastham, L. A. (2011). Research using blogs for data: public documents or private musings? Research in Nursing & Health, 34 (4), 353361.CrossRefGoogle ScholarPubMed
Esuli, A., & Sebastiani, F. (2006). Sentiwordnet: a publicly available lexical resource for opinion mining. In Proceedings of LREC 6 (pp. 417422), online: <http://www.esuli.it/wp-content/uploads/2011/07/LREC06.pdf>.Google Scholar
Finn, A., & Kushmerick, N. (2006). Learning to classify documents according to genre. Journal of the American Society for Information Science and Technology, 57 (11), 15061518.CrossRefGoogle Scholar
Firth, J. (1957) Papers in Linguistics 1934−1951. London: Oxford University Press.Google Scholar
Foltz, P. W., Laham, D., & Landauer, T. K. (1999). Automated essay scoring: applications to educational technology. In World Conference on Educational Multimedia, Hypermedia and Telecommunications 1 (pp. 939944).Google Scholar
Freud, S. (1901). Psychopathology of everyday life. New York: Basic Books.Google Scholar
Gill, A. J., French, R. M., Gergle, D., & Oberlander, J. (2008). The language of emotion in short blog texts. In Proceedings of the 2008 ACM Conference on Computer Supported Cooperative Work, online: <http://homepages.inf.ed.ac.uk/jon/papers1/gill_etal2008.pdf>.CrossRefGoogle Scholar
Gill, A. J., Nowson, S., & Oberlander, J. (2009). What are they blogging about? Personality, topic and motivation in blogs. In the Proceedings of 2009 International AAAI Conference on Weblogs and Social Media, online: <http://kanagawa.lti.cs.cmu.edu/11719/sites/default/files/Gil-personality.pdf>.CrossRefGoogle Scholar
Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36 (2), 193202.Google Scholar
Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96, 10291046.Google Scholar
Graham, J., Haidt, J., Koleva, S., Motyl, M., Iyer, R., Wojcik, S., & Ditto, P. H. (2013). Moral Foundations Theory: the pragmatic validity of moral pluralism. Advances in Experimental Social Psychology, 47, 55130.Google Scholar
Greenfield, P. M. (2013). The changing psychology of culture from 1800 through 2000. Psychological Science, 24 (9), 17221731.Google Scholar
Grimmer, J. (2010). A Bayesian hierarchical topic model for political texts: measuring expressed agendas in Senate press releases. Political Analysis, 18 (1), 135.CrossRefGoogle Scholar
Haidt, J., & Joseph, C. (2004). Intuitive ethics: how innately prepared intuitions generate culturally variable virtues. Daedalus, 133, 5566.CrossRefGoogle Scholar
Helmholtz, H. (1850). Vorläufiger Bericht über die Fortpflanzungsgeschwindigkeit der Nervenreizung [Preliminary report on the propagation speed of nervous stimulations]. Archiv für Anatomie, Physiologie und wissenschaftliche Medizin, 17, 7173.Google Scholar
Hookway, N. (2008). Entering the blogosphere: some strategies for using blogs in social research. Qualitative Research, 8 (1), 91113.CrossRefGoogle Scholar
Houen, S. (2011). Opinion mining with semantic analysis. Online: <http://www.diku.dk/forskning/Publikationer/specialer/2011/specialerapport_final_Soren_Houen.pdf/>..>Google Scholar
Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features (pp. 137142). Berlin and Heidelberg: Springer.Google Scholar
Jung, C. G. (1904–1907) Studies in word association. London: Routledge & K. Paul [contained in Experimental Researches, Collected Works, Vol. 2].Google Scholar
Kacewicz, E., Pennebaker, J. W., Davis, M., Jeon, M., & Graesser, A. C. (2013). Pronoun use reflects standings in social hierarchies. Journal of Language and Social Psychology, 33, 124143.Google Scholar
Kahn, J. H., Tobin, R. M., Massey, A. E., & Anderson, J. A. (2007). Measuring emotional expression with the Linguistic Inquiry and Word Count. American Journal of Psychology, 120 (2), 263286.Google Scholar
Kesebir, P., & Kesebir, S. (2012). The cultural salience of moral character and virtue declined in twentieth century America. Journal of Positive Psychology, 7 (6), 471480.CrossRefGoogle Scholar
Kim, S. M., & Hovy, E. (2004). Determining the sentiment of opinions. In Proceedings of the 20th International Conference on Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, online: <http://www.cs.cmu.edu/~hovy/papers/04Coling-opinion-valences.pdf>.Google Scholar
Kim, S. M., & Hovy, E. (2006). Extracting opinions, opinion holders, and topics expressed in online news media text. In Proceedings of the Workshop on Sentiment and Subjectivity in Text. Stroudsburg, PA: Association for Computational Linguistics, online: <http://www.isi.edu/natural-language/people/hovy/papers/06ACL-WS-opin-topic-holder.pdf>.Google Scholar
King, G. (2011). Ensuring the data-rich future of the social sciences. Science, 331 (6018), 719721.CrossRefGoogle ScholarPubMed
Kingsbury, P., & Palmer, M. (2002). From TreeBank to PropBank. In Proceedings of the Third International Conference on Language Resources and Evaluation, LREC-02, Las Palmas, Canary Islands, Spain.Google Scholar
Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for information Science and Technology, 60 (1), 926.CrossRefGoogle Scholar
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110 (15), 58025805.Google Scholar
Krishnamurthy, B., Gill, P., & Arlitt, M. (2008). A few chirps about twitter. In Proceedings of the First Workshop on On-line Social Networks, pp. 1924.Google Scholar
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104 (2), 211240.Google Scholar
Lerman, K., & Ghosh, R. (2010). Information contagion: an empirical study of the spread of news on Digg and Twitter social networks. In Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM), online: <http://arxiv.org/pdf/1003.2664.pdf>.Google Scholar
Lewis, D. D. (1998). Naive (Bayes) at forty: the independence assumption in information retrieval. In Nedellec, Claire & Rouveirol, Celine (Eds.), Machine learning: ECML-98 (pp. 415). Berlin and Heidelberg: Springer.Google Scholar
Lewis, K., Kaufman, J., Gonzalez, M., Wimmer, A., & Christakis, N. (2008). Tastes, ties, and time: a new social network dataset using Facebook.com. Social Networks, 30 (4), 330342.Google Scholar
Liu, B. (2010). Sentiment analysis and subjectivity. In Indurkhya, Nitin & Damerau, Fred J. (Eds.), Handbook of natural language processing, 2nd ed. (pp. 627666). Boca Raton, FL: Taylor and Francis.Google Scholar
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In AAAI Workshop on Learning for Text Categorization, online: <http://faculty.cs.byu.edu/~ringger/CS679/papers/McCallumNigam_NaiveBayes-aaaiws98.pdf>.Google Scholar
McCarthy, P. M., Lewis, G. A., Dufty, D. F., & McNamara, D. S. (2006). Analyzing writing styles with Coh-Metrix. In Proceedings of the Florida Artificial Intelligence Research Society International Conference, online: <http://www.cs.brandeis.edu/~marc/misc/proceedings/flairs-2006/CD/20/FLAIRS06-151.pdf>.Google Scholar
McCloskey, M. (1983). Naive theories of motion. In Gentner, D. & Stevens, A. (Eds.), Mental models (pp. 299324). New York, NY: Psychology Press.Google Scholar
McNamara, D. S., Crossley, S. A., & McCarthy, P. M. (2010). Linguistic features of writing quality. Written Communication, 27 (1), 5786.Google Scholar
Mehl, M. R., Gosling, S. D., & Pennebaker, J. W. (2006). Personality in its natural habitat: manifestations and implicit folk theories of personality in daily life. Journal of Personality and Social Psychology, 90 (5), 862877.Google Scholar
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., the_Google_Books_Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M., & Lieberman-Aiden, E. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331, 176182.Google Scholar
Miller, G. (1995). WordNet: a lexical database for English. Communications of the ACM, 38 (11), 3941.Google Scholar
Miller, G. (2012). The smartphone psychology manifesto. Perspectives on Psychological Science, 7 (3), 221237.Google Scholar
Mishne, G. (2005). Experiments with mood classification in blog posts. In Proceedings of ACM SIGIR 2005 Workshop on Stylistic Analysis of Text for Information Access, online: <http://staff.science.uva.nl/~gilad/pubs/style2005-blogmoods.pdf>.Google Scholar
Mueller, G. E., & Schumann, F. (1894). Experimentelle beitrÃge zur untersuchung des gedächtnisses [Experimental contributions on the investigation of memory]. Zeitschrift fuer Psychologie, 6, 81190.Google Scholar
Mukherjee, A., & Liu, B. (2010) Improving gender classification of blog authors. In Proceedings of Conference on Empirical Methods in Natural Language Processing (pp. 207217). MIT, Massachusetts, USA, online: <http://www.aclweb.org/anthology/D/D10/D10-1021.pdf>.Google Scholar
Murray, H. A. (1943). Thematic Apperception Test, Vol. 1. Cambridge, MA: Harvard University Press.Google Scholar
Nakov, P. (2001). Latent semantic analysis for German literature investigation. In Computational Intelligence. Theory and Applications, 834841, online: <http://link.springer.com/chapter/10.1007/3-540-45493-4_83>.Google Scholar
Newman, M. L., Groom, C. J., Handelman, L. D., & Pennebaker, J. W. (2008). Gender differences in language use: an analysis of 14,000 text samples. Discourse Processes, 45 (3), 211236.CrossRefGoogle Scholar
Newman, M. L., Pennebaker, J. W., Berry, D. S., & Richards, J. M. (2003). Lying words: predicting deception from linguistic styles. Personality and Social Psychology Bulletin, 29 (5), 665675.Google Scholar
Nisbett, R. (2004). The geography of thought: how Asians and Westerners think differently … and why. New York, NY: Simon and Schuster.Google Scholar
Oberlander, J., & Nowson, S. (2006). Whose thumb is it anyway? Classifying author personality from weblog text. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (pp. 627634), online: <http://edu.cs.uni-magdeburg.de/EC/lehre/sommersemester-2010/emotional-computing/informationen-zum-seminar/blog/blog/P06-2081.pdf>.CrossRefGoogle Scholar
Ogawa, S., Lee, T. M., Kay, A. K., & Tank, D. W. (1990). Brain magnetic resonance imaging with contrast dependent on blood oxygenation. Proceedings of the National Academy of Sciences, 87, 98689872.Google Scholar
Pennebaker, J. W. (2011). The secret life of pronouns: what our words say about us. New York: Bloomsbury Press.Google Scholar
Pennebaker, J. W., & King, L. A. (1999). Linguistic styles: language use as an individual difference. Journal of Personality and Social Psychology, 77 (6), 12961312.Google Scholar
Popping, R. (2003). Knowledge graphs and network text analysis. Social Science Information, 42 (1), 91106.Google Scholar
Pury, C. L. (2011). Automation can lead to confounds in text analysis: Back, Küfner, and Egloff (2010) and the not-so-angry Americans. Psychological Science, 22 (6), 835836.Google Scholar
Rorschach, H. (1964 [1921]) Psychodiagnostik: a diagnostic test based on perception, 6th ed. Berne: Huber.Google Scholar
Rude, S., Gortner, E. M., & Pennebaker, J. (2004). Language use of depressed and depression-vulnerable college students. Cognition & Emotion, 18 (8), 11211133.CrossRefGoogle Scholar
Sagae, K., Gordon, A. S., Dehghani, M., Metke, M., Kim, J. S., Gimbel, S. I., … & Immordino-Yang, M. H. (2013). A data-driven approach for classification of subjectivity in personal narratives. In Proceedings of the 2013 Workshop on Computational Models of Narrative (pp. 198213), OASIcs XX, Scholss Dagstuhl, online: <http://drops.dagstuhl.de/opus/volltexte/2013/4145/pdf/p198-sagae.pdf>.Google Scholar
Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Agrawal, M., Park, G. J., … & Lucas, R. E. (2013). Characterizing geographic variation in well-being using tweets. In Seventh International AAAI Conference on Weblogs and Social Media (ICWSM 2013), online: <http://wwbp.org/papers/icwsm2013_cnty-wb.pdf>.Google Scholar
Standifird, S. S. (2001). Reputation and e-commerce: eBay auctions and the asymmetrical impact of positive and negative ratings. Journal of Management, 27 (3), 279295.CrossRefGoogle Scholar
Stirman, S. W., & Pennebaker, J. W. (2001). Word use in the poetry of suicidal and nonsuicidal poets. Psychosomatic Medicine, 63 (4), 517522.Google Scholar
Stone, P. J., Dunphy, D. C., & Smith, M. S. (1966). The general inquirer: a computer approach to content analysis. Oxford: MIT Press.Google Scholar
Strous, R. D., Koppel, M., Fine, J., Nachliel, S., Shaked, G., & Zivotofsky, A. Z. (2009). Automated characterization and identification of schizophrenia in writing. Journal of Nervous and Mental Disease, 197 (8), 585588.Google Scholar
Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29 (1), 2454.Google Scholar
Twenge, J. M., Campbell, W. K., & Gentile, B. (2012). Increases in individualistic words and phrases in American books, 1960–2008. PloS one, 7 (7), e40181.Google Scholar
Vaillant, G. E. (2012). Triumphs of experience: the men of the Harvard grant study. Cambridge, MA: Harvard University Press.Google Scholar
Van Dijk, T., & Kintsch, W. (1977). Cognitive psychology and discourse: recalling and summarizing stories. In Dressier, W. U. (Ed.), Trends in text-linguistics. New York/Berlin: De Gruyter.Google Scholar
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.Google Scholar
Venegas, R. (2012). Automatic coherence profile in public speeches of three Latin American heads-of-state. In Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference, online: <http://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS12/paper/viewFile/4412/4802>.Google Scholar
Vigouroux, R. (1879). Sur le role de la resistance electrique des tissuesdans le’electrodiagnostic. Comptes Rendus Societe de Biologie (Series 6), 31, 336339.Google Scholar
Villringer, A., & Chance, B. (1997) Non-invasive optical spectroscopy and imaging of human brain function. Trends in Neuroscience, 20, 435442.CrossRefGoogle ScholarPubMed
Weber, E. U., Hsee, C. K., & Sokolowska, J. (1998). What folklore tells us about risk and risk taking: cross-cultural comparisons of American, German, and Chinese proverbs. Organizational Behavior and Human Decision Processes, 75 (2), 170186.Google Scholar
Williams, C., & D’Mello, S. (2010). Predicting student knowledge level from domain-independent function and content words. In Aleven, V., Kay, J., & Mostow, J. (Eds.), Intelligent tutoring systems (pp. 6271). Berlin and Heidelberg: Springer.Google Scholar
Wolff, P., Medin, D. L., & Pankratz, C. (1999). Evolution and devolution of folk biological knowledge. Cognition, 73 (2), 177204.Google Scholar
Xu, X., Murray, T., Smith, D., & Woolf, B. P. (2013). If you were me and I were you: mining social deliberation in online communication. In Proceedings of EDM-13, Educational Data Mining, online: <http://www.educationaldatamining.org/EDM2013/papers/rn_paper_30.pdf>.Google Scholar
Yano, T., Cohen, W. W., & Smith, N. A. (2009). Predicting response to political blog posts with topic models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 477485). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Yarkoni, T. (2012). Psychoinformatics: new horizons at the interface of the psychological and computing sciences. Current Directions in Psychological Science, 21 (6), 391397.Google Scholar