Lexica
Text Features
Lexicon | Description | Score interpretation | Citation |
---|---|---|---|
Flesch-Kincaid Grade Level | The Flesch-Kincaid grade level of a piece of text based on the words per sentence and syllables per word. Higher scores mean the text requires more years of education to understand. | A grade level score is calculated based on sentence lengths and word syllables. Scores can be interpreted as the number of years of education required to understand a passage. There is no upper bound; the lowest possible grade level that can be calculated is -3.4. | Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Naval Technical Training Command Millington TN Research Branch. |
Concreteness & Familiarity | Concreteness refers to how much a word refers to an actual, tangible, or “real” entity: something that arises from or appeals to immediate experience. Familiarity refers to how often a word is typically seen or heard. | The concreteness score of a word indicates its level of concreteness as judged on a 7-point scale. In theory, these scores should range from 100 (abstract) to 700 (concrete); however, the scores assigned to many words were estimated using a regression model, so scores outside this range occur. Familiarity scores are computed identically, with 100 denoting unfamiliar and 700 denoting very familiar in the original judgment. | Paetzold, G., Specia, L. Inferring Psycholinguistic Properties of Words. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 435-440 (2016). http://dx.doi.org/10.18653/v1/N16-1050. |
Emotionality | Quantifies the degree to which an individual’s attitude or reaction is based on emotion. The words "amazing" and "excellent" are similarly positive, but the former indicates the attitude is based on a more emotional feelings-based reaction. Also gives emotional valence and extremity (how extreme the valence is). | The emotionality score of a word is a 0-9 point manual judgment, with 0 indicating no emotionality and 9 indicating high emotionality. Emotional valence is calculated similarly, from 0 (highly negative) to 9 (highly positive). Extremity is the distance of emotional valence from the midpoint, i.e., absolute value of emotional valence minus 4.5. | Rocklage, M.D., Rucker, D.D. & Nordgren, L.F. The Evaluative Lexicon 2.0: The measurement of emotionality, extremity, and valence in language. Behavior Research Methods, 50, 1327–1344 (2018). https://doi.org/10.3758/s13428-017-0975-6. |
NRC Hashtag Emotion and Sentiment | Captures how much of eight specific emotions (anger, anticipation, disgust, fear, joy, sadness, surprise and trust) and how much positive and negative sentiment are expressed in a text. | Emotion word scores give the association between the word and the emotion relative to texts without this emotion, calculated using pointwise mutual information. A score of zero indicates no association with an emotion, while higher scores indicate greater association. Sentiment word scores are computed identically, but using sets of positive and negative hashtags instead of emotion hashtags. | Using Hashtags to Capture Fine Emotion Categories from Tweets. Saif M. Mohammad, Svetlana
Kiritchenko, Computational Intelligence, in press. #Emotional Tweets, Saif Mohammad, In Proceedings of the First Joint Conference on Lexical and Computational Semantics (*Sem), June 2012, Montreal, Canada. Sentiment Analysis of Short Informal Texts. Svetlana Kiritchenko, Xiaodan Zhu and Saif Mohammad. Journal of Artificial Intelligence Research, volume 50, pages 723-762, August 2014. |
NRC VAD (Valence, Arousal, Dominance) | Produces scores for valence, arousal , and dominance. Valence indicates the positivity or negativity expressed in a passage of text; arousal indicates the emotional intensity of the text; dominance indicates the degree of control exerted. | For a given word, the scores for each VAD category were determined based on how often they were manually selected as exemplars of that category. A score of 0 indicates that a word was the least often chosen as an exemplar of that category, while a score of 1 indicates that a word was most often chosen as an exemplar of that category. | Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words. Saif M. Mohammad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, July 2018. Retrieved from https://saifmohammad.com/WebDocs/acl2018-VAD.pdf. |
Pronouns | The relative frequencies of different pronouns: I, we, you, s/he, and they (including contractions and common variations). | Values here simply indicate the relative frequency of that type of pronoun or one of its common variants in the text (relative to all terms in the text) | --- |
Demographics
Lexicon | Description | Citation | |
---|---|---|---|
Age & Gender | The values produced by this lexicon can be interpreted as predictions about the age and gender of a text’s author. Age values can be interpreted as literal age predictions; positive gender values indicate female, while negative indicate male. | Age values for a text are straightforwardly interpretable as the predicted age of the author. The magnitude of gender values can be interpreted as the strength of the prediction. | Sap, M., Park, G., Eichstaedt, J. C., Kern, M. L., Stillwell, D. J., Kosinski, M., Ungar, L. H., & Schwartz, H. A. (2014). Developing Age and Gender Predictive Lexica over Social Media. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). |