Lörres, Möppes, and the Swiss. (Re)Discovering regional patterns in anonymous social media data

Christoph Purschke; Dirk Hovy

doi:10.1017/jlg.2019.10

Lörres, Möppes, and the Swiss. (Re)Discovering regional patterns in anonymous social media data

Published online by Cambridge University Press: 12 December 2019

Christoph Purschke and

Dirk Hovy

Show author details

Christoph Purschke*: Affiliation:
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Dirk Hovy: Affiliation:
Bocconi University, Milan, Italy
*: Author for correspondence: Christoph Purschke, Email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We study regional similarities and differences in language use on an anonymous mobile chat application in the German-speaking area. We use a neural network on 2.3 million online conversations to automatically learn representations of words and cities. These linguistic-use-based representations capture regional distinctions in a high-dimensional vector space that can be clustered and visualized to discover patterns in the data. We find that the resulting regional patterns are closely linked to the traditional division of German dialects, even though most of the conversations are written in standard German. The resulting maps correspond to traditional dialect divisions and language-external spatial structures, with a few notable exceptions that can be explained through external factors.

Our method also facilitates two qualitative analyses, allowing us to discover geographically-pertinent words for various regional levels, as well as creating regional group-specific style profiles based on various linguistic resources. The results of our study strongly suggest the existence of region-specific patterns of language use (“digital regiolects”) representing distinctive strategies of linguistic stylization in relation to linguistic resources and topics. As a methodological contribution, we show how linguistic theory can drive the application and direction of neural network-based representation learning, and how their judicious application provides the basis for qualitative analysis of large-scale data collections.

Keywords

computational sociolinguistics regional variation German language use social style social media neural networks representation learning word embeddings distributed representations distributional semantics

Type: Articles
Information: Journal of Linguistic Geography , Volume 7 , Issue 2 , October 2019 , pp. 113 - 134

DOI: https://doi.org/10.1017/jlg.2019.10 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Androutsopoulos, Jannis. 2003. Online-Gemeinschaften und Sprachvariation. Soziolinguistische Perspektiven auf Sprache im Internet. Zeitschrift für germanistische Linguistik 31(2): Deutsche Sprache in Gegenwart und Geschichte. 173–197.CrossRef Google Scholar

Androutsopoulos, Jannis. 2007. Neue Medien. Neue Schriftlichkeit? Mitteilungen des Germanistenverbandes 54 (1): Medialität und Sprache. 72–97.Google Scholar

Androutsopoulos, Jannis. 2013. Online data collection. In Mallinson, Christine, Childs, Becky & Herk, Gerard Van (eds.), Data collection in sociolinguistics: Methods and applications, 236–249. London: Routledge.Google Scholar

Bamman, David, Dyer, Chris & Smith, Noah. 2014a. Distributed representations of geographically situated language. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (June 22–27, 2014). Volume 2: Short Papers. 828–834. Baltimore: Association for Computational Linguistics.Google Scholar

Bamman, David, Eisenstein, Jacob & Schnoebelen, Tyler. 2014b. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18(2). 135–160.CrossRef Google Scholar

Barton, David & Lee, Carmen. 2013. Language online: Investigating digital texts and practices. London/New York: Routledge.CrossRef Google Scholar

Bundesagentur für Arbeit (2018): Pendleratlas. https://statistik.arbeitsagentur.de/Navigation/Statistik/Statistische-Analysen/Interaktive-Visualisierung/Pendleratlas/Pendleratlas-Nav.html (14 October, 2019).Google Scholar

Cheshire, Jenny. 2005. Syntactic variation and beyond: Gender and social class variation in the use of discourse-new markers. Journal of Sociolinguistics 9(4). 479–508.CrossRef Google Scholar

Coupland, Nikolas. 2007. Style: Language variation and identity. Cambridge: Cambridge University Press.CrossRef Google Scholar

Doyle, Gabriel. 2014. Mapping dialectal variation by querying social media. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (April 26–30, 2014). 98–106. Gothenburg: Association for Computational Linguistics.CrossRef Google Scholar

Dürscheid, Christa & Frick, Karina. 2016. Schreiben digital. Wie das Internet unsere Alltagskommunikation verändert. Stuttgart: Kröner Verlag.Google Scholar

Dürscheid, Christa & Stark, Elisabeth. 2013. Anything goes? SMS, phonographisches Schreiben und Morphemkonstanz. In Neef, Martin & Scherer, Carmen (eds.), Die Schnittstelle von Morphologie und geschriebener Sprache, 189–210. Berlin: De Gruyter.Google Scholar

Eisenstein, Jacob. 2015. Systematic patterning in phonologically-motivated orthographic variation. Journal of Sociolinguistics 19(2). 161–188.CrossRef Google Scholar

Eisenstein, Jacob 2013a. Phonological factors in social media writing. In Proceedings of the Workshop on Language Analysis in Social Media (June 13, 2013). 11–19. Atlanta: Association for Computational Linguistics.Google Scholar

Eisenstein, Jacob. 2013b. What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (June 9–14, 2013). 359–369. Atlanta: Association for Computational Linguistics.Google Scholar

Eisenstein, Jacob, O’Connor, Brendan, Smith, Noah & Xing, Eric. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (October 9–11, 2010). 1277–1287. Cambridge, Massachusetts (USA): Association for Computational Linguistics.Google Scholar

Eisenstein, Jacob, Smith, Noah & Xing, Eric. 2011. Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (June 19–24, 2011). Volume 1. 1365–1374. Portland, Oregon (USA): Association for Computational Linguistics.Google Scholar

Falck, Oliver, Heblich, Stephan, Lameli, Alfred & Südekum, Jens. 2012. Dialects, cultural identity, and economic exchange. Journal of Urban Economics 72. 225–239.CrossRef Google Scholar

Goldberg, Yoav. 2017. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies 10(1). San Rafael, California (USA): Morgan & Claypool Publishers.Google Scholar

Granovetter, Mark. 1973. The strength of weak ties. American Journal of Sociology 78(6). 1360–1380.CrossRef Google Scholar

Grieve, Jack, Speelman, Dirk & Geeraerts, Dirk. 2011. A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change 23(2). 193–221.CrossRef Google Scholar

Heblich, Stephan, Lameli, Alfred & Riener, Gerhard. 2015. The impact of regional accents on economic behavior: A lab experiment on linguistic performance, cognitive ratings and economic decisions. PLoS ONE 10(2). e0113475. https://doi.org/10.1371/journal.pone.0113475 CrossRef Google Scholar

Herring, Susan. 2013. Discourse in Web 2.0: Familiar, reconfigured, and emergent. In Tannen, Deborah & Trester, Anna (eds.), Discourse 2.0: Language and New Media, 1–25. Washington: Georgetown University Press.Google Scholar

Hessisches Statistisches Lansesamt. 2018. Studierende und Gasthörer an den Hochschulen in Hessen im Wintersemester 2017/18. Wiesbaden: Hessisches Statistisches Landesamt.Google Scholar

Hovy, Dirk, Rahimi, Afshin, Brooke, Julian & Baldwin, Tim. 2019. Visualizing Regional Language Variation Across Europe on Twitter. In Stanley Brunn & Roland Kehrein (eds.), Handbook of the Changing World Language Map, 1–24. Cham: Springer.Google Scholar

Hovy, Dirk, Johannsen, Anders & Søgaard, Anders. 2015. User review-sites as a source for large-scale sociolinguistic studies. In Proceedings of the 24th International Conference on World Wide Web (May 18–22, 2015). 452–461. Florence: International World Wide Web Conferences Steering Committee.CrossRef Google Scholar

Hovy, Dirk, & Johannsen, Anders. 2016. Exploring language variation across Europe. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (May 23–28, 2016). Portorož (Slovenia): European Language Resources Association (ELRA).Google Scholar

Johannsen, Anders, Hovy, Dirk & Søgaard, Anders. 2015. Cross-lingual syntactic variation over age and gender. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning (July 30–31, 2015). 103–112. Beijing: Association for Computational Linguistics.CrossRef Google Scholar

Jones, Tyler. 2015. Toward a description of African American vernacular English dialect regions using “Black Twitter.” American Speech 90(4). 403–440.CrossRef Google Scholar

Kehrein, Roland. 2012. Regionalsprachliche Spektren im Raum—zur linguistischen Struktur der Vertikale. (ZDL. Beihefte 152). Stuttgart: Steiner.Google Scholar

Kitchin, Rob. 2014. Big Data, new epistemologies and paradigm shifts. Big Data & Society, 1(1). 1–12.CrossRef Google Scholar

Kleene, Andrea. 2017. Attitudinal-perzeptive Variationslinguistik im bairischen Sprachraum. Horizontale und vertikale Grenzen aus der Hörerperspektive. Vienna, Austria: University of Vienna Dissertation.Google Scholar

Koch, Peter & Oesterreicher, Wolf. 1985. Sprache der Nähe—Sprache der Distanz. Mündlichkeit und Schriftlichkeit im Spannungsfeld von Sprachtheorie und Sprachgeschichte. Romanistisches Jahrbuch 36. 15–43.Google Scholar

Kristiansen, Tore. 2009. The macro-level social meanings of late-modern Danish accents. Acta Linguistica Hafniensia 41. 167–192.CrossRef Google Scholar

Kulkarni, Vivek, Perozzi, Bryan, & Skiena, Steven. 2016. Freshman or fresher? Quantifying the geographic variation of language in online social media. Proceedings of the Tenth International AAAI Conference on Web and Social Media (May 17–20, 2016). 615–618. Cologne: Association for the Advancement for Artificial Intelligence.Google Scholar

Lameli, Alfred. 2013. Strukturen im Sprachraum: Analysen zur arealtypologischen Komplexität der Dialekte in Deutschland. (Linguistik—Impulse und Tendenzen 54). Berlin/Boston: De Gruyter.CrossRef Google Scholar

Lameli, Alfred, Nitsch, Volker, Südekum, Jens & Wolf, Nikolaus. 2015. Same same but different: Dialects and trade. German Economic Review 16(3). 290–306.CrossRef Google Scholar

Landauer, Thomas & Dumais, Susan. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104(2). 211–240.CrossRef Google Scholar

Lau, Jey Han & Baldwin, Timothy. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the 1st Workshop on Representation Learning for NLP (August 11, 2016). 78–86. Berlin: Association for Computational Linguistics.CrossRef Google Scholar

Le, Quoc & Mikolov, Tomas. 2014. Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning (June 21–26, 2014). 1188–1196. Beijing: JMLR, Inc.Google Scholar

Leemann, Adrian, Kolly, Marie-José, Purves, Ross, Britain, David & Glaser, Elvira. 2016. Crowdsourcing language change with smartphone applications. PLoS ONE 11(1). e0143060. https://doi.org/10.1371/journal.pone.0143060 CrossRef Google Scholar PubMed

Leemann, Adrian, Kolly, Marie-José, Schmid, Stephan & Dellwo, Volker (eds.). 2015. Trends in phonetics and phonology: Studies from German-speaking Europe. Frankfurt am Main: Peter Lang.Google Scholar

Lesław, Tobiasz. 2015. Die sprachliche Vielfalt Graubündens—ein Phänomen in der viersprachigen Schweiz. Linguistica Silesiana 36. 209–230.Google Scholar

Nerbonne, John & Heeringa, Wilbert. 1997. Measuring dialect distance phonetically. In Computational Phonology: Third Meeting of the ACL Special Interest Group in Computational Phonology (July 12, 1997). 11–18. Madrid: Association for Computational Linguistics.Google Scholar

Nguyen, Dong. 2017. Text as social and cultural data: A computational perspective on variation in text. Enschede: Universiteit Twente. DOI: 10.3990/1.9789036543002Google Scholar

Nguyen, Dong, Doğruöz, Seza, Rosé, Carolyn & Jong, Franciska de. 2016. Computational sociolinguistics: A survey. Computational Linguistics, 42(3). 537–593.CrossRef Google Scholar

Östling, Robert & Tiedemann, Jörg. 2017. Continuous multilinguality with language vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (April 3–7, 2017). Volume 2: Short Papers. 644–649. Valencia: Association for Computational Linguistics.CrossRef Google Scholar

Prokić, Jelena & Nerbonne, John. 2008. Recognising groups among dialects. International journal of humanities and arts computing 2(1/2). 153–172.CrossRef Google Scholar

Pröll, Simon, Pickl, Simon, & Spettl, Aaron. 2014. Latente Strukturen in geolinguistischen Korpora. In Elmentaler, Michael, Hundt, Markus, Schmidt, Jürgen Erich (Hg.): Deutsche Dialekte. Konzepte, Probleme, Handlungsfelder. (ZDL. Beihefte 158), 247–258. Stuttgart: Steiner.Google Scholar

Purschke, Christoph. 2018. Language regard and cultural practice: Variation, evaluation, and change in the German regional languages. In Evans, Betsy, Benson, Erica & Stanford, James (eds.), Language regard: Methods, variation, and change, 245–261. Cambridge: Cambridge University Press.Google Scholar

Purschke, Christoph. 2011. Regionalsprache und Hörerurteil. Grundzüge einer perzeptiven Variationslinguistik. (ZDL. Beihefte 149). Stuttgart: Steiner.Google Scholar

Rahimi, Afshin, Baldwin, Timothy, & Cohn, Trevor. 2017a. Continuous representation of location for geolocation and lexical dialectology using mixture density networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (September 9–11, 2017). 167–176. Copenhagen: Association for Computational Linguistics.CrossRef Google Scholar

Rahimi, Afshin, Cohn, Trevor, & Baldwin, Timothy. 2017b. A neural model for user geolocation and lexical dialectology. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (July 30–August 4, 2017). Volume 2: Short Papers. 209–216. Vancouver: Association for Computational Linguistics.CrossRef Google Scholar

Scherrer, Yves & Stöckle, Philipp. 2016. A quantitative approach to Swiss-German – Dialectometric analyses and comparisons of linguistic levels. Dialectologia et Geolinguistica 24. 92–125.CrossRef Google Scholar

Schlobinski, Peter (ed.) 2006. Von *hdl* bis *cul8r*. Sprache und Kommunikation in den Neuen Medien. Mannheim: Duden-Verlag.Google Scholar

Schmidt, Jürgen Erich. 2010. Language and space: The linguistic dynamics approach. In Auer, Peter & Schmidt, Jürgen Erich (eds.), Language and space: An international handbook of linguistic variation. Vol. 1: Theories and Methods, 201–225. (Handbooks of Linguistics and Communication Science. 30.1). Berlin/New York: De Gruyter Mouton.Google Scholar

Schümann, Michael. 2011. Hochdütsch isch en seich—Geschriebenes Schweizerdeutsch bei Twitter. In Brigitte, Ganswindt & Purschke, Christoph (eds.), Perspektiven der Variationslinguistik. Beiträge aus dem Forum Sprachvariation, 239–254. (Germanistische Linguistik. 216-217). Hildesheim: Olms.Google Scholar

Shackleton, Jr, Robert, G. 2005. English-American speech relationships: A quantitative approach. Journal of English Linguistics 33(2). 99–160.CrossRef Google Scholar

Statistisches Bundesamt. 2016. Studierende an Hochschulen. Fachserie 11 Reihe 4.1. Wintersemester 2015/2016. Wiesbaden: Statistisches Bundesamt. https://www.destatis.de/DE/Themen/Gesellschaft-Umwelt/Bildung-Forschung-Kultur/Hochschulen/_inhalt.html (14 October, 2019).Google Scholar

Stoeckle, Philipp. 2014. Subjektive Dialekträume im alemannischen Dreiländereck. (Deutsche Dialektgeographie. 112). Hildesheim, Zurich & New York: Olms.Google Scholar

Szmrecsanyi, Benedikt. 2008. Corpus-based dialectometry: Aggregate morphosyntactic variability in British English dialects. International Journal of Humanities and Arts Computing 2(1/2). 279–296.CrossRef Google Scholar

Thurlow, Crispin & Mroczek, Kristine (eds.). 2011. Digital discourse: Language in the new media. Oxford: Oxford University Press.CrossRef Google Scholar

Tophinke, Doris & Ziegler, Evelyn. 2014. Spontane Dialektthematisierung in der Weblogkommunikation: Interaktiv-kontextuelle Einbettung, semantische Topoi und sprachliche Konstruktionen. In Cuonz, Christina & Studler, Rebekka (eds.), Sprechen über Sprache. Perspektiven und neue Methoden der Einstellungsforschung, 205–242. Tübingen: Stauffenburg Verlag.Google Scholar

Wieling, Martijn, Nerbonne, John & Baayen, Harald. 2011. Quantitative social dialectology: Explaining linguistic variation geographically and socially. PloS ONE 6(9). e23613. https://doi.org/10.1371/journal.pone.0023613.CrossRef Google Scholar PubMed

Wiesinger, Peter. 1983. Die Einteilung der deutschen Dialekte. In Besch, Werner, Knoop, Ulrich, Putschke, Wolfgang & Wiegand, Herbert Ernst (eds.), Dialektologie: ein Handbuch zur deutschen und allgemeinen Dialektforschung Vol. 2, 807–900. (Handbooks of Linguistics and Communication Science. 1.2). Berlin/New York: De Gruyter.Google Scholar

Article contents

Lörres, Möppes, and the Swiss. (Re)Discovering regional patterns in anonymous social media data

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests