About the Nordic Tweet Stream

1) The Nordic Tweet Stream – what it is, and what it isn’t?

The Nordic Tweet Stream (NTS) is a monitor corpus of geolocated tweets and associated metadata from the Nordic region. These born-digital data have been collected for basic research purposes. While all the data accessible here are from an open source, using them as primary material in the humanities can be challenging, since researchers may face technical challenges related to data access and use. The objective of this digital interface is to enable an easy access to and distribution of born-digital data for basic research. We operate according to the FAIR Data Principles (see FAIR). The guiding principles of FAIR aim at making data findable, accessible, interoperable, and reusable (see Wilkinson et al. 2016).

The NTS offers data access from November 2016. The current version includes material until February 2019, but the future versions will contain updates in real time.

The NTS does not give access to all material, but it contains material in which the geolocation properties have been activated. Previous studies have estimated that the share of such material in general is low (see Graham et al. 2013).

As in all corpus linguistic research, all corpora have limitations. Here, the users are advised to a classic article by Matti Rissanen (ICAME Journal 13, 16–19, 1989), in which he proposes three universal problems associated with the use of (diachronic) corpora. While all three are relevant here, the most important perhaps is the “God’s truth fallacy”. Rissanen writes that an “authoritative corpus may easily create the erroneous impression that it gives an accurate reflection of the entire reality of the language it is intended to represent” (1989: 17). Likewise, it is important to keep in mind that the material here is a snapshot of languages in use in the Nordic region.

2) NTS – for whom?

The NTS material is multilingual. We envisage that researchers from various fields, such as sociolinguistics, dialectology, social sciences, and cultural studies, and so on could make use of this material, either as the sole primary data or additional material accompanying structured corpus data.

3) NTS – what can you do with it?

The interface gives you the opportunity to search for character strings and phrases in all the different languages in the material. You can restrict your searches using a few geolocation parameters in the search page. The interface also includes a visualization tool that visualizes your search on a map. After you do a query, the results page shows the output in a KWIC window. You can also download the raw results as a .txt-file by selecting the parameters that you want to have. This function could be especially useful for users who want to use the NTS as the first point of entry to data.

-The NTS interface uses the following regular expressions (NB. These can be modified, so please use the feedback form to let us know what kind of functions you'd like to have):
-The star sign * indicates 0+characters and can be used anywhere in the character string (e.g. *ing searches for any item than ends in -ing).
-The plus sign + indicates 1+characters (e.g. +able searches for cable, table, but not able).
-The vertical bar | means either or, so that she|her|hers searches for all of these personal pronouns.
-The curly bracket indicates a lemma, so that {be} searches for all the forms of be (is, 's, was, etc.).
-POS tags are indicated with an underscore _, and more information of the tag set can be found in the POS Guide in the search page.
-These regular expressions can be combined with each other. But note that when it comes to the POS tagging, only English tweets have so far been tagged. So, combining reg ex and POS tags only works when searching the English material.

4) NTS – basic info and how to cite the material:

This corpus and the interface are result of interdisciplinary research between sociolinguists and computer scientists, and it has been funded by the Center for Data Intensive Sciences and Applications (DISA) at Linnaeus University in Sweden and by the University of Eastern Finland.

If you use the NTS interface and use the findings in your publications, please cite our recent paper, which is available online:

[1] Laitinen, Mikko, Jonas Lundberg, Magnus Levin & Rafael Martins. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data, Proc. of Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018, CEUR-WS.org, online CEUR-WS.org/Vol-2084/short10.pdf.

5) NTS – future updates:

Note that the current version is a beta-version, and we value users’ feedback of the functions included and the user experience.

Please send your feedback through the following comment window:



* *
* : Required fields

6) Contact:

Please contact Prof. Mikko Laitinen (general comments) and Prof. Jonas Lundberg (technicalities) for any comments or questions on the corpus and the interface.

Contact details:
First name [dot] last name [at] lnu.se | uef.fi

7) References:

[1] Leetaru, Kalev, Shaowen Wang, Guofeng Cao, Anand Padmanabhan & Eric Shook. 2013. Mapping the global Twitter heartbeat. The geography of Twitter. First Monday 18. Here.

[2] Wilkinson, M. D. et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018. doi:10.1038/sdata.2016.18