epitweetr: user documentation

European Centre for Disease Prevention and Control (ECDC)

Description

The epitweetr package allows you to automatically monitor trends of tweets by time, place and topic. This automated monitoring aims at early detecting public health threats through the detection of signals (e.g. an unusual increase in the number of tweets for a specific time, place and topic). The epitweetr package was designed to focus on infectious diseases, and it can be extended to all hazards or other fields of study by modifying the topics and keywords.

The general principle behind epitweetr is that it collects tweets and related metadata from the Twitter Standard Search API version 1.1 (https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/overview) according to specified topics and stores these tweets in a compressed form on your computer. epitweetr geolocalises the tweets and collects information on key words within a tweet. Tweets are aggregated according to topic and geographical location. Next, a signal detection algorithm identifies the number of tweets (by topic and geographical location) that exceeds what is expected for a given day. Then, epitweetr sends out email alerts to notify those who need to further investigate these signals following the epidemic intelligence processes (filtering, validation, analysis and preliminary assessment).

The package includes an interactive web application (Shiny app) with five pages: the dashboard, where a user can visualise and explore tweets (Fig 1), the alerts page, where you can view the current alerts and associated information (Fig 2), the geotag evaluation page, where you can evaluate the geolocation algorithm in different tweet fields to manually choose the geolocation threshold (Fig 3), the configuration page, where you can change settings and check the status of the underlying processes (Fig 4), and the troubleshoot page, with automatic checks and hints for using epitweetr with all its functionalities (Fig 5). On the dashboard, users can view the aggregated number of tweets over time, the location of these tweets on a map and the words most frequently found in these tweets. These visualisations can be filtered by the topic, location and time period you are interested in. Other filters are available and include the possibility to adjust the time unit of the timeline, whether retweets/quotes should be included, what kind of geolocation types you are interested in, the sensitivity of the prediction interval for the signal detection, and the number of days used to calculate the threshold for signals. This information is also downloadable directly from this interface in the form of data, pictures, or reports.

More information on the methodology used is available here

Shiny app dashboard:

Fig 1: Shiny app dashboard figure

Shiny app alerts page:

Fig 2: Shiny app alerts page

Shiny app geotag evaluation page:

Fig 3: Shiny app geotag evaluation page

Shiny app configuration page:

Fig 4: Shiny app configuration page

Shiny app troubleshoot page:

Fig 5: Shiny app troubleshoot page

Background

Epidemic Intelligence at ECDC

Article 3 of the European Centre for Disease Prevention and Control (ECDC) funding regulation and the Decision No 1082/2013/EU on serious cross-border threats to health have established the detection of public health threats as a core activity of ECDC.

ECDC performs Epidemic Intelligence (El) activities aiming at rapidly detecting and assessing public health threats, focusing on infectious diseases, to ensure EU’s health security. ECDC uses social media as part of its sources to early detect signals of public health threats. Until 2020, the monitoring of social media was mainly performed through the screening and analysis of posts from pre-selected experts or organisations, mainly in Twitter and Facebook.

More information and an online tutorial are available:

EI sources

EI tutorial

Objectives of epitweetr

The primary objective of epitweetr is to use the Twitter Standard Search API version 1.1 in order to detect early signals of potential threats by topic and by geographical unit.

Its secondary objective is to enable the user through an interactive Shiny interface to explore the trend of tweets by time, geographical location and topic, including information on top words and numbers of tweets from trusted users, using charts and tables.

Hardware requirements

The minimum and suggested hardware requirements for the computer are in the table below:

Hardware requirements Minimum Suggested
RAM Needed 8GB 16GB recommended
CPU Needed 4 cores 12 cores
Space needed for 3 years of storage 3TB 5TB

The CPU and RAM usage can be configured on the Shiny app configuration page (see section The interactive user application (Shiny app)>The configuration page). The RAM, CPU and space needed may depend on the amount and size of the topics you request in the collection process.

Installation

epitweetr is conceived to be platform independent, working on Windows, Linux and Mac. We recommend that you use epitweetr on a computer that can be run continuously. You can switch the computer off, but you may miss some tweets if the downtime is large enough, which will have implications for the alert detection. Before using epitweetr, the following items need to be installed:

Prerequisites for running epitweetr

Prerequisites for some of the functionalities in epitweetr

Extra prerequisites for R developers

If you would like to develop epitweetr further, then the following development tools are needed:

External dependencies

epitweetr will need to download some dependencies in order to work. The tool will do this automatically the first time the alert detection process is launched. The Shiny app configuration page will allow you to change the target URLs of these dependencies, which are the following:

Installing epitweetr from CRAN

After installing all required dependencies listed in the section “Prerequisites for running epitweetr”, you can install epitweetr:

install.packages(epitweetr)

Environment variables

Additionally, the R environment needs to know where the Java installation home is. To check this, type in the R console:

Sys.getenv("JAVA_HOME")

If the command returns Null or empty, then you will need to set the Java Home environment variable, for your OS, please see your specific OS instructions. In some cases, epitweetr can work without setting the Java Home environment variable.

The first time you run the application, if the tool cannot identify a secure password store provided by the operating system, you will see a pop-up window requesting a keyring password (Linux and Mac). This is a password necessary for storing encrypted Twitter credentials. Please choose a strong password and remember it. You will be asked for this password each time you run the tool. You can avoid this by setting a system environment variable named ecdc_twitter_tool_kr_password containing the chosen password.

Launching the epitweetr Shiny app

You can launch the epitweetr Shiny app from the R session by typing in the R console. Replace “data_dir” with the designated data directory which is a local folder you choose to store tweets, time series and configuration files in:

library(epitweetr)
epitweetr_app("data_dir")

Please note that the data directory entered in R should have “/” instead of "" (an example of a correct path would be ‘C:/user/name/Documents’). This applies especially in Windows if you copy the path from the File Explorer.

Alternatively, you can use a launcher: In an executable .bat or shell file type the following, (replacing “data_dir” with the designated data directory)

R –vanilla -e epitweetr::epitweetr_app(“data_dir”)

You can check that all requirements are properly installed in the troubleshoot page. More information is available in section The interactive user application (Shiny app)>Dashboard:The interactive user interface for visualisation>The troubleshoot page

Setting up tweet collection and the alert detection loop

In order to use epitweetr, you will need to collect tweets and run the alert detection loop (geonames, languages, geotag, aggregate and alerts). Further details are also available in subsequent sections of the user documentation. A summary of the steps needed is as follows:

library(epitweetr)
epitweetr_app("data_dir")

library(epitweetr)
search_loop("data_dir")
library(epitweetr)
detect_loop("data_dir")

For more details you can go through the section How does it work? General architecture behind epitweetr, which describes the underlying processes behind the tweet collection and the signal detection. Also, the section “The interactive Shiny application (Shiny app)>The configuration page” describes the different settings on the configuration page.

How does it work? General architecture behind epitweetr

The following sections describe in detail the above general principles. The settings of many of these elements can be configured in the Shiny app configuration page, which is explained in the section The interactive Shiny application (Shiny app)>The configuration page.

Collection of tweets

Use of the Twitter Standard Search API version 1.1

epitweetr uses the Twitter Standard Search API version 1.1. The advantage of this API is that it is a free service provided by Twitter enabling users of epitweetr to access tweets free of charge. The search API is not meant to be an exhaustive source of tweets. It searches against a sample of recent tweets published in the past 7 days and it focuses on relevance and not completeness. This means that some tweets and users may be missing from search results.

While this may be a limitation in other fields of public health or research, the epitweetr development team believe that for the objective of signal detection a sample of tweets is sufficient to detect potential threats of importance in combination with other type of sources.

Other attributes of the Twitter Standard Search API version 1.1 include:

  • Only tweets from the last 5–8 days are indexed by Twitter

  • A maximum of 180 requests every 15 minutes are supported by the Twitter Standard Search API (450 requests every 15 minutes if you are using the Twitter developer app credentials; see next section)

  • Each request returns a maximum of 100 tweets and/or retweets

Twitter authentication

You can authenticate the collection of tweets by using a Twitter account (this approach utilises the rtweet package app) or by using a Twitter application. For the latter, you will need a Twitter developer account, which can take some time to obtain, due to verification procedures. We recommend using a Twitter account via the rtweet package for testing purposes and short-term use, and the Twitter developer application for long-term use.

  • Using a Twitter account: delegated via rtweet (user authentication)

    • You will need a Twitter account (username and password)

    • The rtweet package will send a request to Twitter, so it can access your Twitter account on your behalf

    • A pop-up window will appear where you can enter your Twitter user name and password to confirm that the application can access Twitter on your behalf. You will send this token each time you access tweets.

  • Using a Twitter developer app: via epitweetr (app authentication)

    • If you have not done so already, you will need to create a Twitter developer account: https://developer.twitter.com/en/apply-for-access

    • Create an app

    • For the access type, ensure you have read and write access

    • Make a note of your OAuth settings

      • Add them to the configuration page in the Shiny app (see image below)

      • With this information epitweetr can request a token at any time directly to Twitter. The advantage of this method is that the token is not connected to any user information and tweets are returned independently of any user context.

      • With this app you can perform 450 requests every 15 minutes instead of the 180 requests every 15 minutes that a Twitter account allows.

Topics and tweet collection queries

After the Twitter authentication, you need to specify a list of topics in epitweetr to indicate which tweets to collect. For each topic, you have one or more queries that epitweetr uses to collect the relevant tweets (e.g. several queries for a topic using different terminology and/or languages).

A query consists of keywords and operators that are used to match tweet attributes. Keywords separated by a space indicate an AND clause. You can also use an OR operator. A minus sign before the keyword (with no space between the sign and the keyword) indicates the keyword should not be in the tweet attributes. While queries can be up to 512 characters long, best practice is to limit your query to 10 keywords and operators and limit complexity of the query, meaning that sometimes you need more than one query per topic.

epitweetr comes with a default list of topics as used by the ECDC Epidemic Intelligence team at the date of package generation (1st of September, 2020). You can view details of the list of topics in the Shiny app configuration page (see screenshot below).

On the configuration page, you can also download the list of topics, modify it and upload it to epitweetr. The new list of topics will then be used for tweet collection and visible in the Shiny app. The list of topics is an Excel file (*.xlsx) as it handles user-specific regional settings (e.g. delimiters) and special characters well. You can create your own list of topics and upload it too, noting that the structure should include at least:

  • The name of the topic, with the header “Topic” in the Excel spreadsheet. This name should include alphanumeric characters, spaces, dashes and underscores only. Note that it should start with a letter.

  • The query, with the header “Query” in the Excel spreadsheet. This is the query epitweetr uses in its requests to obtain tweets from the Twitter Standard Search API. See above for syntax and constraints of queries.

The topics.xlsx file additionally includes the following fields:

  • An ID, with the header “#” in the Excel spreadsheet, noting a running integer identifier for the topic.

  • A label, with the header “Label” in the Excel spreadsheet, which is what is displayed in the drop-down topic menu of the Shiny app tabs.

  • An alpha parameter, with the header “Signal alpha (FPR)” in the Excel spreadsheet. FPR stands for “false positive rate”. Increasing the alpha will decrease the threshold for signal detection, resulting in an increased sensitivity and possibly obtaining more signals. Setting this alpha can be done empirically and according to the importance and nature of the topic.

  • “Length_charact” is an automatically generated field that calculates the length of all characters used in the query. This field is helpful as a request should not exceed 500 characters.

  • “Length_word” indicates the number of words used in a request, including operators. Best practice is to limit your number of keywords to 10.

  • An alpha parameter, with the header “Outlier alpha (FPR)” in the Excel spreadsheet. FPR stands for “false positive rate”. This alpha sets the false positive rate for determining what an outlier is when downweighting previous outliers/signals. The lower the value, the fewer previous outliers will potentially be included. A higher value will potentially include more previous outliers.

  • “Rank” is the number of queries per topic

When uploading your own file, please modify the topic and query fields, but do not modify the column titles.

Scheduled plans to collect tweets

As a reminder, epitweetr is scheduled to make 180 requests (queries) to Twitter every 15 minutes (or 450 requests every 15 minutes if you are using Twitter developer app credentials). Each request can return 100 tweets. The requests return tweets and retweets. These are returned in JSON format, which is a light-weight data format.

In order to collect the maximum number of tweets, given the Standard Search API limitations, and in order for popular topics not to prevent other topics from being adequately collected, epitweetr uses “search plans” for each query.

The first “search plan” for a query will collect tweets from the current date-time backwards until 7 days (7 days because of the Standard Search API limitation) before the current “search plan” was implemented. The first “search plan” is the biggest, as no tweets have been collected so far.

All subsequent “search plans” are scheduled intervals that are set up in the configuration page of the epitweetr Shiny app (see section The interactive Shiny app > the configuration page > General). For illustration purposes, let us consider the search plans are scheduled at four-hour intervals. The plans collect tweets for a specific query from the current date-time back until four hours before the date-time when the current “search plan” is implemented (see image below). epitweetr will make as many requests (each returning up to 100 tweets) during the four-hour interval as needed to obtain all tweets created within that four-hour interval.

For example, if the “search plan” begins at 4 am on the 10th of September 2020, epitweetr will launch requests for tweets corresponding to its queries for the four-hour period from 4 am to midnight on the 10th of September 2020. epitweetr starts by collecting the most recent tweets (the ones from 4 am) and continues backwards. If during the four-hour time period between 4am and midnight the API does not return any more results, the “search plan” for this query is considered completed.

However, if topics are very popular (e.g. COVID-19 in 2020), then the “search plan” for a query in a given four-hour window may not be completed. If this happens, epitweetr will move on to the “search plans” for the subsequent four-hour window, and put any previous incomplete “search plans” in a queue to execute when “search plans” for this new four-hour window are completed.

Each “search plan” stores the following information:

Field Type Description
expected_end Timestamp End DateTime of the current search window
scheduled_for Timestamp The scheduled DateTime for the next request. On plan creation this will be the current DateTime and after each request this value will be set to a future DateTime. To establish the future DateTime, the application will estimate the number of requests necessary to finish. If it estimates that N requests are necessary, the next schedule will be in 1/N of the remaining time.
start_on Timestamp The DateTime when the first request of the plan was finished
end_on Timestamp The DateTime when the last request of the plan was finished if that request reached a 100% plan progress.
max_id Long The max Twitter id targeted by this plan, which will be defined after the first request
since_id Long The last tweet id returned by the last request of this plan. The next request will start collecting tweets before this value. This value is updated after each requests and allows the Twitter API to return tweets before min_time(pi)
since_target Long If a previous plan exists, this value stores the first tweet id that was downloaded for that plan. The current plan will not collect tweets before that id. This value allows the Twitter API to return tweets after pi-time_back
requests Int Number of requests performed as part of the plan
progress Double Progress of the current plan as a percentage. It is calculated as (current$max_id - current$since_id)/(current$max_id - current$since_target). If the Twitter API returns no tweets the progress is set to 100%. This only applies for non error responses containing an empty list of tweets.

epitweetr will execute plans according to these rules:

  • epitweetr will detect the newest unfinished plan for each search query with the scheduled_for variable located in the past.

  • epitweetr will execute the plans with the minimum number of requests already performed. This ensures that all scheduled plans perform the same number of requests.

  • As a result of the two previous rules, requests for topics under the 180 limit of the Twitter Standard Search API (or 450 if you are using Twitter developer app authentication) will be executed first and will produce higher progress than topics over the limit.

The rationale behind this is that topics with such a large number of tweets that the 4-hour search window is not sufficient to collect them, are likely to already be a known topic of interest. Therefore, priority should be given to smaller topics and possibly less well-known topics.

An example is the COVID-19 pandemic in 2020. In early 2020, there was limited information available regarding COVID-19, which allowed detecting signals with meaningful information or updates (e.g. new countries reporting cases or confirming that it was caused by a coronavirus). However, throughout the pandemic, this topic became more popular and the broad topic of COVID-19 was not effective for signal detection and was taking up a lot of time and requests for epitweetr. In such a case it is more relevant to prioritise the collection of smaller topics such as sub-topics related to COVID-19 (e.g. vaccine AND COVID-19), or to make sure you do not miss other events with less social media attention.

If search plans cannot be finished, several search plans per query may be in a queue: