Menu
Home
Log in / Register
 
Home arrow Computer Science arrow Social Informatics
< Prev   CONTENTS   Next >

3.2 Machine Learning Analysis

To identify the key characteristics of those posts generating attention we follow our previous approach [1]. This approach characterises posts by analysing how they are written and when they are published. Our goal is to identify, by using a set of features, the main characteristics of those posts that generate higher levels of engagement. The features considered for this analysis are listed below:

Post length: Number of terms in the post.

Complexity: Cumulative entropy of terms within the posts to gauge the concentration of language and its dispersion across different terms. Let n be the number of unique terms within the post p and fi the frequency of the term t within p. Therefore, complexity is given by:

Readability: This feature gauges how hard the post is to parse by humans. To measure readability we use the Gunning Fox Index[1] using the average sentence length (ASL) and the percentage of complex words (PCW).

0.4 *(ASL +PCW )

Referral Count: number of hyperlinks (URLS) present in the posts.

Informativeness: The novelty of the post's terms with respect to the other posts. We derive this measure using Term Frequency-Inverse Document Frequency (TF-IDF):

Polarity: Average polarity (sentiment) of the post. We are computing sentiment by using SentiStrength,[2] a state of the art method for analysing sentiment in social media data.

Mentions: Number of mentions (references to other users) within the tweets.

Time of the day: Time of the day in which the tweet has been posted.

To extract the key characteristics of those posts generating attention we firstly identify the characteristics of those tweets that are followed by an engagement action (seed posts), and we then identify the characteristics of those seed posts that are followed by a high level of engagement (high number of retweets).

To perform the first task we train different ML classifiers and select the one that provides a better classification of seed posts, in this case the J48 classifier tree. Once the optimal classifier has been selected, features are removed (one at a time) from the classifier and a drop in performance is measured. Those features that generate a higher performance drop are considered the most discriminative ones, i.e., those ones that better distinguish the seed posts (those generating engagement) vs. the non-seed posts. For more details of the complete analysis process see [1].

Figure 1 shows the result of this analysis. More particularly, the top 4 discriminative features that help distinguishing seed vs. non-seed posts are: post length, complexity, polarity and mentions. Posts that generate some level of engagement are generally longer, present a higher level of complexity (i.e., the post contains many terms which are not repeated often), present slightly more positive than negative sentiment and mention at least one user within the tweet.

Fig. 1. Features with higher influence on engagement levels

Once we have identified the key characteristics of seed posts, our goal is to determine which are the characteristics of those seed posts that generate higher attention levels. To obtain this information we create a linear regression model where the different features listed above are used to approximate the number of engagement interactions that a tweet is receiving. Significant coefficients (p<0.5) are associated with complexity, mentions and time in the day. More specifically, those tweets generating higher levels of attention contain many terms that are not repeated often, mention several users in the tweet, and are posted between 8:00 a.m and 16:00 p.m. As Dorset Police indicated, for the moment, there are no dedicated resources for actively tweeting, monitoring or responding to comments outside that time range.

  • [1] en.wikipedia.org/wiki/Gunning_fog_index
  • [2] sentistrength.wlv.ac.uk/
 
Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >
 
Subjects
Accounting
Business & Finance
Communication
Computer Science
Economics
Education
Engineering
Environment
Geography
Health
History
Language & Literature
Law
Management
Marketing
Philosophy
Political science
Psychology
Religion
Sociology
Travel