

By Robert F. Chestnutt, Dublin City University
1.0 Introduction
Vladimir Putin has, arguably, been the scrutinised about global leader since his unexpected ascension to the Russian presidency in 1999. Interestingly, even since the election of Donald Trump in November 2016, analysis and discussion of Putin has grown even more.
Putin succeeded Boris Yeltsin as Russian president almost two decades ago, and only Stalin has ruled longer. Few would defend accusations that he has flagrantly feathering his nest from Russia’s vast resource riches, terrorised his own people to protect his position, that he has been a sporadic threat to global security, and a range of other incriminations. Yet, we can be certain about a couple of extremes, Putin has been known to be openly mean-spirited and spiteful, and a small window into his penchant for intimidation and domination was demonstrated by how he manipulated German Chancellor Angela Merkel’s well-known dog phobia at his Sochi residence in 2007. Photos recording the event suggest he was quite entertained by her clear discomfort. Conversely, there has been the much less frequent commentary about how Putin has effectively contained the numerous undesirable domestic groups in the Russian ‘wild west’, and we may only know the significance of this in the post-Putin era. As such, he is understandably one of the most polarising political figures in recent history. Conversely, he has been called a tragic figure, and the ideal ruler for the current period of Russia1. Long term foe, Boris Beresovsky called him a traditionalist who believes that the only way to sustain order and protect the state is through authoritarianism.
Yet, whether one loves or hates him, whether one feel threatened or more secure by the thought of him (given his substantial domestic popularity), he will still go down as one of the most interesting political figures in history. Vladimir Putin will be remembered and analysed long after his departure from Russian and global politics. Given this fascination, what else can we learn about him from the vast amount of books, journal articles, news sources and documentaries about him? How will history remember him? Given these questions, over time, what have been the most common topics and themes commentators have chosen to discuss? What has been the dominating sentiment expressed towards Putin by commentators, and how has it fluctuated over time? How do the negative (and positive?) individual themes rank, and how has this fluctuated over time? and can we identify camps of authors whose commentaries are significantly different or similar?
2.0 Data, tools and method
2.1 Data
This piece dissects a corpus of books about Vladimir Putin (or where he is the principal focus). They are written by a range of academics, authors, former political figures, journalists and experts from international organisations. Thus, there is variation on the range of themes surveyed and emphasised and varying levels of sentiment. The corpus is comprised of 31 books published over a 14 year span, starting a few years after his appointment as Russian President in 2004 and up to the present day, 2018. A chronology of the books used is listed in Appendix 1.
2.2 Method
The piece uses a range of Data Science methods, including Topic Modelling and a Sentiment analysis; as well as a range of graphics to illustrate significant findings.
2.2.1 The Descriptives
Section 3 outlines some basic descriptive statistics, surveying the most common terms: overall, by book and by year. In addition, significant terms by book and by year will be analysed using the ‘TF-IDF’ statistic. Instead of simple term frequency, the ‘TF-IDF’ value increases with term frequency, but is offset by the overall frequency of the word in the corpus. This removes commonly occurring terms that may not yield much information. From a common sense perspective, if a term appears often, it must be important, represented in frequency. However, if it appears in all documents, it is likely to not be overly insightful or informative. This post analyses Vladimir Putin, and as such ‘Vladimir’, ‘Putin’ and ‘Russia’ are mentioned very often across all books. Thus, in reality their overall value to the analysis is comparitively low compared to other terms, and as such are automatically omitted from the output of any of the respective machine learning models.
2.2.2 Topic Modelling
Section 4 employs Topic modelling to estimate the themes that are most frequently addressed in analysis of Vladimir Putin.
Topic modelling is a machine learning method used for unsupervised classification of text documents such as books, text webdata, journal articles and news articles; similar to clustering on numeric data2. While clustering seeks to establish groups of documents within a corpus, topic modelling aims to isolate core themes from a set of texts3. Clustering is deductive, while topic modelling is inductive. As such, it is exploratory in nature as it discovers natural groups of items even when the investigator is not totally sure what they are looking for.
Latent Dirichlet allocation (LDA) is a method for fitting a topic model. In this case it will treat each book as a mixture of topics, and each topic as a mixture of words. This allows the books to overlap each other in terms of content, rather than being separated into discrete groups. Theoretically, it seeks to mirror the typical use of language. This post clusters the books where Vladimir Putin is the predominant theme, this process generates a model and ‘learns’ to tell the difference between the books based on the text content. This section also demonstrates a number of options to fine tune a topic model in order to secure the most coherent and usable output.
2.2.3 Sentiment Analysis
Sentiment analysis is a method of opinion mining, which analyses the emotional content of text programmatically. Essentially, it distils an author’s emotional intent into distinct categories. It borrows from a number of disciplines, including linguistics, psychology and natural language processing (NLP)4. Section 5 uses Sentiment analysis to gauge feeling towards President Putin, surveying variation by publication and over time.
There are two approaches to Sentiment Analysis: the machine learning classification approach and the lexicon-based approach. The machine learning classification method involves training a model based on existing pre-labeled data, and then using the model to determine polarity on unlabeled data. This method is very useful and efficient when disecting mammoth datasets, but offers minimal texture outside binary polarity. As such, it may be more useful in scenarios where there is voluminous loads of data to interpret but the output does not require anything outside a binary positive or negative, perhaps like in gauging sentiment trajectory of a share price or summarising many millions of reviews.
The lexicon-based approach offers more texture in two ways, and perhaps offers more fruitful output options for political analysis. As with machine learning binary classification, polarity can be estimated using the ‘Bing’ lexicon5. In addition, the depth of polarity can be estimated using the ‘AFINN’ lexicon6 which estimates the magnitude of polarity, scaled between -5 to +5. Lastly, the ‘NRC’ lexicon7 offers further texture by grouping emotive adjectives into eight distinct categories (as well as negative-positive polarity): sadness, joy, trust, fear, anger, surprise, disgust and anticipation.
As such, this section will seek to estimate depths of polarity measured by book and also by year of publishing. What can changes in magnitude of polarity tell us about attitudes of political commentators to Putin over time?
2.3 Tools
Machine learning methods offer a whole new world of analytical options, but to put it into operation requires specialised tools. This post a mixed programming language approach to answering the array of questions outlined above. Although it will predominately use the ‘R’ statistical programming language, it will also embed Python code chunks depending on the specific task, using the ‘reticulate’ package8 in R. Although, it is highly unusual to use a mixed programming approach to a single project, each programming language holds a number of positives and contentious drawbacks. At other times, as in the case of this piece, it just comes down to personal preference. ‘R’ can be slow and a memory glutton. Python can also be sluggish when dealing with exceptionally large amounts of data but, does not fully rely on RAM to the same extent as ‘R’. Ultimately, some tasks are more lucid and straightforward based on the programming language used.
The descriptives and lexicon-based sentiment analysis will be executed through ‘R’, and the Topic Modelling and Cosine similarity tasks will be performed using the GenSim9 library in Python. Everything will be coded together in Markdown through the RStudio integrated development environment (IDE).
3.0 The Descriptives
3.1 Most common words by frequency (overall)
Section 3 presents some basic descriptives using the ‘R’ statistical programming language. It graphs an overall term count by frequency, with common ‘stopwords’ such as ‘the’, ‘is’, ‘a’ and a host of other non-contributing non-emotive words already removed from the corpus.
Even surveying the most frequent terms of the corpus using basic word counting, the tone is quite clear. Both, ‘power’ and ‘war’ are in the top 5 terms and fascinatingly, Khororkovsky and Yukos make it into the top 10 of frequent terms. Other interesting terms such as ‘Chechnya’, ‘KGB’, ‘FSB’, ‘control’ and ‘security’ are among the top 50, being provocative terms within the context of Putin’s reign.
3.2 Most significant terms by Book (tf-idf)
## Selecting by tf_idf