What does honesty look like (statistically)?

Certain linguistic features (e.g  reference, modality) facilitate deception because they are malleable to context and flexible to interpretation. My first blog post showed that deceptive communication contains ‘outliars’, portions of texts with an unusually high concentration of these linguistic features; in the second post we saw that the linguistic hotspots where these features cluster can be taken as ‘points of interest’ worthy of further investigation. Of course, liars do not have a monopoly on the use of modals! Furthermore, truth-tellers can sometimes be mistaken for liars due to nervousness, fear of disbelief, or perceptions of powerlessness (known as the ‘Othello error’). So what does honesty (non-deceptive) communication look like?

sharapova mistake

In my Standford Decepticon 2017 conference paper I tested the ‘Outliar’ investigative linguistic methodology on honest admissions of doping – true confessions – by the following five sports persons and professionals:

true doping confessions.png

The Maria Sharapova case took the tennis world by surprise (she was the first high-profile female tennis player to fail a drug test). In 2016, Sharapova was banned from competition after testing positive for meldonium during the Australian Open in January of that year. Meldonium is a heart medication that was found by the World Anti-Doping organisation (WADA) to be particularly popular amongst sports persons from Russia and Eastern Europe, perhaps due to its ability to block the body’s conversion of testosterone to oestrogen. Having placed meldonium on a watch list in 2015, WADA had fully prohibited the substance from January 1 2016, two weeks before the Australian Open. Following the failed drug test, Sharapova admitted she had been taking meldonium as medication since 2006 and stated that she had negligently and inexcusably missed the communications from WADA prohibiting its use.

Linguistic analysis of the explanation Sharapova gave to fans via her Facebook page shows two ‘outliars’ at the beginning and end of the post (see Figure 4 below).

Sharapova outliar graph

[1] I want to reach out to you to share some information, discuss the latest news, and let you know that there have been things that have been reported wrong in the media, and I am determined to fight back. You have shown me a tremendous outpouring of support, and I’m so grateful for it.

[13] I have been honest and upfront. I won’t pretend to be injured so I can hide the truth about my testing. I look forward to the ITF hearing at which time they will receive my detailed medical records. I hope I will be allowed to play again. But no matter what, I want you, my fans, to know the truth and have the facts.

Figure 3: Outliar analysis of Maria Sharapova’s 2016 Facebook post and outlier extracts.

Sharapova begins her post by suggesting she has been a victim of unjust media coverage. It had been widely reported that she had received five ‘warnings’ about the upcoming change to the WADA regulations. Sharapova agreed that she had received newsletters with links to the WADA rule changes but argued that these were ‘communications’ rather than warnings through which one had to “hunt, click, hunt, click, hunt, click, scroll and read” in order to find information about the prohibition. Sharapova ends her post by strongly maintaining that she is being honest about her genuine mistake (of using Meldonium as medication after the ban).

These anomalous extracts are particularly emotional when compared to the main body of this post, in which Sharapova gives specific details about all the communications she did receive (see yellow highlighted text in Figure 4 below). There is a lot of literature that suggests specific details are a strong indicator of veracity in legal genres such as witness statements. (Professor Aldert Vrij’s research on Criteria Based Content Analysis is a good place to start.) These anomalous extracts could just be ‘Othello errors’ that are confusing emotional intensity for deception.

Sharapova FB 1cSharapova FB 2c

Figure 4: Maria Sharapova Facebook post, March 2016. Last accessed 21/7/2018

Accounting for the ‘Othello error’ is one reason a full ‘Outliar’ analysis uses an additional measure of language change within a text – intratextual language variation – when assessing text veracity. Texts can range from having a uniform style with consistent use of features throughout – a stable text – to displaying marked changes in language style at several points – variable or ‘spiky’ text.  Outliar captures this by summing the amount of change shown in a text.

Figure 5 is an example of this. It compares ‘Outliar’ analysis of Sharapova’s Facebook post (left) one of a Lance Armstrong TV interview in whch he falsely denied doping (see previous blog for more discussion of Armstrong’s deception). Visually, you can see that  Lance Armstrong’s language use displayes high variability in comparison to which Sharapova’s language is relatively stable.

Figure 5: Comparison of the ‘Outliar’ analysis of Maria Sharapova’s Facebook post (left) and Lance Armstrong’s ESPN interview (right) .

Figure 6 below shows a statistical measure of intratextual language variation for five false doping denials vs. five true doping confessions (see p7 here for the formula). It can be seen that the deceptive communications show more language change than the honest ones. So, combining outlier text detection with an overall measure of language variability can be helpful in distinguishing honesty from dishonesty. Frequent and marked language style change is a signal of potential deception.

intratext analysis edit 2

Figure 6: Analysis of intratextual variation. Y-axis = total intratextual variation measured as aggregate z-score for each text; X-axis represents ten texts in total –  five deceptive texts (false denials by: 1) Barry Bonds; 2) Linford Christie; 3) Lance Armstrong; 4) Alex Rodriguez; 5) Marion Jones) and five honest texts (true confessions by: 1) Maria Sharapova; 2) Dwain Chambers; 3) Victor Conte; 4) Floyd Landis; 5) Levi Leiphemer)

In Sharapova’s case, the tribunal were satisfied she had not intended to cheat (although she was found to have also taken the drug to enhance her performance) and her relatively light ban (reduced from two years to 15 months on appeal) reflected the fact that she had been negligent but not deceptive. I would argue that the (relatively) stable language of both Sharapova’s Facebook post and the initial press conference where she announced her drug test failure support the tribunal finding. The press conference video is below – judge for yourself.



The case of the asthmatic cyclists: deception detection as investigative linguistics

Did you know that Sir Bradley Wiggins, Sir Mo Farah and I-thought-he-was-a-sir Chris Froome all suffer from asthma? As do a number of other succesful sports persons accused of using performance-enhancing drugs?  What I call ‘investigative linguistics’ led me down this particularly rabbit hole. My investigative linguistic approach examines texts for ‘points of interest’ (POIs). It uses deception detection tools on communications with unknown veracity in order to automatically identify POIs. One benefit is that you can approach a topic without any prior knowledge or biases and quickly find avenues that are objectively worth exploring.

After analysing the known deceptions of Lance Armstrong (see previous blog), I collected a bunch of statements made by sports people admitting or denying their use of perfomance-enhancing drugs. Based on currently available evidence these statements were divided into three categories: a) false denials, b) true confessions and c) presumed-to-be-true denials (see my Stanford Decepticon 2017 presentation for full details). For the ‘true’ denials category, I picked five recent high-profile cases: two relating to the cycling controversy around Sir Bradley, Sir David Brailsford and Team Sky (video explainer), and three connected to the controversy around the infamous athletics coach Alberto Salazar and his Nike Oregon Project which engulfed Sir Mo Farah, the Canadian Cameron Levins and, indirectly, Paula Radcliffe MBE.

true denials dataset

It was a surprise to me that each of the interviews (graphed below) mentions asthma and related issues. If I had done some prior research I would have realised that the provision and use of asthma medication during and around major sport events was a key issue in sports doping. Still, it shows that deception detection techniques can be used for exploration i.e. to find the ‘points of interest’ worthy of further investigation. Furthermore, a ‘naive’ approach helps to avoid unconscious bias affecting the analysis


(‘Outliar’ analysis of four ‘true denials’. Interview responses are represented as a time series on the x-axis (c.30-60 second chunks). Green shading indicates ‘outliar’ text. Asthma mentions marked with ∇)

In Bradley Wiggins’ Guardian interview (given to counter suggestions of illegal doping during the 2012 Tour de France), the analysis highlights inconsistencies in Wiggins’ stance towards his asthma allergy:

[14] I was paranoid about making excuses: “Ah, my allergies have kicked in.” I’d learned to live with this thing. It wasn’t something I was going to shout from the rooftops and use as an excuse and say, “my allergies have started off again”. That’s convenient isn’t it Brad, your allergies started when you got dropped.

[17] I didn’t mention it in the book. I’d come off a season of … I’d won everything that year. When I was writing the book I wasn’t sat there thinking, “I’d better bring my allergies up”. I was flying on cloud nine after dominating the sport all year. It wasn’t something that I brought to mind.

In these two extracts, asthma is simultaneously a big deal and a non-issue for Wiggins. While this does not in any way confirm deception or guilt, it does indicate a defensive stance that is worth investigating. This can be contrasted with Mo Farah’s discussion of his own asthma:

[4] “This picture has been painted of me. It’s not right. I am 100% clean. I love what I do. I want to continue winning medals. But I want people to know that I am 100%, I am not on any drugs, I am not on thyroids, I am not on any other medication. The only medication that I am on, I am on asthma and I have had that since I was a child. That’s just a normal use. I am on TUE [therapeutic use exemption] where you have … it’s just the normal stuff. And that’s it.” – Sky Sports interview, 2015

In contrast to Wiggins, Mo Farah volunteers information in a non-defensive fashion about his asthma and use of Therapeutic Use Exemptions (TUEs – a doctor’s note and prescription). Canadian runner Cameron Levins’ response to questions about his use of prescription drugs registers as more ‘interesting’ than Farah’s although not as high as Wiggins:

Interviewer: No prescription drugs?

Levins: I have some medication I take for my asthma, but that is something that is wrong with me. I’m asthmatic.

Interviewer: Was that before you came on with the (Nike Oregon) Project?

Levins: Yeah, I was dealing with it before I joined the project actually. A little bit after the London Olympics I started having quite a bit of difficulty with it. So it was before I joined the project.

Levins later goes on to explain that “adult onset asthma is pretty common”. Obviously, having asthma since childhood is easier to defend which may explain the higher ‘interestingness’ score of Levins response compared to Farah.

In a 2016 interview with Sky Sports news David Brailsford, director of Team Sky (whose riders included Bradley Wiggins and Chris Froome), offered the following highly ‘interesting’ reply when asked about his team’s covert use of Therapeutic Use Exemptions to obtain otherwise prohibited drugs:

[5] We’ve reviewed this over the years as we’ve moved forward. We have changed our policy, we’ve changed the way we do it, and in the future going forward, I think we’re going to take the next step, which has been debated on a wider basis across the whole of the TUE process, and look at having the consent of the riders to make all TUEs transparent

In this segment, Brailsford tries draw a line under anything that may have occurred in the past, using many words related to looking to the future. In no way presuming anything illegal on Brailsford or Team Sky’s part, previous policy would clearly be a ‘point of interest’.

So, this analysis suggests that cyclists use of asthma medication is more ‘interesting’ than that of athletes. (In Farah’s interview, the analysis flags his comments about missed drug tests rather than specific doping allegations.) Understanding the reasons for this can then provide a focus for further investigation. As Chris Froome’s recent successful appeal shows, the asthma issue may be due to faulty regulations based on models with a tendency to generate ‘false positives’ – itself a form of deviance if not deception.

froome inhaler

(picture © BBC/Getty Images, 2018)

For this naive analyst, investigative linguistics revealed an important connection between asthma and sports doping that is clearly ‘interesting’. Application of the same techniques to the domains of business, politics and finance will definitely be interesting…














Linguistic Pointing and Deception Detection

So here’s the thing. You can tell somebody is lying – or more correctly, deceiving – by the words they use. I’m not talking about gesture or disguise or other types of non-verbal deception. I mean when there are words and text – either written or spoken – those words will reveal deception if you know what to look for.

“Listen, nobody believes in doping controls more than me.” — Lance Armstrong

Does that mean it’s possible to detect deception by reading a text or transcript or listening to someone speak? Not exactly. Factors such as human truth bias and our reliance on heuristics to process information mean that judgement derived from our senses is not entirely reliable (although it can be improved by education and training).

Deception detection is possible by processing a text. Now, unless you are some kind of artificial intelligence, you will rely on an automated tool for computational and statistical analysis. Non-verbal deceptions such as credit card or other financial fraud are already detected using statistical algorithms and other data mining techniques. Advances in linguistics mean that texts can also be processed as data and then classified and grouped together – all without being read or heard.

There are different features of language that can be analysed. Words, word sequences, types of words, grammar, syntax and so on. These features can be analysed individually or as groups that represent underlying concepts (e.g. ‘certainty’ or ‘complexity’). You can analyse known true and deceptive texts for these language features, compare the frequencies and distributions, and find linguistic tendencies that correlate with deception and truth. But what are these linguistic features?

There are many sets of linguistic features that have been used for deception detection (see Hausch et al’s 2015 meta-analysis for a comprehensive list of experiments and linguistic features used). The features that are most effective are the ones that enable the linguistic act of deception.

Take the following:

– “ Car.”

By itself, this single common noun cannot be a lie. If I said ‘car’ and pointed to a bicycle or a phone then that could be a lie. There are linguistic different resources for ‘pointing’:

– “That is a car.”
– “I have a car”
– “My car”

In fancy linguistic terminology, ‘pointing’ is known as referential indexicality. Other types of ‘pointing’ can be to a particular place, period of time, event, assumption, thought and so on – even to the text itself. Drawing on the linguistic theory and influence of the Prague School, I use a set of these linguistic features to analyse texts for ‘textual hotspots’ – a linguistic equivalent of the non-verbal hotspots such as micro-expression, gesture and voice identified by the psychologist Dr Paul Ekman (on whom the TV show, ‘Lie To Me’ is based).

I have developed a tool for identifying these linguistic hotspots using anomaly detection techniques adapted from banking fraud. Research has shown that these linguistic deception features cluster together when deceptive language is being used. Ergo, anomalous clusters of linguistic features that point to deception are areas of potential deception. Not to steal Dr Ekman’s thunder, I am calling these anomalous textual hotspots ‘outliars’.

I’ve tested this hypothesis on a number of known deceptions and the results, which are promising, were presented at the Decepticon 2017 conference held at Stanford University.  I chose statements made by sportspersons about the use of performance-enhancing drugs and doping because these are high-stakes deceptions and so more likely to leave linguistic traces.

Screenshot 2018-06-21 19.50.12

One of the most famous examples of doping deception is Lance Armstrong. How did Lance Armstrong successfully deceive so many millions for so long? Bullying, good lawyers and a fairytale narrative of cancer recovery and global charity certainly played their part. But the key to Armstrong maintaining this deception – through various testimony, interviews, biographies – was his sustained verbal performance.

One classic example is Armstrong’s 2005 interview with Bob Ley on the ESPN show ‘Outside the Lines’.’Outside the Lines’ is an investigative ESPN TV series that takes a critical look at American sports issues. This interview was conducted by the usual anchorman, Bob Ley. Armstrong was a year into his first retirement, after winning his 7th Tour de France in 2005 and had just been cleared of doping allegations after a lengthy trial. The show is renowned for its tough questioning and investigative slant, and Bob Ley did not hold back. Below is a transcript of the interview.

Figure 1 shows my ‘Outliar’ analysis of the responses. The  analysis picks out two linguistic hotspots of potential deception – sections 5 and 12. These are highlighted in the transcript but I’m going to lay them out here for analysis.

Screenshot 2018-06-23 21.02.48

Figure 1: Outliar analysis of Lance Armstrong’s interview responses. Armstrong’s interview responses are represented as a time series on the x-axis (c.30 second chunks). The y-axis measures the relative frequency of linguistic deception features; text segments scoring over 3.5 are recorded as anomalous (the Iglewicz-Hoaglin method).

Segment 5 contains the following extract. Bob Ley had asked whether it was true Armstrong had made a phone call to Dr Prentice Steffen threatening “to spend a lot of money to make your life miserable” if Steffen did not retract comments accusing Armstrong of doping [transcript lines 46-50].

ARMSTRONG: Not true. Steven, er Prentice Steffen I think was his name, was not part of the team when I was there, I hardly know him. The only interaction I ever had with him I think was when he was a team doctor with the Mercury cycling team and I helped one of their young riders I think get care for testicular cancer. That’s the first and only interaction I ever had with him.

In this outlier extract, Armstrong denies the accusation by distancing himself from Dr Prentice Steffen (Steven, Stefan, what was his name again?). Instead he foregrounds his charity work for an anonymous sick rider. The underlined sentence introduces three new referents – a young rider, Mercury and cancer.  Such a topic shift and introduction of third party issues is a pragmatic technique for diverting attention. The cluster of pronouns picked out by the analysis – ‘he’, ‘their’, ‘I’– facilitate the diversion and leave the final ‘him’ ambiguous (technically this ‘him’ should refer to the nearest qualifying noun i.e. ‘one of their young riders’).

In contrast,  the following extract from segment 8 is representative of Armstrong’s ‘baseline reading”. Bob Ley had asked whether it was true Armstrong had made a phone call to Greg Lemond, threatening to smear him: “I can produce 10 people that say you took EPO”:

With regards to Greg Lemond I have to say as a young guy, I did idolize him in 1989, I think we all remember that incredible story coming back after getting shot and winning the tour by 8 seconds, the smallest margin ever. I mean he was a guy that quite literally put all of us into cycling, because he was appealing to us at a young age. But, er, for a past champion and a great champion, one of the greatest athletes of all time, to be so involved in a case, I mean I ask you Bob, I ask the viewers, why would you be so involved?

In answering this, Armstrong appears to show rare humility; he acknowledges his own inspiration and even someone else’s achievements. However, a closer reading reveals a mocking tone in which Armstrong draws attention to the narrowness of LeMond’s victory – “winning the tour by 8 seconds, the smallest margin ever”. The transcript shows that taunting accusers is Armstrong’s baseline linguistic behaviour in this interview, which is why the reticent language used when discussing Dr Prentice Steffen in segment 5 stands out as deceptive.

The analysis also flags the following segment 12 as a linguistic hotspot of potential  deception. Here Ley has asked Armstrong about his attempts to shut down the World Anti-Doping Agency (WADA) investigation that was shining a light on doping in cycling and Armstrong at that time.

Now there’s two people involved in this process. There’s the athletes and there’s the people who police the athletes. And both of them have to be ethical. Listen, nobody believes in doping controls more than me. I’ve submitted to all of them, whether in competition or out of competition. Now listen, I’m not saying my best defence is I’ve never tested positive. All I’m saying is that the last few years when you were supposed to tell the investigators and the drug testers everywhere you were everyday of the year, I did it.

Here, rather than taunting accusers, Armstrong again points the linguistic finger. He insinuates that the drug testing process and its ubiquitous participants (‘people’, ‘investigators’, ‘testers’) may not be ethical and he portrays himself as a willing (and perhaps slightly persecuted) subject to the testing regime. However, with the assertion “nobody believes in doping controls more than me”, Armstrong leaks the fact that he has been expert at manipulating the drug testing system. He immediately realises this ‘slip’ and moves to deny its implicature that his “best defense is I’ve never tested positive”. The final ‘it’ is ambiguous and difficult to resolve – a deception strategy we also saw in the above segment 5.

There are more examples like this in my Decepticon 2017 Stanford presentation (including an interesting connection between doping and asthma!) so take a look at that if you are interested in more detail on the method (or write to me). But the real value of this method, I think, is as an investigative linguistic tool which can identify ‘points of interest’ and thus aid forensic and journalistic investigations. So future blog posts will probe the public statements and testimony related to the key events, scandals and crimes of this post-truth era.