# Exp 1 data exploration
## Codify spam
✓ Consider a sentence spam if it verifies *any* of the following:
* it has an ellipsis ("...") or other special characters (e.g. ">", "<")
* it is (even partly) addressed to the experimenter (e.g. "I can't remember")
* over half of the words are misspelled
* it doesn't stand as an autonomous sentence, e.g.:
* an unfinished sentence — a hint of that is when the sentence is unfinished and exactly 10 words (which comes from the UI not being clear about the lower-limit on words)
* a sentence where so many words are garbled it becomes nonsense
* it has clearly no relationship whatsoever to its parent sentence (i.e. it really is a new sentence)
With one pass of coding (done by Sébastien), **the final spam rate is 22.4% of the sentences** (counting children of spam as spam).
## Descriptive stats
✓ Plot raw transformation rate and other distance measures vs.:
* ✓ word span
* ✓ age
* ✓ gender
* ✓ job type
### With the spam still present
Looks like this with gender as color:
![Profile variable interactions with gender as color](http://i.imgur.com/JkAOGxt.png)
and like this with job type as color:
![Profile variable interactions with job type as color](http://i.imgur.com/CJD92s8.png)
✓ Check distribution of reading times:
* see if the data is trustworthy at all
* use this to calibrate fixed reading time
Looks like this:
![Distribution of reading time proportions, flat and with pre-averaging per profile](http://i.imgur.com/vVUs9sX.png)
So if we rely on the flat distribution, over 35% of transformations were done by letting reading time run to the end, but also over 40% were done by using less than half the allotted time. Not good to see.
### With the spam removed
Looks like this with gender as color:
![Profile variable interactions with gender as color (despammed)](http://i.imgur.com/tfdSq9O.png)
and like this with job type as color:
![Profile variable interactions with job type as color (despammed)](http://i.imgur.com/4wPbAEO.png)
Distribution of reading times looks like this:
![Distribution of reading time proportions, flat and with pre-averaging per profile (despammed)](http://i.imgur.com/KLJcNOw.png)
**Conclusions**:
* despamming doesn't change much to this
* no obvious effects yet, but we should really be seeing stuff at least in the word span ~ age graph (there's a problem with word span test, too many scores are close to 0), and maybe also in the TR ~ age graph.
## Shifting positionning
Over the transformations, positionning changes: who's talking, from where, seems to change oftne. We could codify the dominant person (1st, 2nd, 3rd) and plot that vs. depth for each branch for each tree.
## 2D tree representation
(maybe we can directly see the transformation types from below)
### Procedure
Pipeline:
* ✓ stopword-filter and lemmatize _all_ sentences
* ✓ in a given tree, compute all pairwise sentence distances
* ✓ downscale the set of distances into a 2D space (✓ get a representativity score on the way)
* ✓ plot tree paths in that 2D space
Potential distances:
* ✓ levenshtein
* ✓ intersection of bags of words. The idea that sentences are stored as bags of words is supported by:
* Mary C Potter & Linda Lombardi (1990), Regeneration in the short-term recall of sentences
* Linda Lombardi & Mary C Potter (1992), The regeneration of syntax in short term memory
* Mary C. Potter & Linda Lombardi (1997), Syntactic Priming in Immediate Recall of Sentences
* KL distance. Supported by studies of information in Darwin's readings.
(These distances make minor changes appear as distance 0, which is good.)
In that 2D plot, look at the spread of final sentences, and the variability of paths to them.
The downscaling can also been done globally on all the sentences, making one big chart of the whole experiment. It's costly, and might be hard to navigate, but gives the advantage of quantitative comparability between trees (i.e. comparing distances and orientations, instead of only comparing shapes).
### Results
With levenshtein distance (which takes order of words into account), and a metric multidimensional scaling, we get something like this (example on 6 trees; the starting point is always in red, and each branch has a different color) **with spam**:
![Example downscaled trees](http://i.imgur.com/uLQAA4S.png)
and **without spam**:
![Example downscaled trees (despammed)](http://i.imgur.com/NvXBTk7.png)
The distribution of stresses (i.e. the error created by the downscaling) looks like this **with and without spam**:
![Downscaling stress distribution](http://i.imgur.com/FyIdpZH.png)
**So, now that despamming is done, the work is to find a good set of features on sentences** which:
* discriminates types of trees based on the dominant type of transformation they get
* for a given tree with interesting transformations, shows the evolution of sentences
## Typify transformations
Use Aurelien Lauf, Mathieu Valette, Leila Khouas (2013), Analyzing Variation Patterns In Quotes Over Time.
spam, strict copy, minor modification (caps, punctuation), content modification.