This thesis presents SAESNEG, a System for the Automated Extraction of Social Network Event Groups; a pipeline for the aggregation of the personal social media footprint, and its partitioning into events, the ``event clustering'' problem. SAESNEG facilitates a reminiscence-friendly user experience, where the user is able to navigate their social media footprint. A range of socio-technical issues are explored: the challenges to reminiscence, lifelogging, ownership, and digital death. Whilst previous systems have focused on the organisation of a single type of data, such as photos or Tweets respectively; SAESNEG handles a variety of types of social network documents found in a typical footprint (e.g. photos, Tweets, check-ins), with a variety of image, text and other metadata — differently heterogeneous data; adapted to sparse, private events typical of the personal social media footprint.
Phase A extracts information, focusing on natural language processing; new techniques are developed; including a novel distributed approach to handling temporal expressions, and a parser for social events (such as birthdays). Information is also extracted from image and metadata, the resultant annotations feeding the subsequent event clustering. Phase B performs event clustering through the application of a number of pairwise similarity strategies -- a mixture of new and existing algorithms. Clustering itself is achieved by combining machine-learning with correlation clustering.
The main contributions of this thesis are the identification of the technical research task (and the associated social need), the development of novel algorithms and approaches, and the integration of these with existing algorithms to form the pipeline. Results demonstrate SAESNEG's capability to perform event clustering on a differently heterogeneous dataset, enabling users to achieve lifelogging in the context of their existing social media networks.
Detecting and understanding temporal expressions are key tasks in natural language processing (NLP), and are important for event detection and information retrieval. In the existing approaches, temporal semantics are typically represented as discrete ranges or specific dates, and the task is restricted to text that conforms to this representation. We propose an alternate paradigm: that of distributed temporal semantics - where a probability density function models relative probabilities of the various interpretations. We extend SUTime, a state-of-the-art NLP system to incorporate our approach, and build definitions of new and existing temporal expressions.
It would be great to hear about comments or suggestions. I would be keen to collaborate with anyone to incorporate the approach into other software or further academic work. I was thinking introducing P.D.F "dual" definitions over other time periods, such as "morning", "weekend" and the like. Nor did I solve the issue of reconciling the "next" operator with the probabilistic model.
Awarded Prize for Best Poster at BCS SGAI 2012.
If you use the data - please cite my paper! ☺