Français Italiano English

Search for :




SCOOP is a international collaborative project, supported by the EUREKA's Eurostars Programme



Home > Project overview

Scoop Project overview

The SCOOP project aims at developing an innovative software application dealing with the complete process of search, acquisition, analysis and representation of online opinions based on new real time processing algorithms for semantic analysis and new cartography technologies.

The main project innovation concerns the real time and integrated approach of the complete process of search, acquisition, analysis and visualization of online opinions :

Search and Aquisition

For the general purpose of preparing coherent and clean data for the next processing stages (linguistic and sentiment analysis, clusterisation, ...), new algorithms and tools will be developed for high quality source identification, and improve our existant content "cleaning" tools.

For the source quality measurement problem, we are working on an approach based on the Kleinberg HITS algorithm, to identify the authoritative sources for one given thematic.

The content "cleaning" process is needed in the very early stages of data acquisition (before query keywords filtering). The choosen solutions must be suitable for huge document volumes. It aims to extract from an HTML page, the relevant part, and discards the remaining noisy elements like publicities, bonds, menus, ...
We used a first approach based on 'site descriptors' : a set of rules describing where the relevant data begins and where it ends. This approach produces very good results, but descriptors are too toilsome and hard to maintain.
Another solution has been developed for Scoop project, based on an automatic analysis of the topological HTML structure of pages and their layout. This approach gives very satisfying results for institutional and general purpose sites, online press, ..., but problems can arise when processing blogs, with comments bigger than the primary post. To handle this problem, AMI will develop a set of specific topological parsers for the major popular user generated content platforms.

Linguistic Analysis of Web 2.0 content

Innovation concerns the coverage of the Italian, French and English dependency grammars in order to deal with linguistic phenomena which are typically involved in sentiment expressions. That means to go beyond parsing and named entity recognition and to develop coreference resolution systems for these languages.
The new approach developed by XEROX focuses on a fine-grained description of the linguistic phenomena, that is captured through a rich and fine-grained lexical and syntactico-semantic representation, encoded within the Xerox Incremental Parser (XIP) formalism. XIP is based on a specific methodology called incremental parsing, wich combines very high computational performance (speed of text analysis) and expressive power, allowing a fine-grained description of the linguistic phenomena in a very efficient way. A detailed desciption of the approach achievements can be found here.

Sentiment Analysis

The sentiment analysis is one of the key components addressed by the Scoop project. It aims to perform opinion calculus over the dependency output of the grammars developed in the linguistic analysis stage. Such a computation is meant to capture the main aspects in a "conversation" environnement as follows:

Representation of the online opinions

The purpose of this part of Scoop project is to develop powerful tools to help analysts, in extracting sense and producing useful information synthesis, from huge quantities of collected information. These tools must highlight main trends and major events of the studied environment, but also detect weak signals, that could be signs, announcing important future changes.
Our approach first extacts key terms (descriptors) describing the main content of the collected data. These descriptors can be selected by :

A clustering algorithm based on the 'Co-word Analysis' approach (Neal Coulter & Co, Software Engineering as Seen Through Its Research Literature: A Study in Co-Word Analysis) has been implemented. Descriptors are linked together according to a co-occurence measure (two descriptors occuring often in the same documents are very close), to form clusters. The clusters can be considered as distinct sub-themes of the general studied thematic.

The next step is to develop cartographic tools for visual representation and navigation in the data collected, analyzed and clustered in the earlier stages of the process.