Only by imposing some structure on the binary code is it converted to textual characters as we know them. Yet, there is no similar widespread system for converting the characters into higher levels of structure which correlate to our understanding of meaning. While a search can be made for the string plaintiff , there are no widely available searches for a string that represents an individual who bears the role of plaintiff.
To make language on the Web more meaningful and structured, additional content must be added to the source material, which is where the Semantic Web and Natural Language Processing come into play. The Semantic Web is a complex of design principles and technologies which are intended to make information on the Web more meaningful and usable to people.
We focus on only a small portion of this structure, namely the syntactic XML eXtensible Markup Language level, where elements are annotated so as to indicate linguistically relevant information and structure. Click here for more on these points. In a document related to this case, we would see text such as the following portions:.
While it is relatively straightforward to structure the binary string into characters, adding further information is more difficult. Consider what we know about this small fragment: Harris and Jane are very likely first names, Hill and Smith are last names, Harris Hill and Jane Smith are full names of people, plaintiff and attorney are roles in a legal case, Harris Hill has the role of plaintiff, attorney for is a relationship between two entities, and Jane Smith is in the attorney for relationship to Harris Hill.
It would be useful to encode this information into a standardised machine-readable and processable form. XML helps to encode the information by specifying requirements for tags that can be used to annotate the text. One requirement is that each tag has a beginning and an ending; the material in between is the data that is being tagged. For example, suppose tags such as the following, where … indicates the data:. Another requirement is that the tags have a tree structure , where each pair of tags in the document is included in another pair of tags and there is no crossing over :.
Finally, XML tags can be organised into schemas to structure the tags. We have added structured information — the tags — to the original text. While this is more difficult for us to read, it is very easy for a machine to read and process. In addition, the tagged text contains the content of the information, which can be presented in a range of alternative ways and formats using a transformation language such as XSLT click here for more on this point so that we have an easier-to-read format.
Why bother to include all this additional information in a legal text? Because these additions allow us to query the source text and submit the information to further processing such as inference. Given a query language , we could submit to the machine the query Who is the attorney in the case? Though it may seem here like too much technology for such a small and obvious task, it is essential where we scale up our queries and inferences on large corpora of legal texts — hundreds of thousands if not millions of documents — which comprise vast storehouses of unstructured, yet meaningful data.
Were all legal cases uniformly annotated, we could, in principle, find out every attorney for every plaintiff for every legal case. Where our tagging structure is very rich, our queries and inferences could also be very rich and detailed. Perhaps a more familiar way to view documents annotated with XML is as a database to which further processes can be applied over the Web.
As we have presented it, we have an input, the corpus of texts, and an output, texts annotated with XML tags. The objective is to support a range of processes such as querying and inference. However, getting from a corpus of textual information to annotated output is a demanding task, generically referred to as the knowledge acquisition bottleneck. Not only is the task demanding on resources time, money, manpower ; it is also highly knowledge intensive since whoever is doing the annotation must know what to look for, and it is important that all of the annotators annotate the text in the same way inter-annotator agreement to support the processes.
Thus, automation is central. Yet processing language to support such richly annotated documents confronts a spectrum of difficult issues. Among them, natural language supports 1 implicit or presupposed information, 2 multiple forms with the same meaning, 3 the same form with different contextually dependent meanings, and 4 dispersed meanings.
Similar points can be made for sentences or other linguistic elements. Here are examples of these four issues:. She works for Dewey, Cheetum, and Howe. To contact her, write to j. When we search for information, a range of linguistic structures or relationships may be relevant to our query, such as:. People grasp relationships between words and phrases, such that Bill exercises daily contrasts with the meaning of Bill is a couch potato , or that if it is true that Bill used a knife to kill Phil , then Bill killed Phil.
Finally, meaning tends to be sparse; that is, there are a few words and patterns that occur very regularly, while most words or patterns occur relatively rarely in the corpus.
Natural language processing NLP takes on this highly complex and daunting problem as an engineering problem, decomposing large problems into smaller problems and subdomains until it gets to those which it can begin to address. Having found a solution to smaller problems, NLP can then address other problems or larger scope problems. Some of the subtopics in NLP are:.
There are a range of techniques that one can apply to analyse the linguistic data obtained from legal texts; each of these techniques has strengths and weaknesses with respect to different problems.
Rather, algorithms are applied that compare and contrast large bodies of textual data, and identify regularities and similarities. Such algorithms encounter problems with sparse data or patterns that are widely dispersed across the text. See Turney and Pantel for an overview of this area. At the same time authors covers some topics for example sentiment analysis that are heavily depended on natural language techniques and use very less or no semantics at all.
Since the topic of this paper is use of semantic web in addressing social media challenges, authors could have limited their scope to topics, which use semantics to certain extent to solve the problems. Alternatively, authors can discuss how use of dictionaries eg urban dictionary , machine learning or background knowledge could be applied for this topic eg ICWSM has a paper on topic specific sentiment, where identification of topics to which sentiment is associated utilizes light-weight semantics.
Global topics can be improved in terms clarity.
One of the major section in the paper is Semantic annotation. Authors have summarized what and how of semantic annotation but haven't discussed much about why semantic annotations are useful. Paper has some grammatical mistakes like "These graph-based approaches to extracting keywords from Twitter The paper presents an excellent survey of the state-of-the-art approaches about various technologies for mining and intelligent analysis of the user-generated content in Social Media. To the knowledge of this reviewer, this work is unique of its kind and presents the most comprehensive up-to-date analysis of the literature in this emerging field of research.
The paper is fun to read. It is well-written, easy to follow and summarizes the state-of-the-art very nicely. There are just a couple of comments regarding how to improve the paper.
Page The authors point out the lack of lexical knowledge for processing user-generated content. Besides Wikipedia which has by now become an established resource in text analysis, Wiktionary has been found to be very valuable for these purposes [1,2]. Its particular advantage over standard lexical semantic resources is the inclusion of the terms specific to the user-generated content on the Web.
A particular example of a linked lexical-semantic resource is UBY from the same group, as the Wiktionary resource mentioned above [3,4].
The authors mention crowdsourcing as a possible way to improve the performance of automatic systems. This topic has received a lot of attention of different communities in the recent years. Human computation, collective intelligence and games with a purpose can thus be discussed in greater detail. There are numerous references for that, which could make a very nice separate section in the context of this article, for example .
Page Multilinguality is mentioned as one of the major challenges with most of the methods being developed for the English content only. This reviewer would like to see more discussion on what has been done for other languages and how the problem can be tackled. Which technologies are to be research intensively to address this issue?
References  Christian M. Meyer and Iryna Gurevych. Wiktionary: a new rival for expert-built lexicons? Exploring the possibilities of collaborative lexicography. In: M.
follow link Stellato Eds. Meyer and Christian Wirth. Search form. Create new account Request new password.