Book review: Natural Language Annotation for Machine Learning

The book's title is misleading. Its subtitle - A Guide to Corpus Building for Applications - is more descriptive. I believe that not only machine learners, but linguists (esp. corpus and computational linguists), practitioners of the digital humanities and others who are using and/or collecting linguistic data can deepen their knowledge with the help of this terrific book.

Rc_cat

 

 

Although O'Reilly will publish the book in Sept. 2012, it is already available as an Early Release in electronic format. Keep in mind that this is a "work in progress" version when you come across sentences starting with lower case letters and references to Chapter??? and Appendix???. Also, you will find references to chapters that are not included in the book yet. These are parts of an early release and their number doesn't distract the reading experience.

 

It is hard to define the target group of this title. Of course you can read it without any previous knowledge of linguistics and/or natural language processing according to the preface, but I think when you read such things in a book from a publisher of technical books, you can assume that the authors' hands were led by someone in the marketing department. You shouldn't be a linguist or an nlp guru to understand the content, but you need to have some background in the field. Previous exposure to NLTK (and the NLTK book), some basic knowledge of corpus linguistics (e.g. Corpus Linguistics by McEnry and Wilson, Corpus Linguistics by McEnry and Hardie, or Gries brilliant Quantitative Corpus Linguistics with R) is essential to understand the role of corpora in applied and academic research.

 

The first chapter ("The Basics") gives a detailed review of what is corpus linguistics and what is a corpus and its relation to machine learning tasks. But if you want to get a broader overview of the theory and the historical aspects of corpus linguistics, I recommend the first chapter of McEnry and Wilson. However Leech's name was mentioned in this chapter, I miss mentioning his seven maxims of annotation (again McEnry-Wilson help you out in this question). Also, we got a brief summary of the MATTER methodology, which is the main topic of the book. MATTER stands for Model, Annotate, Train, Test, Evaluate, Revise - the steps of corpus development cycle. This high level intro puts the method into context which helps to understand the following chapters - and I think it can serve as an "executive summary" too. I loved the brief section on relevance testing (precision, recall, F-measure) as these are vitally important in real world applications.

 

The second chapter (Defining Your Goal and Dataset) is about the 'M' in the MATTER cycle. It gives practical advises for defining the statement of purpose and expanding it to see how you can reach your goals. I like the pragmatic tone of the chapter. Sure, you have a great idea, but you have to consider the task, the available resources and you have to collect some data - so think it over and define why do you collect data, what kind of data you want to collect and how do you process the data. This process involves lot of thinking and weighting possibilities, and the book helps with going through these steps.

 

Chapter three (Building your Model and Specification) stays at the 'M', but it gets more realistic. It is about the formal definition of models and how to implement them (in XML). The topic - XML and various standards - seems to be boring but it is a great job and it is very refreshing to see the fragmented pieces information being complied into a compact yet enjoyable chapter (ok, maybe only linguists think this is not boring).

 

The fourth chapter (Applying and Adopting Annotation Standards to Your Model) gives hints about bending standards and resources to your needs. It considers technological considerations along with human factors (aka annotators), and shows best practices serves both sides well.

 

I do hope more chapters will be available soon. The practical focus and the vivid real world examples (e.g. named entity recognition, semantic role labeling, etc.) makes the book very accessible for a wider audience. It contains valuable information that was almost inaccessible and it took long time to collect the knowledge necessary to build corpora before. I think this title will be a great success in just like the Semantic Web for the Working Ontologist in the semantic web and enterprise ontologist community.

Breaking the Page: Transforming Books and the Reading Experience (Preview Edition)

Breaking the Page is a promising title. The preview edition is a great idea, similar to O'Reilly Rough Cuts or Manning's MEAP programme, it is used to gain feedback from the readership during the writing process. The are three available chapters in the ebook as of writing this review. The author has got 20+ years experience in technology and writing and he runs a very interesting blog on digital books. The expertise can be felt through the pages of the title.

Breaking

 

The three chapters are about searching and navigating in digital books. Surely, ebook readers are trying to imitate the dead-tree book experience. Although they did great job, there are features that should be redesigned on the new platform. The idea of table of content and the index are about getting an impression about the book and navigating through the text. The first two chapters examine how we can adapt ebooks to the new media. The approach will be familiar for everyone who reads about interaction design and/or information architecture, since Meyers is talking about some sort of design patterns and not about the underlying technology. The third chapter is stub. After a good diagnosis, the treatment should come, but this chapter fails to deliver a detailed explanation of how to make navigation better. But this is a preview edition so more to come later.

As a teaser, I'd definitely recommend the book to everyone interested in digital books. I'm eagerly waiting for the final version.

The Information Diet: A Case for Conscious Consumption by Clay A. Johnson [Review]

According to Johnson there is no such thing as information overload. Rather, we consume junk information produced by content farms. He proposes conscious consumption of information which is not about consuming less, but developing a balanced and healthy habit just like when you go on diet. Although, I don't agree with every word of it, I really enjoyed reading the book as it is full of stories and clear descriptions of various scientific studies.

Cat
  • Clay A. Johnson: The Information Diet: A Case for Conscious Consumption
  • O'Reilly, 2011
  • informationdiet.com

 

Read the rest of this post »

So, you want a real job?

A guest post by Christopher Phipps, The Lousy Linguist
Let’s play a little thought game. Let’s pretend you have an academic background in linguistics with at least some graduate level study and now you want to find a job outside of academics, in industry. What should you expect to find? In the USA (and this is probably largely true of Europe as well), you should expect to find that most people misunderstand your skill set (you will first and foremost be wrongly categorized as a translator).

Read the rest of this post »

Why Clojure lx?

This post is cross-posted on clojurelx, a new project blog

 

The NLTK is a natural choice for students of linguistics and computer science. It has matured into a stable project, its users are very active, and it is now used outside of academia. Those who are into functional programming can use the Scheme Natural Language Toolkit, or learn from the Natural Language Processing for the Working Programmer, and those who needs the JVM can turn to ScalaNLP. So why brother with Clojure?

Read the rest of this post »

Applicative functors

It's been a while since I posted on this blog but I had too much things to do - so forgive me dear reader. During my Haskell adventures I realized that I have a very bad relationship with applicative functors so I was working on this romance. I have to admit that I spent only a little time on applicative functors when we met first - I was very keen on Haskell because I was mesmerized by The Haskell Road book as an aspiring young logician (well, I still love Haskell but I neither young, nor a real logician anymore) and I didn't pay much attention to the impure and/or uninteresting parts of the language. Although applicative functors are new developments, they make easier those ugly impure tasks - that's why you should master them. Let me recommend you a few sources that helps you to develop a basic understanding of the applicative functors - yes, I can't promise that you gonna be a "pro" having read these papers. However, if you devote some time and attention to the details and you are not just reading technical papers but you are really trying to work them through, then you'll be in a good position.

Read the rest of this post »

New books in my Haskell collection

I have two new books in my collection of Haskell related books. The first one is Introduction to Functional Programming by Wadler and Bird. The book is a bit dated, and it is not for newbies but I do recommend it as a second book on Haskell (if you are a hard core programmer guy, I think you can use it as your first book). [First, I provided link to the first edition which is using Miranda, but thanks to a reader called mikyoo now the link points to the second edition]
The other book is the print version of Learn You a Haskell for Great Good. I love the online version and I recommend it to everyone who wants to step into the world of Haskell. I haven't been on the LYaH site for months but I think the book is more detailed and I think I'm old because I find the print version cooler...

Knight - Mozilla Learning Lab

So our entry for the Knight-Mozila Innovation Challenge was successful and we are in the Learning Lab :D You can learn more about the Knight-Mozilla cooperation here and you can find the learning lab schedule here. This is an amazing opportunity to learn about computational journalism and web technologies and we are eagerly waiting to start hacking and learning. The Lab is an online course and it is using the fantastic Peer to Peer University for delivering the lectures and etc. - if you don't know the P2P Uni, you should have a look at its site as it is full of interesting open courses.

Our Entry for the Mozilla-Knight Innovation Challenge

The Mozilla Foundation and has signed a 3-year agreement with the Knight Foundation and as part of their partnership they are organized three innovation challenges. We entered into the "Beyond Comment Threads" challenge with an entry and we are planning to enter into the next session too. You can read about our ideas at http://computationaljournalism.posterous.com/ Please let us know what do you think!