Japanese Learning Support Systems: Hinoki Project Report

In this report, we introduce the Hinoki project, which set out to develop web-based Computer-Assisted Language Learning (CALL) systems for Japanese language learners more than a decade ago. Utilizing Natural Language Processing technologies and other linguistic resources, the project has come to encompass three systems, two corpora and many other resources. Beginning with the reading assistance system Asunaro, we describe the construction of Asunaro’s multilingual dictionary and its dependency grammar-based approach to reading assistance. The second system, Natsume, is a writing assistance system that uses large-scale corpora to provide an easy to use collocation search feature that is interesting for its inclusion of the concept of genre. The final system, Nutmeg, is an extension of Natsume and the Natane learner corpus. It provides automatic correction of learners errors in compositions by using Natsume for its large corpus and genre-aware collocation data and Natane for its data on learner errors.


Preface
According to a 2009 report from the Japan Foundation, there are over three and a half million people learning Japanese outside Japan. 1 Fortunately, access to good general educational materials has become easier with the advent of the Internet.However, the situation for learners with specialized language needs, such as those who are pursuing a degree at a Japanese institution of higher education, has unfortunately not improved as much.
The Japan Student Services Organization (JASSO) reports that there are 138,075 international students in Japan; another report by the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) Agency for Cultural Affairs reports that there are 40,799 international students studying Japanese in Japan. 2 For these students who are pursuing specialized study at institutions of higher education in Japan, the following are just some of the skills they will have to master:  read textbooks  write reports and papers  listen to lectures and take notes  present at conferences or seminars Because it is hard to tailor the Japanese language class to meet the specialized language needs of each learner's field of specialization, an alternative is needed.One way of approaching this problem is to provide Computer-Assisted Language Learning (CALL) systems for use online.CALL systems can supplement the language learning provided to learners and assist them in studying material from their field of specialization.The construction of such self-learning, individualized learning systems has been the goal of the Hinoki project.The following report describes three systems and several linguistic resources, the results of pursuing this goal for over a decade.

Report Overview
After providing an overview of the Hinoki project, Chapter 2 describes the linguistic resources in use by the project.Chapter 3 introduces the Asunaro reading support system, which features courseware designed for science and engineering students, as well as a multilingual dictionary that includes several commonly underrepresented Asian languages.Chapter 4 describes Natsume, a writing assistance system that is backed by large-scale corpora and provides an easy to use search interface for collocations.Chapter 5 introduces the search system for our Natane learner corpus, which has applications to second language acquisition research and machine-learning applications for automatic learner error detection and correction.Chapter 6 introduces Nutmeg, an automatic error correction system for learner's writing.Finally, Chapter 7 concludes this report by offering a summary and our perspectives for future work.

Linguistic Resources
The Hinoki project relies heavily on linguistic resources, though it is also a producer of such.Linguistic resources used in the project are native and learner corpora, as well as dictionaries.To meet the goals of the project, some linguistic resources had to be developed: multi-lingual dictionaries, purpose-specific corpora, as well as learner corpora.
In the earlier systems, emphasis was put on native resources, as they enable a Data-Driven Learning approach to learning Japanese.However, to really know where and why learners make mistakes, a learner corpus is also essential and is where more recent efforts have been focused on.

Native Resources
As part of the Nihongo kōpasu ("Japanese Corpus") project led by the National Institute for Japanese Language and Linguistics (NINJAL) for 4 years, the main goal of our group was to explore the ways in which the project's new Balanced Corpus of Contemporary Written Japanese (BCCWJ) could be applied to Japanese language education.As we are focused on finding ways to assist Japanese language learners in writing academic reports and papers, it was necessary to compile another corpus containing this genre, in addition to using the BCCWJ.For several other reasons explained below, the Japanese version of Wikipedia was also used.

BCCWJ
The National Institute for Japanese Language and Linguistics (NINJAL) created the Balanced Corpus of Contemporary Written Japanese (BCCWJ) in the span of five years between 2006 and the end of 2011 (Maekawa, 2007b(Maekawa, , 2007a(Maekawa, , 2012)).The objective of the project was to compile a tagged corpus of contemporary written Japanese that had sufficient scale and coverage of sub-varieties of written language to offer a representative sample of Japanese written language.Such a language resource had not previously existed for Japanese, and its creation was seen as important for the future development of any research with a need for representative Japanese language data, including official government language policy.
The BCCWJ consists of the Publication, Library and Special-Purposes subcorpora, each of them accounting for roughly one-third the size of the BCCWJ.The Publication sub-corpus includes books, magazines, and newspapers and is sampled from all published material in Japan between 2001 and 2005.The Library sub-corpus includes books sampled from several library holdings within the Greater Tokyo Metropolitan area.The Special-Purposes sub-corpus differs from the other two in that it should not be considered a representative sample of written Japanese, but rather serve as useful comparison material for the others.

Scientific and Technical Japanese Corpus (STJC)
Unfortunately, the data needs of providing writing and reading assistance in an academic context are not fully satisfied by the BCCWJ.While some sub-corpora are close in subject matter (topic) and writing style (register), the lack of inclusion of genuine research papers from academic journals precludes their ability to serve as a representative sample of written science and engineering discourse.It was thus necessary to build a new corpus that contained a representative and authentic sample of academic writing.The new corpus was named the Scientific and Technical Japanese Corpus (STJC), and consists of papers from several scientific and technical journals written in Japanese.The following criteria were used when choosing which journals to collect papers from:

Wikipedia
The decision to include the Japanese version of Wikipedia3 was made for several reasons.For many tasks, the quantity of data provided by the BCCWJ is sufficient (Maekawa, 2011).However, due to the nature of the data used in Natsume, which includes triplet combinations of nouns, case particles and verbs, the amount of extractable data for any but the most common expressions quickly becomes insufficient.Additionally, other NLP technologies deployed in the Natsume system, such as getassoc4 , are more precise at scales of data in the range of Wikipedia.Another requirement of the project is that text data from corpora should be legally available to be displayed online.The permissive license of Wikipedia allows all us to show all sentences as example sentences in Natsume.
One unfortunate side effect of including Wikipedia is that for many less frequent collocations, the only information on them is available in Wikipedia, making any genre comparisons impossible.Another demerit of including Wikipedia is the inclusion of grammatical mistakes and comparatively long sentences, which average at around 51 characters compared with an average of 35 for newspapers (see Table 1).

Introduction
Natane is a Japanese language learner corpus annotated with learner errors.The main benefit of learner corpora in the context of writing assistance, when compared to native corpora, is that they enable insights into the kinds of errors learners make.For example, in the case of Natane, comparing learner error tendencies based on their first language might guide customizations to lesson plans based on the learner's first language.
Compared to native corpora containing the writings of native speakers, learner corpora are often smaller in size and variety.This is due to the difficulty of obtaining learner writing, which in most cases is elicited for the construction of the corpus and not collected from readily-available sources as in the construction of the BCCWJ.Another common differentiator is the inclusion of error annotations and background information on the learners who produced the material used.
The end goal of the construction of this corpus is the construction of well-formed and sufficient machine-learnable data for automatic writing error correction.It should be noted that while relatively simpler things like the construction of a spellchecker, cooccurrence checker, or writing style checker are possible, features that hinge on an understanding of semantics and discourse are hard to make practical even in state-ofthe-art NLP systems.
As an ongoing joint project with several Japanese language teachers, the collection and annotation of the corpus initially proceeded along the following stages: 1. Collection of learner essays and their transcription.2. Pilot annotation of learner errors using Excel (Cao & Nishina, 2010;Cao, Kuroda, Yagi, & Nishina, 2010).3. Analysis of pilot annotation and definition of final error classification framework (Cao, Kuroda, Yagi, & Nishina, 2011;Cao, Yagi, & Nishina, 2012).4. Use of the multipurpose annotation tool Slate for error tagging.

Collection
The essays were collected from undergraduate and graduate students as well as students attending Japanese language schools.All essays were written to a specific topic, though not all topics are the same.Each learner's age, nationality, university level, first language, major and Japanese language learning experience, as well as other background information, were recorded with the essay.Additionally, learners signed a waiver authorizing the anonymized usage of their essay in our project.
Although more than 5000 sentences have been collected, currently only around 3500 have been annotated5 .In its present state, Natane consists of 285 essays obtained from 192 learners, totaling 205,520 characters.From a total of 9,041 annotations, there are 6,789 learner errors.The distribution of learners by their first language is biased towards Mandarin Chinese speakers, who account for more than half of learners and essays.The remaining languages are predominantly from Asia.

Pilot Annotation
While error classification frameworks for languages such as English and French already exist (Dı áz-Negrillo & Fernández-Domı ńguez, 2006;Granger, 2003;L'Haire & Faltin, 2003), there were no preexisting comprehensive error annotation scheme or descriptive framework for Japanese language learner errors.Because of the lack of such a framework, the project decided to construct one itself, drawing from previous research as well as the annotator's teaching experience (Cao & Nishina, 2010).During the pilot annotation process, it became clear that there were two kinds of error annotations.The first were ordinary, unambiguous errors and the second kind were errors where the annotator felt the particular language usage was unnatural.Ordinary errors include deviations from standard orthography, syntactic function (voice, tense, aspect, modality), conjugation, and subject-predicate incongruity.They are typically easy to annotate and occur frequently.Unnatural errors include word choice, addition or omission of text units (phrase, paragraph, etc.), and are typically less frequent and harder to annotate, leading to lower agreement between annotators.

Error Classification Framework
The feedback gained from the pilot annotation process was crucial for refinements in the error classification framework (Cao, Kuroda, Yagi, & Nishina, 2011).The resulting error annotation framework is hierarchical, able to take into account different viewpoints regarding learner errors, as well as enable the systematic annotation of such errors (Yagi & Suzuki, 2012).
The hierarchy consists of at most four levels, with higher levels corresponding with courser, more abstract categories, and branches out in three principal dimensions: 1. Error levelthe linguistic level of the error (i.e.phoneme, word, phrase, ..., discourse; the word tag is further classified into word classes like noun, verb, etc.) 2. Error categorytype and form of error  type: addition, omission, word order, deviation from standard orthography, etc.  form: conjunction, conjugation, collocation, (Japanese letter) script 3. Error sourcereason or background for error (i.e.annotator's subjective opinion on source of error: register and style mismatch, coherence, first language interference, etc.)

Error Annotation Process with Slate
After the error classification framework was decided on, the choice had to be made between continuing to use Excel to annotate the corpus or finding another solution.Though Excel's free-form nature served the formative stage of the annotation process, significant drawbacks related to its ad-hoc usage became clear.The choice was then made to use the web browser-based Slate corpus annotation and management system6 , as it offers the following advantages over Excel: higher data integrity and greater data diversity (Kaplan, Iida, Nishina, & Tokunaga, 2012).Slate decreases the chance for inconsistent annotation by eliminating the chance for errors with respect to formatting differences between annotators and misplacement of annotations into the wrong table cell, among other problems.Using Slate also increases the diversity of possible annotations, by enabling more than one annotation per segment (sentence) as well as annotations that overlap or span multiple sentences.Previously the format of the Excel table limited the amount of possible error annotations to one per sentence.Slate also provides an overhead view of the hierarchical error classification framework that -coupled with an interface that allows the user to see all annotations at a glanceenables efficient and speedy annotation.
As there was considerable data included in the existing Excel tables, it was not reannotated but rather converted for inclusion into Slate.All new annotations are being recorded using Slate.Three teachers specializing in Japanese language education at different universities separately annotated all essays using the Slate corpus annotation and management system.

Conclusion
The Hinoki project depends on the existence of many large-scale corpora, most of which are already available to the research community.For more specialized needs, such as the inclusion of a representative sample of scientific and technical Japanese, no corpora existed, so one had to be constructed.The available Japanese language learner corpora are still few, although recent developments have increased the number available: Learner's Language Corpus of Japanese 7 , Teramura corpus 8 , NINJAL's learners corpus 9 , JC Corpus 10 are just some of the corpora available now.The existing major differences between Natane and these learner corpora is that they are more focused on the annotation of grammatical errors and thus have a less comprehensive error classification framework than the one used in Natane.
Though not mentioned in this chapter, without the availability of high-quality Natural Language Processing tools for Japanese, it would be hard to impossible to make use of much of these linguistic resources.The specific tools used in each system are detailed in the explanations of each system separately.

Introduction
The first system developed under the Hinoki project was the Asunaro multilingual reading assistance system (Nishina, Okumura, Yagi, et al., 2002;Nishina, Okumura, Abekawa, et al., 2004).Development of Asunaro began in 1999 and the system was first released online in 2002.
At its inception, Asunaro was unique in that it integrated a multilingual reading and learning environment into one online system accessible to anyone with an Internet connection.At the time, most systems targeted English language learners, while Asunaro incorporates several Asian languages.This was important because the number of international students from neighboring Asian countries studying at universities in Japan is greater than that of students from English-speaking countries.
The main goal of the system was to help Japanese learners read and understand academic material in Japanese.The main target of the system is Japanese language learners enrolled in Japanese universities majoring in the fields of science and engineering.Many of them are expected to be able to read academic papers and textbooks in their field, but it is often difficult to provide for their specialized learning 7 http://cblle.tufs.ac.jp/llc/ja/ 8 http://teramuradb.ninjal.ac.jp/ 9 http://jpforlife.jp/taiyakudb.html 10 http://www34.atwiki.jp/jccorpus/pages/21.htmlneeds in university Japanese language classes.The use of Asunaro was seen as a way to enable personalized learning for those learners.

Main Features
Users accessing the Asunaro system are presented with the main screen containing a text box into which they can paste or directly enter Japanese language text for analysis.The main screen is split into three areas consisting of the user input area in the top left, the translation and example sentence area in the top right, and a detailed word and phrase view of single sentences at the bottom.Users click words or phrases 11 in the bottom area to update translations in the top right area.Translations appear in order of importance, based on the application of meaning disambiguation using the surrounding word context.
Finally, clicking on the arrow at the beginning of each sentence takes the user to the secondary screen where they can see the dependency structure of the sentence.

Courseware
However, the usage outlined above is in many ways too difficult for non-advanced learners.For beginning-to-intermediate learners who are studying in the fields of science and engineering, the provided courseware is more appropriate.Learning a language through reading is best when the material read is learner-level appropriate.Asunaro makes use of a textbook (Nishina, 2001) which is written specifically for intermediate level undergraduate science and engineering students.The main goal of the courseware is to help science and engineering students achieve proficiency in technical communication to be able to read papers and discuss research in seminars.All courseware in Asunaro was checked for parsing mistakes and manually corrected.
11 Idioms and phrases like te wo tunagu and kao ga hiroi are automatically recognized as such and also marked as phrases.
Additionally, the courseware contains an audio playback feature so users can listen to the courseware material while learning to read it.

Multilingual Dictionary
Although electronic Japanese-English dictionaries have been available since the beginning of the 1990s, for many Asian languages such as Malay, Thai or even Chinese, no electronic dictionary was available at the time of Asunaro's inception.As more than half of all international students in Japan come from other Asian countries, support of languages other than English was seen as a high priority.The EDR Electronic Dictionary is used for its translations between English and Japanese, as well as its concept ID, which links every Japanese word to a concept. 12Enabling translations of Japanese words into languages other than English required the construction of a new multilingual dictionary that would map words from the target language to EDR's concept ID.This differed from many similar systems at the time that used English as the intermediary language.Excluding the Japanese and English entries from the EDR dictionary, the multilingual dictionary contains around 25,000 entries for Chinese, and around 5,000 each for Thai, Indonesian and Malay.
Another unique feature of the system was that it provided a common languageindependent framework for handling compound expressions.This is important as compounds and phrasal units are language-dependent and must be handled on a perlanguage basis.For Japanese, phrasal units and compounds are detected using CaboCha and the EDR electronic dictionary.

Related Work
Reading Tutor, which is a reading assistance system widely used in Japan and abroad, also contains a multilingual dictionary (Kawamura, Kitamura, & Hobara, 2012, 2000).Additionally, Reading Tutor's "Kyozai Banku" (Kawamura & Kitamura, 2001) is a similar effort to the courseware feature in Asunaro to provide leveled reading material.
Rikai.com is a popular website that provides hiragana readings and English translations of online text13

Conclusion
Asunaro was constructed to assist students from the fields of science and engineering to read and understand technical Japanese.For beginning-to intermediate-level learners, it includes courseware aimed at assisting them to eventually be able to read authentic texts from their field.For advanced learners, the copy and paste nature of the system allows them to focus on learning just the sentences they do not yet fully understand.It uses the EDR Electronic Dictionary as a basis for constructing a larger multilingual dictionary, presenting learners with glosses into their native language: English, Chinese, Malay, Thai and Indonesian.

Introduction
Natsume is an online writing assistance system that began operating in 2009 (Hodošček, in press, 2012;Abekawa, Hodošček, & Nishina, 2011).The initial focus during the development of Natsume was to enable users to not just be able to search for collocations, but also be able to convince themselves of the correct usage of a collocation in several ways.Thus, Natsume was to not just provide raw collocation information, but was to enable users to look for similar collocations and compare collocational tendencies between different genres.
While Asunaro assists international students in reading, Natsume focuses on assisting them in writing technical Japanese.For example, writing reports or papers at universities can be hard if the students cannot differentiate between what words or expressions are spoken and what are written Japanese.As a study and writing aid, the use of conventional (non-corpus-based) electronic dictionaries is prevalent among international students.However, these dictionaries seldom contain information on a word's usage with respect to written and spoken language.Natsume, by virtue of having access to corpora from various genres, contains information that can be used to determine if a word is appropriate for spoken or written Japanese.
When writing in a second language, it is often the case that one knows the meaning of a noun or verb, but does not know what verb goes together with what noun.Conventional dictionaries often contain only a limited amount of information on frequently co-occurring patterns of words.These frequently co-occurring patterns of words are called collocations and are important because they offer more contextual information about a word than what is found in conventional dictionaries.Moreover, knowledge of collocations has been shown to be essential to achieving high second language proficiency (Pawley & Syder, 1983).
Users can use the system to find collocations of a word, check the correct use of a word or collocation by looking at example sentences, and compare observed frequencies in various genres.This follows the philosophy of data-driven language learning by giving users access to authentic information which they can then use as the basis for any decisions with respect to writing and word choice.
Natsume's current target users are intermediate to advanced learners of Japanese, as well as Japanese native speakers.

Main Features
The interface can be divided into three views: 1. Collocation viewwhere users search for the collocate words of any noun, adjective or verb.2. Genre comparison viewlooking at the genre frequency distribution of a collocation reveals that collocation's genre tendencies.

Example sentence viewauthentic examples enable the learner to see how the
collocation is used at the sentence level.
Users must select the particular collocation pattern they want to search for and a matching noun, verb or adjective into the search box to start the search.Searching for a word will present several lists, grouped by case particle and sorted by frequency, of the searched word's collocates.The sorting scheme is user selectable and one can choose from the default frequency, Dice's coefficient, t score, Jaccard similarity coefficient, Log-likelihood ratio, Chi-square coefficient, and Mutual-Information score for different types of collocations.The color bars at the right of every collocate indicate the relative frequency (or score) of the collocation in all corpora.Additionally, users can search for and compare two or more similar patterns at the same time to help decide on which one is more suited for them.Using this feature, users can additionally resort on any input word, which makes it easy to see at a glance which words collocate with which input words.When the user is interested in seeing more information on a particular collocation triplet, clicking on the collocate will load the genre comparison view to the bottom of the main collocation view.The behavior of the click can be set to one of:  particle/conjugation expansioncan be used to compare among different grammatical uses of collocates  synonym expansioncan be used to automatically compare among similar collocates  no expansion (default)standard view, only provides genre information of selected collocation  click expansioncan be used to manually compare genre information of collocates Figure 7: Comparing three collocates of /jikken/ "experiment" taking the /wo/ case particle: /yaru/ "to do" (colloquial), /suru/ "to do", and /okonau/ "to conduct, carry out".In the genre comparison view, users can visually compare a collocations usage across different genres.The frequency numbers visible in the genre comparison interface are the relative frequency of occurrence of a collocation per 100,000 collocations.This is done to ensure the frequencies are comparable even if the corpus sizes differ, as is the case here.Additionally, Natsume uses the chi-square test to colorcode genres as blue if the frequency of occurrence is significantly larger than the average across all genres, and pink if the frequency is significantly lower.Genres that are not color coded do not significantly differ from the mean.
When even more information is desired, the user can bring up the example sentence view which shows example sentences from the selected corpora.Sentences are displayed randomly by genre up to a limit of six sentences per genre.One can judge if a collocation is suitable for one's writing context by comparing its frequency across genres, its differences with similar collocations, and the actual usage as seen in example sentences from different corpora.

Related Work
In parallel to the construction of the BCCWJ, NINJAL commissioned the construction of two search systems, one that is freely available and offers basic KWIC search features, called Shonagon, and another subscription-based one that allows searching with regular expressions over short and long unit words, called Chunagon 14 .Another system that shares Natsume's focus on Japanese language education is NINJAL-LWP 15 , a lexical profiler for a subset of the BCCWJ (Pardeshi, 2012).It contains features similar to Natsume, but differentiates itself by providing many different kinds of collocations.
Perhaps the most sophisticated collocation query system for Japanese is the Sketch Engine, a "Corpus Query System incorporating word sketches, one-page, automatic, corpus-derived summary of a word's grammatical and collocational behaviour" (Sketch Engine, 2012;Kilgarriff, Rychly, Smrz, & Tugwell, 2004).The Sketch Engine supports multiple languages including Japanese through a 400 million token web-based corpus (JpWaC) that was first released in 2008.More than 50 collocational and grammatical relations are in use in the word sketch grammar (Srdanović-Erjavec, Erjavec, & Kilgarriff, 2008).The Sketch Engine also contains a unique word comparison feature, called word sketch difference, which is in some aspects similar to searching for several words at the same time using Natsume, though it is also more sophisticated.

Collocation Data Extraction
The Japanese dependency analyzer CaboCha was used to extract the dependency structure of all sentences in the corpora, from which noun, particle and verb or adjective dependent patterns were extracted.Post-processing was performed on verbs to differentiate passive (/iwareru/, passive of "say") or potential (/ieru/ "be able to say") 16 with causative (/iwasu/ "to make talk") voice usage, as well as combine verbal compounds into single units (/kaki + hajimeru/ "begin to write").Nouns were postprocessed to normalize numbers, dates and personal names. 14Available at http://www.kotonoha.gr.jp/shonagon/ and https://chunagon.ninjal.ac.jp/, respectively. 15NINJAL-LWP is accessible from http://nlb.ninjal.ac.jp/. 16Passive and potential usage is not always discriminated by the underlying MeCab morphological analyzer and IPA electronic dictionary.

Genre
The defining feature of Natsume is the ability to differentiate between expressions that are suitable for writing in an academic context and those that are not.Consider the following example from the STJC corpus:  /Taisha ni yori hasseishita nisankatanso wa mizu ni yōkaishi, .../ "The CO2 produced from metabolism dissolved in water, ... "17 Comparing the expression /nisankatanso ga mizu ni yōkaisuru/ "CO2 dissolves in water", taken from the example below, with the expression /satō ga mizu ni tokeru/ "sugar dissolves in water", it is clear that the former is written in an academic, technical style, while the latter is of a more informal, spoken variety.For learners without the native language intuition needed to arrive at the same conclusion, Natsume provides a data-driven way of helping them to take a first step towards gaining this kind of intuition.

Conclusion and Future Work
Natsume is primarily a system to assist a specific part of the writing process: finding the right words for a particular writing context, which in this case is technical Japanese.Currently, example sentences in Natsume are displayed randomly.Tailoring the example sentence view to display examples appropriate to the learner's proficiency level is a future goal of the project (Hodošček, Abekawa, Murota, & Nishina, 2012).

Introduction
Having introduced the corpus Natane in Chapter 2, this chapter focuses on the search capabilities of Natane and how they might help the researcher or Japanese language teacher in analyzing their own students' errors or identify particular areas where learners have tendencies to make errors.Searching for errors relating to the verb /yaru/ returns two errors.One example is the sentence /intānetto wo tsūjite shigoto wo yaru hito wa ōku natte iru/ "the number of people working on the Internet is increasing", where the usage of /yaru/ is wrong because it is the colloquial form of /suru/.

Conclusion and Future Work
Natane is a learner corpus that has many potential uses, though we envision two main types of usages, one by Japanese language educators and the other by NLP researchers.
In this chapter, we described the search interface for the Natane corpus, which is targeted at the former.Japanese language educators can make use of Natane to find examples of learner errors.Also, the data provided is useful for analyzing error tendencies due to first language interference, as well as for observing the language acquisition process.
The latter usage is primarily aimed at applications in NLP and machine learning, where Natane can be used to construct novel error correction systems.An example of one such system is introduced in the next chapter.
With the existence of several Japanese language learner corpora that all make use of different error classification frameworks, a movement towards a common standard is, perhaps, the most pressing issue.

Introduction
Natsume, while useful for finding collocations, does not automatically correct the learner's writing.In an evaluation of Natsume, it became clear that for every collocation the learner checked using Natsume, there were many more that went unchecked (Hodošček, Abekawa, Bekeš, & Nishina, 2011).The next obvious step was to develop a system that checks learners' writing and provides feedback on any errors they may have made.This writing assistance system was named Nutmeg and provides basic feedback for learners' writing using automatic error identification (Yagi, Hodošček, & Nishina, 2012).The system is unique in that it does this from two sources: native and learner corpora.

Related Work
Compared to other existing Japanese language automatic composition correction systems, Nutmeg strives to incorporate both native and learner corpora in its correction model.An example of a more narrow application of a similar system is Chantokun.Developed at the Nara Advanced Institute of Science and Technology (NAIST), Chantokun18 is a system that detects and corrects case particle misuse based on corrected Japanese language sentences from the Lang-8 website19 , a languageexchange social networking website where users with different first languages correct each other's writing (Mizumoto & Komachi, 2012).
An example of a system that focuses on the native corpus side of automatic error correction is the Japanese proofreading system Tomarigi20 (Oono & Inazumi, 2011).Another example that uses the dependency structure of a sentence to revise complex sentences into easier to understand ones is the jcorrect tool (Oosaki, 2006).

Error Correction Method
In general, Nutmeg uses the native BCCWJ and STJC corpora as "correct data", whereas it uses learner errors from Natane as "incorrect data".Thus expressions that are tagged as errors in Natane become candidates for automatic correction.
Natane contains 386 orthographic errors.One way of detecting outright orthographic errors is if they go unrecognized in morphological analysis.Additionally, most such errors are found within two letters of a word.A word including an error is replaced with the corresponding word in the native word list.For example, suppose a learner were to mistaken the word /messeeji/ "message" as /meeseji/.If there is no prior learner error of the same word in Natane, then a morphological analysis reveals that it is an unknown word.The unknown word can then be matched to similar words contained in the morphological dictionary.Finally, the correct orthography can be presented to the learner.
Though the language contained in the BCCWJ and STJC should, in principle, be considered correct, this does not preclude the use of native corpora as instruments in identifying learner errors.One example is making use of the various genres available in Natsume, through which it is possible to correct collocation usage from the genre perspective.For example, using data from Natsume it is possible to automate the process of checking if a collocation is appropriate for an academic report as outlined in Chapter 4. If a learner uses the collocation /jikken wo yaru/ in an academic report, the system will be able to identify the inappropriate usage and offer the replacement collocation /jikken wo okonau/ as a correction.This is possible because of the existence of relatively incompatible genres, such as those that lean towards a formal writing style (STJC and White papers) and those that lean towards an informal or spoken writing style (Yahoo!Blogs, Yahoo!Q&A and Diet minutes) (Hodošček & Nishina, 2011).The corpora in Natsume can thus be divided into so-called positive and negative genres and the relative frequencies of collocations in those genres can be tested using the chi-square test.When an expression like /jikken wo yaru/ is used we can determine that it is incorrect because its frequency in the negative genres is significantly high, while its frequency in positive genres is significantly low.A replacement collocation could be found by searching through similar collocations and testing them in the same manner or by using WordNet to expand the available search space (Bond et al., 2009;Isahara et al., 2012).

Conclusion and Future Work
The aim of Nutmeg is to become a compositional tool that is able to automatically warn learners of potential mistakes as they are making them.The two types of data backing Nutmeg's error correction facilities are native and learner corpora corresponding to the data provided by Natsume and Natane, respectively.
Though the available size of native corpora is much greater than that of learner corpora, an avenue for improvement to collocation error correction is to provide candidate replacement expressions for learner errors, perhaps using WordNet.More effort must be put into obtaining or constructing other specific-purpose corpora if other writing genres, such as business writing, are to be considered.
It is also clear that Natane should be expanded in scale in order to conduct a more comprehensive quantitative evaluation.The implementation of an automatic error correction system must be treated cautiously, because the results of automatic error correction depend on the annotations being objective.This is especially difficult for learner errors at the semantic or discourse levels, as it is here that annotators subjectivity most easily comes into play.Thus, as a first step, easier items such as orthographic errors should be considered (Yagi, Hodošček, & Nishina, 2012).

Conclusion and Future Work
In the span of just over a decade, the Hinoki project has produced the Asunaro, Natsume and Nutmeg systems as well as the Natane learner corpus.As the project is led by linguists, language teachers, computer engineers and educational engineering researchers, it has been able to synthesize ideas from these disciplines together into several multi-viewpoint CALL systems.
The construction of Asunaro resulted in the construction of a novel electronic multilingual dictionary that contains several often underrepresented Asian languages.
Asunaro also applied state of the art NLP research to provide a practical dependency grammar-based reading assistance system.
Natsume was developed as a corpus-backed collocation search tool that allows users to find new collocations that fit their writing style by enabling them to check the correctness of Japanese collocations they are not confident about.An immediate goal of the development of Natsume is the addition of new types of collocations to the search interface.Another goal is being pursued in ongoing work to channel Natsume's knowledge of genres and collocations into Nutmeg for use in automatic error correction.Finally, the extension of available native corpora to other learner-specific purposes, such as the writing of emails or business writing is also being considered.
The development of Natane has resulted in a unique Japanese learner corpus and an accompanying search system.It has applications for both language researchers and educators, as well as NLP applications.The future direction of Natane is closely aligned with that of Nutmeg, the usage of which will hopefully contribute to the development and further validation of Natane's error classification framework.
Nutmeg is an extension of both Natsume and Natane into automatic error correcting for learner writing.From the development of Natane it became clear that simpler orthographic and syntactic factors are easier to objectively annotate than semantic and discourse factors, which are more prone to subjective decision making on the part of the annotator.This subjective decision making also leads to greater difficulty in automating error correction at a reasonable precision.There is thus a need for a greater volume of annotations, that are objectively classified in the error classification framework of Natane.This is essential in order to realize more sophisticated error correction and composition assistance.
Finally, an effort should be made to move from the localized lexical writing assistance seen in Natsume and Nutmeg towards a more comprehensive discourse-level composition assistance.For this purpose, more inter-system collaboration with other projects is needed.

Figure 1 :
Figure 1: The hierarchical error classification framework used in Natane

Figure 2 :
Figure 2: An example of composition errors annotated with Slate; marked areas represent errors, with the left pane providing detailed information including all error annotations.

Figure 3 :
Figure 3: Bottom area containing morphologically analyzed user input with readings and word class information provided by MeCab (Kudo, 2012).

Figure 4 :
Figure 4: Top right area containing translations and example sentences of user selected words or phrases.

Figure 6 :
Figure 6: Main interface containing word search input area, similar words feature and collocates of the three input words.Frequency information for each verb is uniquely color coded.

Figure 8 :
Figure 8: Comparing genre frequencies between different patterns including /jikken/ and /okonau/ using the case particle and conjugation expansion feature.

Figure 9 :
Figure 9: Comparing genre frequencies between similar collocates of /jikken/ using the similarity expansion feature.

Figure 11 :
Figure 11: Natane interface: search for learner errors and filter based on first language and specific error types.

Figure 12 :
Figure12: Searching for /yaru/ "to do" will return all learner errors containing the word.Here the correct way of writing the second sentence is to replace /yaru/ with its polite version /suru/.

Figure 13 :
Figure 13: Situating the previous error in the learner essay.

Figure 14 :
Figure 14: Viewing all learner errors in a given essay.

Table 1 :
Character counts and average sentence length for all corpora.

Table 2 :
Distribution of essays by first language.

Table 3 :
Collocation token and type count per genre.