ÅBO AKADEMI UNIVERSITY DEPARTMENT OF ENGLISH

ICLE: An International Corpus of Learner English

ICLE is a computerized corpus of argumentative essays on different topics written by advanced learners of English (university students of English mainly in their second or third year). The ICLE project was launched in 1990 by Sylviane Granger, University of Louvain-la Neuve, Belgium, and in 2002 the corpus was released in CD-ROM format, accompanied by a handbook which describes its structure and the status of English in the countries of origin of the learners. The corpus is made up of a number of subcorpora representing the following language backgrounds: Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish, and Swedish. There is also a smaller comparable corpus of British and American undergraduate essays. The length of the essays varies between 500 and 1000 words. 

The Finnish subcorpus consists of essays written by Finnish-speaking and Swedish-speaking Finns. The essays were collected from several different universities. The Finnish coordinators of the ICLE project were Håkan Ringbom and Tuija Virtanen, and Signe-Anita Lindgrén functioned as project researcher (English Department at Åbo Akademi University). Many other people kindly offered their time and help; these collaborators included R. Goldblatt, P. Hirvonen, C. Rohlich and G. Watson from the University of Joensuu; A. Mauranen from Savonlinna School of Translation Studies; R. Alanen, S. Leppänen, A. Pitkänen-Huhta and K. Sajavaara from the U of Jyväskylä; A. Chesterman and M. Hatakka from the U of Helsinki; B. Pettersson and O. Pickering from the U of Turku; and K. Timlin from the U. of Oulu.

The existence of a corpus of advanced learner English makes possible a new, more concrete approach to the features of learner English. Opinions about how learner language actually differs from native speaker language are frequently found, but they have seldom been substantiated by concrete evidence from larger collections of texts. The present corpus can be used for many different purposes. It will, for instance, now be possible to find concrete answers to the question to what extent there is a general 'advanced learner language' that shows consistent differences from equivalent native speaker language, and to what extent influence of the different first languages (language transfer) is manifested. There is a list of publications based on data from learner corpora such as ICLE.

Research in the Department

The Department's research on the corpus has dealt with two main aspects: vocabulary frequencies and discourse analysis. Håkan Ringbom's studies of vocabulary frequencies have investigated the occurrences of high-frequency words in the seven West European sub-corpora and the native speaker corpus. There are words that are overused by all learner groups (auxiliary verbs, personal pronouns, some conjuncts such as but and and, the verbs get and think, some vague words such as people, things, and very) as well as words that are underused by them (the, this, these, by). Analysis of the contexts in which these overused and underused words occur reveals that Western European learners are, for example, particularly fond of the phrase I think, and of the construction get + object. 

When words are underused by learners, this often indicates some kind of subconscious awareness of a learning problem. Thus the common verb become has a much lower frequency in the German corpus than in the other learner corpora and the native corpus. The word bekommen is frequent in German, too, but in an entirely different sense, a so-called false friend. The learners writing for the German subcorpus of ICLE are at a sufficiently advanced stage not to make any errors when they use this verb, but there appears to be an indirect influence from the mother tongue which is manifested in avoidance, or underuse. Learners in general have a tendency not to use words where some problem is even vaguely or sub-consciously anticipated. They clutch to what they feel is safe and familiar, something that has been coined 'the teddy bear principle'. 

ICLE material can also be used for investigations of discourse-pragmatic, rhetorical and stylistic topics, which often imply a cross-cultural perspective. Hence Tuija Virtanen has investigated the use of direct questions and found that their frequency of occurrence is considerably lower in the native speaker data as compared with the totality of non-native speaker data. At the same time, there are statistically highly significant differences in this respect between some of the learner corpora. The highest relative frequency is found in the Finland-Swedish essays, which often manifest clusters of questions. Further, there is variation in the type of questions (topical vs. rhetorical questions) and their placement across ICLE subcorpora. An overuse of questions can reduce their argumentative value and give the text an informal flavour. Her studies have also focused on the use of the progressive in these data, and on the attribution of knowledge to a source. Tuija Virtanen and Signe-Anita Lindgrén examined the use of British and American English in ICLE data and compared this use with the results of a questionnaire given to comparable groups of students in Finland and Sweden. Two unpublished MA theses have been based on ICLE material: Signe-Anita Lindgrén investigated the use of contracted forms while Maria Svenfelt studied hedging in several subcorpora.

   
Back to research projects
Back to the welcome page