Lexicoblog

The occasional ramblings of a freelance lexicographer

Monday, February 05, 2018

Four Favourite Corpora



Recently, I gave a 10-minute talk at the ELT Freelancers’ Awayday in Oxford about “Simple corpus hacks for ELT editors”. I only had time to look at one corpus and a handful of searches, but I promised to share some of my other favourites in a blog post. So here goes…

1 Monco: In my talk, I looked at the Monco corpus. I chose it because it’s a monitor corpus, so it monitors current usage, updating with new data daily and as such, I find it useful for answering language questions that haven’t yet made it into conventional reference sources like dictionaries. For example, in my talk, we looked at how wellbeing (spelled as a single word) may be catching up with more traditional hyphenated form (well-being) that you’ll find in most dictionaries (simply by typing well-being|wellbeing into the search box). The split was 35% – 65% in Monco compared with 17% – 83% in the British National Corpus (with data from the 1980s and 90s). We also turned up some potentially useful verb collocates for newsfeed, including scroll through and pop up, which won’t have yet made it into a collocations dictionary. One of my favourite features of Monco, especially for the corpus novice, is its user-friendly search screen and its nice graphics for results.


On the downside, Monco’s data is drawn from entirely online news sources which means that it’s really only reflective of journalism, rather than language usage in general. And although it includes sources from the UK, US, Canada and Australia, it isn’t balanced, so there’s significantly more data from some sources than others – a factor to bear in mind that can skew the results.

2 Brigham Young University: Not strictly a single corpus, but a collection of different corpora available via the same site and the go-to source for lots of queries. Personally, I tend to use COCA (the Corpus of Contemporary American English) for checking US usage. It’s a large corpus containing a nice variety of contemporary sources (1990 – present), including radio & TV transcripts, fiction, newspapers, magazines and academic data. Through BYU, you can also find host a specialized corpora including a corpus of Wikipedia entries and even, slightly weirdly, the Hansard corpus of British parliamentary proceedings, should that happen to fit your purpose!

My main grumble with BYU is that I find the interface clunky and frustrating to use, especially with its rather distracting colour-coding.


3 BAWE and BASE: The British Academic Written English corpus (BAWE) and the British Academic Spoken English corpus (BASE) are composed of written and spoken data collected from university students at a number of British universities. The written corpus contains essays and other coursework which received a good pass mark and the spoken data includes lectures and seminars. I particular like these corpora because they’re an example of language as it might be used by the peers of the students we’re aiming at, rather than text produced by professional writers, journalists, academics, etc. which doesn’t necessarily provide an appropriate model for the average ELT student. This is obviously university-level language, so is especially relevant for EAP, but I think BAWE could be useful for any advanced students who need to write formal essays (IELTS, CAE, Proficiency). And if you’re looking for US academic equivalents, you could also check out MICUSP and MICASE.


BAWE and BASE are actually available via several sources, but I wanted the excuse to get you to experience Sketch Engine, for me, the gold standard when it comes to corpus tools and the interface used by all the major dictionary publishers for their large corpora.

4 Spoken BNC2014: I admit this is the corpus on my list that I’ve probably used least so far, but I’m including it because it’s one I’m quite excited about finding uses for. Slightly contrary to its name, it was only released in 2017 and is the result of a massive project to collect data about current spoken English used in everyday contexts. If you’re working on speaking materials, looking at evidence from written English is not going to tell you anything terribly useful, because we just don’t speak how we write. So I think this could become the go-to corpus for anyone who wants to know how people actually say things.

Unfortunately, the Spoken BNC2014 doesn’t have the most user-friendly interface and getting access involves a bit of a faffy sign-up process which could be off-putting for the casual user. If spoken language is your thing though, I think it’s worth investing the time and effort to check it out, not least because some of the content is just really funny!


A note about corpora and copyright: It’s important to remember that, in general, the data that appears in a corpus is liable to all the usual copyright restrictions. That means you can’t just pull a big chunk of language from the corpus and use it in your activity, especially not if it’s for commercial publication. Occasionally, of course, you come across very short, ‘vanilla’ examples which could have come from almost anywhere (A young woman opened the door. The traffic was particularly bad.), but to be honest, these are few and far between. Generally, when I search for a particular language item, I’ll scan through the examples and jot down a ‘frame’:
I/you scroll through my/your (Facebook) newsfeed to see/searching for/on the train …
Then I’ll use my notes as the basis for an example that keeps the feel and pattern of the ones I’ve looked at, but fits my teaching purpose … and doesn’t infringe copyright.

There are lots of different corpora out there and corpus fans will have their personal favourites. If you’re new to corpora though, I’d say pick one or two to check out, play around with a few simple searches, use the help to get you started, and see what’s most useful for you. Be warned though, it can be addictive!

Labels: , , , , , , ,

Thursday, January 04, 2018

Using a corpus to fish for inspiration



When I think about using corpus tools to help in writing ELT materials, I tend to think of checking details. So I’ll often use a corpus to check the most common form of a word or phrase, or a typical collocation or colligation pattern. An example that cropped up on Facebook yesterday was whether we say “in winter” or “in the winter” (the answer, by the way, seems to be we use both, sometimes interchangeably and sometimes in different contexts). Today though, I’ve been using a corpus in a slightly different way for much more general inspiration.

I’m currently working on some grammar practice materials and one of the grammar points I need to cover is “compound future tenses” (will have done, will be doing, will have been doing). They’re supplementary materials and part of the brief is to choose different topics and contexts from those used in the student’s book. Of course, the SB author has already nabbed perhaps the most obvious context; predictions about life in the future (By 2050, we’ll all be travelling in driverless cars, etc.). I was casting about for an alternative angle and drawing a blank, so I turned to a corpus*. 

Corpora aren’t always ideal when it comes to grammar because it’s difficult to be specific in your searches. Yes, you can use grammar tags to search for particular word forms, but many common forms have so many different uses that what comes up is often too broad to be useful in an ELT context (imagine how many different uses you’d find if you searched for all present continuous verb forms, for example). When you can narrow things down to more specific words or combinations though, you can uncover some more useful results. So here, I ran a quick series of searches:

will have + past participle
will be + present participle
will have been + present participle

Up came a whole load of contexts which I’d probably never have thought of off the top of my head. And interestingly, a lot of them actually referred to the near future rather than distant futuristic predictions. A couple of the recurrent themes I spotted were:


Weather forecasts:
By 6 o’clock, the showers will have passed.
By Wednesday morning, the winds will be dying down.
The storm will have reached the coast of Cuba by early next week.

Sports reporting:
The team will have played nine games in four weeks.
She’ll be competing in three events at the upcoming Winter Olympics.
The coaching staff will have been preparing the players all winter.

I may not end up using the exact examples turned up by the corpus, but they’ve provided some much-needed inspiration and sent me off down some potentially useful paths.


*When you’re just fishing for inspiration, I don’t think it matters quite so much which corpus you use. For these searches, I used the Monco corpus, just because it’s what I’d been using recently and as a continuously-updated news-based corpus, it throws up a range of current topics.

Labels: , , , ,

Wednesday, December 27, 2017

2017 Part 2: What’s next?



In my last post, I talked about some of the reasons why I’ve become frustrated with my work in ELT publishing and started to question where I want to go next. In this post, I want to share some of the thoughts that have been floating around my head, in no particular order, about what I can do to change that.

One of the first questions I’ve had to face is whether I still enjoy working in ELT at all. And you’ll be pleased to hear that the answer is essentially yes. When I’m not getting frustrated dealing with fees and schedules and restrictive briefs, and it’s just me and a Word document and a load of language, then yes, I still love it. What’s not to love about playing with words for a living? The task ahead, then, seems to be one of picking the right projects or as Tania Pattison put it in her recent blog post, taking on projects that “fit with your vision of yourself as a writer” … which is perhaps easier said than done!

Having started off as a lexicographer, I’m still at my happiest using a corpus to tease out how language works. I really could happily spend all my time investigating how words fit together; making lists of collocations and phrases and colligations and dependent prepositions and explaining all the subtle and quirky differences between them. That should equate to writing vocab materials, but having done a lot of that in recent years, I know it’s not always as satisfying as I’d like. The briefs for many vocab projects involve a pre-determined syllabus and format, and an infuriating reliance on wordlists which take no account of chunks and phrases and multiple meanings or the difference between receptive and productive lexis*. And you find yourself being told that you can’t use a word because it’s ‘above level’ or been ‘covered’ before or … any number of other completely nonsensical reasons why you can’t do what you know is pedagogically sound.

Which perhaps leads me naturally to think about breaking away from publishers to go it alone. Several people I’ve spoken to have talked about self-publishing as an alternative. It does have a certain appeal, but from what I’ve heard of others’ experiences, self-publishing involves a huge amount of investment of both time and money, for very little return. You simply don’t make money from self-published materials. And whilst I’m not only in it for the money, this is my job and I do need to pay the mortgage. From a practical point of view, that means either writing something quite small in scope, like the How to Write EAP Materials title I did for ELT T2W, or stretching work on a bigger project out over a longer period of time, squeezing in bits and pieces when I can. I do have a few half-ideas floating around, but nothing fully formed and ready-to-go just yet.

Another option is to more actively push for the types of work I enjoy most … again, not always easy. A few of my ‘big breaks’ and changes of direction have come from proactively pushing. My first book (Common Mistakes at Proficiency) came about because I was doing corpus research for the series and I summoned up the courage to ask the editor if they had authors for all the titles. They hadn’t and she asked if I’d like to write one of them. Other work has come, either directly or indirectly, from chatting to the right people at conferences. A huge amount is down to luck and timing, but sometimes going along to the right events and making your interest in a specific area well known can help. To this end, I’ve started to nudge myself in a couple of directions …

Firstly, I’ve realized that one of the things I enjoy most is messing about with a corpus. Sadly though, it’s something that only rarely do I get paid to do. So I’ve started to edge my way a bit more into the corpus linguistics world. Back in October, I went along to a Corpus Linguistics in the South event in Cambridge. Most of the people there were academics talking about their research, but there are a few other folks who bridge the gap between the academic and the commercial. I haven’t yet spotted an obvious opportunity for work beyond what I’ve already been involved in, but I’m enjoying getting back into the academic side of the discipline and you never know what might crop up. I’m planning to put a proposal in for at least one corpus linguistics conference in 2018, so we’ll see where that leads.

Another aim is to get away from my desk a bit more. Most years, I manage to go to a handful of conferences and events, either as a participant or a speaker, and I generally come back feeling energised and having learnt something new, about a different teaching context or a different area of ELT. Unfortunately, unless I can get sponsored by a publisher to do a talk on their behalf, the costs come out of my own pocket, and with lots of conferences expecting speakers to pay a conference fee as well as their travel expenses, that soon becomes unaffordable. One option I’d like to explore more though is doing more teacher training. I love getting to meet and work with teachers from different places and, as well as being fun, it feeds neatly back into my writing. I recently ran a teacher training workshop in Moscow, which I really enjoyed, and I’ll be looking out for more similar opportunities in the year ahead.


So I guess that’s a few leads to be getting on with, nothing radical and no magic bullet solution, but hopefully, a general push into slightly new directions for 2018.


*I'll be talking about wordlists and their (mis)use in ELT publishing at the IATEFL conference in Brighton in April.

Labels: , , , , , ,