Lexicoblog

The occasional ramblings of a freelance lexicographer

Monday, January 30, 2012

Corpus frequencies: what exactly counts?

Senses, idioms and phrasal verbs

Following on from my last post prompted by Michael Rundell’s webinar: Tweets, blogs and corpora: How computer technology helps us make better dictionaries. There was one more question from a webinar participant which I think opens up a whole area of corpus research and word frequencies as shown in dictionaries that tends to get glossed over.

“Do these numbers [corpus frequencies] consider all the meanings of a word or only the common ones?”

Response from another participant:
“I’d guess that the counting program [the corpus software] doesn’t understand the meaning so it is for all meanings of the word.”

An astute question and a correct answer! Corpus software is very clever at number crunching and identifying patterns, but computers still fall down when it comes down to actually understanding language. When you do a corpus search, you can choose the part of speech you’re interested in (separating out noun and verb senses of a word like walk, for example) and you can search for a ‘lemma’ rather than just a string of letters (so searching for the verb walk will include walk, walks, walked and walking). When it comes down to differentiating between different senses or uses of a word though, that can still only be done “by hand” by a human being sorting through a sample of corpus lines one-by-one. Sometimes, where one sense is overwhelmingly more frequent, the sense frequencies are obvious at a glance. In other cases, especially with very polysemous words, it’s a trickier business. Thus, sense ordering by frequency is, to a degree, impressionistic and doesn’t involve exact statistics.

Does this matter? Again, my answer is “not really”, provided we’re only taking frequency information in a dictionary as a general guide. For many words, the most frequent sense(s) of a word will probably account for the majority of its occurrences, so it’s fair to say that overall its core meaning(s) will fall within a general frequency band. It’s unlikely that in many cases there will be lots of obscure senses of a word that significantly distort the frequency statistics.

Where caution may be required though is where it’s the less frequent senses you’re actually interested in. To take an example I came across recently working on EAP vocabulary, if you do a corpus search for chemist, physicist and biologist, chemist comes up as much more frequent – as reflected in most of the learner’s dictionaries. Now that isn’t because there are more scientists studying chemistry than there are physics or biology. But of course, in British English at least, a chemist can be a pharmacy or a pharmacist as well as a scientist in a lab with their test tubes.

And it’s not just the effect that different senses of a word might have on frequency that needs considering. Another big area to take into account is words that form part of a phrase of some kind. Going back to my example of walk, it crops up in various phrases or idioms – walk the walk, run before you can walk, walk free, etc. – and a whole list of phrasal verbs – walk away with, walk in on, walk off, walk out … In most learner’s dictionaries, these come at the end of the entry for the headword and in most of the major dictionaries (I think with the exception of Cambridge), they don’t have frequency information in their own right. Instead, they get lumped into the overall frequency for the whole entry. This has two consequences; firstly, it means that learners can’t see which phrases and phrasal verbs are most frequent and also, it further undermines the frequency information for some words as we can’t be certain what it’s referring to. Take the verb deal as an example, highlighted as frequent in most dictionaries. In fact, something like 85% of occurrences of the verb deal are actually instances of the phrasal verb deal with. Yet, in most dictionaries, it appears that the basic verb senses (giving out cards or drugs) are common, while the phrasal verb deal with has no highlighting at all.

It is possible to construct corpus searches to find particular phrases or phrasal verbs, even where their form varies slightly - for example, where phrasal verbs have moveable particles. So it is possible to get frequency information for them, albeit not as simply or reliably as for single words. And if you look in specialist dictionaries of phrasal verbs or idioms, you’ll often find the most common ones highlighted. So why don’t most general learner’s dictionaries include this information? Well, firstly, it’s very time-consuming to research and secondly, it isn’t easy within a traditional dictionary format to devise a system that encompasses frequency information for both whole words (with senses lumped together) and individual usages in the form of phrases and phrasal verbs.

So having completely ripped apart the frequency information in dictionaries, am I saying that it’s useless and should be ignored? No, far from it! I think as a broad guide to which words are generally more frequent (and so worth focusing on), I still think it’s an incredibly useful tool. But as in any area of life, statistics should always be approached critically and before you rely too much on them, you need to understand what’s behind them, how they’re compiled and what caveats you might need to take into consideration.

Labels: corpora, dictionaries, Macmillan, Michael Rundell, webinar

Friday, January 27, 2012

Corpus: gospel or guide?

A response to a webinar:

Lately, I’ve been starting to explore Twitter and linking through to various blogs and websites to see what folks are talking about in ELT at the moment, all in the name of “professional development”. Yesterday, I also ventured into the world of webinars for the first time. I started off with a recording from Macmillan’s Interactive webinars series. My first aim was really just to explore the medium, so I chose a familiar topic with Michael Rundell’s Tweets, blogs and corpora: How computer technology helps us make better dictionaries. The actual content wasn’t particularly exciting - not because there was anything wrong with Michael’s presentation, but as an experienced lexicographer, I clearly wasn’t his target audience. That meant though that I could focus more on what actually goes on in a webinar.

Rather unexpectedly, I got particularly caught up with the reactions of the participants as they appeared in the little text box on the side of the screen. Unfortunately, Michael ran out of time, so didn’t get to address any of the comments or questions that popped up. I, however, was itching to respond to them! So I thought I’d tackle some of the points here which I’ve been mulling over since. And in fact, I’ve got so much to say, I’m going to split this into two posts.

The part of the webinar that interested me most in terms of participant feedback was when Michael was talking about how we use frequency information from corpora to highlight the most common and so "useful" words in a dictionary. Below are some of the comments and questions and my reactions:

“Are there standard lists with the top 250 words?”
“But where can we find the list of words (to know if they are frequent or not)”

I always find interest in wordlists from teachers and students a little bit worrying. It seems to suggest that language learners are rather like computers and if we can just input the right list of words, then they’ll output English at a given level! Whilst I think frequency lists can have a role to play in helping prioritise what to focus on, my feeling is that generally checking frequency should be something that comes after you encounter new vocabulary. You look up a new word you’ve come across in the dictionary and you might use the information about frequency to decide whether it’s worth putting in your vocabulary notebook or whether it’s a word that you can naturally drop into conversation or not. Language is a wonderfully messy, organic, personal sort of a thing and what vocabulary you choose to teach or learn should be governed by all sorts of different factors - interests, needs, context, personality - not some (inevitably very dull) standard list of frequent words.

“Are the top 3000 words in Oxford the same top 3000 words in Macmillan?”

I haven’t researched the answer to this one, but I think I can fairly confidently say “more or less” if we’re just talking about frequency (more on that below). Each of the major dictionary publishers uses a different corpus – or rather a different collection of corpora, some of which overlap (like the BNC). In the early days, with relatively small corpora, you would have expected some variation, with different corpora slightly skewed towards particular types of language. Nowadays though, with all the big publishers using really huge and diverse collections of corpora, I think you’d probably expect a straightforward frequency list (at least at the most frequent end) to come out more or less the same, with only minor variations.

Having said that, each dictionary publisher has it’s own criteria for how it shows frequency information – where it sets it’s limits and how it puts words into frequency bands. The Oxford 3000™, for example, isn’t just based on frequency, but was put together using three criteria; frequency, range and familiarity (if you're interested, you can read more about it here). Does this variation matter? Personally, I don’t think so. How many students, or even teachers, ever read the blurb in the front (or back) of a dictionary that explains the frequency information? My feeling is that most students either don’t even notice it, or if they do, it’s just some general sense that a word is highlighted or has stars next to it, therefore it must be useful to learn. Of course, there’ll be times when some teachers (esp. vocab nerds like myself!) will make a point in class about frequent and more marked synonyms by pointing to the frequency information (and often register labels) in the dictionary. But my feeling is that’s the exception, not the rule. And that’s fine. It’s still worth the information being there as one more tool in the language learning toolbox.

Coming back to the title of this post, corpus information has been incredibly useful over the past couple of decades in understanding how language is actually used and in making teaching materials more natural, but it's still only a guide. Despite much of my work being in the area of corpus research, I'm still very wary about taking corpus data as gospel, partly for some of the reasons I'll talk about in my next post ...

Labels: corpora, dictionaries, Macmillan, Michael Rundell, webinar

Tuesday, January 24, 2012

Website update

I set up my website - www.juleswords.co.uk - 5 years ago now to act as a kind of online CV; an easy way for people to get an idea of the range of different things I work on. I built it myself after going on a couple of web design courses and overall, I’ve been pretty pleased with it. I keep it updated every now and again, adding new projects I’ve worked on. Annoyingly though, a while ago, I discovered that the links in the main menu along the top of the page don’t work in some browsers (they’re okay in IE, but not Firefox). Having completely forgotten how I set them up in the first place, despite numerous attempts to fiddle, I haven’t been able to fix them. I know I should get a professional to look at it, but I have a bit of a fear that the whole thing will unravel if anyone looks too closely at it, so I keep putting it off. Thankfully, the links in the rest of the site still work, so it’s still possible to navigate around, if rather clunkily.

As often happens in such cases, because of the big things that needed fixing, I’d been putting off updating too - arguing to myself that I should do it all at the same time. The other day though, I realised that the big overhaul wasn’t going to happen anytime soon, so just got on with some simple updating. I finally changed the picture of myself on the home page. Although it was quite a classy black and white shot, it was taken quite some years ago when I still had long hair. Having had short hair for over two years now, I figured it was time to go for something more representative. I’m not sure I’m that keen on the picture I’ve chosen, but I guess I can always change it again if I can find something better.

I also trawled back over the projects I’ve worked on in the past couple of years to see what’s now been published so can go on the website. Especially a lot of the corpus research work I do tends to happen right at the start of a project, so it can be a year or more before it gets as far as publication. Working through old invoices, I made a list of titles, then went through the publishers’ websites to see which ones had appeared in the catalogue. I was quite pleased to have new stuff to add under corpus research, editing and writing. So I added some new bits of text, some new pictures and links and clicked on “update”, feeling satisfied to be finally up-to-date.

Then yesterday, a shiny, new-smelling copy of Objective First (CUP) arrived in the post.

I’d done some learner corpus research for it a while ago and was pleased to see how it's slotted into new Corpus Spot boxes warning about common exam errors. However, whilst it’s nice to see how my research gets used and to have a shiny new book for my shelves, it is going to mean another website update!

Labels: Cambridge learner corpus, Objective First, website

Tuesday, January 17, 2012

Expat in an internet age

This week I've been in Malta, staying in a lovely little rented apartment in Vittoriosa, across the harbour from the capital, Valletta. I'd intended it to be a week of working away - bringing my laptop with me to carry on with work - just an escape from the grim British weather rather than a full-blown holiday. Unfortunately, both of the jobs I'd hoped to get on with while I'm here got delayed (story of my life at the moment!), so I find myself with no work to do. And as I wasn't really prepared for a 'holiday' - either psychologically or financially! - I've been a bit unsure about how to spend my time. For the first few days, the sun was shining and I was happy to go out exploring - mostly around Valletta and the "Three Cities". The weather's now turned a bit cloudy and rainy though, so I've been spending more time pottering around the apartment instead, reading and surfing, and nipping out onto the roof terrace to catch some rays when there's the occasional break in the clouds!

What has really struck me being away with my laptop (and Wi-fi in the apartment) is just how different it must be living abroad in the internet age. As well as having access to all the usual email and Facebook, of course, there's news online - at the weekend, I enjoyed my usual "flick through" the Saturday Guardian and even did the crossword. The real revelation though has been listening to UK radio online - on Friday evening I chuckled along to the News Quiz on Radio 4, then laughed out loud at the weirdness of the Archer's theme echoing around my Maltese apartment!

It's made me realise just how very different life as an expat must be now compared to when I first headed off to Greece as a young EFL teacher some 20 years ago. Without wanting to slip into some kind of Monty Python sketch, back then we had very little contact with home bar the odd letter from parents and the occassional out-of-date copy of a newspaper. We didn't have TVs and you could only get World Sevice radio if you had a short-wave radio and even then, I remember having to stand holding the aerial to get reception! Thus I spent the best part of seven years in a kind of expat bubble, not really part of the host culture (first in Greece then in the Czech Republic), but fairly cut off from British culture too. I have a big gap in my knowledge of UK popular culture through the first part of the 90s - I completely missed Take That the first time round (no great loss there, perhaps?!) and there are still certain pop culture references from that time that go right over my head.

Do my current counterparts living in my old Greek apartment now have wireless broadband? Do they all have iphones and laptops - keeping up with their friends on Facebook and continuing their usual media consumption almost uninterrupted? Of course, I knew that the internet had opened up all this stuff, but I think being here this week it's really brought it home to me just how much things have changed. It must make the whole expat adventure a very different experience, both as a lifestyle and as a teacher.

Labels: ELT, expat, Malta

Lexicoblog

Monday, January 30, 2012

Corpus frequencies: what exactly counts?

Friday, January 27, 2012

Corpus: gospel or guide?

Tuesday, January 24, 2012

Website update

Tuesday, January 17, 2012

Expat in an internet age

Lexicoblog

About Me

Previous Posts

Archives