Parts 1-4 of the Santa Barbara Corpus of Spoken American English (SBCSAE) are now available, for a total of approximately 249,000 words. The Santa Barbara Corpus includes transcriptions, audio, and timestamps which correlate transcription and audio at the level of individual intonation units.
To access individual conversations and other discourse segments in the Santa Barbara Corpus, you may select the audio file and transcription you wish to download by consulting the Contents and Summaries.
To download the audio files in WAV (recommended) or MP3 format, do the following:
- Select the transcription you want (e.g. SBC001 Actual Blacksmithing) under the listing of Contents and Summaries
- Right-click on the audio format you want (WAV or MP3)
- Select "Save link as...". This should save the file to your computer in the format you have selected.
Alternatively, you can do the following:
- Select a transcription (e.g. SBC001 Actual Blacksmithing) under the listing of Contents and Summaries
- Click on the audio format you want (WAV or MP3)
- The sound will start playing on your computer, and you will see a bar on your screen. Wait a little while for the file to download to your computer (during which time you can listen to the streaming audio)
- Click on the downward-pointing arrow at the right edge of the bar, and choose "Save as source". This should save the file to your computer in the format you have selected.
Although it is now available for free on-line (see above), the Santa Barbara Corpus of Spoken American English can still be purchased on CD and DVD from the Linguistic Data Consortium, at the following web pages:
A version of the Santa Barbara Corpus transcriptions in CHAT format, including metadata, is available for download here; CHAT transcriptions of individual conversations are also available here under Contents and Summaries.
The audio files for the Santa Barbara Corpus can also be downloaded from TalkBank.org, in either MP3 or WAV file format, from the following locations:
For MP3 files: https://talkbank.org/media/CABank/SBCSAE/
For WAV files: https://talkbank.org/media/CABank/SBCSAE/0wave/
SBCSAE by John W. Du Bois is licensed under a Creative Commons Attribution-No Derivative Works 3.0 United States License.
The Santa Barbara Corpus of Spoken American English is based on a large body of recordings of naturally occurring spoken interaction from all over the United States. The Santa Barbara Corpus represents a wide variety of people of different regional origins, ages, occupations, genders, and ethnic and social backgrounds. The predominant form of language use represented is face-to-face conversation, but the corpus also documents many other ways that that people use language in their everyday lives: telephone conversations, card games, food preparation, on-the-job talk, classroom lectures, sermons, story-telling, town hall meetings, tour-guide spiels, and more.
The Santa Barbara Corpus was compiled by researchers in the Linguistics Department of the University of California, Santa Barbara. The Director of the Santa Barbara Corpus is John W. Du Bois, working with Associate Editors Wallace L. Chafe and Sandra A. Thompson (all of UC Santa Barbara), and Charles Meyer (UMass, Boston). For the publication of Parts 3 and 4, the authors are John W. Du Bois and Robert Englebretson.
The Santa Barbara Corpus of Spoken American English also forms part of the International Corpus of English (ICE). The Santa Barbara Corpus provides the main source of data for the spontaneous spoken portions of the American component of the International Corpus of English. In order to meet the specific design specifications of the International Corpus of English (allowing comparison between American and other national varieties of English), the Santa Barbara Corpus data have been supplemented by additional materials in certain genres (e.g. read speech), filling out the American component of ICE.
This is a conversation recorded in rural Hardin, Montana. Mae Lynne is a student of equine science, and is the main speaker. She is telling Lenore (a visitor and near stranger) about her studies. Doris, Mae Lynne's mother, is doing housework, but joins the conversation near the end to discuss friends of their family.
After-dinner conversation among four friends in San Francisco, California. Participants are in their late twenties or early thirties. Harold and Jamie are a married couple, Miles is a doctor, and Pete is a graduate student from Southern California.
A conversation among three friends who are preparing dinner together, recorded in Southern California. Roy and Marilyn are a married couple, and Pete is a friend visiting from out of town. All participants are in their early thirties.
Family conversation recorded in Santa Fe, New Mexico. The primary participants are three sisters all in their twenties.
A conversation between a couple who are lying in bed, recorded in Santa Barbara, California.
A very lively interaction between two female cousins in their mid-thirties, recorded in Los Angeles, California.
Late-night conversation between two sisters, recorded in Montana.
Task related interaction--an attorney preparing two witnesses to testify in a criminal trial. Recorded in San Francisco, California. Rebecca is a lawyer, June and Rickie are the witnesses, and Arnold is Rickie's husband.
Task-related talk, a teenage couple recorded in Mobile, Alabama. Kathy is helping her boyfriend Nathan prepare for a math test.
A business conversation recorded in New Mexico. Brad and Phil are board members of a local arts society. Phil wants to talk business, while Brad keeps trying to leave to pick up his wife who's waiting for him at a bookstore.
A conversation among three friends before lunch, recorded in Tucson, Arizona. All three participants are retired women; Samantha (Sam) is 72, Doris is 83, and Angela is 90.
University lecture, recorded in Riverside, California. This is a Chicano Studies class; the professor is the primary participant, although it is a small, summer school class, and nine members of the class occasionally interact.
This is a family conversation/birthday party, recorded in Fort Wayne, Indiana. The five participants are family members: Kendra (the birthday girl) and Kevin are siblings, Ken and Marci are their parents, and Wendy is Kevin's wife. This segment is highly interactional and contains a lot of overlap.
Task related talk—this is a loan officers meeting, recorded in a bank in a small town in rural southern Illinois. Joe and Fred are loan officers working for the bank. Jim is the president of the bank, and Kurt is a board member.
A conversation among three friends, recorded in Los Angeles, California. Ken and Joanne are a couple, and Lenore is a friend of theirs.
A sales encounter, recorded in an audio store in Santa Barbara. Tammy is planning to buy a new tape deck. Brad, a salesman at the audio store, is discussing various tape decks which he is trying to sell her.
A conversation between two male friends, recorded in Southern California.
A task-related interaction recorded in a veterinarian office near Madison, Wisconsin. All five participants work in the office, some as secretaries and assistants and some as veterinarians.
A family conversation, recorded in Michigan. Frank and Jan (a married couple) are talking with Ron--Jan's brother who is visiting from California. Brett and Melissa are Frank and Jan's junior-high-age children, who are doing homework and also taking part in the conversation.
A segment from a sermon/lecture recorded at a small conference near Chicago, Illinois. The speaker is a pastor in his mid seventies.
A segment from a rather lively sermon recorded in Boston, Massachusetts.
Task-related interaction, recorded in an air traffic control tower in Portland, Oregon. Lance is training to be an air traffic controller, and has just finished working a shift. Randy, an experienced controller, is giving Lance feedback/briefing on his performance on that shift.
A segment from a book discussion group, recorded in Topeka, Kansas. The eleven participants are all women between the ages of 46 and 85.
This segment consists of game-playing and game-teaching on a computer, and was recorded near Cape Cod, Massachusetts. Jennifer and Dan are a couple in their early twenties.
This is a segment from a lecture on the history and theology of Martin Luther, part of an evening class held at a church, recorded in Delaware.
This is a city meeting, recorded in Chicago, Illinois. City officials interact with the public about a government grant which is being applied for, to fund community development. The city can only apply once, so are soliciting applications from various organizations and will submit the one they judge as best.
An entertaining science lecture and demonstration, recorded at a large public science museum in Chicago, Illinois.
A very intimate long-distance telephone conversation between a romantic couple in their early twenties, which took place between Pennsylvania and California.
This is a business conversation recorded in Northern California between Seth and Larry, who are meeting for the first time. Seth works as an engineer who designs, installs, and sells heating and air conditioning units. Larry has invited him to his home to give him an estimate.
A segment from a sermon, recorded at a large Baptist church in Chicago, Illinois.
Face-to-face conversation recorded in a restaurant in Pullman, Washington. Sherry and Beth are sisters (in their late twenties), and Rosemary is their mother. The participants discuss what to order for lunch, interact with the waitress (Jamie) and engage in talk about family and friends while waiting for their food.
A face-to-face conversation that takes place at an outdoor neighborhood 'block party' in Santa Fe, New Mexico. The three main participants are neighbors, age 60 and upward, all of whom happen to be named Tom. Discussion centers on life histories, World War II experiences, and neighborhood gossip. The three are briefly joined by Tucker (the daughter of Tom_1), and Elaine (the wife of Tom_3).
A lively family argument/discussion recorded at a vacation home in Falmouth, Massachusetts. There are eight participants, all relatives or close friends. Discussion centers around a disagreement Jennifer (age 23) is having with her mother (Lisbeth).
A late-night face-to-face conversation recorded in Northampton, Massachusetts. Participants are a married couple (Karen and Scott) in their early twenties. Karen has just returned home from work, and the two are talking while winding down for the evening.
Lively family argument/discussion recorded in the kitchen of a family home in Pittsburgh, Pennsylvania.
Face-to-face conversation recorded in Albuquerque, New Mexico. There are three participants and a baby. Lisa and Kevin are siblings, Marie (the baby's mother) is a friend of Lisa's. Much of the speech event focuses on interaction with, and talk about, the baby, as well as gossip about friends and co-workers.
Informal, task-related (cooking) talk recorded in the kitchen of a family home in Corpus Christi, Texas. A family is making tamales. Main participants are Julia (an 80-year-old woman), her daughter (Dolores), and grandson (Shane). They are briefly joined by Kate (Shane's sister) who is watching TV in another room. The segment contains occasional codeswitching (English/Spanish).
This segment is part of a tour of Hoover Dam, on the Nevada-Arizona border. The presentation is highly practiced. The main speaker also answers audience questions.
Task-related talk, a training meeting recorded at an aquarium in Chicago, Illinois.
Scripted tour of the Kentucky Horse Park / Museum. Presenter also addresses questions from the audience.
Medical interaction recorded in Southern California. A patient (Paige) is consulting with her dietician (Kristen) regarding management of diabetes.
Family argument and task-related talk, recorded in Pasco, Washington. The recording begins in a car, and moves to the kitchen of a family home. Main participants are three teenage sisters (Sabrina, Kendra, and Marlena), their mother (Kitty), and step-father (Curt). A friend of Sabrina's (Gemini) is also present. The dispute centers around Kitty's belief that Kendra stayed the night at a friend's house without permission, something which Kendra denies having done. Argument and shouting is interspersed with Saturday-morning housekeeping chores such as doing dishes and laundry.
Face-to-face conversation recorded in the living room of a private home in Boise, Idaho, between Alice (a nurse, age 49) and her daughter Annette (a student and bank employee, age 24). Topics center mostly on their work day, as well as mutual acquaintances.
Face-to-face conversation recorded in the living room of a private home in Milwaukee, Wisconsin. Two friends (Cam and Lajuan) are talking about their families and friends, and their own experiences as gay men.
Face-to-face conversation recorded in the living room of an apartment in Milwaukee, Wisconsin. Two friends (Corinna and Patrick) are talking and watching TV. Topics are at times rather raunchy.
Medical interaction, recorded in Shreveport, Louisiana. A patient (Darren) is consulting with his orthopedist (Reed) regarding a knee injury from a recent skiing accident.
Face-to-face conversation between two cousins (Fred and Richard) in their early thirties, recorded in a private home in east Los Angeles, California. Topics include Richard's new job selling cars, Fred's frustration with factory work, and Richard's recent breakup with his girlfriend.
Christmas morning traditions and gift-exchange among family members, recorded in Fresno, California. Tim and Lea are a couple in their late fifties, Judy is their daughter, and Dan is Judy's boyfriend.
Face-to-face conversation recorded at an outdoor family birthday party near Boston, Massachusetts. There are ten speakers, all related. Four siblings in their mid thirties to mid forties: Dan, Al, Lucy, and Annette. Allen (Sr.), age 76, is their father. Al and Annette are twins. Linda is Al's wife, John is Annette's husband. Dave and Jane are Al and Linda's children. Glen is Lucy's son. Topics center primarily on recent renovations to Lucy's home.
Face-to-face conversation among four roommates, recorded in a shared apartment in Burlington, Vermont. Speakers are all students at the University of Vermont, women ages 20-21. Speakers engage in small-talk, make plans for the evening, and discuss household matters.
Conversation recorded before and during dinner, in a private home in Laguna Beach, California. There are four speakers, ranging in age from mid forties to early fifties. Sean and Bernard are a couple, Fran is a long-time friend visiting from New York. Alice is also a friend of Sean and Bernard, but had never met Fran. Discussion focuses on travels, and reminiscing about New York City.
Phone conversation between family members at Christmas. Andrew and Cindy, a couple in their mid forties in Albuquerque, NM, are calling Andrew's sisters in San Antonio, Texas. Discussion centers primarily on Christmas and Christmas gifts, and topics prompted by recent television news shows.
Task-related talk recorded in a small claims court in Santa Barbara, California. This segment consists of a judge pro tem hearing and deciding two cases.
Public storytelling event recorded after a church potluck in Chicago, Illinois. The speaker, a professional storyteller in her mid forties, tells several stories and interacts with the audience.
Public lecture/forum in Santa Barbara, California. Noted artist and ceramist Beatrice Wood gives a public lecture at the Santa Barbra Museum of Art, shortly after her 101st birthday. Wood talks about her life and answers audience questions.
Face-to-face conversation recorded on a ranch near Colorado Springs, Colorado. Julie has recently bought a pony from Gary's wife, and is giving him a bill-of-sale. She then gives a brief tour of her property and barn.
Task-related talk, a recording of a judo class in Shreveport, Louisiana. The five students and their instructor are males between the ages of 22 and 37. The instructor is demonstrating and coaching the Hane-Makikomi throw, which students are practicing with varying degrees of success.
Face-to-face conversation recorded in a private home in Boise, Idaho. Sheri, a single mom in her mid thirties, and her son Steven (age 11) talk while Sheri prepares dinner.
Face-to-face conversation, recorded in a family home near Beloit, Wisconsin on Christmas Eve. Cam and Fred are a couple in their early thirties. Jo and Wess are Cam's parents. Topics include talk about family and friends, a football game which Wess and Fred had just finished watching, and holiday baking.
Face-to-face casual conversation recorded in an office in Shreveport, Louisiana. The two speakers, Jon (age 72) and Alan (age 66) are friends/co-workers taking a break from work. Alan is primarily telling Jon about his travel adventures and interests.
To reference the Santa Barbara Corpus as a whole, the following bibliographical model may be used:
Du Bois, John W., Wallace L. Chafe, Charles Meyer, Sandra A. Thompson, Robert Englebretson, and Nii Martey. 2000-2005. Santa Barbara corpus of spoken American English, Parts 1-4. Philadelphia: Linguistic Data Consortium.
To reference individual parts of the Santa Barbara Corpus, the following bibliographical models may be used:
Du Bois, John W., Chafe, Wallace L., Meyer, Charles, and Thompson, Sandra A. 2000. Santa Barbara corpus of spoken American English, Part 1. Philadelphia: Linguistic Data Consortium. ISBN 1-58563-164-7.
Du Bois, John W., Chafe, Wallace L., Meyer, Charles, Thompson, Sandra A., and Martey, Nii. 2003. Santa Barbara corpus of spoken American English, Part 2. Philadelphia: Linguistic Data Consortium. ISBN 1-58563-272-4.
Du Bois, John W., and Englebretson, Robert. 2004. Santa Barbara corpus of spoken American English, Part 3. Philadelphia: Linguistic Data Consortium. ISBN 1-58563-308-9.
Du Bois, John W., and Englebretson, Robert. 2005. Santa Barbara corpus of spoken American English, Part 4. Philadelphia: Linguistic Data Consortium. ISBN: 158563-348-8.
Most of the audio recordings were originally made on Digital Audio Tape (DAT), recorded in stereo at 32 kHz or 48 kHz, on Sony TCD-D6 or TCD-D7 portable DAT recorders, using small, high quality stereo microphones. (A few early recordings were made on high quality analog cassette recorders.)
The audio data as published by the Linguistic Data Consortium consist of 16-bit, stereo, 22.05 kHz audio files in WAV format (PCM).
Personal names of speakers on the recordings, as well as other identifying information such as telephone numbers, have been replaced by pseudonyms in the transcripts, and have been altered to preserve the anonymity of the speakers by filtering the audio files to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (e.g. SBC001.flt) associated with each transcription/waveform file pair (e.g. SBC001.trn, SBC001.wav) is provided to list the beginning and ending times of the filtered regions. (The file SBC040.flt is empty indicating there was no personal information to filter out.)
The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform.
The following additional files are included on the published CD’s and DVD’s from the Linguistic Data Consortium:
|segment.txt||explanation of the information contained in segment.tbl|
|segment.tbl||information about the speech event context|
|segment_summaries.txt||brief summary of the content of each discourse segment|
|speaker.txt||explanation of the information in speaker.tbl|
|speaker.tbl||speaker demographic information|
|table.txt||description of file names and informal titles|
list of conventions and prosodic annotations
Major funding for the creation of the Santa Barbara Corpus of Spoken American English was received from the National Endowment for the Humanities in the form of a grant [Grant #RT-21433-92] to Wallace L. Chafe, John W. Du Bois, and Sandra A. Thompson of the UCSB Linguistics Department, and Charles Meyer of the University of Massachusetts, Boston. The initial phases of the project to develop the Santa Barbara Corpus were made possible by a series of grants awarded to Chafe, Du Bois, and Thompson by the Interdisciplinary Humanities Center, the College of Letters and Science, and the Office of Research, all of UC Santa Barbara. Additional funds were received from the Linguistic Data Consortium at the University of Pennsylvania. The completion and release of Parts 2-4 of the Santa Barbara Corpus was facilitated by funding extended by Talkbank, an interdisciplinary research project funded by a grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.
For more information about the Santa Barbara Corpus, contact: