Stefan Th. Gries
Contact information
Last updated: 01 August 2017

Teaching at the University of California, Santa Barbara

Ling 104: Statistical Methods in Linguistics (Fall 2017)

Syllabus and overview

This course is a hands-on introduction to fundamentals of statistical and data mining / machine learning methods in linguistics, it is based largely on the second edition (2013) of my Statistics for linguistics with R: a practical introduction. We begin by looking at a few basic notions of statistical analyses (e.g., variables, hypotheses, significance etc) and then discuss the logic of quantitative studies using the null-hypothesis falsification approach as well as how data should be set up for subsequent statistical evaluation. Then, we will explore data preparation and processing with the open-source programming language and environment R. The largest part is concerned with a variety of classification and regression tools such as different kinds of regression models, classification and regression trees, missing data imputation, and unsupervised learning. We use the open source software tool R ; note, therefore, that the course requires computer literacy beyond swiping, pinching, long-tapping, and uploading/sending something to/via Facebook, Instagram, Snapchat, or whatever: If you install a program or download a file and you don't know 'where the program/file is' then or what unzipping a file means, you're wrong in this course.

Downloads for class sessions
(files will be made available over time)

Session 01: slides
Session 02: exercise code, exercise data (must be unzipped), and the answer key
Session 03: exercise code and the answer key
Session 04: linear regression code and example data
Session 05: linear regression exercise 1, its data, and the answer key; linear regression exercise 2, its data, and the answer key
Session 06: binary logistic regression code and example data
Session 07: binary logistic regression exercise 1, its example data, and the answer key; binary logistic regression exercise 2, its example data, and the answer key
Session 08: CART 1 (code) and CART 1 (data); CART 2 (code) and CART 2 (data); CART 3 (code) and CART 3 (data)
Session 09: mpg example (exercise assignment)
Session 10: cluster analysis (code), consonants data for clustering, collocation data for clustering

info on assignment data

Links to relevant software and sites

R (current stable version: 3.4.2)
RStudio (current stable version: 1.0.153)

LibreOffice (current stable version:
my 2013 statistics textbook, its companion website, and its StatForLing with R newsgroup, which I moderate.