Introduction to Data Mining and Quantitative Text Analysis with R

Institution: see Organisers & Supporters

Programme of study: International Research Workshop

Lecturer: Pascal Jürgens (Johannes Gutenberg-University Mainz)

Date: see Workshop Programme

Max. number of participants: 15

Credit Points: 5 CP for participating in the whole IRWS

Language of instruction: English

Contents: This course offers a simple and pragmatic introduction into the quantitative analysis of textual data in R and simple data mining tasks. There are four main themes: 1) Data logistics: Data preparation is a crucial task that often takes a lot of work and significantly influences results. We will therefore spend some time to understand how to load, prune, re-arrange and represent textual datasets. 2) Text analysis tools: This section will introduce methods for answering research questions through quantitative approaches, such as word frequency analysis, topic modeling and select semantic methods (if there is a specific application participants are particularly interested in, they are encouraged to reach out in advance to make sure it will be covered). 3) Data mining: Part three covers simple but powerful types of machine learning including clustering and linear models. More advanced methods (such as neural networks) will not be covered in DIY-exercises, although we may cover the basic mechanisms if time permits. 4) Rigor: The quantitative methods at hand are particularly sensitive to conceptual and empirical variation. We will therefore take apart some of our example models in order to understand how and when they fail.

A basic familiarity with the R environment and R Studio is required; introductory material will be provided in advance so that participants can read up and gain the necessary skill level before taking part. Participants should bring a laptop with R Studio pre-installed (www.rstudio.com).

You have to register for the International Research Workshop to participate in this course.