Graduation Year

2024

Document Type

Dissertation

Degree

Ph.D.

Degree Name

Doctor of Philosophy (Ph.D.)

Degree Granting Department

Computer Science and Engineering

Major Professor

John Licato, Ph.D.

Committee Member

Shaun Canavan, Ph.D.

Committee Member

Gene Kim, Ph.D.

Committee Member

Ankit Shah, Ph.D.

Committee Member

Mark Pezzo, Ph.D.

Keywords

language model, artificial intelligence, NLP, Natural Language Inference (NLI)

Abstract

Reasoning over natural text is highly nuanced, and interpretations can vary widely depending on cultural background, financial status, age, gender, or even mood. This doctoral dissertation seeks to not only mimic human reasoning behaviors but also improve the task used in natural language processing (NLP) to capture naturalistic reasoning, known as the Natural Language Inference (NLI) task. NLI involves determining whether a hypothesis is true (entailment), false (contradiction), or indeterminate (neutral) based on a given premise. Initially, we will investigate the extent to which NLP systems designed to capture semantic equivalence actually measure meaning equivalence. After establishing that they do not fully capture this, we will enhance their abilities to better understand the inferential properties of sentences. We will also demonstrate that the NLI task has fundamental limitations, such as the poor operationalization of one of its three labels, and examine various state-of-the-art datasets to show that they suffer from this issue. The primary goal of this dissertation is to assess if language models can mimic human reasoning patterns. To this end, we will first analyze language models' capabilities to mimic human memory retrieval patterns using the simpler Semantic Fluency Task (SFT). We will then describe our creation of a novel dataset that aims to incorporate the dual process theory of human cognition into NLI questions. We will detail methods to elicit System 1 and System 2 responses from both humans and language models, showing that language models can, to a certain extent, mimic human reasoning behaviors, particularly in identifying individuating behaviors. Towards the end of this dissertation, we will compare various prompting styles and provide evidence against the common assumption that zero-shot prompting -- providing a model with a task without prior examples -- is the best for eliciting System 1 behaviors, and that chain-of-thought prompting -- guiding the model through a step-by-step reasoning process to solve complex tasks -- is optimal for System 2 behaviors.

Share

COinS