JERAELYN TAN MING LI
Chan Chee Seng
Dr Fan Lixin
ChatGPT is an
advanced language model based on the GPT-3.5 architecture, designed for
generating human-like text in conversational contexts. It employs a
transformer-based neural network to capture context and generate high-quality
text. However, it has limitations such as potentially generating incorrect or
incoherent responses (hallucination) and a token limit that restricts text
length. Understanding these limitations is crucial. This project aims to
comprehensively understand ChatGPT's functioning and explore the token limit
challenge. The objectives include analyzing the model's architecture,
investigating token limit challenges, and optimizing text generation. The
literature review covers advancements in language modeling, tokenization's
impact on text quality, prompt engineering, and hallucination detection.
Instruction tuning is introduced for improving language models. The problem
statements are the lack of understanding of ChatGPT's functioning and limited
research on the token limit challenge. The research methodology involves a
preliminary study, categorization, algorithm implementation, and evaluation.
The proposed algorithm incorporates document embedding, vector search, and
clustering techniques to overcome the token limit issue. It shows promise in
improving response precision and relevance within the token limit. By
addressing these objectives and utilizing the proposed methodology, this study
aims to enhance understanding of ChatGPT, optimize text generation within the
token limit, and contribute to natural language processing and conversational
AI advancements.
Keywords: ChatGPT,
Tokenization, Vectorization, Clustering.