LLM-assisted Abbreviation Mining for Legacy Systems

“Archaeologists try to find the clues left by people who lived before us, and they try to make sense of them.” (Source). This work is no different for software developers maintaining older systems, who venture as software archaeologists into old codebases like adventurers in a long-forgotten city. Many tasks are the same, such as deciphering the hieroglyphs (aka cryptic abbreviations) left behind by the original creators, unlocking valuable knowledge hidden beneath layers of history. And just as traditional archaeologists piece together ancient stories, software archaeologists help to understand the stories of legacy systems that have stood the test of time, and paving the way for a successful modernization journey.

Analysis goal

An overview of the expected results that shows a list of files and some abbreviations marked that might be part of a certain concept. — The goal is to find initial ideas of what concepts the codebase is made of

The goal of this archaeological analysis is to decipher the dense and often obscure language that developers used to describe their domain, processes, and business rules within a legacy codebase, gaining an initial understanding of what the code is about. This analysis uses Large Language Models (LLMs) to assist in uncovering the fundamental concepts buried within a legacy COBOL codebase, focusing specifically on mining abbreviations and other cryptic identifiers from filenames.

Why does this work? Because abbreviations used in filenames often act as shorthand for business terms, functional areas, framework specifics, or even system modules. They are typically used consistently throughout the codebase due to established (explicit or implicit) naming conventions. Recognizing, grouping, and extending these abbreviations to their original meanings helps create an initial understanding of the system, including its domain-specific terminology and the various functional areas it might cover.

The analysis in this blog post can serve as a first step towards a deeper comprehension of the codebase, helping to identify potential areas of interest, critical functionalities, and candidate modules for further examination or refactoring. It aims to reduce the time and effort required to make sense of legacy software and provides a foundation for upcoming modernization tasks.

Limitations

There are also limitations when using this approach. While LLMs might excel at pattern recognition and can suggest plausible expansions for abbreviations based on context, they still operate without true domain understanding and may hallucinate. This can lead to incorrect interpretations, especially in a specialized or highly customized business domain like COBOL mainframe systems, where abbreviations may have evolved based on internal company jargon or historical practices rather than standard terminology.

Moreover, relying only on abbreviation mining from file paths and names may provide a fragmented view of the codebase. There might be insufficient context, as different parts of the code may use similar or overlapping terms with entirely distinct meanings. This can result in ambiguous or misleading concept identification without careful cross-checking against the actual code logic and domain knowledge.

But I think this approach is better than having absolutely no clue what the codebase is about. Combining the automated identification and expansion of relevant abbreviations (including their possible meaning) with manual inspections using domain expertise opens up plenty opportunities to speculate about the system’s design and is absolutely no rocket science (as you’ll see in this blog post).

Today’s subject under inspection

I attempted to find and analyze a real legacy codebase, but obtaining a realistic scenario was challenging. This is where the “Mainframe CardDemo Application” (GitHub) from AWS comes into play: This project is a sample application designed to showcase modernization strategies for mainframe workloads using AWS services. It features a typical banking scenario with COBOL programs, JCL scripts, and related data files. The demo project should provide a realistic environment to demonstrate refactoring, replatforming, and migration of legacy code to cloud-based solutions. Although the number of files in this project is small, I believe it is a realistic starting point for exploring an unfamiliar legacy application written in a programming language I’m not familiar with.

Analysis

Step 1: Clone the repository

To begin, we clone the source code repository:

git clone https://github.com/aws-samples/aws-mainframe-modernization-carddemo.git

Then, we get a nice and clean file list we want to work with.

import glob

root_dir = "../aws-mainframe-modernization-carddemo/app/"
glob_list = glob.glob(f"{root_dir}**/*.*", recursive=True)
file_list = [f.replace("\\","/").replace(root_dir, "") for f in glob_list]
file_list[:5]

This give us a list with paths (only first five entries shown):

['bms/COACTUP.bms',
 'bms/COACTVW.bms',
 'bms/COADM01.bms',
 'bms/COBIL00.bms',
 'bms/COCRDLI.bms']

Step 2: Set up LLM assistant

I use Claude Sonnet 3.5, a large language model from Anthropic, to expand the abbreviations. For this type of application, I always set a low temperature to get consistent, reproducible results across different runs. I’ve also noticed that LLMs tend to get lazy when working on tasks I delegate because I don’t want to do them myself. Perhaps my own approach is rubbing off on the model. That’s why I make sure to remind the LLM not to be lazy in this case.

import anthropic
client = anthropic.Anthropic()

def ask(prompt):
    return client.messages.create(
        model="claude-3-5-sonnet-latest",
        max_tokens=2000,
        temperature=0.0,
        system="""
        You're a software archaeologist who tries to make sense of the past.
        Respond in short and clear explanations.
        Don't be lazy!""",
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

Step 3: Define the base prompt

The base prompt was designed to guide the LLM in extracting and expanding abbreviations. I paid attention to providing clear, step-by-step instructions to help the LLM through abbreviation extraction and concept definition. It includes handling alternative meanings and uncertainties by suggesting confidence scores and providing alternative entries. I also requested simple regex patterns for practical file matching, useful for locating relevant files later on. I specified a structured output as a JSON schema for consistency and easier integration with subsequent analysis. To my knowledge, Sonnet doesn’t have a built-in option for this kind of structured output, so I also included an example to clarify expectations (and to avoid breaking the following code).

base_prompt = """

Below is a list of paths from a software program containing numerous abbreviations.

Your task is to build a list of abbreviations and a glossary entry of their corresponding concepts.
For this, determine the meaning of an abbreviation found in the paths and filenames.
For very similar abbreviations, make a separate entry. Don't put more than one abbreviations in one entry.

Then, provide a definition of each concept and provide the information if it is a business concept or a technical concept.
With that, also estimate a confidence score between 0 and 1 indicating how certain you are about the term and its definition.

Additionally, find a simple regular expression that identifies all files related to the abbreviation (and their alternative spellings).

If there are not more abbreviatios to discover, deliver back an empty JSON list.

Output only and directly as a JSON array (not with a key) using strict the following schema:
- 'abbreviation': The abbreviation.
- 'meaning': The meaning of the abbreviation.
- 'type': business, technical
- 'definition': An explanation of the concept.
- 'regex': A regular expression to locate related files.
- 'confidence': A value between 0 and 1 indicating certainty.
- 'alternative': An alternative meaning, or "-" if none exists.

Example output:
[
    {
      "abbreviation": "ABC",
      "meaning": "Air Bullet Container",
      "type": "technical"
      "definition": "A storage unit used for air bullets in testing scenarios.",
      "regex": ".*ABC.*",
      "confidence": 0.8,
      "alternative": "-"
    }
]

Now, expand the following abbreviations to their full meanings:
"""

Step 4: Assemble the initial prompt

Next, I combine the base prompt with the file list from above. I intentionally shuffle the filenames to avoid any biases and (hopefully) keep the LLM engaged with a varied input. The shuffled list is then appended to the base prompt for analysis. Concatenating the shuffled list with the base prompt creates a single string where each file path appears on a new line, making it easier for the model to process.

import random

random.shuffle(file_list)
prompt = base_prompt + "\n".join(file_list)
print(prompt[:100] + "...")

The prompt looks like this:

Below is a list of paths from a software program containing numerous abbreviations.
    
Your task is ...

Step 5: Ask Claude via backfeeding

The easy part here is that I simply send the prompt to the LLM and hope that we get the results as a JSON data structure (fingers always crossed). In more detail, this process leverages an iterative feedback loop to refine the extraction of abbreviations. After each pass, it updates the prompt by including abbreviations already identified, guiding the AI to focus on new terms and avoid redundancy. The analysis continues as long as the extracted meanings maintain a high confidence level, stopping when confidence drops below a set threshold.

import json

min_confidence = 1.0

json_results = []

backfeed = ""

while min_confidence > 0.7:

    evolving_prompt = prompt + "\n\n" + backfeed

    res = ask(evolving_prompt)
    json_result = json.loads(res)

    if len(json_result) == 0:
        break

    min_confidence = min([i['confidence'] for i in json_result])
    backfeed = "\n\nI already discovered these abbreviations that I don't need anymore:\n" + "\n".join([i['abbreviation'] for i in json_results])
    json_results.extend(json_result)
    
print(str(json_results)[:100])

That’s what the result looks like:

[{'abbreviation': 'CBL', 'meaning': 'COBOL', 'type': 'technical', 'definition': 'Common Business-Ori

Step 6: Prepare for first analyses

For easier inspection of the result (and later analysis steps), I like to load the expanded abbreviations and their meanings in a pandas DataFrame.

import pandas as pd
abbreviations = pd.DataFrame.from_dict(json_results)\
    .sort_values(by='abbreviation')\
.drop_duplicates(subset=["abbreviation", "meaning"])
abbreviations.head()

Here are the first five entries of the DataFrame:

	abbreviation	meaning	type	definition	regex	confidence	alternative
19	ACCT	Account	business	A financial record or arrangement between a cu...	.AC+T.	0.95	-
32	ACTUP	Account Update	technical	Process or interface for updating account info...	.ACTUP.	0.90	-
42	ACTVW	Account View	business	Process or interface for viewing account details	.ACTVW.	0.90	-
20	ADM	Administration	business	System administration and management functions	.ADM.	0.90	-
18	ADMIN	Administrator	business	A user with administrative privileges in the s...	.ADM.	0.90	-

Step 7: Connect files on found patterns

Next, I search for all files that correspond to the pattern in the regex column. This allows us to map the actual files to specific abbreviations or concepts. I iterate through the list of known abbreviations, using their regular expressions to find matching files in the project. For each abbreviation, I refine the regex, apply it to filter the file list, and update the list of unmatched files. The matched files and their proportion of the total file count are stored for each abbreviation, and the results are added to the DataFrame for analysis. However, this approach isn’t entirely clean, as files can belong to multiple concepts or match more than one abbreviation, leading to potential overlaps. But I think this is good enough for now.

files = pd.Series(file_list)

abbreviation_files = []
proportions = []

length_all_files = len(file_list)

for i, entry in abbreviations.iterrows():
    # remove possible capture groups and start markers
    regex = entry['regex'].replace("(", "(?:").replace("^", ".*")
    files_found = files[files.str.contains(regex)].sort_values().to_list()
    
    proportions.append(len(files_found)/length_all_files)
    abbreviation_files.append(files_found)

abbreviations['prop'] = proportions
abbreviations['paths'] = abbreviation_files
abbreviations.head()

Here are the first five entries of the DataFrame. Click on the info box below for the whole table.

	abbreviation	meaning	type	definition	regex	confidence	alternative	prop	paths
19	ACCT	Account	business	A financial record or arrangement between a cu...	.AC+T.	0.95	-	0.110345	[bms/COACTUP.bms, bms/COACTVW.bms, cbl/CBACT01...
32	ACTUP	Account Update	technical	Process or interface for updating account info...	.ACTUP.	0.90	-	0.020690	[bms/COACTUP.bms, cbl/COACTUPC.cbl, cpy-bms/CO...
42	ACTVW	Account View	business	Process or interface for viewing account details	.ACTVW.	0.90	-	0.020690	[bms/COACTVW.bms, cbl/COACTVWC.cbl, cpy-bms/CO...
20	ADM	Administration	business	System administration and management functions	.ADM.	0.90	-	0.034483	[bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO...
18	ADMIN	Administrator	business	A user with administrative privileges in the s...	.ADM.	0.90	-	0.034483	[bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO...

Complete table

	abbreviation	meaning	type	definition	regex	confidence	alternative	prop	paths
19	ACCT	Account	business	A financial record or arrangement between a cu...	.AC+T.	0.95	-	0.110345	[bms/COACTUP.bms, bms/COACTVW.bms, cbl/CBACT01...
32	ACTUP	Account Update	technical	Process or interface for updating account info...	.ACTUP.	0.90	-	0.020690	[bms/COACTUP.bms, cbl/COACTUPC.cbl, cpy-bms/CO...
42	ACTVW	Account View	business	Process or interface for viewing account details	.ACTVW.	0.90	-	0.020690	[bms/COACTVW.bms, cbl/COACTVWC.cbl, cpy-bms/CO...
20	ADM	Administration	business	System administration and management functions	.ADM.	0.90	-	0.034483	[bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO...
18	ADMIN	Administrator	business	A user with administrative privileges in the s...	.ADM.	0.90	-	0.034483	[bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO...
8	ASCII	American Standard Code for Information Interch...	technical	A character encoding standard used by most mod...	.ASCII.	0.95	-	0.062069	[data/ASCII/acctdata.txt, data/ASCII/carddata....
21	BIL	Billing	business	Process of generating and managing customer bills	.BIL.	0.95	-	0.020690	[bms/COBIL00.bms, cbl/COBIL00C.cbl, cpy-bms/CO...
11	BMS	Basic Mapping Support	technical	CICS facility for handling screen layouts and ...	.*\.bms\|\.BMS$	0.90	-	0.117241	[bms/COACTUP.bms, bms/COACTVW.bms, bms/COADM01...
54	CATG	Category	business	Classification or grouping of items or transac...	.CATG.	0.90	-	0.013793	[data/EBCDIC/AWS.M2.CARDDEMO.TRANCATG.PS, jcl/...
0	CBL	COBOL	technical	Common Business-Oriented Language, a programmi...	.*\.cbl\|\.CBL$	0.95	-	0.193103	[cbl/CBACT01C.cbl, cbl/CBACT02C.cbl, cbl/CBACT...
56	COM	Common	technical	Shared or common components or modules in the ...	.COM.	0.80	Communication	0.041379	[bms/COMEN01.bms, cbl/COMEN01C.cbl, cpy-bms/CO...
12	CPY	Copy Book	technical	A COBOL include file containing shared code or...	.*\.cpy\|\.CPY$	1.00	-	0.310345	[cpy-bms/COACTUP.CPY, cpy-bms/COACTVW.CPY, cpy...
22	CRD	Card	business	Credit or debit card related functionality	.CRD.	0.95	-	0.068966	[bms/COCRDLI.bms, bms/COCRDSL.bms, bms/COCRDUP...
40	CRDLI	Card List	business	Process or interface for listing credit card i...	.CRDLI.	0.80	-	0.020690	[bms/COCRDLI.bms, cbl/COCRDLIC.cbl, cpy-bms/CO...
39	CRDSL	Card Select	business	Process or interface for selecting/querying cr...	.CRDSL.	0.80	Card Sale	0.020690	[bms/COCRDSL.bms, cbl/COCRDSLC.cbl, cpy-bms/CO...
29	CRDU	Card Update	technical	Operations or processes related to updating cr...	.CRDU.	0.85	-	0.020690	[bms/COCRDUP.bms, cbl/COCRDUPC.cbl, cpy-bms/CO...
38	CRDUP	Card Update	business	Process or interface for updating credit card ...	.CRDUP.	0.90	-	0.020690	[bms/COCRDUP.bms, cbl/COCRDUPC.cbl, cpy-bms/CO...
58	CS	Common System	technical	Prefix for system-wide utility or common syste...	.CS[A-Z0-9].	0.80	Customer Service	0.075862	[cbl/CSUTLDTC.cbl, cpy/CSDAT01Y.cpy, cpy/CSLKP...
6	CSD	CICS System Definition	technical	A file containing CICS resource definitions an...	.*\.CSD\|\.csd$	0.90	-	0.006897	[csd/CARDDEMO.CSD]
5	CTL	Control	technical	A control file that defines processing paramet...	.*\.ctl$	0.85	-	0.006897	[ctl/REPROCT.ctl]
15	CUST	Customer	business	A person or entity that uses the services of t...	.CUST.	1.00	-	0.034483	[cpy/CUSTREC.cpy, data/EBCDIC/AWS.M2.CARDDEMO....
59	CV	Conversion	technical	Related to data conversion or transformation p...	.CV[A-Z0-9].	0.70	-	0.082759	[cpy/CVACT01Y.cpy, cpy/CVACT02Y.cpy, cpy/CVACT...
24	DALY	Daily	business	Daily processing or operations	.DALY.	0.90	DAILY	0.020690	[data/EBCDIC/AWS.M2.CARDDEMO.DALYTRAN.PS, data...
48	DISCGRP	Discount Group	business	A grouping of customers or products that share...	.DISCGRP.	0.85	-	0.013793	[data/EBCDIC/AWS.M2.CARDDEMO.DISCGRP.PS, jcl/D...
61	DPY	Display	technical	Related to screen display or output formatting	.DPY.	0.80	-	0.006897	[cpy/CSUTLDPY.cpy]
62	DWY	Data Way	technical	Related to data path or data flow handling	.DWY.	0.60	Dataway	0.006897	[cpy/CSUTLDWY.cpy]
14	EBCDIC	Extended Binary Coded Decimal Interchange Code	technical	A character encoding standard used mainly by I...	.EBCDIC.	1.00	-	0.082759	[data/EBCDIC/AWS.M2.CARDDEMO.ACCDATA.PS, data/...
44	GDG	Generation Data Group	technical	IBM mainframe concept for managing multiple ge...	.GDG.	0.95	-	0.006897	[jcl/DEFGDGB.jcl]
10	JCL	Job Control Language	technical	A scripting language used on IBM mainframes to...	.*\.jcl\|\.JCL$	1.00	-	0.200000	[jcl/ACCTFILE.jcl, jcl/CARDFILE.jcl, jcl/CBADM...
37	MEN	Menu	technical	Menu-related functionality or display screens	.MEN.	0.85	-	0.027586	[bms/COMEN01.bms, cbl/COMEN01C.cbl, cpy-bms/CO...
13	PROC	Procedure	technical	A set of instructions or commands that can be ...	.*\.prc\|\.proc$	0.90	-	0.013793	[proc/REPROC.prc, proc/TRANREPT.prc]
51	PS	Physical Sequential	technical	A type of dataset organization in mainframe sy...	.\.PS(\..)?$	0.95	-	0.082759	[data/EBCDIC/AWS.M2.CARDDEMO.ACCDATA.PS, data/...
55	REJS	Rejects	business	Transactions or records that have been rejecte...	.REJS.	0.85	-	0.006897	[jcl/DALYREJS.jcl]
35	RPT	Report	technical	Reporting functionality or report generation m...	.RPT.	0.95	-	0.020690	[bms/CORPT00.bms, cbl/CORPT00C.cbl, cpy-bms/CO...
26	SGN	Sign-on	technical	User authentication and login functionality	.SGN.	0.85	Signature	0.020690	[bms/COSGN00.bms, cbl/COSGN00C.cbl, cpy-bms/CO...
43	STM	Statement	business	Related to account or credit card statement pr...	.STM.	0.80	-	0.027586	[cbl/CBSTM03A.CBL, cbl/CBSTM03B.CBL, cpy/COSTM...
52	TCATBAL	Transaction Category Balance	business	A record or file containing balance informatio...	.TCATBAL.	0.85	-	0.013793	[data/EBCDIC/AWS.M2.CARDDEMO.TCATBALF.PS, jcl/...
17	TRAN	Transaction	business	A financial operation or exchange recorded in ...	.TRAN.	1.00	-	0.089655	[data/EBCDIC/AWS.M2.CARDDEMO.DALYTRAN.PS, data...
27	TRN	Transaction	business	Financial transaction processing and management	.TR[AN]N.	0.95	-	0.089655	[data/EBCDIC/AWS.M2.CARDDEMO.DALYTRAN.PS, data...
63	TTL	Title	business	Related to title or header information	.TTL.	0.70	Total	0.006897	[cpy/COTTL01Y.cpy]
28	USR	User	technical	User management and access control	.USR.	0.95	-	0.103448	[bms/COUSR00.bms, bms/COUSR01.bms, bms/COUSR02...
60	UTL	Utility	technical	Programs or modules that provide utility funct...	.UTL.	0.90	-	0.020690	[cbl/CSUTLDTC.cbl, cpy/CSUTLDPY.cpy, cpy/CSUTL...
49	XREF	Cross Reference	technical	A system or file that maps relationships betwe...	.XREF.	0.95	-	0.020690	[data/EBCDIC/AWS.M2.CARDDEMO.CARDXREF.PS, jcl/...

Step 8: Coverage of abbreviation information

This step calculates the proportion of files in the project that contain identifiable abbreviation information. The resulting value indicates the percentage of files for which we were able to extract meaningful details based on the identified abbreviations.

files_with_info_about_abbreviations = len(abbreviations.explode('paths')['paths'].drop_duplicates())
coverage = files_with_info_about_abbreviations / length_all_files
coverage_text = f"{coverage*100:.2f}% of the files contain one or more abbreviations of concepts we know about"
print(coverage_text)

This gives us this result:

99.31% of the files contain one or more abbreviations of concepts we know about

Assessment

Step 9: Generate assessment report

At this point, a software archaeologist can already get to know the system and reason about it. But this could be a tedious job for larger applications. So why not generate a summary of the results also via the LLM? That’s exactly what I’m doing here:

from IPython.display import display, Markdown

assessment_prompt = f"""
Here is a table with the information about a COBOL application.
Find the key insights for a software archeologist and summarize the findings in this assessment:
    
{abbreviations.to_markdown()}

Also: {coverage_text}
"""

result = ask(assessment_prompt)
display(Markdown(result))

This generates an assessment report of the results:

Here’s a clear archaeological assessment of this COBOL application:

Key Findings:

System Type

This is a Credit Card Management System running on IBM mainframe

Uses CICS for transaction processing

Handles customer accounts, cards, and financial transactions

Technical Architecture

Core components: COBOL programs (.cbl), CICS screens (.bms), copybooks (.cpy)

Data stored in EBCDIC format physical sequential files (.PS)

Batch processing through JCL jobs

Heavy use of copybooks (31% of codebase) suggesting modular design

Main Business Domains

Account management (ACCT*)

Card operations (CRD*)

Customer data (CUST*)

Transaction processing (TRAN*)

User security (USR*)

Billing/Statements (BIL, STM)

Notable Patterns

Consistent naming conventions (CO prefix for programs)

Clear separation between business and technical components

Strong batch processing component (20% JCL)

Comprehensive user interface (multiple BMS screens)

System Maturity

Well-structured with clear naming conventions

High confidence in identified abbreviations (99.31% coverage)

Complete mainframe ecosystem (online + batch)

Comprehensive security and user management

This appears to be a mature, production-grade mainframe application following standard IBM mainframe architectural >patterns of its era.

(Output slightly formatted for the blog post version for better readability)

OK, there might be some mistakes regarding the interpretation. But I think it is overall a great overview of the software system at hand.

Summary

In this analysis, I applied a LLM-assisted approach to decipher abbreviations embedded in a legacy codebase, focusing on enhancing our understanding of key business and technical concepts. By leveraging a well-structured prompt and an iterative feedback loop, I extracted abbreviations, expanded their meanings, and linked them to relevant files using regular expressions. The high coverage of identifiable abbreviations indicates a comprehensive grasp of the code’s structure and purpose. This iterative method allows us to refine the analysis based on previously discovered information, avoiding redundancy and increasing the accuracy of the interpretations. The generated insights gained can provide a solid foundation for upcoming modernization efforts and help understanding the critical business logic of a legacy application.

What do you think? Can this be useful for your legacy application as well? Let me know!

You can find an earlier version of this article as a Jupyter Notebook on GitHub

Blog Post