Introduction
Software archaeology
“Archaeologists try to find the clues left by people who lived before us, and they try to make sense of them.” (Source). This work is no different for software developers maintaining older systems, who venture as software archaeologists into old codebases like adventurers in a long-forgotten city. Many tasks are the same, such as deciphering the hieroglyphs (aka cryptic abbreviations) left behind by the original creators, unlocking valuable knowledge hidden beneath layers of history. And just as traditional archaeologists piece together ancient stories, software archaeologists help to understand the stories of legacy systems that have stood the test of time, and paving the way for a successful modernization journey.
Analysis goal
The goal of this archaeological analysis is to decipher the dense and often obscure language that developers used to describe their domain, processes, and business rules within a legacy codebase, gaining an initial understanding of what the code is about. This analysis uses Large Language Models (LLMs) to assist in uncovering the fundamental concepts buried within a legacy COBOL codebase, focusing specifically on mining abbreviations and other cryptic identifiers from filenames.
Why does this work? Because abbreviations used in filenames often act as shorthand for business terms, functional areas, framework specifics, or even system modules. They are typically used consistently throughout the codebase due to established (explicit or implicit) naming conventions. Recognizing, grouping, and extending these abbreviations to their original meanings helps create an initial understanding of the system, including its domain-specific terminology and the various functional areas it might cover.
The analysis in this blog post can serve as a first step towards a deeper comprehension of the codebase, helping to identify potential areas of interest, critical functionalities, and candidate modules for further examination or refactoring. It aims to reduce the time and effort required to make sense of legacy software and provides a foundation for upcoming modernization tasks.
Limitations
There are also limitations when using this approach. While LLMs might excel at pattern recognition and can suggest plausible expansions for abbreviations based on context, they still operate without true domain understanding and may hallucinate. This can lead to incorrect interpretations, especially in a specialized or highly customized business domain like COBOL mainframe systems, where abbreviations may have evolved based on internal company jargon or historical practices rather than standard terminology.
Moreover, relying only on abbreviation mining from file paths and names may provide a fragmented view of the codebase. There might be insufficient context, as different parts of the code may use similar or overlapping terms with entirely distinct meanings. This can result in ambiguous or misleading concept identification without careful cross-checking against the actual code logic and domain knowledge.
But I think this approach is better than having absolutely no clue what the codebase is about. Combining the automated identification and expansion of relevant abbreviations (including their possible meaning) with manual inspections using domain expertise opens up plenty opportunities to speculate about the system’s design and is absolutely no rocket science (as you’ll see in this blog post).
Today’s subject under inspection
I attempted to find and analyze a real legacy codebase, but obtaining a realistic scenario was challenging. This is where the “Mainframe CardDemo Application” (GitHub) from AWS comes into play: This project is a sample application designed to showcase modernization strategies for mainframe workloads using AWS services. It features a typical banking scenario with COBOL programs, JCL scripts, and related data files. The demo project should provide a realistic environment to demonstrate refactoring, replatforming, and migration of legacy code to cloud-based solutions. Although the number of files in this project is small, I believe it is a realistic starting point for exploring an unfamiliar legacy application written in a programming language I’m not familiar with.
Analysis
Step 1: Clone the repository
To begin, we clone the source code repository:
Then, we get a nice and clean file list we want to work with.
This give us a list with paths (only first five entries shown):
['bms/COACTUP.bms', 'bms/COACTVW.bms', 'bms/COADM01.bms', 'bms/COBIL00.bms', 'bms/COCRDLI.bms']
Step 2: Set up LLM assistant
I use Claude Sonnet 3.5, a large language model from Anthropic, to expand the abbreviations. For this type of application, I always set a low temperature to get consistent, reproducible results across different runs. I’ve also noticed that LLMs tend to get lazy when working on tasks I delegate because I don’t want to do them myself. Perhaps my own approach is rubbing off on the model. That’s why I make sure to remind the LLM not to be lazy in this case.
Step 3: Define the base prompt
The base prompt was designed to guide the LLM in extracting and expanding abbreviations. I paid attention to providing clear, step-by-step instructions to help the LLM through abbreviation extraction and concept definition. It includes handling alternative meanings and uncertainties by suggesting confidence scores and providing alternative entries. I also requested simple regex patterns for practical file matching, useful for locating relevant files later on. I specified a structured output as a JSON schema for consistency and easier integration with subsequent analysis. To my knowledge, Sonnet doesn’t have a built-in option for this kind of structured output, so I also included an example to clarify expectations (and to avoid breaking the following code).
Step 4: Assemble the initial prompt
Next, I combine the base prompt with the file list from above. I intentionally shuffle the filenames to avoid any biases and (hopefully) keep the LLM engaged with a varied input. The shuffled list is then appended to the base prompt for analysis. Concatenating the shuffled list with the base prompt creates a single string where each file path appears on a new line, making it easier for the model to process.
The prompt looks like this:
Below is a list of paths from a software program containing numerous abbreviations. Your task is ...
Step 5: Ask Claude via backfeeding
The easy part here is that I simply send the prompt to the LLM and hope that we get the results as a JSON data structure (fingers always crossed). In more detail, this process leverages an iterative feedback loop to refine the extraction of abbreviations. After each pass, it updates the prompt by including abbreviations already identified, guiding the AI to focus on new terms and avoid redundancy. The analysis continues as long as the extracted meanings maintain a high confidence level, stopping when confidence drops below a set threshold.
That’s what the result looks like:
[{'abbreviation': 'CBL', 'meaning': 'COBOL', 'type': 'technical', 'definition': 'Common Business-Ori
Step 6: Prepare for first analyses
For easier inspection of the result (and later analysis steps), I like to load the expanded abbreviations and their meanings in a pandas DataFrame.
Here are the first five entries of the DataFrame:
abbreviation | meaning | type | definition | regex | confidence | alternative | |
---|---|---|---|---|---|---|---|
19 | ACCT | Account | business | A financial record or arrangement between a cu... | .*AC+T.* | 0.95 | - |
32 | ACTUP | Account Update | technical | Process or interface for updating account info... | .*ACTUP.* | 0.90 | - |
42 | ACTVW | Account View | business | Process or interface for viewing account details | .*ACTVW.* | 0.90 | - |
20 | ADM | Administration | business | System administration and management functions | .*ADM.* | 0.90 | - |
18 | ADMIN | Administrator | business | A user with administrative privileges in the s... | .*ADM.* | 0.90 | - |
Step 7: Connect files on found patterns
Next, I search for all files that correspond to the pattern in the regex
column. This allows us to map the actual files to specific abbreviations or concepts. I iterate through the list of known abbreviations, using their regular expressions to find matching files in the project. For each abbreviation, I refine the regex, apply it to filter the file list, and update the list of unmatched files. The matched files and their proportion of the total file count are stored for each abbreviation, and the results are added to the DataFrame for analysis. However, this approach isn’t entirely clean, as files can belong to multiple concepts or match more than one abbreviation, leading to potential overlaps. But I think this is good enough for now.
Here are the first five entries of the DataFrame. Click on the info box below for the whole table.
abbreviation | meaning | type | definition | regex | confidence | alternative | prop | paths | |
---|---|---|---|---|---|---|---|---|---|
19 | ACCT | Account | business | A financial record or arrangement between a cu... | .*AC+T.* | 0.95 | - | 0.110345 | [bms/COACTUP.bms, bms/COACTVW.bms, cbl/CBACT01... |
32 | ACTUP | Account Update | technical | Process or interface for updating account info... | .*ACTUP.* | 0.90 | - | 0.020690 | [bms/COACTUP.bms, cbl/COACTUPC.cbl, cpy-bms/CO... |
42 | ACTVW | Account View | business | Process or interface for viewing account details | .*ACTVW.* | 0.90 | - | 0.020690 | [bms/COACTVW.bms, cbl/COACTVWC.cbl, cpy-bms/CO... |
20 | ADM | Administration | business | System administration and management functions | .*ADM.* | 0.90 | - | 0.034483 | [bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO... |
18 | ADMIN | Administrator | business | A user with administrative privileges in the s... | .*ADM.* | 0.90 | - | 0.034483 | [bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO... |
abbreviation | meaning | type | definition | regex | confidence | alternative | prop | paths | |
---|---|---|---|---|---|---|---|---|---|
19 | ACCT | Account | business | A financial record or arrangement between a cu... | .*AC+T.* | 0.95 | - | 0.110345 | [bms/COACTUP.bms, bms/COACTVW.bms, cbl/CBACT01... |
32 | ACTUP | Account Update | technical | Process or interface for updating account info... | .*ACTUP.* | 0.90 | - | 0.020690 | [bms/COACTUP.bms, cbl/COACTUPC.cbl, cpy-bms/CO... |
42 | ACTVW | Account View | business | Process or interface for viewing account details | .*ACTVW.* | 0.90 | - | 0.020690 | [bms/COACTVW.bms, cbl/COACTVWC.cbl, cpy-bms/CO... |
20 | ADM | Administration | business | System administration and management functions | .*ADM.* | 0.90 | - | 0.034483 | [bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO... |
18 | ADMIN | Administrator | business | A user with administrative privileges in the s... | .*ADM.* | 0.90 | - | 0.034483 | [bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO... |
8 | ASCII | American Standard Code for Information Interch... | technical | A character encoding standard used by most mod... | .*ASCII.* | 0.95 | - | 0.062069 | [data/ASCII/acctdata.txt, data/ASCII/carddata.... |
21 | BIL | Billing | business | Process of generating and managing customer bills | .*BIL.* | 0.95 | - | 0.020690 | [bms/COBIL00.bms, cbl/COBIL00C.cbl, cpy-bms/CO... |
11 | BMS | Basic Mapping Support | technical | CICS facility for handling screen layouts and ... | .*\.bms|\.BMS$ | 0.90 | - | 0.117241 | [bms/COACTUP.bms, bms/COACTVW.bms, bms/COADM01... |
54 | CATG | Category | business | Classification or grouping of items or transac... | .*CATG.* | 0.90 | - | 0.013793 | [data/EBCDIC/AWS.M2.CARDDEMO.TRANCATG.PS, jcl/... |
0 | CBL | COBOL | technical | Common Business-Oriented Language, a programmi... | .*\.cbl|\.CBL$ | 0.95 | - | 0.193103 | [cbl/CBACT01C.cbl, cbl/CBACT02C.cbl, cbl/CBACT... |
56 | COM | Common | technical | Shared or common components or modules in the ... | .*COM.* | 0.80 | Communication | 0.041379 | [bms/COMEN01.bms, cbl/COMEN01C.cbl, cpy-bms/CO... |
12 | CPY | Copy Book | technical | A COBOL include file containing shared code or... | .*\.cpy|\.CPY$ | 1.00 | - | 0.310345 | [cpy-bms/COACTUP.CPY, cpy-bms/COACTVW.CPY, cpy... |
22 | CRD | Card | business | Credit or debit card related functionality | .*CRD.* | 0.95 | - | 0.068966 | [bms/COCRDLI.bms, bms/COCRDSL.bms, bms/COCRDUP... |
40 | CRDLI | Card List | business | Process or interface for listing credit card i... | .*CRDLI.* | 0.80 | - | 0.020690 | [bms/COCRDLI.bms, cbl/COCRDLIC.cbl, cpy-bms/CO... |
39 | CRDSL | Card Select | business | Process or interface for selecting/querying cr... | .*CRDSL.* | 0.80 | Card Sale | 0.020690 | [bms/COCRDSL.bms, cbl/COCRDSLC.cbl, cpy-bms/CO... |
29 | CRDU | Card Update | technical | Operations or processes related to updating cr... | .*CRDU.* | 0.85 | - | 0.020690 | [bms/COCRDUP.bms, cbl/COCRDUPC.cbl, cpy-bms/CO... |
38 | CRDUP | Card Update | business | Process or interface for updating credit card ... | .*CRDUP.* | 0.90 | - | 0.020690 | [bms/COCRDUP.bms, cbl/COCRDUPC.cbl, cpy-bms/CO... |
58 | CS | Common System | technical | Prefix for system-wide utility or common syste... | .*CS[A-Z0-9].* | 0.80 | Customer Service | 0.075862 | [cbl/CSUTLDTC.cbl, cpy/CSDAT01Y.cpy, cpy/CSLKP... |
6 | CSD | CICS System Definition | technical | A file containing CICS resource definitions an... | .*\.CSD|\.csd$ | 0.90 | - | 0.006897 | [csd/CARDDEMO.CSD] |
5 | CTL | Control | technical | A control file that defines processing paramet... | .*\.ctl$ | 0.85 | - | 0.006897 | [ctl/REPROCT.ctl] |
15 | CUST | Customer | business | A person or entity that uses the services of t... | .*CUST.* | 1.00 | - | 0.034483 | [cpy/CUSTREC.cpy, data/EBCDIC/AWS.M2.CARDDEMO.... |
59 | CV | Conversion | technical | Related to data conversion or transformation p... | .*CV[A-Z0-9].* | 0.70 | - | 0.082759 | [cpy/CVACT01Y.cpy, cpy/CVACT02Y.cpy, cpy/CVACT... |
24 | DALY | Daily | business | Daily processing or operations | .*DALY.* | 0.90 | DAILY | 0.020690 | [data/EBCDIC/AWS.M2.CARDDEMO.DALYTRAN.PS, data... |
48 | DISCGRP | Discount Group | business | A grouping of customers or products that share... | .*DISCGRP.* | 0.85 | - | 0.013793 | [data/EBCDIC/AWS.M2.CARDDEMO.DISCGRP.PS, jcl/D... |
61 | DPY | Display | technical | Related to screen display or output formatting | .*DPY.* | 0.80 | - | 0.006897 | [cpy/CSUTLDPY.cpy] |
62 | DWY | Data Way | technical | Related to data path or data flow handling | .*DWY.* | 0.60 | Dataway | 0.006897 | [cpy/CSUTLDWY.cpy] |
14 | EBCDIC | Extended Binary Coded Decimal Interchange Code | technical | A character encoding standard used mainly by I... | .*EBCDIC.* | 1.00 | - | 0.082759 | [data/EBCDIC/AWS.M2.CARDDEMO.ACCDATA.PS, data/... |
44 | GDG | Generation Data Group | technical | IBM mainframe concept for managing multiple ge... | .*GDG.* | 0.95 | - | 0.006897 | [jcl/DEFGDGB.jcl] |
10 | JCL | Job Control Language | technical | A scripting language used on IBM mainframes to... | .*\.jcl|\.JCL$ | 1.00 | - | 0.200000 | [jcl/ACCTFILE.jcl, jcl/CARDFILE.jcl, jcl/CBADM... |
37 | MEN | Menu | technical | Menu-related functionality or display screens | .*MEN.* | 0.85 | - | 0.027586 | [bms/COMEN01.bms, cbl/COMEN01C.cbl, cpy-bms/CO... |
13 | PROC | Procedure | technical | A set of instructions or commands that can be ... | .*\.prc|\.proc$ | 0.90 | - | 0.013793 | [proc/REPROC.prc, proc/TRANREPT.prc] |
51 | PS | Physical Sequential | technical | A type of dataset organization in mainframe sy... | .*\.PS(\..*)?$ | 0.95 | - | 0.082759 | [data/EBCDIC/AWS.M2.CARDDEMO.ACCDATA.PS, data/... |
55 | REJS | Rejects | business | Transactions or records that have been rejecte... | .*REJS.* | 0.85 | - | 0.006897 | [jcl/DALYREJS.jcl] |
35 | RPT | Report | technical | Reporting functionality or report generation m... | .*RPT.* | 0.95 | - | 0.020690 | [bms/CORPT00.bms, cbl/CORPT00C.cbl, cpy-bms/CO... |
26 | SGN | Sign-on | technical | User authentication and login functionality | .*SGN.* | 0.85 | Signature | 0.020690 | [bms/COSGN00.bms, cbl/COSGN00C.cbl, cpy-bms/CO... |
43 | STM | Statement | business | Related to account or credit card statement pr... | .*STM.* | 0.80 | - | 0.027586 | [cbl/CBSTM03A.CBL, cbl/CBSTM03B.CBL, cpy/COSTM... |
52 | TCATBAL | Transaction Category Balance | business | A record or file containing balance informatio... | .*TCATBAL.* | 0.85 | - | 0.013793 | [data/EBCDIC/AWS.M2.CARDDEMO.TCATBALF.PS, jcl/... |
17 | TRAN | Transaction | business | A financial operation or exchange recorded in ... | .*TRAN.* | 1.00 | - | 0.089655 | [data/EBCDIC/AWS.M2.CARDDEMO.DALYTRAN.PS, data... |
27 | TRN | Transaction | business | Financial transaction processing and management | .*TR[AN]N.* | 0.95 | - | 0.089655 | [data/EBCDIC/AWS.M2.CARDDEMO.DALYTRAN.PS, data... |
63 | TTL | Title | business | Related to title or header information | .*TTL.* | 0.70 | Total | 0.006897 | [cpy/COTTL01Y.cpy] |
28 | USR | User | technical | User management and access control | .*USR.* | 0.95 | - | 0.103448 | [bms/COUSR00.bms, bms/COUSR01.bms, bms/COUSR02... |
60 | UTL | Utility | technical | Programs or modules that provide utility funct... | .*UTL.* | 0.90 | - | 0.020690 | [cbl/CSUTLDTC.cbl, cpy/CSUTLDPY.cpy, cpy/CSUTL... |
49 | XREF | Cross Reference | technical | A system or file that maps relationships betwe... | .*XREF.* | 0.95 | - | 0.020690 | [data/EBCDIC/AWS.M2.CARDDEMO.CARDXREF.PS, jcl/... |
Step 8: Coverage of abbreviation information
This step calculates the proportion of files in the project that contain identifiable abbreviation information. The resulting value indicates the percentage of files for which we were able to extract meaningful details based on the identified abbreviations.
This gives us this result:
99.31% of the files contain one or more abbreviations of concepts we know about
Assessment
Step 9: Generate assessment report
At this point, a software archaeologist can already get to know the system and reason about it. But this could be a tedious job for larger applications. So why not generate a summary of the results also via the LLM? That’s exactly what I’m doing here:
This generates an assessment report of the results:
Here’s a clear archaeological assessment of this COBOL application:
Key Findings:
System Type
- This is a Credit Card Management System running on IBM mainframe
- Uses CICS for transaction processing
- Handles customer accounts, cards, and financial transactions
Technical Architecture
- Core components: COBOL programs (.cbl), CICS screens (.bms), copybooks (.cpy)
- Data stored in EBCDIC format physical sequential files (.PS)
- Batch processing through JCL jobs
- Heavy use of copybooks (31% of codebase) suggesting modular design
Main Business Domains
- Account management (ACCT*)
- Card operations (CRD*)
- Customer data (CUST*)
- Transaction processing (TRAN*)
- User security (USR*)
- Billing/Statements (BIL, STM)
Notable Patterns
- Consistent naming conventions (CO prefix for programs)
- Clear separation between business and technical components
- Strong batch processing component (20% JCL)
- Comprehensive user interface (multiple BMS screens)
System Maturity
- Well-structured with clear naming conventions
- High confidence in identified abbreviations (99.31% coverage)
- Complete mainframe ecosystem (online + batch)
- Comprehensive security and user management
This appears to be a mature, production-grade mainframe application following standard IBM mainframe architectural >patterns of its era.
(Output slightly formatted for the blog post version for better readability)
OK, there might be some mistakes regarding the interpretation. But I think it is overall a great overview of the software system at hand.
Summary
In this analysis, I applied a LLM-assisted approach to decipher abbreviations embedded in a legacy codebase, focusing on enhancing our understanding of key business and technical concepts. By leveraging a well-structured prompt and an iterative feedback loop, I extracted abbreviations, expanded their meanings, and linked them to relevant files using regular expressions. The high coverage of identifiable abbreviations indicates a comprehensive grasp of the code’s structure and purpose. This iterative method allows us to refine the analysis based on previously discovered information, avoiding redundancy and increasing the accuracy of the interpretations. The generated insights gained can provide a solid foundation for upcoming modernization efforts and help understanding the critical business logic of a legacy application.
What do you think? Can this be useful for your legacy application as well? Let me know!
You can find an earlier version of this article as a Jupyter Notebook on GitHub