Performance of Fine Tuned Models

Author

Srinivas Sundar

Published

February 16, 2025

!pip install librosa soundfile

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: librosa in c:\users\srinivas\appdata\roaming\python\python38\site-packages (0.10.2.post1)
Requirement already satisfied: soundfile in c:\users\srinivas\appdata\roaming\python\python38\site-packages (0.13.1)
Requirement already satisfied: audioread>=2.1.9 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (3.0.1)
Requirement already satisfied: numpy!=1.22.0,!=1.22.1,!=1.22.2,>=1.20.3 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.22.3)
Requirement already satisfied: scipy>=1.2.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.10.1)
Requirement already satisfied: scikit-learn>=0.20.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.3.2)
Requirement already satisfied: joblib>=0.14 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.3.2)
Requirement already satisfied: decorator>=4.3.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (5.1.1)
Requirement already satisfied: numba>=0.51.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (0.56.0)
Requirement already satisfied: pooch>=1.1 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.8.2)
Requirement already satisfied: soxr>=0.3.2 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (0.3.7)
Requirement already satisfied: typing-extensions>=4.1.1 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (4.12.2)
Requirement already satisfied: lazy-loader>=0.1 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (0.4)
Requirement already satisfied: msgpack>=1.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.1.0)
Requirement already satisfied: cffi>=1.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from soundfile) (1.15.0)
Requirement already satisfied: pycparser in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from cffi>=1.0->soundfile) (2.21)
Requirement already satisfied: packaging in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from lazy-loader>=0.1->librosa) (21.3)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from numba>=0.51.0->librosa) (0.39.0)
Requirement already satisfied: setuptools in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from numba>=0.51.0->librosa) (75.3.0)
Requirement already satisfied: importlib-metadata in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from numba>=0.51.0->librosa) (4.11.3)
Requirement already satisfied: platformdirs>=2.5.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from pooch>=1.1->librosa) (3.11.0)
Requirement already satisfied: requests>=2.19.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from pooch>=1.1->librosa) (2.32.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from scikit-learn>=0.20.0->librosa) (3.1.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from packaging->lazy-loader>=0.1->librosa) (3.0.7)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from requests>=2.19.0->pooch>=1.1->librosa) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from requests>=2.19.0->pooch>=1.1->librosa) (2.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from requests>=2.19.0->pooch>=1.1->librosa) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from requests>=2.19.0->pooch>=1.1->librosa) (2024.2.2)
Requirement already satisfied: zipp>=0.5 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from importlib-metadata->numba>=0.51.0->librosa) (3.8.0)

WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
WARNING: Ignoring invalid distribution -nowflake-connector-python (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -orch (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Error parsing dependencies of bleach: Expected matching RIGHT_PARENTHESIS for LEFT_PARENTHESIS, after version specifier
    tinycss2 (>=1.1.0<1.2) ; extra == 'css'
             ~~~~~~~~^
WARNING: Ignoring invalid distribution -nowflake-connector-python (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -orch (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -nowflake-connector-python (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -orch (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\srinivas\appdata\roaming\python\python38\site-packages)

[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip

import pandas as pd
import numpy as np

df_whisper=pd.read_csv('whisper_ft_results.csv')
df_marian=pd.read_csv('marian_ft_results.csv')

df_improved_whisper_cer = df_whisper[(df_whisper["cer_base"] > df_whisper["cer_ft"])]
print("Number of samples improved:", len(df_improved_whisper_cer))
# Display the top rows
display_columns = ["sample_idx", "reference", "base_pred", "ft_pred", "cer_base", "cer_ft"]
df_improved_whisper_cer[display_columns].head(10)
# but fine-tuned model got it perfect (WER=0)
df_improved_whisper_wer = df_whisper[(df_whisper["wer_base"] > df_whisper["wer_ft"])]
print("Number of samples improved:", len(df_improved_whisper_wer))
# Display the top rows
display_columns = ["sample_idx", "reference", "base_pred", "ft_pred", "wer_base", "wer_ft"]
df_improved_whisper_wer[display_columns].head(10)

Number of samples improved: 57
Number of samples improved: 60

	sample_idx	reference	base_pred	ft_pred	wer_base	wer_ft
1	1	and fancied his countenance was not altogether...	And fancy it his countenance was not all toge...	and fancy it his countenance was now altogeth...	0.351351	0.297297
3	3	you've been jawin like a lot a old hens	You've been joined like a lot of old hands.	you've been drawing like a lot of old hands	0.444444	0.333333
4	4	let us finally confess it that what is most di...	Let us finally confess it that what is most d...	Let us finally confess it that what is most d...	0.129032	0.096774
6	6	the tireless machines marched back and forth a...	The tireless machines march back and forth ac...	The tireless machines march back and forth ac...	0.142857	0.071429
7	7	before the middle of the day they were visited...	Before the middle of the day there were visit...	Before the middle of the day there were visit...	0.340000	0.300000
9	9	alarmed but not discouraged she tried it anoth...	alarmed but not discouraged. She tried it ano...	Alarmed but not discouraged, she tried it ano...	0.212121	0.151515
11	11	but at least it was obvious that some one must...	But at least it was obvious that someone must...	but at least it was obvious that someone must...	0.222222	0.155556
13	13	they began to fell trees for the timbers of th...	They began to felt trees for the timbers of t...	They began to fell trees for the timbers of t...	0.200000	0.150000
14	14	it's a technical problem of the exigencies of ...	It's a technical problem of the exigenesis of...	It's a technical problem of the exigenacies o...	0.425000	0.375000
15	15	and in their fury the women fell upon him deal...	and in their fury the women fell upon him, de...	and in their fury the women fell upon him dea...	0.225000	0.050000

df_perfect_whisper_cer = df_whisper[(df_whisper["cer_base"] > df_whisper["cer_ft"]) & (df_improved_whisper_cer["cer_ft"]==0)]
print("Number of samples improved:", len(df_perfect_whisper_cer))
# Display the top rows
display_columns = ["sample_idx", "reference", "base_pred", "ft_pred", "cer_base", "cer_ft"]
df_perfect_whisper_cer[display_columns].head(10)
# but fine-tuned model got it perfect (WER=0)
df_perfect_whisper_wer = df_whisper[(df_whisper["wer_base"] > df_whisper["wer_ft"]) & (df_improved_whisper_cer["wer_ft"]==0)]
print("Number of samples improved:", len(df_perfect_whisper_wer))
# Display the top rows
display_columns = ["sample_idx", "reference", "base_pred", "ft_pred", "wer_base", "wer_ft"]
df_perfect_whisper_wer[display_columns].head(10)

Number of samples improved: 3
Number of samples improved: 3

	sample_idx	reference	base_pred	ft_pred	wer_base
46	46	his little brown eyed girlish faced mother he ...	his little brown-eyed, girlish-faced mother. ...	his little brown eyed girlish faced mother he...	0.239130
78	78	began in the log cabin when he had no idea he ...	began in the log cabin when he had no idea he...	began in the log cabin when he had no idea he...	0.195122
79	79	that the physician alone is called to administ...	that the physician alone is called to adminis...	that the physician alone is called to adminis...	0.147059

df_improved_marian = df_marian[df_marian["bleu_ft"] > df_marian["bleu_base"]]
print("Number of samples improved:", len(df_improved_marian))

# Show the top 10 improved examples
df_improved_marian[["sample_idx", "src_text", "reference_fr", "base_pred", "ft_pred", "bleu_base", "bleu_ft"]].head(10)

Number of samples improved: 166

	sample_idx	src_text	reference_fr	base_pred	ft_pred	bleu_base	bleu_ft
6	6	PCT/GL/ISPE/1 Page 174 Chapter 19 Examination ...	PCT/GL/ISPE/1 Page 199 Chapitre 19 Procédure d...	PCT/GL/ISPE/1 Page 174 Chapitre 19 Procédure d...	PCT/GL/ISPE/1 Page 174 Chapitre 19 Procédure d...	46.334186	52.512434
11	11	- Makes your eyes wanna tear up.	- Ça fait pleurer.	- Tes yeux veulent se déchirer.	- Ça te donne envie de te déchirer les yeux.	7.809850	8.295194
19	19	If we are to escape from France, we must have ...	Si nous voulons fuir la France, nous devons av...	Si nous voulons échapper à la France, nous dev...	Si nous voulons fuir la France, nous devons av...	69.975223	100.000000
25	25	“In its decision 2004/128, the Commission on H...	“In its decision 2004/128, the Commission on H...	Dans sa décision 2004/128, la Commission des d...	«Dans sa décision 2004/128, la Commission des ...	8.103715	9.242725
27	27	38 500 ð.	35 000 ð.	38 500 ð. . . . . . . . . . . . . . . . . . . ...	38 500 ð.	0.207668	31.947155
31	31	Hey, you.	Hé, toi.	Hé, vous.	Hé, toi.	35.355339	100.000000
49	49	Country office reports indicate a remarkably r...	Source de données.	Les rapports des bureaux de pays indiquent une...	Les rapports des bureaux de pays font état d'u...	0.786056	0.966867
57	57	@EhCherif: Live shots fired by snipers who kil...	@EhCherif: Tirs à balles réelles par snipers q...	@EhCherif: Coups de feu en direct tirés par de...	@EhCherif: Coups de feu en direct tirés par de...	22.325877	29.106249
59	59	Our nostalgic postcard range now numbers over ...	Code: 01CH17. La sélection de cartes que nous ...	Notre sélection de cartes nostalgiques compte ...	Notre sélection de cartes nostalgiques compte ...	30.191579	36.447528
64	64	Acknowledgement The Chair of the Strategic Opt...	Remerciements Le président de la Table de conc...	Remerciements Le président de la Table de conc...	Remerciements Le président de la Table de conc...	56.081407	57.150105

df_perfect_marian = df_marian[(df_marian["bleu_ft"] > df_marian["bleu_base"]) & (np.isclose(df_marian["bleu_ft"], 100.0, atol=1e-7))]
print("Number of samples improved:", len(df_perfect_marian))

# Show the top 10 improved examples
df_perfect_marian[["sample_idx", "src_text", "reference_fr", "base_pred", "ft_pred", "bleu_base", "bleu_ft"]].head(10)

Number of samples improved: 9

	sample_idx	src_text	reference_fr	base_pred	ft_pred	bleu_base	bleu_ft
19	19	If we are to escape from France, we must have ...	Si nous voulons fuir la France, nous devons av...	Si nous voulons échapper à la France, nous dev...	Si nous voulons fuir la France, nous devons av...	69.975223	100.0
31	31	Hey, you.	Hé, toi.	Hé, vous.	Hé, toi.	35.355339	100.0
86	86	Yeah, a dream come true.	Oui, un rêve devenu réalité.	Oui, un rêve devient réalité.	Oui, un rêve devenu réalité.	48.892302	100.0
204	204	Wind gusts 45.6 km/h 33.4 km/h 23.5 km/h 16.8 ...	Raffiche 45.6 km/h 33.4 km/h 23.5 km/h 16.8 km...	rafales de vent 45,6 km/h 33,4 km/h 23,5 km/h ...	Raffiche 45.6 km/h 33.4 km/h 23.5 km/h 16.8 km...	19.212952	100.0
297	297	- You do?	- Ah oui ?	- Vraiment ?	- Ah oui ?	0.000000	100.0
705	705	We're late.	On est en retard.	Nous sommes en retard.	On est en retard.	39.763536	100.0
733	733	Snacklesnaaamps...!	Snacklesnaaamps...!	Des snacklesnaaamps...!	Snacklesnaaamps...!	50.813275	100.0
741	741	Carter Blue Label.	Carter Blue Label.	Label bleu Carter.	Carter Blue Label.	21.022410	100.0
808	808	Call the sheriff.	Appelle le shérif.	Appelez le shérif.	Appelle le shérif.	59.460356	100.0

import tempfile
import zipfile
import os
from datasets import load_from_disk

zip_path = "small_validation_set.zip"

# Create a temporary directory
tmpdirname = tempfile.mkdtemp()
with zipfile.ZipFile(zip_path, "r") as zip_ref:
    zip_ref.extractall(tmpdirname)

dataset_path = os.path.join(tmpdirname, "small_validation_set")
small_validation_set = load_from_disk(dataset_path)

print("Loaded small_validation_set with", len(small_validation_set), "samples.")

Loaded small_validation_set with 100 samples.

import difflib
from IPython.display import Audio, display, HTML

def highlight_diff(ref, pred):
    """
    Use difflib.HtmlDiff to highlight token-level differences
    side by side in an HTML table.
    """
    ref_tokens = ref.split()
    pred_tokens = pred.split()
    diff_html = difflib.HtmlDiff().make_table(
        ref_tokens,
        pred_tokens,
        fromdesc='Reference',
        todesc='Prediction',
        context=True,
        numlines=1
    )
    return diff_html

def show_whisper_sample(df, dataset, row_idx=0):
    """
    Display audio playback and text differences for a single row in `df`.
    `dataset` is your loaded small_validation_set (a Dataset object).
    """
    if row_idx >= len(df):
        print("Row index out of range.")
        return

    row = df.iloc[row_idx]
    
    # Convert to a pure Python int, since HF Datasets won't accept np.int64
    sample_idx = int(row["sample_idx"])

    # Retrieve the audio from the dataset
    sample = dataset[sample_idx]
    audio_data = sample["audio"]["array"]

    # Playback the audio (assuming 16k sample rate)
    display(Audio(audio_data, rate=16000))

    print(f"Sample index: {sample_idx}")
    print("Reference:", row["reference"])
    print("Base Prediction:", row["base_pred"])
    print("Fine-tuned Prediction:", row["ft_pred"])
    print(f"CER Base: {row['cer_base']:.3f}, CER FT: {row['cer_ft']:.3f}")
    print(f"WER Base: {row['wer_base']:.3f}, WER FT: {row['wer_ft']:.3f}")

    # Highlight diffs side-by-side
    base_diff_html = highlight_diff(row["reference"], row["base_pred"])
    ft_diff_html   = highlight_diff(row["reference"], row["ft_pred"])

    display(HTML("<h4>Reference vs. Base Prediction</h4>" + base_diff_html))
    display(HTML("<h4>Reference vs. Fine-Tuned Prediction</h4>" + ft_diff_html))

df_perfect_whisper_wer

	Unnamed: 0	sample_idx	reference	base_pred	ft_pred	wer_base	cer_base
46	46	46	his little brown eyed girlish faced mother he ...	his little brown-eyed, girlish-faced mother. ...	his little brown eyed girlish faced mother he...	0.239130	0.066079
78	78	78	began in the log cabin when he had no idea he ...	began in the log cabin when he had no idea he...	began in the log cabin when he had no idea he...	0.195122	0.057692
79	79	79	that the physician alone is called to administ...	that the physician alone is called to adminis...	that the physician alone is called to adminis...	0.147059	0.024038

for i in range(0,len(df_perfect_whisper_wer)):
    show_whisper_sample(df_perfect_whisper_wer, small_validation_set, row_idx=i)

Sample index: 46
Reference: his little brown eyed girlish faced mother he had lived on the homestead until he was twenty he had tilled the broad fields and gone in and out among the people and their life had been his life but his heart was not in his work
Base Prediction:  his little brown-eyed, girlish-faced mother. He had lived on the homestead until he was 20. He had tilled the broad fields and gone in and out among the people, and their life had been his life, but his heart was not in his work.
Fine-tuned Prediction:  his little brown eyed girlish faced mother he had lived on the homestead until he was twenty he had tilled the broad fields and gone in and out among the people and their life had been his life but his heart was not in his work
CER Base: 0.066, CER FT: 0.000
WER Base: 0.239, WER FT: 0.000

Reference vs. Base Prediction

	Reference		Prediction
	2	little		2	little
n	3	brown	n	3	brown-eyed,
	4	eyed		4	girlish-faced
	5	girlish
	6	faced
	7	mother		5	mother.
	8	he		6	He
	9	had		7	had
	16	was		14	was
n	17	twenty	n	15	20.
	18	he		16	He
	19	had		17	had
	30	the		28	the
n	31	people	n	29	people,
	32	and		30	and
	37	his		35	his
n	38	life	n	36	life,
	39	but		37	but
	45	his		43	his
t	46	work	t	44	work.

Reference vs. Fine-Tuned Prediction

	Reference		Prediction
t		No Differences Found	t		No Differences Found

Sample index: 78
Reference: began in the log cabin when he had no idea he could ever be exercising his loving kindness in the executive mansion the home of the nation with malice toward none with charity for all was the rule of his life
Base Prediction:  began in the log cabin when he had no idea he could ever be exercising his loving kindness in the executive mansion, the home of the nation. With Malice Tordon, with charity for all, was the rule of his life.
Fine-tuned Prediction:  began in the log cabin when he had no idea he could ever be exercising his loving kindness in the executive mansion the home of the nation with malice toward none with charity for all was the rule of his life
CER Base: 0.058, CER FT: 0.000
WER Base: 0.195, WER FT: 0.000

Reference vs. Base Prediction

	Reference		Prediction
	21	executive		21	executive
n	22	mansion	n	22	mansion,
	23	the		23	the
	26	the		26	the
n	27	nation	n	27	nation.
	28	with		28	With
	29	malice		29	Malice
	30	toward		30	Tordon,
	31	none
	32	with		31	with
	34	for		33	for
n	35	all	n	34	all,
	36	was		35	was
	40	his		39	his
t	41	life	t	40	life.

Reference vs. Fine-Tuned Prediction

	Reference		Prediction
t		No Differences Found	t		No Differences Found

Sample index: 79
Reference: that the physician alone is called to administer psychotherapeutic work but that he needs a thorough psychological training besides his medical one but the interest of the community is not only a negative one
Base Prediction:  that the physician alone is called to administer psychotherapeutic work, but that he needs a thorough, psychological training besides his medical one. But the interest of the community is not only a negative one.
Fine-tuned Prediction:  that the physician alone is called to administer psychotherapeutic work but that he needs a thorough psychological training besides his medical one but the interest of the community is not only a negative one
CER Base: 0.024, CER FT: 0.000
WER Base: 0.147, WER FT: 0.000

Reference vs. Base Prediction

	Reference		Prediction
	9	psychotherapeutic		9	psychotherapeutic
n	10	work	n	10	work,
	11	but		11	but
	15	a		15	a
n	16	thorough	n	16	thorough,
	17	psychological		17	psychological
	21	medical		21	medical
n	22	one	n	22	one.
	23	but		23	But
	24	the		24	the
	33	negative		33	negative
t	34	one	t	34	one.

Reference vs. Fine-Tuned Prediction

	Reference		Prediction
t		No Differences Found	t		No Differences Found

def highlight_diff_tokens(s1, s2):
    """
    Return an HTML diff table showing token-level differences between s1 and s2.
    """
    tokens1 = s1.split()
    tokens2 = s2.split()
    diff_html = difflib.HtmlDiff().make_table(
        tokens1, tokens2,
        fromdesc='String 1',
        todesc='String 2',
        context=True,
        numlines=1
    )
    return diff_html

def show_marian_sample(df, row_idx=0):
    if row_idx >= len(df):
        print("Row index out of range.")
        return
    
    row = df.iloc[row_idx]
    print(f"Sample index: {row['sample_idx']}")
    print("Source (EN)     :", row["src_text"])
    print("Reference (FR)  :", row["reference_fr"])
    print("Base Prediction :", row["base_pred"])
    print("Fine-tuned Pred :", row["ft_pred"])
    print(f"BLEU Base: {row['bleu_base']:.2f}, BLEU FT: {row['bleu_ft']:.2f}")

    # Compare reference vs base
    base_diff_html = highlight_diff_tokens(row["reference_fr"], row["base_pred"])
    ft_diff_html   = highlight_diff_tokens(row["reference_fr"], row["ft_pred"])

    display(HTML("<h4>Reference vs. Base Prediction</h4>" + base_diff_html))
    display(HTML("<h4>Reference vs. Fine-Tuned Prediction</h4>" + ft_diff_html))

show_marian_sample(df_perfect_marian, row_idx=0)

Sample index: 19
Source (EN)     : If we are to escape from France, we must have faith!
Reference (FR)  : Si nous voulons fuir la France, nous devons avoir la foi !
Base Prediction : Si nous voulons échapper à la France, nous devons avoir la foi !
Fine-tuned Pred : Si nous voulons fuir la France, nous devons avoir la foi !
BLEU Base: 69.98, BLEU FT: 100.00