Performance of Fine Tuned Models

Author

Srinivas Sundar

Published

February 16, 2025

!pip install librosa soundfile
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: librosa in c:\users\srinivas\appdata\roaming\python\python38\site-packages (0.10.2.post1)
Requirement already satisfied: soundfile in c:\users\srinivas\appdata\roaming\python\python38\site-packages (0.13.1)
Requirement already satisfied: audioread>=2.1.9 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (3.0.1)
Requirement already satisfied: numpy!=1.22.0,!=1.22.1,!=1.22.2,>=1.20.3 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.22.3)
Requirement already satisfied: scipy>=1.2.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.10.1)
Requirement already satisfied: scikit-learn>=0.20.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.3.2)
Requirement already satisfied: joblib>=0.14 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.3.2)
Requirement already satisfied: decorator>=4.3.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (5.1.1)
Requirement already satisfied: numba>=0.51.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (0.56.0)
Requirement already satisfied: pooch>=1.1 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.8.2)
Requirement already satisfied: soxr>=0.3.2 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (0.3.7)
Requirement already satisfied: typing-extensions>=4.1.1 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (4.12.2)
Requirement already satisfied: lazy-loader>=0.1 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (0.4)
Requirement already satisfied: msgpack>=1.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from librosa) (1.1.0)
Requirement already satisfied: cffi>=1.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from soundfile) (1.15.0)
Requirement already satisfied: pycparser in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from cffi>=1.0->soundfile) (2.21)
Requirement already satisfied: packaging in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from lazy-loader>=0.1->librosa) (21.3)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from numba>=0.51.0->librosa) (0.39.0)
Requirement already satisfied: setuptools in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from numba>=0.51.0->librosa) (75.3.0)
Requirement already satisfied: importlib-metadata in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from numba>=0.51.0->librosa) (4.11.3)
Requirement already satisfied: platformdirs>=2.5.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from pooch>=1.1->librosa) (3.11.0)
Requirement already satisfied: requests>=2.19.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from pooch>=1.1->librosa) (2.32.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from scikit-learn>=0.20.0->librosa) (3.1.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from packaging->lazy-loader>=0.1->librosa) (3.0.7)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from requests>=2.19.0->pooch>=1.1->librosa) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from requests>=2.19.0->pooch>=1.1->librosa) (2.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from requests>=2.19.0->pooch>=1.1->librosa) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from requests>=2.19.0->pooch>=1.1->librosa) (2024.2.2)
Requirement already satisfied: zipp>=0.5 in c:\users\srinivas\appdata\roaming\python\python38\site-packages (from importlib-metadata->numba>=0.51.0->librosa) (3.8.0)
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
WARNING: Ignoring invalid distribution -nowflake-connector-python (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -orch (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Error parsing dependencies of bleach: Expected matching RIGHT_PARENTHESIS for LEFT_PARENTHESIS, after version specifier
    tinycss2 (>=1.1.0<1.2) ; extra == 'css'
             ~~~~~~~~^
WARNING: Ignoring invalid distribution -nowflake-connector-python (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -orch (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -nowflake-connector-python (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -orch (c:\users\srinivas\appdata\roaming\python\python38\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\srinivas\appdata\roaming\python\python38\site-packages)

[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip
import pandas as pd
import numpy as np
df_whisper=pd.read_csv('whisper_ft_results.csv')
df_marian=pd.read_csv('marian_ft_results.csv')
df_improved_whisper_cer = df_whisper[(df_whisper["cer_base"] > df_whisper["cer_ft"])]
print("Number of samples improved:", len(df_improved_whisper_cer))
# Display the top rows
display_columns = ["sample_idx", "reference", "base_pred", "ft_pred", "cer_base", "cer_ft"]
df_improved_whisper_cer[display_columns].head(10)
# but fine-tuned model got it perfect (WER=0)
df_improved_whisper_wer = df_whisper[(df_whisper["wer_base"] > df_whisper["wer_ft"])]
print("Number of samples improved:", len(df_improved_whisper_wer))
# Display the top rows
display_columns = ["sample_idx", "reference", "base_pred", "ft_pred", "wer_base", "wer_ft"]
df_improved_whisper_wer[display_columns].head(10)
Number of samples improved: 57
Number of samples improved: 60
sample_idx reference base_pred ft_pred wer_base wer_ft
1 1 and fancied his countenance was not altogether... And fancy it his countenance was not all toge... and fancy it his countenance was now altogeth... 0.351351 0.297297
3 3 you've been jawin like a lot a old hens You've been joined like a lot of old hands. you've been drawing like a lot of old hands 0.444444 0.333333
4 4 let us finally confess it that what is most di... Let us finally confess it that what is most d... Let us finally confess it that what is most d... 0.129032 0.096774
6 6 the tireless machines marched back and forth a... The tireless machines march back and forth ac... The tireless machines march back and forth ac... 0.142857 0.071429
7 7 before the middle of the day they were visited... Before the middle of the day there were visit... Before the middle of the day there were visit... 0.340000 0.300000
9 9 alarmed but not discouraged she tried it anoth... alarmed but not discouraged. She tried it ano... Alarmed but not discouraged, she tried it ano... 0.212121 0.151515
11 11 but at least it was obvious that some one must... But at least it was obvious that someone must... but at least it was obvious that someone must... 0.222222 0.155556
13 13 they began to fell trees for the timbers of th... They began to felt trees for the timbers of t... They began to fell trees for the timbers of t... 0.200000 0.150000
14 14 it's a technical problem of the exigencies of ... It's a technical problem of the exigenesis of... It's a technical problem of the exigenacies o... 0.425000 0.375000
15 15 and in their fury the women fell upon him deal... and in their fury the women fell upon him, de... and in their fury the women fell upon him dea... 0.225000 0.050000
df_perfect_whisper_cer = df_whisper[(df_whisper["cer_base"] > df_whisper["cer_ft"]) & (df_improved_whisper_cer["cer_ft"]==0)]
print("Number of samples improved:", len(df_perfect_whisper_cer))
# Display the top rows
display_columns = ["sample_idx", "reference", "base_pred", "ft_pred", "cer_base", "cer_ft"]
df_perfect_whisper_cer[display_columns].head(10)
# but fine-tuned model got it perfect (WER=0)
df_perfect_whisper_wer = df_whisper[(df_whisper["wer_base"] > df_whisper["wer_ft"]) & (df_improved_whisper_cer["wer_ft"]==0)]
print("Number of samples improved:", len(df_perfect_whisper_wer))
# Display the top rows
display_columns = ["sample_idx", "reference", "base_pred", "ft_pred", "wer_base", "wer_ft"]
df_perfect_whisper_wer[display_columns].head(10)
Number of samples improved: 3
Number of samples improved: 3
sample_idx reference base_pred ft_pred wer_base wer_ft
46 46 his little brown eyed girlish faced mother he ... his little brown-eyed, girlish-faced mother. ... his little brown eyed girlish faced mother he... 0.239130 0.0
78 78 began in the log cabin when he had no idea he ... began in the log cabin when he had no idea he... began in the log cabin when he had no idea he... 0.195122 0.0
79 79 that the physician alone is called to administ... that the physician alone is called to adminis... that the physician alone is called to adminis... 0.147059 0.0
df_improved_marian = df_marian[df_marian["bleu_ft"] > df_marian["bleu_base"]]
print("Number of samples improved:", len(df_improved_marian))

# Show the top 10 improved examples
df_improved_marian[["sample_idx", "src_text", "reference_fr", "base_pred", "ft_pred", "bleu_base", "bleu_ft"]].head(10)
Number of samples improved: 166
sample_idx src_text reference_fr base_pred ft_pred bleu_base bleu_ft
6 6 PCT/GL/ISPE/1 Page 174 Chapter 19 Examination ... PCT/GL/ISPE/1 Page 199 Chapitre 19 Procédure d... PCT/GL/ISPE/1 Page 174 Chapitre 19 Procédure d... PCT/GL/ISPE/1 Page 174 Chapitre 19 Procédure d... 46.334186 52.512434
11 11 - Makes your eyes wanna tear up. - Ça fait pleurer. - Tes yeux veulent se déchirer. - Ça te donne envie de te déchirer les yeux. 7.809850 8.295194
19 19 If we are to escape from France, we must have ... Si nous voulons fuir la France, nous devons av... Si nous voulons échapper à la France, nous dev... Si nous voulons fuir la France, nous devons av... 69.975223 100.000000
25 25 “In its decision 2004/128, the Commission on H... “In its decision 2004/128, the Commission on H... Dans sa décision 2004/128, la Commission des d... «Dans sa décision 2004/128, la Commission des ... 8.103715 9.242725
27 27 38 500 ð. 35 000 ð. 38 500 ð. . . . . . . . . . . . . . . . . . . ... 38 500 ð. 0.207668 31.947155
31 31 Hey, you. Hé, toi. Hé, vous. Hé, toi. 35.355339 100.000000
49 49 Country office reports indicate a remarkably r... Source de données. Les rapports des bureaux de pays indiquent une... Les rapports des bureaux de pays font état d'u... 0.786056 0.966867
57 57 @EhCherif: Live shots fired by snipers who kil... @EhCherif: Tirs à balles réelles par snipers q... @EhCherif: Coups de feu en direct tirés par de... @EhCherif: Coups de feu en direct tirés par de... 22.325877 29.106249
59 59 Our nostalgic postcard range now numbers over ... Code: 01CH17. La sélection de cartes que nous ... Notre sélection de cartes nostalgiques compte ... Notre sélection de cartes nostalgiques compte ... 30.191579 36.447528
64 64 Acknowledgement The Chair of the Strategic Opt... Remerciements Le président de la Table de conc... Remerciements Le président de la Table de conc... Remerciements Le président de la Table de conc... 56.081407 57.150105
df_perfect_marian = df_marian[(df_marian["bleu_ft"] > df_marian["bleu_base"]) & (np.isclose(df_marian["bleu_ft"], 100.0, atol=1e-7))]
print("Number of samples improved:", len(df_perfect_marian))

# Show the top 10 improved examples
df_perfect_marian[["sample_idx", "src_text", "reference_fr", "base_pred", "ft_pred", "bleu_base", "bleu_ft"]].head(10)
Number of samples improved: 9
sample_idx src_text reference_fr base_pred ft_pred bleu_base bleu_ft
19 19 If we are to escape from France, we must have ... Si nous voulons fuir la France, nous devons av... Si nous voulons échapper à la France, nous dev... Si nous voulons fuir la France, nous devons av... 69.975223 100.0
31 31 Hey, you. Hé, toi. Hé, vous. Hé, toi. 35.355339 100.0
86 86 Yeah, a dream come true. Oui, un rêve devenu réalité. Oui, un rêve devient réalité. Oui, un rêve devenu réalité. 48.892302 100.0
204 204 Wind gusts 45.6 km/h 33.4 km/h 23.5 km/h 16.8 ... Raffiche 45.6 km/h 33.4 km/h 23.5 km/h 16.8 km... rafales de vent 45,6 km/h 33,4 km/h 23,5 km/h ... Raffiche 45.6 km/h 33.4 km/h 23.5 km/h 16.8 km... 19.212952 100.0
297 297 - You do? - Ah oui ? - Vraiment ? - Ah oui ? 0.000000 100.0
705 705 We're late. On est en retard. Nous sommes en retard. On est en retard. 39.763536 100.0
733 733 Snacklesnaaamps...! Snacklesnaaamps...! Des snacklesnaaamps...! Snacklesnaaamps...! 50.813275 100.0
741 741 Carter Blue Label. Carter Blue Label. Label bleu Carter. Carter Blue Label. 21.022410 100.0
808 808 Call the sheriff. Appelle le shérif. Appelez le shérif. Appelle le shérif. 59.460356 100.0
import tempfile
import zipfile
import os
from datasets import load_from_disk

zip_path = "small_validation_set.zip"

# Create a temporary directory
tmpdirname = tempfile.mkdtemp()
with zipfile.ZipFile(zip_path, "r") as zip_ref:
    zip_ref.extractall(tmpdirname)

dataset_path = os.path.join(tmpdirname, "small_validation_set")
small_validation_set = load_from_disk(dataset_path)

print("Loaded small_validation_set with", len(small_validation_set), "samples.")
Loaded small_validation_set with 100 samples.
import difflib
from IPython.display import Audio, display, HTML

def highlight_diff(ref, pred):
    """
    Use difflib.HtmlDiff to highlight token-level differences
    side by side in an HTML table.
    """
    ref_tokens = ref.split()
    pred_tokens = pred.split()
    diff_html = difflib.HtmlDiff().make_table(
        ref_tokens,
        pred_tokens,
        fromdesc='Reference',
        todesc='Prediction',
        context=True,
        numlines=1
    )
    return diff_html

def show_whisper_sample(df, dataset, row_idx=0):
    """
    Display audio playback and text differences for a single row in `df`.
    `dataset` is your loaded small_validation_set (a Dataset object).
    """
    if row_idx >= len(df):
        print("Row index out of range.")
        return

    row = df.iloc[row_idx]
    
    # Convert to a pure Python int, since HF Datasets won't accept np.int64
    sample_idx = int(row["sample_idx"])

    # Retrieve the audio from the dataset
    sample = dataset[sample_idx]
    audio_data = sample["audio"]["array"]

    # Playback the audio (assuming 16k sample rate)
    display(Audio(audio_data, rate=16000))

    print(f"Sample index: {sample_idx}")
    print("Reference:", row["reference"])
    print("Base Prediction:", row["base_pred"])
    print("Fine-tuned Prediction:", row["ft_pred"])
    print(f"CER Base: {row['cer_base']:.3f}, CER FT: {row['cer_ft']:.3f}")
    print(f"WER Base: {row['wer_base']:.3f}, WER FT: {row['wer_ft']:.3f}")

    # Highlight diffs side-by-side
    base_diff_html = highlight_diff(row["reference"], row["base_pred"])
    ft_diff_html   = highlight_diff(row["reference"], row["ft_pred"])

    display(HTML("<h4>Reference vs. Base Prediction</h4>" + base_diff_html))
    display(HTML("<h4>Reference vs. Fine-Tuned Prediction</h4>" + ft_diff_html))
df_perfect_whisper_wer
Unnamed: 0 sample_idx reference base_pred ft_pred wer_base cer_base wer_ft cer_ft
46 46 46 his little brown eyed girlish faced mother he ... his little brown-eyed, girlish-faced mother. ... his little brown eyed girlish faced mother he... 0.239130 0.066079 0.0 0.0
78 78 78 began in the log cabin when he had no idea he ... began in the log cabin when he had no idea he... began in the log cabin when he had no idea he... 0.195122 0.057692 0.0 0.0
79 79 79 that the physician alone is called to administ... that the physician alone is called to adminis... that the physician alone is called to adminis... 0.147059 0.024038 0.0 0.0
for i in range(0,len(df_perfect_whisper_wer)):
    show_whisper_sample(df_perfect_whisper_wer, small_validation_set, row_idx=i)
Sample index: 46
Reference: his little brown eyed girlish faced mother he had lived on the homestead until he was twenty he had tilled the broad fields and gone in and out among the people and their life had been his life but his heart was not in his work
Base Prediction:  his little brown-eyed, girlish-faced mother. He had lived on the homestead until he was 20. He had tilled the broad fields and gone in and out among the people, and their life had been his life, but his heart was not in his work.
Fine-tuned Prediction:  his little brown eyed girlish faced mother he had lived on the homestead until he was twenty he had tilled the broad fields and gone in and out among the people and their life had been his life but his heart was not in his work
CER Base: 0.066, CER FT: 0.000
WER Base: 0.239, WER FT: 0.000

Reference vs. Base Prediction


Reference
Prediction
2 little 2 little
n 3 brown n 3 brown-eyed,
4 eyed 4 girlish-faced
5 girlish
6 faced
7 mother 5 mother.
8 he 6 He
9 had 7 had
16 was 14 was
n 17 twenty n 15 20.
18 he 16 He
19 had 17 had
30 the 28 the
n 31 people n 29 people,
32 and 30 and
37 his 35 his
n 38 life n 36 life,
39 but 37 but
45 his 43 his
t 46 work t 44 work.

Reference vs. Fine-Tuned Prediction


Reference
Prediction
t  No Differences Found  t  No Differences Found 
Sample index: 78
Reference: began in the log cabin when he had no idea he could ever be exercising his loving kindness in the executive mansion the home of the nation with malice toward none with charity for all was the rule of his life
Base Prediction:  began in the log cabin when he had no idea he could ever be exercising his loving kindness in the executive mansion, the home of the nation. With Malice Tordon, with charity for all, was the rule of his life.
Fine-tuned Prediction:  began in the log cabin when he had no idea he could ever be exercising his loving kindness in the executive mansion the home of the nation with malice toward none with charity for all was the rule of his life
CER Base: 0.058, CER FT: 0.000
WER Base: 0.195, WER FT: 0.000

Reference vs. Base Prediction


Reference
Prediction
21 executive 21 executive
n 22 mansion n 22 mansion,
23 the 23 the
26 the 26 the
n 27 nation n 27 nation.
28 with 28 With
29 malice 29 Malice
30 toward 30 Tordon,
31 none
32 with 31 with
34 for 33 for
n 35 all n 34 all,
36 was 35 was
40 his 39 his
t 41 life t 40 life.

Reference vs. Fine-Tuned Prediction


Reference
Prediction
t  No Differences Found  t  No Differences Found 
Sample index: 79
Reference: that the physician alone is called to administer psychotherapeutic work but that he needs a thorough psychological training besides his medical one but the interest of the community is not only a negative one
Base Prediction:  that the physician alone is called to administer psychotherapeutic work, but that he needs a thorough, psychological training besides his medical one. But the interest of the community is not only a negative one.
Fine-tuned Prediction:  that the physician alone is called to administer psychotherapeutic work but that he needs a thorough psychological training besides his medical one but the interest of the community is not only a negative one
CER Base: 0.024, CER FT: 0.000
WER Base: 0.147, WER FT: 0.000

Reference vs. Base Prediction


Reference
Prediction
9 psychotherapeutic 9 psychotherapeutic
n 10 work n 10 work,
11 but 11 but
15 a 15 a
n 16 thorough n 16 thorough,
17 psychological 17 psychological
21 medical 21 medical
n 22 one n 22 one.
23 but 23 But
24 the 24 the
33 negative 33 negative
t 34 one t 34 one.

Reference vs. Fine-Tuned Prediction


Reference
Prediction
t  No Differences Found  t  No Differences Found 
def highlight_diff_tokens(s1, s2):
    """
    Return an HTML diff table showing token-level differences between s1 and s2.
    """
    tokens1 = s1.split()
    tokens2 = s2.split()
    diff_html = difflib.HtmlDiff().make_table(
        tokens1, tokens2,
        fromdesc='String 1',
        todesc='String 2',
        context=True,
        numlines=1
    )
    return diff_html

def show_marian_sample(df, row_idx=0):
    if row_idx >= len(df):
        print("Row index out of range.")
        return
    
    row = df.iloc[row_idx]
    print(f"Sample index: {row['sample_idx']}")
    print("Source (EN)     :", row["src_text"])
    print("Reference (FR)  :", row["reference_fr"])
    print("Base Prediction :", row["base_pred"])
    print("Fine-tuned Pred :", row["ft_pred"])
    print(f"BLEU Base: {row['bleu_base']:.2f}, BLEU FT: {row['bleu_ft']:.2f}")

    # Compare reference vs base
    base_diff_html = highlight_diff_tokens(row["reference_fr"], row["base_pred"])
    ft_diff_html   = highlight_diff_tokens(row["reference_fr"], row["ft_pred"])

    display(HTML("<h4>Reference vs. Base Prediction</h4>" + base_diff_html))
    display(HTML("<h4>Reference vs. Fine-Tuned Prediction</h4>" + ft_diff_html))
show_marian_sample(df_perfect_marian, row_idx=0)
Sample index: 19
Source (EN)     : If we are to escape from France, we must have faith!
Reference (FR)  : Si nous voulons fuir la France, nous devons avoir la foi !
Base Prediction : Si nous voulons échapper à la France, nous devons avoir la foi !
Fine-tuned Pred : Si nous voulons fuir la France, nous devons avoir la foi !
BLEU Base: 69.98, BLEU FT: 100.00

Reference vs. Base Prediction


String 1
String 2
3 voulons 3 voulons
t 4 fuir t 4 échapper
5 à
5 la 6 la

Reference vs. Fine-Tuned Prediction


String 1
String 2
t  No Differences Found  t  No Differences Found