μΌ | μ | ν | μ | λͺ© | κΈ | ν |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 |
- νμ΄μ¬
- DP
- db
- SK
- λ€μ΄λλ―Ήνλ‘κ·Έλλ°
- μμνμ
- κ·Έλν
- μ λ ¬
- κΉμ΄μ°μ νμ
- λ°μ΄ν°λ² μ΄μ€
- λ³ν©μ λ ¬
- skala
- λ°±μ€
- LIS
- ν°μ€ν 리μ±λ¦°μ§
- BFS
- SQL
- 그리λ
- κ·Έλννμ
- DFS
- λ¨Έμ§μνΈ
- μν
- skala1κΈ°
- λμ κ³νλ²
- νλ‘κ·Έλλ¨Έμ€
- μκ³ λ¦¬μ¦
- λλΉμ°μ νμ
- ꡬν
- LLM
- μ€λΈμ
- Today
- Total
πππ°πΈ π£πΆπ΅ π΄π΅π¦π’π₯πΊ
[LLM] νμνμ μ€ STT to TTS μννλ μμ€ν μ€κ³ - 1. OpenAI API 'Whisper-1' νμ©νμ¬ μ€μκ° STT ꡬν λ³Έλ¬Έ
[LLM] νμνμ μ€ STT to TTS μννλ μμ€ν μ€κ³ - 1. OpenAI API 'Whisper-1' νμ©νμ¬ μ€μκ° STT ꡬν
.23 2025. 4. 18. 02:17π£οΈ STT?
Speech-To-Text λ‘, μμ±μ ν μ€νΈλ‘ λ³ννλ μμ μ΄λ€.
κ·Έκ±Έ λκ°λͺ¨λ¦ λκΉ
'λ³Έμ¬μ νμ§ κ³΅μ₯ κ° νΈλ¬λΈ μ²λ¦¬ μ§μ μμ€ν ' μ€ νμνμ λμμ μ€μκ° STT - λ²μ - TTS νμ΄νλΌμΈ μ€κ³μ ꡬνμ λ΄λΉνκ² λμλλ°, (μ¬μ€ STT μ¬λ―Έμμ΄λ³΄μ¬μ λ΄κ° νκ³ μΆλ€κ³ μμν¨) κ·Έ μ€ 'μ€μκ° STT μμ€ν 'λΆν° μκ² κ΅¬νν΄λ³΄κ³ μ κ°λ¨ν(μ½λ1823641μ€μ§λ¦¬) νλ‘κ·Έλ¨μ μ€κ³νκ² λμλ€.
μ¬μ© λͺ¨λΈ
νμ¬ λ£κ³ μλ κ΅μ‘μμ OpenAI APIλ₯Ό μ¬μ©ν μ μλλ‘ keyλ₯Ό μ 곡ν΄μ€¬κΈ° λλ¬Έμ(μ€μΉΌλΌ μ§±), sttλͺ¨λΈ μ€μμλ ν ν° ν¨μ¨μ μ΄λ©΄μ κ½€ μμ μ λμ μ½λ© μ°Έκ³ μλ£κ° λ§μ 'whisper-1'μ μ¬μ©νκ² λμλ€.
π Whisper-1 곡μ API λ¬Έμ: https://platform.openai.com/docs/models/whisper-1
Google, ννκ³ , AWS λ±μμ STTλ₯Ό μν λ€μν APIλ₯Ό μ§μνλκ±Έ μκ³ λ μμκ³ ,
OpenAI λͺ¨λΈ μ€μμλ λΉκ΅μ μ΅κ·Όμ λμ¨ 'GPT-4o-Audio' λ± λ€λ₯Έ νλ€νν μ΄κ²μ κ² μ£Όμλ€μ κ²μ λ§μλ°..
μλ¬΄νΌ OpenAI API λ§κ» μ°λΌκ³ λ λ¨Ήμ¬μ£Όλκ²κ³Ό λ€λ¦ μλ νκ²½μ μ΅λν νμ©νκ³ μΆκΈ°λ νκ³ ,
'λ€κ΅μ΄μ§μ' + 'μ€μκ°μ±' + 'κ·Έλ₯ μνΈνμ§ μμ λ‘컬 ν μ€νΈ νκ²½' μ λͺ¨λ κ³ λ €νμ¬
κ°λ²Όμ΄ μ€νμμ€ λͺ¨λΈ μλλ©΄ μ£Όμ΄μ§ νκ²½μ μ΅λν νμ©ν μ μλ OpenAI API λͺ¨λΈ μ€ λ μ λλκ±Έ μ°κ³ μ νλ€.
κ·Έλ¬λ..
GPT-4o-Audioλ μ°μ ν ν°μ΄ λ§λ μλκ² λΉμλ€.
μμ¬μ μ°λ¦Όππ
opensource whisper https://github.com/openai/whisper
λν μ€μκ°μ±μ κ³ λ €νμ¬ μ€νμμ€ whisper λͺ¨λΈμ μ¬μ©νλ €κ³ νλλ 'small'λΆν°λ λλ¦¬κ³ + μ»΄ν¨ν°κ° ν°μ§λ €νκ³ (M1 νλ γ γ ) 'base' λ μ±λ₯μ΄ λ무 λ¨μ΄μ‘λ€.
νλ‘μ νΈ μ€κ°μ μ€μ λ‘ whisper baseλͺ¨λΈμ νμ©ν stt to tts μ€μ΅λ μ§νν μ μ΄ μμλλ°...
μ΄κ±΄ μ§μμ μ‘°μ©ν λ ν μ€νΈν΄λ³Έκ±°κ³ ..
μμμ΄ μ΄λμ λ μλ κ΅μ‘νκ²½μμ ν μ€νΈ ν΄λ³Όλλ νκΈλ‘ λ²μμ΄ μλκ³ μ΄μν μΈμ΄λ‘ νμ΄λ²λ¦¬λλΌ
λλ€λ₯Έ μ΅μ μΌλ‘ κ²½λνλ λ²μ μΈ faster_whisperλ μμμΌλ..
GPUκ° μμ΄μ κ·Έλ°κ° API μ°λκ±°λ³΄λ€ λ λλ Έλ€.(30μ΄μ λ) + μ§μ§ λ ΈνΈλΆ ν°μ§λ μ€ μμμ.
μ΄λ΄κ² STT?
κ·Έλλ§ μ¬λλ€μ΄ λ§μ λ½μμ£Όλκ² APIμκΈ° λλ¬Έμ .. κ²°κ΅ Whisper-1 μΉ.
λͺ©μ
ν΄λΉ νλ‘μ νΈ μμ²΄κ° νμνμμμ μ§νλλ κ²μ΄κΈ° λλ¬Έμ, νμνμμμ λ°μν μ μλ μ μ½μ¬νμ κ³ λ €ν STT μμ€ν ꡬνμ΄ νμνλ€.
νμνμμ΄κΈ° λλ¬Έμ, 무μλ³΄λ€ 'κΈΈμ΄μ μκ΄ μλ' μ€μκ°μ±μ΄ μ€μνλ€.
κ·Έ μ΄ν μ§νλ λ²μκ³Ό TTS μμ μ latencyλ₯Ό μκ°νμ λ λ§μ΄ λ€ λλ ν μ 체 λ¬Έμ₯μ μλ²λ‘ μ λ¬νλ κ²μ μλ―Έμλ€ μκ°ν΄μ λ€μκ³Ό κ°μ νλ‘μΈμ€λ‘ μ§νλλ μμ€ν μ ꡬννκ³ μ νλ€.:
1. λ©ν°μ€λ λλ₯Ό νμ©νμ¬ μ€λμ€λ₯Ό μμ§νλ task, μ μ¬λ₯Ό μννλ taskλ₯Ό μ€λ λλ‘ λλμ΄ κ΅¬ννλ€.
2. μ€λμ€ μ€λ λμμλ κ·Έλκ·Έλ λ Ήμν λ°μ΄ν°λ₯Ό 0.5μ΄ ~ 1μ΄ μ¬μ΄μ chunk λ¨μλ‘ μ μ¬λ₯Ό μνν΄μ€ queueμ μμμ€λ€.
3. μ μ¬λ₯Ό μνν queueμ λ°μ΄ν°κ° λ€μ΄μ€λ©΄ λ°λ‘λ°λ‘ μ μ¬λ₯Ό μννλ€.
4. λ§μ΄ λ€ λλλ©΄ λ¬Έμ₯ μ¬μ΄ 침묡μ κ°μ§νλ©΄ κ·ΈλκΉμ§ μμ±λ μ μ¬ κ²°κ³Όλ₯Ό ν λ¬Έμ₯μΌλ‘ κ°μ£Όνμ¬ μ΄ν λ²μ taskλ‘ λ겨μ€λ€.
μ΄λ κ²κΉμ§ ν΄μ μ½λμ κΈ°μ΄λ₯Ό μ§°κ³ ,
GPTμ λμμ (λ§μ΄) λ°μ μ½μλ‘ νμΈνλ μ½λκΉμ§ ꡬνν μ μμλ€.
μ½λ
main
if __name__ == "__main__":
try:
clear_screen()
t1 = threading.Thread(target=audio_collection_thread)
t2 = threading.Thread(target=stt_processing_thread)
t1.daemon = True
t2.daemon = True
t1.start()
t2.start()
update_captions()
while True:
time.sleep(0.1)
except KeyboardInterrupt:
clear_screen()
print("\nπ νλ‘κ·Έλ¨ μ’
λ£...")
time.sleep(0.5)
print("π μ’
λ£ μλ£")
1λ²μμ λ§νλ―, μ€μκ°μ± ꡬνμ μν΄ audio_collection_thread μμλ μμ±μ μμ§νκ³ stt_processing_threadμμλ μ μ¬(STT)λ₯Ό μννλ ν¨μλ₯Ό ꡬνν΄μ€¬λ€.
νλ‘κ·Έλ¨ μ’ λ£ μ λͺ¨λ μ€λ λλ κ·Έμ λ°λΌ taskλ₯Ό μ’ λ£ν΄μ£ΌκΈ° μν΄ daemon μ€μ μ ν΄μ£Όκ³ ,
μ’ λ£λκΈ° μ κΉμ§ 0.1μ΄λ§λ€ μλ‘ λΆλ¬μ¬ μ μκ² sleepμ κ±Έμ΄μ£Όλ©° μ€μκ° STTλ₯Ό μννλ€.
stt_processing_thread
# STT μ²λ¦¬ μ€λ λ (OpenAI Whisper API μ΅μ λ²μ )
def stt_processing_thread():
global current_caption
buffer = np.zeros((0, 1), dtype=np.float32)
max_buffer_size = samplerate * 5
try:
while True:
try:
data = audio_queue.get(timeout=1)
buffer = np.concatenate((buffer, data), axis=0)
if len(buffer) > max_buffer_size:
buffer = buffer[-max_buffer_size:]
chunk_size = int(samplerate * 3.0)
if len(buffer) >= chunk_size:
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
sf.write(f.name, buffer[:chunk_size], samplerate)
audio_file = open(f.name, "rb")
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language="ko"
,
prompt="νμ μ€μ
λλ€. λλ°λλ° λ§νλ λ΄μ©μ λ°μμ μ΄.")
audio_file.close()
os.unlink(f.name)
text = response.text.strip()
if text:
with caption_lock:
if not current_caption or text[0].isupper() or any(current_caption.endswith(p) for p in ['.', '!', '?', 'γ', 'οΌ', 'οΌ']):
if current_caption:
caption_history.append(current_caption)
current_caption = text
else:
current_caption += " " + text
update_captions()
buffer = np.zeros((0, 1), dtype=np.float32)
audio_queue.task_done()
except queue.Empty:
continue
except KeyboardInterrupt:
pass
μ¬μ€ μ¬κΈ°μλ 침묡μ κ°μ§νλ€κΈ°λ³΄λ€ μ€λμ€λ₯Ό κ³μ μ΄μ΄λλ©΄μ
(sample rate * 3) λ§νΌμ©, μ¦ 3μ΄ λΆλλ§νΌ λ²νΌ λ³μ(buffer)μ κ³μ chunkλ₯Ό μμ ν μ μ¬λ₯Ό μ§ννλ€. (overflowλ₯Ό λ°©μ§νκΈ° μν΄ maxν¬κΈ°λ 5μ΄ λΆλμΌλ‘ μ μΈμ ν΄λκ³ , 3μ΄μ©λ§ λͺ¨μλμλ€ λ°©μ¬)
μ μ¬λ₯Ό μ§ννλ λ°©μμ
.wav μμνμΌμ λ§λ€μ΄μ ν΄λΉ μμ±μ 'whisper-1' λͺ¨λΈμ μ λ¬ν΄μ£Όκ³ ,
textμ λͺ¨λΈ responseμ text(λ΅λ³)λΆλΆλ§ μ μ₯νκ³ ,
μμνμΌμ μ§μ°λ μμΌλ‘ μ§νλλ€.
μ°Έμ½μ£ ?
κ·Έ μλλ μΆλ ₯μ μν΄ λ½κ±Έμ΄λμ μ€λ λμ μ κ·Όν΄μ μ μ¬ κ²°κ³Όλ₯Ό κΈ°λ‘νλ ν μ€νΈλ₯Ό μΆκ°ν΄μ£Όλ μμ μ΄λ€.
audio_collection_thread
# μ€λμ€ μμ§ μ€λ λ
def audio_collection_thread():
try:
with sd.InputStream(samplerate=samplerate, channels=1,
callback=audio_callback, blocksize=block_size):
print("ποΈ μ€μκ° STT μμ μ€... μ μλ§ κΈ°λ€λ €μ£ΌμΈμ.")
while True:
time.sleep(0.1)
except Exception as e:
print(f"μ€λμ€ μ€νΈλ¦Ό μ€λ₯: {e}", file=sys.stderr)
except KeyboardInterrupt:
pass
μ΄κ±΄ μ¬μ€..
κ·Έλ₯ μ€λμ€ μ΄μ΄μ κΈ°λ‘νλ ν¨μμ΄λ€.
mac νκ²½μμ μ§ννμλλ,
sounddevice μ°Ύμμ£Όλκ² μ μΌ νλ€μλ€
μ 체 μ½λ
import sounddevice as sd
import numpy as np
from openai import OpenAI
import queue
import threading
import time
import os
import sys
import tempfile
import soundfile as sf
from collections import deque
from dotenv import load_dotenv
load_dotenv()
os.environ["OPEN_API_KEY"] = os.getenv("OPENAI_API_KEY")
# OpenAI Whisper API ν΄λΌμ΄μΈνΈ μμ±
client = OpenAI()
# ν μ€μ
audio_queue = queue.Queue()
# μ€λμ€ μ€μ
samplerate = 16000
block_size = 4000 # 0.25μ΄ λΆλ
# μλ§ κ΄λ¦¬λ₯Ό μν μ€μ
caption_history = deque(maxlen=5) # μ΅κ·Ό 5κ° λ¬Έμ₯ μ μ₯
current_caption = ""
caption_lock = threading.Lock()
# μ€λμ€ μ½λ°±
def audio_callback(indata, frames, time, status):
if status:
print(f"μν: {status}", file=sys.stderr)
audio_queue.put(indata.copy())
# νλ©΄ μ§μ°κΈ° ν¨μ
def clear_screen():
os.system('cls' if os.name == 'nt' else 'clear')
# μλ§ μΆλ ₯ ν¨μ
def update_captions():
clear_screen()
print("\n\n\n")
print("=" * 60)
print("ποΈ μ€μκ° μμ± μΈμ μλ§ (Ctrl+Cλ‘ μ’
λ£)")
print("=" * 60)
for prev in list(caption_history)[:-1]:
print(f"\033[90m{prev}\033[0m")
if caption_history:
print(list(caption_history)[-1])
if current_caption:
print(f"\033[1m{current_caption}\033[0m", end="β\n")
else:
print("β")
print("=" * 60)
# μ€λμ€ μμ§ μ€λ λ
def audio_collection_thread():
try:
with sd.InputStream(samplerate=samplerate, channels=1,
callback=audio_callback, blocksize=block_size):
print("ποΈ μ€μκ° STT μμ μ€... μ μλ§ κΈ°λ€λ €μ£ΌμΈμ.")
while True:
time.sleep(0.1)
except Exception as e:
print(f"μ€λμ€ μ€νΈλ¦Ό μ€λ₯: {e}", file=sys.stderr)
except KeyboardInterrupt:
pass
# STT μ²λ¦¬ μ€λ λ (OpenAI Whisper API μ΅μ λ²μ )
def stt_processing_thread():
global current_caption
buffer = np.zeros((0, 1), dtype=np.float32)
max_buffer_size = samplerate * 5
try:
while True:
try:
data = audio_queue.get(timeout=1)
buffer = np.concatenate((buffer, data), axis=0)
if len(buffer) > max_buffer_size:
buffer = buffer[-max_buffer_size:]
chunk_size = int(samplerate * 3.0)
if len(buffer) >= chunk_size:
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
sf.write(f.name, buffer[:chunk_size], samplerate)
audio_file = open(f.name, "rb")
response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language="ko"
,
prompt="νμ μ€μ
λλ€. λλ°λλ° λ§νλ λ΄μ©μ λ°μμ μ΄.")
audio_file.close()
os.unlink(f.name)
text = response.text.strip()
if text:
with caption_lock:
if not current_caption or text[0].isupper() or any(current_caption.endswith(p) for p in ['.', '!', '?', 'γ', 'οΌ', 'οΌ']):
if current_caption:
caption_history.append(current_caption)
current_caption = text
else:
current_caption += " " + text
update_captions()
buffer = np.zeros((0, 1), dtype=np.float32)
audio_queue.task_done()
except queue.Empty:
continue
except KeyboardInterrupt:
pass
# λ©μΈ μ€ν
if __name__ == "__main__":
try:
clear_screen()
t1 = threading.Thread(target=audio_collection_thread)
t2 = threading.Thread(target=stt_processing_thread)
t1.daemon = True
t2.daemon = True
t1.start()
t2.start()
update_captions()
while True:
time.sleep(0.1)
except KeyboardInterrupt:
clear_screen()
print("\nπ νλ‘κ·Έλ¨ μ’
λ£...")
time.sleep(0.5)
print("π μ’
λ£ μλ£")
μ€ν κ²°κ³Ό
ν μ€νΈν λ λ³΄ν΅ λμ 보μ΄λ μ무 곡λΆμλ£λ₯Ό 보면μ μ½λ νΈμΈλ° π
μ΄μ체μ μ λν κΈμ μ½μμ λμ μ½λ μ€ν κ²°κ³Όμ΄λ€.
μ무λλ λμ λ§μ 3μ΄ λ¨μλ‘ κ°μ Έμμ μ μ¬λ₯Ό νκΈ°λ νκ³ , μΈκ³΅μ§λ₯ λͺ¨λΈμ΄λΌ λ€μ λ§μ μ΄λμ λ μμν΄μ μμ±νκΈ°λ νμ¬ μ λ κ² νΌμ λ§λ¬΄λ¦¬ν΄λ²λ¦¬λ κ²½μ°λ λ°μνλ€.
μ½μ μΆλ ₯μ GPTκ° CLIμ μμκ² λ³΄μΌ μ μκ² μμμ μ νμ μμ±ν΄μ€¬λλ°,
{prev}μ {current_caption}μ ν΅ν΄ νμ¬ λ§νλ λ¬Έμ₯μ ꡬλΆνμ¬ μ΄μ μ μ²λ¦¬ν λ¬Έμ₯μ νμμΌλ‘ 보μ΄κ² ν΄μ€¬λ€.
κ·Έλ¬λ....
λλ₯Ό κ°μ₯ κ³ μμμΌ°λ λ¬Έμ
λ°λ‘
μ μ μ΄ κΈΈμ΄μ§ λ,
μ μ μ μ μ μΌλ‘ μΈμνμ§ μκ³ μμμ΄ μ€λμ€μ μΊ‘μ³λλ©΄
λλ λ§νμ λ μλ 'μμ²ν΄μ£Όμ μ κ°μ¬ν©λλ€.' μ κ°μ μ νλΈμ κ΄λ ¨λ λλ€ν λ¬Έκ΅¬κ° λ¬λ€λ κ².
μ΄μ λ
νκ΅μ΄ μμ± λͺ¨λΈμ νμ΅ν λ μ£Όλ‘ νκ΅μ΄ μ±λλ€μ μ νλΈ - μ νλΈ μ€ν¬λ¦½νΈ μμΌλ‘ νμ΅μ μ§ννκΈ° λλ¬Έμ μ λ° λ¬Έκ΅¬κ° μΆλ ₯λλ€λ κ²..
μ΄νμ ν μ€νΈν λλ μ κ±°λλ¬Έμ κ³ μ λ§μ΄νλ€..γ γ
ν΄κ²°λ°©λ²μ
μ΄νμ chunk λ°μ΄ν°λ‘λΆν° μΌμ frequency μ΄μμ κ°λ§ μΈμνλλ‘ μκ³μΉλ₯Ό μ‘μμ£Όκ±°λ(μ λ¨Ήνμ§ μμ),
μ μ¬ λ¨κ³ λ§κ³ μ’ λ κ³ λνλ λͺ¨λΈμ μ°λ λ²μ λ¨κ³μμ μμλ‘ μ λ¬λ λ§₯λ½μ νμ ν΄μ λ²μ μ²λ¦¬ν΄λ¬λΌκ³ ν둬ννΈλ₯Ό μμ±νλ λ°©λ²
λ±λ±μ΄ μμ μ μλ€..
λλ λ²μλ¨κ³μμ ν둬ννΈνν μμμ λ΄μ©μ μ³λ΄λ¬λΌκ³ νμλ€