๐˜š๐˜ญ๐˜ฐ๐˜ธ ๐˜ฃ๐˜ถ๐˜ต ๐˜ด๐˜ต๐˜ฆ๐˜ข๐˜ฅ๐˜บ

[LLM] ํ™”์ƒํšŒ์˜ ์ค‘ STT to TTS ์ˆ˜ํ–‰ํ•˜๋Š” ์‹œ์Šคํ…œ ์„ค๊ณ„ - 2. ์‹ค์‹œ๊ฐ„ STT์™€ ๋ฒˆ์—ญ์ด ๊ฐ€๋Šฅํ•œ ์‹œ์Šคํ…œ ๊ตฌํ˜„(+ FastAPI ๋ชจ๋ธ์„œ๋น™) ๋ณธ๋ฌธ

machine learning/LLM

[LLM] ํ™”์ƒํšŒ์˜ ์ค‘ STT to TTS ์ˆ˜ํ–‰ํ•˜๋Š” ์‹œ์Šคํ…œ ์„ค๊ณ„ - 2. ์‹ค์‹œ๊ฐ„ STT์™€ ๋ฒˆ์—ญ์ด ๊ฐ€๋Šฅํ•œ ์‹œ์Šคํ…œ ๊ตฌํ˜„(+ FastAPI ๋ชจ๋ธ์„œ๋น™)

.23 2025. 4. 19. 15:06

Special thanks to. ๋‹ค๊ฑด๐Ÿ™Œ

 

 

์ด์ „ ํฌ์ŠคํŒ…: [LLM] ํ™”์ƒํšŒ์˜ ์ค‘ STT to TTS ์ˆ˜ํ–‰ํ•˜๋Š” ์‹œ์Šคํ…œ ์„ค๊ณ„ - 1. OpenAI API 'Whisper-1' ํ™œ์šฉํ•˜์—ฌ ์‹ค์‹œ๊ฐ„ STT ๊ตฌํ˜„

 

์ด์ „์— ๋งŒ๋“ค์—ˆ๋˜ ์Œ์„ฑ ๋…น์Œ + ์‹ค์‹œ๊ฐ„ STT ์ฝ”๋“œ๋ฅผ ๊ฐ€์ง€๊ณ 

๋…น์Œ ์‹œ์ž‘ ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด ์ค‘๋‹จ ์—†์ด ํ”„๋กœ๊ทธ๋žจ์ด ์ž์ฒด์ ์œผ๋กœ ๋ฌธ์žฅ์˜ ๋์„ ํŒ๋‹จํ•˜์—ฌ ํ•œ ๋ฌธ์žฅ์”ฉ ์ „์‚ฌ์™€ ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ์„ ์„ค๊ณ„ํ–ˆ๋‹ค.

์‚ฌ์‹ค FE/BE ์„ค์ •์€ ๊ฑฐ์˜ ๋‹ค๊ฑด์ด๊ฐ€ ๋‹ค ํ•ด์คฌ๊ณ ,

์Œ์„ฑ ๊ด€๋ จ ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ… ๋ถ€๋ถ„๋งŒ ๋‚ด๊ฐ€ ๋ฐœ ์–น์—ˆ๋‹ค. ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค๐Ÿ‘

 

์‚ฌ์ง„์€ 5์ดˆ ์ •๋„ ๊ธธ์ด์˜ ๋ฌธ์žฅ์„ 3์ดˆ๋งŒ์— ์ „์‚ฌํ•˜๊ณ  -> ๊ณง์ด์–ด 1์ดˆ ๋‚ด๋กœ ๋ฒˆ์—ญ์ด ์™„๋ฃŒ๋˜๋Š” ์‹คํ–‰ ๊ฒฐ๊ณผ ๋ชจ์Šต์ด๋‹ค.

 

 

ํ”„๋ก ํŠธ ๋ชปํ•จ + ์ €๊ฒŒ ๋ฉ”์ธ ํ”„๋กœ์ ํŠธ๊ฐ€ ์•„๋‹ˆ์—ˆ์Œ ์ด์Šˆ๋กœ ๋ณด์—ฌ์ฃผ๋Š” ์™„์„ฑ๋„๋ฅผ ์ƒ๊ฐํ•˜์ง€ ์•Š๊ณ  ๊ธฐ๋Šฅ์—๋งŒ ์ง‘์ค‘ํ•ด์„œ ๊ตฌํ˜„ํ•˜๊ณ ์ž ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—

์ด์ „ ์ฝ˜์†” ํ™”๋ฉด์ฒ˜๋Ÿผ ์‹ค์‹œ๊ฐ„ ์ „์‚ฌ๋˜๋Š” ๋ชจ์Šต์€ ํ™•์ธ์ด ์•ˆ๋œ๋‹ค

 

๋งž๋‹ค. ๋ณ€๋ช…์ด๋‹ค. ์•”ํŠผ ์‹ค์‹œ๊ฐ„์ด๋‹ค.

?? : ์‹ค์‹œ๊ฐ„์˜ ์˜๋ฏธ๋Š” ์ž์‹ ์ด ์ •์˜ํ•จ์— ๋”ฐ๋ผ์„œ ๋‹ฌ๋ผ.

์‚ฌ์šฉ ๋ชจ๋ธ

- STT: whisper

- Translation: gpt-4o-mini

- TTS(์—ฌ๊ธฐ์„  ์–ธ๊ธ‰ ์•ˆํ•จ): gpt-4o-mini-tts

 

ํ† ํฐ์„ ์ตœ๋Œ€ํ•œ ์ ˆ์•ฝํ•˜๋ฉด์„œ ์‹ค์‹œ๊ฐ„์„ฑ์„ ํ™•๋ณดํ•˜๊ธฐ ์œ„ํ•ด gpt-4o๋‚˜ tts-1๊ฐ™์€ ๋ชจ๋ธ๋ณด๋‹จ ์ž‘์€ ๋ชจ๋ธ๋“ค์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. 

์—„์ฒญ ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ๋Œ€ํ™”๋ฅผ ๋…น์Œํ•˜๊ฑฐ๋‚˜, ๋ณต์žกํ•œ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•œ ๊ณผ์ •์ด ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์— mini ๋ชจ๋ธ๋กœ๋„ ์ถฉ๋ถ„ํžˆ ์ž˜ ๋Œ์•„๊ฐ”๋‹ค.๐Ÿ‘

 

์ „์ฒด ์ง„ํ–‰ ํ”Œ๋กœ์šฐ

๋Œ€์ถฉ๊ทธ๋ฆฐ๊ธฐ๋ฆฐ๊ทธ๋ฆผ

1. ์ฒ˜์Œ ์›น์— ๋“ค์–ด์™”์„ ๋•Œ, '๋…น์Œ ์‹œ์ž‘'์„ ๋ˆ„๋ฅด๋ฉด ์›น์†Œ์ผ“์ด ์—ฐ๊ฒฐ๋˜์–ด 2์ดˆ ๋‹จ์œ„ ์ฒญํฌ๋กœ webm blob์„ fastAPI๋กœ ์ „์†กํ•œ๋‹ค.

 

2. fastAPI๋Š” ์‹คํ–‰ ์‹œ ์Šค๋ ˆ๋“œ๋ฅผ ๊ฐ€๋™์‹œ์ผœ ๊ฐ ์ „์‚ฌ ํ / ๋ฒˆ์—ญ ํ์— ๋‚ด์šฉ์ด ๋“ค์–ด์˜ค๋ฉด ๋ฐ”๋กœ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๊ณ ,

websocket์ด ์—ฐ๊ฒฐ๋˜๋ฉด ์ „๋‹ฌ๋ฐ›์€ .webm ํŒŒ์ผ์„ .wav๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , audio_queue์— ๊ฐ’์„ ๋„ฃ์–ด์ค€๋‹ค.

 

3. stt_processing_thread()์—์„œ๋Š” audio_queue์— ๊ฐ’์ด ๋“ค์–ด์˜ค๋ฉด ์ „์‚ฌ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ด์„œ ์ „์‚ฌ๋œ ํ…์ŠคํŠธ๋ฅผ ์ „์‚ฌ ๊ฒฐ๊ณผ ํ์— ๋„ฃ์–ด์ฃผ๊ณ 

translation_thread()์—์„œ๋Š” ์ „์‚ฌ ๊ฒฐ๊ณผ ํ์— ๊ฐ’์ด ๋“ค์–ด์˜ค๋ฉด ๋ฒˆ์—ญ์„ ํ•ด์„œ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ ํ์— ๊ฒฐ๊ณผ๋ฅผ ๋„ฃ์–ด์ค€๋‹ค.

 

4. ๋™์ผํ•˜๊ฒŒ fastAPI์˜ ๋ฉ”์ธ ๋ผ์šฐํ„ฐ์—์„œ ๋น„๋™๊ธฐ์ ์œผ๋กœ ๊ณ„์† ์ „์‚ฌ ๊ฒฐ๊ณผ ํ์™€ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ ํ๋ฅผ ํ™•์ธํ•˜๋ฉฐ

๊ฐ’์ด ๋“ค์–ด์˜ฌ๋•Œ๋งˆ๋‹ค websocket์œผ๋กœ ์›น์— ๊ฐ ๊ฒฐ๊ณผ๋ฅผ ์ „์†กํ•˜๋ฉด, ๊ฒฐ๊ณผ๊ฐ€ ํ™”๋ฉด์— ๋ณด์—ฌ์ง„๋‹ค.

 

5. ์‚ฌ์šฉ์ž๊ฐ€ ๊ทธ๋งŒ ๊ฐ–๊ณ  ๋†€๊ณ ์‹ถ์–ด์„œ '๋…น์Œ ์ค‘๋‹จ'์„ ๋ˆŒ๋Ÿฌ ์›น์†Œ์ผ“์ด ๋‹ซํž๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•ด์„œ ์ง„ํ–‰๋œ๋‹ค.

 

ํŒŒ์ผ ๊ตฌ์กฐ

project/
โ”œโ”€โ”€ config.py           # ์ „์—ญ ์„ค์ • (API ํ‚ค, ์˜ค๋””์˜ค ์„ค์ •, ์ „์—ญ ํด๋ผ์ด์–ธํŠธ ๋“ฑ)
โ”œโ”€โ”€ main.py             # ํ”„๋กœ๊ทธ๋žจ์˜ ์ง„์ž…์ : ๊ฐ ๋ชจ๋“ˆ์„ ๋ถˆ๋Ÿฌ์™€ ์Šค๋ ˆ๋“œ ์‹คํ–‰
โ””โ”€โ”€ modules/
|   โ”œโ”€โ”€ __init__.py
|   โ”œโ”€โ”€ stt.py          # STT ์ฒ˜๋ฆฌ (Whisper-1, VAD, ์–ธ์–ด ์ž๋™ ๊ฐ์ง€)
|   โ”œโ”€โ”€ translation.py  # ๋ฒˆ์—ญ ์ฒ˜๋ฆฌ (GPT-4o-mini ์‚ฌ์šฉ)
|   โ”œโ”€โ”€ tts.py          # TTS ์ฒ˜๋ฆฌ (GPT-4o-mini-tts ์‚ฌ์šฉ)
|   โ””โ”€โ”€ utils.py        # ๊ณตํ†ต ์œ ํ‹ธ๋ฆฌํ‹ฐ ํ•จ์ˆ˜ (์–ธ์–ด ๋ณด์ •, ๋กœ๊ทธ ํŒŒ์ผ๋ช… ์ƒ์„ฑ ๋“ฑ)
โ””โ”€โ”€ templates/
    โ””โ”€โ”€ index.html      # ํ…Œ์ŠคํŠธ์šฉ ๊ฐ„๋‹จํ•œ ํ”„๋ก ํŠธ ์ฝ”๋“œ

์ฝ”๋“œ

ํ”„๋ก ํŠธ์—์„œ ๊ผญ ํ•ด์ค˜์•ผ ํ•  ์ผ!!!

์‚ฌ์‹ค ์ด ํ”„๋กœ์ ํŠธ ํ•˜๋ฉด์„œ ๋А๋‚€๊ฑด ํ”„๋ก ํŠธ๊ฐ€ ์ œ์ผ ์ค‘์š”ํ–ˆ๋‹ค.... ์ •๋ง ๋„ˆ๋ฌดํž˜๋“ค์—ˆ๋‹ค๐Ÿ˜ญ

ํ”„๋ก ํŠธ๋‹จ์—์„œ ์ œ๋Œ€๋กœ ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋‚ด์ฃผ์ง€ ์•Š์œผ๋ฉด, ์•„๋ฌด๋ฆฌ ํŒŒ์ด์ฌ ์ชฝ์—์„œ ์ฝ”๋“œ๋ฅผ ๊ธฐ๊น”๋‚˜๊ฒŒ ๊ตฌ์„ฑํ–ˆ์–ด๋„ ์•„๋ฌด๋Ÿฐ ์ž‘๋™ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์—†๋‹ค.

 

 

ํ•ด๋‹น ํ”„๋กœ์ ํŠธ์—์„œ ์ฃผ์˜ํ•  ๊ฒƒ์€ webmํŒŒ์ผ์„ ์ฒญํฌ๋‹จ์œ„๋กœ ๋ณด๋‚ผ ๋•Œ 'ํ—ค๋” ์ •๋ณด'๋ฅผ ๋ฐ˜๋“œ์‹œ ํฌํ•จํ•ด์„œ ๋ณด๋‚ด์•ผ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

 

๊ทธ๋Ÿฌ๋‚˜, ์ตœ์ดˆ 1ํšŒ ์ „์†ก๋˜๋Š” ํŒŒ์ผ์—๋งŒ ํ—ค๋” ์ •๋ณด๊ฐ€ ๋ถ™๊ณ  ๊ทธ ์ดํ›„์—๋Š” ๋…น์Œ๋œ ๋ฐ์ดํ„ฐ๋งŒ ์ „์†ก๋˜๋Š”๋ฐ, ๊ทธ๋Ÿผ ffmpeg ๋ชจ๋“ˆ์ด ์ œ๋Œ€๋กœ ๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹ˆ๋ผ๊ณ  ๋‚œ ์ด๊ฑฐ ๋ณ€ํ™˜ ๋ชปํ•œ๋‹ค๋ฉฐ ๋ฏธ์ณ ๋‚ ๋›ด๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์€

 

1. ํ—ค๋” ์ •๋ณด๋ฅผ ๋ณ€์ˆ˜์— ์ง€์ •ํ•ด๋†“๊ณ , ๋งค blob์— ํ—ค๋” ์ •๋ณด๋ฅผ ๋ถ™์—ฌ ๋ณด๋‚ด์ค€๋‹ค.

2. ๊ทธ๋ƒฅ ๋…น์Œ์„ ์•„์˜ˆ ์ฒญํฌ๋งˆ๋‹ค ์ƒˆ๋กœ ํ•œ๋‹ค.

 

 

์˜€๋Š”๋ฐ, webm์˜ ํ—ค๋”์ •๋ณด๋Š” ๊ณ ์ •๊ธธ์ด๊ฐ€ ์•„๋‹ˆ๋ผ ๋งค๋ฒˆ ํฌ๊ธฐ๊ฐ€ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ 2๋ฒˆ์ด ๋ฌด์‹ํ•ด๋ณด์ด์ง€๋งŒ ์ƒ๊ฐ๋ณด๋‹ค ์ž˜๋จนํ˜”๋‹ค.

๊ทผ๋ฐ ๊ทธ๋Ÿผ

์–ด? ๊ทธ๋Ÿผ ์ค‘๊ฐ„์ค‘๊ฐ„ ๋…น์Œ์ด ์•ˆ๋  ์ˆ˜ ์žˆ๋Š”๊ฑฐ ์•„๋‹Œ๊ฐ€?

 

์‹ถ๊ธฐ๋„ ํ–ˆ์ง€๋งŒ, ์‹ค์ œ ์‹คํ–‰ ๊ฒฐ๊ณผ ๋‹ค์‹œ ๋…น์Œ๊ธฐ๋ฅผ ๊ป๋‹ค ํ‚ค๋Š” ๊ณผ์ •์€ ๊ต‰์žฅํžˆ ๋น ๋ฅธ ์‹œ๊ฐ„์•ˆ์— ์ด๋ฃจ์–ด์กŒ๊ณ ,

๊ทธ ๊ณผ์ •์—์„œ ๋Œ€๋‹จํ•œ ๋‚ด์šฉ ์†์‹ค์ด ๋ฐœ์ƒํ•˜์ง€๋„ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์—

2์ดˆ๋งˆ๋‹ค ๋…น์ŒํŒŒ์ผ์„ ํ•œ๋ฒˆ์”ฉ ๋ณด๋‚ด์ฃผ๋Š” ๋ฐฉ์‹์„ ์„ ํƒํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค.

 

function startRecording() {
  const chunks = [];
  
  mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
  
  mediaRecorder.ondataavailable = (event) => {
    if (event.data.size > 0) {
      chunks.push(event.data);
    }
  };
  
  mediaRecorder.onstop = () => {
    if (chunks.length > 0 && ws && ws.readyState === WebSocket.OPEN) {
      const blob = new Blob(chunks, { type: 'audio/webm' });
      ws.send(blob);
      console.log('๋…น์Œ ๋ฐ์ดํ„ฐ ์ „์†ก๋จ, ํฌ๊ธฐ:', blob.size);
    }
  };
  
  mediaRecorder.start();
  console.log('์ƒˆ ๋…น์Œ ์„ธ์…˜ ์‹œ์ž‘');
}

 

startRecording()์—์„œ๋Š” audio/webm ์ง€์ •ํ•ด์ค˜์„œ MediaRecorder ์‚ฌ์šฉํ•˜์—ฌ ๋…น์Œ์„ ์ˆ˜ํ–‰ํ•˜๊ณ ,

์•„๋ž˜์„œ ์„ค์ •ํ•ด์ฃผ๋Š” Interval ์‹œ๊ฐ„๋งˆ๋‹ค blob์— ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ด์•„ ์›น์†Œ์ผ“์œผ๋กœ ์ „์†กํ•ด์ค€๋‹ค.

// ๋งˆ์ดํฌ ์ŠคํŠธ๋ฆผ ์–ป๊ธฐ
stream = await navigator.mediaDevices.getUserMedia({ audio: true });
// WebSocket ์—ฐ๊ฒฐ (HTTP ํ™˜๊ฒฝ์ด๋ฉด "ws://", HTTPS์ด๋ฉด "wss://")
ws = new WebSocket(`ws://${window.location.host}/ws/stt`);

ws.onopen = () => {
  console.log('WebSocket ์—ฐ๊ฒฐ๋จ');
  status.textContent = '์ƒํƒœ: WebSocket ์—ฐ๊ฒฐ๋จ';

  // ๋…น์Œ ์‹œ์ž‘
  startRecording();
  
  // 3์ดˆ๋งˆ๋‹ค ์ƒˆ๋กœ์šด ๋…น์Œ ์„ธ์…˜ ์‹œ์ž‘
  recordingInterval = setInterval(() => {
    if (isRecording) {
      stopRecording();
      startRecording();
    }
  }, 2000);
// ์ดํ›„ ์ฝ”๋“œ

 

main

app = FastAPI()

@app.on_event("startup")
async def startup_event():
    # stt, ๋ฒˆ์—ญ ์Šค๋ ˆ๋“œ ์‹œ์ž‘
    threading.Thread(
        target=stt_processing_thread,
        args=(audio_queue, sentence_queue, transcription_queue, recording_active, "ko"),
        daemon=True
    ).start()
    threading.Thread(
        target=translation_thread,
        args=(sentence_queue, translation_queue, translated_queue, "en"),
        daemon=True
    ).start()

    asyncio.create_task(result_sender_task())

 

on_event ๋ถ€๋ถ„์—์„œ fastAPI ์‹คํ–‰ ์‹œ ์–ด๋–ค ๊ฒƒ๋“ค์ด ์‹คํ–‰๋˜์–ด์•ผ ํ•˜๋Š”์ง€ ์ •์˜ํ•ด์ฃผ์—ˆ๋‹ค.

 

"startup" ์ด๋ฒคํŠธ ์‹คํ–‰์ด ์‹œ์ž‘๋˜๋ฉด, 

๋‘ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ ์‹คํ–‰

- stt_processing_thread ํ˜ธ์ถœ

- translation_thread ํ˜ธ์ถœ

๊ฒฐ๊ณผ ์ „์†กํ•ด์ฃผ๋Š” asyncio ์ฝ”๋ฃจํ‹ด ์ •์˜

๊ฐ€ ์ˆ˜ํ–‰๋œ๋‹ค.

 

์ฐธ๊ณ ๋กœ on_event๋Š” deprecated ๋œ ๊ธฐ๋Šฅ์ด๊ธฐ ๋•Œ๋ฌธ์—, FastAPI 0.95+ ๋ฒ„์ „๋ถ€ํ„ฐ๋Š” lifespan ์ด๋ฒคํŠธ ์ฒ˜๋ฆฌ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋Š”๊ฑธ ๊ถŒ์žฅํ•œ๋‹ค. ๋ณดํ†ต์€ lifespan ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•ด์„œ ์“ฐ์ง€๋งŒ, ๊ทธ๋ƒฅ if name=='__main__' ์•ˆ์— ์ •์˜ํ•ด์ค˜๋„ ์ƒ๊ด€์€ ์—†๋‹ค. ๊ทผ๋ฐ ๋‚˜๋Š” ๊ทธ๋ƒฅ ์‹คํ–‰์— ๋ฌธ์ œ์—†์–ด์„œ ์ €๋Œ€๋กœ ์ผ๋‹ค.

 

# STT WebSocket ์—”๋“œํฌ์ธํŠธ
@app.websocket("/ws/stt")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    print("[DEBUG] WebSocket ์—ฐ๊ฒฐ๋จ")
    websocket_clients.append(websocket)

    try:
        while True:
            data = await websocket.receive_bytes()
            # ์ž„์‹œ ํŒŒ์ผ๋กœ ์ €์žฅ
            with tempfile.NamedTemporaryFile(suffix=".webm", delete=False) as temp_file:
                temp_file_path = temp_file.name
                temp_file.write(data)
            print(f"[DEBUG] ์ž„์‹œ ํŒŒ์ผ ์ƒ์„ฑ๋จ: {temp_file_path}")
            
            try:
                # ffmpeg๋กœ .webm → .wav ๋ณ€ํ™˜ (BytesIO ํ˜•ํƒœ)
                wav_buffer = convert_webm_to_wav_bytes(temp_file_path)
                os.unlink(temp_file_path)  # ์ž„์‹œ ํŒŒ์ผ ์‚ญ์ œ

                if wav_buffer is None:
                    print("[DEBUG] ๋ณ€ํ™˜ ์‹คํŒจํ•œ ์ฒญํฌ ๊ฑด๋„ˆ๋œ€")
                    continue

                # BytesIO ๋ฒ„ํผ์—์„œ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ
                wav_buffer.seek(0)
                audio_data, sample_rate = sf.read(wav_buffer, dtype='float32')
                print(f"[DEBUG] ์ˆ˜์‹ ๋œ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ: shape {audio_data.shape}, sample_rate {sample_rate}")

                # ์Šคํ…Œ๋ ˆ์˜ค๋ฉด ๋ชจ๋…ธ๋กœ ๋ณ€ํ™˜
                if len(audio_data.shape) > 1:
                    audio_data = audio_data.mean(axis=1).reshape(-1, 1)
                else:
                    audio_data = audio_data.reshape(-1, 1)

                # audio_queue์— ์ถ”๊ฐ€ (STT ์ฒ˜๋ฆฌ ์Šค๋ ˆ๋“œ๋กœ ์ „๋‹ฌ)
                audio_queue.put(audio_data)
                print(f"[DEBUG] audio_queue์— ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€๋จ. ํ˜„์žฌ queue ํฌ๊ธฐ: {audio_queue.qsize()}")
            except Exception as e:
                print(f"[DEBUG] ์˜ค๋””์˜ค ์ฒ˜๋ฆฌ ์˜ค๋ฅ˜: {e}")
                if os.path.exists(temp_file_path):
                    os.unlink(temp_file_path)
    except WebSocketDisconnect:
        print("[DEBUG] WebSocket ์—ฐ๊ฒฐ ์ข…๋ฃŒ๋จ")
        if websocket in websocket_clients:
            websocket_clients.remove(websocket)

 

๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฐ์ดํ„ฐ ๋ฐ›์•„์˜ค๋Š” ๋ถ€๋ถ„..

์›น์†Œ์ผ“ ์‚ฌ์šฉ + ๋™์‹œ๋‹ค๋ฐœ์ ์œผ๋กœ ์Œ์„ฑ์„ ๋ฐ›์•„์˜ค๋ฉด ๋ฐ›์•„์˜ค๋Š” ๋Œ€๋กœ ํ์— ๋„ฃ๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๋น„๋™๊ธฐ ํ•จ์ˆ˜๋กœ ์œ„์™€ ๊ฐ™์ด ์ฝ”๋“œ๋ฅผ ์งฐ๋‹ค.

 

ํ”„๋ก ํŠธ๋‹จ์—์„œ ์Œ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์›น์†Œ์ผ“์œผ๋กœ FastAPI์— ์ด์ฃผ๋ฉด, ์ด๋ฅผ stt๋ชจ๋ธ์ธ whisper์— ์ „๋‹ฌํ•ด์ฃผ๊ธฐ ์œ„ํ•ด ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

whisper ๋ชจ๋ธ์€ .wav๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ธฐ ๋•Œ๋ฌธ์— webm์œผ๋กœ ๋…น์Œ๋œ ํŒŒ์ผ์„ ํ•œ๋ฒˆ ๋ณ€ํ™˜ํ•ด์ฃผ๋Š” ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค.

๋•Œ๋ฌธ์— ffmpeg ์„ค์น˜๊ฐ€ ํ•„์ˆ˜์ด๋ฉฐ, ๋ฐ˜๋“œ์‹œ ํŒŒ์ด์ฌ๊ณผ ๋กœ์ปฌ์— ๊ฐ๊ฐ ์„ค์น˜๋ฅผ ํ•ด์ค˜์•ผ ํ•œ๋‹ค.

 

pip install ffmpeg-python
brew install ffmpeg

 

stt

def is_speech(buffer, sample_rate=16000, frame_duration_ms=30, speech_threshold=0.3):
    audio_int16 = np.int16(buffer * 32767)
    audio_bytes = audio_int16.tobytes()
    frame_size = int(sample_rate * (frame_duration_ms / 1000.0))
    num_frames = len(audio_int16) // frame_size
    if num_frames == 0:
        return False
    speech_frames = 0
    for i in range(num_frames):
        start = i * frame_size * 2  # 2๋ฐ”์ดํŠธ per ์ƒ˜ํ”Œ
        frame = audio_bytes[start: start + frame_size * 2]
        if len(frame) < frame_size * 2:
            break
        if vad.is_speech(frame, sample_rate):
            speech_frames += 1
    fraction = speech_frames / num_frames
    return fraction >= speech_threshold

 

stt ํ•จ์ˆ˜์—์„œ๋Š” ์šฐ์„  ํ˜„์žฌ ๋“ค์–ด์˜ค๋Š” ์ฒญํฌ ์ •๋ณด๊ฐ€ ์ œ๋Œ€๋กœ ๋œ ์Œ์„ฑ ์ •๋ณด์ธ์ง€, ๋„ˆ๋ฌด ์งง์€ ๋‹จ์œ„์˜ ์ฒญํฌ๊ฐ€ ๋“ค์–ด์˜จ ๊ฒƒ์€ ์•„๋‹Œ์ง€๋ฅผ ๊ตฌ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด is_speech๋ผ๋Š” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ–ˆ๋‹ค.

๋ฒ„ํผ์— ํฌํ•จ๋œ ์Œ์„ฑ์ •๋ณด๊ฐ€ ๋ฐœํ™”๋กœ ์ธ์‹ํ•˜๊ธฐ ํž˜๋“  ์ง€๋‚˜์น˜๊ฒŒ ์งง์€(0.3์ดˆ ์ดํ•˜)์ง€, ๋งํ•˜๊ณ ์žˆ๊ธด ํ•œ๊ฒƒ์ธ์ง€๋ฅผ ๊ตฌ๋ถ„ํ•œ๋‹ค.

 

def detect_language(audio_path):
    try:
        with open(audio_path, "rb") as audio_file:
            response = CLIENT.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                response_format="verbose_json"
            )
        detected_lang = response.language
        sanitized = sanitize_language_code(detected_lang)
        print(f"[DEBUG] ๊ฐ์ง€๋œ ์–ธ์–ด (๋ณด์ •๋จ): {sanitized}")
        return sanitized
    except Exception as e:
        print(f"์–ธ์–ด ๊ฐ์ง€ ์˜ค๋ฅ˜: {e}", file=sys.stderr)
        return DEFAULT_LANGUAGE

 

๋Œ์•„๊ฐ€๋Š” ํ”„๋กœ๊ทธ๋žจ์ด ๋‹จ์ˆœํžˆ ํ•œ๊ตญ์–ด -> ์˜์–ด๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๊ณ , ์ดํ›„ ํ™”์ƒํšŒ์˜์—์„œ ๋‹ค๊ตญ์–ด ์˜์—ญ์— ํ™œ์šฉ๋  ๊ฒƒ์„ ๊ณ ๋ คํ•˜์—ฌ ์–ธ์–ด๋ฅผ ์ง์ ‘ ๊ฐ์ง€ํ•˜๊ฒŒ ํ•ด๋ณด๊ณ  ์‹ถ์—ˆ๋‹ค. ๊ทธ๋ž˜์„œ detect_language๋ฅผ ๋งŒ๋“ค์–ด ๋ฒ„ํผ์— ์ €์žฅ๋œ ์˜ค๋””์˜ค ํŒŒ์ผ์„ ๊ณต์œ ํ•˜๋ฉฐ ์–ธ์–ด๋ฅผ ์ž๋™์œผ๋กœ ํƒ์ง€ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด์คฌ๋‹ค.

 

๊ทธ๋ž˜์„œ ๋น„๋ก UI๋Š” ์ˆ˜์ •์ด ๊ท€์ฐฎ์€ ๊ด€๊ณ„๋กœ ์›๋ณธ ํ…์ŠคํŠธ(ํ•œ๊ตญ์–ด) ๋กœ ์ ํ˜€์žˆ๊ธด ํ•˜์ง€๋งŒ..

 

์ด๋ ‡๊ฒŒ ์ผ๋ณธ์–ด๋ž‘ ํ”„๋ž‘์Šค์–ด๋„ ์ž˜ ๋œ๋‹ค๐Ÿ‘

์›๋ž˜ ๊ฐ•๋‚จ์ด๋ผ๊ณ  ๋ฐœ์Œํ•˜๊ณ ์‹ถ์—ˆ๋Š”๋ฐ ๋ฐœ์Œ์ด์Šˆ๋กœ ์ฝ”๋‚œ๋œ๊ฑฐ๋งŒ ๋นผ๊ณ 

 

def stt_processing_thread(audio_queue, sentence_queue, transcription_queue, recording_active, target_language):
    global detected_language
    buffer = np.zeros((0, 1), dtype=np.float32)
    silence_threshold = 0.02
    silence_duration_threshold = 0.5
    silence_start = None
    language_detected_once = False

    while True:
        try:
            data = audio_queue.get(timeout=1)
            buffer = np.concatenate((buffer, data), axis=0)
            current_time = time.time()
            amplitude = np.mean(np.abs(data))
            if amplitude < silence_threshold:
                if silence_start is None:
                    silence_start = current_time
                elif current_time - silence_start >= silence_duration_threshold:
                    if len(buffer) > int(SAMPLE_RATE * 0.5):
                        if not is_speech(buffer):
                            buffer = np.zeros((0, 1), dtype=np.float32)
                            silence_start = None
                            audio_queue.task_done()
                            continue
                        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
                            sf.write(f.name, buffer, SAMPLE_RATE, format='WAV', subtype='PCM_16')  
                            if not language_detected_once:
                                lang_code = detect_language(f.name)
                                if lang_code:
                                    with language_lock:
                                        detected_language = lang_code
                                    language_detected_once = True

                            with language_lock:
                                current_lang = detected_language if detected_language is not None else DEFAULT_LANGUAGE
                            with open(f.name, "rb") as audio_file:
                                response = CLIENT.audio.transcriptions.create(
                                    model="whisper-1",
                                    file=audio_file,
                                    language=current_lang,
                                    prompt="We're now on meeting. Please transcribe exactly what you hear."
                                )
                            os.unlink(f.name)
                        text = response.text.strip()
                        if text:
                            print(f"[DEBUG] STT ๊ฒฐ๊ณผ: {text}")
                            source_log, _ = get_log_filenames(detected_language, target_language)
                            with open(source_log, "a", encoding="utf-8") as f:
                                f.write(text + "\n")
                            with language_lock:
                                src_lang = detected_language if detected_language is not None else DEFAULT_LANGUAGE
                            sentence_queue.put((text, src_lang))
                            transcription_queue.put((text, src_lang))
                    buffer = np.zeros((0, 1), dtype=np.float32)
                    silence_start = None
            else:
                silence_start = None
            audio_queue.task_done()
        except queue.Empty:
            continue

 

stt๋ฅผ ์‹ค์ œ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ํ•จ์ˆ˜์—์„œ๋Š” ์ด์ „๊ณผ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง„ ๋ถ€๋ถ„์€ ์—†๊ณ ,

๋…ธ์ด์ฆˆ๋ฅผ ์ตœ๋Œ€ํ•œ ๋ฐ˜์˜ํ•˜์ง€ ์•Š๊ธฐ ์œ„ํ•ด ์ง„ํญ์„ ๊ฐ์ง€ํ•˜์—ฌ silence_threshold(0.02)๊ฐ’์„ ์ถ”๊ฐ€ํ•˜์—ฌ ํŠน์ • ์ง„ํญ๋ณด๋‹ค ๋‚ฎ์€ ๊ฐ’์ด ๋“ค์–ด์˜ค๋ฉด ์นจ๋ฌต์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๋„๋ก ํ–ˆ๊ณ ,

์ด๋Ÿฌํ•œ ์นจ๋ฌต์‹œ๊ฐ„์ด silence_duration_threshold(0.5์ดˆ)๋ณด๋‹ค ๊ธธ์–ด์ง€๋ฉด ํ•œ ๋ฌธ์žฅ์ด ๋๋‚ฌ๋‹ค๊ณ  ๊ฐ„์ฃผํ•˜์—ฌ data์— ๋ˆ„์ ํ•ด๋†“์€ ๋ฒ„ํผ๋“ค์˜ ์ „์‚ฌ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค.

 

silence_threshold๊ฐ’์€ ๋„ˆ๋ฌด ๋‚ฎ์œผ๋ฉด ์นจ๋ฌต์— ์„ž์ธ ๋…ธ์ด์ฆˆ๋„ ๋ฐœํ™”๋ผ๊ณ  ๊ฐ„์ฃผ๋˜์–ด ๊ตฌ๋…ํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•˜๋‹ค๋Š” ๋ฌธ๊ตฌ๋กœ ์ธ์‹ํ•˜๊ณ ,

๋„ˆ๋ฌด ๋†’์œผ๋ฉด ๋‚ด๊ฐ€ ์‹ค์ œ๋กœ ๋งํ•œ ๊ฒƒ๋„ ์ธ์‹์ด ๋˜์ง€ ์•Š์œผ๋‹ˆ ์ ๋‹นํ•œ ๊ฐ’์„ ๊ฒฝํ—˜์ ์œผ๋กœ ์„ค์ •ํ•ด์ฃผ๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹น.

 

์ „์‚ฌ๊ฐ€ ์™„๋ฃŒ๋œ ๋ฌธ์žฅ์€ sentence_queue์™€ transciption_queue๋กœ ์ „๋‹ฌ๋˜๋Š”๋ฐ,

๊ฐ๊ฐ translation_thread์™€ ์›น์†Œ์ผ“์œผ๋กœ ์ „์†ก๋˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

ํ์—์„œ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์‹œ get(pop)์„ ํ†ตํ•ด ๊ฐ’์„ ๊บผ๋‚ด์˜ค๊ธฐ ๋•Œ๋ฌธ์—, ์ด์ค‘์œผ๋กœ ๊บผ๋‚ด์˜ค๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋‘๊ฐœ์˜ ํ๋ฅผ ์ •์˜ํ•˜์—ฌ ์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•˜๊ฒŒ ํ–ˆ๋‹ค.

 

translation

def translation_thread(sentence_queue, translation_queue, translated_queue, target_language):
    while True:
        try:
            sentence_data = sentence_queue.get(timeout=1)
            if isinstance(sentence_data, tuple):
                sentence, source_lang = sentence_data
            else:
                sentence = sentence_data
                source_lang = "ko"
            print(f"[DEBUG] ๋ฒˆ์—ญํ•  ๋ฌธ์žฅ: {sentence} (์†Œ์Šค ์–ธ์–ด: {source_lang})")
            if source_lang == target_language:
                print("[DEBUG] ์†Œ์Šค ์–ธ์–ด์™€ ํƒ€๊ฒŸ ์–ธ์–ด ๋™์ผ, ๋ฒˆ์—ญ ๊ฑด๋„ˆ๋œ€")
                translation_queue.put(sentence)
                sentence_queue.task_done()
                continue
            try:
                source_name = language_map.get(source_lang, "๊ฐ์ง€๋œ ์–ธ์–ด")
                target_name = language_map.get(target_language, "์˜์–ด")
                response = CLIENT.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=[
                        {"role": "system", "content": f"Translate the following text from {source_name} to {target_name}. Only provide the translation without any additional explanation."},
                        {"role": "user", "content": sentence}
                    ]
                )
                translation = response.choices[0].message.content.strip()
                print(f"[DEBUG] ๋ฒˆ์—ญ ๊ฒฐ๊ณผ: {translation}")
            except Exception as e:
                print(f"๋ฒˆ์—ญ ์˜ค๋ฅ˜: {e}", file=sys.stderr)
                translation = ""
            if translation:
                _, target_log = get_log_filenames(source_lang, target_language)
                with open(target_log, "a", encoding="utf-8") as f:
                    f.write(translation + "\n")
                translation_queue.put(translation)
                translated_queue.put(translation)
            sentence_queue.task_done()
        except queue.Empty:
            continue

 

์ด์ „ ํ”„๋กœ์ ํŠธ์™€ ๋‹ฌ๋ผ์ง„ ๊ฒƒ์€ ์—ฌ๊ธฐ์„œ๋Š” ๋ฒˆ์—ญ์ด ์ถ”๊ฐ€๋˜์—ˆ๋‹ค๋Š” ์ !

 

๊ทธ๋Ÿฌ๋‚˜ ๋ฒˆ์—ญ์€ ์‚ฌ์‹ค ์‰ฝ๋‹ค..

์ž…์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ๋‘ ํ…์ŠคํŠธ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋“ค ๋กœ๊ทธ ์ฐ์–ด์„œ ๋ณด๊ธฐ๋„ ๋„ˆ๋ฌด ์ข‹๊ณ 

๋ฒˆ์—ญ๋„ GPT-4o-mini ๋ชจ๋ธ์„ ์ผ๊ธฐ ๋•Œ๋ฌธ์— stt๋กœ ๋“ค์–ด๊ฐ”๋˜ ๋ฌธ์žฅ์ค‘ ๋งฅ๋ฝ์— ๋ถ€์ž์—ฐ์Šค๋Ÿฌ์šด ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜์–ด์žˆ์–ด๋„ ์•Œ์•„์„œ ๊ฐœ์„ ๋œ ๋ฒˆ์—ญ์„ ๋‚ด๋†“๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ๊ณผ๋ฌผ์— ๋Œ€ํ•œ ๊ฑฑ์ •๋„ ์—†๋‹ค.

 

๊ต์ˆ˜๋‹˜์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค..

์ˆ˜์—…ํ•˜์‹ค๋•Œ ๋ชฐ๋ž˜ ํ…Œ์ŠคํŠธํ•ด๋ณธ์ ์ด ์žˆ๋Š”๋ฐ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๊ฝค ๊ธด ๋ฌธ์žฅ์˜ ๋ฐœํ™”๋„ ์ž˜๋˜๋Š” ๊ฑธ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

์œ„์˜ ์ „์‚ฌ-๋ฒˆ์—ญ ๊ฒฐ๊ณผ ์ค‘ ์ฒซ๋ฒˆ์งธ ๋ฌธ์žฅ์—์„œ

์›๋ฌธ์ด

๋ฐ์ดํ„ฐ ์›จ์–ด ํ•˜์šฐ์Šค๊ฐ€ ๊ฐ–๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ ์‚ฌ์ „์— ๋งž๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜์‹œํ‚ค๊ณ  ๊ทธ ๋‹ค์Œ์— ์ด์ œ ๋ฐ์ดํ„ฐ ์›จ์–ด ํ•˜์šฐ์Šค์—์„œ ์šธ๋ ค๋„ฃ๋Š”๊ฑฐ๋‹ค.

๋กœ ์ธ์‹๋˜์—ˆ์œผ๋‚˜,

 

๋ฒˆ์—ญ์—์„œ๋Š”

The data is transformed to match the data dictionary of the data warehouse, and then it is loaded into the data warehouse.

 

์ด๋ ‡๊ฒŒ ์•Œ์•„์„œ ์˜ฌ๋ ค ๋„ฃ๋‹ค / ์ ์žฌ์˜ ์˜๋ฏธ์ธ load๋กœ ์ž˜ ์ˆ˜์ •๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

sentence_data์—์„œ ๊บผ๋‚ด์˜จ ๋ฐ์ดํ„ฐ์—์„œ๋ถ€ํ„ฐ ๋ฒˆ์—ญ์„ ์ง„ํ–‰ํ•˜๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋ถ€ํ„ด 'ํ•œ ๋ฌธ์žฅ ๋‹จ์œ„'์˜ ์ „์‚ฌ ๊ฒฐ๊ณผ๊ฐ€ ์ „๋‹ฌ๋˜์—ˆ์„๊ฒƒ์ด๋ฏ€๋กœ, ๋ฌธ์žฅ์˜ ๊ธธ์ด๊ฐ€ ์งง๊ณ  ๋ง๊ณ ๋ฅผ ํŒ๋‹จํ•ด ๋ฒˆ์—ญ ์ง„ํ–‰์„ ๊ฒฐ์ •ํ•˜์ง„ ์•Š๋Š”๋‹ค. ๊ทธ๋ž˜์„œ ์˜ˆ์™ธ์ฒ˜๋ฆฌ ํ•ด์ค„ ๊ฒƒ๋„ ์ ์—ˆ๋‹ค.

 

์•„๋ž˜์— stt์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ฒˆ์—ญ๋œ ํ…์ŠคํŠธ ๊ฒฐ๊ณผ๊ฐ€ ๋ฐ˜ํ™˜๋˜์—ˆ์„ ๊ฒฝ์šฐ tts ์ „์†ก ๋‹จ๊ณ„๋ฅผ ๊ณ ๋ คํ•˜์—ฌ translation_queue์™€ translated_queue์— ๊ฐ๊ฐ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋ฅผ ๋„ฃ์–ด์ค€๋‹ค.

 


์‹คํ–‰ ๊ฒฐ๊ณผ

 

Comments