A document processing system designed to extract and analyze information from Greek government documents using OCR and NLP technologies.
- OCR processing with support for Greek and English text
- Intelligent metadata extraction using NLP
- Document type classification
- REST API for document management
- Authentication and authorization
- MongoDB database integration
- Node.js 18+
- MongoDB
- Tesseract OCR with Greek and English language support
- Clone the repository
- Install dependencies:
npm install
- Set up environment variables by copying
.env.exampleto.envand configuring:PORT=3000 MONGODB_URI=mongodb://localhost:27017/govdoc-scanner JWT_SECRET=your-secret-key API_KEY=your-external-api-key OCR_LANG=ell,eng
Start the development server:
npm run devRun tests:
npm testGET /api/documents- List all documentsGET /api/documents/:id- Get a specific documentPOST /api/documents- Upload and process a new documentPATCH /api/documents/:id- Update document metadataDELETE /api/documents/:id- Delete a document
The system processes documents in the following steps:
- OCR Processing: Extracts text from document images using Tesseract OCR
- NLP Analysis: Analyzes the extracted text to identify:
- Company names
- Legal representatives
- Board members
- Important dates
- Document type
- JWT-based authentication
- Role-based access control (User/Admin)
- API key protection for external services
- Request validation and sanitization
The system implements comprehensive error handling with:
- Custom error classes
- Standardized error responses
- Detailed logging in development mode
- Generic error messages in production
MIT License