-
-
Notifications
You must be signed in to change notification settings - Fork 11
Introducing Dataplat.Dbatools.Csv #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduces a default 10GB decompressed size limit to prevent decompression bomb attacks, with configurable limits in CsvReaderOptions. Adds maximum quoted field length enforcement to mitigate memory and denial-of-service risks. Improves buffer reuse and error handling in CsvDataReader, refactors TypeConverterRegistry to prevent shared state mutation, and optimizes CsvWriter quoting logic for multi-character delimiters.
Adds support for distinguishing null vs empty fields, duplicate header handling, culture-aware parsing, lenient quote mode, smart quote normalization, and flexible field count mismatch handling. Updates CsvDataReader, CsvReaderOptions, and type converters to support these features, and introduces new enums and interfaces for configuration. Extensive tests added to validate new behaviors.
Introduces the .csproj for Dataplat.Dbatools.Csv targeting net472 and net8.0, with NuGet metadata, build settings, source linking, and polyfill dependencies for net472. dotnet nuget push bin/Release/Dataplat.Dbatools.Csv.1.0.0.nupkg \ --api-key YOUR_API_KEY \ --source https://api.nuget.org/v3/index.json
Improves CsvDataReader performance by caching the max source index and reusing a pooled StringBuilder for quoted field parsing, reducing allocations and LINQ usage. Enhances documentation and security notes in CsvReaderOptions and QuoteMode for better guidance on handling malformed data and field count mismatches. Adds a comprehensive README for Dataplat.Dbatools.Csv with usage examples and feature overview.
Introduces build.ps1 to automate cleaning, building, packaging, and optionally publishing the Dataplat.Dbatools.Csv project to NuGet. Supports configurable version, configuration, and API key for publishing.
Introduced CultureAwareConverterBase to centralize culture and number style handling for numeric type converters. Refactored Int16, Int32, Int64, Single, Double, Decimal, and Byte converters to inherit from the new base class, reducing code duplication and improving maintainability. Optimized smart quote normalization in CsvDataReader for performance using stackalloc and ArrayPool.
Eliminates all references to LumenWorks from the build script and dbatools.library.psm1, including directory creation, package download, extraction, file copying, and module import. This streamlines the build process and reduces external dependencies.
Columns are now initialized during reader setup when HasHeaderRow is false, allowing SetColumnType and HasColumn to work before Read() is called. Added tests to verify correct behavior for no-header CSVs.
Adds string interning support to reduce memory allocations for repeated values in CSV files, configurable via CsvReaderOptions. Caches type converters per column to avoid per-row registry lookups, improving performance. Refactors quoted/unquoted field parsing for efficiency and updates type converters to use Span-based parsing on .NET 8+. Also adds App.config for test assembly binding redirects.
Introduces parallel processing capabilities to CsvDataReader and CsvReaderOptions, enabling multi-threaded reading, parsing, and type conversion for improved performance on large CSV files. Includes new pipeline structures, thread management, error handling, and extensive tests for parallel scenarios. Parallelism is configurable via CsvReaderOptions.
Introduces a high-performance direct field-by-field parsing path that eliminates intermediate line string allocations, improving CSV reading speed by 10-15%. Adds buffer-based parsing state, SIMD-accelerated delimiter search for .NET 8+, and refactors error and field count mismatch handling for both line-based and direct parsing modes.
Enhanced documentation for Dataplat.Dbatools.Csv in main and package README files, including features, installation, and performance details. Updated build script to download and extract LumenWorksCsvReader for comparison. Bumped module version to 2025.11.28. Improved CSV package metadata in csproj. Added comprehensive PowerShell tests for CSV import, edge cases, and performance validation.
Introduces build-csv.ps1 to automate versioning, building, signing, and publishing of the Dataplat.Dbatools.Csv NuGet package. Supports optional signing and publishing, and manages build artifacts.
Introduced a new 'Development' section detailing the creation process, initial prompt, and refinement steps for the CSV library, including references to Claude Code and dbatools GitHub issues.
Pull Request Review: Introducing Dataplat.Dbatools.CsvThis is an impressive and comprehensive implementation of a modern CSV reader/writer library. The PR introduces 9,287 additions with a complete high-performance CSV solution to replace the legacy LumenWorks dependency. Here's my detailed review: ✅ StrengthsCode Quality & Best Practices
|
Refactored build-csv.ps1 to build and sign DLLs prior to packing the NuGet package, ensuring signed binaries are included. Removed post-pack signing and repacking logic, streamlining the signing process and improving package integrity.
Included Microsoft.SourceLink.GitHub package reference to enable SourceLink support for improved debugging and source code linking.
Improves thread-safety in CsvDataReader when parallel processing is enabled by using volatile fields, atomic operations, and locks to prevent torn reads and race conditions. Updates documentation and README to clarify thread-safety guarantees, and adds comprehensive stress tests to verify concurrent access to GetValue, GetValues, and CurrentRecordIndex.
Added detailed documentation and examples explaining how the CsvReaderOptions.DistinguishEmptyFromNull option affects parsing of empty fields and quoted empty strings in CSV files. Includes code samples, use case recommendations, and a quick reference table for expected behavior.
Introduced DefaultBufferSize constants in CsvReaderOptions and CsvWriterOptions and updated CompressionHelper to use these constants instead of hardcoded values. This improves maintainability and ensures consistent buffer size configuration across CSV reading, writing, and compression operations.
Replaces warning with a thrown error when DLL signing fails, preventing continuation with unsigned DLLs. Ensures build process integrity by stopping on signing errors.
Updated the catch block in TypeConverterRegistry to only handle FormatException, InvalidCastException, and OverflowException during type conversion, improving error handling specificity.
Replaced manual character-by-character delimiter comparison with Span.SequenceEqual for improved performance and readability in MatchesDelimiterAt and MatchesDelimiterAtPosition methods.
Pull Request Review: Dataplat.Dbatools.CsvOverall AssessmentThis is an impressive and well-engineered PR that introduces a high-performance CSV library to replace the legacy LumenWorks dependency. The implementation demonstrates strong engineering practices, comprehensive testing, and thoughtful design. I recommend approval with minor suggestions for consideration. ✅ StrengthsCode Quality & Architecture
|
🔍 Code Quality Issues & RecommendationsMinor Issues
🛡️ Security Review✅ Security Strengths
|
⚡ Performance Considerations✅ Performance Strengths
💡 Performance Suggestions
🧪 Test Coverage Assessment✅ Testing Strengths
💡 Testing Suggestions
📝 Summary & Rating
Overall Recommendation: ✅ APPROVE (with minor suggestions for future consideration) |
🎯 Action Items for ConsiderationOptional Before Merge
Post-Merge Future Enhancements
📋 Additional ObservationsProject Structure✅ Excellent Separation: Standalone NuGet package structure is well-designed Future Opportunities
🎉 ConclusionGreat work on this PR! The CSV library is a significant improvement over LumenWorks with:
The standalone NuGet package approach also provides value to the broader .NET community beyond dbatools users. Reviewed with: Claude Code (Sonnet 4.5) |
Replaced regex-based version update and read logic with XML parsing for safer and more reliable manipulation of the .csproj file's Version property.
Introduces a CHANGELOG.md documenting notable features, improvements, and changes for Dataplat.Dbatools.Csv, including initial release details and unreleased updates.
Introduces MIGRATING-FROM-LUMENWORKS.md, a comprehensive guide for transitioning from LumenWorks.Framework.IO.Csv to Dataplat.Dbatools.Csv. The guide covers feature comparisons, API mappings, constructor and property changes, error handling, new features, and troubleshooting tips to assist developers in updating their CSV processing code.
Introduces unit tests to verify decompression bomb protection in CSV reading, including cases for exceeding limits, allowing within limits, and unlimited decompression. Ensures security against resource exhaustion attacks when handling compressed CSV input.
Introduced backing field and validation logic for the Delimiter property to prevent null or empty values. This change ensures that the delimiter is always a non-empty string, addressing potential runtime issues.
Pull Request Review: Introducing Dataplat.Dbatools.CsvI've completed a comprehensive review of this PR. Overall, this is excellent work with professional-grade code quality and thoughtful implementation. The CSV library is well-architected and production-ready. ✅ Code Quality & Best PracticesStrengths:
Minor Suggestions:
🧪 Test CoverageStrengths:
Suggestions:
🔒 Security ReviewStrengths:
Observations:
⚡ Performance ConsiderationsStrengths:
Observations:
📦 Build & ConfigurationStrengths:
Issues Found:
Minor suggestions:
📚 Documentation QualityStrengths:
Suggestions:
🐛 Potential BugsNone found - the code is very clean. The use of 🎯 Breaking Changes & CompatibilityObservations:
✨ Highlights
📋 RecommendationsBefore Merge:
Nice to Have (Post-Merge):
🎉 Final VerdictAPPROVED ✅ This is high-quality, production-ready code. The CSV library is well-architected, thoroughly tested, and thoughtfully documented. The migration guide alone shows great care for users. The security protections (decompression bombs, field length limits) demonstrate mature engineering. Great work, @potatoqualitee and the dbatools team! This will be a valuable standalone package for the .NET community. Legend:
|
This pull request introduces a new standalone high-performance CSV reader/writer library (
Dataplat.Dbatools.Csv) to thedbatools.libraryproject, replacing the legacy LumenWorks dependency and modernizing CSV handling for .NET. It adds build scripts, documentation, and integration details for the new package, and updates the main library to reference and describe the CSV functionality.CSV Library Addition and Integration
project/Dataplat.Dbatools.Csvwith its own.csproj, README, and build scripts, providing a modern CSV reader/writer for .NET Framework 4.7.2 and .NET 8.0. Features include streamingIDataReaderfor bulk operations, parallel processing, automatic compression, robust error handling, and security protections. [1] [2] [3]build/build-csv.ps1to automate versioning, signing, and publishing of the CSV NuGet package, including support for Azure Trusted Signing and decompression bomb protection.Documentation and Project Structure Updates
README.mdto describe the new CSV library, its features, installation instructions, and how it fits into the overall project structure. Added references and badges for the standalone CSV NuGet package. [1] [2] [3] [4]Codebase Modernization and Cleanup
dbatools.library.psm1, signaling the transition to the new CSV implementation.dbatools.library.psd1to reflect the new release and CSV functionality.(References: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]