Alines() used with multiple String „Parsechar“ to parse large files ?

This forum is meant for questions about the Visual FoxPro Language support in X#.

Post Reply
mainhatten
Posts: 200
Joined: Wed Oct 09, 2019 6:51 pm

Alines() used with multiple String „Parsechar“ to parse large files ?

Post by mainhatten »

Hi,
I volunteered to code alines() implementation in X#, as it fits GetWord* I did earlier, is often used here and I don’t think I know enough of X# to try for tableupdate(), cursorset/getProp, Buffermode, Cursoradapter package I currently miss most in X#.

In my use cases there are ~2.8 main usages for aLines():
Splitting texts into array of lines like the name suggests, often with default parameter
Splitting CSV (often Excel SSV semicolon separated values...) lines into array item lines
Splitting into words using multiple char separators when GetwordNum does not fit (the 0.8 usage)

For these use cases I have ample test material from War&Peace – needed as pushing Dotnet perf to where fox is when running inside fox C-runtime is not easy, esp. as Dotnet works on unicode „char“ which is more demanding than always-1-byte-char used in Fox and strings thereof.

Now Alines() method signature „cParseChar“ is a bit misleading – you can use Len(Separator)>1 and it works – default Parse“Chars“ are Chr(13), Chr(13)+Chr(10), Chr(10). Worst case would be long file, many separators all starting with same char. One possible and relatively demanding scenario I came up with is special „partial parsing“ of XML/HTML, most separators starting with „<“. Perhaps as multi-step sequence, eliminating void tags in all forms of tagging first, getting to interesting meat afterwards...

I can build a table with heavy HTML content by de-7zipping .chm files and use those with somewhat concoted runs with various tags, but prefer to employ anything formed by real outside needs.

If anybody has data file and alines() calls doing hefty, but real-world work with
alines(taAlines, cHeftySourceStrOrTableofSuchStrings, nWhatFitsThePurpose ;
, cStringWithLenGT1_1, cStringWithLenGT1_2, cStringWithLenGT1_3 ;
[ , cStringWithLenGT1_4, cStringWithLenGT1_5, cStringWithLenGT1_6...] )

to check and optimize my implementation on, receiving such data and alines() calls here or via private msg or directly via email would be splendid.

tia
thomas
User avatar
lumberjack
Posts: 723
Joined: Fri Sep 25, 2015 3:11 pm

Alines() used with multiple String „Parsechar“ to parse large files ?

Post by lumberjack »

Hi Thomas,
Have you looked at the .NET String.Split overloads that contains overloads for arrays of char and strings?
mainhatten
Posts: 200
Joined: Wed Oct 09, 2019 6:51 pm

Alines() used with multiple String „Parsechar“ to parse large files ?

Post by mainhatten »

lumberjack wrote:Hi Thomas,
Have you looked at the .NET String.Split overloads that contains overloads for arrays of char and strings?
Hi Johan,
thx for chiming in. Yes, I have looked at String.Split(), which might resolve ~90% of the calls/use cases if everything falls into place - have not even started to check in earnest.

OTOH it is already clear that not all of vfp nFlags can be handled, even if those use cases might crop up less often. In current Net5 docs you have [None,RemoveEmptyEntries, TrimEntries], Xide only shows [None,RemoveEmptyEntries] as options - must check if TrimEntries is available in all Dotnet versions, as
https://github.com/dotnet/runtime/issues/31038
proposing it seems to be from 2019. Perhaps Chris or Robert know a shortcut and can chime in before I try to search versions.

Then there might be "different opinions" of what is whitespace / empty: vfp Empty() includes chr(13), default value of getword* does not, Dotnet has its own idea on Whitespace. In GetWord I resolved this by giving x# coders the explicit option to set DotNet Whitespace as function to separate words. If similar differences exist in trimming of alines(), I probably will offer a new flag to run with Dotnet specifications if trimming results differ. So testing edge cases will have to look into more edges.

Use cases for the other flags?
Include the last element in the array even if the element is empty: Might be needed if searching for Eof, Eot, Eos and similar markers used in last century OSes with different values.
Case insensitive parsing might be easy to retrofit on char/char arrays, but all permutations in string[], if result probably needs to be in proper case ?
Include the parsing characters in the array might be benefical to ground level parsing.

So yes, Split() will be offered and perhaps be used for a subset of flags, but how much it will cover is yet uncertain.

thx again
thomas
User avatar
Chris
Posts: 4562
Joined: Thu Oct 08, 2015 7:48 am
Location: Greece

Alines() used with multiple String „Parsechar“ to parse large files ?

Post by Chris »

Hi Thomas,

Indeed, this is not included in the version of the .Net framework I have either, I suspect it's only being included in .Net core. In any case, I wouldn't write code for the runtime that depends on it!
Chris Pyrgas

XSharp Development Team test
chris(at)xsharp.eu
mainhatten
Posts: 200
Joined: Wed Oct 09, 2019 6:51 pm

Alines() used with multiple String „Parsechar“ to parse large files ?

Post by mainhatten »

Hi Chris,
thx for confirming. Adding .Trim() in fluent interfaces probably won't inflate code size very much ;-)

More a reminder for myself to think about such possible side effects when developing with "latest" Dotnet version. Vfp gave us not that many options - including options to shoot yourself in the foot ;-)

regards
thomas
Chris wrote: Indeed, this is not included in the version of the .Net framework I have either, I suspect it's only being included in .Net core. In any case, I wouldn't write code for the runtime that depends on it!
atlopes
Posts: 83
Joined: Sat Sep 07, 2019 11:43 am

Alines() used with multiple String „Parsechar“ to parse large files ?

Post by atlopes »

Thomas, although not aimed directly at the issue of parsing large files, and since you're working with the ALINES() function implementation, please keep in mind that the order of the parsing characters is relevant to how the string is split into array elements.

Also, contrary to the VFP documentation, CHR(10) + CHR(13) is not handled as a single splitter.

Code: Select all

CLEAR

LOCAL ARRAY Test1[1], Test2[1], Test3[1]
LOCAL Source AS String

m.Source = "A" + CHR(13) + CHR(10) + "B" + CHR(10) + CHR(13) + CHR(10) + "C"

ALINES(m.Test1, m.Source)
ALINES(m.Test2, m.Source, 0, CHR(13) + CHR(10), CHR(13), CHR(10))  && same as default
ALINES(m.Test3, m.Source, 0, CHR(13), CHR(10), CHR(13) + CHR(10))

DISPLAY MEMORY LIKE Test?
mainhatten
Posts: 200
Joined: Wed Oct 09, 2019 6:51 pm

Alines() used with multiple String „Parsechar“ to parse large files ?

Post by mainhatten »

Hi Antonio,
atlopes wrote:please keep in mind that the order of the parsing characters is relevant to how the string is split into array elements.
Yupp, this an implementation detail I have to follow even if in my very own POV it borders on bug (I'd divide on longest fitting string first, but vfp approach might lead to implementation running faster as less search/comparison effort is needed)- but keeping vfp compatibility is target, not my sometimes warped sense of fitness ;-)
Also, contrary to the VFP documentation, CHR(10) + CHR(13) is not handled as a single splitter.
Again ACK - as strange as not classifying CHR(13) as separator in GetWord*. but as empty()
Added your code to my examples - had reached same conclusion splitting long strings of hex digits.
thx & regards
thomas
Post Reply