c - Sequential, subsequent loading of files gets much slower over time -
I have found the following code to read and process many very large files from one.
for (j = 0; j & lt; cors; ++ j) {double time = omp_get_wtime (); Printf ("File:% d, time:% f \ n", j, time); Four [256]; Sprintf (in, "% s.% D", FIN, j); FILE * f = fopen (in, "r"); If (f == NULL) fprintf (stderr, "open fail:% s \ n", FIN); Int i; Four buffers [1024]; Four * tweets; Int lateem = 1; (I = 0, tweet = tweeters + (size_t) h * (size_t) tnn * (size_t) tsesee; i & lt; tNUM; i ++, tweet + = tsesees) {double start; Double end; If (techime) {start = omp_get_wtime (); TechTime = 0; } Char * line = fgets (buffer, 1024, f); If (line == faucet) {fprintf (stderr, "error reading line% d \ n", i); Exit (2); } Int fn = readNumber (and line); Int ln = readNumber (and line); Int month = readMonth (& line); Int day = readNumber (and line); Int Hit = Counterfeit Hits (Line, Key); WriteTweet (Tweet, FN, LN, Hit, Month, Day, Line); If (i% 1000000 == 0) {end = omp_get_wtime (); Printf ("line:% d, time:% f \ n", i, end-start); TechTime = 1; }} Fclose (f); } Each file has 24000000 Tweets and I read a total of 8 files, one after the other. Each line (1 color) is processed and writes (:) A really copies a modified line in the big four array.
As you can see, I measure time to see how much time I take in reading and processing one million tweets. For the first file, about 0.5 seconds per 1 million, which is fast enough but after every additional file, it takes longer. File 2 has 1 million lines per 1 million lines (but not every time, in some iterations), file number 8 to 8 seconds. Should this be expected? Can I speed things up? All files are more or less completely identical, always have 24 million lines.
EDIT: Additional information: Each file is required, in processed form, it means about 730MB of RAM, using 8 files, we end up with approximately 6 GB memory requirement Are there.
As content, writeTweet () content
Zero instrument (four * tweet, const int fn, const int ln, const int hit, const int month, Const int day, four * line) {short * ptr1 = (short *) tweet; * Ptr1 = (short) fn; Int * ptr2 = (int *) (tweet + 2); * Ptr2 = ln; * (Tweet + 6) = (four) hits; * (Tweet + 7) = (four) months; * (Tweet +8) = (four) days; Int i; Int n = TSIZE - 9; For (i = strangle (line); i & lt; n; i ++) line [i] = ''; // padding mempi (tweet + 9, line, n); }
Perhaps, writeTweet () is an obstacle if If you copy all the processed tweets in memory, the huge data array with which the operating system has to do something is created over time. If you do not have enough memory or other processes in the system actively use it, then the OS will dump the portion of the data (in most cases) on a disk. It increases the time of access to the array. It is more hidden that can affect performance. You should not store all the processed lines in memory the easiest way: to dump the processed tweets on a disk (type a file). However, the solution depends on how you use processed Tweets. If you do not use data from the array sequentially, then it is worth thinking about the special data structure for storage (?). For this purpose there are already many libraries -
UPD: Use a special memory for the maintenance of this model in the kernel Is the manager who creates special structure of reference. Usually this is the map, they refer to the sub-maps for large memory versions - it is rather a large branch structure, with random pieces of memory often addressing any random pages during work. Necessary. The OS uses a special cache for address acceleration. I do not know all the nuances of this process, but I think the cash should be often invalid in this case because there is no memory of all references at the same time for storage. This reduces the costly operation performance. It will be more, more than memory is used.
If you need to sort the big tweets array, then it is not compulsory for you to store everything in memory. There are ways in which if you want to sort the data in memory, it is not necessary to take actual swap action on array elements. Using the intermediate structure with the references of elements in the Tilt's array and it is better to sort the references rather than the data.
Comments
Post a Comment