Presentation is loading. Please wait.

Presentation is loading. Please wait.

QQ: Nanoscale Timing and Profiling James Frye † *, James G. King † *, Christine J. Wilson * ◊, Frederick C. Harris, Jr. † * † Department of Computer Science.

Similar presentations


Presentation on theme: "QQ: Nanoscale Timing and Profiling James Frye † *, James G. King † *, Christine J. Wilson * ◊, Frederick C. Harris, Jr. † * † Department of Computer Science."— Presentation transcript:

1 QQ: Nanoscale Timing and Profiling James Frye † *, James G. King † *, Christine J. Wilson * ◊, Frederick C. Harris, Jr. † * † Department of Computer Science and Engineering *Brain Computation Lab ◊ Biomedical Engineering University of Nevada Reno, NV 89557

2 What is QQ QQ is a simple and efficient tool for measuring timing and memory use QQ is a simple and efficient tool for measuring timing and memory use Developed for the examination of a massively parallel program (NCS) Developed for the examination of a massively parallel program (NCS) Easily extensible to inspect other programs Easily extensible to inspect other programs

3 The Place: The Human Brain Goal: Goal: create the first large-scale, synaptically realistic cortical computational model. create the first large-scale, synaptically realistic cortical computational model. Purpose: Purpose: Simulation Experiments Simulation Experiments Drug Trials Drug Trials Alzheimer’s Research Alzheimer’s Research Robotics Robotics

4 The Science: Neurons Neurons Excitatory Excitatory Interneurons (inhibitory) Interneurons (inhibitory) Columns High connectivity within columns. Less connectivity across columns

5 The Science (cont): Channels Channels Potassium Family Potassium Family M, A, AHP Channels M, A, AHP Channels Suppressing behavior on parent cell Suppressing behavior on parent cell Synapses Analog converter of binary spike event. Contextual filters.

6 Neurons Neurons The Science (cont):

7 NCS Biology The membrane voltage determines the cell’s firing rate The membrane voltage determines the cell’s firing rate Once threshold voltage is reached the cell sends an action potential to it’s connected synapses Once threshold voltage is reached the cell sends an action potential to it’s connected synapses 0 mV Time (mS) -45 30 Action Potential

8 2-Cell Model Pre- Synaptic Cell Post- Synaptic Cell 0.2 mV 100200300400500 0 Time (ms)

9 No Channels Sustained firing at maximum rate during a continuous stimulus

10 K a Channel Slows the initial response during a sustained stimulus

11 K m Channel Prevents continuous bursting during a continuous stimulus

12 K ahp Channel Dampens the effect while still allowing for some action potentials during a sustained stimulus

13 QQ Development QQ was developed to optimize a parallel program used to simulate cortical neurons – NeoCortical Simulator (NCS) QQ was developed to optimize a parallel program used to simulate cortical neurons – NeoCortical Simulator (NCS) Our goal for the summer of 2002 was to simulate 10 6 neurons with 10 9 synapses within a realistic run time Our goal for the summer of 2002 was to simulate 10 6 neurons with 10 9 synapses within a realistic run time Before optimization, NCS would run about 1.5 million synapses at a rate of 1 day per simulated second of synaptic activity Before optimization, NCS would run about 1.5 million synapses at a rate of 1 day per simulated second of synaptic activity Clearly optimization of NCS was needed Clearly optimization of NCS was needed

14 QQ Design QQ is designed so that all of its routines can be selectively compiled into a program QQ is designed so that all of its routines can be selectively compiled into a program In the QQ.h header file, each routine is defined with a preprocessor directive, so that if profiling is not enabled, it reduces to an empty statement. In the QQ.h header file, each routine is defined with a preprocessor directive, so that if profiling is not enabled, it reduces to an empty statement. #ifdef QQ_ENABLE void QQInit (int); #else #define QQInit (dummy) #endif

15 QQ Design Memory profiling routines also use the C preprocessor to intercept library calls Memory profiling routines also use the C preprocessor to intercept library calls #ifdef QQ_ENABLE #define malloc(arg) MemMalloc (MEM_KEY, arg) #endif The MemMalloc function records allocation information, calls the malloc function to do the actual allocation, and returns the result to the caller The MemMalloc function records allocation information, calls the malloc function to do the actual allocation, and returns the result to the caller

16 QQ Timing Extremely accurate measurement of execution speed. Extremely accurate measurement of execution speed. In theory fine-grained resolution to a single clock cycle. In theory fine-grained resolution to a single clock cycle. Using the IA32 instruction RTDSC Using the IA32 instruction RTDSC In practice, measurements are accurate to tens of cycles In practice, measurements are accurate to tens of cycles Because of instruction reordering and multiple pipelines in the CPU Because of instruction reordering and multiple pipelines in the CPU

17 Timing Measurements Measuring the impact of a line change in the calculation for the Km channel Measuring the impact of a line change in the calculation for the Km channelFrom: I = unitaryG * strength * pow (m, mPower) * (ReversePot – CmpV); To: I = unitaryG * strength * (ReversePot – CmpV); Km-type channel, mPower is always 1, so we were able to change the equation to streamline the execution Km-type channel, mPower is always 1, so we were able to change the equation to streamline the execution Wrapping the line in calls to QQ, we measure the effect of this single change Wrapping the line in calls to QQ, we measure the effect of this single change QQStateOn (QQ_Km); I = unitaryG * strength * (ReversePot – CmpV); QQStateOff (QQ_Km);

18 Timing Measurements Note that both code versions give similar cycle counts on different processors, though more consistent and somewhat fewer on P4 than P3. Note that both code versions give similar cycle counts on different processors, though more consistent and somewhat fewer on P4 than P3. Times for similar counts are proportional to processor speed, as expected. Times for similar counts are proportional to processor speed, as expected. Function call pays a heavy penalty for first call. It's only called by Km channel code in this code, so time represents first load of the code into cache Function call pays a heavy penalty for first call. It's only called by Km channel code in this code, so time represents first load of the code into cache

19 Timing Measurements PIII – 800 MHz

20 Timing Measurements P4 – 2200MHz

21 Expanding Timing Information QQ allows the user to record an additional item of information with the normal timing. QQ allows the user to record an additional item of information with the normal timing. QQCount records an integer with the key QQCount records an integer with the key QQCount( eventKey, integer_of_interest ); QQCount( eventKey, integer_of_interest ); QQValue records a double precision floating point value with the key QQValue records a double precision floating point value with the key QQValue( eventKey, double_of_interest ); QQValue( eventKey, double_of_interest ); QQState records a state of ON or OFF with the key QQState records a state of ON or OFF with the key QQStateOn( eventKey ); QQStateOff( eventKey ); QQStateOn( eventKey ); QQStateOff( eventKey ); These will be described during discussion of the output format These will be described during discussion of the output format

22 QQ Memory Records memory allocation dedicated to the code-block, rather than the total allocation due to code and library calls, to single-byte accuracy Records memory allocation dedicated to the code-block, rather than the total allocation due to code and library calls, to single-byte accuracy

23 QQ Memory Example NCS implementation of ion channels NCS implementation of ion channels Suppose we want to know the total memory used by all channels. Each channel function would require channel key: Suppose we want to know the total memory used by all channels. Each channel function would require channel key: #define MEM_KEY KEY_CHANNEL Then at any point in the program execution, just call the MemPrint function to display memory use Then at any point in the program execution, just call the MemPrint function to display memory use

24 Memory Usage Output Memory Allocation: Total Allocated = 988 KBytes Object Number Number Object Alloc Total Max Object Number Number Object Alloc Total Max ItemSizeCreatedDeletedKBKBKbKB Brain120101011 CellManager441011 11 Cell 16 1000 2 0 22 Channel25230007407474 Compartment 324100032233 33 MessageMgr16101205205205 MessageBus0 00 01 1 1 Report801011 11 Stimulus252101111 Synapse44100000430118547547 --------------------------------------------------------------------------------------------------------------------------------------------------------------- 1234567 8 Key 1 - Internal name given to recording category 1 - Internal name given to recording category 2 - The size of the object being allocated - it's valid only if all 2 - The size of the object being allocated - it's valid only if all allocations are the same size, as with "new Object". allocations are the same size, as with "new Object". 3 - Number of allocation calls made: new, malloc, calloc, etc. 3 - Number of allocation calls made: new, malloc, calloc, etc. 4 - Number of free or delete calls made 4 - Number of free or delete calls made 5 - KBytes allocated via object creation (new) 5 - KBytes allocated via object creation (new) 6 - KBytes allocated via *alloc calls 6 - KBytes allocated via *alloc calls 7 - Total memory currently allocated 7 - Total memory currently allocated 8 - Max memory ever allocated = high-water mark. 8 - Max memory ever allocated = high-water mark.

25 QQ Applications Brain Communication Server (BCS) Brain Communication Server (BCS) NCS NCS

26 Further experimentation with the simulator required another application be developed to coordinate communication between NCS and numerous potential clients: virtual creatures physical robots visualization tools BCS Brain Communication Server NCS

27 Optimizing BCS Different applications make non-sequential requests. No single function was called in a loop iterating several times, so time needed to be measured over the course of execution. Then perform an analysis of QQ’s final output.

28 Parsing QQ’s output QQ uses a straight forward layout for the final output file QQ uses a straight forward layout for the final output file The data can be easily extracted and displayed in a text report as shown on the previous slide or sent to a graphical display The data can be easily extracted and displayed in a text report as shown on the previous slide or sent to a graphical display The following slides describe the output format and how to manage the information The following slides describe the output format and how to manage the information

29 QQ file format Header Number of Keys (int), Key Name string length (int) Key Table For each Key – Key ID (int), Key type (int), Key name (char *) Node Information Number of nodes (int) Node Table For each Node – Byte offset to data (size_t), Number of entries (int), Starting Base Time (unsigned long long), Mhz (double) Data For each Node, For each entry – item (QQItem)

30 QQ Format – Data Close Up Node 0 Byte offset Node 1 Byte offset Node 2 Byte offset Previous Sections DataNode 0 – For each entry Key (int), [Optional Info], Event Time (unsigned long long) Node 1 – For each entry Key (int), [Optional Info], Event Time (unsigned long long) Node 2 – For each entry … Where Optional Info is the size of a double, but contains a State (int), a Count (int), or a Value (double)

31 Gathering the Results After reading a node’s data section, entries with the same key can be gathered. After reading a node’s data section, entries with the same key can be gathered. Using the key table, the user knows what is contained in the second block of a timing entry Using the key table, the user knows what is contained in the second block of a timing entry Example: Key 2 has type “State” The second block contains integer 1 for “on” or integer 0 for “off” By subtracting the event times, the length of time spent in the “on” state is determined 21109342759 20109342768

32 Another example Example: Key 4 has type “Value” The second block contains a double precision value passed in during execution The value can be saved and displayed with timing information, or sent to a separate graph Timing is obtained the same as before, by subtracting the event times 4-65.3477109342735 4-58.2367109342819

33 NCS Performance Measurement QQ was able to hone in on specific blocks of code and allow measurement at a resolution necessary to allow for easy interpretation QQ was able to hone in on specific blocks of code and allow measurement at a resolution necessary to allow for easy interpretation

34 Optimization Targets QQ analysis quickly identified two major targets within the code QQ analysis quickly identified two major targets within the code Synapses Synapses Message Passing Message Passing

35 Synapses Synapses were by far the most common element of any NCS model with the most memory usage Synapses were by far the most common element of any NCS model with the most memory usage Active only when an action potential was processed through the synapse Active only when an action potential was processed through the synapse Pass information between the nodes via message passing Pass information between the nodes via message passing

36 Message Parsing Overhead Using QQ, we were able to identify areas for improvement within NCS 3 Using QQ, we were able to identify areas for improvement within NCS 3 Many unneeded fields requiring better encoding of their destination Many unneeded fields requiring better encoding of their destination Fixed number of messages pre-allocated, far more than needed by the program Fixed number of messages pre-allocated, far more than needed by the program Implemented a shared pool, buffers allocated as needed Implemented a shared pool, buffers allocated as needed Messages sent individually, processed multiple times Messages sent individually, processed multiple times Implemented a packet scheme: process packet once for send, once for receive Implemented a packet scheme: process packet once for send, once for receive Process messages only when used Process messages only when used

37 Optimization Results

38 Execution Time Measurements after Optimization

39 Conclusions QQ allows profiling of nanoscale timing of code segments and memory usage analysis QQ allows profiling of nanoscale timing of code segments and memory usage analysis Fine grained measurements of specific events Fine grained measurements of specific events Ability to measure memory at an object or event level with a small memory and performance footprint Ability to measure memory at an object or event level with a small memory and performance footprint Simple and effective tool Simple and effective tool

40 Future Work New Opteron cluster New Opteron cluster BlueGene migration BlueGene migration NCS is currently being installed at our sister lab The Brain Mind Institute at EPFL in Switzerland on their new machine NCS is currently being installed at our sister lab The Brain Mind Institute at EPFL in Switzerland on their new machine Robotic integration Robotic integration

41 Acknowledgements Office of Naval Research Office of Naval Research 6 years of funding for people (3 year renewable) 6 years of funding for people (3 year renewable) 4 DURIP grants for hardware 4 DURIP grants for hardware

42 QQ: Nanoscale Timing and Profiling James Frye † *, James G. King † *, Christine J. Wilson * ◊, Frederick C. Harris, Jr. † * † Department of Computer Science and Engineering *Brain Computation Lab ◊ Biomedical Engineering University of Nevada Reno, NV 89557

43

44 QQ API


Download ppt "QQ: Nanoscale Timing and Profiling James Frye † *, James G. King † *, Christine J. Wilson * ◊, Frederick C. Harris, Jr. † * † Department of Computer Science."

Similar presentations


Ads by Google